### Find the missing persons
This notebook uses publicly available data on missing people and helps visualize all that is known about them.  

However, the data is significantly incomplete. 

You can contribute to enhancing it by submitting your updates to the missing_persons_clean.csv file. Only submissions that include valid resource links for verification will be merged.

In [140]:
!pip install pandas folium

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import re
import folium # for geolocating the missing
from folium.plugins import MarkerCluster

pd.set_option('display.max_colwidth', None)



## Load the data

In [141]:
# URL of the raw CSV file on GitHub
url = 'https://raw.githubusercontent.com/sumitdeole/find-the-missing/main/missing_persons_clean.csv'

# Read the CSV file from the URL
df = pd.read_csv(url)

In [142]:
# Let's visualize the dataframe
df.head()

Unnamed: 0,name,description,long,lat,location
0,NamUs #UP4795 ME/C Case Number GC99-158 604UFMN,"date found: November 4, 1999location: Red Wing, Minnesotaage est.: 0-12 monthsrace: Caucasian / Whiteheight: 1'9""weight: 6 lbs (est)description: A full term infant with umbilical cord still attached was found 10 yds north of the Mississippi shore near 800 Levee Dr, Redwing MN. The body showed slight signs of decomposition upon discovery. The infant had not been in the water for long. The race of the decedent is most likely white. This infant is genetically related maternally to the decedent in case# GC03-127.links: https://www.namus.gov/UnidentifiedPersons/Case#/4795http://www.doenetwork.org/cases/604ufmn.html",-92.541061,44.565821,POINT(-92.5410605 44.5658215)
1,NamUs #UP4804 ME/C Case Number 39918 849UFMN,"date found: July 20, 1977location: St. Paul, Minnesotaage est.: 16 - 30race: Caucasian / Whiteheight: 5'8"" (estimated)weight: 130lbs (estimated)description: Medium length brown hair with brown or green eyes; Multiple abdominal striae; Body found in Mississippi River between Childs and Warner Road, St. Paul.Green, Red, Blue vertically stripped shirt with thinner lateral stripes, high waisted blue jeans, brown knee high stockings size 5 underwear. Shoes�Approximate Size 8 TO 9Multiple Abdominal STRIAlinks: http://doenetwork.org/cases/849ufmn.html;�https://www.namus.gov/UnidentifiedPersons/Case#/4804?nav",-93.055401,44.941109,POINT(-93.0554009 44.9411089)
2,NamUS #UP4808 ME/C Case Number 00-1411 270UFMN,,-93.19406,45.078014,POINT(-93.1940597 45.0780144)
3,NamUs #UP4796 ME/C Case Number GC07-39 102UFMN,"date found: March 26, 2007location: Treasure Island Marina, Red Wing, Minnesotaage est.: 0-12 monthsrace: Caucasian / Whiteheight: 1'9""weight: 6 lbsdescription: The infant was found in the Buffalo Slough of the Mississippi River at the Treasure Island Marina (Slip 36, Dock C). The infant was near term/term with no apparent congenital abnormalities. The infant appears to be of Caucasian descent and not a member of the Prairie Island tribe. Estimated time of being in the water was from a few weeks from discovery of the body to the previous fall or winter (2006). Hair Black, 3cm in lengthlinks: https://www.namus.gov/UnidentifiedPersons/Case#/4796http://www.doenetwork.org/cases/102ufmn.html",-92.641182,44.635803,POINT(-92.6411819 44.6358033)
4,NamUs #UP6525 ME/C Case Number FC08-61 803UFMN,"date found: August 10, 2008location: Mabel, Minnesotaage est.: 30-40race: African American / Blackheight: weight: description: The skull has been stored in the school since before 1969. Its origin and identity is unknown.links: https://www.namus.gov/UnidentifiedPersons/Case#/6525http://doenetwork.org/cases/803ufmn.html",-91.773434,43.520049,POINT(-91.7734337 43.5200495)


## Data cleaning
### Missing values
Let's now check the extent of missing values.

In [143]:
df.info()
df.shape

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17307 entries, 0 to 17306
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   name         17307 non-null  object 
 1   description  14803 non-null  object 
 2   long         17307 non-null  float64
 3   lat          17307 non-null  float64
 4   location     17307 non-null  object 
dtypes: float64(2), object(3)
memory usage: 676.2+ KB


(17307, 5)

The *description* column has many values missing. But latitude and longitude columns, necessary to visualize the location, do not. So for now we keep the missing values for the *description* column. 

In [144]:
# Let's drop this column. I will use latitude and longitudes instead.
df = df.drop("location", axis=1)

### Create a per-person unique identifier
The column *name* contains "NamUs case number" or sometimes the known name of the person. Let's extract this information and store it in a new column. We will use this information as unique identifier.

In [145]:
# Extract the identifier (e.g., UP1234) from the "name" column
df["name_or_NamUs"] = df["name"].str.extract(r'(UP\d+)', expand=False)

# Replacing NaN values (indicating no match found) with the original Name value
df.loc[df['name_or_NamUs'].isnull(), 'name_or_NamUs'] = df.loc[df['name_or_NamUs'].isnull(), 'name']

df.head()

Unnamed: 0,name,description,long,lat,name_or_NamUs
0,NamUs #UP4795 ME/C Case Number GC99-158 604UFMN,"date found: November 4, 1999location: Red Wing, Minnesotaage est.: 0-12 monthsrace: Caucasian / Whiteheight: 1'9""weight: 6 lbs (est)description: A full term infant with umbilical cord still attached was found 10 yds north of the Mississippi shore near 800 Levee Dr, Redwing MN. The body showed slight signs of decomposition upon discovery. The infant had not been in the water for long. The race of the decedent is most likely white. This infant is genetically related maternally to the decedent in case# GC03-127.links: https://www.namus.gov/UnidentifiedPersons/Case#/4795http://www.doenetwork.org/cases/604ufmn.html",-92.541061,44.565821,UP4795
1,NamUs #UP4804 ME/C Case Number 39918 849UFMN,"date found: July 20, 1977location: St. Paul, Minnesotaage est.: 16 - 30race: Caucasian / Whiteheight: 5'8"" (estimated)weight: 130lbs (estimated)description: Medium length brown hair with brown or green eyes; Multiple abdominal striae; Body found in Mississippi River between Childs and Warner Road, St. Paul.Green, Red, Blue vertically stripped shirt with thinner lateral stripes, high waisted blue jeans, brown knee high stockings size 5 underwear. Shoes�Approximate Size 8 TO 9Multiple Abdominal STRIAlinks: http://doenetwork.org/cases/849ufmn.html;�https://www.namus.gov/UnidentifiedPersons/Case#/4804?nav",-93.055401,44.941109,UP4804
2,NamUS #UP4808 ME/C Case Number 00-1411 270UFMN,,-93.19406,45.078014,UP4808
3,NamUs #UP4796 ME/C Case Number GC07-39 102UFMN,"date found: March 26, 2007location: Treasure Island Marina, Red Wing, Minnesotaage est.: 0-12 monthsrace: Caucasian / Whiteheight: 1'9""weight: 6 lbsdescription: The infant was found in the Buffalo Slough of the Mississippi River at the Treasure Island Marina (Slip 36, Dock C). The infant was near term/term with no apparent congenital abnormalities. The infant appears to be of Caucasian descent and not a member of the Prairie Island tribe. Estimated time of being in the water was from a few weeks from discovery of the body to the previous fall or winter (2006). Hair Black, 3cm in lengthlinks: https://www.namus.gov/UnidentifiedPersons/Case#/4796http://www.doenetwork.org/cases/102ufmn.html",-92.641182,44.635803,UP4796
4,NamUs #UP6525 ME/C Case Number FC08-61 803UFMN,"date found: August 10, 2008location: Mabel, Minnesotaage est.: 30-40race: African American / Blackheight: weight: description: The skull has been stored in the school since before 1969. Its origin and identity is unknown.links: https://www.namus.gov/UnidentifiedPersons/Case#/6525http://doenetwork.org/cases/803ufmn.html",-91.773434,43.520049,UP6525


Let's check whether the new column uniquely identifies the dataset.

In [146]:
# Let's write a simple function to check whether the column uniquely identifies the dataset
def is_unique(column):
    unique_values = set()
    for value in column:
        if value in unique_values:
            return False
        unique_values.add(value)
    return True

# Call the function
column = df['name_or_NamUs'] 
is_column_unique = is_unique(column)

if is_column_unique:
    print("The column 'name_or_NamUs' uniquely identifies the dataset.")
else:
    print("The column 'name_or_NamUs' does not uniquely identify the dataset.")

The column 'name_or_NamUs' does not uniquely identify the dataset.


Let's visualize the non-unique instances

In [147]:
# Column to check for uniqueness
column_name = 'name_or_NamUs'  

# Identify duplicate values in the column
duplicates = df[column_name][df[column_name].duplicated()]

# Extract rows with these duplicate values
non_unique_values = df[df[column_name].isin(duplicates)]

# Display non-unique values
print("Non-unique values in the column:")
non_unique_values.sort_values(by="name_or_NamUs", ascending=True).tail(10)

Non-unique values in the column:


Unnamed: 0,name,description,long,lat,name_or_NamUs
738,NamUs #UP8086,"date found: February 10, 2000location: Gulf County, FLage est.: Adultrace: Black/African Americanheight: weight: description: Shrimp boat workers found skull in the Gulf of Mexico links: https://www.namus.gov/UnidentifiedPersons/Case#/8086",-85.478156,29.540976,UP8086
2025,NamUs #UP8086,"date found: February 10, 2000location: Gulf County, Floridaage est.: race: Black / African Americanheight: weight: description: Shrimp boat workers found skull in the Gulf of Mexicolinks: https://www.namus.gov/UnidentifiedPersons/Case#/8086?nav",-85.354938,29.800187,UP8086
10077,NamUs #UP814,,-80.609948,26.427968,UP814
8653,Unidentified Person / NamUs #UP814,,-80.276733,26.651452,UP814
13352,Yesica Becerra Gomez,,-111.749839,31.564789,Yesica Becerra Gomez
13393,Yesica Becerra Gomez,,-111.200866,31.374746,Yesica Becerra Gomez
7431,Yokohama Suijo 18-2,"date found: location: age est.: race: height: weight: description: Estimated Date of Death: May 14, 2006Estimated Age: 55 to 65Height: 176cmDental: Gold bridges on upper front teethDistinguishing Characteristics: 1cm wart on top of headClothing: Black fleece, blue windbreaker, green pants, gray sweat pantsPersonal belongings: Nipperhttp://www.police.pref.kanagawa.jp/corp/c_mes/corp06m2.htm#suijo18-2links:",139.635458,35.464528,Yokohama Suijo 18-2
7387,Yokohama Suijo 18-2,"date found: location: age est.: race: height: weight: description: Estimated Date of Death: April 20, 2006Estimated Age: 50 to 70Height: 161cmDental: Upper teeth all missing as well as lower right molarsDistinguishing Characteristics: Decedent had cut his left wrist and was bandagedClothing: Black suit with a name �Saito� written on it, blue checkered sweater, light brown polo shirt, gray socks, black 24cm leather shoesPersonal belonging: Lighterhttp://www.police.pref.kanagawa.jp/corp/c_mes/corp06m2.htm#suijo18-1links:",139.683266,35.442771,Yokohama Suijo 18-2
13530,Yu Chin Chang Goodson,"Nicknames / Aliases: date: location: age: race: height: weight: description: March 25, 2005 http://charleyproject.org/case/yu-chin-chang-goodsonlinks:",-87.722724,34.504077,Yu Chin Chang Goodson
14302,Yu Chin Chang Goodson,"Nicknames / Aliases: date: location: age: race: height: weight: description: 03/25/2005Goodson was last seen lived in a halfway house for mentally disabled adults in the 100 block of Nortin Avenue in Russellville, Alabama at the time of her disappearance. She was last seen leaving the facility on March 25, 2005.She got into a small gray or silver older model car, possibly a Nissan or Mazda, with a loud muffler. The car headed east on Highway 24 towards Decatur, Alabama. Goodson has never been heard from again.Authorities believe Goodson may be trying to reach her son, who lives in Decatur. She used to live in Florence, Alabama, and had lived in the halfway house for only a few months prior to her disappearance. She did not take any identification with her and has not accessed her bank accounts since her disappearance.http://charleyproject.org/case/yu-chin-chang-goodsonlinks:",-87.723165,34.502266,Yu Chin Chang Goodson


Okay, the culprit is the duplicates in the "name" column. Let's drop them.

### Drop the duplicates

In [148]:
# Drop duplicate rows from the original DataFrame based on the 'Name' column
df_v2 = df.drop_duplicates(subset=['name_or_NamUs'], keep='first')
df_v2.shape

(16897, 5)

In [149]:
# Check if the 'long' or 'lat' columns have any NaN values
has_nan = df_v2['long'].isna().any()
# Print the result
print(f"Does the 'long' column have any NaN values? {has_nan}")

# Check if the 'long' or 'lat' columns have any NaN values
has_nan = df_v2['lat'].isna().any()
# Print the result
print(f"Does the 'lat' column have any NaN values? {has_nan}")

Does the 'long' column have any NaN values? False


Does the 'lat' column have any NaN values? False


In [150]:
# Let's check again whether "name_or_NamUs" now uniquely identifies the data
column = df_v2['name_or_NamUs'] 
is_column_unique = is_unique(column)

if is_column_unique:
    print("The column uniquely identifies the dataset.")
else:
    print("The column does not uniquely identify the dataset.")

The column uniquely identifies the dataset.


Perfect! That worked out.


Some missing persons are knows by their nicknames. Let's pick up nicknames from the *name* column and then drop it.

In [151]:
# Some missing persons have nicknames that can also be used for visualization. Lets create a simple function to do this.
def extract_nicknames(name):
    if not isinstance(name, str):
        return None
    else:
        match = re.search(r'"(.*?)"', name)
        return match.group(1) if match else None

df_v3 = df_v2.copy()
# Create a new column for nicknames
df_v3['nickname'] = df_v3['name'].apply(extract_nicknames)

# Drop columns: "name", "location"
df_v3 = df_v3.drop(columns=["name"], axis=1)
df_v3.head()

Unnamed: 0,description,long,lat,name_or_NamUs,nickname
0,"date found: November 4, 1999location: Red Wing, Minnesotaage est.: 0-12 monthsrace: Caucasian / Whiteheight: 1'9""weight: 6 lbs (est)description: A full term infant with umbilical cord still attached was found 10 yds north of the Mississippi shore near 800 Levee Dr, Redwing MN. The body showed slight signs of decomposition upon discovery. The infant had not been in the water for long. The race of the decedent is most likely white. This infant is genetically related maternally to the decedent in case# GC03-127.links: https://www.namus.gov/UnidentifiedPersons/Case#/4795http://www.doenetwork.org/cases/604ufmn.html",-92.541061,44.565821,UP4795,
1,"date found: July 20, 1977location: St. Paul, Minnesotaage est.: 16 - 30race: Caucasian / Whiteheight: 5'8"" (estimated)weight: 130lbs (estimated)description: Medium length brown hair with brown or green eyes; Multiple abdominal striae; Body found in Mississippi River between Childs and Warner Road, St. Paul.Green, Red, Blue vertically stripped shirt with thinner lateral stripes, high waisted blue jeans, brown knee high stockings size 5 underwear. Shoes�Approximate Size 8 TO 9Multiple Abdominal STRIAlinks: http://doenetwork.org/cases/849ufmn.html;�https://www.namus.gov/UnidentifiedPersons/Case#/4804?nav",-93.055401,44.941109,UP4804,
2,,-93.19406,45.078014,UP4808,
3,"date found: March 26, 2007location: Treasure Island Marina, Red Wing, Minnesotaage est.: 0-12 monthsrace: Caucasian / Whiteheight: 1'9""weight: 6 lbsdescription: The infant was found in the Buffalo Slough of the Mississippi River at the Treasure Island Marina (Slip 36, Dock C). The infant was near term/term with no apparent congenital abnormalities. The infant appears to be of Caucasian descent and not a member of the Prairie Island tribe. Estimated time of being in the water was from a few weeks from discovery of the body to the previous fall or winter (2006). Hair Black, 3cm in lengthlinks: https://www.namus.gov/UnidentifiedPersons/Case#/4796http://www.doenetwork.org/cases/102ufmn.html",-92.641182,44.635803,UP4796,
4,"date found: August 10, 2008location: Mabel, Minnesotaage est.: 30-40race: African American / Blackheight: weight: description: The skull has been stored in the school since before 1969. Its origin and identity is unknown.links: https://www.namus.gov/UnidentifiedPersons/Case#/6525http://doenetwork.org/cases/803ufmn.html",-91.773434,43.520049,UP6525,


In [152]:
df_v3.info()

<class 'pandas.core.frame.DataFrame'>
Index: 16897 entries, 0 to 17306
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   description    14535 non-null  object 
 1   long           16897 non-null  float64
 2   lat            16897 non-null  float64
 3   name_or_NamUs  16897 non-null  object 
 4   nickname       298 non-null    object 
dtypes: float64(2), object(3)
memory usage: 792.0+ KB


### Parse entities from *description* column
Let's first visualize the values.

In [153]:
df_v3["description"][1]

'date found: July 20, 1977location: St. Paul, Minnesotaage est.: 16 - 30race: Caucasian / Whiteheight: 5\'8" (estimated)weight: 130lbs (estimated)description: Medium length brown hair with brown or green eyes; Multiple abdominal striae; Body found in Mississippi River between Childs and Warner Road, St. Paul.Green, Red, Blue vertically stripped shirt with thinner lateral stripes, high waisted blue jeans, brown knee high stockings size 5 underwear. Shoes�Approximate Size 8 TO 9Multiple Abdominal STRIAlinks: http://doenetwork.org/cases/849ufmn.html;�https://www.namus.gov/UnidentifiedPersons/Case#/4804?nav'

The column contains information on numerous entities providing details on the missing person, notably including date found, location address, age, race, height, weight, details description, and links.  

In [154]:
# Function to parse the description column and extract the description text
def parse_description(row):
    description = row['description']
    
    if not isinstance(description, str):
        return {
            "date_found": None, 
            "date_seen": None, 
            "location": None, 
            "age_est": None, 
            "race": None, 
            "height": None, 
            "weight": None, 
            "description_2": None
        }

    # Patterns to extract each field with variations
    fields = {
        "date_found": r'(date\s*found\s*:|date\s*:\s*)\s*(?P<value>.*?)(?:location:|$)',
        "location": r'location\s*:\s*(?P<value>.*?)(?:age\s*est\s*:|age\s*:|$)',
        "age_est": r'(age\s*est\s*:|age\s*:)\s*(?P<value>.*?)(?:race:|height:|$)',
        "race": r'race\s*:\s*(?P<value>.*?)(?:height:|weight:|$)',
        "height": r'height\s*:\s*(?P<value>.*?)(?:weight:|description:|$)',
        "weight": r'weight\s*:\s*(?P<value>.*?)(?:description:|$)',
    }
    
    # Extract values for each field
    parsed_data = {}
    for field, pattern in fields.items():
        match = re.search(pattern, description, re.IGNORECASE)
        if match:
            parsed_data[field] = match.group('value').strip()
        else:
            parsed_data[field] = None
    
    # Extract everything following "description:" as description_2
    desc_match = re.search(r'description\s*:\s*(?P<value>.*?)\s*(?:links:|$)', description, re.IGNORECASE | re.DOTALL)
    parsed_data["description_2"] = desc_match.group('value').strip() if desc_match else None

    # Attempt to find the first date in the description
    date_pattern = r'\b(?:January|February|March|April|May|June|July|August|September|October|November|December)\s+\d{1,2},\s*\d{4}\b'
    date_match = re.search(date_pattern, description)
    if date_match:
        parsed_data["date_seen"] = date_match.group().strip()
    else:
        parsed_data["date_seen"] = None

    # Add lat and long to the parsed data for debugging purposes
    parsed_data['lat'] = row['lat']
    parsed_data['long'] = row['long']

    return parsed_data

# Apply the parsing function to the DataFrame using unique identifiers (lat, long)
parsed_info = df_v3.apply(parse_description, axis=1)

In [155]:
# Convert parsed_info to a DataFrame and concatenate it with the original DataFrame
parsed_df = pd.DataFrame(parsed_info.tolist())
df_v3 = pd.concat([df_v3.reset_index(drop=True), parsed_df.reset_index(drop=True)], axis=1)

In [156]:
# Let's visualize the df
print("Combined DataFrame:")
df_v3[["description", "description_2"]].sample(3)

Combined DataFrame:


Unnamed: 0,description,description_2
2297,"date found: June 19, 1970location: Lake County, COage est.: 18-30race: White/Caucasianheight: 5'7"" (est)weight: description: Two men hiking down 2 miles east of Independence Pass Summit discovered the decedent in a ditch alongside the road. It is thought he was sitting on the side of the road bank when a rockfall occurred. The decedent's torso was covered by a rockfall. He was found in an advanced decomposed state. He had previous dental work done, had good teeth with gold and platinum fillings.Dark brown hair. Newspaper stated Tattered Sweatshirt, Gray Trousers, Three Pairs of Socks and Tennis or Hiking Shoes (Could possibly be Jungle Boots?) with an unworn sock over the left shoe. Sheriff's report states he was wearing a pair of light colored wash and wear pants, a sweat shirt and a pair of black loafers along with having seven dollars and a razor in his pocket.links: https://www.namus.gov/UnidentifiedPersons/Case#/10738http://www.doenetwork.org/cases/2530umco.htmlhttps://apps.colorado.gov/apps/coldcase/casedetail.html?id=259002","Two men hiking down 2 miles east of Independence Pass Summit discovered the decedent in a ditch alongside the road. It is thought he was sitting on the side of the road bank when a rockfall occurred. The decedent's torso was covered by a rockfall. He was found in an advanced decomposed state. He had previous dental work done, had good teeth with gold and platinum fillings.Dark brown hair. Newspaper stated Tattered Sweatshirt, Gray Trousers, Three Pairs of Socks and Tennis or Hiking Shoes (Could possibly be Jungle Boots?) with an unworn sock over the left shoe. Sheriff's report states he was wearing a pair of light colored wash and wear pants, a sweat shirt and a pair of black loafers along with having seven dollars and a razor in his pocket."
4061,date found: location: age est.: race: height: weight: description: https://identifyus.org/en/cases/3830�links:,https://identifyus.org/en/cases/3830�
16887,"Nicknames / Aliases: date: January 28, 2006location: Halifax, NSage: 62race: Whiteheight: 157 cm (5'2"")weight: 70 kg (145 lbs)description: Marilyn Hersey left the Abbey Lane Hospital on January 28, 2006, and hasn't been seen since. Marilyn is bi-polar and had to take medication for numerous ailments; she did not have her medication with her. links: https://www.services.rcmp-grc.gc.ca/missing-disparus/case-dossier.jsf?case=2013000023&id=9","Marilyn Hersey left the Abbey Lane Hospital on January 28, 2006, and hasn't been seen since. Marilyn is bi-polar and had to take medication for numerous ailments; she did not have her medication with her."


Let's also extract links stored as a list in a separate column.

## Extract links from the *description* column

In [157]:
# Function to extract links from a given text as a list of clickable links
def extract_links(description):
    if not isinstance(description, str):
        return []
    # Pattern to match http or https links
    links_pattern = r'(https?://[^\s]+)'
    links_matches = re.findall(links_pattern, description)
    
    # Correctly separate concatenated links
    if links_matches:
        links = []
        for match in links_matches:
            parts = match.split('http')
            for i, part in enumerate(parts):
                if i == 0 and part:
                    links.append(part)
                elif part:
                    links.append('http' + part)
        return links
    return []

# Apply the extract_links function to the 'description' column
df_v3['links'] = df_v3['description'].apply(extract_links)

In [158]:
df_v3[['description', 'links']].iloc[0]

description    date found: November 4, 1999location: Red Wing, Minnesotaage est.: 0-12 monthsrace: Caucasian / Whiteheight: 1'9"weight: 6 lbs (est)description: A full term infant with umbilical cord still attached was found 10 yds north of the Mississippi shore near 800 Levee Dr, Redwing MN. The body showed slight signs of decomposition upon discovery. The infant had not been in the water for long. The race of the decedent is most likely white. This infant is genetically related maternally to the decedent in case# GC03-127.links: https://www.namus.gov/UnidentifiedPersons/Case#/4795http://www.doenetwork.org/cases/604ufmn.html
links                                                                                                                                                                                                                                                                                                                                                                          

In [159]:
substring_match = df_v3["name_or_NamUs"].str.contains("Maura Murray", na=False)
df_v3[substring_match]

Unnamed: 0,description,long,lat,name_or_NamUs,nickname,date_found,location,age_est,race,height,weight,description_2,date_seen,lat.1,long.1,links
11310,"Nicknames / Aliases: date: location: age: race: height: weight: description: Murray was involved in a one-car accident Route 112 in the Woodsville section of Haverhill in northern New Hampshire between 7:00 and 7:30 p.m. on February 9, 2004. Her car, a black 1996 Saturn with Massachusetts license plates, failed to negotiate a sharp curve and ran off the road, striking a tree. Haverhill is five miles away from Wells River, Vermont and one mile away from Swift Water Village by the Connecticut River. This was the second car Murray had wrecked in three days; she had previously damaged her father's vehicle in another accident. A resident near the site of the February 9 crash called the police, even though Murray had asked him not to. She had vanished by the time authorities arrived at the scene about ten minutes later. Her car was left behind, severely damaged in the front end and not in a driveable condition. The doors were locked and a few personal belongings, including Murray's cellular phone and credit and bank cards, were missing, but most of her possessions had been left inside. Murray has never been seen again. There were no footprints in the snow around the car and no indications of a struggle, and tracker dogs lost her scent within 100 yards. Police believe she got a ride from the scene of the accident to parts unknown. The witness to the accident says she did not appear to be injured, but she may have been intoxicated. Murray resided in Hanson, Massachusetts and was a student at the University of Massachusetts at Amherst at the time she disappeared; the university police are assisting with her case. She was a nursing major and was a dean's list student, and was employed by a local art gallery in addition to having a job on campus. She had made arrangements to take a nursing job in Oklahoma after her graduation. Four days before she disappeared, she left her job early at her supervisor's suggestion; she appeared to be extremely upset about something and was unable to work. It has not been discovered what was bothering her, but Murray's sister spoke to her on the phone that same evening and said their conversation was normal. Murray emailed her professors the day of her disappearance and said there was a death in her family and she had go away, but would be in touch upon her return in about a week. No one had actually died. After her disappearance, Murray's dormitory room was found packed up, as if she was planning on moving out altogether. She withdrew $280 from her bank account the day she disappeared, but there has been no activity on her bank accounts or credit cards since then. She packed up all her belongings in her dormitory room into boxes, and left behind a personal note for her fiance, an Army lieutenant named William Rausch who was stationed in Fort Still, Oklahoma. Murray also emailed Rausch on the afternoon of her disappearance. In the email she asked to speak with him. The day after she was last seen, Murray called Rausch, but he only heard her breathing on the line. The call could not be traced. Investigators inspected Murray's computer after she vanished; they discovered she had been searching on the internet for information on hotels in the Burlington, Vermont area. Based on this information, they checked Burlington hotels for any signs of Murray, but turned up no clues as to her whereabouts. Murray and her father went hiking together in the Burlington area in the fall of 2003, but she has no other connections to the city. She used to camp regularly in New Hampshire and knew the state well, but there are no known reasons why she would go to Haverhill. Extensive searches of the woods around Haverhill have turned up no evidence as to her whereabouts. There is speculation that Murray's case may be related to the disappearance of Brianna Maitland, a girl who vanished from Montgomery, Vermont on March 19, 2004. She is still missing. Montgomery is only about 90 miles from Haverhill. Both of them are attractive brunette young women, and both disappeared after car accidents in which theirlinks:",-71.936159,44.119442,Maura Murray,,,,,,,,"Murray was involved in a one-car accident Route 112 in the Woodsville section of Haverhill in northern New Hampshire between 7:00 and 7:30 p.m. on February 9, 2004. Her car, a black 1996 Saturn with Massachusetts license plates, failed to negotiate a sharp curve and ran off the road, striking a tree. Haverhill is five miles away from Wells River, Vermont and one mile away from Swift Water Village by the Connecticut River. This was the second car Murray had wrecked in three days; she had previously damaged her father's vehicle in another accident. A resident near the site of the February 9 crash called the police, even though Murray had asked him not to. She had vanished by the time authorities arrived at the scene about ten minutes later. Her car was left behind, severely damaged in the front end and not in a driveable condition. The doors were locked and a few personal belongings, including Murray's cellular phone and credit and bank cards, were missing, but most of her possessions had been left inside. Murray has never been seen again. There were no footprints in the snow around the car and no indications of a struggle, and tracker dogs lost her scent within 100 yards. Police believe she got a ride from the scene of the accident to parts unknown. The witness to the accident says she did not appear to be injured, but she may have been intoxicated. Murray resided in Hanson, Massachusetts and was a student at the University of Massachusetts at Amherst at the time she disappeared; the university police are assisting with her case. She was a nursing major and was a dean's list student, and was employed by a local art gallery in addition to having a job on campus. She had made arrangements to take a nursing job in Oklahoma after her graduation. Four days before she disappeared, she left her job early at her supervisor's suggestion; she appeared to be extremely upset about something and was unable to work. It has not been discovered what was bothering her, but Murray's sister spoke to her on the phone that same evening and said their conversation was normal. Murray emailed her professors the day of her disappearance and said there was a death in her family and she had go away, but would be in touch upon her return in about a week. No one had actually died. After her disappearance, Murray's dormitory room was found packed up, as if she was planning on moving out altogether. She withdrew $280 from her bank account the day she disappeared, but there has been no activity on her bank accounts or credit cards since then. She packed up all her belongings in her dormitory room into boxes, and left behind a personal note for her fiance, an Army lieutenant named William Rausch who was stationed in Fort Still, Oklahoma. Murray also emailed Rausch on the afternoon of her disappearance. In the email she asked to speak with him. The day after she was last seen, Murray called Rausch, but he only heard her breathing on the line. The call could not be traced. Investigators inspected Murray's computer after she vanished; they discovered she had been searching on the internet for information on hotels in the Burlington, Vermont area. Based on this information, they checked Burlington hotels for any signs of Murray, but turned up no clues as to her whereabouts. Murray and her father went hiking together in the Burlington area in the fall of 2003, but she has no other connections to the city. She used to camp regularly in New Hampshire and knew the state well, but there are no known reasons why she would go to Haverhill. Extensive searches of the woods around Haverhill have turned up no evidence as to her whereabouts. There is speculation that Murray's case may be related to the disappearance of Brianna Maitland, a girl who vanished from Montgomery, Vermont on March 19, 2004. She is still missing. Montgomery is only about 90 miles from Haverhill. Both of them are attractive brunette young women, and both disappeared after car accidents in which their","February 9, 2004",44.119442,-71.936159,[]


Unfortunately, in the case when *date_found* is missing, *date_seen* and other information in the *description* column is also missing. Not much can be done there :( 

In [160]:
# check the null values in date_found column
df_v3.loc[(df_v3["date_found"].isnull())]

Unnamed: 0,description,long,lat,name_or_NamUs,nickname,date_found,location,age_est,race,height,weight,description_2,date_seen,lat.1,long.1,links
2,,-93.194060,45.078014,UP4808,,,,,,,,,,,,[]
9,,-97.328052,34.826768,249UFOK�NamUs UP # 5054,,,,,,,,,,,,[]
13,,-95.215244,35.494709,Webbers Falls Jane Doe 617UFOK�NamUs UP # 9183,,,,,,,,,,,,[]
25,,-92.676587,33.207719,"""El Dorado Jane Doe""�81UFAR",El Dorado Jane Doe,,,,,,,,,,,[]
56,,-84.106203,42.333587,UP8182,,,,,,,,,,,,[]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16877,,-113.485811,53.541684,Audrey Ramona Beaver,,,,,,,,,,,,[]
16882,,-113.910077,53.557912,Patricia Janice Salamandyk,,,,,,,,,,,,[]
16883,,-117.788775,53.409230,Stephanie Stewart,,,,,,,,,,,,[]
16884,,-113.483921,53.548474,Freda Alvina Whiteman,,,,,,,,,,,,[]


In [161]:
print(df.shape, df_v3.shape)

(17307, 5) (16897, 16)


In [162]:
# Function to parse the description column and extract the description text
def parse_description(row):
    description = row['description']
    
    if not isinstance(description, str):
        return {
            "date_found": None, 
            "date_seen": None, 
            "location": None, 
            "age_est": None, 
            "race": None, 
            "height": None, 
            "weight": None, 
            "description_2": None
        }

    # Patterns to extract each field with variations
    fields = {
        "date_found": r'(date\s*found\s*:|date\s*:\s*)\s*(?P<value>.*?)(?:location:|$)',
        "location": r'location:\s*(?P<value>.*?)(age est.:|age:|$)',
        "age_est": r'(age\s*est\s*:|age\s*:)\s*(?P<value>.*?)(?:race:|height:|$)',
        "race": r'race\s*:\s*(?P<value>.*?)(?:height:|weight:|$)',
        "height": r'height\s*:\s*(?P<value>.*?)(?:weight:|description:|$)',
        "weight": r'weight\s*:\s*(?P<value>.*?)(?:description:|$)',
    }
    
    # Extract values for each field
    parsed_data = {}
    for field, pattern in fields.items():
        match = re.search(pattern, description, re.IGNORECASE)
        if match:
            parsed_data[field] = match.group('value').strip()
        else:
            parsed_data[field] = None
    
    # Extract everything following "description:" as description_2
    desc_match = re.search(r'description\s*:\s*(?P<value>.*?)\s*(?:links:|$)', description, re.IGNORECASE | re.DOTALL)
    parsed_data["description_2"] = desc_match.group('value').strip() if desc_match else None

    # Attempt to find the first date in the description
    date_pattern = r'\b(?:January|February|March|April|May|June|July|August|September|October|November|December)\s+\d{1,2},\s*\d{4}\b'
    date_match = re.search(date_pattern, description)
    if date_match:
        parsed_data["date_seen"] = date_match.group().strip()
    else:
        parsed_data["date_seen"] = None

    # Add lat and long to the parsed data for debugging purposes
    parsed_data['lat'] = row['lat']
    parsed_data['long'] = row['long']

    return parsed_data

# Apply the parsing function to the DataFrame using unique identifiers (lat, long)
parsed_info = df_v3.apply(parse_description, axis=1)

df_v3.sample(2)

Unnamed: 0,description,long,lat,name_or_NamUs,nickname,date_found,location,age_est,race,height,weight,description_2,date_seen,lat.1,long.1,links
9688,"date found: February 18, 2020location: West Deptford, NJage est.: 18-99race: height: weight: description: The cranium was found in a wooded area.links: https://www.namus.gov/UnidentifiedPersons/Case#/71758",-75.187715,39.832251,UP71758,,"February 18, 2020","West Deptford, NJage est.: 18-99race: height: weight: description: The cranium was found in a wooded area.links: https://www.namus.gov/UnidentifiedPersons/Case#/71758",,,,,The cranium was found in a wooded area.,,39.832251,-75.187715,[https://www.namus.gov/UnidentifiedPersons/Case#/71758]
3867,"date found: location: age est.: race: height: weight: description: https://identifyus.org/en/cases/3603�AUT0 VS BICYCLIST, BLU BRITTNEY 10 SPEED BIKElinks:",-118.213221,34.060277,"NamUs UP # 3603 No_photo ME/C Case Number: 9504412 Los Angeles County, California 30 to 50 year old White Male",,,"age est.: race: height: weight: description: https://identifyus.org/en/cases/3603�AUT0 VS BICYCLIST, BLU BRITTNEY 10 SPEED BIKElinks:",,,,,"https://identifyus.org/en/cases/3603�AUT0 VS BICYCLIST, BLU BRITTNEY 10 SPEED BIKE",,34.060277,-118.213221,[https://identifyus.org/en/cases/3603�AUT0]


In [163]:
# Replace NaN values in the 'description' column with an empty string
df_v3['description'] = df_v3['description'].fillna('')

# Let's check the case of 'Maura Murray'
result = df_v3[df_v3['description'].str.contains('Murray', case=False)]

# Display the result
result[['description', 'description_2']].iloc[5:6]

Unnamed: 0,description,description_2
11310,"Nicknames / Aliases: date: location: age: race: height: weight: description: Murray was involved in a one-car accident Route 112 in the Woodsville section of Haverhill in northern New Hampshire between 7:00 and 7:30 p.m. on February 9, 2004. Her car, a black 1996 Saturn with Massachusetts license plates, failed to negotiate a sharp curve and ran off the road, striking a tree. Haverhill is five miles away from Wells River, Vermont and one mile away from Swift Water Village by the Connecticut River. This was the second car Murray had wrecked in three days; she had previously damaged her father's vehicle in another accident. A resident near the site of the February 9 crash called the police, even though Murray had asked him not to. She had vanished by the time authorities arrived at the scene about ten minutes later. Her car was left behind, severely damaged in the front end and not in a driveable condition. The doors were locked and a few personal belongings, including Murray's cellular phone and credit and bank cards, were missing, but most of her possessions had been left inside. Murray has never been seen again. There were no footprints in the snow around the car and no indications of a struggle, and tracker dogs lost her scent within 100 yards. Police believe she got a ride from the scene of the accident to parts unknown. The witness to the accident says she did not appear to be injured, but she may have been intoxicated. Murray resided in Hanson, Massachusetts and was a student at the University of Massachusetts at Amherst at the time she disappeared; the university police are assisting with her case. She was a nursing major and was a dean's list student, and was employed by a local art gallery in addition to having a job on campus. She had made arrangements to take a nursing job in Oklahoma after her graduation. Four days before she disappeared, she left her job early at her supervisor's suggestion; she appeared to be extremely upset about something and was unable to work. It has not been discovered what was bothering her, but Murray's sister spoke to her on the phone that same evening and said their conversation was normal. Murray emailed her professors the day of her disappearance and said there was a death in her family and she had go away, but would be in touch upon her return in about a week. No one had actually died. After her disappearance, Murray's dormitory room was found packed up, as if she was planning on moving out altogether. She withdrew $280 from her bank account the day she disappeared, but there has been no activity on her bank accounts or credit cards since then. She packed up all her belongings in her dormitory room into boxes, and left behind a personal note for her fiance, an Army lieutenant named William Rausch who was stationed in Fort Still, Oklahoma. Murray also emailed Rausch on the afternoon of her disappearance. In the email she asked to speak with him. The day after she was last seen, Murray called Rausch, but he only heard her breathing on the line. The call could not be traced. Investigators inspected Murray's computer after she vanished; they discovered she had been searching on the internet for information on hotels in the Burlington, Vermont area. Based on this information, they checked Burlington hotels for any signs of Murray, but turned up no clues as to her whereabouts. Murray and her father went hiking together in the Burlington area in the fall of 2003, but she has no other connections to the city. She used to camp regularly in New Hampshire and knew the state well, but there are no known reasons why she would go to Haverhill. Extensive searches of the woods around Haverhill have turned up no evidence as to her whereabouts. There is speculation that Murray's case may be related to the disappearance of Brianna Maitland, a girl who vanished from Montgomery, Vermont on March 19, 2004. She is still missing. Montgomery is only about 90 miles from Haverhill. Both of them are attractive brunette young women, and both disappeared after car accidents in which theirlinks:","Murray was involved in a one-car accident Route 112 in the Woodsville section of Haverhill in northern New Hampshire between 7:00 and 7:30 p.m. on February 9, 2004. Her car, a black 1996 Saturn with Massachusetts license plates, failed to negotiate a sharp curve and ran off the road, striking a tree. Haverhill is five miles away from Wells River, Vermont and one mile away from Swift Water Village by the Connecticut River. This was the second car Murray had wrecked in three days; she had previously damaged her father's vehicle in another accident. A resident near the site of the February 9 crash called the police, even though Murray had asked him not to. She had vanished by the time authorities arrived at the scene about ten minutes later. Her car was left behind, severely damaged in the front end and not in a driveable condition. The doors were locked and a few personal belongings, including Murray's cellular phone and credit and bank cards, were missing, but most of her possessions had been left inside. Murray has never been seen again. There were no footprints in the snow around the car and no indications of a struggle, and tracker dogs lost her scent within 100 yards. Police believe she got a ride from the scene of the accident to parts unknown. The witness to the accident says she did not appear to be injured, but she may have been intoxicated. Murray resided in Hanson, Massachusetts and was a student at the University of Massachusetts at Amherst at the time she disappeared; the university police are assisting with her case. She was a nursing major and was a dean's list student, and was employed by a local art gallery in addition to having a job on campus. She had made arrangements to take a nursing job in Oklahoma after her graduation. Four days before she disappeared, she left her job early at her supervisor's suggestion; she appeared to be extremely upset about something and was unable to work. It has not been discovered what was bothering her, but Murray's sister spoke to her on the phone that same evening and said their conversation was normal. Murray emailed her professors the day of her disappearance and said there was a death in her family and she had go away, but would be in touch upon her return in about a week. No one had actually died. After her disappearance, Murray's dormitory room was found packed up, as if she was planning on moving out altogether. She withdrew $280 from her bank account the day she disappeared, but there has been no activity on her bank accounts or credit cards since then. She packed up all her belongings in her dormitory room into boxes, and left behind a personal note for her fiance, an Army lieutenant named William Rausch who was stationed in Fort Still, Oklahoma. Murray also emailed Rausch on the afternoon of her disappearance. In the email she asked to speak with him. The day after she was last seen, Murray called Rausch, but he only heard her breathing on the line. The call could not be traced. Investigators inspected Murray's computer after she vanished; they discovered she had been searching on the internet for information on hotels in the Burlington, Vermont area. Based on this information, they checked Burlington hotels for any signs of Murray, but turned up no clues as to her whereabouts. Murray and her father went hiking together in the Burlington area in the fall of 2003, but she has no other connections to the city. She used to camp regularly in New Hampshire and knew the state well, but there are no known reasons why she would go to Haverhill. Extensive searches of the woods around Haverhill have turned up no evidence as to her whereabouts. There is speculation that Murray's case may be related to the disappearance of Brianna Maitland, a girl who vanished from Montgomery, Vermont on March 19, 2004. She is still missing. Montgomery is only about 90 miles from Haverhill. Both of them are attractive brunette young women, and both disappeared after car accidents in which their"


In [164]:
# Let's drop missing values from latitude and longitude columns
print("missing values in columns 'lat' and 'long' are:", df_v3["lat"].isna().sum(), "and" ,df_v3["long"].isna().sum())
df_v3 = df_v3.dropna(subset=['lat', 'long'])
print("After deleting missing values in latitude and longitude columns: ", df.shape, df_v3.shape)

missing values in columns 'lat' and 'long' are: lat       0
lat    2362
dtype: int64 and long       0
long    2362
dtype: int64
After deleting missing values in latitude and longitude columns:  (17307, 5) (14535, 16)


In [165]:
# Print the column names to check for duplicates
print(df_v3.columns)

# Rename duplicated columns if any
df_v3 = df_v3.loc[:,~df_v3.columns.duplicated()].copy()
print(df_v3.columns)  # Verify the columns again to ensure duplicates are removed

Index(['description', 'long', 'lat', 'name_or_NamUs', 'nickname', 'date_found',
       'location', 'age_est', 'race', 'height', 'weight', 'description_2',
       'date_seen', 'lat', 'long', 'links'],
      dtype='object')
Index(['description', 'long', 'lat', 'name_or_NamUs', 'nickname', 'date_found',
       'location', 'age_est', 'race', 'height', 'weight', 'description_2',
       'date_seen', 'links'],
      dtype='object')


## Let's begin the visualization 

In [166]:
# Function to create popup text for each marker
def create_popup(row):
    row = row.to_dict()

    # Check if 'links' is not NaN and not already a list
    if isinstance(row['links'], float) and pd.isna(row['links']):
        links = []  # Initialize empty list if 'links' is NaN
    elif not isinstance(row['links'], list):
        links = [row['links']]
    else:
        links = row['links']  # 'links' is already a list, so use it as it is

    def safe_str(val):
        return str(val) if not pd.isna(val) else 'N/A'

    popup_text = (
        f"Name/ID: {safe_str(row['name_or_NamUs'])}<br>"
        f"Nickname: {safe_str(row['nickname'])}<br>"
        f"Race: {safe_str(row['race'])}<br>"
        f"Age Est.: {safe_str(row['age_est'])}<br>"
        f"Date Found: {safe_str(row['date_found'])}<br>"
        f"Date Seen: {safe_str(row['date_seen'])}<br>"
        f"Location: {safe_str(row['location'])}<br>"
        f"Height: {safe_str(row['height'])}<br>"
        f"Weight: {safe_str(row['weight'])}<br>"
        f"Description: {safe_str(row['description_2'])}<br>"
        f"Links: {', '.join(links) if links else 'N/A'}"
    )
    return popup_text

# Create a map object centered on the contiguous United States
us_center = [37.0902, -95.7129]  # Approximate center of the contiguous US
map = folium.Map(location=us_center, zoom_start=4)

# Create a marker cluster object
marker_cluster = MarkerCluster().add_to(map)

# Add markers for each location with detailed popup
for idx, row in df_v3.iterrows():
    # Access latitude and longitude values correctly
    lat = row['lat']
    lon = row['long']
    
    # Print debug information
    print(f"Row index: {idx}, Latitude: {lat}, Longitude: {lon}")
    
    if pd.notna(lat) and pd.notna(lon):  # Check if lat and lon are not NaN
        # Construct the popup for the current row
        popup = create_popup(row)
        # Create a marker at the current location with the popup text
        folium.Marker([lat, lon], popup=popup).add_to(marker_cluster)
        
# Display the map
# map

Row index: 0, Latitude: 44.5658215, Longitude: -92.5410605
Row index: 1, Latitude: 44.9411089, Longitude: -93.0554009
Row index: 3, Latitude: 44.6358033, Longitude: -92.6411819
Row index: 4, Latitude: 43.5200495, Longitude: -91.7734337
Row index: 5, Latitude: 35.2459696, Longitude: -97.3172379
Row index: 6, Latitude: 35.2312656, Longitude: -97.4305987
Row index: 7, Latitude: 35.6012586, Longitude: -98.9663366
Row index: 8, Latitude: 36.1568089, Longitude: -96.0039425
Row index: 10, Latitude: 35.5450772, Longitude: -98.3485794
Row index: 11, Latitude: 34.6140671, Longitude: -94.6404362
Row index: 12, Latitude: 35.4296937, Longitude: -94.4390774
Row index: 14, Latitude: 34.7517417, Longitude: -92.2617567
Row index: 15, Latitude: 35.2317388, Longitude: -97.2724342
Row index: 16, Latitude: 35.1428556, Longitude: -95.6445234
Row index: 17, Latitude: 39.0960962, Longitude: -94.6278191
Row index: 18, Latitude: 39.1508302, Longitude: -94.68009
Row index: 19, Latitude: 33.6594958, Longitude: -9

In [167]:
# Save the map as an HTML file for sharing
map.save("map_output.html")