### Find the missing persons
This notebook uses publicly available data on missing people and helps visualize all that is known about them.  

However, the data is significantly incomplete. 

You can contribute to enhancing it by submitting your updates to the missing_persons_clean.csv file. Only submissions that include valid resource links for verification will be merged.

In [98]:
!pip install pandas folium

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import re
import folium # for geolocating the missing
from folium.plugins import MarkerCluster



## Load the data

In [99]:
# URL of the raw CSV file on GitHub
url = 'https://raw.githubusercontent.com/sumitdeole/find-the-missing/main/missing_persons_clean.csv'

# Read the CSV file from the URL
df = pd.read_csv(url)

In [100]:
# Let's visualize the dataframe
df.head()

Unnamed: 0,name,description,long,lat,location
0,NamUs #UP4795 ME/C Case Number GC99-158 604UFMN,"date found: November 4, 1999location: Red Wing, Minnesotaage est.: 0-12 monthsrace: Caucasian / Whiteheight: 1'9""weight: 6 lbs (est)description: A full term infant with umbilical cord still attached was found 10 yds north of the Mississippi shore near 800 Levee Dr, Redwing MN. The body showed slight signs of decomposition upon discovery. The infant had not been in the water for long. The race of the decedent is most likely white. This infant is genetically related maternally to the decedent in case# GC03-127.links: https://www.namus.gov/UnidentifiedPersons/Case#/4795http://www.doenetwork.org/cases/604ufmn.html",-92.541061,44.565821,POINT(-92.5410605 44.5658215)
1,NamUs #UP4804 ME/C Case Number 39918 849UFMN,"date found: July 20, 1977location: St. Paul, Minnesotaage est.: 16 - 30race: Caucasian / Whiteheight: 5'8"" (estimated)weight: 130lbs (estimated)description: Medium length brown hair with brown or green eyes; Multiple abdominal striae; Body found in Mississippi River between Childs and Warner Road, St. Paul.Green, Red, Blue vertically stripped shirt with thinner lateral stripes, high waisted blue jeans, brown knee high stockings size 5 underwear. Shoes�Approximate Size 8 TO 9Multiple Abdominal STRIAlinks: http://doenetwork.org/cases/849ufmn.html;�https://www.namus.gov/UnidentifiedPersons/Case#/4804?nav",-93.055401,44.941109,POINT(-93.0554009 44.9411089)
2,NamUS #UP4808 ME/C Case Number 00-1411 270UFMN,,-93.19406,45.078014,POINT(-93.1940597 45.0780144)
3,NamUs #UP4796 ME/C Case Number GC07-39 102UFMN,"date found: March 26, 2007location: Treasure Island Marina, Red Wing, Minnesotaage est.: 0-12 monthsrace: Caucasian / Whiteheight: 1'9""weight: 6 lbsdescription: The infant was found in the Buffalo Slough of the Mississippi River at the Treasure Island Marina (Slip 36, Dock C). The infant was near term/term with no apparent congenital abnormalities. The infant appears to be of Caucasian descent and not a member of the Prairie Island tribe. Estimated time of being in the water was from a few weeks from discovery of the body to the previous fall or winter (2006). Hair Black, 3cm in lengthlinks: https://www.namus.gov/UnidentifiedPersons/Case#/4796http://www.doenetwork.org/cases/102ufmn.html",-92.641182,44.635803,POINT(-92.6411819 44.6358033)
4,NamUs #UP6525 ME/C Case Number FC08-61 803UFMN,"date found: August 10, 2008location: Mabel, Minnesotaage est.: 30-40race: African American / Blackheight: weight: description: The skull has been stored in the school since before 1969. Its origin and identity is unknown.links: https://www.namus.gov/UnidentifiedPersons/Case#/6525http://doenetwork.org/cases/803ufmn.html",-91.773434,43.520049,POINT(-91.7734337 43.5200495)


## Data cleaning
### Missing values
Let's now check the extent of missing values.

In [101]:
df.info()
df.shape

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17307 entries, 0 to 17306
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   name         17307 non-null  object 
 1   description  14803 non-null  object 
 2   long         17307 non-null  float64
 3   lat          17307 non-null  float64
 4   location     17307 non-null  object 
dtypes: float64(2), object(3)
memory usage: 676.2+ KB


(17307, 5)

The *description* column has many values missing. But latitude and longitude columns, necessary to visualize the location, do not. So for now we keep the missing values for the *description* column. 

In [102]:
# Let's drop this column. I will use latitude and longitudes instead.
df = df.drop("location", axis=1)

### Create a per-person unique identifier
The column *name* contains "NamUs case number" or sometimes the known name of the person. Let's extract this information and store it in a new column. We will use this information as unique identifier.

In [103]:
# Extract the identifier (e.g., UP1234) from the "name" column
df["name_or_NamUs"] = df["name"].str.extract(r'(UP\d+)', expand=False)

# Replacing NaN values (indicating no match found) with the original Name value
df.loc[df['name_or_NamUs'].isnull(), 'name_or_NamUs'] = df.loc[df['name_or_NamUs'].isnull(), 'name']

df.head()

Unnamed: 0,name,description,long,lat,name_or_NamUs
0,NamUs #UP4795 ME/C Case Number GC99-158 604UFMN,"date found: November 4, 1999location: Red Wing, Minnesotaage est.: 0-12 monthsrace: Caucasian / Whiteheight: 1'9""weight: 6 lbs (est)description: A full term infant with umbilical cord still attached was found 10 yds north of the Mississippi shore near 800 Levee Dr, Redwing MN. The body showed slight signs of decomposition upon discovery. The infant had not been in the water for long. The race of the decedent is most likely white. This infant is genetically related maternally to the decedent in case# GC03-127.links: https://www.namus.gov/UnidentifiedPersons/Case#/4795http://www.doenetwork.org/cases/604ufmn.html",-92.541061,44.565821,UP4795
1,NamUs #UP4804 ME/C Case Number 39918 849UFMN,"date found: July 20, 1977location: St. Paul, Minnesotaage est.: 16 - 30race: Caucasian / Whiteheight: 5'8"" (estimated)weight: 130lbs (estimated)description: Medium length brown hair with brown or green eyes; Multiple abdominal striae; Body found in Mississippi River between Childs and Warner Road, St. Paul.Green, Red, Blue vertically stripped shirt with thinner lateral stripes, high waisted blue jeans, brown knee high stockings size 5 underwear. Shoes�Approximate Size 8 TO 9Multiple Abdominal STRIAlinks: http://doenetwork.org/cases/849ufmn.html;�https://www.namus.gov/UnidentifiedPersons/Case#/4804?nav",-93.055401,44.941109,UP4804
2,NamUS #UP4808 ME/C Case Number 00-1411 270UFMN,,-93.19406,45.078014,UP4808
3,NamUs #UP4796 ME/C Case Number GC07-39 102UFMN,"date found: March 26, 2007location: Treasure Island Marina, Red Wing, Minnesotaage est.: 0-12 monthsrace: Caucasian / Whiteheight: 1'9""weight: 6 lbsdescription: The infant was found in the Buffalo Slough of the Mississippi River at the Treasure Island Marina (Slip 36, Dock C). The infant was near term/term with no apparent congenital abnormalities. The infant appears to be of Caucasian descent and not a member of the Prairie Island tribe. Estimated time of being in the water was from a few weeks from discovery of the body to the previous fall or winter (2006). Hair Black, 3cm in lengthlinks: https://www.namus.gov/UnidentifiedPersons/Case#/4796http://www.doenetwork.org/cases/102ufmn.html",-92.641182,44.635803,UP4796
4,NamUs #UP6525 ME/C Case Number FC08-61 803UFMN,"date found: August 10, 2008location: Mabel, Minnesotaage est.: 30-40race: African American / Blackheight: weight: description: The skull has been stored in the school since before 1969. Its origin and identity is unknown.links: https://www.namus.gov/UnidentifiedPersons/Case#/6525http://doenetwork.org/cases/803ufmn.html",-91.773434,43.520049,UP6525


Let's check whether the new column uniquely identifies the dataset.

In [104]:
# Let's write a simple function to check whether the column uniquely identifies the dataset
def is_unique(column):
    unique_values = set()
    for value in column:
        if value in unique_values:
            return False
        unique_values.add(value)
    return True

# Call the function
column = df['name_or_NamUs'] 
is_column_unique = is_unique(column)

if is_column_unique:
    print("The column 'name_or_NamUs' uniquely identifies the dataset.")
else:
    print("The column 'name_or_NamUs' does not uniquely identify the dataset.")

The column 'name_or_NamUs' does not uniquely identify the dataset.


Let's visualize the non-unique instances

In [105]:
# Column to check for uniqueness
column_name = 'name_or_NamUs'  

# Identify duplicate values in the column
duplicates = df[column_name][df[column_name].duplicated()]

# Extract rows with these duplicate values
non_unique_values = df[df[column_name].isin(duplicates)]

# Display non-unique values
print("Non-unique values in the column:")
non_unique_values.sort_values(by="name_or_NamUs", ascending=True).tail(10)

Non-unique values in the column:


Unnamed: 0,name,description,long,lat,name_or_NamUs
738,NamUs #UP8086,"date found: February 10, 2000location: Gulf County, FLage est.: Adultrace: Black/African Americanheight: weight: description: Shrimp boat workers found skull in the Gulf of Mexico links: https://www.namus.gov/UnidentifiedPersons/Case#/8086",-85.478156,29.540976,UP8086
2025,NamUs #UP8086,"date found: February 10, 2000location: Gulf County, Floridaage est.: race: Black / African Americanheight: weight: description: Shrimp boat workers found skull in the Gulf of Mexicolinks: https://www.namus.gov/UnidentifiedPersons/Case#/8086?nav",-85.354938,29.800187,UP8086
10077,NamUs #UP814,,-80.609948,26.427968,UP814
8653,Unidentified Person / NamUs #UP814,,-80.276733,26.651452,UP814
13352,Yesica Becerra Gomez,,-111.749839,31.564789,Yesica Becerra Gomez
13393,Yesica Becerra Gomez,,-111.200866,31.374746,Yesica Becerra Gomez
7431,Yokohama Suijo 18-2,"date found: location: age est.: race: height: weight: description: Estimated Date of Death: May 14, 2006Estimated Age: 55 to 65Height: 176cmDental: Gold bridges on upper front teethDistinguishing Characteristics: 1cm wart on top of headClothing: Black fleece, blue windbreaker, green pants, gray sweat pantsPersonal belongings: Nipperhttp://www.police.pref.kanagawa.jp/corp/c_mes/corp06m2.htm#suijo18-2links:",139.635458,35.464528,Yokohama Suijo 18-2
7387,Yokohama Suijo 18-2,"date found: location: age est.: race: height: weight: description: Estimated Date of Death: April 20, 2006Estimated Age: 50 to 70Height: 161cmDental: Upper teeth all missing as well as lower right molarsDistinguishing Characteristics: Decedent had cut his left wrist and was bandagedClothing: Black suit with a name �Saito� written on it, blue checkered sweater, light brown polo shirt, gray socks, black 24cm leather shoesPersonal belonging: Lighterhttp://www.police.pref.kanagawa.jp/corp/c_mes/corp06m2.htm#suijo18-1links:",139.683266,35.442771,Yokohama Suijo 18-2
13530,Yu Chin Chang Goodson,"Nicknames / Aliases: date: location: age: race: height: weight: description: March 25, 2005 http://charleyproject.org/case/yu-chin-chang-goodsonlinks:",-87.722724,34.504077,Yu Chin Chang Goodson
14302,Yu Chin Chang Goodson,"Nicknames / Aliases: date: location: age: race: height: weight: description: 03/25/2005Goodson was last seen lived in a halfway house for mentally disabled adults in the 100 block of Nortin Avenue in Russellville, Alabama at the time of her disappearance. She was last seen leaving the facility on March 25, 2005.She got into a small gray or silver older model car, possibly a Nissan or Mazda, with a loud muffler. The car headed east on Highway 24 towards Decatur, Alabama. Goodson has never been heard from again.Authorities believe Goodson may be trying to reach her son, who lives in Decatur. She used to live in Florence, Alabama, and had lived in the halfway house for only a few months prior to her disappearance. She did not take any identification with her and has not accessed her bank accounts since her disappearance.http://charleyproject.org/case/yu-chin-chang-goodsonlinks:",-87.723165,34.502266,Yu Chin Chang Goodson


Okay, the culprit is the duplicates in the "name" column. Let's drop them.

### Drop the duplicates

In [106]:
# Drop duplicate rows from the original DataFrame based on the 'Name' column
df_v2 = df.drop_duplicates(subset=['name_or_NamUs'], keep='first')
df_v2.shape

(16897, 5)

In [107]:
# Check if the 'long' or 'lat' columns have any NaN values
has_nan = df_v2['long'].isna().any()
# Print the result
print(f"Does the 'long' column have any NaN values? {has_nan}")

# Check if the 'long' or 'lat' columns have any NaN values
has_nan = df_v2['lat'].isna().any()
# Print the result
print(f"Does the 'lat' column have any NaN values? {has_nan}")

Does the 'long' column have any NaN values? False
Does the 'lat' column have any NaN values? False


In [108]:
# Let's check again whether "name_or_NamUs" now uniquely identifies the data
column = df_v2['name_or_NamUs'] 
is_column_unique = is_unique(column)

if is_column_unique:
    print("The column uniquely identifies the dataset.")
else:
    print("The column does not uniquely identify the dataset.")

The column uniquely identifies the dataset.


Perfect! That worked out.


Some missing persons are knows by their nicknames. Let's pick up nicknames from the *name* column and then drop it.

In [109]:
# Some missing persons have nicknames that can also be used for visualization. Lets create a simple function to do this.
def extract_nicknames(name):
    if not isinstance(name, str):
        return None
    else:
        match = re.search(r'"(.*?)"', name)
        return match.group(1) if match else None

df_v3 = df_v2.copy()
# Create a new column for nicknames
df_v3['nickname'] = df_v3['name'].apply(extract_nicknames)

# Drop columns: "name", "location"
df_v3 = df_v3.drop(columns=["name"], axis=1)
df_v3.head()

Unnamed: 0,description,long,lat,name_or_NamUs,nickname
0,"date found: November 4, 1999location: Red Wing, Minnesotaage est.: 0-12 monthsrace: Caucasian / Whiteheight: 1'9""weight: 6 lbs (est)description: A full term infant with umbilical cord still attached was found 10 yds north of the Mississippi shore near 800 Levee Dr, Redwing MN. The body showed slight signs of decomposition upon discovery. The infant had not been in the water for long. The race of the decedent is most likely white. This infant is genetically related maternally to the decedent in case# GC03-127.links: https://www.namus.gov/UnidentifiedPersons/Case#/4795http://www.doenetwork.org/cases/604ufmn.html",-92.541061,44.565821,UP4795,
1,"date found: July 20, 1977location: St. Paul, Minnesotaage est.: 16 - 30race: Caucasian / Whiteheight: 5'8"" (estimated)weight: 130lbs (estimated)description: Medium length brown hair with brown or green eyes; Multiple abdominal striae; Body found in Mississippi River between Childs and Warner Road, St. Paul.Green, Red, Blue vertically stripped shirt with thinner lateral stripes, high waisted blue jeans, brown knee high stockings size 5 underwear. Shoes�Approximate Size 8 TO 9Multiple Abdominal STRIAlinks: http://doenetwork.org/cases/849ufmn.html;�https://www.namus.gov/UnidentifiedPersons/Case#/4804?nav",-93.055401,44.941109,UP4804,
2,,-93.19406,45.078014,UP4808,
3,"date found: March 26, 2007location: Treasure Island Marina, Red Wing, Minnesotaage est.: 0-12 monthsrace: Caucasian / Whiteheight: 1'9""weight: 6 lbsdescription: The infant was found in the Buffalo Slough of the Mississippi River at the Treasure Island Marina (Slip 36, Dock C). The infant was near term/term with no apparent congenital abnormalities. The infant appears to be of Caucasian descent and not a member of the Prairie Island tribe. Estimated time of being in the water was from a few weeks from discovery of the body to the previous fall or winter (2006). Hair Black, 3cm in lengthlinks: https://www.namus.gov/UnidentifiedPersons/Case#/4796http://www.doenetwork.org/cases/102ufmn.html",-92.641182,44.635803,UP4796,
4,"date found: August 10, 2008location: Mabel, Minnesotaage est.: 30-40race: African American / Blackheight: weight: description: The skull has been stored in the school since before 1969. Its origin and identity is unknown.links: https://www.namus.gov/UnidentifiedPersons/Case#/6525http://doenetwork.org/cases/803ufmn.html",-91.773434,43.520049,UP6525,


In [110]:
df_v3.info()

<class 'pandas.core.frame.DataFrame'>
Index: 16897 entries, 0 to 17306
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   description    14535 non-null  object 
 1   long           16897 non-null  float64
 2   lat            16897 non-null  float64
 3   name_or_NamUs  16897 non-null  object 
 4   nickname       298 non-null    object 
dtypes: float64(2), object(3)
memory usage: 792.0+ KB


### Parse entities from *description* column
Let's first visualize the values.

In [111]:
df_v3["description"][1]

'date found: July 20, 1977location: St. Paul, Minnesotaage est.: 16 - 30race: Caucasian / Whiteheight: 5\'8" (estimated)weight: 130lbs (estimated)description: Medium length brown hair with brown or green eyes; Multiple abdominal striae; Body found in Mississippi River between Childs and Warner Road, St. Paul.Green, Red, Blue vertically stripped shirt with thinner lateral stripes, high waisted blue jeans, brown knee high stockings size 5 underwear. Shoes�Approximate Size 8 TO 9Multiple Abdominal STRIAlinks: http://doenetwork.org/cases/849ufmn.html;�https://www.namus.gov/UnidentifiedPersons/Case#/4804?nav'

The column contains information on numerous entities providing details on the missing person, notably including date found, location address, age, race, height, weight, details description, and links.  

In [112]:
# Function to parse the description column and extract the description text
def parse_description(description):
    if not isinstance(description, str):
        return {
            "date_found": None, 
            "date_seen": None, 
            "location": None, 
            "age_est": None, 
            "race": None, 
            "height": None, 
            "weight": None, 
            "description_2": None
        }

    # Patterns to extract each field
    fields = {
        "date_found": r'(date found:|date:)\s*(?P<value>.*?)(location:|$)',
        "location": r'location:\s*(?P<value>.*?)(age est.:|age:|$)',
        "age_est": r'(age est.:|age:)\s*(?P<value>.*?)(race:|height:|$)',
        "race": r'race:\s*(?P<value>.*?)(height:|weight:|$)',
        "height": r'height:\s*(?P<value>.*?)(weight:|description:|$)',
        "weight": r'weight:\s*(?P<value>.*?)(description:|$)',
        "description_2": r'description:\s*(?P<value>.*?)(links:|$)',
    }
    
    # Extract values for each field
    parsed_data = {}
    for field, pattern in fields.items():
        match = re.search(pattern, description, re.IGNORECASE)
        parsed_data[field] = match.group('value').strip() if match else None

    # Attempt to find the first date in the description
    date_pattern = r'\b(?:January|February|March|April|May|June|July|August|September|October|November|December)\s+\d{1,2},\s*\d{4}\b'
    date_match = re.search(date_pattern, description)
    if date_match:
        parsed_data["date_seen"] = date_match.group().strip()
    else:
        parsed_data["date_seen"] = None

    return parsed_data

# Apply the parsing function to the "description" column
parsed_info = df_v3['description'].apply(parse_description)

# Convert parsed_info to a DataFrame and concatenate it with the original DataFrame
parsed_df = pd.json_normalize(parsed_info)
df_v3 = pd.concat([df_v3, parsed_df], axis=1)
df_v3.head(2)

Unnamed: 0,description,long,lat,name_or_NamUs,nickname,date_found,location,age_est,race,height,weight,description_2,date_seen
0,"date found: November 4, 1999location: Red Wing, Minnesotaage est.: 0-12 monthsrace: Caucasian / Whiteheight: 1'9""weight: 6 lbs (est)description: A full term infant with umbilical cord still attached was found 10 yds north of the Mississippi shore near 800 Levee Dr, Redwing MN. The body showed slight signs of decomposition upon discovery. The infant had not been in the water for long. The race of the decedent is most likely white. This infant is genetically related maternally to the decedent in case# GC03-127.links: https://www.namus.gov/UnidentifiedPersons/Case#/4795http://www.doenetwork.org/cases/604ufmn.html",-92.541061,44.565821,UP4795,,"November 4, 1999","Red Wing, Minnesota",0-12 months,Caucasian / White,"1'9""",6 lbs (est),"A full term infant with umbilical cord still attached was found 10 yds north of the Mississippi shore near 800 Levee Dr, Redwing MN. The body showed slight signs of decomposition upon discovery. The infant had not been in the water for long. The race of the decedent is most likely white. This infant is genetically related maternally to the decedent in case# GC03-127.",
1,"date found: July 20, 1977location: St. Paul, Minnesotaage est.: 16 - 30race: Caucasian / Whiteheight: 5'8"" (estimated)weight: 130lbs (estimated)description: Medium length brown hair with brown or green eyes; Multiple abdominal striae; Body found in Mississippi River between Childs and Warner Road, St. Paul.Green, Red, Blue vertically stripped shirt with thinner lateral stripes, high waisted blue jeans, brown knee high stockings size 5 underwear. Shoes�Approximate Size 8 TO 9Multiple Abdominal STRIAlinks: http://doenetwork.org/cases/849ufmn.html;�https://www.namus.gov/UnidentifiedPersons/Case#/4804?nav",-93.055401,44.941109,UP4804,,"July 20, 1977","St. Paul, Minnesota",16 - 30,Caucasian / White,"5'8"" (estimated)",130lbs (estimated),"Medium length brown hair with brown or green eyes; Multiple abdominal striae; Body found in Mississippi River between Childs and Warner Road, St. Paul.Green, Red, Blue vertically stripped shirt with thinner lateral stripes, high waisted blue jeans, brown knee high stockings size 5 underwear. Shoes�Approximate Size 8 TO 9Multiple Abdominal STRIA",


Let's also extract links stored as a list in a separate column.

## Extract links from the *description* column

In [113]:
# Function to extract links from a given text
def extract_links(description):
    if not isinstance(description, str):
        return []
    # Pattern to match http or https links
    links_pattern = r'(https?://[^\s]+)'
    links_matches = re.findall(links_pattern, description)
    
    # Correctly separate concatenated links
    if links_matches:
        links = []
        for match in links_matches:
            parts = match.split('http')
            for i, part in enumerate(parts):
                if i == 0 and part:
                    links.append(part)
                elif part:
                    links.append('http' + part)
        return links
    return []

# Apply the extract_links function to the 'description' column
df_v3['links'] = df_v3['description'].apply(extract_links)

In [114]:
df_v3[['description', 'links']].iloc[0]

description    date found: November 4, 1999location: Red Wing, Minnesotaage est.: 0-12 monthsrace: Caucasian / Whiteheight: 1'9"weight: 6 lbs (est)description: A full term infant with umbilical cord still attached was found 10 yds north of the Mississippi shore near 800 Levee Dr, Redwing MN. The body showed slight signs of decomposition upon discovery. The infant had not been in the water for long. The race of the decedent is most likely white. This infant is genetically related maternally to the decedent in case# GC03-127.links: https://www.namus.gov/UnidentifiedPersons/Case#/4795http://www.doenetwork.org/cases/604ufmn.html
links                                                                                                                                                                                                                                                                                                                                                                          

In [115]:
substring_match = df_v3["name_or_NamUs"].str.contains("Margaret Eileen Pennington", na=False)
df_v3[substring_match]

Unnamed: 0,description,long,lat,name_or_NamUs,nickname,date_found,location,age_est,race,height,weight,description_2,date_seen,links
15039,"Nicknames / Aliases: date: July 2, 1999location: Silver Springs, NVage: 40race: Whiteheight: 5'1""weight: 180-185 lbsdescription: Last seen at residence in Silver Springs, NV. Last seen wearing DRK/BLK Jeans, BLU SS sweater, WHT tennis shoes.links: https://www.namus.gov/MissingPersons/Case#/20096",-119.225561,39.416766,Margaret Eileen Pennington,,"January 1, 1965","Lake Odessa, MI",26,White,"5'5""-5'7""",140-150 lbs,"Mary was last seen by her 8-year-old son and 6 year old daughter when she visited them in school one day in 1965. She told her son she was going away for a while, but would be back. She then traveled to her sister, Kathleen Lindstrom's,(according to Kathleen) home in Muskegon where she left a photograph and disappeared.Though she wasn't reported missing until August 2002, investigators discovered there is no record of Mary, public or private, since the mid-1960s; and there has been no activity on her social security number since that time.",,[https://www.namus.gov/MissingPersons/Case#/20096]


In [116]:
pd.set_option('display.max_colwidth', None)
substring_match = df_v3["name_or_NamUs"].str.contains("2012001299", na=False)
df_v3[substring_match]

Unnamed: 0,description,long,lat,name_or_NamUs,nickname,date_found,location,age_est,race,height,weight,description_2,date_seen,links
8386,"date found: October 17, 1991location: Wawa, ONage est.: 40-60race: Whiteheight: 167 cm (5'6"")weight: 165 lbs (est)description: On October 17, 1991, a male was found on the shoreline of Sandy Beach in Michipicoten Harbour, Lake Superior, near Wawa. His postmortem weight was recorded as being 165 lbs; however, this measurement may not represent his actual living weight due to postmortem changes noted at autopsy. His appendix was present. He is thought to have been dead from a few weeks to a few months.links: https://www.services.rcmp-grc.gc.ca/missing-disparus/case-dossier.jsf?case=2012001299&id=1http://www.doenetwork.org/hot/hotcase1901.html",-84.855135,47.95556,2012001299 / Hot Case 1901,,,,,,,,"January 5, 1991Los Angeles John Doe was a man found deceased in 1991 in an abandoned car.The vehicle was used by the business for ""spare parts"" and was kept unlocked.https://www.namus.gov/UnidentifiedPersons/Case#/3872",,"[https://www.services.rcmp-grc.gc.ca/missing-disparus/case-dossier.jsf?case=2012001299&id=1, http://www.doenetwork.org/hot/hotcase1901.html]"


In [117]:
df_v3.info()

<class 'pandas.core.frame.DataFrame'>
Index: 17307 entries, 0 to 16600
Data columns (total 14 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   description    14535 non-null  object 
 1   long           16897 non-null  float64
 2   lat            16897 non-null  float64
 3   name_or_NamUs  16897 non-null  object 
 4   nickname       298 non-null    object 
 5   date_found     14533 non-null  object 
 6   location       14533 non-null  object 
 7   age_est        14533 non-null  object 
 8   race           14533 non-null  object 
 9   height         14533 non-null  object 
 10  weight         14533 non-null  object 
 11  description_2  14533 non-null  object 
 12  date_seen      4526 non-null   object 
 13  links          17307 non-null  object 
dtypes: float64(2), object(12)
memory usage: 2.0+ MB


Unfortunately, in the case when *date_found* is missing, *date_seen* and other information in the *description* column is also missing. Not much can be done there :( 

In [118]:
# check the null values in date_found column
df_v3.loc[(df_v3["date_found"].isnull())]

Unnamed: 0,description,long,lat,name_or_NamUs,nickname,date_found,location,age_est,race,height,weight,description_2,date_seen,links
2,,-93.194060,45.078014,UP4808,,,,,,,,,,[]
9,,-97.328052,34.826768,249UFOK�NamUs UP # 5054,,,,,,,,,,[]
13,,-95.215244,35.494709,Webbers Falls Jane Doe 617UFOK�NamUs UP # 9183,,,,,,,,,,[]
25,,-92.676587,33.207719,"""El Dorado Jane Doe""�81UFAR",El Dorado Jane Doe,,,,,,,,,[]
56,,-84.106203,42.333587,UP8182,,,,,,,,,,[]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14217,,,,,,,,,,,,,,[]
14264,,,,,,,,,,,,,,[]
15881,,,,,,,,,,,,,,[]
16392,,,,,,,,,,,,,,[]


In [119]:
print(df.shape, df_v3.shape)

(17307, 5) (17307, 14)


In [120]:
# Let's drop missing values from latitude and longitude columns
print("missing values in columns 'lat' and 'long' are:", df_v3["lat"].isna().sum(), "and" ,df_v3["long"].isna().sum())
df_v3 = df_v3.dropna(subset=['lat', 'long'])
print("After deleting missing values in latitude and longitude columns: ", df.shape, df_v3.shape)

missing values in columns 'lat' and 'long' are: 410 and 410
After deleting missing values in latitude and longitude columns:  (17307, 5) (16897, 14)


## Let's begin the visualization 

In [121]:
def create_popup(row):
    # Convert the row to a dictionary to ensure scalar values
    row = row.to_dict()

    # Check if 'links' is not NaN and not already a list
    if isinstance(row['links'], float) and pd.isna(row['links']):
        links = []  # Initialize empty list if 'links' is NaN
    elif not isinstance(row['links'], list):
        # Convert the single link to a list
        links = [row['links']]
    else:
        links = row['links']  # 'links' is already a list, so use it as it is

    # Helper function to handle NaN values
    def safe_str(val):
        return str(val) if not pd.isna(val) else 'N/A'

    # Construct the popup text
    popup_text = (
        f"Name/ID: {safe_str(row['name_or_NamUs'])}<br>"
        f"Nickname: {safe_str(row['nickname'])}<br>"
        f"Race: {safe_str(row['race'])}<br>"
        f"Age Est.: {safe_str(row['age_est'])}<br>"
        f"Date Found: {safe_str(row['date_found'])}<br>"
        f"Date Seen: {safe_str(row['date_seen'])}<br>"
        f"Location: {safe_str(row['location'])}<br>"
        f"Height: {safe_str(row['height'])}<br>"
        f"Weight: {safe_str(row['weight'])}<br>"
        f"Description: {safe_str(row['description_2'])}<br>"
        f"Links: {', '.join(links) if links else 'N/A'}"
    )
    return popup_text

# Create a map object centered on the contiguous United States
us_center = [37.0902, -95.7129]  # Approximate center of the contiguous US
map = folium.Map(location=us_center, zoom_start=4)

# Create a marker cluster object
marker_cluster = MarkerCluster().add_to(map)

# Add markers for each location with detailed popup
for idx, row in df_v3.iterrows():
    lat = row['lat']
    lon = row['long']
    if pd.notna(lat) and pd.notna(lon):  # Check if lat and lon are not NaN
        # Construct the popup for the current row
        popup = create_popup(row)
        # Create a marker at the current location with the popup text
        folium.Marker([lat, lon], popup=popup).add_to(marker_cluster)
        
# Display the map
# map

In [122]:
# Save the map as an HTML file for sharing
map.save("map_output.html")