# Wikipedia Notable Life Expectancies

# [Notebook 4 of 4: Data Cleaning](https://github.com/teresahanak/wikipedia-life-expectancy/blob/main/wp_life_expect_data_clean3_thanak_2022_06_23.ipynb)

## Context

The


## Objective

The

### Data Dictionary

- Feature: Description

## Importing Necessary Libraries

In [1]:
# To structure code automatically
%load_ext nb_black

# To import/export sqlite databases
import sqlite3 as sql

# To save/open python objects in pickle file
import pickle

# To help with reading, cleaning, and manipulating data
import pandas as pd
import numpy as np
import re

# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)
# To define the maximum number of rows to be displayed in a dataframe
pd.set_option("display.max_rows", 200)

# To supress warnings
# import warnings

# warnings.filterwarnings("ignore")

# To set some visualization attributes
pd.set_option("max_colwidth", 150)

# To play auditory cue when cell has executed, has warning, or has error and set chime theme
import chime

chime.theme("zelda")

<IPython.core.display.Javascript object>

## Data Overview

### Reading, Sampling, and Checking Data Shape

In [2]:
# Reading the dataset
conn = sql.connect("wp_life_expect_clean2.db")
data = pd.read_sql("SELECT * FROM wp_life_expect_clean2", conn)

# Making a working copy
df = data.copy()

# Checking the shape
print(f"There are {df.shape[0]} rows and {df.shape[1]} columns.")

# Checking first 2 rows of the data
df.head(2)

There are 132652 rows and 21 columns.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death
0,1,William Chappell,", 86, British dancer, ballet designer and director.",https://en.wikipedia.org/wiki/William_Chappell_(dancer),21,1994,January,,,British dancer,ballet designer and director,,,,,,,,,86.0,
1,1,Raymond Crotty,", 68, Irish economist, writer, and academic.",https://en.wikipedia.org/wiki/Raymond_Crotty,12,1994,January,,,Irish economist,writer,and academic,,,,,,,,68.0,


<IPython.core.display.Javascript object>

In [3]:
# Checking last 2 rows of the data
df.tail(2)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death
132650,9,Oleg Moliboga,", 69, Russian volleyball player, Olympic champion and coach.",https://en.wikipedia.org/wiki/Oleg_Moliboga,2,2022,June,(1980),,Russian volleyball player,Olympic champion and coach,,,,,,,,,69.0,
132651,9,Zou Jing,", 86, Chinese engineer, member of the Chinese Academy of Engineering.",https://en.wikipedia.org/wiki/Zou_Jing_(engineer),3,2022,June,,,Chinese engineer,member of the Chinese Academy of Engineering,,,,,,,,,86.0,


<IPython.core.display.Javascript object>

In [4]:
# Checking a sample of the data
df.sample(5)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death
95846,7,Gayle Shepherd,", 81, American singer , dementia.",https://en.wikipedia.org/wiki/Gayle_Shepherd,6,2018,May,(Shepherd Sisters),,American singer,dementia,,,,,,,,,81.0,
20679,21,Bob Poser,", 92, American baseball player .",https://en.wikipedia.org/wiki/Bob_Poser,1,2002,May,"(Chicago White Sox, St. Louis Browns)",,American baseball player,,,,,,,,,,92.0,
102054,18,György Baló,", 71, Hungarian broadcaster.",https://en.wikipedia.org/wiki/Gy%C3%B6rgy_Bal%C3%B3,1,2019,March,,,Hungarian broadcaster,,,,,,,,,,71.0,
75481,16,Adrian Robinson,", 25, American football player , suicide by hanging.",https://en.wikipedia.org/wiki/Adrian_Robinson,12,2015,May,"(Pittsburgh Steelers, Temple Owls)",,American football player,suicide by hanging,,,,,,,,,25.0,
14287,27,Krishna Pal Singh,", 77, Indian activist and politician.",https://en.wikipedia.org/wiki/Krishna_Pal_Singh,5,1999,September,,,Indian activist and politician,,,,,,,,,,77.0,


<IPython.core.display.Javascript object>

### Checking Data Types, Duplicates, and Null Values

In [5]:
# Checking data types and null values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 132652 entries, 0 to 132651
Data columns (total 21 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   day             132652 non-null  object 
 1   name            132652 non-null  object 
 2   info            132652 non-null  object 
 3   link            132652 non-null  object 
 4   num_references  132652 non-null  object 
 5   year            132652 non-null  int64  
 6   month           132652 non-null  object 
 7   info_parenth    49830 non-null   object 
 8   info_1          35 non-null      object 
 9   info_2          132604 non-null  object 
 10  info_3          62571 non-null   object 
 11  info_4          12605 non-null   object 
 12  info_5          1497 non-null    object 
 13  info_6          216 non-null     object 
 14  info_7          31 non-null      object 
 15  info_8          6 non-null       object 
 16  info_9          1 non-null       object 
 17  info_10   

<IPython.core.display.Javascript object>

#### Loading `nation_country_dict` from Pickle File to Dictionary `nation_map`

In [6]:
# Load the nation_country_dict
with open("nation_country_dict.pkl", "rb") as f:
    nation_map = pickle.load(f)

<IPython.core.display.Javascript object>

## Extracting Nationality Continued
Here is the approach we will take:
- The plan will be to save the country name, in lieu of nationality, in new `place_1` and `place_2` columns as it is standardized for the various associated nationality values.
- First, we will update the keys and values in `nation_map` by replacing hyphens with a single space.
- Then we will remove "-born" from the column we are searching, as well as replace "-" and "/" each with single spaces.  In this step, we can also remove leading and trailing periods and whitespace.
- We will proceed to search the numbered `info` columns in order checking as follows:
    1. if column value starts with a value in the dictionary:
        - save country to `place_1` and remove value from searched column.
    2. if `place_1` value has been found:
        - if updated column value starts with a value in the dictionary:
            - save country to `place_2` and remove value from searched column.
    3. Repeat steps 1 and 2 but comparing with country (dictionary keys)
    4. Check unique values for column starting with capital letters.

#### Removing "-" and "." from `nation_map`

In [7]:
# Removing hyphens from nation_map
nation_map = {
    key.replace("-", ""): value.replace("-", " ") for key, value in nation_map.items()
}

# Removing periods from nation_map
nation_map = {
    key.replace(".", ""): value.replace(".", " ") for key, value in nation_map.items()
}

<IPython.core.display.Javascript object>

#### Removing or Replacing Extra Characters in Numbered `info` Columns

In [8]:
%%time

# List of columns to treat
cols_lst = [
    "info_1",
    "info_2",
    "info_3",
    "info_4",
    "info_5",
    "info_6",
    "info_7",
    "info_8",
    "info_9",
    "info_10",
    "info_11",
]

# Dictionary of keys to find and values to replace keys
replace_dict = {'-born': '', '–born': '', '-': ' ', '–': ' ', '/': ' ', '.': ' '}

# For loop to find and replace characters in replace_dict in columns in cols_list
# and strip any leading or trailing periods or whitespace
for column in cols_lst:
    for key, value in replace_dict.items():
        for index in df[column].notna().index:
            item = df.loc[index, column]
            if item:
                df.loc[index, column] = item.replace(key, value).strip()
                
# Chime notification when cell successfully executes
chime.success()

CPU times: total: 1min 57s
Wall time: 1min 58s


<IPython.core.display.Javascript object>

#### Checking `info_1` for `place_1`

In [9]:
# Column to check
column = "info_1"

# Extract to column
extract_to = "place_1"

# Dataframe to check
dataframe = df[(df[column].notna())]

# For loop to extract nation data to place column
for nationality, country in nation_map.items():
    for index in dataframe.index:
        item = df.loc[index, column]
        if item.startswith(nationality):
            df.loc[index, extract_to] = country
            df.loc[index, column] = (
                df.loc[index, column].replace(nationality, "").strip()
            )

# Check a sample of treated rows
df[df[extract_to].notna()].sample(2)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1
59485,24,Kristján Jóhannsson,"83, Icelandic Olympic athlete.",https://en.wikipedia.org/wiki/Kristj%C3%A1n_J%C3%B3hannsson_(athlete),2,2013,January,,Olympic athlete,,,,,,,,,,,83.0,,Iceland
11825,23,Manuel Mejía Vallejo,", 75 Colombian writer.",https://en.wikipedia.org/wiki/Manuel_Mej%C3%ADa_Vallejo,2,1998,July,,writer,,,,,,,,,,,75.0,,Colombia


<IPython.core.display.Javascript object>

#### Observations:
- `info_1` provides us a nice small sample on which to test code.
- We successfully extracted those `place_1` values, now we will do the same on the treated rows for `place_2`.

#### Checking `info_1` for `place_2`

In [10]:
# Column to check
column = "info_1"

# Extract to column
extract_to = "place_2"

# Dataframe to check
dataframe = df[(df["place_1"].notna()) & (df[column].notna())]

# For loop to extract nation data to place column
for nationality, country in nation_map.items():
    for index in dataframe.index:
        item = df.loc[index, column]
        if item.startswith(nationality):
            df.loc[index, extract_to] = country
            df.loc[index, column] = (
                df.loc[index, column].replace(nationality, "").strip()
            )

# Check a sample of rows
df.sample(2)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1
122087,5,Ashraf Sehrai,", 77, Indian Kashmiri separatist, chairman of All Parties Hurriyat Conference , COVID-19.",https://en.wikipedia.org/wiki/Ashraf_Sehrai,34,2021,May,(since 2018),,Indian Kashmiri separatist,chairman of All Parties Hurriyat Conference,COVID 19,,,,,,,,77.0,,
82957,2,Cherokee Run,", 26, American thoroughbred racehorse and sire, euthanised.",https://en.wikipedia.org/wiki/Cherokee_Run,4,2016,July,,,American thoroughbred racehorse and sire,euthanised,,,,,,,,,26.0,,


<IPython.core.display.Javascript object>

#### Observations:
- Here we can see that the new column `place_2` has not yet been added as there were not any matching values.
- Let us confirm by checking the remaining unique values in `info_1`.

#### Checking Remaining Unique Values in `info_1`

In [11]:
# Checking unique values
df["info_1"].unique()

array([None, 'politician', 'Olympic sprinter', 'gridiron football player',
       'writer', 'businessman', 'social psychologist', 'King of Nepal',
       'Maori leader', 'artist', 'English sports journalist',
       'Jules Engel', 'early', 'aka', 'Jr', 'professional wrestler',
       'automotive engineer', 'materials scientist', 'weightlifter',
       'common chimpanzee', '', 'Olympic athlete', 'actor',
       'Olympic gymnast', 'broadcaster and writer', 'Olympic swimmer',
       'Olympic boxer', 'Olympic wrestler', 'Olympic sailor',
       'basketball player', 'college basketball coach',
       'choral conductor', 'Tree of the Year'], dtype=object)

<IPython.core.display.Javascript object>

#### Obsservations:
- Neither "English" nor "Maori" are keys in the current dictionary.
- Maori is an ethnicity within the country of New Zealand, so for now, we will add it as a key our dictionary with the country value of New Zealand.  If we have matching first and second countries, we can later remove the second value.
- We will also add the key "English" with the country value 'United Kingdom of Great Britain and Northern Ireland'.
- Then, we can rerun the above code for `place_1` and `place_2`.
- The country value of "Nepal" is also present.  We will hold off on extracting country names until we have first exhausted matching nationalities, as the Wikipedia field called for nationalities.

#### Updating `nation_map`

In [12]:
# Adding key: country pairs to nation_map
nation_map["English"] = nation_map["British"]
nation_map["Maori"] = nation_map["New Zealand"]

<IPython.core.display.Javascript object>

#### Re-checking `info_1` for `place_1`

In [13]:
# Column to check
column = "info_1"

# Extract to column
extract_to = "place_1"

# Dataframe to check
dataframe = df[(df[extract_to].isna()) & (df[column].notna())]

# For loop to extract nation data to place column
for nationality, country in nation_map.items():
    for index in dataframe.index:
        item = df.loc[index, column]
        if item.startswith(nationality):
            df.loc[index, extract_to] = country
            df.loc[index, column] = (
                df.loc[index, column].replace(nationality, "").strip()
            )

# Check a sample of rows
df[df[extract_to].notna()].sample(2)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1
55331,5,Lucky Diamond,", Maltese 15, American Guinness World Record dog, holder , cancer.",https://en.wikipedia.org/wiki/Lucky_Diamond_(dog),13,2012,June,(dog most photographed with celebrities),,American Guinness World Record dog,holder,cancer,,,,,,,,15.0,,Malta
103467,26,Edmund Seger,", 82 German Olympic wrestler.",https://en.wikipedia.org/wiki/Edmund_Seger,2,2019,May,,Olympic wrestler,,,,,,,,,,,82.0,,Germany


<IPython.core.display.Javascript object>

#### Re-checking `info_1` for `place_2`

In [14]:
# Column to check
column = "info_1"

# Extract to column
extract_to = "place_2"

# Dataframe to check
dataframe = df[(df["place_1"].notna())]

# For loop to extract nation data to place column
for nationality, country in nation_map.items():
    for index in dataframe.index:
        item = df.loc[index, column]
        if item.startswith(nationality):
            df.loc[index, extract_to] = country
            df.loc[index, column] = (
                df.loc[index, column].replace(nationality, "").strip()
            )

# Checking rows
df[df["place_2"].notna()]

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2
19580,20,Dame Miraka Szászy,", 80. New Zealand Maori leader.",https://en.wikipedia.org/wiki/Mira_Sz%C3%A1szy,21,2001,December,,leader,,,,,,,,,,,80.0,,New Zealand,New Zealand


<IPython.core.display.Javascript object>

#### Observations:
- Our code appears to be finding the matching values and assigning the corresponding country to the correct nation column.
- We see "New Zealand" added to both nation columns here, which was expected as both New Zealand and Maori are in the description
- As an aside, we will need to check our final values where `place_1` is "American" and `place_2` is "Indian" as our code will indicate United States and India, which may or may not be correct. 
- Now we can proceed to doing the same extraction on `info_2`.

#### Checking `info_2` for `place_1`

In [15]:
%%time

# Column to check
column = "info_2"

# Extract to column
extract_to = "place_1"

# Dataframe to check
dataframe = df[(df[extract_to].isna()) & (df[column].notna())]

# For loop to extract nation data to place column
for nationality, country in nation_map.items():
    for index in dataframe.index:
        item = df.loc[index, column]
        if item.startswith(nationality):
            df.loc[index, extract_to] = country
            df.loc[index, column] = (
                df.loc[index, column].replace(nationality, "").strip()
            )

# Chime notification when cell successfully executes
chime.success()

CPU times: total: 4min 19s
Wall time: 4min 19s


<IPython.core.display.Javascript object>

In [16]:
# Check a sample of rows
df[df[extract_to].notna()].sample(2)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2
44687,22,Jean Vergnes,", 88, French-born American chef.",https://en.wikipedia.org/wiki/Jean_Vergnes,1,2010,April,,,chef,,,,,,,,,,88.0,,United States of America,
47641,19,Helen Maynor Scheirbeck,", 75, American educator and activist, stroke.",https://en.wikipedia.org/wiki/Helen_Maynor_Scheirbeck,1,2010,December,,,educator and activist,stroke,,,,,,,,,75.0,,United States of America,


<IPython.core.display.Javascript object>

#### Checking `info_2` for `place_2`

In [17]:
%%time

# Column to check
column = "info_2"

# Extract to column
extract_to = "place_2"

# Dataframe to check
dataframe = df[
    (df["place_1"].notna()) & (df[extract_to].isna()) & (df[column].notna())]

# For loop to extract nation data to place column
for nationality, country in nation_map.items():
    for index in dataframe.index:
        item = df.loc[index, column]
        if item.startswith(nationality):
            df.loc[index, extract_to] = country
            df.loc[index, column] = (
                df.loc[index, column].replace(nationality, "").strip()
            )
            
# Chime notification when cell successfully executes
chime.success()

CPU times: total: 3min 45s
Wall time: 3min 45s


<IPython.core.display.Javascript object>

In [18]:
# Check a sample of rows
df[df[extract_to].notna()].sample(2)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2
4185,25,Ludmilla Pajo,", 47, Russian-Albanian writer and journalist.",https://en.wikipedia.org/wiki/Ludmilla_Pajo,1,1995,August,,,writer and journalist,,,,,,,,,,47.0,,Russia,Albania
48370,12,James Elliott,", 82, British-born Australian actor , .",https://en.wikipedia.org/wiki/James_Elliott_(actor),2,2011,February,(Lewy body dementia),,actor,,,,,,,,,,82.0,,United Kingdom of Great Britain and Northern Ireland,Australia


<IPython.core.display.Javascript object>

#### Checking Remaining Missing Values for `place_1` and Number of Rows with a `place_2` Value.

In [19]:
# Checking number of remaining missing values for place_1 and number of captured values for place_2
print(f'There are {df["place_1"].isna().sum()} remaining missing values for place_1.\n')
print(f'{df["place_2"].notna().sum()} entries have a value for place_2, thus far.')

There are 2394 remaining missing values for place_1.

2251 entries have a value for place_2, thus far.


<IPython.core.display.Javascript object>

#### Observations:
- We have captured the `place_1` value for the vast majority of entries.
- Relatively few entries have `place_2` values, which we would expect.
- Before checking for other variations on nationality usage, let us check to see if the nation itself starts the value.

#### Re-checking `info_2` for `place_1` Using Country Value

In [20]:
%%time

# Column to check
column = "info_2"

# Extract to column
extract_to = "place_1"

# Dataframe to check
dataframe = df[(df[extract_to].isna()) & (df[column].notna())]

# For loop to extract nation data to place column
for country in nation_map.values():
    for index in dataframe.index:
        item = df.loc[index, column]
        if item.startswith(country):
            df.loc[index, extract_to] = country
            df.loc[index, column] = (
                df.loc[index, column].replace(country, "").strip()
            )

# Chime notification when cell successfully executes
chime.success()

CPU times: total: 3.97 s
Wall time: 3.95 s


<IPython.core.display.Javascript object>

In [21]:
# Check a sample of rows
df[df[extract_to].notna()].sample(2)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2
108273,19,Charles Alverson,", 84, American screenwriter .",https://en.wikipedia.org/wiki/Charles_Alverson,5,2020,January,(),,screenwriter,,,,,,,,,,84.0,,United States of America,
96512,11,Norma Bessouet,", 77, Argentine artist.",https://en.wikipedia.org/wiki/Norma_Bessouet,4,2018,June,,,artist,,,,,,,,,,77.0,,Argentina,


<IPython.core.display.Javascript object>

#### Re-checking `info_2` for `place_2` Using Country Value

In [22]:
%%time

# Column to check
column = "info_2"

# Extract to column
extract_to = "place_2"

# Dataframe to check
dataframe = df[
    (df["place_1"].notna()) & (df[extract_to].isna()) & (df[column].notna())
]

# For loop to extract nation data to place column
for country in nation_map.values():
    for index in dataframe.index:
        item = df.loc[index, column]
        if item.startswith(country):
            df.loc[index, extract_to] = country
            df.loc[index, column] = (
                df.loc[index, column].replace(country, "").strip()
            )
            
# Chime notification when cell successfully executes
chime.success()

CPU times: total: 3min 38s
Wall time: 3min 38s


<IPython.core.display.Javascript object>

In [23]:
# Check a sample of rows
df[df[extract_to].notna()].sample(2)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2
67240,14,Vello Helk,", 90, Estonian-born Danish historian.",https://en.wikipedia.org/wiki/Vello_Helk,2,2014,March,,,historian,,,,,,,,,,90.0,,Estonia,Denmark
91511,5,Dan Hanganu,", 78, Romanian-born Canadian architect.",https://en.wikipedia.org/wiki/Dan_Hanganu,8,2017,October,,,architect,,,,,,,,,,78.0,,Romania,Canada


<IPython.core.display.Javascript object>

#### Checking Remaining Missing Values for `place_1` and Number of Rows with a `place_2` Value.

In [24]:
# Checking number of remaining missing values for place_1 and number of captured values for place_2
print(f'There are {df["place_1"].isna().sum()} remaining missing values for place_1.\n')
print(f'{df["place_2"].notna().sum()} entries have a value for place_2, thus far.')

There are 2275 remaining missing values for place_1.

2293 entries have a value for place_2, thus far.


<IPython.core.display.Javascript object>

#### Observations:
- We captured over 100 more values for `place_1` and over 50 more for `place_2` by directly checking the country name.
- Now, we will examine the remaining capitalized first words in `info_1` and append our dictionary as needed.

#### Examining Unique Values of First Word in `info_1` if Upper Case

In [25]:
# Column to check
column = "info_2"

# Dataframe to check
dataframe = df[(df["place_1"].isna()) & (df[column].notna())]

# Checking set of first words in info_2 where place_1 is missing
print(
    f"There are {len(set([item.split()[0] for item in dataframe[column] if item[0].isupper()]))} unique values for first word in info_1.\n"
)
set([item.split()[0] for item in dataframe[column] if item[0].isupper()])

There are 302 unique values for first word in info_1.



{'AIDS',
 'ANC',
 'Abkhaz',
 'Abkhazian',
 'Aboriginal',
 'Actress',
 'African',
 'Afrikaans',
 'Afrikaner',
 'Afro',
 'Air',
 'Alfa',
 'All',
 'Alyawarre',
 'Amateur',
 'America',
 "America's",
 'Amrican',
 'Anglican',
 'Anglo',
 'Anguillan',
 'Arabic',
 'Archbishop',
 'Archdeacon',
 'Argentinian',
 'Aruba',
 'Aruban',
 'Assamese',
 'Associate',
 'Assyrian',
 'Athletics',
 'Aussie',
 'Austro',
 'Avarian',
 'Azorean',
 'BBC',
 'Baltic',
 'Basque',
 'Bavarian',
 'Benedictine',
 'Bermudan',
 'Bermudian',
 'Bessarabian',
 'Bletchley',
 'Bodo',
 'Bosnia',
 'Breton',
 'Brigadier',
 "Britain's",
 'Britsih',
 'California',
 'Californian',
 'Calypso',
 'Cantonese',
 'Caribbean',
 'Catalan',
 'Catholic',
 'Caymanian',
 'Ceylon',
 'Ceylonese',
 'Chagossian',
 'Chairman',
 'Chechen',
 'Cherokee',
 'Chief',
 'Chilian',
 'China',
 'Chiricahua',
 'Chuvash',
 'Circassian',
 'Civil',
 'Columbian',
 'Commandant',
 'Composer',
 'Computer',
 'Congo',
 'Congoleze',
 'Congresswoman;',
 'Cook',
 'Cornish',


<IPython.core.display.Javascript object>

#### Observations:
- We can see there are some remaining variations on how nationality was entered that are not yet in `nation_country_dict`.
- Let us add those now, then do another iteration for searching `info_2`.
- We can also proceed to collect causes of death, such as 'AIDS' in a separate list.
- Values for nationality that clearly pertain to an ethnicity, such as "Afro" and "Anglo" will be assigned empty strings in the dictionary.
- Descriptions will be assigned to their geographical physical region when nation of membership is remote.

#### Hard-coding Additional Variations on Nationality

In [26]:
# Hard-coding remaining unique nationality/location descriptors
nation_map["ANC"] = nation_map["South African"]
nation_map["Abkhaz"] = nation_map["Georgian"]
nation_map["Abkhazian"] = nation_map["Georgian"]
nation_map["Aboriginal"] = nation_map["Australian"]
nation_map["African"] = "Africa"
nation_map["Afrikaans"] = nation_map["African"]
nation_map["Afrikaner"] = nation_map["African"]
nation_map["Afro"] = ""
nation_map["Alyawarre"] = nation_map["Australian"]
nation_map["America"] = nation_map["US"]
nation_map["America's"] = nation_map["US"]
nation_map["Amrican"] = nation_map["US"]
nation_map["Anglo"] = ""
nation_map["Anguillan"] = "Caribbean"
nation_map["Antigua"] = nation_map["Antiguan"]
nation_map["Arabic"] = "Arab states"
nation_map["Argentinian"] = nation_map["Argentine"]
nation_map["Aruba"] = "Caribbean"
nation_map["Aruban"] = nation_map["Aruba"]
nation_map["Assamese"] = nation_map["Indian"]
nation_map["Assyrian"] = "Middle East"
nation_map["Aussie"] = nation_map["Australian"]
nation_map["Australia"] = nation_map["Australian"]
nation_map["Australia's"] = nation_map["Australian"]
nation_map["Austria"] = nation_map["Austrian"]
nation_map["Austro"] = nation_map["Austrian"]
nation_map["Avarian"] = nation_map["Russian"]
nation_map["Azerbaijan"] = nation_map["Azerbaijani"]
nation_map["Azorean"] = nation_map["Portuguese"]
nation_map["Baltic"] = "Eastern Europe"
nation_map["Bangladesh"] = nation_map["Bangladeshi"]
nation_map["Barbados"] = "Caribbean"
nation_map["Basque"] = "Western Europe"
nation_map["Bavarian"] = nation_map["German"]
nation_map["Belarus"] = nation_map["Belarusian"]
nation_map["Belarussian"] = nation_map["Belarusian"]
nation_map["Belgium"] = nation_map["Belgian"]
nation_map["Bermudan"] = "Caribbean"
nation_map["Bessarabian"] = "Eastern Europe"
nation_map["Bletchley"] = nation_map["British"]
nation_map["Bodo"] = nation_map["Norwegian"]
nation_map["Bosnia"] = nation_map["Bosnian"]
nation_map["Breton"] = nation_map["French"]
nation_map["Britain's"] = nation_map["British"]
nation_map["Britsih"] = nation_map["British"]
nation_map["California"] = nation_map["US"]
nation_map["Californian"] = nation_map["US"]
nation_map["Cantonese"] = nation_map["Chinese"]
nation_map["Caribbean"] = "Caribbean"
nation_map["Catalan"] = nation_map["Spanish"]
nation_map["Caymanian"] = nation_map["Caribbean"]
nation_map["Ceylon"] = nation_map["Sri Lankan"]
nation_map["Ceylonese"] = nation_map["Sri Lankan"]
nation_map["Chagossian"] = "Indian Ocean"
nation_map["Chechen"] = nation_map["Russian"]
nation_map["Cherokee"] = nation_map["US"]
nation_map["Chilian"] = nation_map["Chilean"]
nation_map["China"] = nation_map["Chinese"]
nation_map["Chiricahua"] = nation_map["US"]
nation_map["Chuvash"] = nation_map["Russian"]
nation_map["Circassian"] = nation_map["Russian"]
nation_map["Columbian"] = nation_map["Colombian"]
nation_map["Congo"] = nation_map["Congolese"]
nation_map["Cornish"] = nation_map["British"]
nation_map["Costan Rican"] = nation_map["Costa Rican"]
nation_map["Crimean"] = nation_map["Russian"]
nation_map["Croat"] = nation_map["Croatian"]
nation_map["Curaçaoan"] = nation_map["Dutch"]
nation_map["Curaçaon"] = nation_map["Dutch"]
nation_map["Dagestani"] = nation_map["Russian"]
nation_map["Dahomey"] = "West Africa"
nation_map["Dijiboutian"] = nation_map["Djiboutian"]
nation_map["Dolgan"] = nation_map["Russian"]
nation_map["Dominica"] = nation_map["Caribbean"]
nation_map["East"] = ""
nation_map["Eastern"] = ""
nation_map["England"] = nation_map["British"]
nation_map["Englist"] = nation_map["British"]
nation_map["European"] = "Europe"
nation_map["Falkland Islands"] = "South America"
nation_map["Falkland islands"] = nation_map["Falkland Islands"]
nation_map["Falkland"] = nation_map["Falkland Islands"]
nation_map["Faroese"] = nation_map["Danish"]
nation_map["Filipina"] = nation_map["Filipino"]
nation_map["Filipo"] = nation_map["Filipino"]  # verified entry
nation_map["Fillipina"] = nation_map["Filipino"]
nation_map["Finish"] = nation_map["Finnish"]
nation_map["Flemish"] = nation_map["Belgian"]
nation_map["Franch"] = nation_map["French"]  # verified entry
nation_map["Franco"] = ""
nation_map["Frenck"] = nation_map["French"]  # verified entry
nation_map["Fujianese"] = nation_map["Chinese"]
nation_map["Gaelic"] = ""  # refers to sport of Gaelic football, otherwise language
nation_map["Galician"] = nation_map["Spanish"]
nation_map["Galápagos"] = "Galápagos Islands"  # entry for tortoise
nation_map["Geman"] = nation_map["German"]  # verified entry
nation_map["Germen"] = nation_map["German"]  # verified entry
nation_map["Ghanese"] = "West Africa"
nation_map["Greenlandic"] = nation_map["Danish"]
nation_map["Guadeloupean"] = nation_map["Caribbean"]
nation_map["Guamanian"] = "Oceania"
nation_map["Guernsey"] = nation_map["British"]
nation_map["Hawaiian"] = nation_map["US"]
nation_map["Hindi"] = nation_map["Indian"]
nation_map["Hindu"] = nation_map["Indian"]
nation_map["Hollywood"] = nation_map["US"]
nation_map["Hong Kong"] = nation_map["Chinese"]
nation_map["Houston"] = nation_map["US"]
nation_map["Huaorani"] = nation_map["Ecuadorian"]
nation_map["I Kiribati"] = nation_map["IKiribati"]
nation_map["Ice"] = ""
nation_map["Indigenous"] = ""
nation_map["Indin"] = nation_map["Indian"]  # verified entry
nation_map["Indo"] = ""
nation_map["Ingush"] = nation_map["Russian"]
nation_map["Italo"] = ""
nation_map["Ivoirian"] = "West Africa"
nation_map["Javanese"] = nation_map["Indonesian"]
nation_map["Jersey"] = nation_map["British"]
nation_map["Kabardin"] = nation_map["Russian"]
nation_map["Kashmiri"] = nation_map["Indian"]
nation_map["Korean"] = "East Asia"
nation_map["Kosovan"] = "Eastern Europe"
nation_map["Kurdish"] = "West Asia"
nation_map["Latino"] = ""
nation_map["Lesothan"] = "Southern Africa"
nation_map["Los Angeles"] = nation_map["US"]
nation_map["Louisiana"] = nation_map["US"]
nation_map["MGerman"] = nation_map["German"]  # verified entry
nation_map["Macanese"] = "East Asia"
nation_map["Malayalam"] = nation_map["Indian"]
nation_map["Malayali"] = nation_map["Indian"]
nation_map["Malayan"] = nation_map["Malaysian"]
nation_map["Manx"] = nation_map["British"]
nation_map["Mexian"] = nation_map["Mexican"]
nation_map["Mississippi"] = nation_map["US"]
nation_map["Monegasque"] = nation_map["Monacan"]
nation_map["Montserrat"] = nation_map["Caribbean"]
nation_map["Montserratian"] = nation_map["Caribbean"]
nation_map["Myanmar"] = nation_map["Burmese"]
nation_map["Native"] = ""
nation_map["New York"] = nation_map["US"]
nation_map["Ngarrindjeri"] = nation_map["Australian"]
nation_map["Ni Vanuatu"] = "Oceania"
nation_map["Nigirean"] = nation_map["Nigerian"]
nation_map["Niuean"] = nation_map["NZ"]
nation_map["Northern Ire"] = nation_map["Northern Irish"]
nation_map["Northern Ireland"] = nation_map["Northern Irish"]
nation_map["Norther Irish"] = nation_map["Northern Irish"]
nation_map["North Irish"] = nation_map["Northern Irish"]
nation_map["North American"] = "North America"
nation_map["North Island"] = nation_map["NZ"]
nation_map["Northern Mariana Island"] = "Oceania"
nation_map["Northern Mariana Islander"] = "Oceania"
nation_map["Nubian"] = nation_map["Sudanese"]
nation_map["Ottoman"] = nation_map["Turkish"]
nation_map["Paraguan"] = nation_map["Paraguayan"]  # verified entry
nation_map["Pitcairn"] = "Oceania"
nation_map["Poliosh"] = ""  # verified entry
nation_map["Polis"] = nation_map["Polish"]  # verified entry
nation_map["Prussian"] = nation_map["German"]
nation_map["Punjabi"] = nation_map["Indian"]
nation_map["Quebec"] = nation_map["Canadian"]
nation_map["Québécois"] = nation_map["Canadian"]
nation_map["Republic of China"] = nation_map["Chinese"]
nation_map["Republic"] = ""
nation_map["Rhodesian"] = "Southern Africa"
nation_map["Roman"] = nation_map["Italian"]
nation_map["Réunionese"] = nation_map["French"]
nation_map["S African"] = nation_map["South African"]
nation_map["Saban"] = nation_map["Caribbean"]
nation_map["Saharawi"] = "West Africa"
nation_map["Sahrawi"] = nation_map["Saharawi"]
nation_map["Saint Helena"] = "South Atlantic"
nation_map["Saint Vincent"] = nation_map["Caribbean"]
nation_map["Saint Martin"] = nation_map["Caribbean"]
nation_map["Saint Pierre and Miquelon"] = "North America"
nation_map["Salvadorean"] = nation_map["Salvadoran"]
nation_map["Sanmarinese"] = nation_map["Sammarinese"]
nation_map["Santomean"] = nation_map["São Toméan"]
nation_map["Seychellian"] = nation_map["Seychellois"]
nation_map["Sicilian"] = nation_map["Italian"]
nation_map["Sicillian"] = nation_map["Italian"]
nation_map["Sikkimese"] = nation_map["Indian"]
nation_map["Sorbian"] = nation_map["German"]
nation_map["South Afican"] = nation_map["South African"]
nation_map["South Ossetian"] = nation_map["Georgian"]
nation_map["South "] = ""
nation_map["Southern"] = ""
nation_map["Soviet"] = "United Socialist Soviet Republic"
nation_map["Sri lankan"] = nation_map["Sri Lankan"]
nation_map["St Lucian"] = nation_map["Caribbean"]
nation_map["St Kitts and Nevis"] = nation_map["Kittian and Nevisian"]
nation_map["Sumatran"] = nation_map["Indonesian"]
nation_map["Swish"] = nation_map["Swiss"]  # verified entry
nation_map["Tahitian"] = "South Pacific"
nation_map["Tamil"] = "Indian subcontinent"
nation_map["Tasmanian"] = nation_map["Australian"]
nation_map["Telugu"] = nation_map["Indian"]
nation_map["Texas"] = nation_map["US"]
nation_map["Florida"] = nation_map["US"]
nation_map["Tibetan"] = nation_map["Chinese"]
nation_map["Tirkish"] = nation_map["Turkish"]
nation_map["Tirkish"] = nation_map["Turkish"]
nation_map["Trinbagonian"] = nation_map["Trinidadian"]
nation_map["Trinidad"] = nation_map["Trinidadian"]
nation_map["Turks"] = nation_map["Turkish"]
nation_map["U S"] = nation_map["US"]
nation_map["UAE's"] = nation_map["Emirati"]
nation_map["United Kingdom"] = nation_map["British"]
nation_map["Upper Silesian"] = nation_map["Polish"]
nation_map["Uruguyan"] = nation_map["Uruguayan"]
nation_map["Uyghur"] = "Asia"
nation_map["Wallisian"] = "Oceania"
nation_map["West"] = ""
nation_map["Western"] = ""
nation_map["Xhosa"] = nation_map["South African"]
nation_map["Yazidi"] = "Asia"
nation_map["Yellowstone"] = nation_map["US"]
nation_map["Yugoslav"] = nation_map["Serbian"]
nation_map["Yugoslavia"] = nation_map["Serbian"]
nation_map["Yugoslavian"] = nation_map["Serbian"]
nation_map["Zairean"] = nation_map["Congolese"]
nation_map["Zanzibari"] = nation_map["Tanzanian"]

<IPython.core.display.Javascript object>

#### Example of Checking Rows with Unique Value

In [27]:
# Cell for checking rows of unique values -- used repeatedly while hard-coding above
check_value = "Yellowstone"

check_list = []
for index in df[df["info_2"].notna()].index:
    item = df.loc[index, "info_2"]
    if item:
        if item.startswith(check_value):
            check_list.append(index)
df.loc[check_list, :]

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2
58424,6,O-Six,", 6, Yellowstone National Park gray wolf, shot.",https://en.wikipedia.org/wiki/O-Six,16,2012,December,,,Yellowstone National Park gray wolf,shot,,,,,,,,,6.0,,,


<IPython.core.display.Javascript object>

#### Appending Other Species List for New Species Observed During Hard-coding

In [28]:
other_species_df = pd.read_csv("other_species.csv")
other_species = other_species_df["species"].tolist()
other_species.append("kiwi")
other_species.append("stallion")
other_species.append("colt")
other_species.append("Thoroughbred hurdler")
other_species.append("race horse")
other_species.append("racing filly")
other_species.append("wolf")

<IPython.core.display.Javascript object>

#### Observations:
- That was more brute force than we would like to be using, but as the nationality descriptions are mixed in with other features, the approach was chosen over using an approximate match approach such as fuzzywuzzy.
- We will proceed to re-run our code searching `info_2`.

#### Checking `info_2` for `place_1` with Updated `nation_map`

In [32]:
%%time

# Column to check
column = "info_2"

# Extract to column
extract_to = "place_1"

# Dataframe to check
dataframe = df[(df[extract_to].isna()) & (df[column].notna())]

# For loop to extract nation data to place column
for nationality, country in nation_map.items():
    for index in dataframe.index:
        item = df.loc[index, column]
        if item.startswith(nationality):
            df.loc[index, extract_to] = country
            df.loc[index, column] = (
                df.loc[index, column].replace(nationality, "").strip()
            )

# Chime notification when cell successfully executes
chime.success()

CPU times: total: 6.22 s
Wall time: 6.2 s


<IPython.core.display.Javascript object>

In [33]:
# Check a sample of rows
df[df[extract_to].notna()].sample(2)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2
5802,28,Siegfried Mynhardt,", 90, South African actor.",https://en.wikipedia.org/wiki/Siegfried_Mynhardt,2,1996,March,,,actor,,,,,,,,,,90.0,,South Africa,
101353,14,Sergei Zakharov,", 68, Russian singer, heart failure.",https://en.wikipedia.org/wiki/Sergei_Zakharov_(singer),7,2019,February,,,singer,heart failure,,,,,,,,,68.0,,Russia,


<IPython.core.display.Javascript object>

#### Checking `info_2` for `place_2` with Updated `nation_map`

In [34]:
%%time

# Column to check
column = "info_2"

# Extract to column
extract_to = "place_2"

# Dataframe to check
dataframe = df[
    (df["place_1"].notna()) & (df[extract_to].isna()) & (df[column].notna())]

# For loop to extract nation data to place column
for nationality, country in nation_map.items():
    for index in dataframe.index:
        item = df.loc[index, column]
        if item.startswith(nationality):
            df.loc[index, extract_to] = country
            df.loc[index, column] = (
                df.loc[index, column].replace(nationality, "").strip()
            )
            
# Chime notification when cell successfully executes
chime.success()

CPU times: total: 6min 53s
Wall time: 6min 53s


<IPython.core.display.Javascript object>

In [35]:
# Check a sample of rows
df[df[extract_to].notna()].sample(2)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2
98785,13,Don Leo Jonathan,", 87, American-Canadian Hall of Fame professional wrestler .",https://en.wikipedia.org/wiki/Don_Leo_Jonathan,16,2018,October,(NWA),,Hall of Fame professional wrestler,,,,,,,,,,87.0,,United States of America,Canada
117729,29,Geoffrey Robinson,", 83, Australian Roman Catholic prelate, auxiliary bishop of Sydney .",https://en.wikipedia.org/wiki/Geoffrey_Robinson_(bishop),11,2020,December,(1984–2004),,Catholic prelate,auxiliary bishop of Sydney,,,,,,,,,83.0,,Australia,Italy


<IPython.core.display.Javascript object>

In [36]:
df["place_1"].isna().sum()

633

<IPython.core.display.Javascript object>

In [38]:
df[df["place_1"].isna()]["info_2"].unique()

array(['Royal Netherlands Navy vice admiral', 'President of Laos',
       'Governor general of the Bahamas',
       'Amateur violinist and philanthropist',
       'Composer and music editor',
       'President of the Yemen Arab Republic', 'Prime Minister of Rwanda',
       'Male Hungarian international table tennis player',
       '37th President of the United States',
       'Queen of Jordan as the wife of King Talal',
       'Prime Minister of Zaire under Mobutu Sese Seko',
       'President of Burma and writer',
       'Founder and first leader of North Korea',
       'Moravian American classical pianist', 'President of Ciskei',
       'Prime Minister of Nepal', 'President of Palau',
       'Poet and an Esperantist professor',
       "People's Republic of China politician", 'Jewish rabbi',
       '17th Indian Army officer and Chief of Staff',
       'All American baseball player',
       'pioneering American biophysicist and virologist',
       "native leader and historian of the Me

<IPython.core.display.Javascript object>