# Wikipedia Notable Life Expectancies

# [Notebook 4 of 4: Data Cleaning](https://github.com/teresahanak/wikipedia-life-expectancy/blob/main/wp_life_expect_data_clean3_thanak_2022_06_23.ipynb)

## Context

The


## Objective

The

### Data Dictionary

- Feature: Description

## Importing Necessary Libraries

In [1]:
# To structure code automatically
%load_ext nb_black

# To import/export sqlite databases
import sqlite3 as sql

# To save/open python objects in pickle file
import pickle

# To help with reading, cleaning, and manipulating data
import pandas as pd
import numpy as np
import re

# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)
# To define the maximum number of rows to be displayed in a dataframe
pd.set_option("display.max_rows", 200)

# To supress warnings
# import warnings

# warnings.filterwarnings("ignore")

# To set some visualization attributes
pd.set_option("max_colwidth", 150)

# To play auditory cue when cell has executed, has warning, or has error and set chime theme
import chime

chime.theme("zelda")

<IPython.core.display.Javascript object>

## Data Overview

### Reading, Sampling, and Checking Data Shape

In [2]:
# Reading the dataset
conn = sql.connect("wp_life_expect_clean2.db")
data = pd.read_sql("SELECT * FROM wp_life_expect_clean2", conn)

# Making a working copy
df = data.copy()

# Checking the shape
print(f"There are {df.shape[0]} rows and {df.shape[1]} columns.")

# Checking first 2 rows of the data
df.head(2)

There are 132652 rows and 21 columns.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death
0,1,William Chappell,", 86, British dancer, ballet designer and director.",https://en.wikipedia.org/wiki/William_Chappell_(dancer),21,1994,January,,,British dancer,ballet designer and director,,,,,,,,,86.0,
1,1,Raymond Crotty,", 68, Irish economist, writer, and academic.",https://en.wikipedia.org/wiki/Raymond_Crotty,12,1994,January,,,Irish economist,writer,and academic,,,,,,,,68.0,


<IPython.core.display.Javascript object>

In [3]:
# Checking last 2 rows of the data
df.tail(2)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death
132650,9,Oleg Moliboga,", 69, Russian volleyball player, Olympic champion and coach.",https://en.wikipedia.org/wiki/Oleg_Moliboga,2,2022,June,(1980),,Russian volleyball player,Olympic champion and coach,,,,,,,,,69.0,
132651,9,Zou Jing,", 86, Chinese engineer, member of the Chinese Academy of Engineering.",https://en.wikipedia.org/wiki/Zou_Jing_(engineer),3,2022,June,,,Chinese engineer,member of the Chinese Academy of Engineering,,,,,,,,,86.0,


<IPython.core.display.Javascript object>

In [4]:
# Checking a sample of the data
df.sample(5)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death
17005,30,James C. Corman,", 80, American politician .",https://en.wikipedia.org/wiki/James_C._Corman,16,2000,December,(U.S. Representative for California's 21st and 22nd congressional districts),,American politician,,,,,,,,,,80.0,
49264,23,Tom King,", 68, American guitarist and songwriter , heart failure.",https://en.wikipedia.org/wiki/Tom_King_(musician),5,2011,April,(The Outsiders),,American guitarist and songwriter,heart failure,,,,,,,,,68.0,
33742,1,Norman Adrian Wiggins,", 83, American third president of Campbell University.",https://en.wikipedia.org/wiki/Norman_Adrian_Wiggins,0,2007,August,,,American third president of Campbell University,,,,,,,,,,83.0,
51677,30,Jonas Kubilius,", 90, Lithuanian mathematician.",https://en.wikipedia.org/wiki/Jonas_Kubilius,17,2011,October,,,Lithuanian mathematician,,,,,,,,,,90.0,
105846,22,Tom Polanic,", 76, Canadian ice hockey player .",https://en.wikipedia.org/wiki/Tom_Polanic,4,2019,September,(Minnesota North Stars),,Canadian ice hockey player,,,,,,,,,,76.0,


<IPython.core.display.Javascript object>

### Checking Data Types, Duplicates, and Null Values

In [5]:
# Checking data types and null values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 132652 entries, 0 to 132651
Data columns (total 21 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   day             132652 non-null  object 
 1   name            132652 non-null  object 
 2   info            132652 non-null  object 
 3   link            132652 non-null  object 
 4   num_references  132652 non-null  object 
 5   year            132652 non-null  int64  
 6   month           132652 non-null  object 
 7   info_parenth    49830 non-null   object 
 8   info_1          35 non-null      object 
 9   info_2          132604 non-null  object 
 10  info_3          62571 non-null   object 
 11  info_4          12605 non-null   object 
 12  info_5          1497 non-null    object 
 13  info_6          216 non-null     object 
 14  info_7          31 non-null      object 
 15  info_8          6 non-null       object 
 16  info_9          1 non-null       object 
 17  info_10   

<IPython.core.display.Javascript object>

#### Loading `nation_country_dict` from Pickle File

In [6]:
# Load the nation_country_dict
with open("nation_country_dict.pkl", "rb") as f:
    nation_country_dict = pickle.load(f)

<IPython.core.display.Javascript object>

## Extracting Nationality Continued
Here is the approach we will take:
- The plan will be to save the country name, in lieu of nationality, in new `place_1` and `place_2` columns as it is standardized for the various associated nationality values.
- First, we will update the keys and values in `nation_country_dict` by replacing hyphens with a single space.
- Then we will remove "-born" from the column we are searching, as well as replace "-" and "/" each with single spaces.  In this step, we can also remove leading and trailing periods and whitespace.
- We will proceed to search the numbered `info` columns in order checking as follows:
    1. if column value starts with a value in the dictionary:
        - save country to `place_1` and remove value from searched column.
    2. if `place_1` value has been found:
        - if updated column value starts with a value in the dictionary:
            - save country to `place_2` and remove value from searched column.
    3. Repeat steps 1 and 2 but comparing with country (dictionary keys)
    4. Check unique values for column starting with capital letters.

#### Removing "-" and "." from `nation_country_dict`

In [7]:
# Removing hyphens from nation_country_dict
nation_country_dict = {
    key.replace("-", ""): value.replace("-", " ")
    for key, value in nation_country_dict.items()
}

# Removing periods from nation_country_dict
nation_country_dict = {
    key.replace(".", ""): value.replace(".", " ")
    for key, value in nation_country_dict.items()
}

<IPython.core.display.Javascript object>

#### Removing or Replacing Extra Characters in Numbered `info` Columns

In [8]:
%%time

# List of columns to treat
cols_lst = [
    "info_1",
    "info_2",
    "info_3",
    "info_4",
    "info_5",
    "info_6",
    "info_7",
    "info_8",
    "info_9",
    "info_10",
    "info_11",
]

# Dictionary of keys to find and values to replace keys
replace_dict = {'-born': '', '–born': '', '-': ' ', '–': ' ', '/': ' ', '.': ' ', 'and': ''}

# For loop to find and replace characters in replace_dict in columns in cols_list
# and strip any leading or trailing periods or whitespace
for column in cols_lst:
    for key, value in replace_dict.items():
        for index in df[column].notna().index:
            item = df.loc[index, column]
            if item:
                df.loc[index, column] = item.replace(key, value).strip(' .')
                
# Chime notification when cell successfully executes
chime.success()

CPU times: total: 1min 47s
Wall time: 1min 47s


<IPython.core.display.Javascript object>

#### Checking `info_1` for `place_1`

In [9]:
# Column to check
column = "info_1"

# Extract to column
extract_to = "place_1"

# Dataframe to check
dataframe = df[(df[column].notna())]

# For loop to extract nation data to place column
for nationality, country in nation_country_dict.items():
    for index in dataframe.index:
        item = df.loc[index, column]
        if item.startswith(nationality):
            df.loc[index, extract_to] = country
            df.loc[index, column] = (
                df.loc[index, column].replace(nationality, "").strip()
            )

# Check a sample of treated rows
df[df[extract_to].notna()].sample(2)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1
21865,23,Roberto Matta,", 91 Chilean artist.",https://en.wikipedia.org/wiki/Roberto_Matta,7,2002,November,,artist,,,,,,,,,,,91.0,,Chile
7507,21,Kell Areskoug,", 90 Swedish Olympic sprinter.",https://en.wikipedia.org/wiki/Kell_Areskoug,5,1996,December,,Olympic sprinter,,,,,,,,,,,90.0,,Sweden


<IPython.core.display.Javascript object>

#### Observations:
- `info_1` provides us a nice small sample on which to test code.
- We successfully extracted those `place_1` values, now we will do the same on the treated rows for `place_2`.

#### Checking `info_1` for `place_2`

In [10]:
# Column to check
column = "info_1"

# Extract to column
extract_to = "place_2"

# Dataframe to check
dataframe = df[(df["place_1"].notna()) & (df[column].notna())]

# For loop to extract nation data to place column
for nationality, country in nation_country_dict.items():
    for index in dataframe.index:
        item = df.loc[index, column]
        if item.startswith(nationality):
            df.loc[index, extract_to] = country
            df.loc[index, column] = (
                df.loc[index, column].replace(nationality, "").strip()
            )

# Check a sample of rows
df.sample(2)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1
91864,25,Ross Powell,", 49, American baseball player , carbon monoxide poisoning.",https://en.wikipedia.org/wiki/Ross_Powell,4,2017,October,"(Cincinnati Reds, Houston Astros, Pittsburgh Pirates)",,American baseball player,carbon monoxide poisoning,,,,,,,,,49.0,,
107857,1,Ng Jui Ping,", 71, Singaporean entrepreneur and army general, Chief of Defence Force , pancreatic cancer.",https://en.wikipedia.org/wiki/Ng_Jui_Ping,4,2020,January,(1992–1995),,Singaporean entrepreneur and army general,Chief of Defence Force,pancreatic cancer,,,,,,,,71.0,,


<IPython.core.display.Javascript object>

#### Observations:
- Here we can see that the new column `place_2` has not yet been added as there were not any matching values.
- Let us confirm by checking the remaining unique values in `info_1`.

#### Checking Remaining Unique Values in `info_1`

In [11]:
# Checking unique values
df["info_1"].unique()

array([None, 'politician', 'Olympic sprinter', 'gridiron football player',
       'writer', 'businessman', 'social psychologist', 'King of Nepal',
       'Maori leader', 'artist', 'English sports journalist',
       'Jules Engel', 'early', 'aka', 'Jr', 'professional wrestler',
       'automotive engineer', 'materials scientist', 'weightlifter',
       'common chimpanzee', '', 'Olympic athlete', 'actor',
       'Olympic gymnast', 'broadcaster and writer', 'Olympic swimmer',
       'Olympic boxer', 'Olympic wrestler', 'Olympic sailor',
       'basketball player', 'college basketball coach',
       'choral conductor', 'Tree of the Year'], dtype=object)

<IPython.core.display.Javascript object>

#### Obsservations:
- Neither "English" nor "Maori" are keys in the current dictionary.
- Maori is an ethnicity within the country of New Zealand, so for now, we will add it as a key our dictionary with the country value of New Zealand.  If we have matching first and second countries, we can later remove the second value.
- We will also add the key "English" with the country value 'United Kingdom of Great Britain and Northern Ireland'.
- Then, we can rerun the above code for `place_1` and `place_2`.
- The country value of "Nepal" is also present.  We will hold off on extracting country names until we have first exhausted matching nationalities, as the Wikipedia field called for nationalities.

#### Updating `nation_country_dict`

In [12]:
# Adding key: country pairs to nation_country_dict
nation_country_dict["English"] = nation_country_dict["British"]
nation_country_dict["Maori"] = nation_country_dict["New Zealand"]

<IPython.core.display.Javascript object>

#### Re-checking `info_1` for `place_1`

In [13]:
# Column to check
column = "info_1"

# Extract to column
extract_to = "place_1"

# Dataframe to check
dataframe = df[(df[extract_to].isna()) & (df[column].notna())]

# For loop to extract nation data to place column
for nationality, country in nation_country_dict.items():
    for index in dataframe.index:
        item = df.loc[index, column]
        if item.startswith(nationality):
            df.loc[index, extract_to] = country
            df.loc[index, column] = (
                df.loc[index, column].replace(nationality, "").strip()
            )

# Check a sample of rows
df[df[extract_to].notna()].sample(2)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1
21050,19,Frank Taylor,", 81. English sports journalist.",https://en.wikipedia.org/wiki/Frank_Taylor_(journalist),3,2002,July,,sports journalist,,,,,,,,,,,81.0,,United Kingdom of Great Britain and Northern Ireland
47293,18,Donald Mitchell,", 55 Australian weightlifter.",https://en.wikipedia.org/wiki/Donald_Mitchell_(weightlifter),2,2010,November,,weightlifter,,,,,,,,,,,55.0,,Australia


<IPython.core.display.Javascript object>

#### Re-checking `info_1` for `place_2`

In [14]:
# Column to check
column = "info_1"

# Extract to column
extract_to = "place_2"

# Dataframe to check
dataframe = df[(df["place_1"].notna()) & (df[column].notna())]

# For loop to extract nation data to place column
for nationality, country in nation_country_dict.items():
    for index in dataframe.index:
        item = df.loc[index, column]
        if item.startswith(nationality):
            df.loc[index, extract_to] = country
            df.loc[index, column] = (
                df.loc[index, column].replace(nationality, "").strip()
            )

# Checking rows
df[df["place_2"].notna()]

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2
19580,20,Dame Miraka Szászy,", 80. New Zealand Maori leader.",https://en.wikipedia.org/wiki/Mira_Sz%C3%A1szy,21,2001,December,,leader,,,,,,,,,,,80.0,,New Zealand,New Zealand


<IPython.core.display.Javascript object>

#### Observations:
- Our code appears to be finding the matching values and assigning the corresponding country to the correct nation column.
- We see "New Zealand" added to both nation columns here, which was expected as both New Zealand and Maori are in the description
- As an aside, we will need to check our final values where `place_1` is "American" and `place_2` is "Indian" as our code will indicate United States and India, which may or may not be correct. 
- Now we can proceed to doing the same extraction on `info_2`.

#### Checking `info_2` for `place_1`

In [15]:
%%time

# Column to check
column = "info_2"

# Extract to column
extract_to = "place_1"

# Dataframe to check
dataframe = df[(df[extract_to].isna()) & (df[column].notna())]

# For loop to extract nation data to place column
for nationality, country in nation_country_dict.items():
    for index in dataframe.index:
        item = df.loc[index, column]
        if item.startswith(nationality):
            df.loc[index, extract_to] = country
            df.loc[index, column] = (
                df.loc[index, column].replace(nationality, "").strip()
            )

# Chime notification when cell successfully executes
chime.success()

CPU times: total: 3min 54s
Wall time: 3min 54s


<IPython.core.display.Javascript object>

In [16]:
# Check a sample of rows
df[df[extract_to].notna()].sample(2)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2
15631,11,Dale Jennings,", 82, American LGBT rights activist, playwright and author.",https://en.wikipedia.org/wiki/Dale_Jennings_(activist),7,2000,May,,,LGBT rights activist,playwright and author,,,,,,,,,82.0,,United States of America,
2135,5,Asım Orhan Barut,", 68, Turkish-American theoretical physicist.",https://en.wikipedia.org/wiki/As%C4%B1m_Orhan_Barut,7,1994,December,,,theoretical physicist,,,,,,,,,,68.0,,United States of America,


<IPython.core.display.Javascript object>

#### Checking `info_2` for `place_2`

In [17]:
%%time

# Column to check
column = "info_2"

# Extract to column
extract_to = "place_2"

# Dataframe to check
dataframe = df[
    (df["place_1"].notna()) & (df[extract_to].isna()) & (df[column].notna())]

# For loop to extract nation data to place column
for nationality, country in nation_country_dict.items():
    for index in dataframe.index:
        item = df.loc[index, column]
        if item.startswith(nationality):
            df.loc[index, extract_to] = country
            df.loc[index, column] = (
                df.loc[index, column].replace(nationality, "").strip()
            )
            
# Chime notification when cell successfully executes
chime.success()

CPU times: total: 3min 33s
Wall time: 3min 34s


<IPython.core.display.Javascript object>

In [18]:
# Check a sample of rows
df[df[extract_to].notna()].sample(2)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2
76459,10,Roger Rees,", 71, Welsh-American actor , stomach cancer.",https://en.wikipedia.org/wiki/Roger_Rees,29,2015,July,"(Tony, 1982, ), winner ()",,actor,stomach cancer,,,,,,,,,71.0,,Wales,United States of America
2343,4,Naomi Amir,", 63, American-Israeli pediatric neurologist.",https://en.wikipedia.org/wiki/Naomi_Amir,6,1995,January,,,pediatric neurologist,,,,,,,,,,63.0,,United States of America,Israel


<IPython.core.display.Javascript object>

#### Checking Remaining Missing Values for `place_1` and Number of Rows with a `place_2` Value.

In [19]:
# Checking number of remaining missing values for place_1 and number of captured values for place_2
print(f'There are {df["place_1"].isna().sum()} remaining missing values for place_1.\n')
print(f'{df["place_2"].notna().sum()} entries have a value for place_2, thus far.')

There are 2394 remaining missing values for place_1.

2251 entries have a value for place_2, thus far.


<IPython.core.display.Javascript object>

#### Observations:
- We have captured the `place_1` value for the vast majority of entries.
- Relatively few entries have `place_2` values, which we would expect.
- Before checking for other variations on nationality usage, let us check to see if the nation itself starts the value.

#### Re-checking `info_2` for `place_1` Using Country Value

In [20]:
%%time

# Column to check
column = "info_2"

# Extract to column
extract_to = "place_1"

# Dataframe to check
dataframe = df[(df[extract_to].isna()) & (df[column].notna())]

# For loop to extract nation data to place column
for country in nation_country_dict.values():
    for index in dataframe.index:
        item = df.loc[index, column]
        if item.startswith(country):
            df.loc[index, extract_to] = country
            df.loc[index, column] = (
                df.loc[index, column].replace(country, "").strip()
            )

# Chime notification when cell successfully executes
chime.success()

CPU times: total: 4.41 s
Wall time: 4.37 s


<IPython.core.display.Javascript object>

In [21]:
# Check a sample of rows
df[df[extract_to].notna()].sample(2)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2
414,4,Aníbal,", 53, Mexican , brain cancer.",https://en.wikipedia.org/wiki/An%C3%ADbal_(wrestler),20,1994,March,(professional wrestler),,,brain cancer,,,,,,,,,53.0,,Mexico,
862,14,Brian Roper,", 64, British-American actor, and real estate agent.",https://en.wikipedia.org/wiki/Brian_Roper_(actor),22,1994,May,,,actor,and real estate agent,,,,,,,,,64.0,,United States of America,


<IPython.core.display.Javascript object>

#### Re-checking `info_2` for `place_2` Using Country Value

In [22]:
%%time

# Column to check
column = "info_2"

# Extract to column
extract_to = "place_2"

# Dataframe to check
dataframe = df[
    (df["place_1"].notna()) & (df[extract_to].isna()) & (df[column].notna())
]

# For loop to extract nation data to place column
for country in nation_country_dict.values():
    for index in dataframe.index:
        item = df.loc[index, column]
        if item.startswith(country):
            df.loc[index, extract_to] = country
            df.loc[index, column] = (
                df.loc[index, column].replace(country, "").strip()
            )
            
# Chime notification when cell successfully executes
chime.success()

CPU times: total: 3min 39s
Wall time: 3min 39s


<IPython.core.display.Javascript object>

In [23]:
# Check a sample of rows
df[df[extract_to].notna()].sample(2)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2
60405,10,"Princess Lilian, Duchess of Halland",", 97, Welsh-born Swedish royal.","https://en.wikipedia.org/wiki/Princess_Lilian,_Duchess_of_Halland",15,2013,March,,,royal,,,,,,,,,,97.0,,Wales,Sweden
123131,12,Dennis Berry,", 76, American-French film director .",https://en.wikipedia.org/wiki/Dennis_Berry_(director),4,2021,June,"(, , )",,film director,,,,,,,,,,76.0,,United States of America,France


<IPython.core.display.Javascript object>

#### Checking Remaining Missing Values for `place_1` and Number of Rows with a `place_2` Value.

In [24]:
# Checking number of remaining missing values for place_1 and number of captured values for place_2
print(f'There are {df["place_1"].isna().sum()} remaining missing values for place_1.\n')
print(f'{df["place_2"].notna().sum()} entries have a value for place_2, thus far.')

There are 2275 remaining missing values for place_1.

2293 entries have a value for place_2, thus far.


<IPython.core.display.Javascript object>

#### Observations:
- We captured over 100 more values for `place_1` and over 50 more for `place_2` by directly checking the country name.
- Now, we will examine the remaining capitalized first words in `info_1` and append our dictionary as needed.

#### Examining Unique Values of First Word in `info_1` if Upper Case

In [25]:
# Column to check
column = "info_2"

# Dataframe to check
dataframe = df[(df["place_1"].isna()) & (df[column].notna())]

# Checking set of first words in info_2 where place_1 is missing
print(
    f"There are {len(set([item.split()[0] for item in dataframe[column] if item[0].isupper()]))} unique values for first word in info_1.\n"
)
set([item.split()[0] for item in dataframe[column] if item[0].isupper()])

There are 302 unique values for first word in info_1.



{'AIDS',
 'ANC',
 'Abkhaz',
 'Abkhazian',
 'Aboriginal',
 'Actress',
 'African',
 'Afrikaans',
 'Afrikaner',
 'Afro',
 'Air',
 'Alfa',
 'All',
 'Alyawarre',
 'Amateur',
 'America',
 "America's",
 'Amrican',
 'Anglican',
 'Anglo',
 'Anguillan',
 'Arabic',
 'Archbishop',
 'Archdeacon',
 'Argentinian',
 'Aruba',
 'Aruban',
 'Assamese',
 'Associate',
 'Assyrian',
 'Athletics',
 'Aussie',
 'Austro',
 'Avarian',
 'Azorean',
 'BBC',
 'Baltic',
 'Basque',
 'Bavarian',
 'Benedictine',
 'Bermudan',
 'Bermudian',
 'Bessarabian',
 'Bletchley',
 'Bodo',
 'Bosnia',
 'Breton',
 'Brigadier',
 "Britain's",
 'Britsih',
 'California',
 'Californian',
 'Calypso',
 'Cantonese',
 'Caribbean',
 'Catalan',
 'Catholic',
 'Caymanian',
 'Ceylon',
 'Ceylonese',
 'Chagossian',
 'Chairman',
 'Chechen',
 'Cherokee',
 'Chief',
 'Chilian',
 'China',
 'Chiricahua',
 'Chuvash',
 'Circassian',
 'Civil',
 'Columbian',
 'Commandant',
 'Composer',
 'Computer',
 'Congo',
 'Congoleze',
 'Congresswoman;',
 'Cook',
 'Cornish',


<IPython.core.display.Javascript object>

#### Observations:
- We can see there are some remaining variations on how nationality was entered that are not yet in `nation_country_dict`.
- Let us add those now, then do another iteration for searching `info_2`.
- We can also proceed to collect causes of death, such as 'AIDS' in a separate list.
- Values for nationality that clearly pertain to an ethnicity, such as "Afro" and "Anglo" will be assigned empty strings in the dictionary.
- Descriptions will be assigned to their geographical physical region when nation of membership is remote.

In [145]:
nation_country_dict["ANC"] = nation_country_dict["South African"]
nation_country_dict["Abkhaz"] = nation_country_dict["Georgian"]
nation_country_dict["Abkhazian"] = nation_country_dict["Georgian"]
nation_country_dict["Aboriginal"] = nation_country_dict["Australian"]
nation_country_dict["African"] = "Africa"
nation_country_dict["Afrikaans"] = nation_country_dict["African"]
nation_country_dict["Afrikaner"] = nation_country_dict["African"]
nation_country_dict["Afro"] = ""
nation_country_dict["Alyawarre"] = nation_country_dict["Australian"]
nation_country_dict["America"] = nation_country_dict["US"]
nation_country_dict["America's"] = nation_country_dict["US"]
nation_country_dict["Amrican"] = nation_country_dict["US"]
nation_country_dict["Anglo"] = ""
nation_country_dict["Anguillan"] = "Caribbean"
nation_country_dict["Antigua"] = nation_country_dict["Antiguan"]
nation_country_dict["Arabic"] = "Arab states"
nation_country_dict["Argentinian"] = nation_country_dict["Argentine"]
nation_country_dict["Aruba"] = "Caribbean"
nation_country_dict["Aruban"] = nation_country_dict["Aruba"]
nation_country_dict["Assamese"] = nation_country_dict["Indian"]
nation_country_dict["Assyrian"] = "Middle East"
nation_country_dict["Aussie"] = nation_country_dict["Australian"]
nation_country_dict["Australia"] = nation_country_dict["Australian"]
nation_country_dict["Australia's"] = nation_country_dict["Australian"]
nation_country_dict["Austria"] = nation_country_dict["Austrian"]
nation_country_dict["Austro"] = nation_country_dict["Austrian"]
nation_country_dict["Avarian"] = nation_country_dict["Russian"]
nation_country_dict["Azerbaijan"] = nation_country_dict["Azerbaijani"]
nation_country_dict["Azorean"] = nation_country_dict["Portuguese"]
nation_country_dict["Baltic"] = "Eastern Europe"
nation_country_dict["Bangladesh"] = nation_country_dict["Bangladeshi"]
nation_country_dict["Barbados"] = "Caribbean"
nation_country_dict["Basque"] = "Western Europe"
nation_country_dict["Bavarian"] = nation_country_dict["German"]
nation_country_dict["Belarus"] = nation_country_dict["Belarusian"]
nation_country_dict["Belarussian"] = nation_country_dict["Belarusian"]
nation_country_dict["Belgium"] = nation_country_dict["Belgian"]
nation_country_dict["Bermudan"] = "Caribbean"
nation_country_dict["Bessarabian"] = "Eastern Europe"
nation_country_dict["Bletchley"] = nation_country_dict["British"]
nation_country_dict["Bodo"] = nation_country_dict["Norwegian"]
nation_country_dict["Bosnia"] = nation_country_dict["Bosnian"]
nation_country_dict["Breton"] = nation_country_dict["French"]
nation_country_dict["Britain's"] = nation_country_dict["British"]
nation_country_dict["Britsih"] = nation_country_dict["British"]
nation_country_dict["California"] = nation_country_dict["US"]
nation_country_dict["Californian"] = nation_country_dict["US"]
nation_country_dict["Cantonese"] = nation_country_dict["Chinese"]
nation_country_dict["Caribbean"] = "Caribbean"
nation_country_dict["Catalan"] = nation_country_dict["Spanish"]
nation_country_dict["Caymanian"] = nation_country_dict["Caribbean"]
nation_country_dict["Ceylon"] = nation_country_dict["Sri Lankan"]
nation_country_dict["Ceylonese"] = nation_country_dict["Sri Lankan"]
nation_country_dict["Chagossian"] = "Indian Ocean"
nation_country_dict["Chechen"] = nation_country_dict["Russian"]
nation_country_dict["Cherokee"] = nation_country_dict["US"]
nation_country_dict["Chilian"] = nation_country_dict["Chilean"]
nation_country_dict["China"] = nation_country_dict["Chinese"]
nation_country_dict["Chiricahua"] = nation_country_dict["US"]
nation_country_dict["Chuvash"] = nation_country_dict["Russian"]
nation_country_dict["Circassian"] = nation_country_dict["Russian"]
nation_country_dict["Columbian"] = nation_country_dict["Colombian"]
nation_country_dict["Congo"] = nation_country_dict["Congolese"]
nation_country_dict["Cornish"] = nation_country_dict["British"]
nation_country_dict["Costan Rican"] = nation_country_dict["Costa Rican"]
nation_country_dict["Crimean"] = nation_country_dict["Russian"]
nation_country_dict["Croat"] = nation_country_dict["Croatian"]
nation_country_dict["Curaçaoan"] = nation_country_dict["Dutch"]
nation_country_dict["Curaçaon"] = nation_country_dict["Dutch"]
nation_country_dict["Dagestani"] = nation_country_dict["Russian"]
nation_country_dict["Dahomey"] = "West Africa"
nation_country_dict["Dijiboutian"] = nation_country_dict["Djiboutian"]
nation_country_dict["Dolgan"] = nation_country_dict["Russian"]
nation_country_dict["Dominica"] = nation_country_dict["Caribbean"]
nation_country_dict["East"] = ""
nation_country_dict["Eastern"] = ""
nation_country_dict["England"] = nation_country_dict["British"]
nation_country_dict["Englist"] = nation_country_dict["British"]
nation_country_dict["European"] = "Europe"
nation_country_dict["Falkland Islands"] = "South America"
nation_country_dict["Falkland islands"] = nation_country_dict["Falkland Islands"]
nation_country_dict["Falkland"] = nation_country_dict["Falkland Islands"]
nation_country_dict["Faroese"] = nation_country_dict["Danish"]
nation_country_dict["Filipina"] = nation_country_dict["Filipino"]
nation_country_dict["Filipo"] = nation_country_dict["Filipino"]  # verified entry
nation_country_dict["Fillipina"] = nation_country_dict["Filipino"]
nation_country_dict["Finish"] = nation_country_dict["Finnish"]
nation_country_dict["Flemish"] = nation_country_dict["Belgian"]
nation_country_dict["Franch"] = nation_country_dict["French"]  # verified entry
nation_country_dict["Franco"] = ""
nation_country_dict["Frenck"] = nation_country_dict["French"]  # verified entry
nation_country_dict["Fujianese"] = nation_country_dict["Chinese"]
nation_country_dict["Gaelic"] = ""  # refers to sport of Gaelic football, otherwise language
nation_country_dict["Galician"] = nation_country_dict["Spanish"]
nation_country_dict["Galápagos"] = "Galápagos Islands"  # entry for tortoise
nation_country_dict["Geman"] = nation_country_dict["German"]  # verified entry
nation_country_dict["Germen"] = nation_country_dict["German"]  # verified entry
nation_country_dict["Ghanese"] = "West Africa"
nation_country_dict["Greenlandic"] = nation_country_dict["Danish"]
nation_country_dict["Guadeloupean"] = nation_country_dict["Caribbean"]
nation_country_dict["Guamanian"] = "Oceania"
nation_country_dict["Guernsey"] = nation_country_dict["British"]
nation_country_dict["Hawaiian"] = nation_country_dict["US"]
nation_country_dict["Hindi"] = nation_country_dict["Indian"]
nation_country_dict["Hindu"] = nation_country_dict["Indian"]
nation_country_dict["Hollywood"] = nation_country_dict["US"]
nation_country_dict["Hong Kong"] = nation_country_dict["Chinese"]
nation_country_dict["Houston"] = nation_country_dict["US"]
nation_country_dict["Huaorani"] = nation_country_dict["Ecuadorian"]
nation_country_dict["I Kiribati"] = nation_country_dict["IKiribati"]
nation_country_dict["Ice"] = ""
nation_country_dict["Indigenous"] = ""
nation_country_dict["Indin"] = nation_country_dict["Indian"]  # verified entry
nation_country_dict["Indo"] = ""
nation_country_dict["Ingush"] = nation_country_dict["Russian"]
nation_country_dict["Italo"] = ""
nation_country_dict["Ivoirian"] = "West Africa"
nation_country_dict["Javanese"] = nation_country_dict["Indonesian"]
nation_country_dict["Jersey"] = nation_country_dict["British"]
nation_country_dict["Kabardin"] = nation_country_dict["Russian"]
nation_country_dict["Kashmiri"] = nation_country_dict["Indian"]
nation_country_dict["Korean"] = "East Asia"
nation_country_dict["Kosovan"] = "Eastern Europe"
nation_country_dict["Kurdish"] = "West Asia"
nation_country_dict["Latino"] = ""
nation_country_dict["Lesothan"] = "Southern Africa"
nation_country_dict["Los Angeles"] = nation_country_dict["US"]
nation_country_dict["Louisiana"] = nation_country_dict["US"]
nation_country_dict["MGerman"] = nation_country_dict["German"]  # verified entry
nation_country_dict["Macanese"] = "East Asia"
nation_country_dict["Malayalam"] = nation_country_dict["Indian"]
nation_country_dict["Malayali"] = nation_country_dict["Indian"]
nation_country_dict['Malayan'] = nation_country_dict["Malaysian"]
nation_country_dict['Manx'] = nation_country_dict["British"]
nation_country_dict['Mexian'] = nation_country_dict["Mexican"]
nation_country_dict['Mississippi'] = nation_country_dict["US"]
nation_country_dict['Monegasque'] = nation_country_dict["Monacan"]
nation_country_dict['Montserrat'] = nation_country_dict["Caribbean"]
nation_country_dict['Montserratian'] = nation_country_dict["Caribbean"]
nation_country_dict['Myanmar'] = nation_country_dict["Burmese"]
nation_country_dict['Native'] = ''
nation_country_dict['New York'] = nation_country_dict["US"]
nation_country_dict['Ngarrindjeri'] = nation_country_dict["Australian"]
nation_country_dict['Ni Vanuatu'] = "Oceania"
nation_country_dict['Nigirean'] = nation_country_dict["Nigerian"]
nation_country_dict['Niuean'] = nation_country_dict["NZ"]
nation_country_dict['Northern Ire'] = nation_country_dict["Northern Irish"]
nation_country_dict['Northern Ireland'] = nation_country_dict["Northern Irish"]
nation_country_dict['Norther Irish'] = nation_country_dict["Northern Irish"]
nation_country_dictp['North Irish'] = nation_country_dict["Northern Irish"]
nation_country_dict['North American'] = "North America"
nation_country_dict['North Island'] = nation_country_dict["NZ"]
nation_country_dict['Northern Mariana Island'] = "Oceania"
nation_country_dict['Northern Mariana Islander'] = 'Oceania'
nation_country_dict['Nubian'] = nation_country_dict["Sudanese"]
nation_country_dict['Ottoman'] = nation_country_dict["Turkish"]
nation_country_dict['Paraguan'] = nation_country_dict["Paraguayan"]  # verified entry
nation_country_dict['Pitcairn'] = 'Oceania'
nation_country_dict['Poliosh'] = '' # verified entry
nation_country_dict['Polis'] = nation_country_dict["Polish"] # verified entry
nation_country_dict['Prussian'] = nation_country_dict["German"]
nation_country_dict['Punjabi'] = nation_country_dict["Indian"]
nation_country_dict['Quebec'] = nation_country_dict["Canadian"]
nation_country_dict['Québécois'] = nation_country_dict["Canadian"]
nation_country_dict['Republic of China'] = nation_country_dict["Chinese"]
nation_country_dict['Republic'] = ''
nation_country_dict['Rhodesian'] = 'Southern Africa'
nation_country_dict['Roman'] = nation_country_dict["Italian"]
nation_country_dict['Réunionese'] = nation_country_dict["French"]
nation_country_dict['S African'] = nation_country_dict["South African"]
nation_country_dict['Saban'] = nation_country_dict["Caribbean"]
nation_country_dict['Saharawi'] = 'West Africa'
nation_country_dict['Sahrawi'] = nation_country_dict['Saharawi']
nation_country_dict['Saint Helena'] = 'South Atlantic'
nation_country_dict['Saint Vincent'] = nation_country_dict["Caribbean"]
nation_country_dict['Saint Martin'] = nation_country_dict["Caribbean"]
nation_country_dict['Saint Pierre and Miquelon'] = nation_country_dict["North America"]
nation_country_dict['Salvadorean'] = nation_country_dict["Salvadoran"]
nation_country_dict['Sanmarinese'] = nation_country_dict["Sammarinese"]
nation_country_dict['Santomean'] = nation_country_dict["São Toméan"]
nation_country_dict['Seychellian'] = nation_country_dict["Seychellois"]
nation_country_dict['Sicilian'] = nation_country_dict["Italian"]
nation_country_dict['Sicillian'] = nation_country_dict['Italian']
nation_country_dict['Sikkimese'] = nation_country_dict["Indian"]
nation_country_dict['Sorbian'] = nation_country_dict["German"]
nation_country_dict['South Afican'] = nation_country_dict["South African"]
nation_country_dict['South Ossetian'] = nation_country_dict["Georgian"]
nation_country_dict['South '] = ''
nation_country_dict['Southern'] = ''
nation_country_dict['Soviet'] = 'United Socialist Soviet Republic'
nation_country_dict['Sri lankan'] = nation_country_dict["Sri Lankan"]
nation_country_dict['St Lucian'] = nation_country_dict["Caribbean"]
nation_country_dict['St Kitts and Nevis'] = nation_country_dict["Kittian and Nevisian"]



























causes = [
    "AIDS",
]

roles = ["MC"]



<IPython.core.display.Javascript object>

In [169]:
other_species_df = pd.read_csv("other_species.csv")
other_species = other_species_df["species"].tolist()
other_species.append("kiwi")

<IPython.core.display.Javascript object>

In [227]:
nation_country_dict["Kittian and Nevisian"]

'Saint Kitts and Nevis'

<IPython.core.display.Javascript object>

In [29]:
nation_country_dict

{'Afghan': 'Afghanistan',
 'Albanian': 'Albania',
 'Algerian': 'Algeria',
 'Andorran': 'Andorra',
 'Angolan': 'Angola',
 'Antiguan': 'Antigua and Barbuda',
 'Barbudan': 'Antigua and Barbuda',
 'Argentine': 'Argentina',
 'Armenian': 'Armenia',
 'Australian': 'Australia',
 'Austrian': 'Austria',
 'Azerbaijani': 'Azerbaijan',
 'Azeri': 'Azerbaijan',
 'Bahamian': 'The Bahamas',
 'Bahraini': 'Bahrain',
 'Bengali': 'Bangladesh',
 'Barbadian': 'Barbados',
 'Belarusian': 'Belarus',
 'Belgian': 'Belgium',
 'Belizean': 'Belize',
 'Beninese': 'Benin',
 'Beninois': 'Benin',
 'Bhutanese': 'Bhutan',
 'Bolivian': 'Bolivia',
 'Bosnian': 'Bosnia and Herzegovina',
 'Herzegovinian': 'Bosnia and Herzegovina',
 'Motswana': 'Botswana',
 'Botswanan': 'Botswana',
 'Brazilian': 'Brazil',
 'Bruneian': 'Brunei',
 'Bulgarian': 'Bulgaria',
 'Burkinabé': 'Burkina Faso',
 'Burmese': 'Burma',
 'Burundian': 'Burundi',
 'Cabo Verdean': 'Cabo Verde',
 'Cambodian': 'Cambodia',
 'Cameroonian': 'Cameroon',
 'Canadian': 'Ca

<IPython.core.display.Javascript object>

In [225]:
check_value = "St "

check_list = []
for index in df[df["info_2"].notna()].index:
    item = df.loc[index, "info_2"]
    if item:
        if item.startswith(check_value):
            check_list.append(index)

<IPython.core.display.Javascript object>

In [226]:
df.loc[check_list, :]

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2
15268,6,Roderick Walcott,", 70, St Lucian playwright, screenwriter, painter, theatre director, and lyricist.",https://en.wikipedia.org/wiki/Roderick_Walcott,8,2000,March,,,St Lucian playwright,screenwriter,painter,theatre director,and lyricist,,,,,,70.0,,,
34044,7,Sir John Compton,", 82, St. Lucian Prime Minister , stroke.",https://en.wikipedia.org/wiki/John_Compton,31,2007,September,"(1979, 1982–1996, 2006–2007)",,St Lucian Prime Minister,stroke,,,,,,,,,82.0,,,
88026,25,Sir Cuthbert Sebastian,", 95, St. Kitts and Nevis politician, Governor-General .",https://en.wikipedia.org/wiki/Cuthbert_Sebastian,7,2017,March,(1996–2013),,St Kitts and Nevis politician,Governor General,,,,,,,,,95.0,,,


<IPython.core.display.Javascript object>