# Wikipedia Notable Life Expectancies

# [Data Cleaning Part 3](https://github.com/teresahanak/wikipedia-life-expectancy/blob/main/wp_life_expect_data_clean3_thanak_2022_06_23.ipynb)

## Context

The


## Objective

The

### Data Dictionary

- Feature: Description

## Importing Libraries

In [1]:
# To structure code automatically
%load_ext nb_black

# To import/export sqlite databases
import sqlite3 as sql

# To save/open python objects in pickle file
import pickle

# To help with reading, cleaning, and manipulating data
import pandas as pd
import numpy as np
import re

# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)
# To define the maximum number of rows to be displayed in a dataframe
pd.set_option("display.max_rows", 200)

# To supress warnings
# import warnings

# warnings.filterwarnings("ignore")

# To set some visualization attributes
pd.set_option("max_colwidth", 150)

# To play auditory cue when cell has executed, has warning, or has error and set chime theme
import chime

chime.theme("zelda")

<IPython.core.display.Javascript object>

## Data Overview

### Reading, Sampling, and Checking Data Shape

In [2]:
# Reading the dataset
conn = sql.connect("wp_life_expect_clean2.db")
data = pd.read_sql("SELECT * FROM wp_life_expect_clean2", conn)

# Making a working copy
df = data.copy()

# Checking the shape
print(f"There are {df.shape[0]} rows and {df.shape[1]} columns.")

# Checking first 2 rows of the data
df.head(2)

There are 132652 rows and 21 columns.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death
0,1,William Chappell,", 86, British dancer, ballet designer and director.",https://en.wikipedia.org/wiki/William_Chappell_(dancer),21,1994,January,,,British dancer,ballet designer and director,,,,,,,,,86.0,
1,1,Raymond Crotty,", 68, Irish economist, writer, and academic.",https://en.wikipedia.org/wiki/Raymond_Crotty,12,1994,January,,,Irish economist,writer,and academic,,,,,,,,68.0,


<IPython.core.display.Javascript object>

In [3]:
# Checking last 2 rows of the data
df.tail(2)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death
132650,9,Oleg Moliboga,", 69, Russian volleyball player, Olympic champion and coach.",https://en.wikipedia.org/wiki/Oleg_Moliboga,2,2022,June,(1980),,Russian volleyball player,Olympic champion and coach,,,,,,,,,69.0,
132651,9,Zou Jing,", 86, Chinese engineer, member of the Chinese Academy of Engineering.",https://en.wikipedia.org/wiki/Zou_Jing_(engineer),3,2022,June,,,Chinese engineer,member of the Chinese Academy of Engineering,,,,,,,,,86.0,


<IPython.core.display.Javascript object>

In [4]:
# Checking a sample of the data
df.sample(5)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death
3096,4,"Richard Adrian, 2nd Baron Adrian",", 67, British peer and physiologist.","https://en.wikipedia.org/wiki/Richard_Adrian,_2nd_Baron_Adrian",5,1995,April,,,British peer and physiologist,,,,,,,,,,67.0,
119148,5,Robert E. Kelley,", 87, American lieutenant general.",https://en.wikipedia.org/wiki/Robert_E._Kelley,5,2021,February,,,American lieutenant general,,,,,,,,,,87.0,
28170,9,Rita Keller,", 72, American baseball player .",https://en.wikipedia.org/wiki/Rita_Keller,3,2005,August,(AAGPBL),,American baseball player,,,,,,,,,,72.0,
129501,29,Pete Smith,", 63, New Zealand actor , kidney disease.",https://en.wikipedia.org/wiki/Pete_Smith_(actor),4,2022,January,"(, , )",,New Zealand actor,kidney disease,,,,,,,,,63.0,
94356,23,James Colby,", 56, American actor .",https://en.wikipedia.org/wiki/James_Colby,4,2018,February,"(, , )",,American actor,,,,,,,,,,56.0,


<IPython.core.display.Javascript object>

### Checking Data Types, Duplicates, and Null Values

In [5]:
# Checking data types and null values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 132652 entries, 0 to 132651
Data columns (total 21 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   day             132652 non-null  object 
 1   name            132652 non-null  object 
 2   info            132652 non-null  object 
 3   link            132652 non-null  object 
 4   num_references  132652 non-null  object 
 5   year            132652 non-null  int64  
 6   month           132652 non-null  object 
 7   info_parenth    49830 non-null   object 
 8   info_1          35 non-null      object 
 9   info_2          132604 non-null  object 
 10  info_3          62571 non-null   object 
 11  info_4          12605 non-null   object 
 12  info_5          1497 non-null    object 
 13  info_6          216 non-null     object 
 14  info_7          31 non-null      object 
 15  info_8          6 non-null       object 
 16  info_9          1 non-null       object 
 17  info_10   

<IPython.core.display.Javascript object>

#### Loading `nation_country_dict` from Pickle File to Dictionary `nation_map`

In [6]:
# Load the nation_country_dict
with open("nation_country_dict.pkl", "rb") as f:
    nation_map = pickle.load(f)

<IPython.core.display.Javascript object>

#### Function to Save Indices of Rows Matching Regular Expressions Pattern to a List and Print Number of Rows with Match 

In [7]:
# Define a function that takes dataframe, column name, and re pattern as arguments and returns list of indices
# for which column value matches re pattern
def rows_with_pattern(dataframe, column, pattern):
    """
    Takes input of dataframe, column name, and re pattern 
    and returns list of indices for rows that contain match
    for pattern anywhere within value for given column.
    
    dataframe: dataframe
    column: column name
    pattern: re pattern
    """
    index_list = []

    for i in dataframe.index:
        item = dataframe.loc[i, column]
        match = re.search(pattern, item)
        if match:
            index_list.append(i)
    print(
        f"There are {len(index_list)} rows with matching pattern in column '{column}'."
    )
    return index_list

<IPython.core.display.Javascript object>

#### Function to Use rows_with_pattern Function for Multiple Regular Expression Patterns

In [8]:
# Define a function that calls rows_with_pattern function for multiple re patterns
# returning a single list of indices for all rows with any pattern match


def multiple_patterns(dataframe, column, patterns):
    """
    Takes input dataframe, column, and list of re patterns and returns single list 
    of indices for rows in which a match for any pattern is found with re.search
    
    dataframe: dataframe
    column: column name
    patterns: list of re patterns
    """
    rows_combined = []

    # For loop to check each pattern
    for pattern in patterns:

        # List and number of rows matching each pattern
        print(pattern)
        rows_to_check = rows_with_pattern(dataframe, column, pattern)
        print("")

        # Add list for each pattern to combined list
        rows_combined += rows_to_check

    return rows_combined

<IPython.core.display.Javascript object>

## Extracting Nationality Continued
Here is the approach we will take:
- The plan will be to save the country name, in lieu of nationality, in new `place_1` and `place_2` columns as it is standardized for the various associated nationality values.
- First, we will update the keys and values in `nation_map` by replacing hyphens with a single space.
- Then we will remove "-born" from the column we are searching, as well as replace "-" and "/" each with single spaces.  In this step, we can also remove leading and trailing periods and whitespace.
- We will proceed to search the numbered `info_` columns in order checking as follows:
    1. if column value starts with a value in the dictionary:
        - save country to `place_1` and remove value from searched column.
    2. if `place_1` value has been found:
        - if updated column value starts with a value in the dictionary:
            - save country to `place_2` and remove value from searched column.
    3. Repeat steps 1 and 2 but comparing with country (dictionary keys)
    4. Check unique values for column starting with capital letters.
- It is tempting to shorten the process by simply searching for nationality and country values within the column value, but as there are entries containing first and second nationality, we have to proceed from left to right to extract these values optimally.
- To avoid extracting incorrect values for a person who studies a place, rather than being from that place (e.g., "Egypt" vs "Egyptologist"), we will search for the exact value and the value with a single trailing space.

In [9]:
nation_map

{'Afghan': 'Afghanistan',
 'Albanian': 'Albania',
 'Algerian': 'Algeria',
 'Andorran': 'Andorra',
 'Angolan': 'Angola',
 'Antiguan': 'Antigua and Barbuda',
 'Barbudan': 'Antigua and Barbuda',
 'Argentine': 'Argentina',
 'Armenian': 'Armenia',
 'Australian': 'Australia',
 'Austrian': 'Austria',
 'Azerbaijani': 'Azerbaijan',
 'Azeri': 'Azerbaijan',
 'Bahamian': 'The Bahamas',
 'Bahraini': 'Bahrain',
 'Bengali': 'Bangladesh',
 'Barbadian': 'Barbados',
 'Belarusian': 'Belarus',
 'Belgian': 'Belgium',
 'Belizean': 'Belize',
 'Beninese': 'Benin',
 'Beninois': 'Benin',
 'Bhutanese': 'Bhutan',
 'Bolivian': 'Bolivia',
 'Bosnian': 'Bosnia and Herzegovina',
 'Herzegovinian': 'Bosnia and Herzegovina',
 'Motswana': 'Botswana',
 'Botswanan': 'Botswana',
 'Brazilian': 'Brazil',
 'Bruneian': 'Brunei',
 'Bulgarian': 'Bulgaria',
 'Burkinabé': 'Burkina Faso',
 'Burmese': 'Burma',
 'Burundian': 'Burundi',
 'Cabo Verdean': 'Cabo Verde',
 'Cambodian': 'Cambodia',
 'Cameroonian': 'Cameroon',
 'Canadian': 'Ca

<IPython.core.display.Javascript object>

#### Removing "-" and "." from `nation_map`

In [10]:
# Removing hyphens from nation_map
nation_map = {
    key.replace("-", ""): value.replace("-", " ") for key, value in nation_map.items()
}

# Removing periods from nation_map
nation_map = {key.replace(".", ""): value for key, value in nation_map.items()}

<IPython.core.display.Javascript object>

#### Removing or Replacing Extra Characters in `info_` Columns

In [11]:
%%time

# List of columns to treat
cols_lst = [
    "info_1",
    "info_2",
    "info_3",
    "info_4",
    "info_5",
    "info_6",
    "info_7",
    "info_8",
    "info_9",
    "info_10",
    "info_11",
    'info_parenth'
]

# Dictionary of keys to find and values to replace keys
replace_dict = {'-born': '', '–born': '', '-bred': '', '–bred': '', '-': ' ', '–': ' ', '/': ' ', '.': ' ', '(': '', ')': '', "'s": '', "   ": ' ', '  ': ' '}

# For loop to find and replace characters in replace_dict in columns in cols_list
# and strip any leading or trailing periods or whitespace
for column in cols_lst:
    for key, value in replace_dict.items():
        for index in df[column].notna().index:
            item = df.loc[index, column]
            if item:
                df.loc[index, column] = item.replace(key, value).strip()

CPU times: total: 4min 25s
Wall time: 4min 25s


<IPython.core.display.Javascript object>

#### Checking `info_1` for `place_1`

In [12]:
# Column to check
column = "info_1"

# Extract to column
extract_to = "place_1"

# For loop to extract nation data to place column
for nationality, country in nation_map.items():
    dataframe = df[(df[column].notna())]
    for index in dataframe.index:
        item = df.loc[index, column]
        if (
            item == nationality
            or item == country
            or item.startswith(nationality + " ")
            or item.startswith(country + " ")
        ):
            df.loc[index, extract_to] = country
            df.loc[index, column] = (
                df.loc[index, column]
                .replace(nationality, "")
                .strip()
                .replace(country, "")
                .strip()
            )

# Check a sample of rows
df[df[extract_to].notna()].sample(2)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1
104434,12,Franz Eisl,", 98. Austrian Olympic sailor .",https://en.wikipedia.org/wiki/Franz_Eisl,1,2019,July,"1960, 1972",Olympic sailor,,,,,,,,,,,98.0,,Austria
21865,23,Roberto Matta,", 91 Chilean artist.",https://en.wikipedia.org/wiki/Roberto_Matta,7,2002,November,,artist,,,,,,,,,,,91.0,,Chile


<IPython.core.display.Javascript object>

#### Observations:
- `info_1` provides us a small sample on which to test code.
- We successfully extracted those `place_1` values, now we will do the same on the treated rows for `place_2`.

#### Checking `info_1` for `place_2`

In [13]:
# Column to check
column = "info_1"

# Extract to column
extract_to = "place_2"

# For loop to extract nation data to place column
for nationality, country in nation_map.items():
    dataframe = df[(df["place_1"].notna()) & (df[column].notna())]
    for index in dataframe.index:
        item = df.loc[index, column]
        if (
            item == nationality
            or item == country
            or item.startswith(nationality + " ")
            or item.startswith(country + " ")
        ):
            df.loc[index, extract_to] = country
            df.loc[index, column] = (
                df.loc[index, column]
                .replace(nationality, "")
                .strip()
                .replace(country, "")
                .strip()
            )

# Check a sample of rows
df.sample(2)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1
117912,3,Dick Kulpa,", 67, American cartoonist , cancer.",https://en.wikipedia.org/wiki/Dick_Kulpa,6,2021,January,",",,American cartoonist,cancer,,,,,,,,,67.0,,
43796,18,John Babcock,", 109, Canadian soldier, Canada's last surviving World War I veteran.",https://en.wikipedia.org/wiki/John_Babcock,21,2010,February,,,Canadian soldier,Canada last surviving World War I veteran,,,,,,,,,109.0,,


<IPython.core.display.Javascript object>

#### Observations:
- Here we can see that the new column place_2 has not yet been added as there were not any matching values.
- Let us confirm by checking the remaining unique values in info_1.

#### Checking Remaining Unique Values in `info_1`

In [14]:
# Checking unique values
df["info_1"].unique()

array([None, 'politician', 'Olympic sprinter', 'gridiron football player',
       'writer', 'businessman', 'social psychologist', 'King of Nepal',
       'Maori leader', 'artist', 'English sports journalist',
       'Jules Engel', 'early', 'aka', 'Jr', 'professional wrestler',
       'automotive engineer', 'materials scientist', 'weightlifter',
       'common chimpanzee', '', 'Olympic athlete', 'actor',
       'Olympic gymnast', 'broadcaster and writer', 'Olympic swimmer',
       'Olympic boxer', 'Olympic wrestler', 'Olympic sailor',
       'basketball player', 'college basketball coach',
       'choral conductor', 'Tree of the Year'], dtype=object)

<IPython.core.display.Javascript object>

#### Obsservations:
- Neither "English" nor "Maori" are keys in the current dictionary.
- Maori is an ethnicity within the country of New Zealand, so for now, we will add it as a key our dictionary with the country value of New Zealand.  If we have matching first and second countries, we can later remove the first value.
- We will also add the key "English" with the country value 'United Kingdom of Great Britain and Northern Ireland'.
- Then, we can rerun the above code for `place_1` and `place_2`.
- The country value of "Nepal" is also present. We will hold off on extracting country names until we have first exhausted matching nationalities, as the Wikipedia field called for nationalities.

#### Updating `nation_map`

In [15]:
# Adding key: country pairs to nation_map
nation_map["English"] = nation_map["British"]
nation_map["Maori"] = nation_map["New Zealand"]

<IPython.core.display.Javascript object>

#### Re-checking `info_1` for `place_1`

In [16]:
# Column to check
column = "info_1"

# Extract to column
extract_to = "place_1"

# For loop to extract nation data to place column
for nationality, country in nation_map.items():
    dataframe = df[(df[extract_to].isna()) & (df[column].notna())]
    for index in dataframe.index:
        item = df.loc[index, column]
        if (
            item == nationality
            or item == country
            or item.startswith(nationality + " ")
            or item.startswith(country + " ")
        ):
            df.loc[index, extract_to] = country
            df.loc[index, column] = (
                df.loc[index, column]
                .replace(nationality, "")
                .strip()
                .replace(country, "")
                .strip()
            )

# Check a sample of rows
df[df[extract_to].notna()].sample(2)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1
6832,17,Spiro Agnew,", American politician, 77, 39th Vice President of the United States, leukemia.",https://en.wikipedia.org/wiki/Spiro_Agnew,207,1996,September,,politician,,39th Vice President of the United States,leukemia,,,,,,,,77.0,,United States of America
55331,5,Lucky Diamond,", Maltese 15, American Guinness World Record dog, holder , cancer.",https://en.wikipedia.org/wiki/Lucky_Diamond_(dog),13,2012,June,dog most photographed with celebrities,,American Guinness World Record dog,holder,cancer,,,,,,,,15.0,,Malta


<IPython.core.display.Javascript object>

#### Re-checking `info_1` for `place_2`

In [17]:
# Column to check
column = "info_1"

# Extract to column
extract_to = "place_2"

# For loop to extract nation data to place column
for nationality, country in nation_map.items():
    dataframe = df[(df["place_1"].notna()) & (df[column].notna())]
    for index in dataframe.index:
        item = df.loc[index, column]
        if (
            item == nationality
            or item == country
            or item.startswith(nationality + " ")
            or item.startswith(country + " ")
        ):
            df.loc[index, extract_to] = country
            df.loc[index, column] = (
                df.loc[index, column]
                .replace(nationality, "")
                .strip()
                .replace(country, "")
                .strip()
            )

# Checking rows
df[df["place_2"].notna()]

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2
19580,20,Dame Miraka Szászy,", 80. New Zealand Maori leader.",https://en.wikipedia.org/wiki/Mira_Sz%C3%A1szy,21,2001,December,,leader,,,,,,,,,,,80.0,,New Zealand,New Zealand


<IPython.core.display.Javascript object>

#### Observations:
- Our code appears to be finding the matching values and assigning the corresponding country to the correct nation column.
- We see "New Zealand" added to both nation columns here, which was expected as both New Zealand and Maori are in the description
- As an aside, we will need to check our final values where `place_1` is "American" and `place_2` is "Indian" as our code will indicate United States and India, which may or may not be correct. 
- Now we can proceed to doing the same extraction on `info_2`.

#### Checking `info_2` for `place_1`

In [18]:
%%time

# Column to check
column = "info_2"

# Extract to column
extract_to = "place_1"

# For loop to extract nation data to place column
for nationality, country in nation_map.items():
    dataframe = df[(df[extract_to].isna()) & (df[column].notna())]
    for index in dataframe.index:
        item = df.loc[index, column]
        if item == nationality or item == country or item.startswith(nationality + ' ') or item.startswith(country + ' '):
            df.loc[index, extract_to] = country
            df.loc[index, column] = (
                df.loc[index, column].replace(nationality, "").strip().replace(country, '').strip()
            )

CPU times: total: 2min 48s
Wall time: 2min 48s


<IPython.core.display.Javascript object>

In [19]:
# Check a sample of rows
df[df[extract_to].notna()].sample(2)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2
32961,26,Wolfgang Gewalt,", 78, German zoologist, director of the Duisburg Zoo .",https://en.wikipedia.org/wiki/Wolfgang_Gewalt,0,2007,April,1966 1993,,zoologist,director of the Duisburg Zoo,,,,,,,,,78.0,,Germany,
85673,4,Lady Moyra Browne,", 98, British nursing administrator.",https://en.wikipedia.org/wiki/Lady_Moyra_Browne,3,2016,December,,,nursing administrator,,,,,,,,,,98.0,,United Kingdom of Great Britain and Northern Ireland,


<IPython.core.display.Javascript object>

#### Checking Remaining Missing Values for place_1

In [20]:
# Checking number of remaining missing values for place_1
print(f'There are {df["place_1"].isna().sum()} remaining missing values for place_1.\n')

There are 2465 remaining missing values for place_1.



<IPython.core.display.Javascript object>

#### Observations:
- We have captured the `place_1` value for the vast majority of entries.
- Next we will check for other variations on nationality usage.

#### Examining Unique Values of First Word in `info_1` if Upper Case

In [21]:
# Column to check
column = "info_2"

# Dataframe to check
dataframe = df[(df["place_1"].isna()) & (df[column].notna())]

# Checking set of first words in info_2 where place_1 is missing
first_words = set([item.split()[0] for item in dataframe[column] if item[0].isupper()])
print(
    f"There are {len(first_words)} unique values starting with upper case for first word in info_1.\n"
)
first_words

There are 324 unique values starting with upper case for first word in info_1.



{'AIDS',
 'ANC',
 'Abkhaz',
 'Abkhazian',
 'Aboriginal',
 'Actress',
 'Afghani',
 'African',
 'Afrikaans',
 'Afrikaner',
 'Afro',
 'Air',
 'Alfa',
 'All',
 'Alyawarre',
 'Amateur',
 'America',
 'American]]',
 'Americane',
 'Americanthoroughbred',
 'Amrican',
 'Anglican',
 'Anglo',
 'Anguillan',
 'Arabic',
 'Archbishop',
 'Archdeacon',
 'Argentinian',
 'Aruba',
 'Aruban',
 'Assamese',
 'Associate',
 'Assyrian',
 'Athletics',
 'Aussie',
 'Australiancabaret',
 'Austro',
 'Avarian',
 'Azorean',
 'BBC',
 'Baltic',
 'Basque',
 'Bavarian',
 'Belarussian',
 'Benedictine',
 'Bermudan',
 'Bermudian',
 'Bessarabian',
 'Bletchley',
 'Bodo',
 'Bosnia',
 'Braziliam',
 'Breton',
 'Brigadier',
 'Britain',
 'Britsih',
 'Bulgarianactor',
 'California',
 'Californian',
 'Calypso',
 'Cantonese',
 'Caribbean',
 'Catalan',
 'Catholic',
 'Caymanian',
 'Ceylon',
 'Ceylonese',
 'Chagossian',
 'Chairman',
 'Chechen',
 'Cherokee',
 'Chief',
 'Chilen',
 'Chileán',
 'Chilian',
 'China',
 'Chiricahua',
 'Chuvash',


<IPython.core.display.Javascript object>

#### Observations:
- We can see there are some remaining variations on how nationality was entered that are not yet in `nation_country_dict`.
- Let us add those now, then do another iteration searching `info_2`.
- Descriptions will be assigned to their geographical physical region where sovereign state is remote or nationality description is broad.

#### Hard-coding Additional Variations on Nationality

In [22]:
# Hard-coding remaining unique nationality/location descriptors
nation_map["ANC"] = nation_map["South African"]
nation_map["Abkhaz"] = nation_map["Georgian"]
nation_map["Abkhazian"] = nation_map["Georgian"]
nation_map["Aboriginal"] = nation_map["Australian"]
nation_map["African"] = "Africa"
nation_map["Afrikaans"] = nation_map["African"]
nation_map["Afrikaner"] = nation_map["African"]
nation_map["Afro"] = "Africa"
nation_map["Alyawarre"] = nation_map["Australian"]
nation_map["America"] = nation_map["US"]
nation_map["America's"] = nation_map["US"]
nation_map["Amrican"] = nation_map["US"]
nation_map["Anglo"] = "Europe"
nation_map["Anguillan"] = "Central America and the Caribbean"
nation_map["Antigua"] = nation_map["Antiguan"]
nation_map["Arabic"] = "Arab states"
nation_map["Argentinian"] = nation_map["Argentine"]
nation_map["Aruba"] = "Central America and the Caribbean"
nation_map["Aruban"] = nation_map["Aruba"]
nation_map["Assamese"] = nation_map["Indian"]
nation_map["Assyrian"] = "Middle East"
nation_map["Aussie"] = nation_map["Australian"]
nation_map["Australia"] = nation_map["Australian"]
nation_map["Australia's"] = nation_map["Australian"]
nation_map["Austria"] = nation_map["Austrian"]
nation_map["Austro"] = nation_map["Austrian"]
nation_map["Avarian"] = nation_map["Russian"]
nation_map["Azerbaijan"] = nation_map["Azerbaijani"]
nation_map["Azorean"] = nation_map["Portuguese"]
nation_map["Baltic"] = "Europe"
nation_map["Bangladesh"] = nation_map["Bangladeshi"]
nation_map["Barbados"] = "Central America and the Caribbean"
nation_map["Basque"] = "Europe"
nation_map["Bavarian"] = nation_map["German"]
nation_map["Belarus"] = nation_map["Belarusian"]
nation_map["Belarussian"] = nation_map["Belarusian"]
nation_map["Belgium"] = nation_map["Belgian"]
nation_map["Bermudan"] = "Central America and the Caribbean"
nation_map["Bermudian"] = "Central America and the Caribbean"
nation_map["Bessarabian"] = "Europe"
nation_map["Bletchley"] = nation_map["British"]
nation_map["Bodo"] = nation_map["Norwegian"]
nation_map["Bosnia"] = nation_map["Bosnian"]
nation_map["Breton"] = nation_map["French"]
nation_map["Britain's"] = nation_map["British"]
nation_map["Britsih"] = nation_map["British"]
nation_map["California"] = nation_map["US"]
nation_map["Californian"] = nation_map["US"]
nation_map["Cantonese"] = nation_map["Chinese"]
nation_map["Caribbean"] = "Central America and the Caribbean"
nation_map["Catalan"] = nation_map["Spanish"]
nation_map["Caymanian"] = nation_map["Caribbean"]
nation_map["Ceylon"] = nation_map["Sri Lankan"]
nation_map["Ceylonese"] = nation_map["Sri Lankan"]
nation_map["Chagossian"] = "Indian Ocean"
nation_map["Chechen"] = nation_map["Russian"]
nation_map["Cherokee"] = nation_map["US"]
nation_map["Chilian"] = nation_map["Chilean"]
nation_map["China"] = nation_map["Chinese"]
nation_map["Chiricahua"] = nation_map["US"]
nation_map["Chuvash"] = nation_map["Russian"]
nation_map["Circassian"] = nation_map["Russian"]
nation_map["Columbian"] = nation_map["Colombian"]
nation_map["Congo"] = nation_map["Congolese"]
nation_map["Congoleze"] = nation_map["Congolese"]
nation_map["Cornish"] = nation_map["British"]
nation_map["Costan Rican"] = nation_map["Costa Rican"]
nation_map["Côte d'Ivoire"] = nation_map["Ivorian"]
nation_map["Crimean"] = nation_map["Russian"]
nation_map["Croat"] = nation_map["Croatian"]
nation_map["Curaçaoan"] = nation_map["Dutch"]
nation_map["Curaçaon"] = nation_map["Dutch"]
nation_map["Dagestani"] = nation_map["Russian"]
nation_map["Dahomey"] = "Africa"
nation_map["Dijiboutian"] = nation_map["Djiboutian"]
nation_map["Dolgan"] = nation_map["Russian"]
nation_map["Dominica"] = nation_map["Caribbean"]
nation_map["England"] = nation_map["British"]
nation_map["Englist"] = nation_map["British"]
nation_map["European"] = "Europe"
nation_map["Falkland Islands"] = "South America"
nation_map["Falkland islands"] = nation_map["Falkland Islands"]
nation_map["Falkland"] = nation_map["Falkland Islands"]
nation_map["Faroese"] = nation_map["Danish"]
nation_map["Filipina"] = nation_map["Filipino"]
nation_map["Filipo"] = nation_map["Filipino"]  # verified entry
nation_map["Fillipina"] = nation_map["Filipino"]
nation_map["Finish"] = nation_map["Finnish"]
nation_map["Flemish"] = nation_map["Belgian"]
nation_map["Franch"] = nation_map["French"]  # verified entry
nation_map["Franco"] = nation_map["French"]
nation_map["Frenck"] = nation_map["French"]  # verified entry
nation_map["Fujianese"] = nation_map["Chinese"]
nation_map[
    "Gaelic"
] = "Europe"  # refers to sport of Gaelic football, otherwise language
nation_map["Galician"] = nation_map["Spanish"]
nation_map["Galápagos"] = nation_map["Ecuadorian"]  # entry for tortoise
nation_map["Geman"] = nation_map["German"]  # verified entry
nation_map["Germen"] = nation_map["German"]  # verified entry
nation_map["Ghanese"] = "Africa"
nation_map["Greenlandic"] = nation_map["Danish"]
nation_map["Guadeloupean"] = nation_map["Caribbean"]
nation_map["Guamanian"] = "Oceania"
nation_map["Guernsey"] = nation_map["British"]
nation_map["Guernseyan"] = nation_map["British"]
nation_map["Hawaiian"] = nation_map["US"]
nation_map["Hollywood"] = nation_map["US"]
nation_map["Hong Kong"] = nation_map["Chinese"]
nation_map["Houston"] = nation_map["US"]
nation_map["Huaorani"] = nation_map["Ecuadorian"]
nation_map["I Kiribati"] = nation_map["IKiribati"]
nation_map["Indin"] = nation_map["Indian"]  # verified entry
nation_map["Indo"] = nation_map["Indian"]
nation_map["Ingush"] = nation_map["Russian"]
nation_map["Italo"] = nation_map["Italian"]
nation_map["Ivoirian"] = "Africa"
nation_map["Javanese"] = nation_map["Indonesian"]
nation_map["Jersey"] = nation_map["British"]
nation_map["Kabardin"] = nation_map["Russian"]
nation_map["Kashmiri"] = nation_map["Indian"]
nation_map["Korean"] = "Asia"
nation_map["Kosovan"] = "Europe"
nation_map["Kosovar"] = "Europe"
nation_map["Kosovo"] = "Europe"
nation_map["Kurdish"] = "Asia"
nation_map["Lesothan"] = "Africa"
nation_map["Los Angeles"] = nation_map["US"]
nation_map["Louisiana"] = nation_map["US"]
nation_map["MGerman"] = nation_map["German"]  # verified entry
nation_map["Macanese"] = nation_map["Chinese"]
nation_map["Malayalam"] = nation_map["Indian"]
nation_map["Malayali"] = nation_map["Indian"]
nation_map["Malayan"] = nation_map["Malaysian"]
nation_map["Manx"] = nation_map["British"]
nation_map["Mexian"] = nation_map["Mexican"]
nation_map["Mississippi"] = nation_map["US"]
nation_map["Monegasque"] = nation_map["Monacan"]
nation_map["Montserrat"] = nation_map["Caribbean"]
nation_map["Montserratian"] = nation_map["Caribbean"]
nation_map["Myanmar"] = nation_map["Burmese"]
nation_map["New York"] = nation_map["US"]
nation_map["Ngarrindjeri"] = nation_map["Australian"]
nation_map["Ni Vanuatu"] = "Oceania"
nation_map["Nigirean"] = nation_map["Nigerian"]
nation_map["Niuean"] = nation_map["NZ"]
nation_map["Northern Ire"] = nation_map["Northern Irish"]
nation_map["Northern Ireland"] = nation_map["Northern Irish"]
nation_map["Norther Irish"] = nation_map["Northern Irish"]
nation_map["North Irish"] = nation_map["Northern Irish"]
nation_map["North American"] = "North America"
nation_map["North Island"] = nation_map["NZ"]
nation_map["Northern Mariana Island"] = "Oceania"
nation_map["Northern Mariana Islander"] = "Oceania"
nation_map["Nubian"] = nation_map["Sudanese"]
nation_map["Ottoman"] = nation_map["Turkish"]
nation_map["Paraguan"] = nation_map["Paraguayan"]  # verified entry
nation_map["People's Republic of China"] = nation_map["Chinese"]
nation_map["Pitcairn"] = "Oceania"
nation_map["Poliosh"] = nation_map["Polish"]  # verified entry
nation_map["Polis"] = nation_map["Polish"]  # verified entry
nation_map["Prussian"] = nation_map["German"]
nation_map["Punjabi"] = nation_map["Indian"]
nation_map["Quebec"] = nation_map["Canadian"]
nation_map["Québécois"] = nation_map["Canadian"]
nation_map["Republic of China"] = nation_map["Chinese"]
nation_map["Rhodesian"] = "Africa"
nation_map["Roman"] = nation_map["Italian"]
nation_map["Réunionese"] = nation_map["French"]
nation_map["S African"] = nation_map["South African"]
nation_map["Saban"] = nation_map["Caribbean"]
nation_map["Saharawi"] = "Africa"
nation_map["Sahrawi"] = nation_map["Saharawi"]
nation_map["Saint Helena"] = "South Atlantic"
nation_map["Saint Vincent"] = nation_map["Caribbean"]
nation_map["Saint Martin"] = nation_map["Caribbean"]
nation_map["Saint Pierre and Miquelon"] = "North America"
nation_map["Salvadorean"] = nation_map["Salvadoran"]
nation_map["Sanmarinese"] = nation_map["Sammarinese"]
nation_map["Santomean"] = nation_map["São Toméan"]
nation_map["Seychellian"] = nation_map["Seychellois"]
nation_map["Sicilian"] = nation_map["Italian"]
nation_map["Sicillian"] = nation_map["Italian"]
nation_map["Sikkimese"] = nation_map["Indian"]
nation_map["Sorbian"] = nation_map["German"]
nation_map["South Afican"] = nation_map["South African"]
nation_map["South Ossetian"] = nation_map["Georgian"]
nation_map["Soviet"] = "United Socialist Soviet Republic"
nation_map["Sri lankan"] = nation_map["Sri Lankan"]
nation_map["St Lucian"] = nation_map["Caribbean"]
nation_map["St Kitts and Nevis"] = nation_map["Kittian and Nevisian"]
nation_map["Sumatran"] = nation_map["Indonesian"]
nation_map["Swish"] = nation_map["Swiss"]  # verified entry
nation_map["Tahitian"] = "South Pacific"
nation_map["Tamil"] = "Indian subcontinent"
nation_map["Tasmanian"] = nation_map["Australian"]
nation_map["Telugu"] = nation_map["Indian"]
nation_map["Texas"] = nation_map["US"]
nation_map["Florida"] = nation_map["US"]
nation_map["Tibetan"] = nation_map["Chinese"]
nation_map["Tirkish"] = nation_map["Turkish"]
nation_map["Tirkish"] = nation_map["Turkish"]
nation_map["Trinbagonian"] = nation_map["Trinidadian"]
nation_map["Trinidad"] = nation_map["Trinidadian"]
nation_map["Turks"] = nation_map["Turkish"]
nation_map["U S"] = nation_map["US"]
nation_map["UAE's"] = nation_map["Emirati"]
nation_map["United Kingdom"] = nation_map["British"]
nation_map["Upper Silesian"] = nation_map["Polish"]
nation_map["Uruguyan"] = nation_map["Uruguayan"]
nation_map["Uyghur"] = "Asia"
nation_map["Ni Vanuatu"] = nation_map["Vanuatuan"]
nation_map["Wallisian"] = "Oceania"
nation_map["Xhosa"] = nation_map["South African"]
nation_map["Yazidi"] = "Asia"
nation_map["Yellowstone"] = nation_map["US"]
nation_map["Yugoslav"] = nation_map["Serbian"]
nation_map["Yugoslavia"] = nation_map["Serbian"]
nation_map["Yugoslavian"] = nation_map["Serbian"]
nation_map["Zairean"] = nation_map["Congolese"]
nation_map["Zanzibari"] = nation_map["Tanzanian"]
nation_map["Czechoslovakian"] = nation_map[
    "Czech"
]  # Note:  this will later be converted to Europe, so either Czech or Slovak works here
nation_map["Czechoslovak"] = nation_map["Czech"]  # same as above

<IPython.core.display.Javascript object>

#### Example of Checking Rows with Unique Value

In [23]:
# Cell for checking rows of unique values -- used repeatedly while hard-coding above
check_value = "Yellowstone "

check_list = []
for index in df[df["info_2"].notna()].index:
    item = df.loc[index, "info_2"]
    if item:
        if item.startswith(check_value):
            check_list.append(index)
df.loc[check_list, :]

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2
58424,6,O-Six,", 6, Yellowstone National Park gray wolf, shot.",https://en.wikipedia.org/wiki/O-Six,16,2012,December,,,Yellowstone National Park gray wolf,shot,,,,,,,,,6.0,,,


<IPython.core.display.Javascript object>

#### Observations:
- Because the original `info` column contains mutliple pieces of information, in various formats, we will be doing some hard-coding along the way.
- We will proceed to re-run our code searching `info_2`.

#### Checking `info_2` for `place_1` with Updated `nation_map`

In [24]:
%%time

# Column to check
column = "info_2"

# Extract to column
extract_to = "place_1"


# For loop to extract nation data to place column
for nationality, country in nation_map.items():
    dataframe = df[(df[extract_to].isna()) & (df[column].notna())]
    for index in dataframe.index:
        item = df.loc[index, column]
        if item==nationality or item==country or item.startswith(nationality + ' ') or item.startswith(country + ' '):
            df.loc[index, extract_to] = country
            df.loc[index, column] = (
                df.loc[index, column].replace(nationality, "").strip().replace(country, '').strip()
            )

CPU times: total: 11.8 s
Wall time: 11.9 s


<IPython.core.display.Javascript object>

In [25]:
# Check a sample of rows
df[df[extract_to].notna()].sample(2)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2
35639,16,Bobby Lord,", 74, American country musician.",https://en.wikipedia.org/wiki/Bobby_Lord,3,2008,February,,,country musician,,,,,,,,,,74.0,,United States of America,
126718,23,Jim Malone,", 95, Australian footballer .",https://en.wikipedia.org/wiki/Jim_Malone_(footballer),4,2021,October,North Melbourne,,footballer,,,,,,,,,,95.0,,Australia,


<IPython.core.display.Javascript object>

#### Checking Remaining Missing Values for `place_1`

In [26]:
# Checking number of remaining missing values for place_1
print(f'There are {df["place_1"].isna().sum()} remaining missing values for place_1.\n')

There are 750 remaining missing values for place_1.



<IPython.core.display.Javascript object>

#### Obervations:
- We have narrowed down the missing values for `place_1` values to ~750.
- Let us take another look at the remaining unique values for `info_2`.

#### Checking Remaining Unique Values for `info_2` Where `place_1` Value is Missing

In [27]:
# Checking remaining unique values for info_2 where place_1 value is missing
df[df["place_1"].isna()]["info_2"].unique()

array(['Royal Netherlands Navy vice admiral', 'President of Laos',
       'Luxembourgian road bicycle racer',
       'Governor general of the Bahamas',
       'Amateur violinist and philanthropist',
       'West German long distance runner and Olympian',
       'Composer and music editor',
       'President of the Yemen Arab Republic',
       'Luxembourgian football player', 'Native American tribal leader',
       'Prime Minister of Rwanda',
       'Male Hungarian international table tennis player',
       '37th President of the United States',
       'Queen of Jordan as the wife of King Talal',
       'Prime Minister of Zaire under Mobutu Sese Seko',
       'President of Burma and writer',
       'Founder and first leader of North Korea',
       'Irani American engineer', 'Moravian American classical pianist',
       'President of Ciskei', 'Prime Minister of Nepal',
       'President of Palau', 'Poet and an Esperantist professor',
       'East German politician', 'People Republic of C

<IPython.core.display.Javascript object>

#### Observations:
- We can see some country names embedded in the middle of the values, so we will search there next, for nationality or country.
- The first place value found will be maintained, which does bias the value toward nationalities/countries higher up on the list, as we are not proceeding from left to right for this iteration.

#### Checking `info_2` for `place_1` via Nationality or Country within Value

In [28]:
%%time

# Column to check
column = "info_2"

# Extract to column
extract_to = "place_1"

# For loop to extract nation data to place column
for nationality, country in nation_map.items():
    dataframe = df[(df[extract_to].isna()) & (df[column].notna())]
    for index in dataframe.index:
        item = df.loc[index, column]
        if nationality + ' ' in item or country + ' ' in item or item.endswith(nationality) or item.endswith(country):
            df.loc[index, extract_to] = country
            df.loc[index, column] = (
                df.loc[index, column].replace(nationality, "").strip().replace(country, '').strip()
            )

CPU times: total: 6.45 s
Wall time: 6.46 s


<IPython.core.display.Javascript object>

In [29]:
# Check a sample of rows
df[df[extract_to].notna()].sample(2)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2
34634,18,Jim Ford,", 66, American singer songwriter.",https://en.wikipedia.org/wiki/Jim_Ford,2,2007,November,,,singer songwriter,,,,,,,,,,66.0,,United States of America,
71822,10,Whitey Adolfson,", 82, American football and wrestling coach.",https://en.wikipedia.org/wiki/Whitey_Adolfson,2,2014,November,,,football and wrestling coach,,,,,,,,,,82.0,,United States of America,


<IPython.core.display.Javascript object>

#### Checking Remaining Missing Values for `place_1`

In [30]:
# Checking number of remaining missing values for place_1
print(f'There are {df["place_1"].isna().sum()} remaining missing values for place_1.\n')

There are 390 remaining missing values for place_1.



<IPython.core.display.Javascript object>

#### Observations:
- We have ~400 remaining missing values for `place_1`.
- After checking these rows, we can update the dictionary again.

In [31]:
# Checking remaining unique values for info_2 where place_1 value is missing
df[df["place_1"].isna()]["info_2"].unique()

array(['Luxembourgian road bicycle racer',
       'Governor general of the Bahamas',
       'Amateur violinist and philanthropist',
       'Composer and music editor', 'Luxembourgian football player',
       'Prime Minister of Zaire under Mobutu Sese Seko',
       'President of Ciskei', 'Poet and an Esperantist professor',
       'Jewish rabbi', 'Iranist',
       'Somalian military leader and statesman',
       "native leader and historian of the Metepenagiag Mi'kmaq Nation",
       'Luxembourgian cyclist',
       'sidecarcross rider and the first ever Sidecarcross World Championship',
       'Major League Baseball player', 'professional wrestler',
       'Field hockey player', 'fifth dean of the Harvard Business School',
       'Racecar driver', 'longtime chairman of Liverpool F C',
       'Iraniab historian', 'Orthodox Jewish rabbi',
       'Royal Air Force Air marshal and flying ace during World War II',
       'Archdeacon of Halifax',
       'inspector general and executive directo

<IPython.core.display.Javascript object>

#### Hard-coding Additional Variations on Nationality

In [32]:
# Hard-coding remaining unique nationality/location descriptors
nation_map["Bahamas"] = nation_map["Bahamian"]
nation_map["Zaire"] = nation_map["Zairean"]
nation_map["Ciskei"] = "Africa"
nation_map["Metepenagiag Mi'kmaq"] = nation_map["Canadian"]
nation_map["Major League"] = nation_map["US"]
nation_map["Liverpool"] = nation_map["British"]
nation_map["Halifax"] = nation_map["Canadian"]
nation_map["Major Leagues"] = nation_map["US"]
nation_map["Levi Strauss & Co"] = nation_map["US"]
nation_map["Miami"] = nation_map["US"]
nation_map["Heaven Gate"] = nation_map["US"]
nation_map["Royal Navy"] = nation_map["British"]
nation_map["Luftwaffe"] = nation_map["German"]
nation_map["Ashanti"] = "Africa"
nation_map["White House"] = nation_map["US"]
nation_map["Boeing"] = nation_map["US"]
nation_map["Red Army"] = nation_map["Soviet"]
nation_map["Yokohama"] = nation_map["Japanese"]
nation_map["Tasmania"] = nation_map["Tasmanian"]
nation_map["Harvard"] = nation_map["US"]
nation_map["Malaya"] = nation_map["Malayan"]
nation_map["Pennsylvania"] = nation_map["US"]
nation_map["House of Saud"] = nation_map["Saudi"]
nation_map["Faroe Islands"] = nation_map["Faroese"]
nation_map["Kentucky"] = nation_map["US"]
nation_map["Norfolk"] = nation_map["British"]  # verified entry
nation_map["Rhodesia"] = nation_map["Rhodesian"]
nation_map["Royal Marines"] = nation_map["British"]
nation_map["Rhode Island"] = nation_map["US"]
nation_map["Jerusalem"] = nation_map["Israeli"]
nation_map["Kriegsmarine"] = nation_map["German"]
nation_map["Nazi"] = nation_map["German"]
nation_map["South Carolina"] = nation_map["US"]
nation_map["Rashtriya Swayamsevak Sangh"] = nation_map["Indian"]
nation_map["Ontario"] = nation_map["Canadian"]
nation_map["London"] = nation_map["British"]
nation_map["McDonalds"] = nation_map["US"]
nation_map["Kleiner"] = nation_map["US"]
nation_map["Worldcom"] = nation_map["US"]
nation_map["Reino Aventura"] = nation_map["Mexican"]
nation_map["Palm Beach"] = nation_map["US"]
nation_map["Lawrence Radiation Laboratory"] = nation_map["US"]
nation_map["Royal Air Force"] = nation_map["British"]
nation_map["Britain"] = nation_map["British"]
nation_map["american"] = nation_map["US"]
nation_map["St  Lucian"] = nation_map["St Lucian"]
nation_map["Cook Islands"] = "Oceania"
nation_map["E O  Green Junior High School"] = nation_map["US"]
nation_map["Cook Islander"] = "Oceania"
nation_map["western lowland"] = "Africa"
nation_map["Saharan"] = "Africa"
nation_map["St  Kitts and Nevis"] = nation_map["Kittian"]
nation_map["Jammu and Kashmir"] = "Indian subcontinent"
nation_map["Cook Island"] = "Oceania"
nation_map["New Caledonian"] = "Oceania"
nation_map["Luxembourgian"] = "Luxembourg"
nation_map["Iraniab"] = nation_map["Iranian"]
nation_map["Australiancabaret"] = nation_map["Australian"]
nation_map["Chileán"] = nation_map["Chilean"]
nation_map["Somalian"] = nation_map["Somali"]
nation_map["Americanthoroughbred"] = nation_map["US"]
nation_map["USAF"] = nation_map["US"]
nation_map["Egyption"] = nation_map["Egyptian"]
nation_map["USA"] = nation_map["US"]

<IPython.core.display.Javascript object>

#### Re-checking `info_2` for `place_1` via Nationality or Country within Value after Updating `nation_map`

In [33]:
%%time

# Column to check
column = "info_2"

# Extract to column
extract_to = "place_1"

# For loop to extract nation data to place column
for nationality, country in nation_map.items():
    dataframe = df[(df[extract_to].isna()) & (df[column].notna())]
    for index in dataframe.index:
        item = df.loc[index, column]
        if nationality + ' ' in item or country + ' ' in item or item.endswith(nationality) or item.endswith(country):
            df.loc[index, extract_to] = country
            df.loc[index, column] = (
                df.loc[index, column].replace(nationality, "").strip().replace(country, '').strip()
            )

CPU times: total: 6.25 s
Wall time: 6.27 s


<IPython.core.display.Javascript object>

In [34]:
# Check a sample of rows
df[df[extract_to].notna()].sample(2)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2
121279,11,Artturi Niemelä,", 97, Finnish homesteader and politician, MP .",https://en.wikipedia.org/wiki/Artturi_Niemel%C3%A4,2,2021,April,1970 1975,,homesteader and politician,MP,,,,,,,,,97.0,,Finland,
109483,5,Hossein Sheikholeslam,", 67, Iranian politician, MP , COVID-19.",https://en.wikipedia.org/wiki/Hossein_Sheikholeslam,7,2020,March,2004 2008 and Ambassador to Syria 1998 2003,,politician,MP,COVID 19,,,,,,,,67.0,,Iran,


<IPython.core.display.Javascript object>

#### Checking Remaining Missing Values for `place_1`

In [35]:
# Checking number of remaining missing values for place_1
print(f'There are {df["place_1"].isna().sum()} remaining missing values for place_1.\n')

There are 229 remaining missing values for place_1.



<IPython.core.display.Javascript object>

#### Observations:
- We have assigned some "nationality" keys, such as "Nazi" which has the value "German", which may vary from the specific nation, however, our plan is to later convert to geographical region, so these values serve our purpose.
- There are ~ 225 remaining missing values for `place_1`, which we do not expect to find in `info_2`.
- We can now move on to searching for `place_2` in `info_2`.

#### Checking `info_2` for `place_2`

In [36]:
%%time

# Column to check
column = "info_2"

# Extract to column
extract_to = "place_2"

# For loop to extract nation data to place column
for nationality, country in nation_map.items():
    dataframe = df[
    (df["place_1"].notna()) & (df[extract_to].isna()) & (df[column].notna())]
    for index in dataframe.index:
        item = df.loc[index, column]
        if item == nationality or item == country or item.startswith(nationality + ' ') or item.startswith(country + ' '):
            df.loc[index, extract_to] = country
            df.loc[index, column].replace(nationality, "").strip().replace(country, '').strip()

CPU times: total: 8min 33s
Wall time: 8min 33s


<IPython.core.display.Javascript object>

In [37]:
# Check a sample of rows
df[df[extract_to].notna()].sample(2)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2
90162,21,Yami Lester,", 75, Australian Aboriginal and anti-nuclear activist.",https://en.wikipedia.org/wiki/Yami_Lester,7,2017,July,,,Aboriginal and anti nuclear activist,,,,,,,,,,75.0,,Australia,Australia
28815,6,Anthony Sawoniuk,", 84, Polish-born Nazi criminal, dead in a United Kingdom prison, natural causes.",https://en.wikipedia.org/wiki/Anthony_Sawoniuk,19,2005,November,,,Nazi criminal,dead in a United Kingdom prison,natural causes,,,,,,,,84.0,,Poland,Germany


<IPython.core.display.Javascript object>

#### Checking Number of Entries with Values for `place_2`

In [38]:
# Checking number of entries for place_2
print(f'There are {df["place_2"].notna().sum()} entries with values for place_2.')

There are 9191 entries with values for place_2.


<IPython.core.display.Javascript object>

#### Observations:
- We are finished searching `info_2` for place values and can move on to searching the other `info_` columns for them.

#### Checking Other `info_` Columns  for Remaining `place_1` Values

In [39]:
%%time

# Columns to search
cols_lst = [
    'info_1',
    'info_3',
    'info_4',
    'info_5',
    'info_6',
    'info_7',
    'info_8',
    'info_9',
    'info_10',
    'info_11',
    'info_parenth'
]

# Extract to column
extract_to = "place_1"

# For loop to extract nation data to place column
for column in cols_lst:
    for nationality, country in nation_map.items():
        dataframe = df[(df[extract_to].isna()) & (df[column].notna())]
        for index in dataframe.index:
            item = df.loc[index, column]
            if item == nationality or item == country or item.startswith(nationality + ' ') or item.startswith(country + ' '):
                df.loc[index, extract_to] = country
                df.loc[index, column] = (
                    df.loc[index, column].replace(nationality, "").strip().replace(country, "").strip()
                )

CPU times: total: 27.7 s
Wall time: 27.7 s


<IPython.core.display.Javascript object>

#### Checking Remaining Missing Values for `place_1`

In [40]:
# Checking number of remaining missing values for place_1
print(f'There are {df["place_1"].isna().sum()} remaining missing values for place_1.\n')

There are 206 remaining missing values for place_1.



<IPython.core.display.Javascript object>

#### Observations:
- That iteration captured ~20 values.
- We will repeat it, but checking for values that are embedded.

#### Re-checking `info_3` through `info_11` and `info_parenth` for place_1 via Nationality or Country within Value

In [41]:
%%time

# Extract to column
extract_to = "place_1"

# For loop to extract nation data to place column
for column in cols_lst:
    for nationality, country in nation_map.items():
        dataframe = df[(df[extract_to].isna()) & (df[column].notna())]
        for index in dataframe.index:
            item = df.loc[index, column]
            if nationality + ' ' in item or country + ' ' in item or item.endswith(nationality) or item.endswith(country):
                df.loc[index, extract_to] = country
                df.loc[index, column] = (
                    df.loc[index, column].replace(nationality, "").strip().replace(country, '').strip()
                )

CPU times: total: 27.3 s
Wall time: 27.4 s


<IPython.core.display.Javascript object>

#### Checking Remaining Missing Values for `place_1`

In [42]:
# Checking number of remaining missing values for place_1
print(f'There are {df["place_1"].isna().sum()} remaining missing values for place_1.\n')

# Checking a sample of rows with missing place_1
df[df["place_1"].isna()].sample(2)

There are 195 remaining missing values for place_1.



Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2
9359,25,Mauro Cristofani,", 56, Linguist and researcher in Etruscan studies.",https://en.wikipedia.org/wiki/Mauro_Cristofani,0,1997,August,,,Linguist and researcher in Etruscan studies,,,,,,,,,,56.0,,,
19942,2,Paul Baloff,", 41, Exodus vocalist, heart failure.",https://en.wikipedia.org/wiki/Paul_Baloff,1,2002,February,,,Exodus vocalist,heart failure,,,,,,,,,41.0,,,


<IPython.core.display.Javascript object>

#### Obervations:
- We have completed the search for `place_1` and the remaining values do appear to be lacking that data.
- The next step is to check these columns for any `place_2` values.

#### Checking  for Remaining `place_2` Values in Other `info_` Columns

In [43]:
%%time

# Columns to search
cols_lst = [
    'info_1',
    'info_3',
    'info_4',
    'info_5',
    'info_6',
    'info_7',
    'info_8',
    'info_9',
    'info_10',
    'info_11',
    'info_parenth'
]

# Extract to column
extract_to = "place_2"

# For loop to extract nation data to place column
for column in cols_lst:
    for nationality, country in nation_map.items():
        dataframe = df[(df['place_1'].notna()) & (df[extract_to].isna()) & (df[column].notna())]
        for index in dataframe.index:
            item = df.loc[index, column]
            if item==nationality or item==country or item.startswith(nationality+ ' ') or item.startswith(country + ' '):
                df.loc[index, extract_to] = country
                df.loc[index, column] = (
                    df.loc[index, column].replace(nationality, "").strip().replace(country, '').strip()
                  )

CPU times: total: 8min 22s
Wall time: 8min 22s


<IPython.core.display.Javascript object>

#### Observations:
- We should likely stop our extraction of place values here, as searching for embedded `place_2` values is likely to generate erroneous values.
- For example, "English foremost expert on Poland" would result in Poland as a value for `place_2`.  We accepted this possibility for the 30 `place_1` values that were extracted from embedded nationality or country data, but we will not do that for `place_2`.
- Let us take a look at a sample of some entries with only `place_1` values, to confirm this decision.

#### Examining Sample of Rows with Only `place_1` Value

In [44]:
df[(df["place_1"].notna()) & df["place_2"].isna()].sample(20)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2
74509,25,Ron Suart,", 94, English football player and manager .",https://en.wikipedia.org/wiki/Ron_Suart,4,2015,March,Chelsea,,football player and manager,,,,,,,,,,94.0,,United Kingdom of Great Britain and Northern Ireland,
60158,26,Tom Griffin,", 96, American aviator, member of the Doolittle Raid.",https://en.wikipedia.org/wiki/Tom_Griffin_(aviator),3,2013,February,,,aviator,member of the Doolittle Raid,,,,,,,,,96.0,,United States of America,
21735,5,Ansley J. Coale,", 84, American demographer, senior research demographer at the Office of Population Research at Princeton.",https://en.wikipedia.org/wiki/Ansley_J._Coale,11,2002,November,,,demographer,senior research demographer at the Office of Population Research at Princeton,,,,,,,,,84.0,,United States of America,
8111,2,Horace Cutler,", 84, British politician.",https://en.wikipedia.org/wiki/Horace_Cutler,3,1997,March,,,politician,,,,,,,,,,84.0,,United Kingdom of Great Britain and Northern Ireland,
99733,5,Thor Hansen,", 71, Norwegian poker player, cancer.",https://en.wikipedia.org/wiki/Thor_Hansen,6,2018,December,,,poker player,cancer,,,,,,,,,71.0,,Norway,
83729,14,Marion Christopher Barry,", 36, American construction company owner, drug overdose.",https://en.wikipedia.org/wiki/Marion_Christopher_Barry,33,2016,August,,,construction company owner,drug overdose,,,,,,,,,36.0,,United States of America,
15428,5,Derrick Walters,", 67, British Anglican priest, Dean of Liverpool.",https://en.wikipedia.org/wiki/Derrick_Walters,9,2000,April,,,Anglican priest,Dean of Liverpool,,,,,,,,,67.0,,United Kingdom of Great Britain and Northern Ireland,
71394,19,Stuart Gallacher,", 68, Welsh rugby player and executive.",https://en.wikipedia.org/wiki/Stuart_Gallacher,6,2014,October,national union and league teams,,rugby player and executive,,,,,,,,,,68.0,,Wales,
73545,3,Norman Yemm,", 81, Australian actor .",https://en.wikipedia.org/wiki/Norman_Yemm,10,2015,February,", ,",,actor,,,,,,,,,,81.0,,Australia,
16709,19,George Cosmas Adyebo,", 53, Ugandan politician and economist, cancer.",https://en.wikipedia.org/wiki/George_Cosmas_Adyebo,3,2000,November,,,politician and economist,cancer,,,,,,,,,53.0,,Uganda,


<IPython.core.display.Javascript object>

#### Observations:
- The sample validates the decision to not look for `place_2` values that aren't at the beginning of the column's string value.

#### Final Counts of Missing Values for `place_1` and Number of Entries with Values for `place_2`

In [45]:
# Checking number of remaining missing values for place_1
print(f'There are {df["place_1"].isna().sum()} remaining missing values for place_1.\n')
print(f'There are {df["place_2"].notna().sum()} entries with values for place_2.')

There are 195 remaining missing values for place_1.

There are 11240 entries with values for place_2.


<IPython.core.display.Javascript object>

#### Observations:
- We are finished with extracting the place values, but we can search for "IRA" and "CIA" in `info` where `place_1` is missing and assign "Ireland" and "United States of America", respectively.  It is important not to remove these values from the original column as we might need them as `known_for` values.
- It's time to save our dataframe and start a new notebook before extracting `known_role` and `cause_of_death` values.
- First, let us remove any remaining digits and nation or country values from the `info_` columns.  We will make a working copy of `info_parenth` before doing so, as that information is no longer in the original `info` column.
- We will repeat clearing out extra spaces afterward.

#### Assigning `place_1` for "CIA" and "IRA"

In [46]:
# Separate treatment of CIA and IRA for place_1
for index in df[df["place_1"].isna()].index:
    item = df.loc[index, "info"]
    if "CIA" in item:
        df.loc[index, "place_1"] = nation_map["US"]
    if "IRA" in item:
        df.loc[index, "place_1"] = nation_map["Irish"]

# Repeating for place_2 in the event place_1 notna for above loop
# Can remove duplicated values later
for index in df[df["place_2"].isna()].index:
    item = df.loc[index, "info"]
    if "CIA" in item:
        df.loc[index, "place_1"] = nation_map["US"]
    if "IRA" in item:
        df.loc[index, "place_1"] = nation_map["Irish"]

<IPython.core.display.Javascript object>

#### Making Copy of `info_parenth`

In [47]:
# Creating info_parenth_2 column
df["info_parenth_copy"] = df["info_parenth"]

<IPython.core.display.Javascript object>

### Removing Remaining Digits and Nationality/Country Values from `info_` Columns

#### List of Columns to Treat

In [48]:
# List of columns to treat
cols_lst = [
    "info_1",
    "info_2",
    "info_3",
    "info_4",
    "info_5",
    "info_6",
    "info_7",
    "info_8",
    "info_9",
    "info_10",
    "info_11",
    "info_parenth_copy",
]

<IPython.core.display.Javascript object>

#### Removing Digits

In [49]:
# Regular expression for parenthesis and its contents
pattern = r"\d"

# For loop to find indices of rows that have pattern
rows_combined = []
for column in cols_lst:
    dataframe = df[df[column].notna()]
    rows_to_check = rows_with_pattern(dataframe, column, pattern)
    rows_combined += rows_to_check

# Checking a sample of rows
df.loc[rows_combined, :].sample(2)

There are 0 rows with matching pattern in column 'info_1'.
There are 442 rows with matching pattern in column 'info_2'.
There are 2252 rows with matching pattern in column 'info_3'.
There are 1060 rows with matching pattern in column 'info_4'.
There are 69 rows with matching pattern in column 'info_5'.
There are 5 rows with matching pattern in column 'info_6'.
There are 0 rows with matching pattern in column 'info_7'.
There are 0 rows with matching pattern in column 'info_8'.
There are 0 rows with matching pattern in column 'info_9'.
There are 0 rows with matching pattern in column 'info_10'.
There are 0 rows with matching pattern in column 'info_11'.
There are 24403 rows with matching pattern in column 'info_parenth_copy'.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2,info_parenth_copy
125380,1,Kurt Zwikl,", 72, American politician, member of the Pennsylvania House of Representatives .",https://en.wikipedia.org/wiki/Kurt_Zwikl,6,2021,September,1973 1984,,politician,member of the Pennsylvania House of Representatives,,,,,,,,,72.0,,United States of America,,1973 1984
100609,15,Luis Grajeda,", 81, Mexican Olympic basketball player .",https://en.wikipedia.org/wiki/Luis_Grajeda_(basketball),2,2019,January,"1964, 1968",,Olympic basketball player,,,,,,,,,,81.0,,Mexico,,"1964, 1968"


<IPython.core.display.Javascript object>

In [50]:
# For loop to extract digits
for column in cols_lst:
    for index in set(rows_combined):
        item = df.loc[index, column]
        if item:
            match = re.search(pattern, item)
            if match:
                df.loc[index, column] = re.sub(pattern, "", item)

# Rechecking number and example rows after treatment
# For loop to find indices of rows that have pattern
recheck_rows = []
for column in cols_lst:
    dataframe = df[df[column].notna()]
    rows_to_check = rows_with_pattern(dataframe, column, pattern)
    recheck_rows += rows_to_check

There are 0 rows with matching pattern in column 'info_1'.
There are 0 rows with matching pattern in column 'info_2'.
There are 0 rows with matching pattern in column 'info_3'.
There are 0 rows with matching pattern in column 'info_4'.
There are 0 rows with matching pattern in column 'info_5'.
There are 0 rows with matching pattern in column 'info_6'.
There are 0 rows with matching pattern in column 'info_7'.
There are 0 rows with matching pattern in column 'info_8'.
There are 0 rows with matching pattern in column 'info_9'.
There are 0 rows with matching pattern in column 'info_10'.
There are 0 rows with matching pattern in column 'info_11'.
There are 0 rows with matching pattern in column 'info_parenth_copy'.


<IPython.core.display.Javascript object>

#### Removing Any Remaining Matches with  `nation_map` Keys and Values

In [51]:
%%time

# For loop to extract remaining information matching items in nation_map
for column in cols_lst:
    dataframe = df[df[column].notna()]
    for nationality, country in nation_map.items():
        for index in dataframe.index:
            item = df.loc[index, column]
            if item:
                if nationality + ' ' in item or country + ' ' in item or item.endswith(nationality) or item.endswith(country):
                    df.loc[index, column] = item.replace(nationality, "").strip().replace(country,'').strip()

CPU times: total: 16min 8s
Wall time: 16min 8s


<IPython.core.display.Javascript object>

#### Removing Extra Spaces in `info_` Columns

In [52]:
%%time

# List of columns to treat
cols_lst = [
    "info_1",
    "info_2",
    "info_3",
    "info_4",
    "info_5",
    "info_6",
    "info_7",
    "info_8",
    "info_9",
    "info_10",
    "info_11",
    'info_parenth_copy'
]

# Dictionary of keys to find and values to replace keys
replace_dict = {"    ": ' ', "   ": ' ', '  ': ' '}

# For loop to find and replace characters in replace_dict in columns in cols_list
# and strip any leading or trailing periods or whitespace
for column in cols_lst:
    for key, value in replace_dict.items():
        for index in df[column].notna().index:
            item = df.loc[index, column]
            if item:
                df.loc[index, column] = item.replace(key, value).strip()

CPU times: total: 1min
Wall time: 1min


<IPython.core.display.Javascript object>

#### Observations:
- It's time to save our dataframe and start a new notebook before extracting `known_role` and `cause_of_death` values.
- Exporting `nation_map` is also a good idea, at this time.

## Exporting Dataset to SQLite Database [wp_life_expect_clean3.db](https://github.com/teresahanak/wikipedia-life-expectancy/blob/main/wp_life_expect_clean3.db)

In [53]:
# Saving dataset in a SQLite database
conn = sql.connect("wp_life_expect_clean3.db")
df.to_sql("wp_life_expect_clean3", conn, index=False)

132652

<IPython.core.display.Javascript object>

## Saving nation_map to a Pickle File [nation_map.pkl](https://github.com/teresahanak/wikipedia-life-expectancy/blob/main/nation_map.pkl)

In [54]:
# Create a binary pickle file
f = open("nation_map.pkl", "wb")

# Write the dictionary to pickle file
pickle.dump(nation_map, f)

# close file
f.close()

# Chime notification when cell executes
chime.success()

<IPython.core.display.Javascript object>

# [Proceed to Data Cleaning Part 4](https://github.com/teresahanak/wikipedia-life-expectancy/blob/main/wp_life_expect_data_clean4_thanak_2022_06_23.ipynb)