# Wikipedia Notable Life Expectancies

# [Notebook 3 of 4: Data Cleaning](https://github.com/teresahanak/wikipedia-life-expectancy/blob/main/wp_life_expect_data_clean2_thanak_2022_06_17.ipynb)

## Context

The


## Objective

The

### Data Dictionary

- Feature: Description

## Importing Necessary Libraries

In [1]:
# To structure code automatically
%load_ext nb_black

# To import/export sqlite databases
import sqlite3 as sql

# To help with reading, cleaning, and manipulating data
import pandas as pd
import numpy as np
import re
from fuzzywuzzy import fuzz
from fuzzywuzzy import process


# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)
# To define the maximum number of rows to be displayed in a dataframe
pd.set_option("display.max_rows", 200)

# To supress warnings
# import warnings

# warnings.filterwarnings("ignore")

# To set some visualization attributes
pd.set_option("max_colwidth", 150)

<IPython.core.display.Javascript object>

## Data Overview


### Reading, Sampling, and Checking Data Shape

In [2]:
# Reading the dataset
conn = sql.connect("wp_life_expect_clean1.db")
data = pd.read_sql("SELECT * FROM wp_life_expect_clean1", conn)

# Making a working copy
df = data.copy()

# Checking the shape
print(f"There are {df.shape[0]} rows and {df.shape[1]} columns.")

# Checking first 2 rows of the data
df.head(2)

There are 132445 rows and 20 columns.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
0,1,William Chappell,", 86, British dancer, ballet designer and director.",https://en.wikipedia.org/wiki/William_Chappell_(dancer),21,1994,January,,,British dancer,ballet designer and director,,,,,,,,,86.0
1,1,Raymond Crotty,", 68, Irish economist, writer, and academic.",https://en.wikipedia.org/wiki/Raymond_Crotty,12,1994,January,,,Irish economist,writer,and academic,,,,,,,,68.0


<IPython.core.display.Javascript object>

In [3]:
# Checking last 2 rows of the data
df.tail(2)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
132443,9,Oleg Moliboga,", 69, Russian volleyball player, Olympic champion and coach.",https://en.wikipedia.org/wiki/Oleg_Moliboga,2,2022,June,(),,Russian volleyball player,Olympic champion and coach,,,,,,,,,69.0
132444,9,Zou Jing,", 86, Chinese engineer, member of the Chinese Academy of Engineering.",https://en.wikipedia.org/wiki/Zou_Jing_(engineer),3,2022,June,,,Chinese engineer,member of the Chinese Academy of Engineering,,,,,,,,,86.0


<IPython.core.display.Javascript object>

In [4]:
# Checking a sample of the data
df.sample(5)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
29536,3,Ustad Qawwal Bahauddin,", 72, Indian-Pakistani Qawwali singer.",https://en.wikipedia.org/wiki/Qawwal_Bahauddin_Khan,7,2006,February,,,Indian-Pakistani Qawwali singer,,,,,,,,,,72.0
3314,4,Arne Arnardo,", 82, Norwegian circus performer and owner.",https://en.wikipedia.org/wiki/Arne_Arnardo,6,1995,May,,,Norwegian circus performer and owner,,,,,,,,,,82.0
86126,5,David Alexander,", 90, British Royal Marines general.",https://en.wikipedia.org/wiki/David_Alexander_(Royal_Marines_officer),3,2017,January,,,British Royal Marines general,,,,,,,,,,90.0
5830,1,Masroor Anwar,", 51, Indian poet, lyricist and screenwriter.",https://en.wikipedia.org/wiki/Masroor_Anwar,6,1996,April,,,Indian poet,lyricist and screenwriter,,,,,,,,,51.0
45415,16,Jim Nestor,", 90, Australian Olympic cyclist.",https://en.wikipedia.org/wiki/Jim_Nestor,2,2010,June,,,Australian Olympic cyclist,,,,,,,,,,90.0


<IPython.core.display.Javascript object>

#### Observations:
- There are currently 132,445 rows and 20 columns.

### Checking Data Types, Duplicates, and Null Values

In [5]:
# Checking data types and null values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 132445 entries, 0 to 132444
Data columns (total 20 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   day             132445 non-null  object 
 1   name            132445 non-null  object 
 2   info            132445 non-null  object 
 3   link            132445 non-null  object 
 4   num_references  132445 non-null  object 
 5   year            132445 non-null  int64  
 6   month           132445 non-null  object 
 7   info_parenth    49788 non-null   object 
 8   info_1          132445 non-null  object 
 9   info_2          132421 non-null  object 
 10  info_3          62570 non-null   object 
 11  info_4          12587 non-null   object 
 12  info_5          1505 non-null    object 
 13  info_6          217 non-null     object 
 14  info_7          33 non-null      object 
 15  info_8          7 non-null       object 
 16  info_9          1 non-null       object 
 17  info_10   

<IPython.core.display.Javascript object>

#### Observations:
- Our dataset was saved to and read from the database without any hiccups.
- Picking up where we left off, we will aim to track down the remaining missing values for `age`, starting with searching for digits in `info_2`.

## Extracting Age Continued

### Remaining Missing Values for Age

In [6]:
# Checking number of remaining missing values
print(f'There are {df["age"].isna().sum()} missing values for age.')

There are 79 missing values for age.


<IPython.core.display.Javascript object>

#### Function to Save Indices of Rows Matching Regular Expressions Pattern to a List and Print Number of Rows with Match

In [7]:
# Define a function that takes dataframe, column name, and re pattern as arguments and returns list of indices
# for which column value matches re pattern
def rows_with_pattern(dataframe, column, pattern):
    """
    Takes input of dataframe, column name, and re pattern 
    and returns list of indices for rows that contain match
    for pattern anywhere within value for given column.
    
    dataframe: dataframe
    column: column name
    pattern: re pattern
    """
    index_list = []

    for i in dataframe.index:
        item = dataframe.loc[i, column]
        match = re.search(pattern, item)
        if match:
            index_list.append(i)
    print(
        f"There are {len(index_list)} rows with matching pattern in column '{column}'."
    )
    return index_list

<IPython.core.display.Javascript object>

#### Function to Use rows_with_pattern Function for Multiple Regular Expression Patterns

In [8]:
# Define a function that calls rows_with_pattern function for multiple re patterns
# returning a single list of indices for all rows with any pattern match


def multiple_patterns(dataframe, column, patterns):
    """
    Takes input dataframe, column, and list of re patterns and returns single list 
    of indices for rows in which a match for any pattern is found with re.search
    
    dataframe: dataframe
    column: column name
    patterns: list of re patterns
    """
    rows_combined = []

    # For loop to check each pattern
    for pattern in patterns:

        # List and number of rows matching each pattern
        print(pattern)
        rows_to_check = rows_with_pattern(dataframe, column, pattern)
        print("")

        # Add list for each pattern to combined list
        rows_combined += rows_to_check

    return rows_combined

<IPython.core.display.Javascript object>

### `info_2`

#### Rows Missing `age` with Digits in `info_2`

In [9]:
# Column to check
column = "info_2"

# Dataframe to check
dataframe = df[(df["age"].isna()) & (df[column].notna())]

# Pattern for re
pattern = r"\d"

# List and number of rows matching pattern
rows_to_check = rows_with_pattern(dataframe, column, pattern)

# Examining the rows directly
df.loc[rows_to_check, :]

There are 43 rows with matching pattern in column 'info_2'.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
6846,17,Spiro Agnew,", American politician, 77, 39th Vice President of the United States, leukemia.",https://en.wikipedia.org/wiki/Spiro_Agnew,207,1996,September,,American politician,77,39th Vice President of the United States,leukemia,,,,,,,,
12861,14,Muslimgauze,", , 37, British electronic musician, fungal infection.",https://en.wikipedia.org/wiki/Muslimgauze,26,1999,January,(Bryn Jones),,37,British electronic musician,fungal infection,,,,,,,,
14993,15,Željko Ražnatović,", , 47, Serbian mobster and paramilitary leader.",https://en.wikipedia.org/wiki/Arkan,55,2000,January,(aka Arkan),,47,Serbian mobster and paramilitary leader,,,,,,,,,
15062,28,Sarah Caudwell,", , 60, British detective story writer and barrister, cancer.",https://en.wikipedia.org/wiki/Sarah_Caudwell,9,2000,January,(aka Sarah Cockburn),,60,British detective story writer and barrister,cancer,,,,,,,,
16111,30,Max Showalter,", , 83, American actor, composer, pianist, singer, cancer.",https://en.wikipedia.org/wiki/Max_Showalter,14,2000,July,(aka Casey Adams),,83,American actor,composer,pianist,singer,cancer,,,,,
17792,15,Joey Ramone,", , 49, American musician, lead singer for The Ramones, lymphoma.",https://en.wikipedia.org/wiki/Joey_Ramone,25,2001,April,(b. Jeffrey Hyman),,49,American musician,lead singer for The Ramones,lymphoma,,,,,,,
18170,4,Dipendra,", King of Nepal, 29, suicide.",https://en.wikipedia.org/wiki/Dipendra_of_Nepal,8,2001,June,,King of Nepal,29,suicide,,,,,,,,,
19017,28,Mohammad Khalequzzaman,", member of the then National Assembly of Pakistan and Union Minister of Labor, died in 28 September .",https://en.wikipedia.org/wiki/Mohammad_Khalequzzaman,3,2001,September,,member of the then National Assembly of Pakistan and Union Minister of Labor,died in 28 September,,,,,,,,,,
19117,12,Lord Hailsham of St Marylebone,", , 94, British lawyer and politician.","https://en.wikipedia.org/wiki/Quintin_Hogg,_Baron_Hailsham_of_St_Marylebone",17,2001,October,(Quintin Hogg),,94,British lawyer and politician,,,,,,,,,
23752,6,Jules Engel,", Jules Engel, 94, American filmmaker, visual artist, and film director.",https://en.wikipedia.org/wiki/Jules_Engel,10,2003,September,,Jules Engel,94,American filmmaker,visual artist,and film director,,,,,,,


<IPython.core.display.Javascript object>

#### Observations:
- There are several ages as sole integer year values and two as year ranges with two integers.
- The remaining entries are missing age, but do have digits, so order of processing matters here.
- We can safely remove any of these rows that contains a letter of the alphabet, taking care to select rows only from those that are missing `age` and have a digit in `info_2`.

In [10]:
# Column to check
column = "info_2"

# Dataframe to check
dataframe = df.loc[rows_to_check, :]

# Pattern for re
pattern = r"[a-z,A-Z]"

# List and number of rows matching pattern
rows_to_check = rows_with_pattern(dataframe, column, pattern)

# Checking a sample of the rows
df.loc[rows_to_check, :].sample(2)

There are 17 rows with matching pattern in column 'info_2'.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
123181,21,Haribhushan,", Indian politician and guerrilla, COVID-19.",https://en.wikipedia.org/wiki/Haribhushan,6,2021,June,,Indian politician and guerrilla,COVID-19,,,,,,,,,,
124505,8,Fernando López de Olmedo,", Spanish general commander of Ceuta , COVID-19.",https://en.wikipedia.org/wiki/Fernando_L%C3%B3pez_de_Olmedo,4,2021,August,(Perejil Island crisis),Spanish general commander of Ceuta,COVID-19,,,,,,,,,,


<IPython.core.display.Javascript object>

#### Observations:
- We can drop these rows, as they are missing the data for age.
- Extraction of age for two integer ranges, then single integer values, can follow.

#### Dropping Additional Rows with Age Data Absent

In [11]:
# Dropping rows, resetting index, and checking new shape of df
rows_to_drop = rows_to_check.copy()
df.drop(rows_to_drop, inplace=True)
df.reset_index(inplace=True, drop=True)
df.shape

(132428, 20)

<IPython.core.display.Javascript object>

#### Remaining Rows with `age` values in `info_2`

In [12]:
# Column to check
column = "info_2"

# Dataframe to check
dataframe = df[(df["age"].isna()) & (df[column].notna())]

# Regular expression for parenthesis and its contents
pattern = r"\d"

# Finding indices of rows that have pattern
rows_to_check = rows_with_pattern(dataframe, column, pattern)

# Checking unique values
df.loc[rows_to_check, :]["info_2"].unique()

There are 26 rows with matching pattern in column 'info_2'.


array(['77', '37', '47', '60', '83', '49', '29', '94', '55', '62', '69',
       '80', '32', '70', '81', '24', '84', '95', '76', '86', '61',
       '74–75', '79–80'], dtype=object)

<IPython.core.display.Javascript object>

#### Extracting `age` for Ranges with Two Values

In [13]:
# Column to check
column = "info_2"

# Dataframe to check
dataframe = df[(df["age"].isna()) & (df[column].notna())]

# Pattern for re
pattern = r"(\d{1,3})(-|–|/| or )(\d{1,3})"

# List and number of rows matching pattern
rows_to_check = rows_with_pattern(dataframe, column, pattern)

# Checking sample of rows
df.loc[rows_to_check, :].sample(2)

There are 2 rows with matching pattern in column 'info_2'.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
61915,1,Basil Soper,", British actor, 74–75.",https://en.wikipedia.org/wiki/Basil_Soper,0,2013,June,,British actor,74–75,,,,,,,,,,
86907,8,Mohamud Muse Hersi,", Somali politician, 79–80, President of Puntland .",https://en.wikipedia.org/wiki/Mohamud_Muse_Hersi,12,2017,February,(–),Somali politician,79–80,President of Puntland,,,,,,,,,


<IPython.core.display.Javascript object>

In [14]:
# For loop to find rows with values and pattern and calculate and extract age to age column and remove age from info_2
for index in rows_to_check:
    item = df.loc[index, column]
    match = re.search(pattern, item)
    if match:
        age = (int(match.group(1)) + int(match.group(3))) / 2
        df.loc[index, "age"] = age
        df.loc[index, column] = re.sub(pattern, "", df.loc[index, column]).strip()

# Rechecking number of rows after treatment
recheck_rows = rows_with_pattern(df.loc[rows_to_check, :], column, pattern)

# Recheck a sample of treated rows
df.loc[rows_to_check, :].sample(2)

There are 0 rows with matching pattern in column 'info_2'.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
86907,8,Mohamud Muse Hersi,", Somali politician, 79–80, President of Puntland .",https://en.wikipedia.org/wiki/Mohamud_Muse_Hersi,12,2017,February,(–),Somali politician,,President of Puntland,,,,,,,,,79.5
61915,1,Basil Soper,", British actor, 74–75.",https://en.wikipedia.org/wiki/Basil_Soper,0,2013,June,,British actor,,,,,,,,,,,74.5


<IPython.core.display.Javascript object>

#### Extracting `age` as Single Integer

In [15]:
# Column to check
column = "info_2"

# Dataframe to check
dataframe = df[(df["age"].isna()) & (df[column].notna())]

# List of patterns for age formats with single integer for age
pattern = r"\b(\d{1,3})\b"

# Rechecking number of rows after treatment
rows_to_check = rows_with_pattern(dataframe, column, pattern)

# Checking sample of rows
df.loc[rows_to_check, :].sample(2)

There are 24 rows with matching pattern in column 'info_2'.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
24881,8,Muhammad Zaidan,", , 55, Palestinian nationalist, founder of Palestine Liberation Front.",https://en.wikipedia.org/wiki/Muhammad_Zaidan,4,2004,March,(aka Abu Abbas),,55,Palestinian nationalist,founder of Palestine Liberation Front,,,,,,,,
16111,30,Max Showalter,", , 83, American actor, composer, pianist, singer, cancer.",https://en.wikipedia.org/wiki/Max_Showalter,14,2000,July,(aka Casey Adams),,83,American actor,composer,pianist,singer,cancer,,,,,


<IPython.core.display.Javascript object>

In [16]:
# For loop to extract age pattern to age column
for index in rows_to_check:
    item = df.loc[index, column]
    match = re.search(pattern, item)
    if match:
        age = int(match.group(1))
        df.loc[index, "age"] = age
        df.loc[index, column] = re.sub(pattern, "", df.loc[index, column]).strip()

# Re-checking number of rows matching pattern
recheck_rows = rows_with_pattern(df.loc[rows_to_check, :], column, pattern)

# Recheck a sample of treated rows
df.loc[rows_to_check, :].sample(2)

There are 0 rows with matching pattern in column 'info_2'.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
25200,6,Kjell Hallbing,", , 69, Norwegian Western author.",https://en.wikipedia.org/wiki/Kjell_Hallbing,2,2004,May,(aka Louis Masterson),,,Norwegian Western author,,,,,,,,,69.0
19116,12,Lord Hailsham of St Marylebone,", , 94, British lawyer and politician.","https://en.wikipedia.org/wiki/Quintin_Hogg,_Baron_Hailsham_of_St_Marylebone",17,2001,October,(Quintin Hogg),,,British lawyer and politician,,,,,,,,,94.0


<IPython.core.display.Javascript object>

#### Number of Remaining Missing Values

In [17]:
# Checking number of remaining missing values
print(f'There are {df["age"].isna().sum()} missing values for age.')
df[df["age"].isna()].sample(2)

There are 36 missing values for age.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
111852,24,Hussain Ahmad Kanjo,", Pakistani politician, Minister of Science and Technology , COVID-19.",https://en.wikipedia.org/wiki/Hussain_Ahmad_Kanjo,2,2020,May,(–),Pakistani politician,Minister of Science and Technology,COVID-19,,,,,,,,,
121814,3,Hamid Rashid Ma`ala,", Iraqi politician, MP , COVID-19.",https://en.wikipedia.org/wiki/Hamid_Rashid_Ma%60ala,2,2021,May,(–),Iraqi politician,MP,COVID-19,,,,,,,,,


<IPython.core.display.Javascript object>

#### Observations:
- At this point, we could look through the list to manually extract any remaining `age` info, which would likely be the fastest approach.
- We will instead take a programmatic approach, for the sake of the exercise.
- `info_parenth`, and `info_3` and beyond are the remaining columns to search.
- We see that COVID-19 appears often and the number 19 could be mistakenly extracted as an age.  
- Let us start by moving it to a new column `cause_of_death`.

#### Extracting "COVID-19" from Remaining `info` Sub-columns for Rows with Missing `age` Value

In [18]:
# List of columns to check
cols_to_check = [
    "info_parenth",
    "info_3",
    "info_4",
    "info_5",
    "info_6",
    "info_7",
    "info_8",
    "info_9",
    "info_10",
    "info_11",
]

# Dataframe to check
dataframe = df[df["age"].isna()]

# Pattern for re
pattern = r"(COVID-19)"

# For loop to collect indices of all rows with pattern
comb_rows_to_check = []
for column in cols_to_check:
    rows_to_check = rows_with_pattern(
        dataframe[dataframe[column].notna()], column, pattern
    )
    comb_rows_to_check += rows_to_check

# Checking sample of rows
df.loc[comb_rows_to_check, :].sample(2)

There are 1 rows with matching pattern in column 'info_parenth'.
There are 30 rows with matching pattern in column 'info_3'.
There are 2 rows with matching pattern in column 'info_4'.
There are 1 rows with matching pattern in column 'info_5'.
There are 0 rows with matching pattern in column 'info_6'.
There are 0 rows with matching pattern in column 'info_7'.
There are 0 rows with matching pattern in column 'info_8'.
There are 0 rows with matching pattern in column 'info_9'.
There are 0 rows with matching pattern in column 'info_10'.
There are 0 rows with matching pattern in column 'info_11'.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
111852,24,Hussain Ahmad Kanjo,", Pakistani politician, Minister of Science and Technology , COVID-19.",https://en.wikipedia.org/wiki/Hussain_Ahmad_Kanjo,2,2020,May,(–),Pakistani politician,Minister of Science and Technology,COVID-19,,,,,,,,,
117176,18,Robina Sentongo,", Ugandan politician, MP , COVID-19.",https://en.wikipedia.org/wiki/Robina_Sentongo,2,2020,December,(since ),Ugandan politician,MP,COVID-19,,,,,,,,,


<IPython.core.display.Javascript object>

In [19]:
# For loop to extract COVID-19 from remaining info columns for entries with missing age
for column in cols_to_check:
    for index in comb_rows_to_check:
        if df.loc[index, column]:
            item = df.loc[index, column]
            match = re.search(pattern, item)
            if match:
                cause = match.group(1)
                df.loc[index, "cause_of_death"] = cause
                df.loc[index, column] = re.sub(
                    pattern, "", df.loc[index, column]
                ).strip()

# Re-checking number of rows matching pattern in info_parenth
# For loop to collect indices of all rows with pattern
for column in cols_to_check:
    rows_to_check = rows_with_pattern(
        df.loc[comb_rows_to_check, :][df[column].notna()], column, pattern
    )

# Recheck a sample of treated rows
df.loc[comb_rows_to_check, :].sample(2)

There are 0 rows with matching pattern in column 'info_parenth'.
There are 0 rows with matching pattern in column 'info_3'.
There are 0 rows with matching pattern in column 'info_4'.
There are 0 rows with matching pattern in column 'info_5'.
There are 0 rows with matching pattern in column 'info_6'.
There are 0 rows with matching pattern in column 'info_7'.
There are 0 rows with matching pattern in column 'info_8'.
There are 0 rows with matching pattern in column 'info_9'.
There are 0 rows with matching pattern in column 'info_10'.
There are 0 rows with matching pattern in column 'info_11'.


  df.loc[comb_rows_to_check, :][df[column].notna()], column, pattern
  df.loc[comb_rows_to_check, :][df[column].notna()], column, pattern
  df.loc[comb_rows_to_check, :][df[column].notna()], column, pattern
  df.loc[comb_rows_to_check, :][df[column].notna()], column, pattern
  df.loc[comb_rows_to_check, :][df[column].notna()], column, pattern
  df.loc[comb_rows_to_check, :][df[column].notna()], column, pattern
  df.loc[comb_rows_to_check, :][df[column].notna()], column, pattern
  df.loc[comb_rows_to_check, :][df[column].notna()], column, pattern
  df.loc[comb_rows_to_check, :][df[column].notna()], column, pattern
  df.loc[comb_rows_to_check, :][df[column].notna()], column, pattern


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death
121426,21,Erasmo Vásquez,", Dominican physician and politician, minister of public health , COVID-19.",https://en.wikipedia.org/wiki/Erasmo_V%C3%A1squez,3,2021,April,(–),Dominican physician and politician,minister of public health,,,,,,,,,,,COVID-19
111198,30,Suleiman Adamu,", Nigerian politician, member of the Nasarawa State House of Assembly, COVID-19.",https://en.wikipedia.org/wiki/Suleiman_Adamu,5,2020,April,,Nigerian politician,member of the Nasarawa State House of Assembly,,,,,,,,,,,COVID-19


<IPython.core.display.Javascript object>

#### Observations:
- With "COVID-19" put aside, we can check for any remaining digits in the same columns for these entries.
- The new column `cause_of_death` has been added.

#### Checking Remaining `info` Columns for Remaining Digits

In [20]:
# List of columns to check
cols_to_check = [
    "info_parenth",
    "info_3",
    "info_4",
    "info_5",
    "info_6",
    "info_7",
    "info_8",
    "info_9",
    "info_10",
    "info_11",
]

# Dataframe to check
dataframe = df[df["age"].isna()]

# Pattern for re
pattern = r"\d"

# For loop to collect indices of all rows with pattern
comb_rows_to_check = []
for column in cols_to_check:
    rows_to_check = rows_with_pattern(
        dataframe[dataframe[column].notna()], column, pattern
    )
    comb_rows_to_check += rows_to_check

# Checking sample of rows
df.loc[comb_rows_to_check, :]

There are 2 rows with matching pattern in column 'info_parenth'.
There are 0 rows with matching pattern in column 'info_3'.
There are 0 rows with matching pattern in column 'info_4'.
There are 0 rows with matching pattern in column 'info_5'.
There are 0 rows with matching pattern in column 'info_6'.
There are 0 rows with matching pattern in column 'info_7'.
There are 0 rows with matching pattern in column 'info_8'.
There are 0 rows with matching pattern in column 'info_9'.
There are 0 rows with matching pattern in column 'info_10'.
There are 0 rows with matching pattern in column 'info_11'.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death
22855,10,Little Eva,", .",https://en.wikipedia.org/wiki/Little_Eva,14,2003,April,"(née Eva Narcissus Boyd), 59, American pop singer ()",,,,,,,,,,,,,
126710,31,Simon Young,", Irish radio presenter .",https://en.wikipedia.org/wiki/Simon_Young_(presenter),9,2021,October,(RTÉ 2fm),Irish radio presenter,,,,,,,,,,,,


<IPython.core.display.Javascript object>

#### Observations:
- There are only two entries that still potentially contain digits for the age data.
- Here, we see that there is one entry we can preserve that has an age value in `info_parenth`.
- The other entry has a radio station identification value and is missing age data.
- After we collect this last age, we will drop the remaining entries missing `age`.

#### Extracting `age` from `info_parenth`

In [21]:
# Column to check
column = "info_parenth"

# Dataframe to check
dataframe = df[(df["age"].isna()) & (df[column].notna())]

# Pattern for re
pattern = r"\b(\d{1,3})\b"

# List and number of rows matching pattern
rows_to_check = rows_with_pattern(dataframe, column, pattern)

# Checking row
df.loc[rows_to_check, :]

There are 1 rows with matching pattern in column 'info_parenth'.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death
22855,10,Little Eva,", .",https://en.wikipedia.org/wiki/Little_Eva,14,2003,April,"(née Eva Narcissus Boyd), 59, American pop singer ()",,,,,,,,,,,,,


<IPython.core.display.Javascript object>

In [22]:
# For loop to extract age pattern to age column
for index in rows_to_check:
    item = df.loc[index, column]
    match = re.search(pattern, item)
    if match:
        age = int(match.group(1))
        df.loc[index, "age"] = age
        df.loc[index, column] = re.sub(pattern, "", df.loc[index, column]).strip()

# Re-checking number of rows matching pattern
recheck_rows = rows_with_pattern(
    df.loc[rows_to_check, :][df[column].notna()], column, pattern
)

# Rechecking treated row
df.loc[rows_to_check, :]

There are 0 rows with matching pattern in column 'info_parenth'.


  df.loc[rows_to_check, :][df[column].notna()], column, pattern


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death
22855,10,Little Eva,", .",https://en.wikipedia.org/wiki/Little_Eva,14,2003,April,"(née Eva Narcissus Boyd), , American pop singer ()",,,,,,,,,,,,59.0,


<IPython.core.display.Javascript object>

#### Observations:
- All of the age data has been captured and it's time to drop the remaining entries with missing values for `age1`.

#### Dropping the Last Entries with Missing `age` Values

In [23]:
# Checking number of remaining missing values
print(f'There are {df["age"].isna().sum()} missing values for age.')

There are 35 missing values for age.


<IPython.core.display.Javascript object>

In [24]:
# Dropping rows, resetting index, and checking new shape of df
df.dropna(subset="age", inplace=True)
df.reset_index(inplace=True, drop=True)
df.shape

(132393, 21)

<IPython.core.display.Javascript object>

In [25]:
# Checking current info status
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 132393 entries, 0 to 132392
Data columns (total 21 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   day             132393 non-null  object 
 1   name            132393 non-null  object 
 2   info            132393 non-null  object 
 3   link            132393 non-null  object 
 4   num_references  132393 non-null  object 
 5   year            132393 non-null  int64  
 6   month           132393 non-null  object 
 7   info_parenth    49752 non-null   object 
 8   info_1          132393 non-null  object 
 9   info_2          132371 non-null  object 
 10  info_3          62537 non-null   object 
 11  info_4          12584 non-null   object 
 12  info_5          1504 non-null    object 
 13  info_6          217 non-null     object 
 14  info_7          33 non-null      object 
 15  info_8          7 non-null       object 
 16  info_9          1 non-null       object 
 17  info_10   

<IPython.core.display.Javascript object>

#### Observations:
- We have 132,393 entries containing the target variable `age`.
- Some of these rows may represent groups or members of non-human species, as we have observed previously.
- We have been replacing values extracted with empty strings.  Before moving forward let us replace these empty strings with Nan, as it will simplify slicing the dataframe.
- Then it will be time to search for nationality.

#### Replacing Empty Strings with NaN

In [26]:
# Replacing empty strings with NaN
df = df.replace(r"^\s*$", np.nan, regex=True)

<IPython.core.display.Javascript object>

In [27]:
# Checking the NaN values per column
df.isna().sum()

day                    0
name                   0
info                   0
link                   0
num_references         0
year                   0
month                  0
info_parenth       82641
info_1            132363
info_2                48
info_3             70065
info_4            119852
info_5            130895
info_6            132179
info_7            132362
info_8            132387
info_9            132392
info_10           132392
info_11           132392
age                    0
cause_of_death    132393
dtype: int64

<IPython.core.display.Javascript object>

## Extracting Nationality Data

In [28]:
# Checking a sample
df.sample(5)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death
112334,11,Hermann Salomon,", 82, German Olympic athlete .",https://en.wikipedia.org/wiki/Hermann_Salomon,7,2020,June,"(, , )",,German Olympic athlete,,,,,,,,,,82.0,
66295,1,Luis Salvadores Salvi,", 81, Chilean basketball player.",https://en.wikipedia.org/wiki/Luis_Salvadores_Salvi,5,2014,February,,,Chilean basketball player,,,,,,,,,,81.0,
128246,30,Ron Jones,", 87, British Olympic sprinter .",https://en.wikipedia.org/wiki/Ron_Jones_(athlete),4,2021,December,"(, )",,British Olympic sprinter,,,,,,,,,,87.0,
68771,4,Nathan Shamuyarira,", 85, Zimbabwean newspaper editor and politician, Minister of Information , chest infection.",https://en.wikipedia.org/wiki/Nathan_Shamuyarira,13,2014,June,(–) and Foreign Affairs (–),,Zimbabwean newspaper editor and politician,Minister of Information,chest infection,,,,,,,,85.0,
102791,2,Md. Nazim Uddin,", 84, Bangladeshi freedom fighter.",https://en.wikipedia.org/wiki/Md._Nazim_Uddin,3,2019,May,,,Bangladeshi freedom fighter,,,,,,,,,,84.0,


<IPython.core.display.Javascript object>

#### Observations:
- `info_2` appears overall consistent with the Wikipedia field  that combines "citizenship" and "known for".
- The first word does appear to represent the nationality.
- Recall that this information is in `info_1` for some entries and may also be in other `info` columns beyond `info_2`.
- Running the sample check a few times reveals that when citizenship changed the original citizenship may be followed by '-born', then the second citizenship or may be two nationalities with a hyphen between them.
- There are nationalities that have multiple words and there are capitalized words that are not part of the nationality.
- We will start with a list of nationalities downloaded from marijn's github repository [List of nationalities](https://gist.github.com/marijn/274449).

### Reading List of Nationalities

In [29]:
# Reading in list of nationalities
nationalities = pd.read_csv(
    "nationalities.txt", sep="/n", engine="python", names=["Nationality"]
)

print(nationalities.shape)
nationalities.head()

(194, 1)


Unnamed: 0,Nationality
0,Afghan
1,Albanian
2,Algerian
3,American
4,Andorran


<IPython.core.display.Javascript object>

In [30]:
# Converting nationalities to a list
nationalities_lst = nationalities["Nationality"].to_list()
nationalities_lst

['Afghan',
 'Albanian',
 'Algerian',
 'American',
 'Andorran',
 'Angolan',
 'Antiguans',
 'Argentinean',
 'Armenian',
 'Australian',
 'Austrian',
 'Azerbaijani',
 'Bahamian',
 'Bahraini',
 'Bangladeshi',
 'Barbadian',
 'Barbudans',
 'Batswana',
 'Belarusian',
 'Belgian',
 'Belizean',
 'Beninese',
 'Bhutanese',
 'Bolivian',
 'Bosnian',
 'Brazilian',
 'British',
 'Bruneian',
 'Bulgarian',
 'Burkinabe',
 'Burmese',
 'Burundian',
 'Cambodian',
 'Cameroonian',
 'Canadian',
 'Cape Verdean',
 'Central African',
 'Chadian',
 'Chilean',
 'Chinese',
 'Colombian',
 'Comoran',
 'Congolese',
 'Costa Rican',
 'Croatian',
 'Cuban',
 'Cypriot',
 'Czech',
 'Danish',
 'Djibouti',
 'Dominican',
 'Dutch',
 'East Timorese',
 'Ecuadorean',
 'Egyptian',
 'Emirian',
 'Equatorial Guinean',
 'Eritrean',
 'Estonian',
 'Ethiopian',
 'Fijian',
 'Filipino',
 'Finnish',
 'French',
 'Gabonese',
 'Gambian',
 'Georgian',
 'German',
 'Ghanaian',
 'Greek',
 'Grenadian',
 'Guatemalan',
 'Guinea-Bissauan',
 'Guinean',
 'Gu

<IPython.core.display.Javascript object>

#### Observations:
- There are two nationalities in the current list that might present differently in our dataset:  'Kittian and Nevisian' and 'Trinidadian or Tobagonian', so we will add their individual parts to the list.
- As we encounter other versions for nationality values, we can continue to update the list.

In [31]:
# Adding nationality vaulues to nationality list
nationalities_lst += ["Kittian", "Nevisian", "Trinidadian", "Tobagonian"]

<IPython.core.display.Javascript object>

#### Observations:
- There are now 196 nationalities represented in the nationalities list, which we may use for comparison.
- The dataset has one or two nationalities represented ('Nationality', 'Nationality1-born Nationality2' or "Nationality1-Nationality2").
- We will take the approach of extracting values that contain two nationalities, starting with `info_1`, then address the remaining missing values, likely column by column.

### Extracting Nationality from `info_1` to `nation_1` and `nation_2`

In [34]:
column = "info_2"

dataframe = df[df[column].notna()]

for index in dataframe.index:
    item = df.loc[index, column]
    split_1 = item.split()
    split_2 = split_1[0].split("-")
    first_word = split_2[0].strip()
    if first_word in nationalities_lst:
        df.loc[index, "nation_1"] = first_word
        df.loc[index, "info_2"] = df.loc[index, "info_2"].replace(first_word, "")

<IPython.core.display.Javascript object>

In [38]:
column = "info_2"

dataframe = df[df["nation_1"].notna()]

for index in dataframe.index:
    item = df.loc[index, column]
    if item:
        split_1 = item.split()
        split_2 = split_1[0].split("-")
        first_word = split_2[0].strip()
        if first_word in nationalities_lst:
            df.loc[index, "nation_2"] = first_word
            df.loc[index, "info_2"] = df.loc[index, "info_2"].replace(first_word, "")

<IPython.core.display.Javascript object>

In [61]:
for item in df[df["nation_1"].isna()]["info_2"]:
    if item:
        print(type(item))
#         split_1 = item.split()
#         split_2 = split_1[0].split("-")
#         first_word = split_2[0].strip()
#         if first_word.isupper:
#             print(first_word)
#     if len(split_2) > 1:
#         second_word = split_2[1].strip()
#         if second_word.isupper:
#             print(second_word)
#     if len(split_2) > 2:
#         third_word = split_2[2].strip()
#         if third_word.isupper:
#             print(third_word)

<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class

<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class

<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class

<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class

<IPython.core.display.Javascript object>

In [70]:
other = []
for i, item in enumerate(df["info_2"]):
    if type(item) != str:
        other.append(i)

<IPython.core.display.Javascript object>

In [71]:
df.loc[other, :]

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,nation_1,nation_2
6846,17,Spiro Agnew,", American politician, 77, 39th Vice President of the United States, leukemia.",https://en.wikipedia.org/wiki/Spiro_Agnew,207,1996,September,,American politician,,39th Vice President of the United States,leukemia,,,,,,,,77.0,,,
7525,21,Kell Areskoug,", 90 Swedish Olympic sprinter.",https://en.wikipedia.org/wiki/Kell_Areskoug,5,1996,December,,Swedish Olympic sprinter,,,,,,,,,,,90.0,,,
11858,23,Manuel Mejía Vallejo,", 75 Colombian writer.",https://en.wikipedia.org/wiki/Manuel_Mej%C3%ADa_Vallejo,2,1998,July,,Colombian writer,,,,,,,,,,,75.0,,,
12861,14,Muslimgauze,", , 37, British electronic musician, fungal infection.",https://en.wikipedia.org/wiki/Muslimgauze,26,1999,January,(Bryn Jones),,,British electronic musician,fungal infection,,,,,,,,37.0,,,
14993,15,Željko Ražnatović,", , 47, Serbian mobster and paramilitary leader.",https://en.wikipedia.org/wiki/Arkan,55,2000,January,(aka Arkan),,,Serbian mobster and paramilitary leader,,,,,,,,,47.0,,,
15062,28,Sarah Caudwell,", , 60, British detective story writer and barrister, cancer.",https://en.wikipedia.org/wiki/Sarah_Caudwell,9,2000,January,(aka Sarah Cockburn),,,British detective story writer and barrister,cancer,,,,,,,,60.0,,,
16111,30,Max Showalter,", , 83, American actor, composer, pianist, singer, cancer.",https://en.wikipedia.org/wiki/Max_Showalter,14,2000,July,(aka Casey Adams),,,American actor,composer,pianist,singer,cancer,,,,,83.0,,,
17792,15,Joey Ramone,", , 49, American musician, lead singer for The Ramones, lymphoma.",https://en.wikipedia.org/wiki/Joey_Ramone,25,2001,April,(b. Jeffrey Hyman),,,American musician,lead singer for The Ramones,lymphoma,,,,,,,49.0,,,
17890,28,Marie Jahoda,", 94 Austrian-British social psychologist.",https://en.wikipedia.org/wiki/Marie_Jahoda,3,2001,April,,Austrian-British social psychologist,,,,,,,,,,,94.0,,,
18170,4,Dipendra,", King of Nepal, 29, suicide.",https://en.wikipedia.org/wiki/Dipendra_of_Nepal,8,2001,June,,King of Nepal,,suicide,,,,,,,,,29.0,,,


<IPython.core.display.Javascript object>

In [68]:
type("Teresa")

str

<IPython.core.display.Javascript object>

In [45]:
df[df["nation_2"].notna()].sample(50)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,nation_1,nation_2
98039,13,Emmanuel Dabbaghian,", 84, Syrian Armenian Catholic hierarch, Archbishop of Baghdad .",https://en.wikipedia.org/wiki/Emmanuel_Dabbaghian,2,2018,September,(–),,Catholic hierarch,Archbishop of Baghdad,,,,,,,,,84.0,,Syrian,Armenian
51241,27,Ida Fink,", 89, Israeli Polish-language author.",https://en.wikipedia.org/wiki/Ida_Fink,8,2011,September,,,-language author,,,,,,,,,,89.0,,Israeli,Polish
6407,6,Kutlu Adalı,", 61, Turkish Cypriot journalist, poet, socio-political researcher, and peace advocate.",https://en.wikipedia.org/wiki/Kutlu_Adal%C4%B1,6,1996,July,,,journalist,poet,socio-political researcher,and peace advocate,,,,,,,61.0,,Turkish,Cypriot
118026,12,Florentin Crihălmeanu,", 61, Romanian Greek Catholic hierarch, bishop of Cluj-Gherla , COVID-19.",https://en.wikipedia.org/wiki/Florentin_Crih%C4%83lmeanu,2,2021,January,(since ),,Catholic hierarch,bishop of Cluj-Gherla,COVID-19,,,,,,,,61.0,,Romanian,Greek
103623,12,Te'o J. Fuavai,", 82, American Samoan politician, former Senator, Speaker of the American Samoa House of Representatives .",https://en.wikipedia.org/wiki/Te%27o_J._Fuavai,16,2019,June,(–19??),,politician,former Senator,Speaker of the American Samoa House of Representatives,,,,,,,,82.0,,American,Samoan
71558,31,Sofron Mudry,", 90, Ukrainian Greek Catholic hierarch, Bishop of Ivano-Frankivsk .",https://en.wikipedia.org/wiki/Sofron_Mudry,0,2014,October,(–),,Catholic hierarch,Bishop of Ivano-Frankivsk,,,,,,,,,90.0,,Ukrainian,Greek
9185,27,K'tut Tantri,", 99, Scottish American hotelier and broadcaster known as .",https://en.wikipedia.org/wiki/K%27tut_Tantri,18,1997,July,,,hotelier and broadcaster known as,,,,,,,,,,99.0,,Scottish,American
55739,30,"Michael Abney-Hastings, 14th Earl of Loudoun",", 69, British Australian peer.","https://en.wikipedia.org/wiki/Michael_Abney-Hastings,_14th_Earl_of_Loudoun",11,2012,June,,,peer,,,,,,,,,,69.0,,British,Australian
8350,1,Evsey Domar,", 82, Russian American economist.",https://en.wikipedia.org/wiki/Evsey_Domar,4,1997,April,,,economist,,,,,,,,,,82.0,,Russian,American
29480,26,Dave Tatsuno,", 92, Japanese American businessman, documented the Topaz Japanese internment camp in his film .",https://en.wikipedia.org/wiki/Dave_Tatsuno,3,2006,January,,,businessman,documented the Topaz Japanese internment camp in his film,,,,,,,,,92.0,,Japanese,American


<IPython.core.display.Javascript object>

In [40]:
df["nation_2"].notna().sum()

140

<IPython.core.display.Javascript object>

In [None]:
# Define a function that checks if nation_1 or nation_2 value is missing and extracts it from column if present, comparing
# value to defined nationalities_lst.  Also returns missing_nations list to collect any nationalities that might need
# to be added to it.


def extract_nation(dataframe, column):
    """
    Takes input of dataframe and one of its columns and checks column for missing value in
    nation_1 or nation_2 columns and presence of nationality
    in previously-defined nationalities_lst, then extracts a found value to its respective
    nation_1 or nation_2 column
    
    column: column to be checked for presence of nationality data
    """
    missing_nations = []
    for index in dataframe[dataframe[column].notna()].index:
        item = dataframe.loc[index, column]

        if "-born" in item:
            split_1 = item.split("-born")
            split_2 = split_1[0].strip().split()
            nation_1 = ""
            for word in split_2:
                if word[0].isupper():
                    nation_1 = (nation_1 + " " + word).strip()
            if not dataframe.loc[index, "nation_1"] and nation_1 in nationalities_lst:
                dataframe.loc[index, "nation_1"] = nation_1

            elif (
                not dataframe.loc[index, "nation_1"]
                and nation_1 not in nationalities_lst
            ):
                missing_nations.append(nation_1)

            split_3 = split_1[1].split()
            nation_2 = ""
            for word in split_3:
                if word[0].isupper():
                    nation_2 = (nation_2 + " " + word).strip()
            if not dataframe.loc[index, "nation_2"] and nation_2 in nationalities_lst:
                dataframe.loc[index, "nation_2"] = nation_2
            elif (
                not dataframe.loc[index, "nation_2"]
                and nation_2 not in nationalities_lst
            ):
                missing_nations.append(nation_2)

    #                 for word in split_1:
    #                     if word[0].islower():
    #                     if not dataframe.loc[index, "known_for"]:
    #                         dataframe.loc[index, "known_for"] = ""
    #                     else:
    #                         dataframe.loc[index, "known_for"] = (
    #                             dataframe.loc[index, "known_for"] + " " + word
    #                         )

    #         elif "-" in item:
    #             split_1 = item.split("-")
    #             split_2 = split_1[0].split()
    #             nation_1 = ""
    #             for word in split_2:
    #                 if word.isupper():
    #                     nation_1 = (nation_1 + " " + word).strip()
    #                     if (
    #                         not dataframe.loc[index, "nation_1"]
    #                         and nation_1 in nationalities_lst
    #                     ):
    #                         dataframe.loc[index, "nation_1"] = nation_1
    #                     else:
    #                         missing_nations.append(nation_1)

    #             split_3 = split_1[1].split()
    #             nation_2 = ""
    #             for word in split_2:
    #                 if word[0].isupper():
    #                     nation_2 = (nation_2 + " " + word).strip()
    #                     if (
    #                         not dataframe.loc[index, "nation_2"]
    #                         and nation_2 in nationalities_lst
    #                     ):
    #                         dataframe.loc[index, "nation_2"] = nation_2
    #                     else:
    #                         missing_nations.append(nation_2)

    #             for word in split_1:
    #                 if word[0].islower():
    #                     if not dataframe.loc[index, "known_for"]:
    #                         dataframe.loc[index, "known_for"] = ""
    #                     else:
    #                         dataframe.loc[index, "known_for"] = (
    #                             dataframe.loc[index, "known_for"] + " " + word
    #                         )

    #         else:
    #             split_1 = item.split()
    #             nation_1 = ""
    #             for word in split_1:
    #                 if word[0].isupper():
    #                     nation_1 = (nation_1 + " " + word).strip()
    #                     if (
    #                         not dataframe.loc[index, "nation_1"]
    #                         and nation_1 in nationalities_lst
    #                     ):
    #                         nation_1_test.append(nation_1)

    #             for word in split_1:
    #                 if word[0].islower():
    #                     if not dataframe.loc[index, "known_for"]:
    #                         dataframe.loc[index, "known_for"] = ""
    #                     else:
    #                         dataframe.loc[index, "known_for"] = (
    #                             dataframe.loc[index, "known_for"] + " " + word
    #                         )

    return missing_nations

In [None]:
missing_list = extract_nation(df, "info_2")
set(missing_list)

In [None]:
nationalities_lst

In [None]:
df[df["nation_1"] != ""].sample(100)

In [None]:
df["nation_2"].value_counts()

In [None]:
nation_1_test = []
nation_2_test = []
role_test = []
missing = []
for item in test:
    if "-born" in item:
        split_1 = item.split("-born")
        nation_1 = split_1[0].strip()
        if nation_1 in nationalities_lst:
            nation_1_test.append(nation_1)

        split_nation = split_1[1].split()
        nation_2 = ""
        for word in split_nation:
            if word[0].isupper():
                nation_2 = (nation_2 + " " + word).strip()
                if nation_2 in nationalities_lst:
                    nation_2_test.append(nation_2)
                else:
                    missing.append(nation_2)

        role = ""
        for word in split_nation:
            if word[0].islower():
                role = role + " " + word
        role_test.append(role.strip())

    elif "-" in item:
        split_1 = item.split("-")
        nation_1 = split_1[0].strip()
        if nation_1 in nationalities_lst:
            nation_1_test.append(nation_1)
        else:
            missing.append(nation_1)

        split_nation = split_1[1].split()
        nation_2 = ""
        for word in split_nation:
            if word[0].isupper():
                nation_2 = (nation_2 + " " + word).strip()
                if nation_2 in nationalities_lst:
                    nation_2_test.append(nation_2)
                else:
                    missing.append(nation_2)

        role = ""
        for word in split_nation:
            if word[0].islower():
                role = role + " " + word
        role_test.append(role.strip())
    else:
        split_nation = item.split()
        nation_1 = ""
        for word in split_nation:
            if word[0].isupper():
                nation_1 = (nation_1 + " " + word).strip()
                if nation_1 in nationalities_lst:
                    nation_1_test.append(nation_1)

        role = ""
        for word in split_nation:
            if word[0].islower():
                role = role + " " + word
        role_test.append(role.strip())

nation_1_test
nation_2_test
role_test
missing

In [None]:
# List of columns to check
cols_to_check = [
    "info_1",
    "info_2",
    "info_3",
    "info_4",
    "info_5",
    "info_6",
    "info_7",
    "info_8",
    "info_9",
    "info_10",
    "info_11",
]

# List and number of rows matching pattern

    
comb_rows_to_check = []
for column in cols_to_check:
    for nation in nationalities_lst:
        # Pattern for re
        pattern = r"([A-Z][a-z]*\s?[A-Z]?[a-z]*\s?[A-Z]?[a-z]*)-born\s(%s)" % nation
        rows_to_check = rows_with_pattern(df[df[column].notna()], column, pattern)
        comb_rows_to_check += rows_to_check

# Checking a sample of rows
df.loc[comb_rows_to_check, :].sample(2)

#### Observations:
- Such values are only in `info_2` and `info_3`, so we will treat those values as indicated, one column at a time.

#### `info_2`

In [None]:
# Column to be treated
column = "info_2"

# Empty list to collect missing nationalities
missing = []

# For loop to extract nation information to new columns nation_1 and nation_2
for i in df[df[column].notna()].index:
    item = df.loc[i, column]
    match = re.search(pattern, item)

    if match:
        print(match.group(0))
#         nation_1 = match.group(1).strip()
#         nation_2 = match.group(2).strip()

#         if nation_1 in nationalities_lst:
#             df.loc[i, "nation_1"] = nation_1
#             df.loc[i, column] = df.loc[i, column].replace(nation_1, "").strip()

#             if nation_2 in nationalities_lst:
#                 df.loc[i, "nation_2"] = nation_2
#                 df.loc[i, column] = df.loc[i, column].replace(nation_2, "").strip()
#             else:
#                 missing.append(nation_2)
#         else:
#             missing.append(nation_1)

# # Re-check number of and example rows
# rows_to_check = rows_with_pattern(df[df[column].notna()], column, pattern)
# df[df["name"] == "Giltedge"]

In [None]:
set(missing)

In [None]:
nationalities_lst

#### `info_3`

In [None]:
# List of columns to check
column = "info_3"

# Pattern for re
pattern = r"([A-Z][a-z]*-born\s[A-Z][a-z]*\s)"

# List and number of rows matching pattern
rows_to_check = rows_with_pattern(df[df[column].notna()], column, pattern)

# Checking a sample of rows
df.loc[rows_to_check, :]

In [None]:
# Column to be treated
column = "info_3"

# Extracting nation information to new columns nation_1 and nation_2
for i in df[df[column].notna()].index:
    item = df.loc[i, column]
    match = re.search(pattern, item)
    if match:
        nations = match.group(1).split("-born")
        nation_1 = nations[0].strip()
        nation_2 = nations[1].strip()
        df.loc[i, "nation_1"] = nation_1
        df.loc[i, "nation_2"] = nation_2
        df.loc[i, column] = re.sub(pattern, "", df.loc[i, column]).strip()

# Re-check number of and example rows
rows_to_check = rows_with_pattern(df[df[column].notna()], column, pattern)
df[df["name"] == "Herbert Freudenberger"]

#### Values with 2 Nationalities Hyphenated

In [None]:
# Column to check and treat
column = "info_2"

# Patter for re
pattern = r"(^[A-Z][a-z]*-[A-Z][a-z]*)"

# List and number of rows matching pattern
rows_to_check = rows_with_pattern(
    df[(df["nation_1"].isna()) & (df[column].notna())], column, pattern
)

df.loc[rows_to_check, :].sample(2)

#### Observations:
- This pattern may have matches that are not nationalities, so we will add that comparison to our for loop when extracting.

In [None]:
# Extracting nation information to new columns nation_1 and nation_2
for i in df[(df[column].notna()) & (df["nation_1"].isna())].index:
    item = df.loc[i, column]
    match = re.search(pattern, item)
    not_in_list = []

    if match:
        nations = match.group(0).split("-")
        nation_1 = nations[0].strip()
        nation_2 = nations[1].strip()
        if nation_1 in nationalities_lst:
            df.loc[i, "nation_1"] = nation_1
            df.loc[i, column] = df.loc[i, column].replace(nation_1, "").strip()
        else:
            not_in_list.append(nation_1)

        if nation_2 in nationalities_lst:
            df.loc[i, "nation_1"] = nation_2
            df.loc[i, column] = df.loc[i, column].replace(nation_2, "").strip()
        else:
            not_in_list.append(nation_2)

# Re-check number of and example rows
rows_to_check = rows_with_pattern(
    df[(df["nation_1"].isna()) & (df[column].notna())], column, pattern
)
df.loc[rows_to_check, :]

In [None]:
df["nation_1"].unique()

In [None]:
nationalities_lst

In [None]:
# For loop to extract nationality from info_1 to nation_1 and nation_2
for i in df[df["info_1"].notna()].index:
    item = df.loc[i, "info_1"]

    nations = []
    for nationality in nationalities_lst:
        if nationality in item:
            nations.append(nationality)
            print(nations)

#     for j, nation in enumerate(nations):
#         df.loc[i, f"nation_{j+1}"] = nation
#         df.loc[i, "info_1"] = df.loc[i, "info_1"].replace(nation, "")

# # Checking sample of treated rows
# df[df["nation_1"].notna()].sample(2)

#### Observations:
- We have been fortunate in `info_1` in that it has provided a small sample on which to test our code and see a subset of values.
- Let us check the remaining unique values in `info_1` for any missed nationalities.

#### Checking Remaining Unique Values in `info_1`

In [None]:
# Checking unique values in info_1
df["info_1"].unique()

#### Observations:
- `info_1` had only single nationality values, so we do not yet have a `nation_2` column.
- It appears "New Zealand Maori", "English", and "Icelandic" are not included in the nationalities list, but they are used as values on Wikipedia.
- We will take the approach of adding them to the list and re-running the code to extract values from `info_1`.

#### Adding Missing Nationalities to `nationalities_lst` and Repeating Extraction of Nationality from `info_1`

In [None]:
# Confirming nationalities are missing then adding them to existing nationalities list
missing_nations = ["New Zealand Maori", "English", "Icelandic"]
missing_nations = [item for item in missing_nations if item not in nationalities_lst]
nationalities_lst += missing_nations

# List of new columns for first and second nationality
new_cols = ["nation_1", "nation_2"]

# For loop to extract nationality from info_1 to nation_1 and nation_2
for col in new_cols:

    for i in df[df["info_1"].notna()].index:
        item = df.loc[i, "info_1"]

        for nationality in nationalities_lst:
            if nationality in item:
                df.loc[i, col] = nationality
                df.loc[i, "info_1"] = df.loc[i, "info_1"].replace(nationality, "")

# Checking sample of treated rows
df[df["nation_1"].notna()].sample(2)

#### Re-checking Remaning Unique Values in `info_1`

In [None]:
# Re-checking remaining unique values in info_1
df["info_1"].unique()

#### Observations:
- That iteration addressed missing nationalities.
- We do have nationality given as a country, "Nepal", that we will need to keep in mind.

### Extracting Nationality from `info_2` to `nation_1` and `nation_2`

In [None]:
# List of new columns for first and second nationality
new_cols = ["nation_1", "nation_2"]

# For loop to extract nationality from info_2 to nation_1 and nation_2
for col in new_cols:

    for i in df[df["info_2"].notna()].index:
        item = df.loc[i, "info_2"]

        for nationality in nationalities_lst:
            if nationality in item:
                df.loc[i, col] = nationality
                df.loc[i, "info_2"] = df.loc[i, "info_2"].replace(nationality, "")

# Checking sample of treated rows
df[df["nation_1"].notna()].sample(2)

In [None]:
df["info_2"].unique()

In [None]:
fuzz.partial_ratio("Nepal", "Nepalese")

### Exporting Dataset to SQLite Database [wp_life_expect_clean2.db](https://github.com/teresahanak/wikipedia-life-expectancy/blob/main/wp_life_expect_clean1.db)

In [None]:
# Saving complete raw dataset in a SQLite database
conn = sql.connect("wp_life_expect_clean2.db")
df.to_sql("wp_life_expect_clean2", conn, index=False)

# [Proceed to Notebook 4 of  4:  Data Cleaning Part 3](https://github.com/teresahanak/wikipedia-life-expectancy/blob/main/wp_life_expect_data_clean2_thanak_2022_06_17.ipynb)