# Wikipedia Notable Life Expectancies

# [Notebook 3 of 4: Data Cleaning](https://github.com/teresahanak/wikipedia-life-expectancy/blob/main/wp_life_expect_data_clean2_thanak_2022_06_17.ipynb)

## Context

The


## Objective

The

### Data Dictionary

- Feature: Description

## Importing Necessary Libraries

In [1]:
# To structure code automatically
%load_ext nb_black

# To import/export sqlite databases
import sqlite3 as sql

# To save python objects in pickle file
import pickle

# To help with reading, cleaning, and manipulating data
import pandas as pd
import numpy as np
import re
from fuzzywuzzy import fuzz
from fuzzywuzzy import process


# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)
# To define the maximum number of rows to be displayed in a dataframe
pd.set_option("display.max_rows", 200)

# To supress warnings
import warnings

warnings.filterwarnings("ignore")

# To set some visualization attributes
pd.set_option("max_colwidth", 150)

<IPython.core.display.Javascript object>

## Data Overview

### Reading, Sampling, and Checking Data Shape

In [2]:
# Reading the dataset
conn = sql.connect("wp_life_expect_clean1.db")
data = pd.read_sql("SELECT * FROM wp_life_expect_clean1", conn)

# Making a working copy
df = data.copy()

# Checking the shape
print(f"There are {df.shape[0]} rows and {df.shape[1]} columns.")

# Checking first 2 rows of the data
df.head(2)

There are 132445 rows and 20 columns.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
0,1,William Chappell,", 86, British dancer, ballet designer and director.",https://en.wikipedia.org/wiki/William_Chappell_(dancer),21,1994,January,,,British dancer,ballet designer and director,,,,,,,,,86.0
1,1,Raymond Crotty,", 68, Irish economist, writer, and academic.",https://en.wikipedia.org/wiki/Raymond_Crotty,12,1994,January,,,Irish economist,writer,and academic,,,,,,,,68.0


<IPython.core.display.Javascript object>

In [3]:
# Checking last 2 rows of the data
df.tail(2)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
132443,9,Oleg Moliboga,", 69, Russian volleyball player, Olympic champion and coach.",https://en.wikipedia.org/wiki/Oleg_Moliboga,2,2022,June,(),,Russian volleyball player,Olympic champion and coach,,,,,,,,,69.0
132444,9,Zou Jing,", 86, Chinese engineer, member of the Chinese Academy of Engineering.",https://en.wikipedia.org/wiki/Zou_Jing_(engineer),3,2022,June,,,Chinese engineer,member of the Chinese Academy of Engineering,,,,,,,,,86.0


<IPython.core.display.Javascript object>

In [4]:
# Checking a sample of the data
df.sample(5)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
69388,7,Mario Coyula Cowley,", 79, Cuban architect, cancer.",https://en.wikipedia.org/wiki/Mario_Coyula_Cowley,2,2014,July,,,Cuban architect,cancer,,,,,,,,,79.0
114453,30,Keith Lampard,", 74, American baseball player .",https://en.wikipedia.org/wiki/Keith_Lampard,4,2020,August,(Houston Astros),,American baseball player,,,,,,,,,,74.0
32689,14,Tommy Cavanagh,", 78, British football player and manager of Burnley.",https://en.wikipedia.org/wiki/Tommy_Cavanagh,8,2007,March,,,British football player and manager of Burnley,,,,,,,,,,78.0
55694,26,Mario O'Hara,", 68, Filipino film director, leukemia.",https://en.wikipedia.org/wiki/Mario_O%27Hara,10,2012,June,,,Filipino film director,leukemia,,,,,,,,,68.0
11996,21,Eddie Baxter,", 75, American organist.",https://en.wikipedia.org/wiki/Eddie_Baxter,14,1998,August,,,American organist,,,,,,,,,,75.0


<IPython.core.display.Javascript object>

#### Observations:
- There are currently 132,445 rows and 20 columns.

### Checking Data Types, Duplicates, and Null Values

In [5]:
# Checking data types and null values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 132445 entries, 0 to 132444
Data columns (total 20 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   day             132445 non-null  object 
 1   name            132445 non-null  object 
 2   info            132445 non-null  object 
 3   link            132445 non-null  object 
 4   num_references  132445 non-null  object 
 5   year            132445 non-null  int64  
 6   month           132445 non-null  object 
 7   info_parenth    49788 non-null   object 
 8   info_1          132445 non-null  object 
 9   info_2          132421 non-null  object 
 10  info_3          62570 non-null   object 
 11  info_4          12587 non-null   object 
 12  info_5          1505 non-null    object 
 13  info_6          217 non-null     object 
 14  info_7          33 non-null      object 
 15  info_8          7 non-null       object 
 16  info_9          1 non-null       object 
 17  info_10   

<IPython.core.display.Javascript object>

#### Observations:
- Our dataset was saved to and read from the database without any hiccups.
- Picking up where we left off, we will aim to track down the remaining missing values for `age`, starting with searching for digits in `info_2`.

## Extracting Age Continued

### Remaining Missing Values for Age

In [6]:
# Checking number of remaining missing values
print(f'There are {df["age"].isna().sum()} missing values for age.')

There are 79 missing values for age.


<IPython.core.display.Javascript object>

#### Function to Save Indices of Rows Matching Regular Expressions Pattern to a List and Print Number of Rows with Match

In [7]:
# Define a function that takes dataframe, column name, and re pattern as arguments and returns list of indices
# for which column value matches re pattern
def rows_with_pattern(dataframe, column, pattern):
    """
    Takes input of dataframe, column name, and re pattern 
    and returns list of indices for rows that contain match
    for pattern anywhere within value for given column.
    
    dataframe: dataframe
    column: column name
    pattern: re pattern
    """
    index_list = []

    for i in dataframe.index:
        item = dataframe.loc[i, column]
        match = re.search(pattern, item)
        if match:
            index_list.append(i)
    print(
        f"There are {len(index_list)} rows with matching pattern in column '{column}'."
    )
    return index_list

<IPython.core.display.Javascript object>

#### Function to Use rows_with_pattern Function for Multiple Regular Expression Patterns

In [8]:
# Define a function that calls rows_with_pattern function for multiple re patterns
# returning a single list of indices for all rows with any pattern match


def multiple_patterns(dataframe, column, patterns):
    """
    Takes input dataframe, column, and list of re patterns and returns single list 
    of indices for rows in which a match for any pattern is found with re.search
    
    dataframe: dataframe
    column: column name
    patterns: list of re patterns
    """
    rows_combined = []

    # For loop to check each pattern
    for pattern in patterns:

        # List and number of rows matching each pattern
        print(pattern)
        rows_to_check = rows_with_pattern(dataframe, column, pattern)
        print("")

        # Add list for each pattern to combined list
        rows_combined += rows_to_check

    return rows_combined

<IPython.core.display.Javascript object>

### `info_2`

#### Rows Missing `age` with Digits in `info_2`

In [9]:
# Column to check
column = "info_2"

# Dataframe to check
dataframe = df[(df["age"].isna()) & (df[column].notna())]

# Pattern for re
pattern = r"\d"

# List and number of rows matching pattern
rows_to_check = rows_with_pattern(dataframe, column, pattern)

# Examining the rows directly
df.loc[rows_to_check, :]

There are 43 rows with matching pattern in column 'info_2'.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
6846,17,Spiro Agnew,", American politician, 77, 39th Vice President of the United States, leukemia.",https://en.wikipedia.org/wiki/Spiro_Agnew,207,1996,September,,American politician,77,39th Vice President of the United States,leukemia,,,,,,,,
12861,14,Muslimgauze,", , 37, British electronic musician, fungal infection.",https://en.wikipedia.org/wiki/Muslimgauze,26,1999,January,(Bryn Jones),,37,British electronic musician,fungal infection,,,,,,,,
14993,15,Željko Ražnatović,", , 47, Serbian mobster and paramilitary leader.",https://en.wikipedia.org/wiki/Arkan,55,2000,January,(aka Arkan),,47,Serbian mobster and paramilitary leader,,,,,,,,,
15062,28,Sarah Caudwell,", , 60, British detective story writer and barrister, cancer.",https://en.wikipedia.org/wiki/Sarah_Caudwell,9,2000,January,(aka Sarah Cockburn),,60,British detective story writer and barrister,cancer,,,,,,,,
16111,30,Max Showalter,", , 83, American actor, composer, pianist, singer, cancer.",https://en.wikipedia.org/wiki/Max_Showalter,14,2000,July,(aka Casey Adams),,83,American actor,composer,pianist,singer,cancer,,,,,
17792,15,Joey Ramone,", , 49, American musician, lead singer for The Ramones, lymphoma.",https://en.wikipedia.org/wiki/Joey_Ramone,25,2001,April,(b. Jeffrey Hyman),,49,American musician,lead singer for The Ramones,lymphoma,,,,,,,
18170,4,Dipendra,", King of Nepal, 29, suicide.",https://en.wikipedia.org/wiki/Dipendra_of_Nepal,8,2001,June,,King of Nepal,29,suicide,,,,,,,,,
19017,28,Mohammad Khalequzzaman,", member of the then National Assembly of Pakistan and Union Minister of Labor, died in 28 September .",https://en.wikipedia.org/wiki/Mohammad_Khalequzzaman,3,2001,September,,member of the then National Assembly of Pakistan and Union Minister of Labor,died in 28 September,,,,,,,,,,
19117,12,Lord Hailsham of St Marylebone,", , 94, British lawyer and politician.","https://en.wikipedia.org/wiki/Quintin_Hogg,_Baron_Hailsham_of_St_Marylebone",17,2001,October,(Quintin Hogg),,94,British lawyer and politician,,,,,,,,,
23752,6,Jules Engel,", Jules Engel, 94, American filmmaker, visual artist, and film director.",https://en.wikipedia.org/wiki/Jules_Engel,10,2003,September,,Jules Engel,94,American filmmaker,visual artist,and film director,,,,,,,


<IPython.core.display.Javascript object>

#### Observations:
- There are several ages as sole integer year values and two as year ranges with two integers.
- The remaining entries are missing age, but do have digits, so order of processing matters here.
- We can safely remove any of these rows that contains a letter of the alphabet, taking care to select rows only from those that are missing `age` and have a digit in `info_2`.

In [10]:
# Column to check
column = "info_2"

# Dataframe to check
dataframe = df.loc[rows_to_check, :]

# Pattern for re
pattern = r"[a-z,A-Z]"

# List and number of rows matching pattern
rows_to_check = rows_with_pattern(dataframe, column, pattern)

# Checking a sample of the rows
df.loc[rows_to_check, :].sample(2)

There are 17 rows with matching pattern in column 'info_2'.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
115179,2,Fadma Abi,", Moroccan surgeon and professor, COVID-19.",https://en.wikipedia.org/wiki/Fadma_Abi,6,2020,October,,Moroccan surgeon and professor,COVID-19,,,,,,,,,,
121419,21,Johny Lal,", Indian cinematographer , COVID-19.",https://en.wikipedia.org/wiki/Johny_Lal,4,2021,April,"(, , )",Indian cinematographer,COVID-19,,,,,,,,,,


<IPython.core.display.Javascript object>

#### Observations:
- We can drop these rows, as they are missing the data for age.
- Extraction of age for two integer ranges, then single integer values, can follow.

#### Dropping Additional Rows with Age Data Absent

In [11]:
# Dropping rows, resetting index, and checking new shape of df
rows_to_drop = rows_to_check.copy()
df.drop(rows_to_drop, inplace=True)
df.reset_index(inplace=True, drop=True)
df.shape

(132428, 20)

<IPython.core.display.Javascript object>

#### Remaining Rows with `age` values in `info_2`

In [12]:
# Column to check
column = "info_2"

# Dataframe to check
dataframe = df[(df["age"].isna()) & (df[column].notna())]

# Regular expression for parenthesis and its contents
pattern = r"\d"

# Finding indices of rows that have pattern
rows_to_check = rows_with_pattern(dataframe, column, pattern)

# Checking unique values
df.loc[rows_to_check, :]["info_2"].unique()

There are 26 rows with matching pattern in column 'info_2'.


array(['77', '37', '47', '60', '83', '49', '29', '94', '55', '62', '69',
       '80', '32', '70', '81', '24', '84', '95', '76', '86', '61',
       '74–75', '79–80'], dtype=object)

<IPython.core.display.Javascript object>

#### Extracting `age` for Ranges with Two Values

In [13]:
# Column to check
column = "info_2"

# Dataframe to check
dataframe = df[(df["age"].isna()) & (df[column].notna())]

# Pattern for re
pattern = r"(\d{1,3})(-|–|/| or )(\d{1,3})"

# List and number of rows matching pattern
rows_to_check = rows_with_pattern(dataframe, column, pattern)

# Checking sample of rows
df.loc[rows_to_check, :].sample(2)

There are 2 rows with matching pattern in column 'info_2'.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
61915,1,Basil Soper,", British actor, 74–75.",https://en.wikipedia.org/wiki/Basil_Soper,0,2013,June,,British actor,74–75,,,,,,,,,,
86907,8,Mohamud Muse Hersi,", Somali politician, 79–80, President of Puntland .",https://en.wikipedia.org/wiki/Mohamud_Muse_Hersi,12,2017,February,(–),Somali politician,79–80,President of Puntland,,,,,,,,,


<IPython.core.display.Javascript object>

In [14]:
# For loop to find rows with values and pattern and calculate and extract age to age column and remove age from info_2
for index in rows_to_check:
    item = df.loc[index, column]
    match = re.search(pattern, item)
    if match:
        age = (int(match.group(1)) + int(match.group(3))) / 2
        df.loc[index, "age"] = age
        df.loc[index, column] = re.sub(pattern, "", df.loc[index, column]).strip()

# Rechecking number of rows after treatment
recheck_rows = rows_with_pattern(df.loc[rows_to_check, :], column, pattern)

# Recheck a sample of treated rows
df.loc[rows_to_check, :].sample(2)

There are 0 rows with matching pattern in column 'info_2'.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
61915,1,Basil Soper,", British actor, 74–75.",https://en.wikipedia.org/wiki/Basil_Soper,0,2013,June,,British actor,,,,,,,,,,,74.5
86907,8,Mohamud Muse Hersi,", Somali politician, 79–80, President of Puntland .",https://en.wikipedia.org/wiki/Mohamud_Muse_Hersi,12,2017,February,(–),Somali politician,,President of Puntland,,,,,,,,,79.5


<IPython.core.display.Javascript object>

#### Extracting `age` as Single Integer

In [15]:
# Column to check
column = "info_2"

# Dataframe to check
dataframe = df[(df["age"].isna()) & (df[column].notna())]

# List of patterns for age formats with single integer for age
pattern = r"\b(\d{1,3})\b"

# Rechecking number of rows after treatment
rows_to_check = rows_with_pattern(dataframe, column, pattern)

# Checking sample of rows
df.loc[rows_to_check, :].sample(2)

There are 24 rows with matching pattern in column 'info_2'.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
26995,19,Kihachi Okamoto,", , 81, Japanese film director, esophageal cancer",https://en.wikipedia.org/wiki/Kihachi_Okamoto,7,2005,February,(岡本喜八),,81,Japanese film director,esophageal cancer,,,,,,,,
27725,30,Takanohana Kenshi,", , 55, Japanese sumo wrestler, aka ""The Prince of Sumo"".",https://en.wikipedia.org/wiki/Takanohana_Kenshi,7,2005,May,(née Mitsuru Hanada),,55,Japanese sumo wrestler,"aka ""The Prince of Sumo""",,,,,,,,


<IPython.core.display.Javascript object>

In [16]:
# For loop to extract age pattern to age column
for index in rows_to_check:
    item = df.loc[index, column]
    match = re.search(pattern, item)
    if match:
        age = int(match.group(1))
        df.loc[index, "age"] = age
        df.loc[index, column] = re.sub(pattern, "", df.loc[index, column]).strip()

# Re-checking number of rows matching pattern
recheck_rows = rows_with_pattern(df.loc[rows_to_check, :], column, pattern)

# Recheck a sample of treated rows
df.loc[rows_to_check, :].sample(2)

There are 0 rows with matching pattern in column 'info_2'.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
27023,22,Lee Eun-ju,", , 24, Korean actress, suicide.",https://en.wikipedia.org/wiki/Lee_Eun-ju,12,2005,February,(이은주),,,Korean actress,suicide,,,,,,,,24.0
30073,20,Stanley Hiller,", Jr., 81, American helicopter designer.",https://en.wikipedia.org/wiki/Stanley_Hiller,5,2006,April,,Jr,,American helicopter designer,,,,,,,,,81.0


<IPython.core.display.Javascript object>

#### Number of Remaining Missing Values

In [17]:
# Checking number of remaining missing values
print(f'There are {df["age"].isna().sum()} missing values for age.')
df[df["age"].isna()].sample(2)

There are 36 missing values for age.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
116496,24,Hussein Al-Zuhairi,", Iraqi politician, MP, COVID-19.",https://en.wikipedia.org/wiki/Hussein_Al-Zuhairi,3,2020,November,,Iraqi politician,MP,COVID-19,,,,,,,,,
122278,18,Yolanda Tortolero,", Venezuelan physician and politician, deputy , complications from COVID-19.",https://en.wikipedia.org/wiki/Yolanda_Tortolero,6,2021,May,(since ),Venezuelan physician and politician,deputy,complications from COVID-19,,,,,,,,,


<IPython.core.display.Javascript object>

#### Observations:
- At this point, we could look through the list to manually extract any remaining `age` info, which would likely be the fastest approach.
- We will instead take a programmatic approach, for the sake of the exercise.
- `info_parenth`, and `info_3` and beyond are the remaining columns to search.
- We see that COVID-19 appears often and the number 19 could be mistakenly extracted as an age.  
- Let us start by moving it to a new column `cause_of_death`.

#### Extracting "COVID-19" from Remaining `info` Sub-columns for Rows with Missing `age` Value

In [18]:
# List of columns to check
cols_to_check = [
    "info_parenth",
    "info_3",
    "info_4",
    "info_5",
    "info_6",
    "info_7",
    "info_8",
    "info_9",
    "info_10",
    "info_11",
]

# Dataframe to check
dataframe = df[df["age"].isna()]

# Pattern for re
pattern = r"(COVID-19)"

# For loop to collect indices of all rows with pattern
comb_rows_to_check = []
for column in cols_to_check:
    rows_to_check = rows_with_pattern(
        dataframe[dataframe[column].notna()], column, pattern
    )
    comb_rows_to_check += rows_to_check

# Checking sample of rows
df.loc[comb_rows_to_check, :].sample(2)

There are 1 rows with matching pattern in column 'info_parenth'.
There are 30 rows with matching pattern in column 'info_3'.
There are 2 rows with matching pattern in column 'info_4'.
There are 1 rows with matching pattern in column 'info_5'.
There are 0 rows with matching pattern in column 'info_6'.
There are 0 rows with matching pattern in column 'info_7'.
There are 0 rows with matching pattern in column 'info_8'.
There are 0 rows with matching pattern in column 'info_9'.
There are 0 rows with matching pattern in column 'info_10'.
There are 0 rows with matching pattern in column 'info_11'.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
123455,2,Mohamed Nejib Berriche,", Tunisian politician, deputy , COVID-19.",https://en.wikipedia.org/wiki/Mohamed_Nejib_Berriche,3,2021,July,(–),Tunisian politician,deputy,COVID-19,,,,,,,,,
119194,12,Tohami Khaled,", Libyan military officer, head of the Internal Security Agency, COVID-19.",https://en.wikipedia.org/wiki/Tohami_Khaled,9,2021,February,,Libyan military officer,head of the Internal Security Agency,COVID-19,,,,,,,,,


<IPython.core.display.Javascript object>

In [19]:
# For loop to extract COVID-19 from remaining info columns for entries with missing age
for column in cols_to_check:
    for index in comb_rows_to_check:
        if df.loc[index, column]:
            item = df.loc[index, column]
            match = re.search(pattern, item)
            if match:
                cause = match.group(1)
                df.loc[index, "cause_of_death"] = cause
                df.loc[index, column] = re.sub(
                    pattern, "", df.loc[index, column]
                ).strip()

# Re-checking number of rows matching pattern in info_parenth
# For loop to collect indices of all rows with pattern
for column in cols_to_check:
    rows_to_check = rows_with_pattern(
        df.loc[comb_rows_to_check, :][df[column].notna()], column, pattern
    )

# Recheck a sample of treated rows
df.loc[comb_rows_to_check, :].sample(2)

There are 0 rows with matching pattern in column 'info_parenth'.
There are 0 rows with matching pattern in column 'info_3'.
There are 0 rows with matching pattern in column 'info_4'.
There are 0 rows with matching pattern in column 'info_5'.
There are 0 rows with matching pattern in column 'info_6'.
There are 0 rows with matching pattern in column 'info_7'.
There are 0 rows with matching pattern in column 'info_8'.
There are 0 rows with matching pattern in column 'info_9'.
There are 0 rows with matching pattern in column 'info_10'.
There are 0 rows with matching pattern in column 'info_11'.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death
121426,21,Erasmo Vásquez,", Dominican physician and politician, minister of public health , COVID-19.",https://en.wikipedia.org/wiki/Erasmo_V%C3%A1squez,3,2021,April,(–),Dominican physician and politician,minister of public health,,,,,,,,,,,COVID-19
111198,30,Suleiman Adamu,", Nigerian politician, member of the Nasarawa State House of Assembly, COVID-19.",https://en.wikipedia.org/wiki/Suleiman_Adamu,5,2020,April,,Nigerian politician,member of the Nasarawa State House of Assembly,,,,,,,,,,,COVID-19


<IPython.core.display.Javascript object>

#### Observations:
- With "COVID-19" put aside, we can check for any remaining digits in the same columns for these entries.
- The new column `cause_of_death` has been added.

#### Checking Remaining `info` Columns for Remaining Digits

In [20]:
# List of columns to check
cols_to_check = [
    "info_parenth",
    "info_3",
    "info_4",
    "info_5",
    "info_6",
    "info_7",
    "info_8",
    "info_9",
    "info_10",
    "info_11",
]

# Dataframe to check
dataframe = df[df["age"].isna()]

# Pattern for re
pattern = r"\d"

# For loop to collect indices of all rows with pattern
comb_rows_to_check = []
for column in cols_to_check:
    rows_to_check = rows_with_pattern(
        dataframe[dataframe[column].notna()], column, pattern
    )
    comb_rows_to_check += rows_to_check

# Checking sample of rows
df.loc[comb_rows_to_check, :]

There are 2 rows with matching pattern in column 'info_parenth'.
There are 0 rows with matching pattern in column 'info_3'.
There are 0 rows with matching pattern in column 'info_4'.
There are 0 rows with matching pattern in column 'info_5'.
There are 0 rows with matching pattern in column 'info_6'.
There are 0 rows with matching pattern in column 'info_7'.
There are 0 rows with matching pattern in column 'info_8'.
There are 0 rows with matching pattern in column 'info_9'.
There are 0 rows with matching pattern in column 'info_10'.
There are 0 rows with matching pattern in column 'info_11'.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death
22855,10,Little Eva,", .",https://en.wikipedia.org/wiki/Little_Eva,14,2003,April,"(née Eva Narcissus Boyd), 59, American pop singer ()",,,,,,,,,,,,,
126710,31,Simon Young,", Irish radio presenter .",https://en.wikipedia.org/wiki/Simon_Young_(presenter),9,2021,October,(RTÉ 2fm),Irish radio presenter,,,,,,,,,,,,


<IPython.core.display.Javascript object>

#### Observations:
- There are only two entries that still potentially contain digits for the age data.
- Here, we see that there is one entry we can preserve that has an age value in `info_parenth`.
- The other entry has a radio station identification value and is missing age data.
- After we collect this last age, we will drop the remaining entries missing `age`.

#### Extracting `age` from `info_parenth`

In [21]:
# Column to check
column = "info_parenth"

# Dataframe to check
dataframe = df[(df["age"].isna()) & (df[column].notna())]

# Pattern for re
pattern = r"\b(\d{1,3})\b"

# List and number of rows matching pattern
rows_to_check = rows_with_pattern(dataframe, column, pattern)

# Checking row
df.loc[rows_to_check, :]

There are 1 rows with matching pattern in column 'info_parenth'.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death
22855,10,Little Eva,", .",https://en.wikipedia.org/wiki/Little_Eva,14,2003,April,"(née Eva Narcissus Boyd), 59, American pop singer ()",,,,,,,,,,,,,


<IPython.core.display.Javascript object>

In [22]:
# For loop to extract age pattern to age column
for index in rows_to_check:
    item = df.loc[index, column]
    match = re.search(pattern, item)
    if match:
        age = int(match.group(1))
        df.loc[index, "age"] = age
        df.loc[index, column] = re.sub(pattern, "", df.loc[index, column]).strip()

# Re-checking number of rows matching pattern
recheck_rows = rows_with_pattern(
    df.loc[rows_to_check, :][df[column].notna()], column, pattern
)

# Rechecking treated row
df.loc[rows_to_check, :]

There are 0 rows with matching pattern in column 'info_parenth'.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death
22855,10,Little Eva,", .",https://en.wikipedia.org/wiki/Little_Eva,14,2003,April,"(née Eva Narcissus Boyd), , American pop singer ()",,,,,,,,,,,,59.0,


<IPython.core.display.Javascript object>

#### Observations:
- All of the age data has been captured and it's time to drop the remaining entries with missing values for `age1`.

#### Dropping the Last Entries with Missing `age` Values

In [23]:
# Checking number of remaining missing values
print(f'There are {df["age"].isna().sum()} missing values for age.')

There are 35 missing values for age.


<IPython.core.display.Javascript object>

In [24]:
# Dropping rows, resetting index, and checking new shape of df
df.dropna(subset="age", inplace=True)
df.reset_index(inplace=True, drop=True)
df.shape

(132393, 21)

<IPython.core.display.Javascript object>

In [25]:
# Checking current info status
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 132393 entries, 0 to 132392
Data columns (total 21 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   day             132393 non-null  object 
 1   name            132393 non-null  object 
 2   info            132393 non-null  object 
 3   link            132393 non-null  object 
 4   num_references  132393 non-null  object 
 5   year            132393 non-null  int64  
 6   month           132393 non-null  object 
 7   info_parenth    49752 non-null   object 
 8   info_1          132393 non-null  object 
 9   info_2          132371 non-null  object 
 10  info_3          62537 non-null   object 
 11  info_4          12584 non-null   object 
 12  info_5          1504 non-null    object 
 13  info_6          217 non-null     object 
 14  info_7          33 non-null      object 
 15  info_8          7 non-null       object 
 16  info_9          1 non-null       object 
 17  info_10   

<IPython.core.display.Javascript object>

#### Observations:
- We have 132,393 entries containing the target variable `age`.
- Some of these rows may represent groups or members of non-human species, as we have observed previously.
- We have been replacing values extracted with empty strings.  Before moving forward let us replace these empty strings with Nan, as it will simplify slicing the dataframe.
- Then it will be time to search for nationality.

#### Replacing Empty Strings with NaN

In [26]:
# Replacing empty strings with NaN
df = df.replace(r"^\s*$", np.nan, regex=True)

<IPython.core.display.Javascript object>

In [27]:
# Checking the NaN values per column
df.isna().sum()

day                    0
name                   0
info                   0
link                   0
num_references         0
year                   0
month                  0
info_parenth       82641
info_1            132363
info_2                48
info_3             70065
info_4            119852
info_5            130895
info_6            132179
info_7            132362
info_8            132387
info_9            132392
info_10           132392
info_11           132392
age                    0
cause_of_death    132393
dtype: int64

<IPython.core.display.Javascript object>

## Extracting Nationality Data

In [28]:
# Checking a sample
df.sample(5)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death
24100,30,Mike Yaconelli,", 61, American youth minister, magazine editor and writer, co-founder of Youth Specialties.",https://en.wikipedia.org/wiki/Mike_Yaconelli,4,2003,October,(),,American youth minister,magazine editor and writer,co-founder of Youth Specialties,,,,,,,,61.0,
88551,1,Alice Langtry,", 84, American politician, member of the Pennsylvania House of Representatives .",https://en.wikipedia.org/wiki/Alice_Langtry,4,2017,May,(–),,American politician,member of the Pennsylvania House of Representatives,,,,,,,,,84.0,
4496,4,Else Brems,", 87, Danish contralto.",https://en.wikipedia.org/wiki/Else_Brems,5,1995,October,,,Danish contralto,,,,,,,,,,87.0,
131450,24,Denis Kiwanuka Lote,", 84, Ugandan Roman Catholic prelate, bishop of Kotido .",https://en.wikipedia.org/wiki/Denis_Kiwanuka_Lote,5,2022,April,(–) and archbishop of Tororo (–),,Ugandan Roman Catholic prelate,bishop of Kotido,,,,,,,,,84.0,
34221,22,Karl Hardman,", 80, American horror film producer and actor.",https://en.wikipedia.org/wiki/Karl_Hardman,169,2007,September,,,American horror film producer and actor,,,,,,,,,,80.0,


<IPython.core.display.Javascript object>

#### Observations:
- `info_2` appears overall consistent with the Wikipedia field  that combines "citizenship" and "known for".
- The first word does appear to represent the nationality, typically.
- Recall that this information is in `info_1` for some entries and may also be in other `info` columns beyond `info_2`.
- Running the sample check a few times reveals that when there are there two citizenships (first and second), they may be in the formats of "nationality1-born nationality2", "nationality1-nationality2", or "nationality1 nationality2".
- There are nationalities that have multiple words and there are also additional capitalized words that are not part of the nationality.

### Dictionary of Country: Nationality Values
- It would be helpful to have a comprehensive list of countries and their corresponding nationality values.
- To this end, the table from [A List of Nationalities - WorldAtlas](https://www.worldatlas.com/articles/what-is-a-demonym-a-list-of-nationalities.html) was scraped by Spider, ["nations"](https://github.com/teresahanak/wikipedia-life-expectancy/blob/main/wikipedia_life_expectancy/spiders/nations.py), obtaining both values, which were saved to [nations.csv](https://github.com/teresahanak/wikipedia-life-expectancy/blob/main/nations.csv).
- We will compare the values obtained from scraping and add any missing values to the scraped information from a separate list downloaded from marijn's github repository, [List of nationalities](https://gist.github.com/marijn/274449), [nationalities.txt](https://github.com/teresahanak/wikipedia-life-expectancy/blob/main/nationalities.txt).

In [29]:
# Reading in nations.csv
country_nation = pd.read_csv("nations.csv", index_col=False)

# Checking shape
print(country_nation.shape)

# Checking first 2 rows
country_nation.head(2)

(202, 2)


Unnamed: 0,country,nationality
0,Afghanistan,Afghan
1,Albania,Albanian


<IPython.core.display.Javascript object>

In [30]:
# Checking last 2 rows
country_nation.tail(2)

Unnamed: 0,country,nationality
200,Zambia,Zambian
201,Zimbabwe,Zimbabwean


<IPython.core.display.Javascript object>

In [31]:
# Converting country_nation to a dictionary
country_nation_dict = {}
for i in country_nation.index:
    country = country_nation.loc[i, "country"]
    nationality = country_nation.loc[i, "nationality"]
    country_nation_dict[country] = nationality

# Exploring dictionary
country_nation_dict

{'Afghanistan': 'Afghan',
 'Albania': 'Albanian',
 'Algeria': 'Algerian',
 'Andorra': 'Andorran',
 'Angola': 'Angolan',
 'Antigua and Barbuda': 'Antiguan or Barbudan',
 'Argentina': 'Argentine',
 'Armenia': 'Armenian',
 'Australia': 'Australian',
 'Austria': 'Austrian',
 'Azerbaijan': 'Azerbaijani, Azeri',
 'The Bahamas ': 'Bahamian',
 'Bahrain': 'Bahraini',
 'Bangladesh': 'Bengali',
 'Barbados': 'Barbadian',
 'Belarus': 'Belarusian',
 'Belgium': 'Belgian',
 'Belize': 'Belizean',
 'Benin': 'Beninese, Beninois',
 'Bhutan': 'Bhutanese',
 'Bolivia': 'Bolivian',
 'Bosnia and Herzegovina': 'Bosnian or Herzegovinian',
 'Botswana': 'Motswana, Botswanan',
 'Brazil': 'Brazilian',
 'Brunei': 'Bruneian',
 'Bulgaria': 'Bulgarian',
 'Burkina Faso': 'Burkinabé',
 'Burma[2]': 'Burmese',
 'Burundi': 'Burundian',
 'Cabo Verde[3]': 'Cabo Verdean',
 'Cambodia': 'Cambodian',
 'Cameroon': 'Cameroonian',
 'Canada': 'Canadian',
 'Central African Republic': 'Central African',
 'Chad': 'Chadian',
 'Chile': 'Ch

<IPython.core.display.Javascript object>

#### Observations:
- Some countries have more than one associated nationality, separated with ',' or 'or'.
- It would be beneficial to split the values into lists and reverse the order of the keys and values, so that each nationality is a key with its country as a value.

In [32]:
# Converting dictionary values to lists and splitting where there is more than one value for nationality
for key, value in country_nation_dict.items():
    if "," in value:
        country_nation_dict[key] = value.split(", ")
    elif "or" in value:
        country_nation_dict[key] = value.split(" or ")
    else:
        country_nation_dict[key] = [value]

# Exploring dictionary
country_nation_dict

{'Afghanistan': ['Afghan'],
 'Albania': ['Albanian'],
 'Algeria': ['Algerian'],
 'Andorra': ['Andorran'],
 'Angola': ['Angolan'],
 'Antigua and Barbuda': ['Antiguan', 'Barbudan'],
 'Argentina': ['Argentine'],
 'Armenia': ['Armenian'],
 'Australia': ['Australian'],
 'Austria': ['Austrian'],
 'Azerbaijan': ['Azerbaijani', 'Azeri'],
 'The Bahamas ': ['Bahamian'],
 'Bahrain': ['Bahraini'],
 'Bangladesh': ['Bengali'],
 'Barbados': ['Barbadian'],
 'Belarus': ['Belarusian'],
 'Belgium': ['Belgian'],
 'Belize': ['Belizean'],
 'Benin': ['Beninese', 'Beninois'],
 'Bhutan': ['Bhutanese'],
 'Bolivia': ['Bolivian'],
 'Bosnia and Herzegovina': ['Bosnian', 'Herzegovinian'],
 'Botswana': ['Motswana', 'Botswanan'],
 'Brazil': ['Brazilian'],
 'Brunei': ['Bruneian'],
 'Bulgaria': ['Bulgarian'],
 'Burkina Faso': ['Burkinabé'],
 'Burma[2]': ['Burmese'],
 'Burundi': ['Burundian'],
 'Cabo Verde[3]': ['Cabo Verdean'],
 'Cambodia': ['Cambodian'],
 'Cameroon': ['Cameroonian'],
 'Canada': ['Canadian'],
 'Central

<IPython.core.display.Javascript object>

#### Observations:
- We have separated the distince nationalities, but there are some trailing white spaces that we need to remove when inversely mapping the dictionary.
- There are also some double quotation marks and brackets containing digits, in the current dictionary's keys, that we can remove now.

### Inverse Mapping of Dictionary to Nationality: Country

In [33]:
# Inverse mapping of dictionary, stripping extra characters from keys and values
nation_country_dict = {}
for key, value in country_nation_dict.items():
    for item in value:
        new_key = item.strip()
        nation_country_dict[new_key] = key.strip(' "[]0123456789')

# Exploring dictionary
nation_country_dict

{'Afghan': 'Afghanistan',
 'Albanian': 'Albania',
 'Algerian': 'Algeria',
 'Andorran': 'Andorra',
 'Angolan': 'Angola',
 'Antiguan': 'Antigua and Barbuda',
 'Barbudan': 'Antigua and Barbuda',
 'Argentine': 'Argentina',
 'Armenian': 'Armenia',
 'Australian': 'Australia',
 'Austrian': 'Austria',
 'Azerbaijani': 'Azerbaijan',
 'Azeri': 'Azerbaijan',
 'Bahamian': 'The Bahamas',
 'Bahraini': 'Bahrain',
 'Bengali': 'Bangladesh',
 'Barbadian': 'Barbados',
 'Belarusian': 'Belarus',
 'Belgian': 'Belgium',
 'Belizean': 'Belize',
 'Beninese': 'Benin',
 'Beninois': 'Benin',
 'Bhutanese': 'Bhutan',
 'Bolivian': 'Bolivia',
 'Bosnian': 'Bosnia and Herzegovina',
 'Herzegovinian': 'Bosnia and Herzegovina',
 'Motswana': 'Botswana',
 'Botswanan': 'Botswana',
 'Brazilian': 'Brazil',
 'Bruneian': 'Brunei',
 'Bulgarian': 'Bulgaria',
 'Burkinabé': 'Burkina Faso',
 'Burmese': 'Burma',
 'Burundian': 'Burundi',
 'Cabo Verdean': 'Cabo Verde',
 'Cambodian': 'Cambodia',
 'Cameroonian': 'Cameroon',
 'Canadian': 'Ca

<IPython.core.display.Javascript object>

#### Observations:
- Let us add 'US' as an additonal key for United States of America.

In [34]:
# Adding US as key for United States of America
nation_country_dict["US"] = "United States of America"

# Checking value
nation_country_dict["US"]

'United States of America'

<IPython.core.display.Javascript object>

#### Observations:
- We have a solid foundation for our dictionary containing nationality keys and country values.
- Let us compare it to the existing list of nationalities, [nationalities.txt](https://github.com/teresahanak/wikipedia-life-expectancy/blob/main/nationalities.txt), downloaded from marijn's github repository, [List of nationalities](https://gist.github.com/marijn/274449).

### Comparison of Nationality: Country Dictionary with Existing Nationalities List

In [35]:
# Reading in nationalities.txt
nation_lst = pd.read_csv(
    "nationalities.txt", sep="/n", engine="python", names=["Nationality"]
)

# Checking shape
print(nation_lst.shape)

# Checking first 2 rows
nation_lst.head()

(194, 1)


Unnamed: 0,Nationality
0,Afghan
1,Albanian
2,Algerian
3,American
4,Andorran


<IPython.core.display.Javascript object>

In [36]:
# Checking last 2 rows
nation_lst.tail()

Unnamed: 0,Nationality
189,Vietnamese
190,Welsh
191,Yemenite
192,Zambian
193,Zimbabwean


<IPython.core.display.Javascript object>

In [37]:
# Converting nation_lst to a list
nation_lst = nation_lst["Nationality"].to_list()
nation_lst

['Afghan',
 'Albanian',
 'Algerian',
 'American',
 'Andorran',
 'Angolan',
 'Antiguans',
 'Argentinean',
 'Armenian',
 'Australian',
 'Austrian',
 'Azerbaijani',
 'Bahamian',
 'Bahraini',
 'Bangladeshi',
 'Barbadian',
 'Barbudans',
 'Batswana',
 'Belarusian',
 'Belgian',
 'Belizean',
 'Beninese',
 'Bhutanese',
 'Bolivian',
 'Bosnian',
 'Brazilian',
 'British',
 'Bruneian',
 'Bulgarian',
 'Burkinabe',
 'Burmese',
 'Burundian',
 'Cambodian',
 'Cameroonian',
 'Canadian',
 'Cape Verdean',
 'Central African',
 'Chadian',
 'Chilean',
 'Chinese',
 'Colombian',
 'Comoran',
 'Congolese',
 'Costa Rican',
 'Croatian',
 'Cuban',
 'Cypriot',
 'Czech',
 'Danish',
 'Djibouti',
 'Dominican',
 'Dutch',
 'East Timorese',
 'Ecuadorean',
 'Egyptian',
 'Emirian',
 'Equatorial Guinean',
 'Eritrean',
 'Estonian',
 'Ethiopian',
 'Fijian',
 'Filipino',
 'Finnish',
 'French',
 'Gabonese',
 'Gambian',
 'Georgian',
 'German',
 'Ghanaian',
 'Greek',
 'Grenadian',
 'Guatemalan',
 'Guinea-Bissauan',
 'Guinean',
 'Gu

<IPython.core.display.Javascript object>

#### Observations:
- There are two nationalities in the current list that might present differently in our dataset:  'Kittian and Nevisian' and 'Trinidadian or Tobagonian', so we will add their individual parts to the list, before comparing it to the dictionary keys.

In [38]:
# Adding nationality vaulues to nationality list
nation_lst += ["Kittian", "Nevisian", "Trinidadian", "Tobagonian"]

<IPython.core.display.Javascript object>

#### List of Nationalities in `nation_lst` but not in `nation_country_dict`

In [39]:
# Creating a list of nationalities in nation_lst but not in nation_country_dict
missing_keys = [item for item in nation_lst if item not in nation_country_dict]

# Checking the values and their count
print(f"There are {len(missing_keys)} additional nationality values.")
missing_keys

There are 29 additional nationality values.


['Antiguans',
 'Argentinean',
 'Bangladeshi',
 'Barbudans',
 'Batswana',
 'Burkinabe',
 'Cape Verdean',
 'Djibouti',
 'East Timorese',
 'Ecuadorean',
 'Guinea-Bissauan',
 'Icelander',
 'Kittian and Nevisian',
 'Luxembourger',
 'Mosotho',
 'New Zealander',
 'Northern Irish',
 'San Marinese',
 'Sao Tomean',
 'Scottish',
 'Slovakian',
 'Solomon Islander',
 'Surinamer',
 'Taiwanese',
 'Tajik',
 'Trinidadian or Tobagonian',
 'Welsh',
 'Yemenite',
 'Kittian']

<IPython.core.display.Javascript object>

#### Observations:
- There are 29 additional nationality values in Marijn's list, that we will hard-code in, as any variation that we can capture will facilitate extracting the value for nationality from the `info` columns.
- Where versions of nationality contain parts that might be used alone, a key for that partial value will also be added.
- Additional keys may be added also if other variations are encountered, while searching the Internet for country names.
- The intent here is not to be strict in the correct usage, but instead to be broad with capturing usage to match multiple variations in the Wikipedia inputs.

#### Adding Nationality: Country Pairs to Dictionary from Additional Values in `nation_lst`

In [40]:
# Hard-coding nationality: country pairs that were in nation_lst but not in nation_country_dict

nation_country_dict["Antiguans"] = "Antigua and Barbuda"
nation_country_dict["Argentinean"] = "Argentina"
nation_country_dict["Bangladeshi"] = "Bangladesh"
nation_country_dict["Barbudans"] = "Barbados"
nation_country_dict["Batswana"] = "Botswana"
nation_country_dict["Burkinabe"] = "Burkina Faso"
nation_country_dict["Cape Verdean"] = "Cabo Verde"
nation_country_dict["Verdean"] = "Cabo Verde"
nation_country_dict["Djibouti"] = "Djibouti"
nation_country_dict["East Timorese"] = "Timor-Leste"

nation_country_dict["Ecuadorean"] = "Ecuador"
nation_country_dict["Guinea-Bissauan"] = "Guinea-Bissau"
nation_country_dict["Guinea"] = "Guinea-Bissau"
nation_country_dict["Bissau"] = "Guinea-Bissau"
nation_country_dict["Icelander"] = "Iceland"
nation_country_dict["Kittian and Nevisian"] = "Saint Kitts and Nevis"
nation_country_dict["Kittian"] = "Saint Kitts and Nevis"
nation_country_dict["Nevisian"] = "Saint Kitts and Nevis"
nation_country_dict["Luxembourger"] = "Luxembourg"
nation_country_dict["Mosotho"] = "Lesotho"
nation_country_dict["Bosotho"] = "Lesotho"  # plural
nation_country_dict["New Zealander"] = "New Zealand"
nation_country_dict[
    "Northern Irish"
] = "United Kingdom of Great Britain and Northern Ireland"
nation_country_dict["San Marinese"] = "San Marino"
nation_country_dict["Marinese"] = "San Marino"
nation_country_dict["Sao Tomean"] = "São Toméan"
nation_country_dict["Scottish"] = "Scotland"
nation_country_dict["Scot"] = "Scotland"
nation_country_dict["Slovakian"] = "Slovakia"
nation_country_dict["Slovenka"] = "Slovakia"  # femenine
nation_country_dict["Solomon Islander"] = "Solomon Islands"
nation_country_dict["Surinamer"] = "Suriname"
nation_country_dict["Taiwanese"] = "Taiwan"  # part of the Republic of China
nation_country_dict[
    "Tajik"
] = "Tajikistan"  # appears in nationality list but is an ethnicity of mutliple countries
nation_country_dict["Trinidadian or Tobagonian"] = "Trinidad and Tobago"
nation_country_dict["Welsh"] = "Wales"  # part of the United Kingdom
nation_country_dict["Yemenite"] = "Yemen"

<IPython.core.display.Javascript object>

#### Observations:
- If we find additional variations for nationality, we can further update the dictionary.
- Next, we can begin to extract nationality information, starting with `info_1`, as the values were intended to be entered from left to right on the Wikipedia site.
- But first, we should save our dataframe to a SQLite database, and start a new notebook.
- We also will be needing an exported copy of `nation_country_dict` in the next notebook.

### Exporting Dataset to SQLite Database [wp_life_expect_clean2.db](https://github.com/teresahanak/wikipedia-life-expectancy/blob/main/wp_life_expect_clean1.db)

In [41]:
# Saving complete raw dataset in a SQLite database
conn = sql.connect("wp_life_expect_clean2.db")
df.to_sql("wp_life_expect_clean2", conn, index=False)

132393

<IPython.core.display.Javascript object>

### Saving `nation_country_dict` to a Pickle File

In [42]:
# Create a binary pickle file
f = open("nation_country_dict.pkl", "wb")

# Write the dictionary to pickle file
pickle.dump(nation_country_dict, f)

# close file
f.close()

<IPython.core.display.Javascript object>

# [Proceed to Notebook 4 of  4:  Data Cleaning Part 3](https://github.com/teresahanak/wikipedia-life-expectancy/blob/main/wp_life_expect_data_clean2_thanak_2022_06_17.ipynb)