# Wikipedia Notable Life Expectancies

# [Notebook 3 of 4: Data Cleaning](https://github.com/teresahanak/wikipedia-life-expectancy/blob/main/wp_life_expect_data_clean2_thanak_2022_06_17.ipynb)

## Context

The


## Objective

The

### Data Dictionary

- Feature: Description

## Importing Necessary Libraries

In [1]:
# To structure code automatically
%load_ext nb_black

# To import/export sqlite databases
import sqlite3 as sql

# To help with reading and manipulating data
import pandas as pd
import numpy as np
import re

# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)
# To define the maximum number of rows to be displayed in a dataframe
pd.set_option("display.max_rows", 200)

# To supress warnings
# import warnings

# warnings.filterwarnings("ignore")

# To set some visualization attributes
pd.set_option("max_colwidth", 150)

<IPython.core.display.Javascript object>

## Data Overview


### Reading, Sampling, and Checking Data Shape

In [2]:
# Reading the dataset
conn = sql.connect("wp_life_expect_clean1.db")
data = pd.read_sql("SELECT * FROM wp_life_expect_clean1", conn)

# Making a working copy
df = data.copy()

# Checking the shape
print(f"There are {df.shape[0]} rows and {df.shape[1]} columns.")

# Checking first 2 rows of the data
df.head(2)

There are 132445 rows and 20 columns.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
0,1,William Chappell,", 86, British dancer, ballet designer and director.",https://en.wikipedia.org/wiki/William_Chappell_(dancer),21,1994,January,,,British dancer,ballet designer and director,,,,,,,,,86.0
1,1,Raymond Crotty,", 68, Irish economist, writer, and academic.",https://en.wikipedia.org/wiki/Raymond_Crotty,12,1994,January,,,Irish economist,writer,and academic,,,,,,,,68.0


<IPython.core.display.Javascript object>

In [3]:
# Checking last 2 rows of the data
df.tail(2)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
132443,9,Oleg Moliboga,", 69, Russian volleyball player, Olympic champion and coach.",https://en.wikipedia.org/wiki/Oleg_Moliboga,2,2022,June,(),,Russian volleyball player,Olympic champion and coach,,,,,,,,,69.0
132444,9,Zou Jing,", 86, Chinese engineer, member of the Chinese Academy of Engineering.",https://en.wikipedia.org/wiki/Zou_Jing_(engineer),3,2022,June,,,Chinese engineer,member of the Chinese Academy of Engineering,,,,,,,,,86.0


<IPython.core.display.Javascript object>

In [4]:
# Checking a sample of the data
df.sample(5)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
110675,14,Ignacio Pichardo Pagaza,", 84, Mexican politician, Governor of the State of Mexico , complications from surgery.",https://en.wikipedia.org/wiki/Ignacio_Pichardo_Pagaza,2,2020,April,(–) and President of the Institutional Revolutionary Party (),,Mexican politician,Governor of the State of Mexico,complications from surgery,,,,,,,,84.0
19621,18,Marcelle Tassencourt,", 87, French actress and theatre director.",https://en.wikipedia.org/wiki/Marcelle_Tassencourt,0,2001,December,,,French actress and theatre director,,,,,,,,,,87.0
48986,30,Denis McLean,", 80, New Zealand diplomat, academic, author and civil servant.",https://en.wikipedia.org/wiki/Denis_McLean,3,2011,March,,,New Zealand diplomat,academic,author and civil servant,,,,,,,,80.0
17660,27,Robert Lee Massie,", 59, American convicted murderer, execution by lethal injection.",https://en.wikipedia.org/wiki/Robert_Lee_Massie,14,2001,March,,,American convicted murderer,execution by lethal injection,,,,,,,,,59.0
114107,16,Aisultan Nazarbayev,", 29, Kazakh footballer and sporting executive.",https://en.wikipedia.org/wiki/Aisultan_Nazarbayev,22,2020,August,,,Kazakh footballer and sporting executive,,,,,,,,,,29.0


<IPython.core.display.Javascript object>

#### Observations:
- There are currently 132,445 rows and 20 columns.

### Checking Data Types, Duplicates, and Null Values

In [5]:
# Checking data types and null values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 132445 entries, 0 to 132444
Data columns (total 20 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   day             132445 non-null  object 
 1   name            132445 non-null  object 
 2   info            132445 non-null  object 
 3   link            132445 non-null  object 
 4   num_references  132445 non-null  object 
 5   year            132445 non-null  int64  
 6   month           132445 non-null  object 
 7   info_parenth    49788 non-null   object 
 8   info_1          132445 non-null  object 
 9   info_2          132421 non-null  object 
 10  info_3          62570 non-null   object 
 11  info_4          12587 non-null   object 
 12  info_5          1505 non-null    object 
 13  info_6          217 non-null     object 
 14  info_7          33 non-null      object 
 15  info_8          7 non-null       object 
 16  info_9          1 non-null       object 
 17  info_10   

<IPython.core.display.Javascript object>

#### Observations:
- Our dataset was saved to and read from the database without any hiccups.
- Picking up where we left off, we will aim to track down the remaining missing values for `age`, starting with searching for digits in `info_2`.

### Remaining Missing Values for Age

In [6]:
# Checking number of remaining missing values
print(f'There are {df["age"].isna().sum()} missing values for age.')

There are 79 missing values for age.


<IPython.core.display.Javascript object>

#### Function to Save Indices of Rows Matching Regular Expressions Pattern to a List and Print Number of Rows with Match

In [7]:
# Define a function that takes dataframe, column name, and re pattern as arguments and returns list of indices
# for which column value matches re pattern
def rows_with_pattern(dataframe, column, pattern):
    """
    Takes input of dataframe, column name, and re pattern 
    and returns list of indices for rows that contain match
    for pattern anywhere within value for given column.
    
    dataframe: dataframe
    column: column name
    pattern: re pattern
    """
    index_list = []

    for i in dataframe.index:
        item = dataframe.loc[i, column]
        match = re.search(pattern, item)
        if match:
            index_list.append(i)
    print(
        f"There are {len(index_list)} rows with matching pattern in column '{column}'."
    )
    return index_list

<IPython.core.display.Javascript object>

#### Function to Use rows_with_pattern Function for Multiple Regular Expression Patterns

In [8]:
# Define a function that calls rows_with_pattern function for multiple re patterns
# returning a single list of indices for all rows with any pattern match


def multiple_patterns(dataframe, column, patterns):
    """
    Takes input dataframe, column, and list of re patterns and returns single list 
    of indices for rows in which a match for any pattern is found with re.search
    
    dataframe: dataframe
    column: column name
    patterns: list of re patterns
    """
    rows_combined = []

    # For loop to check each pattern
    for pattern in patterns:

        # List and number of rows matching each pattern
        print(pattern)
        rows_to_check = rows_with_pattern(dataframe, column, pattern)
        print("")

        # Add list for each pattern to combined list
        rows_combined += rows_to_check

    return rows_combined

<IPython.core.display.Javascript object>

### `info_2`

#### Rows Missing `age` with Digits in `info_2`

In [9]:
# Pattern for re
pattern = r"\d"

# List and number of rows matching pattern
rows_to_check = rows_with_pattern(
    df[(df["age"].isna()) & (df["info_2"].notna())], "info_2", pattern
)

# Examining the rows directly
df.loc[rows_to_check, :]

There are 43 rows with matching pattern in column 'info_2'.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
6846,17,Spiro Agnew,", American politician, 77, 39th Vice President of the United States, leukemia.",https://en.wikipedia.org/wiki/Spiro_Agnew,207,1996,September,,American politician,77,39th Vice President of the United States,leukemia,,,,,,,,
12861,14,Muslimgauze,", , 37, British electronic musician, fungal infection.",https://en.wikipedia.org/wiki/Muslimgauze,26,1999,January,(Bryn Jones),,37,British electronic musician,fungal infection,,,,,,,,
14993,15,Željko Ražnatović,", , 47, Serbian mobster and paramilitary leader.",https://en.wikipedia.org/wiki/Arkan,55,2000,January,(aka Arkan),,47,Serbian mobster and paramilitary leader,,,,,,,,,
15062,28,Sarah Caudwell,", , 60, British detective story writer and barrister, cancer.",https://en.wikipedia.org/wiki/Sarah_Caudwell,9,2000,January,(aka Sarah Cockburn),,60,British detective story writer and barrister,cancer,,,,,,,,
16111,30,Max Showalter,", , 83, American actor, composer, pianist, singer, cancer.",https://en.wikipedia.org/wiki/Max_Showalter,14,2000,July,(aka Casey Adams),,83,American actor,composer,pianist,singer,cancer,,,,,
17792,15,Joey Ramone,", , 49, American musician, lead singer for The Ramones, lymphoma.",https://en.wikipedia.org/wiki/Joey_Ramone,25,2001,April,(b. Jeffrey Hyman),,49,American musician,lead singer for The Ramones,lymphoma,,,,,,,
18170,4,Dipendra,", King of Nepal, 29, suicide.",https://en.wikipedia.org/wiki/Dipendra_of_Nepal,8,2001,June,,King of Nepal,29,suicide,,,,,,,,,
19017,28,Mohammad Khalequzzaman,", member of the then National Assembly of Pakistan and Union Minister of Labor, died in 28 September .",https://en.wikipedia.org/wiki/Mohammad_Khalequzzaman,3,2001,September,,member of the then National Assembly of Pakistan and Union Minister of Labor,died in 28 September,,,,,,,,,,
19117,12,Lord Hailsham of St Marylebone,", , 94, British lawyer and politician.","https://en.wikipedia.org/wiki/Quintin_Hogg,_Baron_Hailsham_of_St_Marylebone",17,2001,October,(Quintin Hogg),,94,British lawyer and politician,,,,,,,,,
23752,6,Jules Engel,", Jules Engel, 94, American filmmaker, visual artist, and film director.",https://en.wikipedia.org/wiki/Jules_Engel,10,2003,September,,Jules Engel,94,American filmmaker,visual artist,and film director,,,,,,,


<IPython.core.display.Javascript object>

#### Observations:
- There are several ages as sole integer year values and two as year ranges with two integers.
- The remaining entries are missing age, but do have digits, so order of processing matters here.
- We can safely remove any of these rows that contains a letter of the alphabet, taking care to select rows only from those that are missing `age` and have a digit in `info_2`.

In [10]:
# Pattern for re
pattern = r"[a-z,A-Z]"

# List and number of rows matching pattern
rows_to_check = rows_with_pattern(df.loc[rows_to_check, :], "info_2", pattern)

# Checking a sample of the rows
df.loc[rows_to_check, :].sample(2)

There are 17 rows with matching pattern in column 'info_2'.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
111798,21,Kamrun Nahar Putul,", Bangladeshi politician, COVID-19.",https://en.wikipedia.org/wiki/Kamrun_Nahar_Putul,3,2020,May,,Bangladeshi politician,COVID-19,,,,,,,,,,
111267,2,Justa Barrios,", American home care worker and labor organizer, COVID-19.",https://en.wikipedia.org/wiki/Justa_Barrios,9,2020,May,(death announced on this date),American home care worker and labor organizer,COVID-19,,,,,,,,,,


<IPython.core.display.Javascript object>

#### Observations:
- We can drop these rows, as they are missing the data for age.
- Extraction of age for single integer value and two integer ranges can follow.

#### Dropping Additional Rows with Age Data Absent

In [11]:
# Dropping rows, resetting index, and checking new shape of df
df.drop(rows_to_check, inplace=True)
df.reset_index(inplace=True, drop=True)
df.shape

(132428, 20)

<IPython.core.display.Javascript object>

#### Remaining Rows with `age` values in `info_2`

In [12]:
# Regular expression for parenthesis and its contents
pattern = r"\d"

# Finding indices of rows that have pattern
rows_to_check = rows_with_pattern(
    df[(df["age"].isna()) & (df["info_2"].notna())], "info_2", pattern
)

# Checking unique values
df.loc[rows_to_check, :]["info_2"].unique()

There are 26 rows with matching pattern in column 'info_2'.


array(['77', '37', '47', '60', '83', '49', '29', '94', '55', '62', '69',
       '80', '32', '70', '81', '24', '84', '95', '76', '86', '61',
       '74–75', '79–80'], dtype=object)

<IPython.core.display.Javascript object>

#### Extracting `age` for Ranges with Two Values

In [13]:
# Pattern for re
pattern = r"(\d{1,3})(-|–|/| or )(\d{1,3})"

# List and number of rows matching pattern
rows_to_check = rows_with_pattern(
    df[(df["age"].isna()) & (df["info_2"].notna())], "info_2", pattern
)

# Checking sample of rows
df.loc[rows_to_check, :].sample(2)

There are 2 rows with matching pattern in column 'info_2'.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
86907,8,Mohamud Muse Hersi,", Somali politician, 79–80, President of Puntland .",https://en.wikipedia.org/wiki/Mohamud_Muse_Hersi,12,2017,February,(–),Somali politician,79–80,President of Puntland,,,,,,,,,
61915,1,Basil Soper,", British actor, 74–75.",https://en.wikipedia.org/wiki/Basil_Soper,0,2013,June,,British actor,74–75,,,,,,,,,,


<IPython.core.display.Javascript object>

In [14]:
# For loop to find rows with values and pattern and calculate and extract age to age column and remove age from info_2
for i in df.loc[rows_to_check, :].index:
    item = df.loc[i, "info_2"]
    match = re.search(pattern, item)
    if match:
        age = (int(match.group(1)) + int(match.group(3))) / 2
        df.loc[i, "age"] = age
        df.loc[i, "info_2"] = re.sub(pattern, "", df.loc[i, "info_2"]).strip()

# Checking example rows
pd.concat([df[df["name"] == "Mohamud Muse Hersi"], df[df["name"] == "Basil Soper"]])

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
86907,8,Mohamud Muse Hersi,", Somali politician, 79–80, President of Puntland .",https://en.wikipedia.org/wiki/Mohamud_Muse_Hersi,12,2017,February,(–),Somali politician,,President of Puntland,,,,,,,,,79.5
61915,1,Basil Soper,", British actor, 74–75.",https://en.wikipedia.org/wiki/Basil_Soper,0,2013,June,,British actor,,,,,,,,,,,74.5


<IPython.core.display.Javascript object>

#### Extracting `age` as Single Integer

In [15]:
# List of patterns for age formats with single integer for age
pattern = r"\b(\d{1,3})\b"

# List and number of rows matching patterns
rows_to_check = rows_with_pattern(
    df[(df["age"].isna()) & (df["info_2"].notna())], "info_2", pattern
)

There are 24 rows with matching pattern in column 'info_2'.


<IPython.core.display.Javascript object>

In [16]:
# For loop to extract age pattern to age column
for i in df.loc[rows_to_check, :].index:
    item = df.loc[i, "info_2"]
    match = re.search(pattern, item)
    if match:
        age = int(match.group(1))
        df.loc[i, "age"] = age
        df.loc[i, "info_2"] = re.sub(pattern, "", df.loc[i, "info_2"]).strip()

# Re-checking number of rows matching patterns
rows_to_check = rows_with_pattern(
    df[(df["age"].isna()) & (df["info_2"].notna())], "info_2", pattern
)

There are 0 rows with matching pattern in column 'info_2'.


<IPython.core.display.Javascript object>

In [17]:
# Checking number of remaining missing values
print(f'There are {df["age"].isna().sum()} missing values for age.')
df[df["age"].isna()].sample(2)

There are 36 missing values for age.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
118488,23,Makhosi Vilakati,", Swazi politician, minister of labour and social security , COVID-19.",https://en.wikipedia.org/wiki/Makhosi_Vilakati,2,2021,January,(since ),Swazi politician,minister of labour and social security,COVID-19,,,,,,,,,
126710,31,Simon Young,", Irish radio presenter .",https://en.wikipedia.org/wiki/Simon_Young_(presenter),9,2021,October,(RTÉ 2fm),Irish radio presenter,,,,,,,,,,,


<IPython.core.display.Javascript object>

#### Observations:
- At this point, we could look through the list to manually extract any remaining `age` info, which would likely be the fastest approach.
- We will instead take a programmatic approach, for the sake of the exercise.
- `info_parenth`, and `info_3` and beyond are the remaining columns to search.
- We see that COVID-19 appears often and the number 19 could be mistakenly extracted as an age.  
- Let us start by moving it to a new column `cause_of_death`.

#### Extracting "COVID-19" from Remaining `info` Sub-columns where `age` Missing

In [18]:
# List of columns to check
cols_to_check = [
    "info_parenth",
    "info_3",
    "info_4",
    "info_5",
    "info_6",
    "info_7",
    "info_8",
    "info_9",
    "info_10",
    "info_11",
]

# Pattern for re
pattern = r"(COVID-19)"

# For loop to collect indices of all rows with pattern
comb_rows_to_check = []
for column in cols_to_check:
    rows_to_check = rows_with_pattern(
        df[(df["age"].isna()) & (df[column].notna())], column, pattern
    )
    comb_rows_to_check += rows_to_check

# Checking sample of rows
df.loc[comb_rows_to_check, :].sample(2)

There are 1 rows with matching pattern in column 'info_parenth'.
There are 30 rows with matching pattern in column 'info_3'.
There are 2 rows with matching pattern in column 'info_4'.
There are 1 rows with matching pattern in column 'info_5'.
There are 0 rows with matching pattern in column 'info_6'.
There are 0 rows with matching pattern in column 'info_7'.
There are 0 rows with matching pattern in column 'info_8'.
There are 0 rows with matching pattern in column 'info_9'.
There are 0 rows with matching pattern in column 'info_10'.
There are 0 rows with matching pattern in column 'info_11'.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
112423,14,Tawfiq al-Yasiri,", Iraqi politician, member of the IRDC, COVID-19.",https://en.wikipedia.org/wiki/Tawfiq_al-Yasiri,1,2020,June,,Iraqi politician,member of the IRDC,COVID-19,,,,,,,,,
118043,12,Lingson Belekanyama,", Malawian politician, minister of local government and rural development , COVID-19.",https://en.wikipedia.org/wiki/Lingson_Belekanyama,2,2021,January,(since ),Malawian politician,minister of local government and rural development,COVID-19,,,,,,,,,


<IPython.core.display.Javascript object>

In [19]:
# For loop to extract COVID-19 from remaining info columns for entries with missing age
for column in cols_to_check:
    for i in df.loc[comb_rows_to_check, :].index:
        if df.loc[i, column]:
            item = df.loc[i, column]
            match = re.search(pattern, item)
            if match:
                cause = match.group(1)
                df.loc[i, "cause_of_death"] = cause
                df.loc[i, column] = re.sub(pattern, "", df.loc[i, column]).strip()

# Checking sample of rows
df.loc[comb_rows_to_check, :].sample(2)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death
117176,18,Robina Sentongo,", Ugandan politician, MP , COVID-19.",https://en.wikipedia.org/wiki/Robina_Sentongo,2,2020,December,(since ),Ugandan politician,MP,,,,,,,,,,,COVID-19
121426,21,Erasmo Vásquez,", Dominican physician and politician, minister of public health , COVID-19.",https://en.wikipedia.org/wiki/Erasmo_V%C3%A1squez,3,2021,April,(–),Dominican physician and politician,minister of public health,,,,,,,,,,,COVID-19


<IPython.core.display.Javascript object>

#### Observations:
- With "COVID-19" put aside, we can check for any remaining digits in the same columns for these entries.

#### Checking Remaining `info` Columns for Remaining Digits

In [20]:
# List of columns to check
cols_to_check = [
    "info_parenth",
    "info_3",
    "info_4",
    "info_5",
    "info_6",
    "info_7",
    "info_8",
    "info_9",
    "info_10",
    "info_11",
]

# Pattern for re
pattern = r"\d"

# For loop to collect indices of all rows with pattern
comb_rows_to_check = []
for column in cols_to_check:
    rows_to_check = rows_with_pattern(
        df[(df["age"].isna()) & (df[column].notna())], column, pattern
    )
    comb_rows_to_check += rows_to_check

# Checking sample of rows
df.loc[comb_rows_to_check, :]

There are 2 rows with matching pattern in column 'info_parenth'.
There are 0 rows with matching pattern in column 'info_3'.
There are 0 rows with matching pattern in column 'info_4'.
There are 0 rows with matching pattern in column 'info_5'.
There are 0 rows with matching pattern in column 'info_6'.
There are 0 rows with matching pattern in column 'info_7'.
There are 0 rows with matching pattern in column 'info_8'.
There are 0 rows with matching pattern in column 'info_9'.
There are 0 rows with matching pattern in column 'info_10'.
There are 0 rows with matching pattern in column 'info_11'.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death
22855,10,Little Eva,", .",https://en.wikipedia.org/wiki/Little_Eva,14,2003,April,"(née Eva Narcissus Boyd), 59, American pop singer ()",,,,,,,,,,,,,
126710,31,Simon Young,", Irish radio presenter .",https://en.wikipedia.org/wiki/Simon_Young_(presenter),9,2021,October,(RTÉ 2fm),Irish radio presenter,,,,,,,,,,,,


<IPython.core.display.Javascript object>

#### Observations:
- There are only two entries that still potentially contain digits for the age data.
- Here, we see that there is one entry we can preserve that has an age value in `info_parenth`.
- The other entry has a radio station identification value and is missing age data.
- After we collect this last age, we will drop the remaining entries missing `age`.

#### Extracting `age` from `info_parenth`

In [21]:
# List of patterns for age formats with single integer for age
pattern = r"\b(\d{1,3})\b"

# List and number of rows matching patterns
rows_to_check = rows_with_pattern(
    df[(df["age"].isna()) & (df["info_parenth"].notna())], "info_parenth", pattern
)

There are 1 rows with matching pattern in column 'info_parenth'.


<IPython.core.display.Javascript object>

In [22]:
# For loop to extract age pattern to age column
for i in df.loc[rows_to_check, :].index:
    item = df.loc[i, "info_parenth"]
    match = re.search(pattern, item)
    if match:
        age = int(match.group(1))
        df.loc[i, "age"] = age
        df.loc[i, "info_parenth"] = re.sub(
            pattern, "", df.loc[i, "info_parenth"]
        ).strip()

# Re-checking number of rows matching patterns
rows_to_check = rows_with_pattern(
    df[(df["age"].isna()) & (df["info_parenth"].notna())], "info_parenth", pattern
)

There are 0 rows with matching pattern in column 'info_parenth'.


<IPython.core.display.Javascript object>

#### Dropping the Last Entries with Missing `age` Values

In [23]:
# Checking number of remaining missing values
print(f'There are {df["age"].isna().sum()} missing values for age.')

There are 35 missing values for age.


<IPython.core.display.Javascript object>

In [24]:
# Dropping rows, resetting index, and checking new shape of df
df.dropna(subset="age", inplace=True)
df.reset_index(inplace=True, drop=True)
df.shape

(132393, 21)

<IPython.core.display.Javascript object>

In [25]:
# Checking current info status
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 132393 entries, 0 to 132392
Data columns (total 21 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   day             132393 non-null  object 
 1   name            132393 non-null  object 
 2   info            132393 non-null  object 
 3   link            132393 non-null  object 
 4   num_references  132393 non-null  object 
 5   year            132393 non-null  int64  
 6   month           132393 non-null  object 
 7   info_parenth    49752 non-null   object 
 8   info_1          132393 non-null  object 
 9   info_2          132371 non-null  object 
 10  info_3          62537 non-null   object 
 11  info_4          12584 non-null   object 
 12  info_5          1504 non-null    object 
 13  info_6          217 non-null     object 
 14  info_7          33 non-null      object 
 15  info_8          7 non-null       object 
 16  info_9          1 non-null       object 
 17  info_10   

<IPython.core.display.Javascript object>

#### Observations:
- We have 132,393 entries containing the target variable `age`.
- Some of these rows may represent groups or members of non-human species, as we have observed previously.
- We have been replacing values extracted with empty strings.  Before moving forward let us replace these empty strings with Nan, as it will simplify slicing the dataframe.
- Then it will be time to search for nationality.

#### Replacing Empty Strings with NaN

In [26]:
# Replacing empty strings with NaN
df = df.replace(r"^\s*$", np.nan, regex=True)

<IPython.core.display.Javascript object>

In [27]:
# Checking the NaN values per column
df.isna().sum()

day                    0
name                   0
info                   0
link                   0
num_references         0
year                   0
month                  0
info_parenth       82641
info_1            132363
info_2                48
info_3             70065
info_4            119852
info_5            130895
info_6            132179
info_7            132362
info_8            132387
info_9            132392
info_10           132392
info_11           132392
age                    0
cause_of_death    132393
dtype: int64

<IPython.core.display.Javascript object>

### Extracting Nationality Data

In [28]:
# Checking a sample
df.sample(5)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death
129557,8,Dottie Frazier,", 99, American diver.",https://en.wikipedia.org/wiki/Dottie_Frazier,14,2022,February,,,American diver,,,,,,,,,,99.0,
15797,5,Carl-Erik Creutz,", 88, Finnish radio announcer.",https://en.wikipedia.org/wiki/Carl-Erik_Creutz,1,2000,June,,,Finnish radio announcer,,,,,,,,,,88.0,
13779,22,Mervyn de Silva,", 69, Sri Lankan journalist.",https://en.wikipedia.org/wiki/Mervyn_de_Silva,14,1999,June,,,Sri Lankan journalist,,,,,,,,,,69.0,
83458,5,John Alan Robinson,", 86, British philosopher, mathematician and computer scientist.",https://en.wikipedia.org/wiki/John_Alan_Robinson,16,2016,August,,,British philosopher,mathematician and computer scientist,,,,,,,,,86.0,
68988,14,Rodney Thomas,", 41, American football player , heart attack.",https://en.wikipedia.org/wiki/Rodney_Thomas,6,2014,June,"(Texas A&M Aggies, Tennessee Titans)",,American football player,heart attack,,,,,,,,,41.0,


<IPython.core.display.Javascript object>

#### Observations:
- `info_2` appears overall consistent with the Wikipedia field  that combines "citizenship" and "known for".
- The first word does appear to represent the nationality.
- Recall that this information is in `info_1` for some entries and may also be in other `info` columns beyond `info_2`.
- Running the sample check a few times reveals that when citizenship changed the original citizenship may be followed by '-born', then the second citizenship.
- There are nationalities that have multiple words and there are capitalized words that are not part of the nationality.
- We will start with a list of nationalities downloaded from marijn's github repository [List of nationalities](https://gist.github.com/marijn/274449).

#### Reading List of Nationalities

In [29]:
# Reading in list of nationalities
nationalities = pd.read_csv(
    "nationalities.txt", sep="/n", engine="python", names=["Nationality"]
)

print(nationalities.shape)
nationalities.head()

(194, 1)


Unnamed: 0,Nationality
0,Afghan
1,Albanian
2,Algerian
3,American
4,Andorran


<IPython.core.display.Javascript object>

In [30]:
# Converting nationalities to a list
nationalities_lst = nationalities["Nationality"].to_list()

<IPython.core.display.Javascript object>

#### Observations:
- There are 194 nationalities represented in the nationalities list, which we may use for comparison.
- First, let us check entries that have two nationalities listed, either in the form of "Nationality_1st_born Nationality_2" or "Nationality_1-Nationality2".
- We will check them separately as the latter may reflect dual nationality, while the former reflects a change in nationality.

#### Extracting Nationality for Entries with Changes in Nationality

In [31]:
# List of columns to check
cols_to_check = [
    "info_1",
    "info_2",
    "info_3",
    "info_4",
    "info_5",
    "info_6",
    "info_7",
    "info_8",
    "info_9",
    "info_10",
    "info_11",
]

# Pattern for re
pattern = r"([A-Z][a-z]*-born\s[A-Z][a-z]*\s)"

# List and number of rows matching patterns
comb_rows_to_check = []
for column in cols_to_check:
    rows_to_check = rows_with_pattern(df[df[column].notna()], column, pattern)
    comb_rows_to_check += rows_to_check

# Checking a sample of rows
df.loc[comb_rows_to_check, :].sample(2)

There are 0 rows with matching pattern in column 'info_1'.
There are 4109 rows with matching pattern in column 'info_2'.
There are 1 rows with matching pattern in column 'info_3'.
There are 0 rows with matching pattern in column 'info_4'.
There are 0 rows with matching pattern in column 'info_5'.
There are 0 rows with matching pattern in column 'info_6'.
There are 0 rows with matching pattern in column 'info_7'.
There are 0 rows with matching pattern in column 'info_8'.
There are 0 rows with matching pattern in column 'info_9'.
There are 0 rows with matching pattern in column 'info_10'.
There are 0 rows with matching pattern in column 'info_11'.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death
52400,21,Patrick Bashford,", 82, Polish-born British professor of classical guitar.",https://en.wikipedia.org/wiki/Patrick_Bashford,6,2011,December,,,Polish-born British professor of classical guitar,,,,,,,,,,82.0,
93879,9,Mordechai E. Kreinin,", 88, Israeli-born American economist.",https://en.wikipedia.org/wiki/Mordechai_E._Kreinin,3,2018,February,,,Israeli-born American economist,,,,,,,,,,88.0,


<IPython.core.display.Javascript object>

#### Observations:
- For the most part, the pattern captures only nationalities and both of them.
- We will extract them to a new column `nation_born` that we can divide later.
- Then we can proceed to extraction two-nation values, joined by a hyphen.

In [32]:
# Extracting nation information to new column nation_born
for column in cols_to_check:
    for i in df[df[column].notna()].index:
        item = df.loc[i, column]
        match = re.search(pattern, item)
        if match:
            df.loc[i, "nation_born"] = match.group(1)
            df.loc[i, column] = re.sub(pattern, "", df.loc[i, column]).strip()

# Re-check number of and example rows
comb_rows_to_check = []
for column in cols_to_check:
    rows_to_check = rows_with_pattern(df[df[column].notna()], column, pattern)
    comb_rows_to_check += rows_to_check
pd.concat([df[df["name"] == "Dorothy E. Smith"], df[df["name"] == "Walter Abish"]])

There are 0 rows with matching pattern in column 'info_1'.
There are 0 rows with matching pattern in column 'info_2'.
There are 0 rows with matching pattern in column 'info_3'.
There are 0 rows with matching pattern in column 'info_4'.
There are 0 rows with matching pattern in column 'info_5'.
There are 0 rows with matching pattern in column 'info_6'.
There are 0 rows with matching pattern in column 'info_7'.
There are 0 rows with matching pattern in column 'info_8'.
There are 0 rows with matching pattern in column 'info_9'.
There are 0 rows with matching pattern in column 'info_10'.
There are 0 rows with matching pattern in column 'info_11'.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,nation_born
132302,3,Dorothy E. Smith,", 95, British-born Canadian sociologist.",https://en.wikipedia.org/wiki/Dorothy_E._Smith,38,2022,June,,,sociologist,,,,,,,,,,95.0,,British-born Canadian
132161,28,Walter Abish,", 90, Austrian-born American author .",https://en.wikipedia.org/wiki/Walter_Abish,17,2022,May,"(, )",,author,,,,,,,,,,90.0,,Austrian-born American


<IPython.core.display.Javascript object>

#### Hyphenated Nationalities

In [33]:
# Pattern for re
pattern = r"(^[A-Z][a-z]*-[A-Z][a-z]*)"

# List and number of rows matching patterns
comb_rows_to_check = []
for column in cols_to_check:
    rows_to_check = rows_with_pattern(df[df[column].notna()], column, pattern)
    comb_rows_to_check += rows_to_check

# Checking a sample of rows
df.loc[comb_rows_to_check, :].sample(2)

There are 1 rows with matching pattern in column 'info_1'.
There are 1866 rows with matching pattern in column 'info_2'.
There are 145 rows with matching pattern in column 'info_3'.
There are 4 rows with matching pattern in column 'info_4'.
There are 1 rows with matching pattern in column 'info_5'.
There are 0 rows with matching pattern in column 'info_6'.
There are 0 rows with matching pattern in column 'info_7'.
There are 0 rows with matching pattern in column 'info_8'.
There are 0 rows with matching pattern in column 'info_9'.
There are 0 rows with matching pattern in column 'info_10'.
There are 0 rows with matching pattern in column 'info_11'.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,nation_born
87345,28,Pierre Pascau,", 78, Mauritian-Canadian journalist.",https://en.wikipedia.org/wiki/Pierre_Pascau,11,2017,February,,,Mauritian-Canadian journalist,,,,,,,,,,78.0,,
119282,15,Eva Maria Pracht,", 83, German-Canadian equestrian, Olympic bronze medalist , COVID-19.",https://en.wikipedia.org/wiki/Eva_Maria_Pracht,4,2021,February,(),,German-Canadian equestrian,Olympic bronze medalist,COVID-19,,,,,,,,83.0,,


<IPython.core.display.Javascript object>

#### Observations:
- This pattern is trickier because there are other hyphenated values that match it.
- We have included the ^ character to indicate start of string, to hopefully limit the number of non-nationality values that we extract.

In [34]:
# Extracting nation information to new column nation_dual
for column in cols_to_check:
    for i in df[df[column].notna()].index:
        item = df.loc[i, column]
        match = re.search(pattern, item)
        if match:
            df.loc[i, "nation_dual"] = match.group(1)
            df.loc[i, column] = re.sub(pattern, "", df.loc[i, column]).strip()

# Re-check number of and example rows
comb_rows_to_check = []
for column in cols_to_check:
    rows_to_check = rows_with_pattern(df[df[column].notna()], column, pattern)
    comb_rows_to_check += rows_to_check
pd.concat([df[df["name"] == "David Bierk"], df[df["name"] == "Gregor von Rezzori"]])

There are 0 rows with matching pattern in column 'info_1'.
There are 0 rows with matching pattern in column 'info_2'.
There are 0 rows with matching pattern in column 'info_3'.
There are 0 rows with matching pattern in column 'info_4'.
There are 0 rows with matching pattern in column 'info_5'.
There are 0 rows with matching pattern in column 'info_6'.
There are 0 rows with matching pattern in column 'info_7'.
There are 0 rows with matching pattern in column 'info_8'.
There are 0 rows with matching pattern in column 'info_9'.
There are 0 rows with matching pattern in column 'info_10'.
There are 0 rows with matching pattern in column 'info_11'.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,nation_born,nation_dual
21369,28,David Bierk,", 58, American-Canadian artist.",https://en.wikipedia.org/wiki/David_Bierk,28,2002,August,,,artist,,,,,,,,,,58.0,,,American-Canadian
11239,23,Gregor von Rezzori,", 83, Austrian-Romanian journalist, actor, writer and art collector.",https://en.wikipedia.org/wiki/Gregor_von_Rezzori,13,1998,April,,,journalist,actor,writer and art collector,,,,,,,,83.0,,,Austrian-Romanian


<IPython.core.display.Javascript object>

#### Observations:
- With the two-value nationalities safely extracted, to be treated further, later, we can move on to single nationality values.
- Some nationalities have multiple words and there are words capitalized after the nationality values for some entries.
- We will take the approach of capturing all of the capitalized words at the start of the string.

#### Entries with Single Nationality Values

In [38]:
# Pattern for re
pattern = r"^([A-Z][a-z]*\s)*"

# List and number of rows matching patterns
comb_rows_to_check = []
for column in cols_to_check:
    rows_to_check = rows_with_pattern(df[df[column].notna()], column, pattern)
    comb_rows_to_check += rows_to_check

# Checking a sample of rows
df.loc[comb_rows_to_check, :].sample(2)

There are 30 rows with matching pattern in column 'info_1'.
There are 132345 rows with matching pattern in column 'info_2'.
There are 62328 rows with matching pattern in column 'info_3'.
There are 12541 rows with matching pattern in column 'info_4'.
There are 1498 rows with matching pattern in column 'info_5'.
There are 214 rows with matching pattern in column 'info_6'.
There are 31 rows with matching pattern in column 'info_7'.
There are 6 rows with matching pattern in column 'info_8'.
There are 1 rows with matching pattern in column 'info_9'.
There are 1 rows with matching pattern in column 'info_10'.
There are 1 rows with matching pattern in column 'info_11'.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,nation_born,nation_dual
117381,26,George Robert Carruthers,", 81, American physicist and inventor.",https://en.wikipedia.org/wiki/George_Robert_Carruthers,13,2020,December,,,American physicist and inventor,,,,,,,,,,81.0,,,
1961,6,Dean Gallo,", 58, American politician and businessman, prostate cancer.",https://en.wikipedia.org/wiki/Dean_Gallo,12,1994,November,,,American politician and businessman,prostate cancer,,,,,,,,,58.0,,,


<IPython.core.display.Javascript object>

#### Observations:
- Here, we see that we will run into problems if we search all of the columns at once, as the columm with the first capitalized word should contain the nationality information and could get overwritten.
- So, we will start with `info_1` and work left to right, from column to column.

#### Single Nationalities in `info_1`

In [39]:
# Pattern for re
pattern = r"^([A-Z][a-z]*\s)*"

# List and number of rows matching patterns
rows_to_check = rows_with_pattern(df[df["info_1"].notna()], "info_1", pattern)

# Checking a sample of rows
df.loc[rows_to_check, :].sample(2)

There are 30 rows with matching pattern in column 'info_1'.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,nation_born,nation_dual
103270,26,Edmund Seger,", 82 German Olympic wrestler.",https://en.wikipedia.org/wiki/Edmund_Seger,2,2019,May,,German Olympic wrestler,,,,,,,,,,,82.0,,,
66853,26,Bill Roetzheim,", 85. American Olympic gymnast.",https://en.wikipedia.org/wiki/Bill_Roetzheim,16,2014,February,,American Olympic gymnast,,,,,,,,,,,85.0,,,


<IPython.core.display.Javascript object>

In [40]:
# Extracting nation information to new column nation_single
for i in df[df["info_1"].notna()].index:
    item = df.loc[i, "info_1"]
    match = re.search(pattern, item)
    if match:
        df.loc[i, "nation_single"] = match.group(0)
        df.loc[i, "info_1"] = re.sub(pattern, "", df.loc[i, "info_1"]).strip()

# Re-check number of and example rows
rows_to_check = rows_with_pattern(df[df["info_1"].notna()], "info_1", pattern)
pd.concat([df[df["name"] == "Harry Webster"], df[df["name"] == "George Strugar"]])

There are 30 rows with matching pattern in column 'info_1'.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,nation_born,nation_dual,nation_single
32396,6,Harry Webster,", 89. British automotive engineer.",https://en.wikipedia.org/wiki/Harry_Webster,1,2007,February,,automotive engineer,,,,,,,,,,,89.0,,,,British
8887,13,George Strugar,", 63. American gridiron football player, lung cancer.",https://en.wikipedia.org/wiki/George_Strugar,0,1997,June,,gridiron football player,lung cancer,,,,,,,,,,63.0,,,,American


<IPython.core.display.Javascript object>

In [41]:
df.loc[rows_to_check, :]

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,nation_born,nation_dual,nation_single
6846,17,Spiro Agnew,", American politician, 77, 39th Vice President of the United States, leukemia.",https://en.wikipedia.org/wiki/Spiro_Agnew,207,1996,September,,politician,,39th Vice President of the United States,leukemia,,,,,,,,77.0,,,,American
7525,21,Kell Areskoug,", 90 Swedish Olympic sprinter.",https://en.wikipedia.org/wiki/Kell_Areskoug,5,1996,December,,sprinter,,,,,,,,,,,90.0,,,,Swedish Olympic
8887,13,George Strugar,", 63. American gridiron football player, lung cancer.",https://en.wikipedia.org/wiki/George_Strugar,0,1997,June,,gridiron football player,lung cancer,,,,,,,,,,63.0,,,,American
11858,23,Manuel Mejía Vallejo,", 75 Colombian writer.",https://en.wikipedia.org/wiki/Manuel_Mej%C3%ADa_Vallejo,2,1998,July,,writer,,,,,,,,,,,75.0,,,,Colombian
12351,25,Sir Robin Brook,", 90 British businessman, banker and Olympic fencer.",https://en.wikipedia.org/wiki/Robin_Brook,3,1998,October,,businessman,banker and Olympic fencer,,,,,,,,,,90.0,,,,British
17890,28,Marie Jahoda,", 94 Austrian-British social psychologist.",https://en.wikipedia.org/wiki/Marie_Jahoda,3,2001,April,,social psychologist,,,,,,,,,,,94.0,,,Austrian-British,
18170,4,Dipendra,", King of Nepal, 29, suicide.",https://en.wikipedia.org/wiki/Dipendra_of_Nepal,8,2001,June,,of Nepal,,suicide,,,,,,,,,29.0,,,,King
19638,20,Dame Miraka Szászy,", 80. New Zealand Maori leader.",https://en.wikipedia.org/wiki/Mira_Sz%C3%A1szy,21,2001,December,,leader,,,,,,,,,,,80.0,,,,New Zealand Maori
20448,8,Helen Gilbert,", 80 American artist.",https://en.wikipedia.org/wiki/Helen_Gilbert,4,2002,April,,artist,,,,,,,,,,,80.0,,,,American
21110,19,Frank Taylor,", 81. English sports journalist.",https://en.wikipedia.org/wiki/Frank_Taylor_(journalist),3,2002,July,,sports journalist,,,,,,,,,,,81.0,,,,English


<IPython.core.display.Javascript object>

### Exporting Dataset to SQLite Database [wp_life_expect_clean2.db](https://github.com/teresahanak/wikipedia-life-expectancy/blob/main/wp_life_expect_clean1.db)

In [None]:
# Saving complete raw dataset in a SQLite database
conn = sql.connect("wp_life_expect_clean2.db")
df.to_sql("wp_life_expect_clean2", conn, index=False)

# [Proceed to Notebook 4 of  4:  Data Cleaning Part 3](https://github.com/teresahanak/wikipedia-life-expectancy/blob/main/wp_life_expect_data_clean2_thanak_2022_06_17.ipynb)