# Wikipedia Notable Life Expectancies

# [Notebook 3 of 4: Data Cleaning](https://github.com/teresahanak/wikipedia-life-expectancy/blob/main/wp_life_expect_data_clean2_thanak_2022_06_17.ipynb)

## Context

The


## Objective

The

### Data Dictionary

- Feature: Description

## Importing Necessary Libraries

In [1]:
# To structure code automatically
%load_ext nb_black

# To import/export sqlite databases
import sqlite3 as sql

# To help with reading and manipulating data
import pandas as pd
import numpy as np
import re

# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)
# To define the maximum number of rows to be displayed in a dataframe
pd.set_option("display.max_rows", 200)

# To supress warnings
# import warnings

# warnings.filterwarnings("ignore")

# To set some visualization attributes
pd.set_option("max_colwidth", 150)

<IPython.core.display.Javascript object>

## Data Overview


### Reading, Sampling, and Checking Data Shape

In [2]:
# Reading the dataset
conn = sql.connect("wp_life_expect_clean1.db")
data = pd.read_sql("SELECT * FROM wp_life_expect_clean1", conn)

# Making a working copy
df = data.copy()

# Checking the shape
print(f"There are {df.shape[0]} rows and {df.shape[1]} columns.")

# Checking first 2 rows of the data
df.head(2)

There are 132445 rows and 20 columns.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
0,1,William Chappell,", 86, British dancer, ballet designer and director.",https://en.wikipedia.org/wiki/William_Chappell_(dancer),21,1994,January,,,British dancer,ballet designer and director,,,,,,,,,86.0
1,1,Raymond Crotty,", 68, Irish economist, writer, and academic.",https://en.wikipedia.org/wiki/Raymond_Crotty,12,1994,January,,,Irish economist,writer,and academic,,,,,,,,68.0


<IPython.core.display.Javascript object>

In [3]:
# Checking last 2 rows of the data
df.tail(2)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
132443,9,Oleg Moliboga,", 69, Russian volleyball player, Olympic champion and coach.",https://en.wikipedia.org/wiki/Oleg_Moliboga,2,2022,June,(),,Russian volleyball player,Olympic champion and coach,,,,,,,,,69.0
132444,9,Zou Jing,", 86, Chinese engineer, member of the Chinese Academy of Engineering.",https://en.wikipedia.org/wiki/Zou_Jing_(engineer),3,2022,June,,,Chinese engineer,member of the Chinese Academy of Engineering,,,,,,,,,86.0


<IPython.core.display.Javascript object>

In [4]:
# Checking a sample of the data
df.sample(5)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
81,13,Norm Jacobson,", 76, Australian rugby player and coach.",https://en.wikipedia.org/wiki/Norm_Jacobson,3,1994,January,,,Australian rugby player and coach,,,,,,,,,,76.0
63759,15,Jackie Lomax,", 69, English guitarist and singer-songwriter.",https://en.wikipedia.org/wiki/Jackie_Lomax,14,2013,September,,,English guitarist and singer-songwriter,,,,,,,,,,69.0
55724,28,Paul Stassino,", 82, Greek-Cypriot actor .",https://en.wikipedia.org/wiki/Paul_Stassino,5,2012,June,(),,Greek-Cypriot actor,,,,,,,,,,82.0
36190,4,Michael White,", 59, Australian inventor of narrative therapy, cardiac arrest.",https://en.wikipedia.org/wiki/Michael_White_(psychotherapist),9,2008,April,,,Australian inventor of narrative therapy,cardiac arrest,,,,,,,,,59.0
105810,2,Bill Bidwill,", 88, American football team owner .",https://en.wikipedia.org/wiki/Bill_Bidwill,9,2019,October,(Arizona Cardinals),,American football team owner,,,,,,,,,,88.0


<IPython.core.display.Javascript object>

#### Observations:
- There are currently 132,445 rows and 20 columns.

### Checking Data Types, Duplicates, and Null Values

In [5]:
# Checking data types and null values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 132445 entries, 0 to 132444
Data columns (total 20 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   day             132445 non-null  object 
 1   name            132445 non-null  object 
 2   info            132445 non-null  object 
 3   link            132445 non-null  object 
 4   num_references  132445 non-null  object 
 5   year            132445 non-null  int64  
 6   month           132445 non-null  object 
 7   info_parenth    49788 non-null   object 
 8   info_1          132445 non-null  object 
 9   info_2          132421 non-null  object 
 10  info_3          62570 non-null   object 
 11  info_4          12587 non-null   object 
 12  info_5          1505 non-null    object 
 13  info_6          217 non-null     object 
 14  info_7          33 non-null      object 
 15  info_8          7 non-null       object 
 16  info_9          1 non-null       object 
 17  info_10   

<IPython.core.display.Javascript object>

#### Observations:
- Our dataset was saved to and read from the database without any hiccups.
- Picking up where we left off, we will aim to track down the remaining missing values for `age`, starting with searching for digits in `info_2`.

### Remaining Missing Values for Age

In [6]:
# Checking number of remaining missing values
print(f'There are {df["age"].isna().sum()} missing values for age.')

There are 79 missing values for age.


<IPython.core.display.Javascript object>

#### Function to Save Indices of Rows Matching Regular Expressions Pattern to a List and Print Number of Rows with Match

In [7]:
# Define a function that takes dataframe, column name, and re pattern as arguments and returns list of indices
# for which column value matches re pattern
def rows_with_pattern(dataframe, column, pattern):
    """
    Takes input of dataframe, column name, and re pattern 
    and returns list of indices for rows that contain match
    for pattern anywhere within value for given column.
    
    dataframe: dataframe
    column: column name
    pattern: re pattern
    """
    index_list = []

    for i in dataframe.index:
        item = dataframe.loc[i, column]
        match = re.search(pattern, item)
        if match:
            index_list.append(i)
    print(
        f"There are {len(index_list)} rows with matching pattern in column '{column}'."
    )
    return index_list

<IPython.core.display.Javascript object>

#### Function to Use rows_with_pattern Function for Multiple Regular Expression Patterns

In [8]:
# Define a function that calls rows_with_pattern function for multiple re patterns
# returning a single list of indices for all rows with any pattern match


def multiple_patterns(dataframe, column, patterns):
    """
    Takes input dataframe, column, and list of re patterns and returns single list 
    of indices for rows in which a match for any pattern is found with re.search
    
    dataframe: dataframe
    column: column name
    patterns: list of re patterns
    """
    rows_combined = []

    # For loop to check each pattern
    for pattern in patterns:

        # List and number of rows matching each pattern
        print(pattern)
        rows_to_check = rows_with_pattern(dataframe, column, pattern)
        print("")

        # Add list for each pattern to combined list
        rows_combined += rows_to_check

    return rows_combined

<IPython.core.display.Javascript object>

### `info_2`

#### Rows Missing `age` with Digits in `info_2`

In [9]:
# Pattern for re
pattern = r"\d"

# List and number of rows matching pattern
rows_to_check = rows_with_pattern(
    df[(df["age"].isna()) & (df["info_2"].notna())], "info_2", pattern
)

# Examining the rows directly
df.loc[rows_to_check, :]

There are 43 rows with matching pattern in column 'info_2'.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
6846,17,Spiro Agnew,", American politician, 77, 39th Vice President of the United States, leukemia.",https://en.wikipedia.org/wiki/Spiro_Agnew,207,1996,September,,American politician,77,39th Vice President of the United States,leukemia,,,,,,,,
12861,14,Muslimgauze,", , 37, British electronic musician, fungal infection.",https://en.wikipedia.org/wiki/Muslimgauze,26,1999,January,(Bryn Jones),,37,British electronic musician,fungal infection,,,,,,,,
14993,15,Željko Ražnatović,", , 47, Serbian mobster and paramilitary leader.",https://en.wikipedia.org/wiki/Arkan,55,2000,January,(aka Arkan),,47,Serbian mobster and paramilitary leader,,,,,,,,,
15062,28,Sarah Caudwell,", , 60, British detective story writer and barrister, cancer.",https://en.wikipedia.org/wiki/Sarah_Caudwell,9,2000,January,(aka Sarah Cockburn),,60,British detective story writer and barrister,cancer,,,,,,,,
16111,30,Max Showalter,", , 83, American actor, composer, pianist, singer, cancer.",https://en.wikipedia.org/wiki/Max_Showalter,14,2000,July,(aka Casey Adams),,83,American actor,composer,pianist,singer,cancer,,,,,
17792,15,Joey Ramone,", , 49, American musician, lead singer for The Ramones, lymphoma.",https://en.wikipedia.org/wiki/Joey_Ramone,25,2001,April,(b. Jeffrey Hyman),,49,American musician,lead singer for The Ramones,lymphoma,,,,,,,
18170,4,Dipendra,", King of Nepal, 29, suicide.",https://en.wikipedia.org/wiki/Dipendra_of_Nepal,8,2001,June,,King of Nepal,29,suicide,,,,,,,,,
19017,28,Mohammad Khalequzzaman,", member of the then National Assembly of Pakistan and Union Minister of Labor, died in 28 September .",https://en.wikipedia.org/wiki/Mohammad_Khalequzzaman,3,2001,September,,member of the then National Assembly of Pakistan and Union Minister of Labor,died in 28 September,,,,,,,,,,
19117,12,Lord Hailsham of St Marylebone,", , 94, British lawyer and politician.","https://en.wikipedia.org/wiki/Quintin_Hogg,_Baron_Hailsham_of_St_Marylebone",17,2001,October,(Quintin Hogg),,94,British lawyer and politician,,,,,,,,,
23752,6,Jules Engel,", Jules Engel, 94, American filmmaker, visual artist, and film director.",https://en.wikipedia.org/wiki/Jules_Engel,10,2003,September,,Jules Engel,94,American filmmaker,visual artist,and film director,,,,,,,


<IPython.core.display.Javascript object>

#### Observations:
- There are several ages as sole integer year values and two as year ranges with two integers.
- The remaining entries are missing age, but do have digits, so order of processing matters here.
- We can safely remove any of these rows that contains a letter of the alphabet, taking care to select rows only from those that are missing `age` and have a digit in `info_2`.

In [10]:
# Pattern for re
pattern = r"[a-z,A-Z]"

# List and number of rows matching pattern
rows_to_check = rows_with_pattern(df.loc[rows_to_check, :], "info_2", pattern)

# Checking a sample of the rows
df.loc[rows_to_check, :].sample(2)

There are 17 rows with matching pattern in column 'info_2'.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
111798,21,Kamrun Nahar Putul,", Bangladeshi politician, COVID-19.",https://en.wikipedia.org/wiki/Kamrun_Nahar_Putul,3,2020,May,,Bangladeshi politician,COVID-19,,,,,,,,,,
115179,2,Fadma Abi,", Moroccan surgeon and professor, COVID-19.",https://en.wikipedia.org/wiki/Fadma_Abi,6,2020,October,,Moroccan surgeon and professor,COVID-19,,,,,,,,,,


<IPython.core.display.Javascript object>

#### Observations:
- We can drop these rows, as they are missing the data for age.
- Extraction of age for single integer value and two integer ranges can follow.

#### Dropping Additional Rows with Age Data Absent

In [11]:
# Dropping rows, resetting index, and checking new shape of df
df.drop(rows_to_check, inplace=True)
df.reset_index(inplace=True, drop=True)
df.shape

(132428, 20)

<IPython.core.display.Javascript object>

#### Remaining Rows with `age` values in `info_2`

In [12]:
# Regular expression for parenthesis and its contents
pattern = r"\d"

# Finding indices of rows that have pattern
rows_to_check = rows_with_pattern(
    df[(df["age"].isna()) & (df["info_2"].notna())], "info_2", pattern
)

# Checking unique values
df.loc[rows_to_check, :]["info_2"].unique()

There are 26 rows with matching pattern in column 'info_2'.


array(['77', '37', '47', '60', '83', '49', '29', '94', '55', '62', '69',
       '80', '32', '70', '81', '24', '84', '95', '76', '86', '61',
       '74–75', '79–80'], dtype=object)

<IPython.core.display.Javascript object>

#### Extracting `age` for Ranges with Two Values

In [13]:
# Pattern for re
pattern = r"(\d{1,3})(-|–|/| or )(\d{1,3})"

# List and number of rows matching pattern
rows_to_check = rows_with_pattern(
    df[(df["age"].isna()) & (df["info_2"].notna())], "info_2", pattern
)

# Checking sample of rows
df.loc[rows_to_check, :].sample(2)

There are 2 rows with matching pattern in column 'info_2'.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
61915,1,Basil Soper,", British actor, 74–75.",https://en.wikipedia.org/wiki/Basil_Soper,0,2013,June,,British actor,74–75,,,,,,,,,,
86907,8,Mohamud Muse Hersi,", Somali politician, 79–80, President of Puntland .",https://en.wikipedia.org/wiki/Mohamud_Muse_Hersi,12,2017,February,(–),Somali politician,79–80,President of Puntland,,,,,,,,,


<IPython.core.display.Javascript object>

In [14]:
# For loop to find rows with values and pattern and calculate and extract age to age column and remove age from info_2
for i in df.loc[rows_to_check, :].index:
    item = df.loc[i, "info_2"]
    match = re.search(pattern, item)
    if match:
        age = (int(match.group(1)) + int(match.group(3))) / 2
        df.loc[i, "age"] = age
        df.loc[i, "info_2"] = re.sub(pattern, "", df.loc[i, "info_2"]).strip()

# Checking example rows
pd.concat([df[df["name"] == "Mohamud Muse Hersi"], df[df["name"] == "Basil Soper"]])

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
86907,8,Mohamud Muse Hersi,", Somali politician, 79–80, President of Puntland .",https://en.wikipedia.org/wiki/Mohamud_Muse_Hersi,12,2017,February,(–),Somali politician,,President of Puntland,,,,,,,,,79.5
61915,1,Basil Soper,", British actor, 74–75.",https://en.wikipedia.org/wiki/Basil_Soper,0,2013,June,,British actor,,,,,,,,,,,74.5


<IPython.core.display.Javascript object>

#### Extracting `age` as Single Integer

In [15]:
# List of patterns for age formats with single integer for age
pattern = r"\b(\d{1,3})\b"

# List and number of rows matching patterns
rows_to_check = rows_with_pattern(
    df[(df["age"].isna()) & (df["info_2"].notna())], "info_2", pattern
)

There are 24 rows with matching pattern in column 'info_2'.


<IPython.core.display.Javascript object>

In [16]:
# For loop to extract age pattern to age column
for i in df.loc[rows_to_check, :].index:
    item = df.loc[i, "info_2"]
    match = re.search(pattern, item)
    if match:
        age = int(match.group(1))
        df.loc[i, "age"] = age
        df.loc[i, "info_2"] = re.sub(pattern, "", df.loc[i, "info_2"]).strip()

# Re-checking number of rows matching patterns
rows_to_check = rows_with_pattern(
    df[(df["age"].isna()) & (df["info_2"].notna())], "info_2", pattern
)

There are 0 rows with matching pattern in column 'info_2'.


<IPython.core.display.Javascript object>

In [17]:
# Checking number of remaining missing values
print(f'There are {df["age"].isna().sum()} missing values for age.')
df[df["age"].isna()].sample(2)

There are 36 missing values for age.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
116898,10,Maroof Afzal,", Pakistani civil servant, cabinet secretary , COVID-19.",https://en.wikipedia.org/wiki/Maroof_Afzal,5,2020,December,(–),Pakistani civil servant,cabinet secretary,COVID-19,,,,,,,,,
126710,31,Simon Young,", Irish radio presenter .",https://en.wikipedia.org/wiki/Simon_Young_(presenter),9,2021,October,(RTÉ 2fm),Irish radio presenter,,,,,,,,,,,


<IPython.core.display.Javascript object>

#### Observations:
- At this point, we could look through the list to manually extract any remaining `age` info, which would likely be the fastest approach.
- We will instead take a programmatic approach, for the sake of the exercise.
- `info_parenth`, and `info_3` and beyond are the remaining columns to search.
- We see that COVID-19 appears often and the number 19 could be mistakenly extracted as an age.  
- Let us start by moving it to a new column `cause_of_death`.

#### Extracting "COVID-19" from Remaining `info` Sub-columns where `age` Missing

In [18]:
# List of columns to check
cols_to_check = [
    "info_parenth",
    "info_3",
    "info_4",
    "info_5",
    "info_6",
    "info_7",
    "info_8",
    "info_9",
    "info_10",
    "info_11",
]

# Pattern for re
pattern = r"(COVID-19)"

# For loop to collect indices of all rows with pattern
comb_rows_to_check = []
for column in cols_to_check:
    rows_to_check = rows_with_pattern(
        df[(df["age"].isna()) & (df[column].notna())], column, pattern
    )
    comb_rows_to_check += rows_to_check

# Checking sample of rows
df.loc[comb_rows_to_check, :].sample(2)

There are 1 rows with matching pattern in column 'info_parenth'.
There are 30 rows with matching pattern in column 'info_3'.
There are 2 rows with matching pattern in column 'info_4'.
There are 1 rows with matching pattern in column 'info_5'.
There are 0 rows with matching pattern in column 'info_6'.
There are 0 rows with matching pattern in column 'info_7'.
There are 0 rows with matching pattern in column 'info_8'.
There are 0 rows with matching pattern in column 'info_9'.
There are 0 rows with matching pattern in column 'info_10'.
There are 0 rows with matching pattern in column 'info_11'.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
117312,23,Loyiso Mpumlwana,", South African politician, member of the National Assembly , COVID-19.",https://en.wikipedia.org/wiki/Loyiso_Mpumlwana,9,2020,December,"(–, since )",South African politician,member of the National Assembly,COVID-19,,,,,,,,,
116845,7,Tasiman,", Indonesian politician, regent of Pati , COVID-19.",https://en.wikipedia.org/wiki/Tasiman,1,2020,December,(–),Indonesian politician,regent of Pati,COVID-19,,,,,,,,,


<IPython.core.display.Javascript object>

In [19]:
# For loop to extract COVID-19 from remaining info columns for entries with missing age
for column in cols_to_check:
    for i in df.loc[comb_rows_to_check, :].index:
        if df.loc[i, column]:
            item = df.loc[i, column]
            match = re.search(pattern, item)
            if match:
                cause = match.group(1)
                df.loc[i, "cause_of_death"] = cause
                df.loc[i, column] = re.sub(pattern, "", df.loc[i, column]).strip()

# Checking sample of rows
df.loc[comb_rows_to_check, :].sample(2)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death
119194,12,Tohami Khaled,", Libyan military officer, head of the Internal Security Agency, COVID-19.",https://en.wikipedia.org/wiki/Tohami_Khaled,9,2021,February,,Libyan military officer,head of the Internal Security Agency,,,,,,,,,,,COVID-19
111852,24,Hussain Ahmad Kanjo,", Pakistani politician, Minister of Science and Technology , COVID-19.",https://en.wikipedia.org/wiki/Hussain_Ahmad_Kanjo,2,2020,May,(–),Pakistani politician,Minister of Science and Technology,,,,,,,,,,,COVID-19


<IPython.core.display.Javascript object>

#### Observations:
- With "COVID-19" put aside, we can check for any remaining digits in the same columns for these entries.

#### Checking Remaining `info` Columns for Remaining Digits

In [20]:
# List of columns to check
cols_to_check = [
    "info_parenth",
    "info_3",
    "info_4",
    "info_5",
    "info_6",
    "info_7",
    "info_8",
    "info_9",
    "info_10",
    "info_11",
]

# Pattern for re
pattern = r"\d"

# For loop to collect indices of all rows with pattern
comb_rows_to_check = []
for column in cols_to_check:
    rows_to_check = rows_with_pattern(
        df[(df["age"].isna()) & (df[column].notna())], column, pattern
    )
    comb_rows_to_check += rows_to_check

# Checking sample of rows
df.loc[comb_rows_to_check, :]

There are 2 rows with matching pattern in column 'info_parenth'.
There are 0 rows with matching pattern in column 'info_3'.
There are 0 rows with matching pattern in column 'info_4'.
There are 0 rows with matching pattern in column 'info_5'.
There are 0 rows with matching pattern in column 'info_6'.
There are 0 rows with matching pattern in column 'info_7'.
There are 0 rows with matching pattern in column 'info_8'.
There are 0 rows with matching pattern in column 'info_9'.
There are 0 rows with matching pattern in column 'info_10'.
There are 0 rows with matching pattern in column 'info_11'.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death
22855,10,Little Eva,", .",https://en.wikipedia.org/wiki/Little_Eva,14,2003,April,"(née Eva Narcissus Boyd), 59, American pop singer ()",,,,,,,,,,,,,
126710,31,Simon Young,", Irish radio presenter .",https://en.wikipedia.org/wiki/Simon_Young_(presenter),9,2021,October,(RTÉ 2fm),Irish radio presenter,,,,,,,,,,,,


<IPython.core.display.Javascript object>

#### Observations:
- There are only two entries that still potentially contain digits for the age data.
- Here, we see that there is one entry we can preserve that has an age value in `info_parenth`.
- The other entry has a radio station identification value and is missing age data.
- After we collect this last age, we will drop the remaining entries missing `age`.

#### Extracting `age` from `info_parenth`

In [21]:
# List of patterns for age formats with single integer for age
pattern = r"\b(\d{1,3})\b"

# List and number of rows matching patterns
rows_to_check = rows_with_pattern(
    df[(df["age"].isna()) & (df["info_parenth"].notna())], "info_parenth", pattern
)

There are 1 rows with matching pattern in column 'info_parenth'.


<IPython.core.display.Javascript object>

In [22]:
# For loop to extract age pattern to age column
for i in df.loc[rows_to_check, :].index:
    item = df.loc[i, "info_parenth"]
    match = re.search(pattern, item)
    if match:
        age = int(match.group(1))
        df.loc[i, "age"] = age
        df.loc[i, "info_parenth"] = re.sub(
            pattern, "", df.loc[i, "info_parenth"]
        ).strip()

# Re-checking number of rows matching patterns
rows_to_check = rows_with_pattern(
    df[(df["age"].isna()) & (df["info_parenth"].notna())], "info_parenth", pattern
)

There are 0 rows with matching pattern in column 'info_parenth'.


<IPython.core.display.Javascript object>

#### Dropping the Last Entries with Missing `age` Values

In [23]:
# Checking number of remaining missing values
print(f'There are {df["age"].isna().sum()} missing values for age.')

There are 35 missing values for age.


<IPython.core.display.Javascript object>

In [24]:
# Dropping rows, resetting index, and checking new shape of df
df.dropna(subset="age", inplace=True)
df.reset_index(inplace=True, drop=True)
df.shape

(132393, 21)

<IPython.core.display.Javascript object>

In [25]:
# Checking current info status
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 132393 entries, 0 to 132392
Data columns (total 21 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   day             132393 non-null  object 
 1   name            132393 non-null  object 
 2   info            132393 non-null  object 
 3   link            132393 non-null  object 
 4   num_references  132393 non-null  object 
 5   year            132393 non-null  int64  
 6   month           132393 non-null  object 
 7   info_parenth    49752 non-null   object 
 8   info_1          132393 non-null  object 
 9   info_2          132371 non-null  object 
 10  info_3          62537 non-null   object 
 11  info_4          12584 non-null   object 
 12  info_5          1504 non-null    object 
 13  info_6          217 non-null     object 
 14  info_7          33 non-null      object 
 15  info_8          7 non-null       object 
 16  info_9          1 non-null       object 
 17  info_10   

<IPython.core.display.Javascript object>

#### Observations:
- We have 132,393 entries containing the target variable `age`.
- Some of these rows may represent groups or members of non-human species, as we have observed previously.
- We have been replacing values extracted with empty strings.  Before moving forward let us replace these empty strings with Nan, as it will simplify slicing the dataframe.
- Then it will be time to search for nationality.

#### Replacing Empty Strings with NaN

In [26]:
# Replacing empty strings with NaN
df = df.replace(r"^\s*$", np.nan, regex=True)

<IPython.core.display.Javascript object>

In [27]:
# Checking the NaN values per column
df.isna().sum()

day                    0
name                   0
info                   0
link                   0
num_references         0
year                   0
month                  0
info_parenth       82641
info_1            132363
info_2                48
info_3             70065
info_4            119852
info_5            130895
info_6            132179
info_7            132362
info_8            132387
info_9            132392
info_10           132392
info_11           132392
age                    0
cause_of_death    132393
dtype: int64

<IPython.core.display.Javascript object>

## Extracting Nationality Data

In [28]:
# Checking a sample
df.sample(5)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death
104773,8,Theodore L. Eliot Jr.,", 91, American diplomat, Ambassador to Afghanistan , heart disease.",https://en.wikipedia.org/wiki/Theodore_L._Eliot_Jr.,3,2019,August,(–),,American diplomat,Ambassador to Afghanistan,heart disease,,,,,,,,91.0,
109548,18,Richard G. Austin,", 89, American weightlifter, complications from diabetes.",https://en.wikipedia.org/wiki/Richard_G._Austin_(weightlifter),2,2020,March,,,American weightlifter,complications from diabetes,,,,,,,,,89.0,
15416,26,Alex Comfort,", 80, British scientist, physician and author , .",https://en.wikipedia.org/wiki/Alex_Comfort,13,2000,March,(cerebral haemorrhage),,British scientist,physician and author,,,,,,,,,80.0,
36841,5,Jacklyn H. Lucas,", 80, American World War II veteran, youngest marine to be awarded the Medal of Honor, cancer.",https://en.wikipedia.org/wiki/Jacklyn_H._Lucas,12,2008,June,,,American World War II veteran,youngest marine to be awarded the Medal of Honor,cancer,,,,,,,,80.0,
26288,14,Evelyn West,", 80, American burlesque stripper, pin-up girl and actress.",https://en.wikipedia.org/wiki/Evelyn_West,6,2004,November,,,American burlesque stripper,pin-up girl and actress,,,,,,,,,80.0,


<IPython.core.display.Javascript object>

#### Observations:
- `info_2` appears overall consistent with the Wikipedia field  that combines "citizenship" and "known for".
- The first word does appear to represent the nationality.
- Recall that this information is in `info_1` for some entries and may also be in other `info` columns beyond `info_2`.
- Running the sample check a few times reveals that when citizenship changed the original citizenship may be followed by '-born', then the second citizenship.
- There are nationalities that have multiple words and there are capitalized words that are not part of the nationality.
- We will start with a list of nationalities downloaded from marijn's github repository [List of nationalities](https://gist.github.com/marijn/274449).

### Reading List of Nationalities

In [29]:
# Reading in list of nationalities
nationalities = pd.read_csv(
    "nationalities.txt", sep="/n", engine="python", names=["Nationality"]
)

print(nationalities.shape)
nationalities.head()

(194, 1)


Unnamed: 0,Nationality
0,Afghan
1,Albanian
2,Algerian
3,American
4,Andorran


<IPython.core.display.Javascript object>

In [30]:
# Converting nationalities to a list
nationalities_lst = nationalities["Nationality"].to_list()

<IPython.core.display.Javascript object>

#### Observations:
- There are 194 nationalities represented in the nationalities list, which we may use for comparison.
- There are single nationalities and dual nationalities represented ('Nationality1-born Nationality2' or "Nationality1-Nationality2").
- We will take the approach of extracting nationalities that match those in the nationalities_lst, and see what missing values remain afterward.
- Most of this information is in `info_2` and it is best for us to approach from left to right to extract it, so we will begin with `info_1` and go column by column, as indicated.

### Extracting Nationality from `info_1` to `nation_1` and `nation_2`

#### Values with 2 Nationalities containing '-born'

In [36]:
# List of columns to check
cols_to_check = [
    "info_1",
    "info_2",
    "info_3",
    "info_4",
    "info_5",
    "info_6",
    "info_7",
    "info_8",
    "info_9",
    "info_10",
    "info_11",
]

# Pattern for re
pattern = r"([A-Z][a-z]*-born\s[A-Z][a-z]*\s)"

# List and number of rows matching patterns
comb_rows_to_check = []
for column in cols_to_check:
    rows_to_check = rows_with_pattern(df[df[column].notna()], column, pattern)
    comb_rows_to_check += rows_to_check

# Checking a sample of rows
df.loc[comb_rows_to_check, :].sample(2)

There are 0 rows with matching pattern in column 'info_1'.
There are 4109 rows with matching pattern in column 'info_2'.
There are 1 rows with matching pattern in column 'info_3'.
There are 0 rows with matching pattern in column 'info_4'.
There are 0 rows with matching pattern in column 'info_5'.
There are 0 rows with matching pattern in column 'info_6'.
There are 0 rows with matching pattern in column 'info_7'.
There are 0 rows with matching pattern in column 'info_8'.
There are 0 rows with matching pattern in column 'info_9'.
There are 0 rows with matching pattern in column 'info_10'.
There are 0 rows with matching pattern in column 'info_11'.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death
126097,8,Everett Morton,", 71, Kittitian-born British drummer .",https://en.wikipedia.org/wiki/Everett_Morton,26,2021,October,(The Beat),,Kittitian-born British drummer,,,,,,,,,,71.0,
65238,11,Frederick Fox,", 82, Australian-born British milliner.",https://en.wikipedia.org/wiki/Frederick_Fox_(milliner),5,2013,December,,,Australian-born British milliner,,,,,,,,,,82.0,


<IPython.core.display.Javascript object>

#### Observations:
- Such values are only in `info_2` and `info_3`, so we will treat those values as indicated, one column at a time.

#### `info_2`

In [39]:
# Column to be treated
column = "info_2"

# Extracting nation information to new columns nation_1 and nation_2
for i in df[df[column].notna()].index:
    item = df.loc[i, column]
    match = re.search(pattern, item)
    if match:
        nations = match.group(1).split("-born")
        nation_1 = nations[0].strip()
        nation_2 = nations[1].strip()
        df.loc[i, "nation_1"] = nation_1
        df.loc[i, "nation_2"] = nation_2
        df.loc[i, column] = re.sub(pattern, "", df.loc[i, column]).strip()

# Re-check number of and example rows
rows_to_check = rows_with_pattern(df[df[column].notna()], column, pattern)
pd.concat([df[df["name"] == "Everett Morton"], df[df["name"] == "Frederick Fox"]])

There are 0 rows with matching pattern in column 'info_2'.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,nation_1,nation_2
126097,8,Everett Morton,", 71, Kittitian-born British drummer .",https://en.wikipedia.org/wiki/Everett_Morton,26,2021,October,(The Beat),,drummer,,,,,,,,,,71.0,,Kittitian,British
65238,11,Frederick Fox,", 82, Australian-born British milliner.",https://en.wikipedia.org/wiki/Frederick_Fox_(milliner),5,2013,December,,,milliner,,,,,,,,,,82.0,,Australian,British


<IPython.core.display.Javascript object>

#### `info_3`

In [55]:
# List of columns to check
column = "info_3"

# Pattern for re
pattern = r"([A-Z][a-z]*-born\s[A-Z][a-z]*\s)"

# List and number of rows matching patterns
rows_to_check = rows_with_pattern(df[df[column].notna()], column, pattern)

# Checking a sample of rows
df.loc[rows_to_check, :]

There are 0 rows with matching pattern in column 'info_3'.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,nation_1,nation_2


<IPython.core.display.Javascript object>

In [47]:
# Column to be treated
column = "info_3"

# Extracting nation information to new columns nation_1 and nation_2
for i in df[df[column].notna()].index:
    item = df.loc[i, column]
    match = re.search(pattern, item)
    if match:
        nations = match.group(1).split("-born")
        nation_1 = nations[0].strip()
        nation_2 = nations[1].strip()
        df.loc[i, "nation_1"] = nation_1
        df.loc[i, "nation_2"] = nation_2
        df.loc[i, column] = re.sub(pattern, "", df.loc[i, column]).strip()

# Re-check number of and example rows
rows_to_check = rows_with_pattern(df[df[column].notna()], column, pattern)
df[df["name"] == "Herbert Freudenberger"]

There are 0 rows with matching pattern in column 'info_3'.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,nation_1,nation_2
14664,29,Herbert Freudenberger,", 73, Jewish, German-born American psychologist, kidney disease.",https://en.wikipedia.org/wiki/Herbert_Freudenberger,6,1999,November,,,Jewish,psychologist,kidney disease,,,,,,,,73.0,,German,American


<IPython.core.display.Javascript object>

#### Values with 2 Nationalities Hyphenated

In [66]:
# Column to check and treat
column = "info_2"

# Patter for re
pattern = r"(^[A-Z][a-z]*-[A-Z][a-z]*)"

# List and number of rows matching patterns
rows_to_check = rows_with_pattern(
    df[(df["nation_1"].isna()) & (df[column].notna())], column, pattern
)

df.loc[rows_to_check, :].sample(2)

There are 34 rows with matching pattern in column 'info_2'.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,nation_1,nation_2
73512,6,Tetaua Taitai,", 67, I-Kiribati politician and physician, Leader of the Opposition, cancer.",https://en.wikipedia.org/wiki/Tetaua_Taitai,3,2015,February,,,I-Kiribati politician and physician,Leader of the Opposition,cancer,,,,,,,,67.0,,,
84595,11,Teatao Teannaki,", 80, I-Kiribati politician, Vice-President , heart attack.",https://en.wikipedia.org/wiki/Teatao_Teannaki,2,2016,October,(–) and President (–),,I-Kiribati politician,Vice-President,heart attack,,,,,,,,80.0,,,


<IPython.core.display.Javascript object>

#### Observations:
- This pattern may have matches that are not nationalities, so we will add that comparison to our for loop when extracting.

In [69]:
# Extracting nation information to new columns nation_1 and nation_2
for i in df[(df[column].notna()) & (df["nation_1"].isna())].index:
    item = df.loc[i, column]
    match = re.search(pattern, item)
    not_in_list = []

    if match:
        nations = match.group(0).split("-")
        nation_1 = nations[0].strip()
        nation_2 = nations[1].strip()
        if nation_1 in nationalities_lst:
            df.loc[i, "nation_1"] = nation_1
            df.loc[i, column] = df.loc[i, column].replace(nation_1, "").strip()
        else:
            not_in_list.append(nation_1)

        if nation_2 in nationalities_lst:
            df.loc[i, "nation_1"] = nation_2
            df.loc[i, column] = df.loc[i, column].replace(nation_2, "").strip()
        else:
            not_in_list.append(nation_2)

# Re-check number of and example rows
rows_to_check = rows_with_pattern(
    df[(df["nation_1"].isna()) & (df[column].notna())], column, pattern
)
df.loc[rows_to_check, :]

There are 34 rows with matching pattern in column 'info_2'.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,nation_1,nation_2
387,28,Jimmy Stevens,", 74, Ni-Vanuatu nationalist and politician, stomach cancer.",https://en.wikipedia.org/wiki/Jimmy_Stevens_(politician),2,1994,February,,,Ni-Vanuatu nationalist and politician,stomach cancer,,,,,,,,,74.0,,,
3619,10,Bruno Lawrence,", 54, English-New Zealand musician and actor, lung cancer.",https://en.wikipedia.org/wiki/Bruno_Lawrence,7,1995,June,,,English-New Zealand musician and actor,lung cancer,,,,,,,,,54.0,,,
4904,28,Sukhan Babayev,", 85, Soviet-Turkmenistan politician and General Secretary of the Communist Party of Turkmenistan.",https://en.wikipedia.org/wiki/Sukhan_Babayev,2,1995,November,,,Soviet-Turkmenistan politician and General Secretary of the Communist Party of Turkmenistan,,,,,,,,,,85.0,,,
5100,19,Nita Barrow,", 79, Governor-General of Barbados, stroke.",https://en.wikipedia.org/wiki/Nita_Barrow,14,1995,December,,,Governor-General of Barbados,stroke,,,,,,,,,79.0,,,
7185,2,Joseph Lambert Eustace,", 88, Governor-General of Saint Vincent and the Grenadines.",https://en.wikipedia.org/wiki/Joseph_Lambert_Eustace,1,1996,November,,,Governor-General of Saint Vincent and the Grenadines,,,,,,,,,,88.0,,,
7356,24,Tupua Leupena,", 74, Governor-General of Tuvalu.",https://en.wikipedia.org/wiki/Tupua_Leupena,5,1996,November,,,Governor-General of Tuvalu,,,,,,,,,,74.0,,,
9597,22,Chiang Wei-kuo,", 80, Secretary-General of the National Security Council of the Republic of China , diabetes.",https://en.wikipedia.org/wiki/Chiang_Wei-kuo,20,1997,September,(–),,Secretary-General of the National Security Council of the Republic of China,diabetes,,,,,,,,,80.0,,,
16704,9,James B. Knighten,", 80, African-Americans US Army Air Corps pilot.",https://en.wikipedia.org/wiki/James_B._Knighten,14,2000,November,,,African-Americans US Army Air Corps pilot,,,,,,,,,,80.0,,,
18439,12,Vlado Dapčević,", 84, Yugoslav-Montenegrin communist and revolutionary.",https://en.wikipedia.org/wiki/Vlado_Dap%C4%8Devi%C4%87,2,2001,July,,,Yugoslav-Montenegrin communist and revolutionary,,,,,,,,,,84.0,,,
20332,22,Sir Kingsford Dibela,", 70, Governor-General of Papua New Guinea.",https://en.wikipedia.org/wiki/Kingsford_Dibela,2,2002,March,,,Governor-General of Papua New Guinea,,,,,,,,,,70.0,,,


<IPython.core.display.Javascript object>

In [70]:
df["nation_1"].unique()

array([nan, 'British', 'Australian', 'Norwegian', 'French', 'Armenian',
       'American', 'Russian', 'Danish', 'Israeli', 'Swedish', 'Irish',
       'Cuban', 'Mexican', 'Canadian', 'Japanese', 'German', 'Finnish',
       'Chilean', 'Uruguayan', 'Ukrainian', 'Belgian', 'Swiss', 'Latvian',
       'Venezuelan', 'Iraqi', 'Austrian', 'Albanian', 'Peruvian',
       'Herzegovinian', 'Chinese', 'Italian', 'Hungarian', 'Czech',
       'Indian', 'Lithuanian', 'African', 'Dutch', 'Burmese', 'Polish',
       'Brazilian', 'Rican', 'Spanish', 'Scottish', 'Empire', 'Estonian',
       'Romanian', 'Fijian', 'Pakistan', 'Croatian', 'Greek', 'Samoan',
       'Algerian', 'Filipino', 'Trinidad', 'Soviet', 'English',
       'Mozambican', 'Zealand', 'Iranian', 'Ghanaian', 'Belgium',
       'Turkish', 'Cypriot', 'Palestinian', 'Zambian', 'Jamaican',
       'Lankan', 'Haitian', 'Lebanese', 'Pakistani', 'Yugoslav',
       'Slovene', 'Iran', 'Serbian', 'Mauritian', 'Egyptian', 'Guinean',
       'Welsh', 'Guyane

<IPython.core.display.Javascript object>

In [71]:
nationalities_lst

['Afghan',
 'Albanian',
 'Algerian',
 'American',
 'Andorran',
 'Angolan',
 'Antiguans',
 'Argentinean',
 'Armenian',
 'Australian',
 'Austrian',
 'Azerbaijani',
 'Bahamian',
 'Bahraini',
 'Bangladeshi',
 'Barbadian',
 'Barbudans',
 'Batswana',
 'Belarusian',
 'Belgian',
 'Belizean',
 'Beninese',
 'Bhutanese',
 'Bolivian',
 'Bosnian',
 'Brazilian',
 'British',
 'Bruneian',
 'Bulgarian',
 'Burkinabe',
 'Burmese',
 'Burundian',
 'Cambodian',
 'Cameroonian',
 'Canadian',
 'Cape Verdean',
 'Central African',
 'Chadian',
 'Chilean',
 'Chinese',
 'Colombian',
 'Comoran',
 'Congolese',
 'Costa Rican',
 'Croatian',
 'Cuban',
 'Cypriot',
 'Czech',
 'Danish',
 'Djibouti',
 'Dominican',
 'Dutch',
 'East Timorese',
 'Ecuadorean',
 'Egyptian',
 'Emirian',
 'Equatorial Guinean',
 'Eritrean',
 'Estonian',
 'Ethiopian',
 'Fijian',
 'Filipino',
 'Finnish',
 'French',
 'Gabonese',
 'Gambian',
 'Georgian',
 'German',
 'Ghanaian',
 'Greek',
 'Grenadian',
 'Guatemalan',
 'Guinea-Bissauan',
 'Guinean',
 'Gu

<IPython.core.display.Javascript object>

In [None]:
# For loop to extract nationality from info_1 to nation_1 and nation_2
for i in df[df["info_1"].notna()].index:
    item = df.loc[i, "info_1"]

    nations = []
    for nationality in nationalities_lst:
        if nationality in item:
            nations.append(nationality)
            print(nations)

#     for j, nation in enumerate(nations):
#         df.loc[i, f"nation_{j+1}"] = nation
#         df.loc[i, "info_1"] = df.loc[i, "info_1"].replace(nation, "")

# # Checking sample of treated rows
# df[df["nation_1"].notna()].sample(2)

#### Observations:
- We have been fortunate in `info_1` in that it has provided a small sample on which to test our code and see a subset of values.
- Let us check the remaining unique values in `info_1` for any missed nationalities.

#### Checking Remaining Unique Values in `info_1`

In [None]:
# Checking unique values in info_1
df["info_1"].unique()

#### Observations:
- `info_1` had only single nationality values, so we do not yet have a `nation_2` column.
- It appears "New Zealand Maori", "English", and "Icelandic" are not included in the nationalities list, but they are used as values on Wikipedia.
- We will take the approach of adding them to the list and re-running the code to extract values from `info_1`.

#### Adding Missing Nationalities to `nationalities_lst` and Repeating Extraction of Nationality from `info_1`

In [None]:
# Confirming nationalities are missing then adding them to existing nationalities list
missing_nations = ["New Zealand Maori", "English", "Icelandic"]
missing_nations = [item for item in missing_nations if item not in nationalities_lst]
nationalities_lst += missing_nations

# List of new columns for first and second nationality
new_cols = ["nation_1", "nation_2"]

# For loop to extract nationality from info_1 to nation_1 and nation_2
for col in new_cols:

    for i in df[df["info_1"].notna()].index:
        item = df.loc[i, "info_1"]

        for nationality in nationalities_lst:
            if nationality in item:
                df.loc[i, col] = nationality
                df.loc[i, "info_1"] = df.loc[i, "info_1"].replace(nationality, "")

# Checking sample of treated rows
df[df["nation_1"].notna()].sample(2)

#### Re-checking Remaning Unique Values in `info_1`

In [None]:
# Re-checking remaining unique values in info_1
df["info_1"].unique()

#### Observations:
- That iteration addressed missing nationalities.
- We do have nationality given as a country, "Nepal", that we will need to keep in mind.

### Extracting Nationality from `info_2` to `nation_1` and `nation_2`

In [None]:
# List of new columns for first and second nationality
new_cols = ["nation_1", "nation_2"]

# For loop to extract nationality from info_2 to nation_1 and nation_2
for col in new_cols:

    for i in df[df["info_2"].notna()].index:
        item = df.loc[i, "info_2"]

        for nationality in nationalities_lst:
            if nationality in item:
                df.loc[i, col] = nationality
                df.loc[i, "info_2"] = df.loc[i, "info_2"].replace(nationality, "")

# Checking sample of treated rows
df[df["nation_1"].notna()].sample(2)

In [None]:
df["info_2"].unique()

### Exporting Dataset to SQLite Database [wp_life_expect_clean2.db](https://github.com/teresahanak/wikipedia-life-expectancy/blob/main/wp_life_expect_clean1.db)

In [None]:
# Saving complete raw dataset in a SQLite database
conn = sql.connect("wp_life_expect_clean2.db")
df.to_sql("wp_life_expect_clean2", conn, index=False)

# [Proceed to Notebook 4 of  4:  Data Cleaning Part 3](https://github.com/teresahanak/wikipedia-life-expectancy/blob/main/wp_life_expect_data_clean2_thanak_2022_06_17.ipynb)