# Wikipedia Notable Life Expectancies

# [Notebook 3 of 4: Data Cleaning](https://github.com/teresahanak/wikipedia-life-expectancy/blob/main/wp_life_expect_data_clean2_thanak_2022_06_17.ipynb)

## Context

The


## Objective

The

### Data Dictionary

- Feature: Description

## Importing Necessary Libraries

In [1]:
# To structure code automatically
%load_ext nb_black

# To import/export sqlite databases
import sqlite3 as sql

# To help with reading and manipulating data
import pandas as pd
import numpy as np
import re

# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)
# To define the maximum number of rows to be displayed in a dataframe
pd.set_option("display.max_rows", 200)

# To supress warnings
# import warnings

# warnings.filterwarnings("ignore")

# To set some visualization attributes
pd.set_option("max_colwidth", 150)

<IPython.core.display.Javascript object>

## Data Overview


### Reading, Sampling, and Checking Data Shape

In [2]:
# Reading the dataset
conn = sql.connect("wp_life_expect_clean1.db")
data = pd.read_sql("SELECT * FROM wp_life_expect_clean1", conn)

# Making a working copy
df = data.copy()

# Checking the shape
print(f"There are {df.shape[0]} rows and {df.shape[1]} columns.")

# Checking first 2 rows of the data
df.head(2)

There are 132584 rows and 20 columns.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
0,1,William Chappell,", 86, British dancer, ballet designer and director.",https://en.wikipedia.org/wiki/William_Chappell_(dancer),21,1994,January,,,British dancer,ballet designer and director,,,,,,,,,86.0
1,1,Raymond Crotty,", 68, Irish economist, writer, and academic.",https://en.wikipedia.org/wiki/Raymond_Crotty,12,1994,January,,,Irish economist,writer,and academic,,,,,,,,68.0


<IPython.core.display.Javascript object>

In [3]:
# Checking last 2 rows of the data
df.tail(2)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
132582,9,Oleg Moliboga,", 69, Russian volleyball player, Olympic champion and coach.",https://en.wikipedia.org/wiki/Oleg_Moliboga,2,2022,June,(1980),,Russian volleyball player,Olympic champion and coach,,,,,,,,,69.0
132583,9,Zou Jing,", 86, Chinese engineer, member of the Chinese Academy of Engineering.",https://en.wikipedia.org/wiki/Zou_Jing_(engineer),3,2022,June,,,Chinese engineer,member of the Chinese Academy of Engineering,,,,,,,,,86.0


<IPython.core.display.Javascript object>

In [4]:
# Checking a sample of the data
df.sample(5)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
55774,1,Ossie Hibbert,", 62, Jamaican musician, heart attack.",https://en.wikipedia.org/wiki/Ossie_Hibbert,4,2012,July,,,Jamaican musician,heart attack,,,,,,,,,62.0
89013,23,Viorel Morariu,", 85, Romanian rugby union player, Vernon Pugh Award for Distinguished Service recipient.",https://en.wikipedia.org/wiki/Viorel_Morariu,8,2017,May,,,Romanian rugby union player,Vernon Pugh Award for Distinguished Service recipient,,,,,,,,,85.0
7119,23,Harold Hughes,", 74, American politician.",https://en.wikipedia.org/wiki/Harold_Hughes,10,1996,October,,,American politician,,,,,,,,,,74.0
92734,16,Tu An,", 94, Chinese poet and translator.",https://en.wikipedia.org/wiki/Tu_An,9,2017,December,,,Chinese poet and translator,,,,,,,,,,94.0
30406,2,Bernard Loomis,", 82, American toymaker responsible for Strawberry Shortcake and action figures, heart disease.",https://en.wikipedia.org/wiki/Bernard_Loomis,2,2006,June,,,American toymaker responsible for Strawberry Shortcake and action figures,heart disease,,,,,,,,,82.0


<IPython.core.display.Javascript object>

#### Observations:
- There are currently 132,584 rows and 20 columns.

### Checking Data Types, Duplicates, and Null Values

In [5]:
# Checking data types and null values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 132584 entries, 0 to 132583
Data columns (total 20 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   day             132584 non-null  object 
 1   name            132584 non-null  object 
 2   info            132584 non-null  object 
 3   link            132584 non-null  object 
 4   num_references  132584 non-null  object 
 5   year            132584 non-null  int64  
 6   month           132584 non-null  object 
 7   info_parenth    49924 non-null   object 
 8   info_1          132584 non-null  object 
 9   info_2          132555 non-null  object 
 10  info_3          62620 non-null   object 
 11  info_4          12587 non-null   object 
 12  info_5          1505 non-null    object 
 13  info_6          217 non-null     object 
 14  info_7          33 non-null      object 
 15  info_8          7 non-null       object 
 16  info_9          1 non-null       object 
 17  info_10   

<IPython.core.display.Javascript object>

#### Observations:
- Our dataset was saved to and read from the database without any hiccups.
- Picking up where we left off, we will aim to track down the remaining missing values for `age`, starting with searching for digits in `info_2`.

### Remaining Missing Values for Age

In [6]:
# Checking number of remaining missing values
print(f'There are {df["age"].isna().sum()} missing values for age.')

There are 218 missing values for age.


<IPython.core.display.Javascript object>

#### Function to Save Indices of Rows Matching Regular Expressions Pattern to a List and Print Number of Rows with Match

In [7]:
# Define a function that takes dataframe, column name, and re pattern as arguments and returns list of indices
# for which column value matches re pattern
def rows_with_pattern(dataframe, column, pattern):
    """
    Takes input of dataframe, column name, and re pattern 
    and returns list of indices for rows that contain match
    for pattern anywhere within value for given column.
    
    dataframe: dataframe
    column: column name
    pattern: re pattern
    """
    index_list = []

    for i in dataframe.index:
        item = dataframe.loc[i, column]
        match = re.search(pattern, item)
        if match:
            index_list.append(i)
    print(
        f"There are {len(index_list)} rows with matching pattern in column '{column}'."
    )
    return index_list

<IPython.core.display.Javascript object>

#### Function to Use rows_with_pattern Function for Multiple Regular Expression Patterns

In [8]:
# Define a function that calls rows_with_pattern function for multiple re patterns
# returning a single list of indices for all rows with any pattern match


def multiple_patterns(dataframe, column, patterns):
    """
    Takes input dataframe, column, and list of re patterns and returns single list 
    of indices for rows in which a match for any pattern is found with re.search
    
    dataframe: dataframe
    column: column name
    patterns: list of re patterns
    """
    rows_combined = []

    # For loop to check each pattern
    for pattern in patterns:

        # List and number of rows matching each pattern
        print(pattern)
        rows_to_check = rows_with_pattern(dataframe, column, pattern)
        print("")

        # Add list for each pattern to combined list
        rows_combined += rows_to_check

    return rows_combined

<IPython.core.display.Javascript object>

### `info_2`

#### Rows Missing `age` with Digits in `info_2`

In [9]:
# Pattern for re
pattern = r"\d"

# List and number of rows matching pattern
rows_to_check = rows_with_pattern(
    df[(df["age"].isna()) & (df["info_2"].notna())], "info_2", pattern
)

# Examining the rows directly
df.loc[rows_to_check, :]

There are 47 rows with matching pattern in column 'info_2'.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
6846,17,Spiro Agnew,", American politician, 77, 39th Vice President of the United States, leukemia.",https://en.wikipedia.org/wiki/Spiro_Agnew,207,1996,September,,American politician,77,39th Vice President of the United States,leukemia,,,,,,,,
12861,14,Muslimgauze,", , 37, British electronic musician, fungal infection.",https://en.wikipedia.org/wiki/Muslimgauze,26,1999,January,(Bryn Jones),,37,British electronic musician,fungal infection,,,,,,,,
14994,15,Željko Ražnatović,", , 47, Serbian mobster and paramilitary leader.",https://en.wikipedia.org/wiki/Arkan,55,2000,January,(aka Arkan),,47,Serbian mobster and paramilitary leader,,,,,,,,,
15063,28,Sarah Caudwell,", , 60, British detective story writer and barrister, cancer.",https://en.wikipedia.org/wiki/Sarah_Caudwell,9,2000,January,(aka Sarah Cockburn),,60,British detective story writer and barrister,cancer,,,,,,,,
16112,30,Max Showalter,", , 83, American actor, composer, pianist, singer, cancer.",https://en.wikipedia.org/wiki/Max_Showalter,14,2000,July,(aka Casey Adams),,83,American actor,composer,pianist,singer,cancer,,,,,
17793,15,Joey Ramone,", , 49, American musician, lead singer for The Ramones, lymphoma.",https://en.wikipedia.org/wiki/Joey_Ramone,25,2001,April,(b. Jeffrey Hyman),,49,American musician,lead singer for The Ramones,lymphoma,,,,,,,
18171,4,Dipendra,", King of Nepal, 29, suicide.",https://en.wikipedia.org/wiki/Dipendra_of_Nepal,8,2001,June,,King of Nepal,29,suicide,,,,,,,,,
19018,28,Mohammad Khalequzzaman,", member of the then National Assembly of Pakistan and Union Minister of Labor, died in 28 September 2001.",https://en.wikipedia.org/wiki/Mohammad_Khalequzzaman,3,2001,September,,member of the then National Assembly of Pakistan and Union Minister of Labor,died in 28 September 2001,,,,,,,,,,
19118,12,Lord Hailsham of St Marylebone,", , 94, British lawyer and politician.","https://en.wikipedia.org/wiki/Quintin_Hogg,_Baron_Hailsham_of_St_Marylebone",17,2001,October,(Quintin Hogg),,94,British lawyer and politician,,,,,,,,,
23753,6,Jules Engel,", Jules Engel, 94, American filmmaker, visual artist, and film director.",https://en.wikipedia.org/wiki/Jules_Engel,10,2003,September,,Jules Engel,94,American filmmaker,visual artist,and film director,,,,,,,


<IPython.core.display.Javascript object>

#### Observations:
- There are several ages as sole integer year values and two as year ranges with two integers.
- The remaining entries are missing age, but do have digits, so order of processing matters here.
- We can safely remove any of these rows that contains a letter of the alphabet.

In [10]:
# Pattern for re
pattern = r"[a-z,A-Z]"

# List and number of rows matching pattern
rows_to_drop = rows_with_pattern(df.loc[rows_to_check, :], "info_2", pattern)

# Checking the rows
df.loc[rows_to_drop, :]

There are 21 rows with matching pattern in column 'info_2'.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
19018,28,Mohammad Khalequzzaman,", member of the then National Assembly of Pakistan and Union Minister of Labor, died in 28 September 2001.",https://en.wikipedia.org/wiki/Mohammad_Khalequzzaman,3,2001,September,,member of the then National Assembly of Pakistan and Union Minister of Labor,died in 28 September 2001,,,,,,,,,,
31239,15,Guy François,", Haitian Army colonel, participated in failed coups in 1989 and 2001.",https://en.wikipedia.org/wiki/Guy_Fran%C3%A7ois_(colonel),0,2006,September,,Haitian Army colonel,participated in failed coups in 1989 and 2001,,,,,,,,,,
43769,12,Saleban Olad Roble,", Somali government minister, injuries sustained in the 2009 Shamo Hotel bombing.",https://en.wikipedia.org/wiki/Saleban_Olad_Roble,2,2010,February,,Somali government minister,injuries sustained in the 2009 Shamo Hotel bombing,,,,,,,,,,
109917,25,Robert Levinson,", American intelligence officer, missing since 2007.",https://en.wikipedia.org/wiki/Robert_Levinson,43,2020,March,(declared legally deceased on this date),American intelligence officer,missing since 2007,,,,,,,,,,
110688,12,Khalif Mumin Tohow,", Somali justice minister of Hirshabelle State, COVID-19.",https://en.wikipedia.org/wiki/Khalif_Mumin_Tohow,3,2020,April,,Somali justice minister of Hirshabelle State,COVID-19,,,,,,,,,,
110741,14,Danny Delaney,", Irish Gaelic footballer , COVID-19.",https://en.wikipedia.org/wiki/Danny_Delaney,6,2020,April,"(Laois, Stradbally)",Irish Gaelic footballer,COVID-19,,,,,,,,,,
111354,2,Justa Barrios,", American home care worker and labor organizer, COVID-19.",https://en.wikipedia.org/wiki/Justa_Barrios,9,2020,May,(death announced on this date),American home care worker and labor organizer,COVID-19,,,,,,,,,,
111887,21,Kamrun Nahar Putul,", Bangladeshi politician, COVID-19.",https://en.wikipedia.org/wiki/Kamrun_Nahar_Putul,3,2020,May,,Bangladeshi politician,COVID-19,,,,,,,,,,
112736,23,Jean-Michel Bokamba-Yangouma,", Congolese trade unionist and politician, COVID-19.",https://en.wikipedia.org/wiki/Jean-Michel_Bokamba-Yangouma,25,2020,June,,Congolese trade unionist and politician,COVID-19,,,,,,,,,,
113336,16,Cornelius Mwalwanda,", Malawian economist and politician, COVID-19.",https://en.wikipedia.org/wiki/Cornelius_Mwalwanda,5,2020,July,,Malawian economist and politician,COVID-19,,,,,,,,,,


<IPython.core.display.Javascript object>

#### Observations:
- We can drop these rows, as they are missing the data for age.
- Extraction of age for single integer value and two integer ranges can follow.

#### Dropping Additional Rows with Age Data Absent

In [11]:
# Dropping rows, resetting index, and checking new shape of df
df.drop(rows_to_drop, inplace=True)
df.reset_index(inplace=True, drop=True)
df.shape

(132563, 20)

<IPython.core.display.Javascript object>

#### Remaining Rows with `age` values in `info_2`

In [12]:
# Regular expression for parenthesis and its contents
pattern = r"\d"

# Finding indices of rows that have pattern
rows_to_check = rows_with_pattern(
    df[(df["age"].isna()) & (df["info_2"].notna())], "info_2", pattern
)

# Checking unique values
df.loc[rows_to_check, :]["info_2"].unique()

There are 26 rows with matching pattern in column 'info_2'.


array(['77', '37', '47', '60', '83', '49', '29', '94', '55', '62', '69',
       '80', '32', '70', '81', '24', '84', '95', '76', '86', '61',
       '74–75', '79–80'], dtype=object)

<IPython.core.display.Javascript object>

#### Extracting `age` for Ranges with Two Values

In [13]:
# Pattern for re
pattern = r"(\d{1,3})(-|–|/| or )(\d{1,3})"

# List and number of rows matching pattern
rows_to_check = rows_with_pattern(
    df[(df["age"].isna()) & (df["info_2"].notna())], "info_2", pattern
)

# Checking sample of rows
df.loc[rows_to_check, :].sample(2)

There are 2 rows with matching pattern in column 'info_2'.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
86950,8,Mohamud Muse Hersi,", Somali politician, 79–80, President of Puntland .",https://en.wikipedia.org/wiki/Mohamud_Muse_Hersi,12,2017,February,(2005–2009),Somali politician,79–80,President of Puntland,,,,,,,,,
61933,1,Basil Soper,", British actor, 74–75.",https://en.wikipedia.org/wiki/Basil_Soper,0,2013,June,,British actor,74–75,,,,,,,,,,


<IPython.core.display.Javascript object>

In [14]:
# For loop to find rows with values and pattern and calculate and extract age to age column and remove age from info_2
for i in df.loc[rows_to_check, :].index:
    item = df.loc[i, "info_2"]
    match = re.search(pattern, item)
    if match:
        age = (int(match.group(1)) + int(match.group(3))) / 2
        df.loc[i, "age"] = age
        df.loc[i, "info_2"] = re.sub(pattern, "", df.loc[i, "info_2"])

# Checking example rows
pd.concat([df[df["name"] == "Mohamud Muse Hersi"], df[df["name"] == "Basil Soper"]])

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
86950,8,Mohamud Muse Hersi,", Somali politician, 79–80, President of Puntland .",https://en.wikipedia.org/wiki/Mohamud_Muse_Hersi,12,2017,February,(2005–2009),Somali politician,,President of Puntland,,,,,,,,,79.5
61933,1,Basil Soper,", British actor, 74–75.",https://en.wikipedia.org/wiki/Basil_Soper,0,2013,June,,British actor,,,,,,,,,,,74.5


<IPython.core.display.Javascript object>

#### Extracting `age` as Single Integer

In [15]:
# List of patterns for age formats with single integer for age
pattern = r"\b(\d{1,3})\b"

# List and number of rows matching patterns
rows_to_check = rows_with_pattern(
    df[(df["age"].isna()) & (df["info_2"].notna())], "info_2", pattern
)

There are 24 rows with matching pattern in column 'info_2'.


<IPython.core.display.Javascript object>

In [17]:
# For loop to extract age pattern to age column
for i in df.loc[rows_to_check, :].index:
    item = df.loc[i, "info_2"]
    match = re.search(pattern, item)
    if match:
        age = int(match.group(1))
        df.loc[i, "age"] = age
        df.loc[i, "info_2"] = re.sub(pattern, "", df.loc[i, "info_2"])

# Re-checking number of rows matching patterns
rows_to_check = rows_with_pattern(
    df[(df["age"].isna()) & (df["info_2"].notna())], "info_2", pattern
)

There are 0 rows with matching pattern in column 'info_2'.


<IPython.core.display.Javascript object>

In [21]:
# Checking number of remaining missing values
print(f'There are {df["age"].isna().sum()} missing values for age.')
df[df["age"].isna()].sample(5)

There are 171 missing values for age.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
113214,12,Mohamed Abdi Hashi,", Somali politician, President of Puntland .",https://en.wikipedia.org/wiki/Mohamed_Abdi_Hashi,6,2020,July,(2004–2005),Somali politician,President of Puntland,,,,,,,,,,
85111,8,Aide Ganasi,", Papua New Guinean politician, MP , heart attack.",https://en.wikipedia.org/wiki/Aide_Ganasi,14,2016,November,(since 2012),Papua New Guinean politician,MP,heart attack,,,,,,,,,
98059,11,Cheikhna Ould Mohamed Laghdaf,", Mauritanian diplomat and politician, Foreign Minister .",https://en.wikipedia.org/wiki/Cheikhna_Ould_Mohamed_Laghdaf,1,2018,September,"(1962–1963, 1978–1979)",Mauritanian diplomat and politician,Foreign Minister,,,,,,,,,,
108620,5,Mohammad Shafiq,", Pakistani politician, MLA , cardiac arrest.",https://en.wikipedia.org/wiki/Mohammad_Shafiq_(politician),4,2020,February,(since 2015),Pakistani politician,MLA,cardiac arrest,,,,,,,,,
62642,9,Bhagwati Prasad,", Indian politician, Uttar Pradesh MLA for Ikauna , multiple organ failure.",https://en.wikipedia.org/wiki/Bhagwati_Prasad_(politician),1,2013,July,"(1967–1969, 1969–1974)",Indian politician,Uttar Pradesh MLA for Ikauna,multiple organ failure,,,,,,,,,


<IPython.core.display.Javascript object>

In [19]:
df[df["age"].isna()]

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
13511,29,Faustin Birindwa,", Prime Minister of Zaire .",https://en.wikipedia.org/wiki/Faustin_Birindwa,4,1999,April,(1993 – 1994),Prime Minister of Zaire,,,,,,,,,,,
22856,10,Little Eva,", .",https://en.wikipedia.org/wiki/Little_Eva,14,2003,April,"(née Eva Narcissus Boyd), 59, American pop singer ()",,,,,,,,,,,,
26314,18,Alfred Maseng,", Vanuatuan president .",https://en.wikipedia.org/wiki/Alfred_Maseng,3,2004,November,"(1994, 2004) and foreign minister (1995–1996)",Vanuatuan president,,,,,,,,,,,
31766,21,Sir Harold Young,", Australian Liberal politician, President of the Senate .",https://en.wikipedia.org/wiki/Harold_Young_(politician),3,2006,November,(1981–1983),Australian Liberal politician,President of the Senate,,,,,,,,,,
32722,17,Joseph C. Casdin,", American businessman and politician, Mayor of Worcester, Massachusetts",https://en.wikipedia.org/wiki/Joseph_C._Casdin,6,2007,March,(1962–1963),American businessman and politician,Mayor of Worcester,Massachusetts,,,,,,,,,
36487,2,Justin Yak,", Sudanese politician, minister for cabinet affairs for Southern Sudan , plane crash.",https://en.wikipedia.org/wiki/Justin_Yak,1,2008,May,(2006–2007),Sudanese politician,minister for cabinet affairs for Southern Sudan,plane crash,,,,,,,,,
40580,20,Nguyễn Bá Cẩn,", Vietnamese politician, Prime Minister of South Vietnam .",https://en.wikipedia.org/wiki/Nguy%E1%BB%85n_B%C3%A1_C%E1%BA%A9n,3,2009,May,(1975),Vietnamese politician,Prime Minister of South Vietnam,,,,,,,,,,
41574,22,Iftikhar Ali Khan,", Pakistani general, Defence Secretary , heart attack.",https://en.wikipedia.org/wiki/Iftikhar_Ali_Khan_(general),1,2009,August,(1997–1999),Pakistani general,Defence Secretary,heart attack,,,,,,,,,
44835,1,Dragan Kujović,", Montenegrin politician, President .",https://en.wikipedia.org/wiki/Dragan_Kujovi%C4%87,2,2010,May,(2003),Montenegrin politician,President,,,,,,,,,,
45013,15,Gabriel Bien-Aimé,", Haitian politician, Minister of Education , heart attack.",https://en.wikipedia.org/wiki/Gabriel_Bien-Aim%C3%A9,2,2010,May,(2006–2008),Haitian politician,Minister of Education,heart attack,,,,,,,,,


<IPython.core.display.Javascript object>