# Wikipedia Notable Life Expectancies

# [Notebook 4 of 4: Data Cleaning](https://github.com/teresahanak/wikipedia-life-expectancy/blob/main/wp_life_expect_data_clean3_thanak_2022_06_23.ipynb)

## Context

The


## Objective

The

### Data Dictionary

- Feature: Description

## Importing Necessary Libraries

In [1]:
# To structure code automatically
%load_ext nb_black

# To import/export sqlite databases
import sqlite3 as sql

# To save/open python objects in pickle file
import pickle

# To help with reading, cleaning, and manipulating data
import pandas as pd
import numpy as np
import re

# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)
# To define the maximum number of rows to be displayed in a dataframe
pd.set_option("display.max_rows", 200)

# To supress warnings
# import warnings

# warnings.filterwarnings("ignore")

# To set some visualization attributes
pd.set_option("max_colwidth", 150)

# To play auditory cue when cell has executed, has warning, or has error and set chime theme
import chime

chime.theme("zelda")

<IPython.core.display.Javascript object>

## Data Overview

### Reading, Sampling, and Checking Data Shape

In [2]:
# Reading the dataset
conn = sql.connect("wp_life_expect_clean2.db")
data = pd.read_sql("SELECT * FROM wp_life_expect_clean2", conn)

# Making a working copy
df = data.copy()

# Checking the shape
print(f"There are {df.shape[0]} rows and {df.shape[1]} columns.")

# Checking first 2 rows of the data
df.head(2)

There are 132652 rows and 21 columns.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death
0,1,William Chappell,", 86, British dancer, ballet designer and director.",https://en.wikipedia.org/wiki/William_Chappell_(dancer),21,1994,January,,,British dancer,ballet designer and director,,,,,,,,,86.0,
1,1,Raymond Crotty,", 68, Irish economist, writer, and academic.",https://en.wikipedia.org/wiki/Raymond_Crotty,12,1994,January,,,Irish economist,writer,and academic,,,,,,,,68.0,


<IPython.core.display.Javascript object>

In [3]:
# Checking last 2 rows of the data
df.tail(2)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death
132650,9,Oleg Moliboga,", 69, Russian volleyball player, Olympic champion and coach.",https://en.wikipedia.org/wiki/Oleg_Moliboga,2,2022,June,(1980),,Russian volleyball player,Olympic champion and coach,,,,,,,,,69.0,
132651,9,Zou Jing,", 86, Chinese engineer, member of the Chinese Academy of Engineering.",https://en.wikipedia.org/wiki/Zou_Jing_(engineer),3,2022,June,,,Chinese engineer,member of the Chinese Academy of Engineering,,,,,,,,,86.0,


<IPython.core.display.Javascript object>

In [4]:
# Checking a sample of the data
df.sample(5)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death
121226,10,Edward Cassidy,", 96, Australian Roman Catholic cardinal, apostolic pro-nuncio to China .",https://en.wikipedia.org/wiki/Edward_Cassidy,13,2021,April,"(1970–1979) and Bangladesh (1973–1979), president of the PCPCU (1989–2001)",,Australian Roman Catholic cardinal,apostolic pro-nuncio to China,,,,,,,,,96.0,
5102,20,Mario Procaccino,", 83, Italian-American lawyer and politician.",https://en.wikipedia.org/wiki/Mario_Procaccino,1,1995,December,,,Italian-American lawyer and politician,,,,,,,,,,83.0,
22165,5,Jean Kerr,", 80, American author and playwright, pneumonia.",https://en.wikipedia.org/wiki/Jean_Kerr,6,2003,January,,,American author and playwright,pneumonia,,,,,,,,,80.0,
117379,18,Jerry Relph,", 76, American politician, member of the Minnesota Senate , complications from COVID-19.",https://en.wikipedia.org/wiki/Jerry_Relph,16,2020,December,(since 2017),,American politician,member of the Minnesota Senate,complications from COVID-19,,,,,,,,76.0,
52418,23,Norayr Musheghyan,", 76, Armenian wrestler, coach and public activist, world champion .",https://en.wikipedia.org/wiki/Norayr_Musheghyan,3,2011,December,(1958),,Armenian wrestler,coach and public activist,world champion,,,,,,,,76.0,


<IPython.core.display.Javascript object>

### Checking Data Types, Duplicates, and Null Values

In [5]:
# Checking data types and null values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 132652 entries, 0 to 132651
Data columns (total 21 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   day             132652 non-null  object 
 1   name            132652 non-null  object 
 2   info            132652 non-null  object 
 3   link            132652 non-null  object 
 4   num_references  132652 non-null  object 
 5   year            132652 non-null  int64  
 6   month           132652 non-null  object 
 7   info_parenth    49830 non-null   object 
 8   info_1          35 non-null      object 
 9   info_2          132604 non-null  object 
 10  info_3          62571 non-null   object 
 11  info_4          12605 non-null   object 
 12  info_5          1497 non-null    object 
 13  info_6          216 non-null     object 
 14  info_7          31 non-null      object 
 15  info_8          6 non-null       object 
 16  info_9          1 non-null       object 
 17  info_10   

<IPython.core.display.Javascript object>

#### Loading `nation_country_dict` from Pickle File

In [6]:
# Load the nation_country_dict
with open("nation_country_dict.pkl", "rb") as f:
    nation_country_dict = pickle.load(f)

<IPython.core.display.Javascript object>

## Extracting Nationality Continued
Here is the approach we will take:
- The plan will be to save the country name, in lieu of nationality, in new `nation_1` and `nation_2` columns as it is standardized for the various associated nationality values.
- First, we will update the keys and values in `nation_country_dict` by replacing hyphens with a single space.
- Then we will remove "-born" from the column we are searching, as well as replace "-" with a single space.  In this step, we can also remove leading and trailing periods and whitespace.
- We will proceed to search the numbered `info` columns in order checking as follows:
    1. if column value starts with a value in the dictionary:
        - save country to `nation_1` and remove value from searched column.
    2. if `nation_1` value has been found:
        - if updated column value starts with a value in the dictionary:
            - save country to `nation_2` and remove value from searched column.
    3. Repeat steps 1 and 2 but comparing with country (dictionary keys)
    4. Check unique values for column starting with capital letters.

#### Removing Hyphens from `nation_country_dict`

In [7]:
# Removing hyphens from nation_country_dict
nation_country_dict = {
    key.replace("-", ""): value.replace("-", " ")
    for key, value in nation_country_dict.items()
}

<IPython.core.display.Javascript object>

In [8]:
# List of columns to treat
cols_lst = [
    "info_1",
    "info_2",
    "info_3",
    "info_4",
    "info_5",
    "info_6",
    "info_7",
    "info_8",
    "info_9",
    "info_10",
    "info_11",
]

# For loop to remove '-born' and replace'-' with single space in columns in cols_list
# and strip any leading or trailing periods or whitespace
for column in cols_lst:
    for index in df[column].notna().index:
        if df.loc[index, column]:
            df.loc[index, column] = (
                df.loc[index, column].replace("-born", "").strip(" .")
            )
        if df.loc[index, column]:
            df.loc[index, column] = df.loc[index, column].replace("-", " ").strip(" .")

CPU times: total: 37 s
Wall time: 37 s


<IPython.core.display.Javascript object>

#### Checking `info_1` for `nation_1`

In [9]:
# Column to check
column = "info_1"

# Extract to column
extract_to = "nation_1"

# Dataframe to check
dataframe = df[(df[column].notna()) & (df[extract_to].isna())]

# For loop to extract nation data to nation column
for nationality, country in nation_country_dict.items():
    for index in dataframe.index:
        item = df.loc[index, column]
        if item.startswith(nationality):
            df.loc[index, extract_to] = country
            df.loc[index, column] = (
                df.loc[index, column].replace(nationality, "").strip()
            )

# Check a sample of treated rows
df[df[extract_to].notna()].sample(2)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,nation_1
8861,13,George Strugar,", 63. American gridiron football player, lung cancer.",https://en.wikipedia.org/wiki/George_Strugar,0,1997,June,,gridiron football player,lung cancer,,,,,,,,,,63.0,,United States of America
87054,8,Mohamud Muse Hersi,", Somali politician, 79–80, President of Puntland .",https://en.wikipedia.org/wiki/Mohamud_Muse_Hersi,12,2017,February,(2005–2009),politician,,President of Puntland,,,,,,,,,79.5,,Somalia


<IPython.core.display.Javascript object>

#### Observations:
- `info_1` provides us a nice smaller sample on which to test code.
- We successfully extracted those `nation_1` values, now we will do the same on the treated rows for `nation_2`.

#### Checking `info_1` for `nation_2`

In [10]:
# Column to check
column = "info_1"

# Extract to column
extract_to = "nation_2"

# Dataframe to check
dataframe = df[(df[column].notna()) & (df[extract_to].isna()) & (df["nation_1"].notna())]

# For loop to extract nation data to nation column
for nationality, country in nation_country_dict.items():
    for index in dataframe.index:
        item = df.loc[index, column]
        if item.startswith(nationality):
            df.loc[index, extract_to] = country
            df.loc[index, column] = (
                df.loc[index, column].replace(nationality, "").strip()
            )

# Check a sample of rows
df.sample(2)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,nation_1
47671,23,Donald Allchin,", 80, British Anglican priest and theologian",https://en.wikipedia.org/wiki/Donald_Allchin,9,2010,December,,,British Anglican priest and theologian,,,,,,,,,,80.0,,
65296,12,Tom Laughlin,", 82, American actor , complications from pneumonia.",https://en.wikipedia.org/wiki/Tom_Laughlin,91,2013,December,(),,American actor,complications from pneumonia,,,,,,,,,82.0,,


<IPython.core.display.Javascript object>

#### Observations:
- Here we can see that the new column `nation_2` has not yet been added as there were not any matching values.
- Let us confirm by checking the remaining unique values in `info_1`.

#### Checking Remaining Unique Values in `info_1`

In [11]:
# Checking unique values
df["info_1"].unique()

array([None, 'politician', 'Olympic sprinter', 'gridiron football player',
       'writer', 'businessman', 'social psychologist', 'King of Nepal',
       'Maori leader', 'artist', 'English sports journalist',
       'Jules Engel', 'early', 'aka', 'Jr', 'professional wrestler',
       'automotive engineer', 'materials scientist', 'weightlifter',
       'common chimpanzee', '', 'Olympic athlete', 'actor',
       'Olympic gymnast', 'broadcaster and writer', 'Olympic swimmer',
       'Olympic boxer', 'Olympic wrestler', 'Olympic sailor',
       'basketball player', 'college basketball coach',
       'choral conductor', 'Tree of the Year'], dtype=object)

<IPython.core.display.Javascript object>

#### Obsservations:
- Neither "English" nor "Maori" are keys in the current dictionary.
- Maori is an ethnicity within the country of New Zealand, so for now, we will add it as a key our dictionary with the country value of New Zealand.  If we have matching first and second countries, we can later remove the second value.
- We will also add the key "English" with the country value 'United Kingdom of Great Britain and Northern Ireland'.
- Then, we can rerun the above code for `nation_1` and `nation_2`.
- The country value of "Nepal" is also present.  We will hold off on extracting country names until we have first exhausted matching nationalities, as the Wikipedia field called for nationalities.

#### Updating `nation_country_dict`

In [12]:
# Adding key: country pairs to nation_country_dict
nation_country_dict["English"] = nation_country_dict["British"]
nation_country_dict["Maori"] = nation_country_dict["New Zealand"]

<IPython.core.display.Javascript object>

#### Re-hecking `info_1` for `nation_1`

In [13]:
# Column to check
column = "info_1"

# Extract to column
extract_to = "nation_1"

# Dataframe to check
dataframe = df[(df[column].notna()) & (df[extract_to].isna())]

# For loop to extract nation data to nation column
for nationality, country in nation_country_dict.items():
    for index in dataframe.index:
        item = df.loc[index, column]
        if item.startswith(nationality):
            df.loc[index, extract_to] = country
            df.loc[index, column] = (
                df.loc[index, column].replace(nationality, "").strip()
            )

# Check a sample of rows
df[df[extract_to].notna()].sample(2)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,nation_1
99323,12,Maryse Morandini,", 85. French Olympic swimmer .",https://en.wikipedia.org/wiki/Maryse_Morandini,2,2018,November,(1952),Olympic swimmer,,,,,,,,,,,85.0,,France
12317,25,Sir Robin Brook,", 90 British businessman, banker and Olympic fencer.",https://en.wikipedia.org/wiki/Robin_Brook,3,1998,October,,businessman,banker and Olympic fencer,,,,,,,,,,90.0,,United Kingdom of Great Britain and Northern Ireland


<IPython.core.display.Javascript object>

#### Re-checking `info_1` for `nation_2`

In [14]:
# Column to check
column = "info_1"

# Extract to column
extract_to = "nation_2"

# Dataframe to check
dataframe = df[(df[column].notna()) & (df[df[extract_to].isna()]) & (df["nation_1"].notna())]

# For loop to extract nation data to nation column
for nationality, country in nation_country_dict.items():
    for index in dataframe.index:
        item = df.loc[index, column]
        if item.startswith(nationality):
            df.loc[index, extract_to] = country
            df.loc[index, column] = (
                df.loc[index, column].replace(nationality, "").strip()
            )

# Checking rows
df[df["nation_2"].notna()]

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,nation_1,nation_2
19580,20,Dame Miraka Szászy,", 80. New Zealand Maori leader.",https://en.wikipedia.org/wiki/Mira_Sz%C3%A1szy,21,2001,December,,leader,,,,,,,,,,,80.0,,New Zealand,New Zealand


<IPython.core.display.Javascript object>

#### Observations:
- Our code appears to be finding the matching values and assigning the corresponding country to the correct nation column.
- We see "New Zealand" added to both nation columns here, which was expected as both New Zealand and Maori are in the description.
- Now we can proceed to doing the same extraction on `info_2`.

#### Checking `info_2` for `nation_1`

In [18]:
# Column to check
column = "info_2"

# Extract to column
extract_to = "nation_1"

# Dataframe to check
dataframe = df[(df[column].notna()) & (df[extract_to].isna())]

# For loop to extract nation data to nation column
for nationality, country in nation_country_dict.items():
    for index in dataframe.index:
        item = df.loc[index, column]
        if item.startswith(nationality):
            df.loc[index, extract_to] = country
            df.loc[index, column] = (
                df.loc[index, column].replace(nationality, "").strip()
            )

# Check a sample of rows
df[df[extract_to].notna()].sample(2)

CPU times: total: 14.7 s
Wall time: 14.7 s


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,nation_1,nation_2
91043,9,Harold Nutter,", 93, Canadian Anglican prelate, Metropolitan of Canada .",https://en.wikipedia.org/wiki/Harold_Nutter,10,2017,September,(1980–1989),,Anglican prelate,Metropolitan of Canada,,,,,,,,,93.0,,Canada,
114565,26,Ørnulf Tofte,", 98, Norwegian police officer .",https://en.wikipedia.org/wiki/%C3%98rnulf_Tofte,8,2020,August,(Police Surveillance Agency),,police officer,,,,,,,,,,98.0,,Norway,


<IPython.core.display.Javascript object>

#### Checking `info_2` for `nation_2`

In [21]:
# Column to check
column = "info_2"

# Extract to column
extract_to = "nation_2"

# Dataframe to check
dataframe = df[
    (df[column].notna()) & (df[extract_to].isna()) & (df["nation_1"].notna())
]

# For loop to extract nation data to nation column
for nationality, country in nation_country_dict.items():
    for index in dataframe.index:
        item = df.loc[index, column]
        if item.startswith(nationality):
            df.loc[index, extract_to] = country
            df.loc[index, column] = (
                df.loc[index, column].replace(nationality, "").strip()
            )

# Check a sample of rows
df[df[extract_to].notna()].sample(2)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,nation_1,nation_2
129107,17,Hale Trotter,", 90, Canadian-American mathematician.",https://en.wikipedia.org/wiki/Hale_Trotter,5,2022,January,,,mathematician,,,,,,,,,,90.0,,Canada,United States of America
74629,1,Peter Diamandopoulos,", 86, Greek-born American academic, President of Adelphi University .",https://en.wikipedia.org/wiki/Peter_Diamandopoulos,11,2015,April,(1985–1997),,academic,President of Adelphi University,,,,,,,,,86.0,,Greece,United States of America


<IPython.core.display.Javascript object>

In [23]:
df["nation_1"].isna().sum()

2351

<IPython.core.display.Javascript object>

In [24]:
df["nation_2"].isna().sum()

128618

<IPython.core.display.Javascript object>

In [32]:
df[df["nation_1"].isna()].head(100)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,nation_1,nation_2
6,2,Viktor Aristov,", 50, Soviet and Russian film director and screenwriter.",https://en.wikipedia.org/wiki/Viktor_Aristov_(director),1,1994,January,,,Soviet and Russian film director and screenwriter,,,,,,,,,,50.0,,,
36,6,Adri van Es,", 80, Royal Netherlands Navy vice admiral.",https://en.wikipedia.org/wiki/Adri_van_Es,0,1994,January,,,Royal Netherlands Navy vice admiral,,,,,,,,,,80.0,,,
42,7,Phoumi Vongvichit,", 84, President of Laos.",https://en.wikipedia.org/wiki/Phoumi_Vongvichit,1,1994,January,,,President of Laos,,,,,,,,,,84.0,,,
63,10,Roman Tkachuk,", 61, Soviet theatre and film actor.",https://en.wikipedia.org/wiki/Roman_Tkachuk,3,1994,January,,,Soviet theatre and film actor,,,,,,,,,,61.0,,,
72,12,Goran Ivandić,", 38, Yugoslav drummer, suicide.",https://en.wikipedia.org/wiki/Ipe_Ivandi%C4%87,3,1994,January,,,Yugoslav drummer,suicide,,,,,,,,,38.0,,,
86,14,Ivan Fuqua,", 84, America track and field athlete.",https://en.wikipedia.org/wiki/Ivan_Fuqua,3,1994,January,,,America track and field athlete,,,,,,,,,,84.0,,,
104,17,Yevgeny Ivanov,", 68, Soviet spy.",https://en.wikipedia.org/wiki/Yevgeny_Ivanov_(spy),10,1994,January,,,Soviet spy,,,,,,,,,,68.0,,,
105,17,Juan Carlos Pugliese,", 78, Argentinian lawyer and politician.",https://en.wikipedia.org/wiki/Juan_Carlos_Pugliese,1,1994,January,,,Argentinian lawyer and politician,,,,,,,,,,78.0,,,
137,23,Alexei Mozhaev,", 75, Soviet and Russian painter, graphic artist, and art teacher.",https://en.wikipedia.org/wiki/Alexei_Mozhaev,5,1994,January,,,Soviet and Russian painter,graphic artist,and art teacher,,,,,,,,75.0,,,
139,23,Nikolai Ogarkov,", 76, Soviet military officer and Hero of the Soviet Union.",https://en.wikipedia.org/wiki/Nikolai_Ogarkov,21,1994,January,,,Soviet military officer and Hero of the Soviet Union,,,,,,,,,,76.0,,,


<IPython.core.display.Javascript object>

In [34]:
nation_country_dict["Korean"]

KeyError: 'Korean'

<IPython.core.display.Javascript object>

In [35]:
df["nation_1"].notna().sum()

130301

<IPython.core.display.Javascript object>