# Wikipedia Notable Life Expectancies

# [Notebook 4 of 4: Data Cleaning](https://github.com/teresahanak/wikipedia-life-expectancy/blob/main/wp_life_expect_data_clean3_thanak_2022_06_23.ipynb)

## Context

The


## Objective

The

### Data Dictionary

- Feature: Description

## Importing Necessary Libraries

In [1]:
# To structure code automatically
%load_ext nb_black

# To import/export sqlite databases
import sqlite3 as sql

# To save/open python objects in pickle file
import pickle

# To help with reading, cleaning, and manipulating data
import pandas as pd
import numpy as np
import re

# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)
# To define the maximum number of rows to be displayed in a dataframe
pd.set_option("display.max_rows", 200)

# To supress warnings
# import warnings

# warnings.filterwarnings("ignore")

# To set some visualization attributes
pd.set_option("max_colwidth", 150)

<IPython.core.display.Javascript object>

## Data Overview

### Reading, Sampling, and Checking Data Shape

In [2]:
# Reading the dataset
conn = sql.connect("wp_life_expect_clean2.db")
data = pd.read_sql("SELECT * FROM wp_life_expect_clean2", conn)

# Making a working copy
df = data.copy()

# Checking the shape
print(f"There are {df.shape[0]} rows and {df.shape[1]} columns.")

# Checking first 2 rows of the data
df.head(2)

There are 132393 rows and 21 columns.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death
0,1,William Chappell,", 86, British dancer, ballet designer and director.",https://en.wikipedia.org/wiki/William_Chappell_(dancer),21,1994,January,,,British dancer,ballet designer and director,,,,,,,,,86.0,
1,1,Raymond Crotty,", 68, Irish economist, writer, and academic.",https://en.wikipedia.org/wiki/Raymond_Crotty,12,1994,January,,,Irish economist,writer,and academic,,,,,,,,68.0,


<IPython.core.display.Javascript object>

In [3]:
# Checking last 2 rows of the data
df.tail(2)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death
132391,9,Oleg Moliboga,", 69, Russian volleyball player, Olympic champion and coach.",https://en.wikipedia.org/wiki/Oleg_Moliboga,2,2022,June,(),,Russian volleyball player,Olympic champion and coach,,,,,,,,,69.0,
132392,9,Zou Jing,", 86, Chinese engineer, member of the Chinese Academy of Engineering.",https://en.wikipedia.org/wiki/Zou_Jing_(engineer),3,2022,June,,,Chinese engineer,member of the Chinese Academy of Engineering,,,,,,,,,86.0,


<IPython.core.display.Javascript object>

In [4]:
# Checking a sample of the data
df.sample(5)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death
20095,13,Thomas J. H. Trapnell,", 99, American U.S. Army lieutenant general .",https://en.wikipedia.org/wiki/Thomas_J._H._Trapnell,24,2002,February,(survived Bataan Death March),,American U.S. Army lieutenant general,,,,,,,,,,99.0,
77610,26,Eugene D. Commins,", 83, American physicist.",https://en.wikipedia.org/wiki/Eugene_D._Commins,13,2015,September,,,American physicist,,,,,,,,,,83.0,
69762,28,Madhukar Dighe,", 94, Indian politician, Governor of Meghalaya .",https://en.wikipedia.org/wiki/Madhukar_Dighe,6,2014,July,(–) and Arunachal Pradesh (),,Indian politician,Governor of Meghalaya,,,,,,,,,94.0,
55774,2,Gunnar Eide,", 92, Norwegian actor and theatre director.",https://en.wikipedia.org/wiki/Gunnar_Eide,2,2012,July,,,Norwegian actor and theatre director,,,,,,,,,,92.0,
76402,13,Joan Sebastian,", 64, Mexican singer and songwriter, bone cancer.",https://en.wikipedia.org/wiki/Joan_Sebastian,30,2015,July,,,Mexican singer and songwriter,bone cancer,,,,,,,,,64.0,


<IPython.core.display.Javascript object>

### Checking Data Types, Duplicates, and Null Values

In [5]:
# Checking data types and null values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 132393 entries, 0 to 132392
Data columns (total 21 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   day             132393 non-null  object 
 1   name            132393 non-null  object 
 2   info            132393 non-null  object 
 3   link            132393 non-null  object 
 4   num_references  132393 non-null  object 
 5   year            132393 non-null  int64  
 6   month           132393 non-null  object 
 7   info_parenth    49752 non-null   object 
 8   info_1          30 non-null      object 
 9   info_2          132345 non-null  object 
 10  info_3          62328 non-null   object 
 11  info_4          12541 non-null   object 
 12  info_5          1498 non-null    object 
 13  info_6          214 non-null     object 
 14  info_7          31 non-null      object 
 15  info_8          6 non-null       object 
 16  info_9          1 non-null       object 
 17  info_10   

<IPython.core.display.Javascript object>

#### Loading `nation_country_dict` from Pickle File

In [7]:
# Load the nation_country_dict
with open("nation_country_dict.pkl", "rb") as f:
    nation_country_dict = pickle.load(f)

<IPython.core.display.Javascript object>

## Extracting Nationality Continued
Here is the approach we will take:
- The plan will be to save the country name in new `nation_1` and `nation_2` columns as it is standardized for the various associated nationality values.
- First, we will update the keys in `nation_country_dict` by replacing hyphens with " ".
- Then we will remove "-born" from the column we are searching, as well as replace "-" with " ".
- We will proceed to search the numbered `info` columns in order checking as follows:
    1. If first word in dictionary:
        a. Extract to `nation_1` and check next word (new first word)
        b. If new first word in dictionary, extract to `nation_2`.
        c. If not, if first 2 words in dictionary, extract to `nation_2`, and so on.
    2. If first word not in dictionary:
        a. Check if first 2 words in dictionary, extract to `nation_1`, and so on.
- So, individual words will be compared then combined words.

In [None]:
#### nation_country_dict

#### Checking First Word in `info_1` for `nation_1` Values

In [None]:
# Column to check
column = "info_1"

# Extract-to column
extract_to = 'nation_1'

# Dataframe to check
dataframe = df[df[column].notna()]

# For loop to split each value into a list and check if first word is in nation_lst
# and assign country to new nation_1 column if found.  Removes found value from column searched.
for index in dataframe.index:
    item = df.loc[index, column]
    split_1 = item.split()
    first_word = split_1[0].strip()
    if first_word in nation_country_dict:
        df.loc[index, extract_to] = nation_country_dict[first_word]
        df.loc[index, column] = df.loc[index, column].replace(first_word, "")

# Check number and sample of treated rows
print(f'{len(df[df[extract_to].notna()])} rows were treated.')
df[df[extract_to].notna()].sample(2)

#### Observations:
- In a way, we are fortunate to have a small sample of rows on which to test code in `info_1`.
- We successfully extracted those `nation_1` values, now we will do the same on the treated rows for `nation_2`.

#### Checking First Word in `info_1` for `nation_2` Values