# Wikipedia Notable Life Expectancies

# [Notebook 2 of 4: Data Cleaning](https://github.com/teresahanak/wikipedia-life-expectancy/blob/main/wp_life_expect_data_clean_thanak_2022_06_13.ipynb)

## Context

The


## Objective

The

### Data Dictionary

- Feature: Description

## Importing Necessary Libraries

In [22]:
# To structure code automatically
%load_ext nb_black

# To import/export sqlite databases
import sqlite3 as sql

# To help with reading and manipulating data
import pandas as pd
import numpy as np
import re

# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)
# To define the maximum number of rows to be displayed in a dataframe
pd.set_option("display.max_rows", 200)

# To supress scientific notations for a dataframe
pd.set_option("display.float_format", lambda x: "%.3f" % x)

# To supress warnings
# import warnings

# warnings.filterwarnings("ignore")

# To set some visualization attributes
pd.set_option("max_colwidth", 150)

The nb_black extension is already loaded. To reload it, use:
  %reload_ext nb_black


<IPython.core.display.Javascript object>

## Data Overview

### Reading, Sampling, and Checking Data Shape

In [23]:
# Reading the wp_life_expect_raw_complete dataset
conn = sql.connect("wp_life_expect_raw_complete.db")
data = pd.read_sql("SELECT * FROM wp_life_expect_raw_complete", conn)

# Making a working copy
df = data.copy()

# Checking the shape
print(f"There are {df.shape[0]} rows and {df.shape[1]} columns.")

# Checking first 2 rows of the data
df.head(2)

There are 133900 rows and 6 columns.


Unnamed: 0,month_year,day,name,info,link,num_references
0,January 1994,1,William Chappell,", 86, British dancer, ballet designer and director.",https://en.wikipedia.org/wiki/William_Chappell_(dancer),21
1,January 1994,1,Raymond Crotty,", 68, Irish economist, writer, and academic.",https://en.wikipedia.org/wiki/Raymond_Crotty,12


<IPython.core.display.Javascript object>

In [24]:
# Checking last 2 rows of the data
df.tail(2)

Unnamed: 0,month_year,day,name,info,link,num_references
133898,June 2022,9,Oleg Moliboga,", 69, Russian volleyball player, Olympic champion (1980) and coach.",https://en.wikipedia.org/wiki/Oleg_Moliboga,2
133899,June 2022,9,Zou Jing,", 86, Chinese engineer, member of the Chinese Academy of Engineering.",https://en.wikipedia.org/wiki/Zou_Jing_(engineer),3


<IPython.core.display.Javascript object>

In [25]:
# Checking a sample of the data
df.sample(5)

Unnamed: 0,month_year,day,name,info,link,num_references
126175,August 2021,18,Jill Murphy,", 72, British author (), cancer.",https://en.wikipedia.org/wiki/Jill_Murphy,35
22253,November 2002,24,Kazım Ergin,", 87, Turkish geophysicist.",https://en.wikipedia.org/wiki/Kaz%C4%B1m_Ergin,3
84720,August 2016,21,Marin Moraru,", 79, Romanian actor.",https://en.wikipedia.org/wiki/Marin_Moraru,2
76407,May 2015,22,Alan Koch,", 77, American baseball player (Detroit Tigers, Washington Senators).",https://en.wikipedia.org/wiki/Alan_Koch_(baseball),8
55571,May 2012,10,Barbara D'Arcy,", 84, American visual merchandiser.",https://en.wikipedia.org/wiki/Barbara_D%27Arcy,5


<IPython.core.display.Javascript object>

#### Observations:
- There are 133,900 rows and 6 columns.
- `month_year` contains the month and year of death, while `day` contains the day of the month of death.
- `name` is the notable person's name.  It is a nominal feature that will not be used for analysis, but will be maintained for any referencing needs.
- `info` contains multiple items including the notable person's "age, country of citizenship at birth, subsequent country of citizenship (if applicable), reason for notability, (and) cause of death (if known)."
- `link` is the url to the notable person's individual Wikipedia page.  If such a page does not exist, there is either a non-working link (https://en.wikipedia.orgNone), or the link is to a page with a message that the page does not exist for that individual.  `link` is a unique identifier for all entries, except the 6 with the non-working link, which do have unique `name` values from each other.
- `num_references` contains the number of references on the notable person's individual Wikipedia page.  This feature serves as a proxy measure of notability.
- Prior to EDA, our task will be to extract the individual elements that are comined in `month_year` and `info` columns.

### Checking Data Types, Duplicates, and Null Values

In [26]:
# Checking data types and null values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 133900 entries, 0 to 133899
Data columns (total 6 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   month_year      133900 non-null  object
 1   day             133900 non-null  object
 2   name            133894 non-null  object
 3   info            133900 non-null  object
 4   link            133900 non-null  object
 5   num_references  133900 non-null  object
dtypes: object(6)
memory usage: 6.1+ MB


<IPython.core.display.Javascript object>

In [27]:
# Checking duplicate rows
df.duplicated().sum()

0

<IPython.core.display.Javascript object>

In [28]:
# Check percentage of null values by column
df.isnull().sum() / df.count() * 100

month_year       0.000
day              0.000
name             0.004
info             0.000
link             0.000
num_references   0.000
dtype: float64

<IPython.core.display.Javascript object>

In [29]:
# Checking number of missing values per row (not necessary here, but done to keep process standard)
df.isnull().sum(axis=1).value_counts()

0    133894
1         6
dtype: int64

<IPython.core.display.Javascript object>

#### Observations:
- Our dataset was saved to and read from the database without any hiccups.
- As expected, we have 6 entries that are missing `name`, but we will find it in their `info` values.
- All columns are currently of object type.  We will need to appropriately typecast them after separating the information in `month_year` and `info`.

## Data Cleaning

### Addressing Missing `name` Values

In [30]:
# Checking rows with missing name values
missing_name = df[df["name"].isna()]
missing_name

Unnamed: 0,month_year,day,name,info,link,num_references
18937,August 2001,11,,"Kevin Kowalcyk, 2, known for eating a hamburger contaminated with E. coli O157:H7.",https://en.wikipedia.orgNone,0
24985,January 2004,22,,"Vincent Palmer, 37, British criminal.",https://en.wikipedia.orgNone,0
27458,March 2005,1,,"Barry Stigler, 57, American voice actor.",https://en.wikipedia.orgNone,0
34077,July 2007,11,,"Nana Gualdi, 75, German singer and actress.",https://en.wikipedia.orgNone,0
64769,September 2013,29,,"Scott Workman, 47, American stuntman (, , ), cancer.",https://en.wikipedia.orgNone,0
106613,September 2019,12,,"Thami Shobede, 31, Singer Songwriter",https://en.wikipedia.orgNone,0


<IPython.core.display.Javascript object>

#### Observations:
- These rows vary from the main set as there is a substring containing the person's name at the start of the `info` string.
- As there are so few rows missing `name`, let us address this issue first.

In [31]:
# For loop to copy name value from info column
treat_rows = missing_name.index
for i in treat_rows:
    info = df.loc[i, "info"]
    info_lst = info.split(sep=",", maxsplit=1)
    df.loc[i, "name"] = info_lst[0]

# Re-check rows
df.loc[treat_rows, :]

Unnamed: 0,month_year,day,name,info,link,num_references
18937,August 2001,11,Kevin Kowalcyk,"Kevin Kowalcyk, 2, known for eating a hamburger contaminated with E. coli O157:H7.",https://en.wikipedia.orgNone,0
24985,January 2004,22,Vincent Palmer,"Vincent Palmer, 37, British criminal.",https://en.wikipedia.orgNone,0
27458,March 2005,1,Barry Stigler,"Barry Stigler, 57, American voice actor.",https://en.wikipedia.orgNone,0
34077,July 2007,11,Nana Gualdi,"Nana Gualdi, 75, German singer and actress.",https://en.wikipedia.orgNone,0
64769,September 2013,29,Scott Workman,"Scott Workman, 47, American stuntman (, , ), cancer.",https://en.wikipedia.orgNone,0
106613,September 2019,12,Thami Shobede,"Thami Shobede, 31, Singer Songwriter",https://en.wikipedia.orgNone,0


<IPython.core.display.Javascript object>

#### Observations:
- Missing `name` values have been addressed.
- The names still appear in the `info` value for these rows, but we can address that as we separate the information in that column.

In [32]:
# Re-check info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 133900 entries, 0 to 133899
Data columns (total 6 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   month_year      133900 non-null  object
 1   day             133900 non-null  object
 2   name            133900 non-null  object
 3   info            133900 non-null  object
 4   link            133900 non-null  object
 5   num_references  133900 non-null  object
dtypes: object(6)
memory usage: 6.1+ MB


<IPython.core.display.Javascript object>

#### Observations:
- We have no remaining missing values.
- Let us treat the `month_year` column next.

### Separating `month` and `year`

In [33]:
# Separating month and year into 2 columns and typecasting year as integer
df.loc[:, "year"] = df["month_year"].apply(lambda x: x.split(sep=" ")[1])
df["year"] = df["year"].astype("int64")

df.loc[:, "month"] = df["month_year"].apply(lambda x: x.split(sep=" ")[0])
df.head(2)

Unnamed: 0,month_year,day,name,info,link,num_references,year,month
0,January 1994,1,William Chappell,", 86, British dancer, ballet designer and director.",https://en.wikipedia.org/wiki/William_Chappell_(dancer),21,1994,January
1,January 1994,1,Raymond Crotty,", 68, Irish economist, writer, and academic.",https://en.wikipedia.org/wiki/Raymond_Crotty,12,1994,January


<IPython.core.display.Javascript object>

In [34]:
# Dropping month_year column
df.drop("month_year", axis=1, inplace=True)

<IPython.core.display.Javascript object>

### Treating `info`

#### Checking a Sample

In [35]:
# Checking a sample of info
df.sample(5)

Unnamed: 0,day,name,info,link,num_references,year,month
65542,14,Ramziya al-Iryani,", 58–59, Yemeni novelist, writer, diplomat and feminist.",https://en.wikipedia.org/wiki/Ramziya_al-Iryani,10,2013,November
55464,4,Angelica Garnett,", 93, British writer and painter.",https://en.wikipedia.org/wiki/Angelica_Garnett,19,2012,May
96085,2,Claus Heß,", 84, German Olympic rower (1956).",https://en.wikipedia.org/wiki/Claus_He%C3%9F,6,2018,April
29237,31,Evert Hingst,", 35, Dutch lawyer, allegedly involved in organized crime, shot.",https://en.wikipedia.org/wiki/Evert_Hingst,0,2005,October
75627,8,Ion Trewin,", 71, British editor, publisher and author.",https://en.wikipedia.org/wiki/Ion_Trewin,5,2015,April


<IPython.core.display.Javascript object>

#### Observations:
- We have leading and trailing characters of comma, period, and white space that can be removed.

#### Removing Leading and Trailing Commas, Whitespace, and Periods

In [36]:
# Removing the leading/trailing commas, periods, and whitespace
df.loc[:, "info"] = df["info"].apply(lambda x: x.strip(" ,."))
df.sample(5)

Unnamed: 0,day,name,info,link,num_references,year,month
24046,31,Pavel Tigrid,"85, Czech writer, publisher, author and politician",https://en.wikipedia.org/wiki/Pavel_Tigrid,1,2003,August
114065,29,James Douglas Henderson,"93, Canadian politician",https://en.wikipedia.org/wiki/James_Douglas_Henderson,6,2020,June
96528,25,Adebayo Adedeji,"87, Nigerian politician and diplomat",https://en.wikipedia.org/wiki/Adebayo_Adedeji,4,2018,April
40002,8,William Alexander Deer,"98, British geologist, Vice-Chancellor of the University of Cambridge (1971–1973)",https://en.wikipedia.org/wiki/William_Alexander_Deer,1,2009,February
28078,26,Albinas Albertynas,"71, Lithuanian politician",https://en.wikipedia.org/wiki/Albinas_Albertynas,2,2005,May


<IPython.core.display.Javascript object>

#### Checking the Set of First Substrings (Before First Comma)

In [37]:
# Checking values in the first substring of info (before the first comma)
age_set = set(df["info"].apply(lambda x: x.split(sep=",", maxsplit=1)[0]))
age_set

{'27',
 'American television executive',
 '111',
 '84',
 'Vanuatuan politician and member of parliament',
 '14',
 'Solomon Islands Cabinet Minister',
 'Georgian politician and deputy governor of the Kvemo Kartli region',
 'Pakistani politician',
 '56–57',
 '38-40',
 'Ivorian traditional musician',
 'aka "Sammy Steamboat"',
 'Bangladeshi singer and composer',
 'Syrian playwright',
 'Notable ice hockey players and coaches among the 44 killed in the :\n',
 'Sierra Leonean politician and military officer',
 'Jules Engel',
 'Scottish botanist and agriculturist',
 'Chechen Islamist',
 '73-74',
 'South African plastic surgeon',
 'c. 35',
 'Gambian-American journalist',
 '94–95',
 'Pakistani military officer and diplomat',
 'Italian poet',
 'British snooker player',
 '19',
 '53–54',
 'Malaysian honorary consul',
 'Kyrgyz Imam and alleged Islamic militant',
 'Iranian feminist activist. (Farsi)',
 'Pakistani actor',
 'wife of Robert Guéï',
 'American transwoman and performer',
 'Chinese marathon

<IPython.core.display.Javascript object>

#### Observations:
- Though we can see the age values, there is a lot of other information in the first substring of the `info` value.
- Some entries may be missing the age information.
- Also, age is entered in various formats.
- Let us see how many variations there are.

In [38]:
# Checking number of different values for first substring of info column
len(age_set)

1412

<IPython.core.display.Javascript object>

#### Observations:
- There are 1412 variations for the substring that represents age on Wikipedia.  
- We may a regular expressions approach to dividing this string feature, in addition to using Python string methods.

In [39]:
ages = pd.Series(np.arange(1, 150)).astype("string")

<IPython.core.display.Javascript object>

In [48]:
for age in ages:
    for i, item in enumerate(df["info"]):
        if item.startswith(age + ","):
            df.loc[i, "age"] = int(age)
df.sample(5)

Unnamed: 0,day,name,info,link,num_references,year,month,age
38197,5,Miroslav Havel,"86, Czech-born Irish chief designer (Waterford Crystal)",https://en.wikipedia.org/wiki/Miroslav_Havel,3,2008,September,86
70724,3,Wimper Guerrero,"32, Ecuadorian footballer, heart attack",https://en.wikipedia.org/wiki/Wimper_Guerrero,2,2014,August,32
107737,10,Werner Andreas Albert,"84, German composer and conductor",https://en.wikipedia.org/wiki/Werner_Andreas_Albert,4,2019,November,84
81356,23,Luis Alberto Machado,"84, Venezuelan lawyer and politician",https://en.wikipedia.org/wiki/Luis_Alberto_Machado,8,2016,February,84
106193,22,Junior Agogo,"40, Ghanaian footballer (Bristol Rovers, Nottingham Forest, national team)",https://en.wikipedia.org/wiki/Junior_Agogo,29,2019,August,40


<IPython.core.display.Javascript object>

In [41]:
df["age"].isna().sum()

2025

<IPython.core.display.Javascript object>

In [42]:
df

Unnamed: 0,day,name,info,link,num_references,year,month,age
0,1,William Chappell,"86, British dancer, ballet designer and director",https://en.wikipedia.org/wiki/William_Chappell_(dancer),21,1994,January,86.000
1,1,Raymond Crotty,"68, Irish economist, writer, and academic",https://en.wikipedia.org/wiki/Raymond_Crotty,12,1994,January,68.000
2,1,Walter Eckhardt,"87, German politician",https://en.wikipedia.org/wiki/Walter_Eckhardt,1,1994,January,87.000
3,1,"Arthur Porritt, Baron Porritt","93, New Zealand physician, statesman and athlete","https://en.wikipedia.org/wiki/Arthur_Porritt,_Baron_Porritt",25,1994,January,93.000
4,1,Cesar Romero,"86, American actor (, , ) and activist",https://en.wikipedia.org/wiki/Cesar_Romero,42,1994,January,86.000
...,...,...,...,...,...,...,...,...
133895,8,Song Hae,"95, South Korean television host () and singer",https://en.wikipedia.org/wiki/Song_Hae,21,2022,June,95.000
133896,8,Birkha Bahadur Muringla,"79, Indian writer",https://en.wikipedia.org/wiki/Birkha_Bahadur_Muringla,3,2022,June,79.000
133897,9,Aamir Liaquat Hussain,"50, Pakistani journalist and politician, MNA (2002–2007, since 2018)",https://en.wikipedia.org/wiki/Aamir_Liaquat_Hussain,99,2022,June,50.000
133898,9,Oleg Moliboga,"69, Russian volleyball player, Olympic champion (1980) and coach",https://en.wikipedia.org/wiki/Oleg_Moliboga,2,2022,June,69.000


<IPython.core.display.Javascript object>