# Wikipedia Notable Life Expectancies

# [Notebook 2 of 4: Data Cleaning](https://github.com/teresahanak/wikipedia-life-expectancy/blob/main/wp_life_expect_data_clean_thanak_2022_06_13.ipynb)

## Context

The


## Objective

The

### Data Dictionary

- Feature: Description

## Importing Necessary Libraries

In [1]:
# To structure code automatically
%load_ext nb_black

# To import/export sqlite databases
import sqlite3 as sql

# To help with reading and manipulating data
import pandas as pd
import numpy as np
import re

# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)
# To define the maximum number of rows to be displayed in a dataframe
pd.set_option("display.max_rows", 200)

# To supress scientific notations for a dataframe
pd.set_option("display.float_format", lambda x: "%.3f" % x)

# To supress warnings
# import warnings

# warnings.filterwarnings("ignore")

# To set some visualization attributes
pd.set_option("max_colwidth", 150)

<IPython.core.display.Javascript object>

## Data Overview

### Reading, Sampling, and Checking Data Shape

In [2]:
# Reading the wp_life_expect_raw_complete dataset
conn = sql.connect("wp_life_expect_raw_complete.db")
data = pd.read_sql("SELECT * FROM wp_life_expect_raw_complete", conn)

# Making a working copy
df = data.copy()

# Checking the shape
print(f"There are {df.shape[0]} rows and {df.shape[1]} columns.")

# Checking first 2 rows of the data
df.head(2)

There are 133900 rows and 6 columns.


Unnamed: 0,month_year,day,name,info,link,num_references
0,January 1994,1,William Chappell,", 86, British dancer, ballet designer and director.",https://en.wikipedia.org/wiki/William_Chappell_(dancer),21
1,January 1994,1,Raymond Crotty,", 68, Irish economist, writer, and academic.",https://en.wikipedia.org/wiki/Raymond_Crotty,12


<IPython.core.display.Javascript object>

In [3]:
# Checking last 2 rows of the data
df.tail(2)

Unnamed: 0,month_year,day,name,info,link,num_references
133898,June 2022,9,Oleg Moliboga,", 69, Russian volleyball player, Olympic champion (1980) and coach.",https://en.wikipedia.org/wiki/Oleg_Moliboga,2
133899,June 2022,9,Zou Jing,", 86, Chinese engineer, member of the Chinese Academy of Engineering.",https://en.wikipedia.org/wiki/Zou_Jing_(engineer),3


<IPython.core.display.Javascript object>

In [4]:
# Checking a sample of the data
df.sample(5)

Unnamed: 0,month_year,day,name,info,link,num_references
131926,March 2022,10,Georges Ginoux,", 88, Belgian-born French politician, senator (2004-2005).",https://en.wikipedia.org/wiki/Georges_Ginoux,2
77018,June 2015,25,Lou Butera,", 78, American pool player, Parkinson's disease.",https://en.wikipedia.org/wiki/Lou_Butera,8
99020,September 2018,5,Lise Payette,", 87, Canadian journalist, writer and politician, MNA (1976–1981).",https://en.wikipedia.org/wiki/Lise_Payette,20
57907,September 2012,30,Jack Morris,", 84, American Jesuit, founder of the Jesuit Volunteer Corps, cancer.",https://en.wikipedia.org/wiki/Jack_Morris_(Jesuit),3
45370,April 2010,30,Manuel Alvarado,", 62, Guatemalan-born British academic.",https://en.wikipedia.org/wiki/Manuel_Alvarado,1


<IPython.core.display.Javascript object>

#### Observations:
- There are 133,900 rows and 6 columns.
- `month_year` contains the month and year of death, while `day` contains the day of the month of death.
- `name` is the notable person's name.  It is a nominal feature that will not be used for analysis, but will be maintained for any referencing needs.
- `info` contains multiple items including the notable person's "age, country of citizenship at birth, subsequent country of citizenship (if applicable), reason for notability, (and) cause of death (if known)."
- `link` is the url to the notable person's individual Wikipedia page.  If such a page does not exist, there is either a non-working link (https://en.wikipedia.orgNone), or the link is to a page with a message that the page does not exist for that individual.  `link` is a unique identifier for all entries, except the 6 with the non-working link, which do have unique `name` values from each other.
- `num_references` contains the number of references on the notable person's individual Wikipedia page.  This feature serves as a proxy measure of notability.
- Prior to EDA, our task will be to extract the individual elements that are comined in `month_year` and `info` columns.

### Checking Data Types, Duplicates, and Null Values

In [5]:
# Checking data types and null values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 133900 entries, 0 to 133899
Data columns (total 6 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   month_year      133900 non-null  object
 1   day             133900 non-null  object
 2   name            133894 non-null  object
 3   info            133900 non-null  object
 4   link            133900 non-null  object
 5   num_references  133900 non-null  object
dtypes: object(6)
memory usage: 6.1+ MB


<IPython.core.display.Javascript object>

In [6]:
# Checking duplicate rows
df.duplicated().sum()

0

<IPython.core.display.Javascript object>

In [7]:
# Check percentage of null values by column
df.isnull().sum() / df.count() * 100

month_year       0.000
day              0.000
name             0.004
info             0.000
link             0.000
num_references   0.000
dtype: float64

<IPython.core.display.Javascript object>

In [8]:
# Checking number of missing values per row (not necessary here, but done to keep process standard)
df.isnull().sum(axis=1).value_counts()

0    133894
1         6
dtype: int64

<IPython.core.display.Javascript object>

#### Observations:
- Our dataset was saved to and read from the database without any hiccups.
- As expected, we have 6 entries that are missing `name`, but we will find it in their `info` values.
- All columns are currently of object type.  We will need to appropriately typecast them after separating the information in `month_year` and `info`.

## Data Cleaning

### Addressing Missing `name` Values

In [9]:
# Checking rows with missing name values
missing_name = df[df["name"].isna()]
missing_name

Unnamed: 0,month_year,day,name,info,link,num_references
18937,August 2001,11,,"Kevin Kowalcyk, 2, known for eating a hamburger contaminated with E. coli O157:H7.",https://en.wikipedia.orgNone,0
24985,January 2004,22,,"Vincent Palmer, 37, British criminal.",https://en.wikipedia.orgNone,0
27458,March 2005,1,,"Barry Stigler, 57, American voice actor.",https://en.wikipedia.orgNone,0
34077,July 2007,11,,"Nana Gualdi, 75, German singer and actress.",https://en.wikipedia.orgNone,0
64769,September 2013,29,,"Scott Workman, 47, American stuntman (, , ), cancer.",https://en.wikipedia.orgNone,0
106613,September 2019,12,,"Thami Shobede, 31, Singer Songwriter",https://en.wikipedia.orgNone,0


<IPython.core.display.Javascript object>

#### Observations:
- These rows vary from the main set as there is a substring containing the person's name at the start of the `info` string.
- As there are so few rows missing `name`, let us address this issue first.

In [10]:
# For loop to copy name value from info column
treat_rows = missing_name.index
for i in treat_rows:
    info = df.loc[i, "info"]
    info_lst = info.split(sep=",", maxsplit=1)
    df.loc[i, "name"] = info_lst[0]

# Re-check rows
df.loc[treat_rows, :]

Unnamed: 0,month_year,day,name,info,link,num_references
18937,August 2001,11,Kevin Kowalcyk,"Kevin Kowalcyk, 2, known for eating a hamburger contaminated with E. coli O157:H7.",https://en.wikipedia.orgNone,0
24985,January 2004,22,Vincent Palmer,"Vincent Palmer, 37, British criminal.",https://en.wikipedia.orgNone,0
27458,March 2005,1,Barry Stigler,"Barry Stigler, 57, American voice actor.",https://en.wikipedia.orgNone,0
34077,July 2007,11,Nana Gualdi,"Nana Gualdi, 75, German singer and actress.",https://en.wikipedia.orgNone,0
64769,September 2013,29,Scott Workman,"Scott Workman, 47, American stuntman (, , ), cancer.",https://en.wikipedia.orgNone,0
106613,September 2019,12,Thami Shobede,"Thami Shobede, 31, Singer Songwriter",https://en.wikipedia.orgNone,0


<IPython.core.display.Javascript object>

#### Observations:
- Missing `name` values have been addressed.
- The names still appear in the `info` value for these rows, but we can address that as we separate the information in that column.

In [11]:
# Re-check info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 133900 entries, 0 to 133899
Data columns (total 6 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   month_year      133900 non-null  object
 1   day             133900 non-null  object
 2   name            133900 non-null  object
 3   info            133900 non-null  object
 4   link            133900 non-null  object
 5   num_references  133900 non-null  object
dtypes: object(6)
memory usage: 6.1+ MB


<IPython.core.display.Javascript object>

#### Observations:
- We have no remaining missing values.
- Let us treat the `month_year` column next.

### Separating `month` and `year`

In [12]:
# Separating month and year into 2 columns and typecasting year as integer
df.loc[:, "year"] = df["month_year"].apply(lambda x: x.split(sep=" ")[1])
df["year"] = df["year"].astype("int64")

df.loc[:, "month"] = df["month_year"].apply(lambda x: x.split(sep=" ")[0])
df.head(2)

Unnamed: 0,month_year,day,name,info,link,num_references,year,month
0,January 1994,1,William Chappell,", 86, British dancer, ballet designer and director.",https://en.wikipedia.org/wiki/William_Chappell_(dancer),21,1994,January
1,January 1994,1,Raymond Crotty,", 68, Irish economist, writer, and academic.",https://en.wikipedia.org/wiki/Raymond_Crotty,12,1994,January


<IPython.core.display.Javascript object>

In [13]:
# Dropping month_year column
df.drop("month_year", axis=1, inplace=True)

<IPython.core.display.Javascript object>

### Treating `info`

#### Checking a Sample

In [14]:
# Checking a sample of info
df.sample(5)

Unnamed: 0,day,name,info,link,num_references,year,month
60943,2,Eriya Kategaya,", 67, Ugandan politician, Deputy Prime Minister and Minister for East African Affairs.",https://en.wikipedia.org/wiki/Eriya_Kategaya,7,2013,March
108167,30,Bertil Fiskesjö,", 91, Swedish politician, MP (1971–1994).",https://en.wikipedia.org/wiki/Bertil_Fiskesj%C3%B6,3,2019,November
30064,18,Ruth Taylor,", 44, Canadian poet, alcohol poisoning.",https://en.wikipedia.org/wiki/Ruth_Taylor_(poet),1,2006,February
108677,24,Noor-Ali Tabandeh,", 92, Iranian Islamic Sufi leader, Qutb of the Ni'matullāhī.",https://en.wikipedia.org/wiki/Noor-Ali_Tabandeh,15,2019,December
114398,12,Bill Gilbreth,", 72, American baseball player (Detroit Tigers, California Angels), complications from heart surgery.",https://en.wikipedia.org/wiki/Bill_Gilbreth,1,2020,July


<IPython.core.display.Javascript object>

#### Observations:
- We have leading and trailing characters of comma, period, and white space that can be removed.
- Also, it would be helpful to remove information contained within parentheses, as we will not be using it.

#### Removing Leading and Trailing Commas, Whitespace, and Periods

In [15]:
# Removing the leading/trailing commas, periods, and whitespace
df.loc[:, "info"] = df["info"].apply(lambda x: x.strip(" ,."))
df.sample(5)

Unnamed: 0,day,name,info,link,num_references,year,month
133363,13,Rosmarie Trapp,"93, Austrian-born American singer (Trapp Family)",https://en.wikipedia.org/wiki/Rosmarie_Trapp,2,2022,May
113640,13,Marj Carpenter,"93, American Presbyterian leader, missionary and reporter, Moderator of the General Assembly of the Presbyterian Church (1995–1996)",https://en.wikipedia.org/wiki/Marj_Carpenter,3,2020,June
78738,5,Peter Wespi,"72, Swiss Olympic ice hockey player (1964), cancer",https://en.wikipedia.org/wiki/Peter_Wespi,2,2015,October
2390,8,Sylvia B. Seaman,"94, American novelist and suffragist",https://en.wikipedia.org/wiki/Sylvia_B._Seaman,7,1995,January
32665,21,Richard Ollard,"83, British historian and biographer",https://en.wikipedia.org/wiki/Richard_Ollard,1,2007,January


<IPython.core.display.Javascript object>

#### Removing Information within Parentheses in `info`

In [38]:
# Regular expression for parenthesis and its contents
pattern = r"\(.*\)"

# Subbing empty string for parentheses and stripping white space
df.loc[:, "info"] = df["info"].apply(lambda x: re.sub(pattern, "", x).strip())
df.sample(5)

Unnamed: 0,day,name,info,link,num_references,year,month
69903,19,Avraham Shalom,"86, Austrian-born Israeli security official, Director of the Shin Bet , commander in the capture of Adolf Eichmann and the Bus 300 affair",https://en.wikipedia.org/wiki/Avraham_Shalom,15,2014,June
98186,22,Clemmie Spangler,"86, American banker, natural resource executive",https://en.wikipedia.org/wiki/Clemmie_Spangler,16,2018,July
17519,1,Leslie Vincent,"91, American actor",https://en.wikipedia.org/wiki/Leslie_Vincent,2,2001,February
14725,10,Tom McKinney,"72, Northern Irish rugby player",https://en.wikipedia.org/wiki/Tom_McKinney,6,1999,November
27042,8,Suvad Katana,"35, Bosnian footballer",https://en.wikipedia.org/wiki/Suvad_Katana,4,2005,January


<IPython.core.display.Javascript object>

#### Checking the Set of First Substrings (Before First Comma)

In [63]:
# Checking values in the first substring of info (before the first comma)
substring_set = set(df["info"].apply(lambda x: x.split(sep=",", maxsplit=1)[0]))
substring_set

{'',
 'Irish road racing cyclist',
 '36',
 '53–54',
 '11+',
 'German sculptor',
 'American civil and aeronautical engineer',
 'Swazi educator and politician',
 'Iranian translator and journalist',
 'Indian film director and screenwriter',
 'Russian journalist',
 '51–52',
 'Bissau-Guinean soldier and rebel',
 '93/94',
 'New Zealand mycologist',
 'Dominican Republic diplomat and feminist',
 'Egyptian physician',
 'Senegalese educator and poet',
 'Irish scientist',
 'Turkish artist',
 'Bangladesh army general',
 'Pakistani diplomat',
 'c. 3500',
 'Yemeni soldier.',
 'American football and track coach',
 'Provisional Irish Republican Army volunteer',
 '62–63',
 '117',
 'American police officer ',
 'Israeli Haredi rabbi and orator',
 '16',
 '74–76',
 '7-8',
 'Iranian Army officer and Minister of Defense',
 'American interior designer',
 '82-83',
 'Filipino publisher and editor',
 '69-70',
 'member of the then National Assembly of Pakistan and Union Minister of Labor',
 'American basketball 

<IPython.core.display.Javascript object>

#### Observations:
- Though we can see the age values, there is a lot of other information in the first substring of the `info` value--1412 variations.
- Some entries may be missing the age information.
- Also, age is entered in various formats.
- Let us examine the rows that start with a digit.

In [67]:
# Checking the different patterns of first substrings that start with digits
{item for item in substring_set if re.match(r"\d", item)}

{'1',
 '10',
 '100',
 '100 or 101',
 '100+',
 '100/101',
 '101',
 '102',
 '103',
 '104',
 '104?',
 '105',
 '106',
 '107',
 '108',
 '109',
 '10–11',
 '11',
 '11 months',
 '11+',
 '11-12',
 '110',
 '111',
 '112',
 '113',
 '114',
 '115',
 '116',
 '117',
 '119',
 '12',
 '122',
 '122 ',
 '125',
 '126?',
 '127?',
 '13',
 '13-14',
 '134?',
 '14',
 '146 ',
 '15',
 '16',
 '17',
 '176',
 '18',
 '19',
 '1995',
 '1996',
 '1997',
 '1–2',
 '2',
 '20',
 '21',
 '21-22',
 '22',
 '23',
 '23–24',
 '24',
 '24–26',
 '25',
 '25-26',
 '255 ',
 '25–26',
 '26',
 '27',
 '28',
 '28-29',
 '28/29',
 '28–29',
 '29',
 '29-30',
 '29–30',
 '3',
 '3-months',
 '30',
 '30s',
 '30–31',
 '31',
 '32',
 '32-33',
 '33',
 '33-34',
 '34',
 '34-35',
 '34-38',
 '34/35',
 '34–35',
 '35',
 '35-36',
 '35–42',
 '36',
 '36-37',
 '36–37',
 '37',
 '37-38',
 '37–38',
 '38',
 '38 or 39',
 '38-39',
 '38-40',
 '38/39',
 '38?',
 '39',
 '39-40',
 '39–40',
 '4',
 '40',
 '40 days',
 '40s',
 '40–41',
 '41',
 '41-42',
 '41–42',
 '42',
 '42 or 43'

<IPython.core.display.Javascript object>

#### Observations:
- The patterns include "age", "age ", "age+", "age/", "age?, "age–", "age months", "age-" (distinct dash),"age-months", "age–months" (distinct dash), "ages", "age?", "age days", "age."
- We will need to address ages in months first.

#### Extracting `age` from `info` for `info` Values that Begin with "age,"

In [None]:
# Creating a list of ages
ages = pd.Series(np.arange(1, 150)).astype("string")

# For loop to extract age from info column and assign to age column, if info starts with 'age,'
for age in ages:
    for i, item in enumerate(df["info"]):
        if item.startswith(age + ","):
            df.loc[i, "age"] = int(age)
df.sample(5)

In [None]:
# Checking for remaining rows missing age
df["age"].isna().sum()

#### Observations:
- We were able to extract `age` for the vast majority of entries.
- Let us take a look at the remaining entries.

#### Checking Rows with Missing `age`

In [None]:
df[df["age"].isna()]

#### Observations:
- We can immediately see 2 apparent patterns.  The first is an age range with a hyphen and the second is age missing altogether.
- Let us do another iteration for the age-range pattern, accepting the lower value as age.

#### Extracting `age` from `info` for `info` Values that Begin with "age-"

In [None]:
# Creating a list of ages
ages = pd.Series(np.arange(1, 150)).astype("string")

# For loop to extract age from info column and assign to age column, if info starts with 'age-'
for age in ages:
    for i, item in enumerate(df["info"]):
        if item.startswith(age + "-"):
            df.loc[i, "age"] = int(age)
df.sample(5)

In [None]:
# Checking the number of remaining missing values for `age`
df["age"].isna().sum()

#### Observation:
- The last iteration captured nearly 200 values.
- Let us take a look at the remaining rows with missing `name'.

#### Checking Rows with Missing `age`

In [None]:
# Checking rows still missing `age`
df[df["age"].isna()].sample(5)

#### Observations:
- It almost appears as if our iteration for addressing `info` starting with "age-' missed some values.
- Closer examination reveals that the dash character varies for the remaining values ("-" vs "–").
- So we will iterate again with the larger dash.

In [None]:
# Creating a list of ages
ages = pd.Series(np.arange(1, 150)).astype("string")

# For loop to extract age from info column and assign to age column, if info starts with 'age-'
for age in ages:
    for i, item in enumerate(df["info"]):
        if item.startswith(age + "–"):
            df.loc[i, "age"] = int(age)
df.sample(5)

In [None]:
# Checking the number of remaining missing values for `age`
df["age"].isna().sum()

#### Observation:
- That iteration captured over 400 values.
- Let us take a look at the remaining rows with missing `name'.

#### Checking Rows with Missing `age`

In [None]:
# Checking rows still missing `age`
df[df["age"].isna()].head(100)

#### Observations:
- The next pattern starts with 'age/', so we will iterate again with it.

In [None]:
# Creating a list of ages
ages = pd.Series(np.arange(1, 150)).astype("string")

# For loop to extract age from info column and assign to age column, if info starts with 'age-'
for age in ages:
    for i, item in enumerate(df["info"]):
        if item.startswith(age + "/"):
            df.loc[i, "age"] = int(age)
df.sample(5)

In [None]:
# Checking the number of remaining missing values for `age`
df["age"].isna().sum()

#### Observation:
- That iteration captured nearly 70 missing values.
- Let us take a look at the remaining rows with missing `name'.

#### Checking Rows with Missing `age`

In [None]:
# Checking rows still missing `age`
df[df["age"].isna()].head(100)

#### Observations:
- The next pattern starts with 'age.', so we will iterate again with it.

In [None]:
# Creating a list of ages
ages = pd.Series(np.arange(1, 150)).astype("string")

# For loop to extract age from info column and assign to age column, if info starts with 'age-'
for age in ages:
    for i, item in enumerate(df["info"]):
        if item.startswith(age + "."):
            df.loc[i, "age"] = int(age)
df.sample(5)

In [None]:
# Checking the number of remaining missing values for `age`
df["age"].isna().sum()

#### Observation:
- We captured a handful of more missing values with that iteration.
- Let us take a look at the remaining rows with missing `name'.

#### Checking Rows with Missing `age`

In [None]:
# Checking rows still missing `age`
df[df["age"].isna()].head(100)

#### Observations:
- There is at least one entry that escaped the net of that last iteration.  It can be hard-coded later.
- The next pattern starts with "age ", so will will iterate again with it.

In [None]:
# Creating a list of ages
ages = pd.Series(np.arange(1, 150)).astype("string")

# For loop to extract age from info column and assign to age column, if info starts with 'age-'
for age in ages:
    for i, item in enumerate(df["info"]):
        if item.startswith(age + " "):
            df.loc[i, "age"] = int(age)
df.sample(5)

In [None]:
# Checking the number of remaining missing values for `age`
df["age"].isna().sum()

#### Observations:
- That iteration captured 26 more missing values.
- Let us look again at the remaining rows.

#### Checking Rows with Missing `age`

In [None]:
# Checking rows still missing `age`
df[df["age"].isna()].head(100)

#### Observations:
- This sample shows entries that are missing age information, or the age is imbedded in the middle of the string. 
- Let us remove rows without any age information, as they will not add to the analysis.

In [None]:
# Creating a list of ages
ages = pd.Series(np.arange(1, 150)).astype("string")

# List of index of entries to drop due to no age information
drop_list = []

# For loop to add index of entries to drop_list if no age information present
for i in df[df["age"].isna()].index:
    if not any(age in df.loc[i, "info"] for age in ages):
        drop_list.append(i)
df.loc[drop_list, :].sample(50)

In [None]:
# Checking the number of entries that will be dropped
print(len(drop_list))

# Dropping the rows without age information
df.drop(drop_list, axis=0, inplace=True)

In [None]:
# Checking df shape after dropping rows
df.shape

#### Observations:
- We have successfully dropped the entries lacking age data.
- Let us examine the remaining rows missing `age`.

#### Checking Rows with Missing `age`

In [None]:
# Checking rows still missing `age`
df[df["age"].isna()].sample(100)