# Wikipedia Notable Life Expectancies

# [Notebook 2 of 4: Data Cleaning](https://github.com/teresahanak/wikipedia-life-expectancy/blob/main/wp_life_expect_data_clean_thanak_2022_06_13.ipynb)

## Context

The


## Objective

The

### Data Dictionary

- Feature: Description

## Importing Necessary Libraries

In [1]:
# To structure code automatically
%load_ext nb_black

# To import/export sqlite databases
import sqlite3 as sql

# To help with reading and manipulating data
import pandas as pd
import numpy as np
import re

# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)
# To define the maximum number of rows to be displayed in a dataframe
pd.set_option("display.max_rows", 200)

# To supress warnings
# import warnings

# warnings.filterwarnings("ignore")

# To set some visualization attributes
pd.set_option("max_colwidth", 150)

<IPython.core.display.Javascript object>

## Data Overview

### Reading, Sampling, and Checking Data Shape

In [2]:
# Reading the wp_life_expect_raw_complete dataset
conn = sql.connect("wp_life_expect_raw_complete.db")
data = pd.read_sql("SELECT * FROM wp_life_expect_raw_complete", conn)

# Making a working copy
df = data.copy()

# Checking the shape
print(f"There are {df.shape[0]} rows and {df.shape[1]} columns.")

# Checking first 2 rows of the data
df.head(2)

There are 133900 rows and 6 columns.


Unnamed: 0,month_year,day,name,info,link,num_references
0,January 1994,1,William Chappell,", 86, British dancer, ballet designer and director.",https://en.wikipedia.org/wiki/William_Chappell_(dancer),21
1,January 1994,1,Raymond Crotty,", 68, Irish economist, writer, and academic.",https://en.wikipedia.org/wiki/Raymond_Crotty,12


<IPython.core.display.Javascript object>

In [3]:
# Checking last 2 rows of the data
df.tail(2)

Unnamed: 0,month_year,day,name,info,link,num_references
133898,June 2022,9,Oleg Moliboga,", 69, Russian volleyball player, Olympic champion (1980) and coach.",https://en.wikipedia.org/wiki/Oleg_Moliboga,2
133899,June 2022,9,Zou Jing,", 86, Chinese engineer, member of the Chinese Academy of Engineering.",https://en.wikipedia.org/wiki/Zou_Jing_(engineer),3


<IPython.core.display.Javascript object>

In [4]:
# Checking a sample of the data
df.sample(5)

Unnamed: 0,month_year,day,name,info,link,num_references
30971,June 2006,25,Kenneth Griffith,", 84, Welsh actor and documentary maker, Parkinson's disease.",https://en.wikipedia.org/wiki/Kenneth_Griffith,16
76803,June 2015,13,Russell J. Donnelly,", 85, Canadian physicist.",https://en.wikipedia.org/wiki/Russell_J._Donnelly,12
84069,July 2016,16,Robert Burren Morgan,", 90, American politician, member of the U.S. Senate for North Carolina (1975–1981), N.C. Senate (1955–1969) and Attorney General (1969–1974).",https://en.wikipedia.org/wiki/Robert_Burren_Morgan,11
22668,January 2003,20,Bill Werbeniuk,", 56, Canadian snooker player.",https://en.wikipedia.org/wiki/Bill_Werbeniuk,28
116754,October 2020,16,László Branikovits,", 70, Hungarian footballer, Olympic silver medalist (1972).",https://en.wikipedia.org/wiki/L%C3%A1szl%C3%B3_Branikovits,2


<IPython.core.display.Javascript object>

#### Observations:
- There are 133,900 rows and 6 columns.
- `month_year` contains the month and year of death, while `day` contains the day of the month of death.
- `name` is the notable person's name.  It is a nominal feature that will not be used for analysis, but will be maintained for any referencing needs.
- `info` contains multiple items including the notable person's "age, country of citizenship at birth, subsequent country of citizenship (if applicable), reason for notability, (and) cause of death (if known)."
- `link` is typically the url to the notable person's individual Wikipedia page.  If such a page does not exist, there is either a non-working link (https://en.wikipedia.orgNone), or the link is to a page with a message that the page does not exist for that individual.  `link` is intended to be used as a unique identifier for all entries, except the 6 with the non-working link, which do have unique `name` values from each other.  We still need to verify that all other links are unique.
- `num_references` contains the number of references on the notable person's individual Wikipedia page.  This feature serves as a proxy measure of notability.
- Prior to EDA, our task will be to extract the individual elements that are comined in `month_year` and `info` columns.

### Checking Data Types, Duplicates, and Null Values

In [5]:
# Checking data types and null values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 133900 entries, 0 to 133899
Data columns (total 6 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   month_year      133900 non-null  object
 1   day             133900 non-null  object
 2   name            133894 non-null  object
 3   info            133900 non-null  object
 4   link            133900 non-null  object
 5   num_references  133900 non-null  object
dtypes: object(6)
memory usage: 6.1+ MB


<IPython.core.display.Javascript object>

In [6]:
# Checking duplicate rows
df.duplicated().sum()

0

<IPython.core.display.Javascript object>

In [7]:
# Check percentage of null values by column
df.isnull().sum() / df.count() * 100

month_year        0.000000
day               0.000000
name              0.004481
info              0.000000
link              0.000000
num_references    0.000000
dtype: float64

<IPython.core.display.Javascript object>

In [8]:
# Checking number of missing values per row
df.isnull().sum(axis=1).value_counts()

0    133894
1         6
dtype: int64

<IPython.core.display.Javascript object>

#### Observations:
- Our dataset was saved to and read from the database without any hiccups.
- As expected, we have 6 entries that are missing `name`, but we will find it in their `info` values.
- All columns are currently of object type.  We will need to appropriately typecast them after separating the information in `month_year` and `info`.
- We do not have any duplicate entries, but we will need to verify that all links are unique.  If an individual's death was entered on more than one date, there could be more than one entry for the individual.

## Data Cleaning

### Addressing Missing `name` Values

In [9]:
# Checking rows with missing name values
missing_name = df[df["name"].isna()]
missing_name

Unnamed: 0,month_year,day,name,info,link,num_references
18937,August 2001,11,,"Kevin Kowalcyk, 2, known for eating a hamburger contaminated with E. coli O157:H7.",https://en.wikipedia.orgNone,0
24985,January 2004,22,,"Vincent Palmer, 37, British criminal.",https://en.wikipedia.orgNone,0
27458,March 2005,1,,"Barry Stigler, 57, American voice actor.",https://en.wikipedia.orgNone,0
34077,July 2007,11,,"Nana Gualdi, 75, German singer and actress.",https://en.wikipedia.orgNone,0
64769,September 2013,29,,"Scott Workman, 47, American stuntman (, , ), cancer.",https://en.wikipedia.orgNone,0
106613,September 2019,12,,"Thami Shobede, 31, Singer Songwriter",https://en.wikipedia.orgNone,0


<IPython.core.display.Javascript object>

#### Observations:
- These rows vary from the main set as there is a substring containing the person's name at the start of the `info` string.
- As there are so few rows missing `name`, let us address this issue first.
- Within a for loop, we will split the `info` value into a list and extract `name` from the first list entry.  
- Then we will replace the name with an empty string within `info`.

In [10]:
# For loop to copy name value from info value and remove name from info value
treat_rows = missing_name.index
for i in treat_rows:
    info = df.loc[i, "info"]
    info_lst = info.split(sep=",", maxsplit=1)

    name = info_lst[0].strip()
    df.loc[i, "name"] = name
    df.loc[i, "info"] = re.sub(name, "", info).strip()

# Re-check rows
df.loc[treat_rows, :]

Unnamed: 0,month_year,day,name,info,link,num_references
18937,August 2001,11,Kevin Kowalcyk,", 2, known for eating a hamburger contaminated with E. coli O157:H7.",https://en.wikipedia.orgNone,0
24985,January 2004,22,Vincent Palmer,", 37, British criminal.",https://en.wikipedia.orgNone,0
27458,March 2005,1,Barry Stigler,", 57, American voice actor.",https://en.wikipedia.orgNone,0
34077,July 2007,11,Nana Gualdi,", 75, German singer and actress.",https://en.wikipedia.orgNone,0
64769,September 2013,29,Scott Workman,", 47, American stuntman (, , ), cancer.",https://en.wikipedia.orgNone,0
106613,September 2019,12,Thami Shobede,", 31, Singer Songwriter",https://en.wikipedia.orgNone,0


<IPython.core.display.Javascript object>

#### Observations:
- Missing `name` values have been addressed and those names have been removed from `info` values.
- Now, we can check for any duplicate links (aside from the 6 above) and remove entries that have duplicate links.

In [11]:
# Checking number and sample of rows with duplicate links
rows_to_check = df[
    (df["link"].duplicated()) & (df["link"] != "https://en.wikipedia.orgNone")
].index
print(f"There are {len(df.loc[rows_to_check, :])} individuals with repeated links.")

# Checking a sample of rows
df.loc[rows_to_check, :].sample(5)

There are 125 individuals with repeated links.


Unnamed: 0,month_year,day,name,info,link,num_references
26165,August 2004,21,Moshe Shamir,", 82, Israeli author, playwright and columnist.",https://en.wikipedia.org/wiki/Moshe_Shamir,4
65040,October 2013,17,Lou Scheimer,", 84, American television producer (Filmation, ), co-founder of , Parkinson's disease.",https://en.wikipedia.org/wiki/Lou_Scheimer,14
22711,January 2003,27,Lord Dacre of Glanton,", 89, British historian, authenticator of the hoaxed Hitler Diaries.",https://en.wikipedia.org/wiki/Hugh_Trevor-Roper,73
17490,January 2001,28,Sally Mansfield,", 80, American actress.",https://en.wikipedia.org/wiki/Sally_Mansfield,3
26725,November 2004,24,John Tosi,", 88, American football player.",https://en.wikipedia.org/wiki/John_Tosi,15


<IPython.core.display.Javascript object>

#### Observations:
- There are 125 individuals with more than one entry.
- Let us take a look at an example.

In [13]:
# Checking an example of an individual with duplicate entries
df[df["link"] == "https://en.wikipedia.org/wiki/Sally_Mansfield"]

Unnamed: 0,month_year,day,name,info,link,num_references
17479,January 2001,27,Sally Mansfield,", 77, American actress, lung cancer.",https://en.wikipedia.org/wiki/Sally_Mansfield,3
17490,January 2001,28,Sally Mansfield,", 80, American actress.",https://en.wikipedia.org/wiki/Sally_Mansfield,3


<IPython.core.display.Javascript object>

#### Observations:
- We can see that the duplication is a result of the death being entered twice, on two different days, on Wikipedia.
- It is also possible that some links appear more than once as they point to a page for an event or different associated person, but we will err on the side of avoiding repeating entries for one individual.
- For entries with duplicate links, we will opt to keep the first entry and drop the others.

#### Dropping Duplicate Rows for Entries with Duplicate `link`

In [14]:
# Dropping duplicate rows for entries for entries with duplicate link values
rows_to_drop = rows_to_check.copy()
df.drop(rows_to_drop, inplace=True)
df.reset_index(inplace=True, drop=True)
df.shape

(133775, 6)

<IPython.core.display.Javascript object>

In [15]:
# Re-check info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 133775 entries, 0 to 133774
Data columns (total 6 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   month_year      133775 non-null  object
 1   day             133775 non-null  object
 2   name            133775 non-null  object
 3   info            133775 non-null  object
 4   link            133775 non-null  object
 5   num_references  133775 non-null  object
dtypes: object(6)
memory usage: 6.1+ MB


<IPython.core.display.Javascript object>

#### Observations:
- We have no remaining missing values for the existing columns and the rows with duplicate `link` values have been dropped.
- Let us treat the `month_year` column next, by separating into two new columns `year` and `month`.

### Separating `month_year` into `month` and `year` and Dropping `month_year`

In [16]:
# Separating month and year into 2 columns and typecasting year as integer
df.loc[:, "year"] = df["month_year"].apply(lambda x: x.split(sep=" ")[1].strip())
df["year"] = df["year"].apply(lambda x: int(x))

df.loc[:, "month"] = df["month_year"].apply(lambda x: x.split(sep=" ")[0])
df.head(2)

Unnamed: 0,month_year,day,name,info,link,num_references,year,month
0,January 1994,1,William Chappell,", 86, British dancer, ballet designer and director.",https://en.wikipedia.org/wiki/William_Chappell_(dancer),21,1994,January
1,January 1994,1,Raymond Crotty,", 68, Irish economist, writer, and academic.",https://en.wikipedia.org/wiki/Raymond_Crotty,12,1994,January


<IPython.core.display.Javascript object>

In [17]:
# Dropping month_year column
df.drop("month_year", axis=1, inplace=True)
df.head(2)

Unnamed: 0,day,name,info,link,num_references,year,month
0,1,William Chappell,", 86, British dancer, ballet designer and director.",https://en.wikipedia.org/wiki/William_Chappell_(dancer),21,1994,January
1,1,Raymond Crotty,", 68, Irish economist, writer, and academic.",https://en.wikipedia.org/wiki/Raymond_Crotty,12,1994,January


<IPython.core.display.Javascript object>

#### Observations:
- With `month` and `year` separated, we can move on to treating the `info` column.

### Treating `info`
First, we will define two functions to help identify rows that match a given regular expression pattern or list of patterns.  Then, we will start by examining the `info` column in a sample of the dataset.

#### Function to Save Indices of Rows Matching Regular Expressions Pattern to a List and Print Number of Rows with Match

In [18]:
# Define a function that takes dataframe, column name, and re pattern as arguments and returns list of indices
# for which column value matches re pattern
def rows_with_pattern(dataframe, column, pattern):
    """
    Takes input of dataframe, column name, and re pattern 
    and returns list of indices for rows that contain match
    for pattern anywhere within value for given column.
    
    dataframe: dataframe
    column: column name
    pattern: re pattern
    """
    index_list = []

    for i in dataframe.index:
        item = dataframe.loc[i, column]
        match = re.search(pattern, item)
        if match:
            index_list.append(i)
    print(
        f"There are {len(index_list)} rows with matching pattern in column '{column}'."
    )
    return index_list

<IPython.core.display.Javascript object>

#### Function to Use rows_with_pattern Function for Multiple Regular Expression Patterns

In [19]:
# Define a function that calls rows_with_pattern function for multiple re patterns
# returning a single list of indices for all rows with any pattern match


def multiple_patterns(dataframe, column, patterns):
    """
    Takes input dataframe, column, and list of re patterns and returns single list 
    of indices for rows in which a match for any pattern is found with re.search
    
    dataframe: dataframe
    column: column name
    patterns: list of re patterns
    """
    rows_combined = []

    # For loop to check each pattern
    for pattern in patterns:

        # List and number of rows matching each pattern
        print(pattern)
        rows_to_check = rows_with_pattern(dataframe, column, pattern)
        print("")

        # Add list for each pattern to combined list
        rows_combined += rows_to_check

    return rows_combined

<IPython.core.display.Javascript object>

#### Checking a Sample

In [20]:
# Checking a sample of info
df.sample(5)

Unnamed: 0,day,name,info,link,num_references,year,month
109780,11,Jacques Mehler,", 83, French cognitive psychologist.",https://en.wikipedia.org/wiki/Jacques_Mehler,8,2020,February
3,1,"Arthur Porritt, Baron Porritt",", 93, New Zealand physician, statesman and athlete.","https://en.wikipedia.org/wiki/Arthur_Porritt,_Baron_Porritt",25,1994,January
11415,12,Graham Shaw,", 63, English footbal player.","https://en.wikipedia.org/wiki/Graham_Shaw_(footballer,_born_1934)",1,1998,May
67301,16,Robert J. Conley,", 73, American Cherokee author.",https://en.wikipedia.org/wiki/Robert_J._Conley,4,2014,February
65473,17,Zeke Bella,", 83, American baseball player (New York Yankees, Kansas City Athletics), complications from stroke and fall.",https://en.wikipedia.org/wiki/Zeke_Bella,8,2013,November


<IPython.core.display.Javascript object>

#### Observations:
- We can see that `info` has variety, so will take some cleaning effort:
    - There are number values for years in addition to ages.
    - There is extra information inside of parentheses, that we likely don't need.
    - Some entries lack cause of death.
    - Some entries have multiple roles listed, with separating commas.
    - There are nationalities with multiple words.
    - `info` contains capital letters that are not part of citizenship.
- A strategic approach to cleaning the `info` column is needed, as follows:
    1. Drop all entries that are lacking digits in `info`, as they are missing the target `age` information.
    2. Extract any parentheses and their contents from `info` to new column `info_parenth`, as it likely won't be needed but we will preserve it, for now.
    3. Split `info` on "," and save to numbered new `info` columns left to right.
    3. Begin extracting age data from `info_0` and proceed left to right through the numbered `info` columns.

#### Step 1: Checking and Dropping Rows Lacking Digits (and therefore Age Data) within `info`

In [21]:
# Column to check
column = "info"

# Dataframe to check
dataframe = df

# Pattern for re
pattern = r"\d"

# Finding indices of rows that have pattern/digits and number of rows that do not have digits
has_digits = rows_with_pattern(dataframe, column, pattern)
print(
    f"\nThere are {len(df) - len(has_digits)} rows without numbers in the info column."
)

# Overwriting df with only rows that contain a digit in info column, resetting index, and checking new shape
df = df.loc[has_digits, :]
df.reset_index(inplace=True, drop=True)
df.shape

There are 132852 rows with matching pattern in column 'info'.

There are 923 rows without numbers in the info column.


(132852, 7)

<IPython.core.display.Javascript object>

#### Observations:
- 923 rows were removed as they lacked any digits and, therefore, the target data for `age`.
- The resultant number of rows matches the rows containing digits.
- Next, we will extract parentheses and their contents from `info` to a new column `info_parenth`.

#### Step 2: Removing Information within Parentheses from `info` and saving to new column `info_parenth`

In [22]:
# Column to check
column = "info"

# Dataframe to check
dataframe = df

# Regular expression for parenthesis and its contents
pattern = r"\(.*\)"

# Finding indices of rows that have pattern
rows_to_check = rows_with_pattern(dataframe, column, pattern)

# Checking a sample of rows
df.loc[rows_to_check, :].sample(2)

There are 50003 rows with matching pattern in column 'info'.


Unnamed: 0,day,name,info,link,num_references,year,month
63231,9,Glen Hobbie,", 77, American baseball player (Chicago Cubs, St. Louis Cardinals).",https://en.wikipedia.org/wiki/Glen_Hobbie,2,2013,August
67047,4,Lawrence Patrick Henry,", 79, South African Roman Catholic prelate, Archbishop of Cape Town (1990–2009).",https://en.wikipedia.org/wiki/Lawrence_Patrick_Henry,1,2014,March


<IPython.core.display.Javascript object>

In [23]:
# For loop to extract parenthesis and its contents from info to info_parenth
for index in rows_to_check:
    item = df.loc[index, column]
    match = re.search(pattern, item)
    if match:
        df.loc[index, "info_parenth"] = match.group(0)
        df.loc[index, column] = re.sub(pattern, "", df.loc[index, column])

# Rechecking number and example rows after treatment
recheck_rows = rows_with_pattern(dataframe, column, pattern)

# Recheck a sample of treated rows
df.loc[rows_to_check, :].sample(2)

There are 0 rows with matching pattern in column 'info'.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth
62593,5,Bud Asher,", 88, American politician and football coach, Mayor of Daytona Beach, Florida .",https://en.wikipedia.org/wiki/Bud_Asher,7,2013,July,(1995–2003)
50734,12,Ernie Johnson,", 87, American baseball player .",https://en.wikipedia.org/wiki/Ernie_Johnson_(pitcher),2,2011,August,"(Boston Braves/Milwaukee Braves, Baltimore Orioles) and broadcaster (Atlanta Braves)"


<IPython.core.display.Javascript object>

#### Observations:
- Parentheses and information within has been removed from `info` and assigned to `info_parenth`.
- Next, we will iterate through the rows, splitting `info` on commas and assigining the respective list values to new individual columns `info_0`, `info_1`, and so on.  
- Though we can keep in mind the Wikipedia-defined fields, we will take the general approach of treating column by column, after splitting `info`, varying as indicated to obtain specific feature information.

#### Step 3: Splitting `info` on Commas into Separate Columns

In [24]:
# For loop to split info on commas and separate into respective new columns and removing leading/trailing white space and periods
for i, item in enumerate(df["info"]):
    info_lst = item.split(",")

    for j in range(len(info_lst)):
        df.loc[i, f"info_{j}"] = info_lst[j].strip(" .")

# Checking the first 2 rows
df.head(2)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_0,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11
0,1,William Chappell,", 86, British dancer, ballet designer and director.",https://en.wikipedia.org/wiki/William_Chappell_(dancer),21,1994,January,,,86,British dancer,ballet designer and director,,,,,,,,
1,1,Raymond Crotty,", 68, Irish economist, writer, and academic.",https://en.wikipedia.org/wiki/Raymond_Crotty,12,1994,January,,,68,Irish economist,writer,and academic,,,,,,,


<IPython.core.display.Javascript object>

In [25]:
# Checking the last 2 rows
df.tail(2)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_0,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11
132850,9,Oleg Moliboga,", 69, Russian volleyball player, Olympic champion and coach.",https://en.wikipedia.org/wiki/Oleg_Moliboga,2,2022,June,(1980),,69,Russian volleyball player,Olympic champion and coach,,,,,,,,
132851,9,Zou Jing,", 86, Chinese engineer, member of the Chinese Academy of Engineering.",https://en.wikipedia.org/wiki/Zou_Jing_(engineer),3,2022,June,,,86,Chinese engineer,member of the Chinese Academy of Engineering,,,,,,,,


<IPython.core.display.Javascript object>

#### Observations:
- The `info` value is successfully divided and we can proceed through it column by column.
- There are 12 resultant new rows for the sections of `info` divided by commas (`info_parenth` excluded), indicating that some entries are quite detailed.
- We will now start our search for age data, starting with `info_0`.

## Extracting Age Data

### `info_0`

In [None]:
# Checking unique value counts
df["info_0"].value_counts()

#### Observations:
- The vast majority of rows have an empty string for `info_0`.
- There is one row representing a group, rather than an individual, and we will drop it.
- We should verify the name and age information for the remainder of unique values in `info_0`.

#### Dropping Entry for Group

In [None]:
# Checking the entry representing a group
group_entry = df[
    df["info_0"]
    == "Notable ice hockey players and coaches among the 44 killed in the :\n"
]
group_entry

In [None]:
# Dropping group entry, resetting index, and checking new shape of df
df.drop(group_entry.index, inplace=True)
df.reset_index(inplace=True, drop=True)
df.shape

#### Examining Rows with Atypical `info_0` Values

In [None]:
# Examining rows with atypical info_0 values
list_to_check = df["info_0"].value_counts().index.to_list()

verify_df = pd.DataFrame()
for item in list_to_check[1:]:
    verify_df = pd.concat([verify_df, df[df["info_0"] == item]])
verify_df

#### Observations:
- The majority of rows contain additional aliases or titles within `info_0`, that we don't need, but we can leave in place for now, as we can drop the column as a whole, later.  
- There are a few rows that will need to be treated individually to correct the name value, as follows:
    1. Entry is for Mike Alexander whose band was Evile.
    2. Entry is for Herbert Wiere who performed slapstick comedy.
    3. Entry is for Sarah-Jayne Mulvihill who was a Flight Lieutenant.
    4. Entry is for Douglass Scott who was killed by Demetreus Nix.
    5. Entry is for Kim Hwan-Sung who was a member of the band NRG.
- We can replace the `name` value with the `info_0` value for these rows as well as proceed with hard-coding the correct values for info_2 and info_3 fields to match the Wikipedia pattern, but staying true to the information scraped.
- The row with "Nearly 3" value for `info_0` also represents a group, rather than an individual, so will be dropped, after treating the above rows.
- We will then proceed to extract age from `info_0` for the few rows that contain it here instead of in `info_1`.

#### Treating 5 rows with Name in `info_0`

In [None]:
# List of names values in info_0
values_lst = [
    "Mike Alexander",
    "Herbert Wiere",
    "Sarah-Jayne Mulvihill",
    "Douglas Scott",
    "Kim Hwan-Sung",
]

In [None]:
# For loop to copy name from info_0 to name
for i in df[df["info_0"].isin(values_lst)].index.to_list():
    df.loc[i, "name"] = df.loc[i, "info_0"]

# Hard-coding info_2 and info_3 values for Kim Hwan-Sung
index = df[
    df["link"] == "https://en.wikipedia.org/wiki/NRG_(South_Korean_band)"
].index.to_list()
df.loc[index, "info_2"] = "South Korean musician"

df.loc[index, "info_3"] = "respiratory illness"

# # Hard-coding info_2 and info_3 values for Douglass Scott
index = df[
    df["link"]
    == "https://en.wikipedia.org/w/index.php?title=Demetreus_Nix&action=edit&redlink=1"
].index.to_list()
df.loc[index, "info_2"] = "student"

df.loc[index, "info_3"] = "murdered"

# # Hard-coding info_2 and info_3 values for Sarah-Jayne Mulvihill
index = df[
    df["link"] == "https://en.wikipedia.org/wiki/Flight_Lieutenant"
].index.to_list()
df.loc[index, "info_2"] = "British servicewoman"

df.loc[index, "info_3"] = "killed in action"

In [None]:
# Rechecking updated rows
df[df["info_0"].isin(values_lst)]

#### Dropping Entry for Group

In [None]:
# Checking the entry representing a group
group_entry = df[df["info_0"] == "Nearly 3"]
group_entry

In [None]:
# Dropping group entry, resetting index, and checking new shape of df
df.drop(group_entry.index, inplace=True)
df.reset_index(inplace=True, drop=True)
df.shape

#### Extracting `age` from `info_0`

In [None]:
# Column to check
column = "info_0"

# Dataframe to check
dataframe = df

# Pattern for re
pattern = r"(\d{1,3})"

# Checking rows with pattern
rows_to_check = rows_with_pattern(dataframe, column, pattern)

# Checking a sample of rows
df.loc[rows_to_check, :].sample(2)

In [None]:
# For loop to extract age from info_0 to age
for index in rows_to_check:
    item = df.loc[index, column]
    match = re.search(pattern, item)
    if match:
        df.loc[index, "age"] = int(match.group(1))
        df.loc[index, column] = re.sub(pattern, "", df.loc[index, column])

# Rechecking number and example rows after treatment
recheck_rows = rows_with_pattern(dataframe, column, pattern)

# Recheck a sample of treated rows
df.loc[rows_to_check, :].sample(2)

#### Observations:
- The new `age` column has been added successfully.
- We are finished processing `info_0`, and we can drop it, as it no longer contains information that we need.

#### Re-checking Unique Values for `info_0` and Dropping the Column

In [None]:
# Re-checking unique values for info_0 prior to dropping it
df["info_0"].unique()

In [None]:
# Dropping info_0
df.drop("info_0", axis=1, inplace=True)

#### Observations:
- All of the age values were extracted from `info_0` and the column has been dropped.
- We are ready to move on to processing `info_1`, which should primarily consist of age values, per the defined Wikipedia fields.

### `info_1`

#### Unique Values

In [None]:
# Checking unique values
df["info_1"].unique()

#### Observations:
- There is a lot of variety in `info_1` and in the format of the age data.
- Also, this field contains several values that we would expect in info_2 and beyond.
- Let us take the approach of extracting age values first.

#### Examining Unique Formats for Age Data

In [None]:
# Column to check
column = "info_1"

# Dataframe to check
dataframe = df

# Pattern for re
pattern = r"\d"

# List and number of rows matching pattern
rows_to_check = rows_with_pattern(dataframe, column, pattern)

# Checking unique values for column
df.loc[rows_to_check, :][column].unique()

#### Observations:
- The data for age within `info_1` is in the following formats:  
    - single integer ("age", "age ", "age.")
    - range of 2 integers (separators '-', '–', '/', and ' or ')
    - range of 2 integers with only unit value for second number ('age1/age2-2nd-digit')
    - age in days or months ('age days', 'age-months')
    - estimates ('c. age', 'c.age',  'age?', 'ages' (e.g. 80s), "age+"
- There are some specific rows that need to be examined, with the following values for `info_1`:
    - German Olympic sailor [1]
    - Taiwanese failed assassin in the 3-19 shooting incident
    - 255
    - 176
    - the first wild bear in Germany in 170 years
    - c. 3500
    - common chimpanzee 55
    - Maltese 15
    - c.1000
    - Tree of the Year 150
- There are additional 4-digit values that likely represent years or other information, such as flight number, etc.  Unless the information is preceded by 'c.' we will opt to exclude it.
- We will need to be strategic in the order in which we extract age from `info_1`.
- First, we will look at the atypical values listed above.

#### Examining Rows with Digits and Atypical Values for `info_1`

In [None]:
# List of atypical info_1 values for rows with digits
values_lst = [
    "German Olympic sailor [1]",
    "Taiwanese failed assassin in the 3-19 shooting incident",
    "255",
    "176",
    "the first wild bear in Germany in 170 years",
    "c. 3500",
    "common chimpanzee 55",
    "Maltese 15",
    "c.1000",
    "Tree of the Year 150",
]

df[df[column].isin(values_lst)]

#### Observations:
- Age data is either missing or the entry is for a member of a non-human species (tortoise, chimpanzee, dog, or tree).
- Where age data is missing, we will drop the entry.  Age is missing for all of the human entries and for the bear, above.
- Let us maintain non-human entries, as they may be interesting examine in the future, but we should keep a list of non-human species to facilitate entry types later.

#### Selecting and Dropping Additional Rows Missing Age Data and Creating `other_species` List

In [None]:
# Creating a list for other non-human species
other_species = ["tortoise", "tree", "chimpanzee", "dog"]

# For loop to create list of rows to drop
rows_to_drop = []
for index in df[df[column].isin(values_lst)].index:
    item = df.loc[index, "info"]
    if not any(species in item for species in other_species):
        rows_to_drop.append(index)

# Check rows to drop
df.loc[rows_to_drop, :]

In [None]:
# Dropping rows, resetting index, and checking new shape of df
df.drop(rows_to_drop, inplace=True)
df.reset_index(inplace=True, drop=True)
df.shape

#### Adding "bear" to `other_species` list

In [None]:
# Adding bear to other_species list
other_species.append("bear")

#### Observations:
- With those rows addressed, we will begin extracting age in days and months and convert them to years.

### Extracting `age` from `info_1`

#### Step 1: Age in Years and Months

In [None]:
# Column to check
column = "info_1"

# Dataframe to check
dataframe = df

# Dictionary of patterns for days and months formats as keys and factor to convert to years
patterns = {
    r"(\d{1,3})( days)": 365,
    r"(\d{1,3})(-months)": 12,
    r"(\d{1,3})( months)": 12,
}

# List and number of rows matching patterns
rows_to_check = multiple_patterns(dataframe, column, patterns)

# Checking rows
df.loc[rows_to_check, :]

In [None]:
# For loop to extract age in days and months fromm info_1 and convert to years and save in age
for key, value in patterns.items():
    for index in rows_to_check:
        item = df.loc[index, column]
        match = re.search(key, item)
        if match:
            age = int(match.group(1)) / value
            df.loc[index, "age"] = age
            df.loc[index, column] = re.sub(key, "", df.loc[index, column])

# Re-check number of rows matching patterns
recheck_rows = multiple_patterns(dataframe, column, patterns)

# Checking treated rows
df.loc[rows_to_check, :]

#### Observations:
- We have successfully captured the age in days and months values and converted them to years.
- Next, we will address entries that contain two age values as a range, starting first with those that have a single digit as the second number.

#### Step 2: Extracting `age` from `info_1` for Entries with Age Estimate Containing Age Range with 2 Values

#### Ranges with Single Digit as Upper-end

In [None]:
# Column to check
column = "info_1"

# Dataframe to check
dataframe = df

# Pattern for re
pattern = r"(\d{1,3})(/)(\d)\b"

# List and number of rows matching pattern
rows_to_check = rows_with_pattern(dataframe, column, pattern)

# Checking sample of rows
df.loc[rows_to_check, :].sample(2)

In [None]:
# For loop to find rows with values and pattern and calculate and extract age to age column and remove age from info_1
for index in rows_to_check:
    item = df.loc[index, column]
    match = re.search(pattern, item)
    if match:
        age_1 = int(match.group(1))
        age_2 = int(match.group(3))
        units = ((age_1 % 10) + age_2) / 2
        tens = age_1 - (age_1 % 10)
        age = tens + units
        df.loc[index, "age"] = age
        df.loc[index, column] = re.sub(pattern, "", df.loc[index, column])

# Re-check number of rows matching patterns
recheck_rows = multiple_patterns(dataframe, column, patterns)

# Checking a sample of treated rows
df.loc[rows_to_check, :].sample(2)

#### Observations:
- The age with split units has been updated to reflect the average of the two values.
- Next we will address the age ranges with two complete integer values.

#### Other Ranges with Two Values

In [None]:
# Column to check
column = "info_1"

# Dataframe to check
dataframe = df

# Pattern for re
pattern = r"(\d{1,3})(-|–|/| or )(\d{1,3})"

# List and number of rows matching pattern
rows_to_check = rows_with_pattern(dataframe, column, pattern)

# Checking sample of rows
df.loc[rows_to_check, :].sample(2)

In [None]:
# For loop to find rows with values and pattern and calculate and extract age to age column and remove age from info_1
for index in rows_to_check:
    item = df.loc[index, column]
    match = re.search(pattern, item)
    if match:
        age = (int(match.group(1)) + int(match.group(3))) / 2
        df.loc[index, "age"] = age
        df.loc[index, column] = re.sub(pattern, "", df.loc[index, column])

# Re-check number of rows matching patterns
recheck_rows = multiple_patterns(dataframe, column, patterns)

# Checking a sample of treated rows
df.loc[rows_to_check, :].sample(2)

#### Observations:
- The two-integer age ranges have been successfully treated and saved as the average of the two values in `age`.
- Next, we will extract from the entries with straightforward single integer age values, including the formats: "age", "age ", 'age.'.
- More vague estimates (including 'c.', '+', '?', and ending in 's') are excluded here to allow closer examination, as they are more likely to by atypical entries.

#### Step 3: Age as Single Integer (Excluding Estimates)

In [None]:
# Column to check
column = "info_1"

# Dataframe to check
dataframe = df

# List of patterns for age formats with single integer for age
patterns = [r"^(\d{1,3})$", r"^(\d{1,3})\s", r"^(\d{1,3})\.\s"]

# List and number of rows matching patterns
rows_to_check = multiple_patterns(df, column, patterns)

# Checking a sample of rows
df.loc[rows_to_check, :].sample(2)

In [None]:
# For loop to check age pattern in info_1, save age to age column, and remove from age from info_1
for index in rows_to_check:
    for pattern in patterns:
        item = df.loc[index, column]
        match = re.search(pattern, item)
        if match:
            age = int(match.group(1))
            df.loc[index, "age"] = age
            df.loc[index, column] = re.sub(pattern, "", df.loc[index, column])

# Re-checking number of rows matching patterns
recheck_rows = multiple_patterns(dataframe, column, patterns)

# Checking first 2 rows
df.head(2)

In [None]:
# Checking last 2 rows
df.tail(2)

In [None]:
# Checking the number of remaining missing values for `age`
print(f'There are {df["age"].isna().sum()} remaining missing values for age.')

#### Observations:
- The rows with single integer age data have been addressed.
- There are only 155 remaining missing values for `age` after extracting age in days, months, single integer years, and 2-integer year range values.
- Let us check the rows containing 'c.', '+', or '?' in the age information.  We will do the ranges ending in 's', such as 80s, separately as we will treat the age value differently for those entries.

#### Entries with Age Data Containing 'c.', '+', or '?' for Estimate

In [None]:
# Column to check
column = "info_1"

# Dataframe to check
dataframe = df

# Pattern for re
pattern = r"(c\.|\+|\?)"

# List and number of rows matching pattern
rows_to_check = rows_with_pattern(dataframe, column, pattern)

# Inspecting rows containing values
df.loc[rows_to_check, :]

#### Observations:
- Most of the entries are for people, but there are also one or more entries for the following:
    - carp
    - racehorse
    - chimpanzee
    - flamingo
    - cat
    - turkey
- We will proceed to check `info_2` for these non-human-entry values and drop these and other rows representing members of these other species.

#### Checking for Cat, Racehorse, Chimpanzee, Carp, and Flamingo in `info_2`

In [None]:
# Column to check
column = "info_2"

# Dataframe to check
dataframe = df[df[column].notna()]

# pattern for re
pattern = r"\b(cat|racehorse|chimpanzee|carp|flamingo|turkey)\b"

# List and number of rows matching pattern
rows_to_check = rows_with_pattern(dataframe, column, pattern)

# Checking a sample of rows
df.loc[rows_to_check, :].sample(2)

#### Observations:
- There are sufficient rows to warrant checking species by species.

#### Cat Entries per `info_2`

In [None]:
# Column to check
column = "info_2"

# Dataframe to check
dataframe = df[df[column].notna()]

# Pattern for re
pattern = r"\b(cat)\b"

# List and number of rows matching pattern
rows_to_check = rows_with_pattern(dataframe, column, pattern)

# Inpsecting rows with pattern
df.loc[rows_to_check, :]

#### Observations:
- There is one person represented.  
- We can proceed to drop the others, using "cat", "cat of", and "cat in" patterns.

#### Dropping Entries for Cats per `info_2`

In [None]:
column = "info_2"

dataframe = df[df[column].notna()]

# List of re patterns to find
patterns = [r"\bcat$", r"\b(cat of|cat in)\b"]

# List and number of rows matching pattern
rows_to_drop = multiple_patterns(dataframe, column, patterns)

# Checking a sample of rows
df.loc[rows_to_drop, :].sample(2)

In [None]:
# Dropping rows, resetting index, and checking new shape of df
df.drop(rows_to_drop, inplace=True)
df.reset_index(inplace=True, drop=True)
df.shape

####  Racehorse Entries per `info_2`

In [None]:
# Column to check
column = "info_2"

# Dataframe to check
dataframe = df[df[column].notna()]

# Pattern for re
pattern = r"\bracehorse\b"

# List and number of rows matching pattern
rows_to_check = rows_with_pattern(dataframe, column, pattern)

# Checking a sample of rows
df.loc[rows_to_check, :].sample(10)

#### Observations:
- There are several entries for people involved in the racehorse business.
- Values that end in 'racehorse' and 'racehorse and sire' can be removed.

#### Dropping Entries for Racehorses per `info_2`

In [None]:
# Column to check
column = "info_2"

# Dataframe to check
dataframe = df[df[column].notna()]

# List of re patterns to find
patterns = [r"\bracehorse$", r"\b(racehorse and sire)$"]

# List and number of rows matching pattern
rows_to_drop = multiple_patterns(dataframe, column, patterns)

# Checking a sample of rows
df.loc[rows_to_drop, :].sample(2)

In [None]:
# Dropping rows, resetting index, and checking new shape of df
df.drop(rows_to_drop, inplace=True)
df.reset_index(inplace=True, drop=True)
df.shape

####  Chimpanzee, Flamingo,  Carp, and Turkey Entries per `info_2`

In [None]:
# Column to check
column = "info_2"

# Dataframe to check
dataframe = df[df[column].notna()]

# Defining pattern for re
pattern = r"\b(chimpanzee|flamingo|carp|turkey)\b"

# List and number of rows matching pattern
rows_to_check = rows_with_pattern(dataframe, column, pattern)

# Inpsecting rows with pattern
df.loc[rows_to_check, :]

#### Observations:
- All of these rows are for animals, so we will remove them.

#### Dropping Entries for Chimpanzees, Flamingos,  Carps, and Turkeys per `info_2`

In [None]:
# Dropping rows, resetting index, and checking new shape of df
rows_to_drop = rows_to_check.copy()
df.drop(rows_to_drop, inplace=True)
df.reset_index(inplace=True, drop=True)
df.shape

#### Observations:
- With those non-human entries addressed, we can return to processing `info_1`.
- We will need to be on the lookout for other non-human entries, in `info_2` and beyond.
- Let us address the remaining entries with '?', '+', or 'c.', accepting the estimated age as `age`.
- We will treat the age estimates ending in 's' similarly, but separately.

#### Extracting `age` from `info_1` for Entries with Age Estimate Containing '?', '+', or 'c.'

In [None]:
# Column to check
column = "info_1"

# Dataframe to check
dataframe = df[df["age"].isna()]

# Pattern for re
pattern_1 = r"(c\.|\+|\?)"

# List and number of rows matching pattern_1
rows_to_check = rows_with_pattern(dataframe, column, pattern_1)

# Checking a sample of rows
df.loc[rows_to_check, :].sample(2)

In [None]:
# List to identify rows
values = ["c.", "+", "?"]

# Pattern for re
pattern_2 = r"\b(\d{1,3})\b"

# For loop to find rows with values and pattern and extract age to age column and remove age from info_1
for index in rows_to_check:
    item = df.loc[index, column]
    match = re.search(pattern_2, item)
    if match:
        age = int(match.group(1))
        df.loc[index, "age"] = age
        df.loc[index, column] = re.sub(pattern_2, "", df.loc[index, column])

    for value in values:
        df.loc[index, column] = df.loc[index, column].replace(value, "").strip()


# Re-check number of rows matching pattern_1 and pattern_2
recheck_rows = rows_with_pattern(df.loc[rows_to_check, :], column, pattern_2)

# Checking a sample of treated rows
df.loc[rows_to_check, :].sample(2)

#### Observations:
- With those values treated, ages ending in 's' should be the only ones remaining in the `info_1` column.
- We will examine those now.

#### Extracting `age` from `info_1` for Entries with Age Estimate ending in 's'

In [None]:
# Column to check
column = "info_1"

# Dataframe to check
dataframe = df

# Defining pattern for re
pattern = r"\b(\d{1,3})s\b"

# List and number of rows matching pattern
rows_to_check = rows_with_pattern(dataframe, column, pattern)

# Inpsecting rows with pattern
df.loc[rows_to_check, :]

#### Observations:
- They are all for people, so we can extract `age` from `info` and set it to the middle value of the range.
- We will also replace the age-associated qualifier `early` from `info_1` with an empty string.

In [None]:
# For loop to find rows with values and pattern and extract age to age column and remove age from info_1
for index in rows_to_check:
    item = df.loc[index, column]
    match = re.search(pattern, item)
    if match:
        age = int(match.group(1))
        df.loc[index, "age"] = age + 5
        df.loc[index, column] = re.sub(pattern, "", df.loc[index, column])
    if "early " in item:
        df.loc[index, column] = df.loc[index, column].replace("early ", "")

# Re-checking number of rows matching pattern
recheck_rows = rows_with_pattern(dataframe, column, pattern)

# Checking a sample of treated rows
df.loc[rows_to_check, :].sample(2)

#### Observations:
- All of the age data contained in `info_1` should now be extracted.
- We will verify and check the number of remaining missing values for `age`.

### Checking for Any Missed Digits in `info_1` and for Remaining Missing Values for `age`

In [None]:
# Column to check
column = "info_1"

# Dataframe to check
dataframe = df

# Pattern for re
pattern = r"\d"

# Re-checking number of rows matching pattern
rows_to_check = rows_with_pattern(dataframe, column, pattern)

# Checking number of missing values for age
print(f'\nThere are {df["age"].isna().sum()} remaining missing values for age.')

### Observations:
- All of the age data that had been in `info_1` has been successfully extracted.
- There are 79 remaining missing values for `age` that we hope to find in the other info columns.
- We will include the remaining values in `info_1` when we extract citizenship and role information.
- It is time to export the current dataframe to a SQLite database, so we can start a third notebook.

### Exporting Dataset to SQLite Database [wp_life_expect_clean1.db](https://github.com/teresahanak/wikipedia-life-expectancy/blob/main/wp_life_expect_clean1.db)

In [None]:
# Saving complete raw dataset in a SQLite database
conn = sql.connect("wp_life_expect_clean1.db")
df.to_sql("wp_life_expect_clean1", conn, index=False)

# [Proceed to Notebook 3 of  4:  Data Cleaning Part 2](https://github.com/teresahanak/wikipedia-life-expectancy/blob/main/wp_life_expect_data_clean2_thanak_2022_06_17.ipynb)