# Wikipedia Notable Life Expectancies

# [Notebook 2 of 4: Data Cleaning](https://github.com/teresahanak/wikipedia-life-expectancy/blob/main/wp_life_expect_data_clean_thanak_2022_06_13.ipynb)

## Context

The


## Objective

The

### Data Dictionary

- Feature: Description

## Importing Necessary Libraries

In [1]:
# To structure code automatically
%load_ext nb_black

# To import/export sqlite databases
import sqlite3 as sql

# To help with reading and manipulating data
import pandas as pd
import numpy as np
import re

# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)
# To define the maximum number of rows to be displayed in a dataframe
pd.set_option("display.max_rows", 200)

# To supress warnings
# import warnings

# warnings.filterwarnings("ignore")

# To set some visualization attributes
pd.set_option("max_colwidth", 150)

<IPython.core.display.Javascript object>

## Data Overview

### Reading, Sampling, and Checking Data Shape

In [2]:
# Reading the wp_life_expect_raw_complete dataset
conn = sql.connect("wp_life_expect_raw_complete.db")
data = pd.read_sql("SELECT * FROM wp_life_expect_raw_complete", conn)

# Making a working copy
df = data.copy()

# Checking the shape
print(f"There are {df.shape[0]} rows and {df.shape[1]} columns.")

# Checking first 2 rows of the data
df.head(2)

There are 133900 rows and 6 columns.


Unnamed: 0,month_year,day,name,info,link,num_references
0,January 1994,1,William Chappell,", 86, British dancer, ballet designer and director.",https://en.wikipedia.org/wiki/William_Chappell_(dancer),21
1,January 1994,1,Raymond Crotty,", 68, Irish economist, writer, and academic.",https://en.wikipedia.org/wiki/Raymond_Crotty,12


<IPython.core.display.Javascript object>

In [3]:
# Checking last 2 rows of the data
df.tail(2)

Unnamed: 0,month_year,day,name,info,link,num_references
133898,June 2022,9,Oleg Moliboga,", 69, Russian volleyball player, Olympic champion (1980) and coach.",https://en.wikipedia.org/wiki/Oleg_Moliboga,2
133899,June 2022,9,Zou Jing,", 86, Chinese engineer, member of the Chinese Academy of Engineering.",https://en.wikipedia.org/wiki/Zou_Jing_(engineer),3


<IPython.core.display.Javascript object>

In [4]:
# Checking a sample of the data
df.sample(5)

Unnamed: 0,month_year,day,name,info,link,num_references
19560,November 2001,4,Amalie Rothschild,", 85, American artist.",https://en.wikipedia.org/wiki/Amalie_Rothschild,44
8030,February 1997,15,Oscar Adams,", 72, American lawyer and first African-American Alabama Supreme Court justice.",https://en.wikipedia.org/wiki/Oscar_Adams,5
125143,July 2021,13,Jean-Jacques Reboux,", 62, French writer, poet and editor.",https://en.wikipedia.org/wiki/Jean-Jacques_Reboux,32
9899,October 1997,26,"Rankin M. Smith, Sr.",", 72, American businessman and philanthropist.",https://en.wikipedia.org/wiki/Rankin_M._Smith_Sr.,8
82318,April 2016,9,Patrick J. O'Donnell,", 68, Scottish academic.",https://en.wikipedia.org/wiki/Patrick_J._O%27Donnell,3


<IPython.core.display.Javascript object>

#### Observations:
- There are 133,900 rows and 6 columns.
- `month_year` contains the month and year of death, while `day` contains the day of the month of death.
- `name` is the notable person's name.  It is a nominal feature that will not be used for analysis, but will be maintained for any referencing needs.
- `info` contains multiple items including the notable person's "age, country of citizenship at birth, subsequent country of citizenship (if applicable), reason for notability, (and) cause of death (if known)."
- `link` is the url to the notable person's individual Wikipedia page.  If such a page does not exist, there is either a non-working link (https://en.wikipedia.orgNone), or the link is to a page with a message that the page does not exist for that individual.  `link` is a unique identifier for all entries, except the 6 with the non-working link, which do have unique `name` values from each other.
- `num_references` contains the number of references on the notable person's individual Wikipedia page.  This feature serves as a proxy measure of notability.
- Prior to EDA, our task will be to extract the individual elements that are comined in `month_year` and `info` columns.

### Checking Data Types, Duplicates, and Null Values

In [5]:
# Checking data types and null values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 133900 entries, 0 to 133899
Data columns (total 6 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   month_year      133900 non-null  object
 1   day             133900 non-null  object
 2   name            133894 non-null  object
 3   info            133900 non-null  object
 4   link            133900 non-null  object
 5   num_references  133900 non-null  object
dtypes: object(6)
memory usage: 6.1+ MB


<IPython.core.display.Javascript object>

In [6]:
# Checking duplicate rows
df.duplicated().sum()

0

<IPython.core.display.Javascript object>

In [7]:
# Check percentage of null values by column
df.isnull().sum() / df.count() * 100

month_year        0.000000
day               0.000000
name              0.004481
info              0.000000
link              0.000000
num_references    0.000000
dtype: float64

<IPython.core.display.Javascript object>

In [8]:
# Checking number of missing values per row (not necessary here, but done to keep process standard)
df.isnull().sum(axis=1).value_counts()

0    133894
1         6
dtype: int64

<IPython.core.display.Javascript object>

#### Observations:
- Our dataset was saved to and read from the database without any hiccups.
- As expected, we have 6 entries that are missing `name`, but we will find it in their `info` values.
- All columns are currently of object type.  We will need to appropriately typecast them after separating the information in `month_year` and `info`.

## Data Cleaning

### Addressing Missing `name` Values

In [9]:
# Checking rows with missing name values
missing_name = df[df["name"].isna()]
missing_name

Unnamed: 0,month_year,day,name,info,link,num_references
18937,August 2001,11,,"Kevin Kowalcyk, 2, known for eating a hamburger contaminated with E. coli O157:H7.",https://en.wikipedia.orgNone,0
24985,January 2004,22,,"Vincent Palmer, 37, British criminal.",https://en.wikipedia.orgNone,0
27458,March 2005,1,,"Barry Stigler, 57, American voice actor.",https://en.wikipedia.orgNone,0
34077,July 2007,11,,"Nana Gualdi, 75, German singer and actress.",https://en.wikipedia.orgNone,0
64769,September 2013,29,,"Scott Workman, 47, American stuntman (, , ), cancer.",https://en.wikipedia.orgNone,0
106613,September 2019,12,,"Thami Shobede, 31, Singer Songwriter",https://en.wikipedia.orgNone,0


<IPython.core.display.Javascript object>

#### Observations:
- These rows vary from the main set as there is a substring containing the person's name at the start of the `info` string.
- As there are so few rows missing `name`, let us address this issue first.

In [10]:
# For loop to copy name value from info value and remove name from info value
treat_rows = missing_name.index
for i in treat_rows:
    info = df.loc[i, "info"]
    info_lst = info.split(sep=",", maxsplit=1)

    name = info_lst[0].strip()
    df.loc[i, "name"] = name
    df.loc[i, "info"] = re.sub(name, "", info).strip()

# Re-check rows
df.loc[treat_rows, :]

Unnamed: 0,month_year,day,name,info,link,num_references
18937,August 2001,11,Kevin Kowalcyk,", 2, known for eating a hamburger contaminated with E. coli O157:H7.",https://en.wikipedia.orgNone,0
24985,January 2004,22,Vincent Palmer,", 37, British criminal.",https://en.wikipedia.orgNone,0
27458,March 2005,1,Barry Stigler,", 57, American voice actor.",https://en.wikipedia.orgNone,0
34077,July 2007,11,Nana Gualdi,", 75, German singer and actress.",https://en.wikipedia.orgNone,0
64769,September 2013,29,Scott Workman,", 47, American stuntman (, , ), cancer.",https://en.wikipedia.orgNone,0
106613,September 2019,12,Thami Shobede,", 31, Singer Songwriter",https://en.wikipedia.orgNone,0


<IPython.core.display.Javascript object>

#### Observations:
- Missing `name` values have been addressed and those names have been removed from `info` values.

In [11]:
# Re-check info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 133900 entries, 0 to 133899
Data columns (total 6 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   month_year      133900 non-null  object
 1   day             133900 non-null  object
 2   name            133900 non-null  object
 3   info            133900 non-null  object
 4   link            133900 non-null  object
 5   num_references  133900 non-null  object
dtypes: object(6)
memory usage: 6.1+ MB


<IPython.core.display.Javascript object>

#### Observations:
- We have no remaining missing values.
- Let us treat the `month_year` column next.

### Separating `month` and `year`

In [12]:
# Separating month and year into 2 columns and typecasting year as integer
df.loc[:, "year"] = df["month_year"].apply(lambda x: x.split(sep=" ")[1].strip())
df["year"] = df["year"].apply(lambda x: int(x))

df.loc[:, "month"] = df["month_year"].apply(lambda x: x.split(sep=" ")[0])
df.head(2)

Unnamed: 0,month_year,day,name,info,link,num_references,year,month
0,January 1994,1,William Chappell,", 86, British dancer, ballet designer and director.",https://en.wikipedia.org/wiki/William_Chappell_(dancer),21,1994,January
1,January 1994,1,Raymond Crotty,", 68, Irish economist, writer, and academic.",https://en.wikipedia.org/wiki/Raymond_Crotty,12,1994,January


<IPython.core.display.Javascript object>

In [13]:
# Dropping month_year column
df.drop("month_year", axis=1, inplace=True)
df.head(2)

Unnamed: 0,day,name,info,link,num_references,year,month
0,1,William Chappell,", 86, British dancer, ballet designer and director.",https://en.wikipedia.org/wiki/William_Chappell_(dancer),21,1994,January
1,1,Raymond Crotty,", 68, Irish economist, writer, and academic.",https://en.wikipedia.org/wiki/Raymond_Crotty,12,1994,January


<IPython.core.display.Javascript object>

### Treating `info`

#### Function to Save Indices of Rows Matching Regular Expressions Pattern to a List and Print Number of Rows with Match

In [14]:
# Define a function that takes dataframe, column name, and re pattern as arguments and returns list of indices
# for which column value matches re pattern
def rows_with_pattern(dataframe, column, pattern):
    """
    Takes input of dataframe, column name, and re pattern 
    and returns list of indices for rows that contain match
    for pattern anywhere within value for given column.
    
    dataframe: dataframe
    column: column name
    pattern: re pattern
    """
    index_list = []

    for i in dataframe.index:
        item = dataframe.loc[i, column]
        match = re.search(pattern, item)
        if match:
            index_list.append(i)
    print(
        f"There are {len(index_list)} rows with matching pattern in column '{column}'."
    )
    return index_list

<IPython.core.display.Javascript object>

#### Function to Use rows_with_pattern Function for Multiple Regular Expression Patterns

In [15]:
# Define a function that calls rows_with_pattern function for multiple re patterns
# returning a single list of indices for all rows with any pattern match


def multiple_patterns(dataframe, column, patterns):
    """
    Takes input dataframe, column, and list of re patterns and returns single list 
    of indices for rows in which a match for any pattern is found with re.search
    
    dataframe: dataframe
    column: column name
    patterns: list of re patterns
    """
    rows_combined = []

    # For loop to check each pattern
    for pattern in patterns:

        # List and number of rows matching each pattern
        print(pattern)
        rows_to_check = rows_with_pattern(dataframe, column, pattern)
        print("")

        # Add list for each pattern to combined list
        rows_combined += rows_to_check

    return rows_combined

<IPython.core.display.Javascript object>

#### Checking a Sample

In [16]:
# Checking a sample of info
df.sample(5)

Unnamed: 0,day,name,info,link,num_references,year,month
73128,6,Naphtali Lau-Lavie,", 88, Israeli writer and diplomat.",https://en.wikipedia.org/wiki/Naphtali_Lau-Lavie,3,2014,December
24226,27,Donald J. Mitchell,", 80, American politician and member of the United States House of Representatives for New York.",https://en.wikipedia.org/wiki/Donald_J._Mitchell,18,2003,September
68849,28,Garnet de la Hunt,", 80, South African scout, Chairman of the World Scout Committee (1999–2002), cancer.",https://en.wikipedia.org/wiki/Garnet_de_la_Hunt,3,2014,April
53016,21,Ann-Mari Adamsson,", 77, Swedish actor.",https://en.wikipedia.org/wiki/Ann-Mari_Adamsson,2,2011,December
55201,18,Dick Clark,", 82, American television host and producer (, , ), heart attack.",https://en.wikipedia.org/wiki/Dick_Clark,68,2012,April


<IPython.core.display.Javascript object>

#### Observations:
- First, let us check for any rows that are missing digits, and therefore the age target, within `info` and remove them.
- Also, it would be helpful to remove information contained within parentheses, as we will not likely be using it.  We can save it to a new column, until we are certain.

#### Checking and Dropping Rows Lacking Digits (and therefore Age Data) within `info`

In [17]:
# Finding indices of rows that have pattern
has_digits = rows_with_pattern(df, "info", r"\d")
print(
    f"\nThere are {len(df) - len(has_digits)} rows without numbers in the info column."
)

# Dropping rows missing age data, resetting index, and checking new shape of df
df = df.loc[has_digits, :]
df.reset_index(inplace=True, drop=True)
df.shape

There are 132975 rows with matching pattern in column 'info'.

There are 925 rows without numbers in the info column.


(132975, 7)

<IPython.core.display.Javascript object>

#### Observations:
- 925 rows were removed as they lacked the target data for `age`.

#### Removing Information within Parentheses from `info` and saving to new column `info_parenth`

In [18]:
# Regular expression for parenthesis and its contents
pattern = r"\(.*\)"

# Finding indices of rows that have pattern
rows_to_check = rows_with_pattern(df, "info", pattern)

There are 50020 rows with matching pattern in column 'info'.


<IPython.core.display.Javascript object>

In [19]:
# For loop to extract parenthesis and its contents from info to info_parenth
for i, item in enumerate(df["info"]):
    match = re.search(pattern, item)
    if match:
        df.loc[i, "info_parenth"] = match.group(0)
        df.loc[i, "info"] = re.sub(pattern, "", df.loc[i, "info"])

# Rechecking for rows with pattern in original column
rows_to_check = rows_with_pattern(df, "info", pattern)
rows_to_check = rows_with_pattern(
    df[df["info_parenth"].notna()], "info_parenth", pattern
)

There are 0 rows with matching pattern in column 'info'.
There are 50020 rows with matching pattern in column 'info_parenth'.


<IPython.core.display.Javascript object>

#### Observation:
- Parentheses and information within has been removed from `info` and assigned to `info_parenth`.
- Next, we will follow the Wikipedia-defined fields to divide the `info` values.

#### Splitting `info` on Commas into Separate Columns

In [20]:
# For loop to split info on commas and separate into respective new columns and removing leading/trailing white space and periods
for i, item in enumerate(df["info"]):
    info_lst = item.split(",")

    for j in range(len(info_lst)):
        df.loc[i, f"info_{j}"] = info_lst[j].strip(" .")

# Checking the first 2 rows
df.head(2)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_0,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11
0,1,William Chappell,", 86, British dancer, ballet designer and director.",https://en.wikipedia.org/wiki/William_Chappell_(dancer),21,1994,January,,,86,British dancer,ballet designer and director,,,,,,,,
1,1,Raymond Crotty,", 68, Irish economist, writer, and academic.",https://en.wikipedia.org/wiki/Raymond_Crotty,12,1994,January,,,68,Irish economist,writer,and academic,,,,,,,


<IPython.core.display.Javascript object>

In [21]:
# Checking the last 2 rows
df.tail(2)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_0,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11
132973,9,Oleg Moliboga,", 69, Russian volleyball player, Olympic champion and coach.",https://en.wikipedia.org/wiki/Oleg_Moliboga,2,2022,June,(1980),,69,Russian volleyball player,Olympic champion and coach,,,,,,,,
132974,9,Zou Jing,", 86, Chinese engineer, member of the Chinese Academy of Engineering.",https://en.wikipedia.org/wiki/Zou_Jing_(engineer),3,2022,June,,,86,Chinese engineer,member of the Chinese Academy of Engineering,,,,,,,,


<IPython.core.display.Javascript object>

#### Observations:
- The `info` value is successfully divided and we can proceed through it column by column.
- We will check the set of values for the first two columns, for age.

### `info_0`

In [22]:
# Checking unique value counts
df["info_0"].value_counts()

                                                                         132946
Sir                                                                           6
92                                                                            2
Douglas Scott                                                                 1
VC                                                                            1
Sir Woolwich West                                                             1
Sir Lord Justice of Appeal                                                    1
Dame MEP                                                                      1
83                                                                            1
Sir Governor-General                                                          1
Notable ice hockey players and coaches among the 44 killed in the :\n         1
Mike Alexander                                                                1
Colonel                                 

<IPython.core.display.Javascript object>

#### Observations:
- The vast majority of rows have an empty string for this field.
- There is one row representing a group, rather than an individual, and we will drop it.
- We should verify the name and age information for the remainder of unique values in `info_0`.

#### Dropping Entry for Group

In [23]:
# Checking the entry representing a group
group_entry = df[
    df["info_0"]
    == "Notable ice hockey players and coaches among the 44 killed in the :\n"
]
group_entry

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_0,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11
51130,7,2011 Lokomotiv Yaroslavl plane crash,Notable ice hockey players and coaches among the 44 killed in the :\n,https://en.wikipedia.org/wiki/2011_Lokomotiv_Yaroslavl_plane_crash,95,2011,September,,Notable ice hockey players and coaches among the 44 killed in the :\n,,,,,,,,,,,


<IPython.core.display.Javascript object>

In [24]:
# Dropping group entry, resetting index, and checking new shape of df
df.drop(group_entry.index, inplace=True)
df.reset_index(inplace=True, drop=True)
df.shape

(132974, 20)

<IPython.core.display.Javascript object>

#### Examining Rows with Atypical `info_0` Values

In [25]:
# Examining rows with atypical info_0 values
list_to_check = df["info_0"].value_counts().index.to_list()

verify_df = pd.DataFrame()
for item in list_to_check[1:]:
    verify_df = pd.concat([verify_df, df[df["info_0"] == item]])
verify_df

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_0,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11
30349,25,Julian Bullard,"Sir , 78, British diplomat.",https://en.wikipedia.org/wiki/Julian_Bullard,3,2006,May,,Sir,78,British diplomat,,,,,,,,,
31140,1,Kyffin Williams,"Sir , 88, Welsh artist, lung and prostate cancer.",https://en.wikipedia.org/wiki/Kyffin_Williams,15,2006,September,,Sir,88,Welsh artist,lung and prostate cancer,,,,,,,,
34207,15,Jeremy Moore,"Sir , 79, British soldier, commander of UK land forces in the Falklands War.",https://en.wikipedia.org/wiki/Jeremy_Moore,3,2007,September,,Sir,79,British soldier,commander of UK land forces in the Falklands War,,,,,,,,
43606,29,Derek Hodgkinson,"Sir , 92, British air chief marshal.",https://en.wikipedia.org/wiki/Derek_Hodgkinson,5,2010,January,,Sir,92,British air chief marshal,,,,,,,,,
67212,7,Richard Best,"Sir , 80, British diplomat, Ambassador to Iceland .",https://en.wikipedia.org/wiki/Richard_Best_(diplomat),5,2014,March,(1989–1991),Sir,80,British diplomat,Ambassador to Iceland,,,,,,,,
67217,7,Thomas Hinde,"Sir , 88, British novelist.",https://en.wikipedia.org/wiki/Thomas_Hinde_(novelist),13,2014,March,,Sir,88,British novelist,,,,,,,,,
33086,28,David Turnbull,". 92, American materials scientist.",https://en.wikipedia.org/wiki/David_Turnbull_(materials_scientist),8,2007,April,,92,American materials scientist,,,,,,,,,,
117387,11,Gotthilf Fischer,"92, German choral conductor .",https://en.wikipedia.org/wiki/Gotthilf_Fischer,4,2020,December,(Fischer-Chöre),92,German choral conductor,,,,,,,,,,
67517,21,Colin Turner,"Sir Woolwich West, 92, British politician, MP for .",https://en.wikipedia.org/wiki/Colin_Turner,3,2014,March,(1959–1964),Sir Woolwich West,92,British politician,MP for,,,,,,,,
67162,5,Robin Dunn,"Sir Lord Justice of Appeal, 96, British jurist, .",https://en.wikipedia.org/wiki/Robin_Dunn,3,2014,March,(1980–1984),Sir Lord Justice of Appeal,96,British jurist,,,,,,,,,


<IPython.core.display.Javascript object>

#### Observations:
- The majority of rows contain additional aliases or titles within `info_0`, that we don't need, but we can leave in place for now.  
- There are a few rows that will need to be treated individually to correct the name value, as follows:
    1. Entry is for Mike Alexander whose band was Evile.
    2. Entry is for Herbert Wiere who performed slapstick.
    3. Entry is for Sarah-Jayne Mulvihill who was a Flight Lieutenant.
    4. Entry is for Douglass Scott who was killed by Demetreus Nix.
    5. Entry is for Kim Hwan-Sung who was a member of the band NRG.
- We can replace the `name` value with the `info_0` value for these rows as well as proceed with hard-coding the correct values for info_2 and info_3 fields to match the Wikipedia pattern, but staying true to the information scraped.
- The row with "Nearly 3" value for `info_0` represents a group, rather than an individual, so will be dropped, after treating the above rows.
- We can proceed to extract age from `info_0` for the few rows that contain it here instead of in `info_1`.

#### Treating 5 rows with Name in `info_0`

In [26]:
# List of names values in info_0
values_lst = [
    "Mike Alexander",
    "Herbert Wiere",
    "Sarah-Jayne Mulvihill",
    "Douglas Scott",
    "Kim Hwan-Sung",
]

<IPython.core.display.Javascript object>

In [27]:
# For loop to copy name from info_0 to name
for i in df[df["info_0"].isin(values_lst)].index.to_list():
    df.loc[i, "name"] = df.loc[i, "info_0"]

# Hard-coding info_2 and info_3 values for Kim Hwan-Sung
index = df[
    df["link"] == "https://en.wikipedia.org/wiki/NRG_(South_Korean_band)"
].index.to_list()
df.loc[index, "info_2"] = "South Korean musician"

df.loc[index, "info_3"] = "respiratory illness"

# # Hard-coding info_2 and info_3 values for Douglass Scott
index = df[
    df["link"]
    == "https://en.wikipedia.org/w/index.php?title=Demetreus_Nix&action=edit&redlink=1"
].index.to_list()
df.loc[index, "info_2"] = "student"

df.loc[index, "info_3"] = "murdered"

# # Hard-coding info_2 and info_3 values for Sarah-Jayne Mulvihill
index = df[
    df["link"] == "https://en.wikipedia.org/wiki/Flight_Lieutenant"
].index.to_list()
df.loc[index, "info_2"] = "British servicewoman"

df.loc[index, "info_3"] = "killed in action"

<IPython.core.display.Javascript object>

In [28]:
# Rechecking updated rows
df[df["info_0"].isin(values_lst)]

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_0,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11
14035,5,Herbert Wiere,"Herbert Wiere, 91, Austrian-born American Wiere Brothers comedian, member of the",https://en.wikipedia.org/wiki/Slapstick,16,1999,August,,Herbert Wiere,91,Austrian-born American Wiere Brothers comedian,member of the,,,,,,,,
15873,15,Kim Hwan-Sung,"Kim Hwan-Sung, 19, A Member of .",https://en.wikipedia.org/wiki/NRG_(South_Korean_band),27,2000,June,,Kim Hwan-Sung,19,South Korean musician,respiratory illness,,,,,,,,
18312,20,Douglas Scott,"Douglas Scott, 20, High-school student murdered by .",https://en.wikipedia.org/w/index.php?title=Demetreus_Nix&action=edit&redlink=1,0,2001,June,,Douglas Scott,20,student,murdered,,,,,,,,
30210,6,Sarah-Jayne Mulvihill,"Sarah-Jayne Mulvihill, 32, first British servicewoman to be killed in action in Iraq.",https://en.wikipedia.org/wiki/Flight_Lieutenant,12,2006,May,,Sarah-Jayne Mulvihill,32,British servicewoman,killed in action,,,,,,,,
42140,5,Mike Alexander,"Mike Alexander, 32, British bassist , pulmonary embolism.",https://en.wikipedia.org/wiki/Evile,82,2009,October,(),Mike Alexander,32,British bassist,pulmonary embolism,,,,,,,,


<IPython.core.display.Javascript object>

#### Dropping Entry for Group

In [29]:
# Checking the entry representing a group
group_entry = df[df["info_0"] == "Nearly 3"]
group_entry

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_0,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11
18908,11,were killed,"Nearly 3,000 people September 11 attacks in the , including:\n",https://en.wikipedia.org/wiki/Casualties_of_the_September_11_attacks,176,2001,September,,Nearly 3,000 people September 11 attacks in the,including:\n,,,,,,,,,


<IPython.core.display.Javascript object>

In [30]:
# Dropping group entry, resetting index, and checking new shape of df
df.drop(group_entry.index, inplace=True)
df.reset_index(inplace=True, drop=True)
df.shape

(132973, 20)

<IPython.core.display.Javascript object>

#### Extracting `age` from `info_0`

In [31]:
# Pattern for re
pattern = r"(\d{1,3})"

# Checking rows with pattern
rows_to_check = rows_with_pattern(df, "info_0", pattern)

There are 3 rows with matching pattern in column 'info_0'.


<IPython.core.display.Javascript object>

In [32]:
# For loop to extract age from info_0 to age
for i, item in enumerate(df["info_0"]):
    match = re.search(pattern, item)
    if match:
        df.loc[i, "age"] = int(match.group(1))
        df.loc[i, "info_0"] = re.sub(pattern, "", df.loc[i, "info_0"])

# Re-checking info_0 and age for pattern
rows_to_check = rows_with_pattern(df, "info_0", pattern)
df[df["age"].notna()]

There are 0 rows with matching pattern in column 'info_0'.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_0,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
33085,28,David Turnbull,". 92, American materials scientist.",https://en.wikipedia.org/wiki/David_Turnbull_(materials_scientist),8,2007,April,,,American materials scientist,,,,,,,,,,,92.0
59615,24,Kristján Jóhannsson,"83, Icelandic Olympic athlete.",https://en.wikipedia.org/wiki/Kristj%C3%A1n_J%C3%B3hannsson_(athlete),2,2013,January,,,Icelandic Olympic athlete,,,,,,,,,,,83.0
117386,11,Gotthilf Fischer,"92, German choral conductor .",https://en.wikipedia.org/wiki/Gotthilf_Fischer,4,2020,December,(Fischer-Chöre),,German choral conductor,,,,,,,,,,,92.0


<IPython.core.display.Javascript object>

#### Observations:
- The new `age` column has been added successfully.
- We are finished processing `info_0`.

#### Re-checking Unique Values for `info_0` and Dropping the Column

In [33]:
# Re-checking unique values for info_0 prior to dropping it
df["info_0"].unique()

array(['', 'Herbert Wiere', 'Kim Hwan-Sung', 'Douglas Scott',
       'aka J Dilla', 'aka "Sammy Steamboat"', 'Sarah-Jayne Mulvihill',
       'Sir', 'aka "Dino Sete Cordas"', 'aka Abu Hamza', 'Sir Winchester',
       'Sir Edinburgh University', 'VC', 'Colonel', 'Mike Alexander',
       'Sir Governor-General', 'Dame MEP', 'Sir Lord Justice of Appeal',
       'Sir Woolwich West', 'Sir Premier'], dtype=object)

<IPython.core.display.Javascript object>

In [34]:
# Dropping info_0
df.drop("info_0", axis=1, inplace=True)

<IPython.core.display.Javascript object>

#### Observations:
- We are ready to move on to processing `info_1`, which should primarily consist of age values, per the defined Wikipedia fields.

### `info_1`

#### Unique Values

In [35]:
# Checking unique values
df["info_1"].unique()

array(['86', '68', '87', '93', '79', '50', '88', '72', '81', '80', '90',
       '85', '92', '58', '54', '96', '49', '77', '76', '43', '35', '83',
       '31', '64', '57', '52', '84', '78', '70', '73', '67', '99', '33',
       '75', '66', '74', '62', '61', '82', '38', '47', '56', '91', '89',
       '94', '45', '65', '97', '63', '69', '37', '53', '46', '60', '26',
       '71', '25', '59', '39–40', '23', '95', '42', '32', '51', '41',
       '55', '44', '98', '39', '100', '27', '28', '40', '30', '48', '34',
       '29', '36', '111', '22', '104', '14', '21', '106', '105', '18',
       '101', '102', '20', '73-74', '84-85', '67-68', '85-86', '89-90',
       '103', '49-50', '87-88', '107', '48-49', '19', '71-72', '2',
       '82-83', '88-89', '11-12', '63-64', '64-65', '24', '17', '69-70',
       '52-53', '39-40', '80-81', '51-52', '70–71', '66–67', '59-60',
       '94-95', '15', '108', '76-77', '1995', '37-38',
       'American politician', '55/56', '87–88',
       '90 Swedish Olympic sprinte

<IPython.core.display.Javascript object>

#### Observations:
- There is a lot of variety in the format of the age data.
- Also, this field contains several values that we would expect in info_2 and beyond.
- Let us take the approach of extracting age values first.

#### Examining Unique Formats for Age Data

In [36]:
# Pattern for re
pattern = r"\d"

# List and number of rows matching pattern
rows_to_check = rows_with_pattern(df, "info_1", pattern)

# Checking unique values for column
df.loc[rows_to_check, :]["info_1"].unique()

There are 132752 rows with matching pattern in column 'info_1'.


array(['86', '68', '87', '93', '79', '50', '88', '72', '81', '80', '90',
       '85', '92', '58', '54', '96', '49', '77', '76', '43', '35', '83',
       '31', '64', '57', '52', '84', '78', '70', '73', '67', '99', '33',
       '75', '66', '74', '62', '61', '82', '38', '47', '56', '91', '89',
       '94', '45', '65', '97', '63', '69', '37', '53', '46', '60', '26',
       '71', '25', '59', '39–40', '23', '95', '42', '32', '51', '41',
       '55', '44', '98', '39', '100', '27', '28', '40', '30', '48', '34',
       '29', '36', '111', '22', '104', '14', '21', '106', '105', '18',
       '101', '102', '20', '73-74', '84-85', '67-68', '85-86', '89-90',
       '103', '49-50', '87-88', '107', '48-49', '19', '71-72', '2',
       '82-83', '88-89', '11-12', '63-64', '64-65', '24', '17', '69-70',
       '52-53', '39-40', '80-81', '51-52', '70–71', '66–67', '59-60',
       '94-95', '15', '108', '76-77', '1995', '37-38', '55/56', '87–88',
       '90 Swedish Olympic sprinter', '6', '86-87', '62/63', '79

<IPython.core.display.Javascript object>

#### Observations:
- The data for age within `info_1` is in the following formats:  
    - single integer ("age", "age ", "age.")
    - range of 2 integers (separators '-', '–', '/', and ' or ')
    - range of 2 integers with only unit value for second number ('age1/age2-2nd-digit')
    - age in days or months ('age days', 'age-months')
    - estimates ('c. age', 'c.age',  'age?', 'ages' (e.g. 80s), "age+"
- There are some specific rows that need to be examined, with the following values for `info_1`:
    - 1995
    - 1996
    - 1997
    - German Olympic sailor [1]
    - Taiwanese failed assassin in the 3-19 shooting incident
    - 255
    - 176
    - the first wild bear in Germany in 170 years
    - c. 3500
    - common chimpanzee 55
    - Maltese 15
    - c.1000
    - Tree of the Year 150
- We will need to be strategic in the order in which we extract age from `info_1`.
- First, we will look at the atypical values listed above.

#### Examining Rows with Digits and Atypical Values for `info_2`

In [37]:
# List of atypical info_1 values for rows with digits
values_lst = [
    "1995",
    "1996",
    "1997",
    "German Olympic sailor [1]",
    "Taiwanese failed assassin in the 3-19 shooting incident",
    "255",
    "176",
    "the first wild bear in Germany in 170 years",
    "c. 3500",
    "common chimpanzee 55",
    "Maltese 15",
    "c.1000",
    "Tree of the Year 150",
]

df[df["info_1"].isin(values_lst)]

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
6046,9,Khaptad Baba,", 1995, Nepalese spiritual saint.",https://en.wikipedia.org/wiki/Khaptad_Baba,6,1996,May,,1995,Nepalese spiritual saint,,,,,,,,,,
6985,5,Wymberley D. Coerr,", 1995, American politician and diplomat.",https://en.wikipedia.org/wiki/Wymberley_D._Coerr,3,1996,October,,1995,American politician and diplomat,,,,,,,,,,
8441,12,Moro Lorenzo,", 1996, Filipino basketball player and executive.",https://en.wikipedia.org/wiki/Moro_Lorenzo,5,1997,April,,1996,Filipino basketball player and executive,,,,,,,,,,
10548,21,Pakoda Kadhar,", 1997, Indian actor.",https://en.wikipedia.org/wiki/Pakoda_Kadhar,1,1998,January,,1997,Indian actor,,,,,,,,,,
22722,17,Klaus Oldendorff,", German Olympic sailor [1]",https://en.wikipedia.org/wiki/Klaus_Oldendorff,1,2003,March,,German Olympic sailor [1],,,,,,,,,,,
25010,29,Chen Yi-hsiung,", Taiwanese failed assassin in the 3-19 shooting incident.",https://en.wikipedia.org/wiki/Chen_Yi-hsiung,4,2004,March,,Taiwanese failed assassin in the 3-19 shooting incident,,,,,,,,,,,
29898,23,Adwaita,", 255 , tortoise claimant for world's oldest animal, reputedly a former pet of General Clive, liver failure.",https://en.wikipedia.org/wiki/Adwaita,4,2006,March,(approximate age),255,tortoise claimant for world's oldest animal,reputedly a former pet of General Clive,liver failure,,,,,,,,
30566,23,Harriet,", 176, Galápagos tortoise believed to be the third oldest animal in the world and allegedly owned by Charles Darwin, heart failure.",https://en.wikipedia.org/wiki/Harriet_(tortoise),10,2006,June,,176,Galápagos tortoise believed to be the third oldest animal in the world and allegedly owned by Charles Darwin,heart failure,,,,,,,,,
30591,26,Bear JJ1,", the first wild bear in Germany in 170 years, shot to death.",https://en.wikipedia.org/wiki/Bear_JJ1,1,2006,June,(Bruno the Bear),the first wild bear in Germany in 170 years,shot to death,,,,,,,,,,
52996,16,The Senator,", c. 3500, American pond cypress tree, largest in the world, fire.",https://en.wikipedia.org/wiki/The_Senator_(tree),11,2012,January,,c. 3500,American pond cypress tree,largest in the world,fire,,,,,,,,


<IPython.core.display.Javascript object>

#### Observations:
- Age data is either missing or the entry is for a member of a non-human species.
- We will drop all of these rows.

#### Dropping Rows for Non-Human Entries or Entries Missing Age Data

In [38]:
# List of indexes to be dropped
drop_rows = df[df["info_1"].isin(values_lst)].index.to_list()

# Dropping rows, resetting index, and checking new shape of df
df.drop(drop_rows, inplace=True)
df.reset_index(inplace=True, drop=True)
df.shape

(132959, 20)

<IPython.core.display.Javascript object>

#### Observations:
- With those rows addressed, we will begin extracting age in days and months and convert them to years.

### Extracting `age` from `info_1`

#### Step 1: Age in Years and Months

In [39]:
# Dictionary of patterns for days and months formats as keys and factor to convert to years
patterns = {
    r"(\d{1,3})( days)": 365,
    r"(\d{1,3})(-months)": 12,
    r"(\d{1,3})( months)": 12,
}

# List and number of rows matching patterns
rows_to_check = multiple_patterns(df, "info_1", patterns)

(\d{1,3})( days)
There are 1 rows with matching pattern in column 'info_1'.

(\d{1,3})(-months)
There are 1 rows with matching pattern in column 'info_1'.

(\d{1,3})( months)
There are 1 rows with matching pattern in column 'info_1'.



<IPython.core.display.Javascript object>

In [40]:
# For loop to extract age in days and months fromm info_1 and convert to years and save in age
for key, value in patterns.items():
    for i, item in enumerate(df["info_1"]):
        match = re.search(key, item)
        if match:
            age = int(match.group(1)) / value
            df.loc[i, "age"] = age
            df.loc[i, "info_1"] = re.sub(key, "", df.loc[i, "info_1"])

# Re-check number of rows matching patterns
rows_to_check = multiple_patterns(df, "info_1", patterns)

# Checking updated rows
df[df["age"].notna()]

(\d{1,3})( days)
There are 0 rows with matching pattern in column 'info_1'.

(\d{1,3})(-months)
There are 0 rows with matching pattern in column 'info_1'.

(\d{1,3})( months)
There are 0 rows with matching pattern in column 'info_1'.



Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
28491,12,Susan Anne Catherine Torres,", 40 days, American baby born to Susan Torres, brain-dead woman, heart failure after intestinal surgery.",https://en.wikipedia.org/wiki/Susan_Torres,3,2005,September,,,American baby born to Susan Torres,brain-dead woman,heart failure after intestinal surgery,,,,,,,,0.109589
30523,18,Chris and Cru Kahui,", 3-months, New Zealand child homicide victims.",https://en.wikipedia.org/wiki/Chris_and_Cru_Kahui,48,2006,June,,,New Zealand child homicide victims,,,,,,,,,,0.25
33076,28,David Turnbull,". 92, American materials scientist.",https://en.wikipedia.org/wiki/David_Turnbull_(materials_scientist),8,2007,April,,American materials scientist,,,,,,,,,,,92.0
59603,24,Kristján Jóhannsson,"83, Icelandic Olympic athlete.",https://en.wikipedia.org/wiki/Kristj%C3%A1n_J%C3%B3hannsson_(athlete),2,2013,January,,Icelandic Olympic athlete,,,,,,,,,,,83.0
90455,28,Charlie Gard,", 11 months, British infant, subject of life support and parental rights case, MDDS.",https://en.wikipedia.org/wiki/Charlie_Gard_case,125,2017,July,,,British infant,subject of life support and parental rights case,MDDS,,,,,,,,0.916667
117373,11,Gotthilf Fischer,"92, German choral conductor .",https://en.wikipedia.org/wiki/Gotthilf_Fischer,4,2020,December,(Fischer-Chöre),German choral conductor,,,,,,,,,,,92.0


<IPython.core.display.Javascript object>

#### Observations:
- We have successfully captured the age in days and months values and converted them to years.
- The other rows are in place already from our treatment of `info_0`.
- Next, we will address entries that contain two age values as a range, starting first with those that have a single digit as the second number.

#### Step 2: Extracting `age` from `info_1` for Entries with Age Estimate Containing Age Range with 2 Values

#### Ranges with Single Digit as Upper-end

In [41]:
# Pattern for re
pattern = r"(\d{1,3})(/)(\d)\b"

# List and number of rows matching pattern
rows_to_check = rows_with_pattern(df, "info_1", pattern)

# Checking sample of rows
df.loc[rows_to_check, :].sample(2)

There are 12 rows with matching pattern in column 'info_1'.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
18381,2,Sayed Khalifa,", 72/3, Sudanese singer.",https://en.wikipedia.org/wiki/Sayed_Khalifa,5,2001,July,,72/3,Sudanese singer,,,,,,,,,,
40905,17,Joji Banuve,", 68/9, Fijian politician, Minister for Local Government and the Environment, after short illness.",https://en.wikipedia.org/wiki/Joji_Banuve,0,2009,June,,68/9,Fijian politician,Minister for Local Government and the Environment,after short illness,,,,,,,,


<IPython.core.display.Javascript object>

In [42]:
# For loop to find rows with values and pattern and calculate and extract age to age column and remove age from info_1
for i in df[df["age"].isna()].index:
    item = df.loc[i, "info_1"]
    match = re.search(pattern, item)
    if match:
        age_1 = int(match.group(1))
        age_2 = int(match.group(3))
        units = ((age_1 % 10) + age_2) / 2
        tens = age_1 - (age_1 % 10)
        age = tens + units
        df.loc[i, "age"] = age
        df.loc[i, "info_1"] = re.sub(pattern, "", df.loc[i, "info_1"])

# Checking example rows
pd.concat([df[df["name"] == "Sayed Khalifa"], df[df["name"] == "Joji Banuve"]])

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
18381,2,Sayed Khalifa,", 72/3, Sudanese singer.",https://en.wikipedia.org/wiki/Sayed_Khalifa,5,2001,July,,,Sudanese singer,,,,,,,,,,72.5
40905,17,Joji Banuve,", 68/9, Fijian politician, Minister for Local Government and the Environment, after short illness.",https://en.wikipedia.org/wiki/Joji_Banuve,0,2009,June,,,Fijian politician,Minister for Local Government and the Environment,after short illness,,,,,,,,68.5


<IPython.core.display.Javascript object>

#### Other Ranges with Two Values

In [43]:
# Pattern for re
pattern = r"(\d{1,3})(-|–|/| or )(\d{1,3})"

# List and number of rows matching pattern
rows_to_check = rows_with_pattern(df, "info_1", pattern)

# Checking sample of rows
df.loc[rows_to_check, :].sample(2)

There are 716 rows with matching pattern in column 'info_1'.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
48555,18,Paulo de Tarso Alvim,", 91-92, Brazilian biologist.",https://en.wikipedia.org/wiki/Paulo_de_Tarso_Alvim,2,2011,February,,91-92,Brazilian biologist,,,,,,,,,,
122470,9,Ahmed Al Khattab,", 78–79, Jordanian politician, minister of agriculture .",https://en.wikipedia.org/wiki/Ahmed_Al_Khattab,15,2021,May,(2011–2013) and MP (1997–2001),78–79,Jordanian politician,minister of agriculture,,,,,,,,,


<IPython.core.display.Javascript object>

In [44]:
# For loop to find rows with values and pattern and calculate and extract age to age column and remove age from info_1
for i in df[df["age"].isna()].index:
    item = df.loc[i, "info_1"]
    match = re.search(pattern, item)
    if match:
        age = (int(match.group(1)) + int(match.group(3))) / 2
        df.loc[i, "age"] = age
        df.loc[i, "info_1"] = re.sub(pattern, "", df.loc[i, "info_1"])

# Checking example rows
pd.concat(
    [df[df["name"] == "Paulo de Tarso Alvim"], df[df["name"] == "Ahmed Al Khattab"]]
)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
48555,18,Paulo de Tarso Alvim,", 91-92, Brazilian biologist.",https://en.wikipedia.org/wiki/Paulo_de_Tarso_Alvim,2,2011,February,,,Brazilian biologist,,,,,,,,,,91.5
122470,9,Ahmed Al Khattab,", 78–79, Jordanian politician, minister of agriculture .",https://en.wikipedia.org/wiki/Ahmed_Al_Khattab,15,2021,May,(2011–2013) and MP (1997–2001),,Jordanian politician,minister of agriculture,,,,,,,,,78.5


<IPython.core.display.Javascript object>

#### Observations:
- Next, we will extract from the entries with straightforward single integer age values, including the formats: "age", "age ", 'age.'.
- More vague estimates are excluded here to allow closer examination, as they are more likely to by atypical entries.

#### Step 3: Age as Single Integer (Excluding Estimates)

In [45]:
# List of patterns for age formats with single integer for age
patterns = [r"^(\d{1,3})$", r"^(\d{1,3})\s", r"^(\d{1,3})\.\s"]

# List and number of rows matching patterns
rows_to_check = multiple_patterns(df, "info_1", patterns)

^(\d{1,3})$
There are 131912 rows with matching pattern in column 'info_1'.

^(\d{1,3})\s
There are 11 rows with matching pattern in column 'info_1'.

^(\d{1,3})\.\s
There are 8 rows with matching pattern in column 'info_1'.



<IPython.core.display.Javascript object>

In [46]:
# For loop to check age pattern in info_0, save age to age column, and remove from age from info_0
for i, item in enumerate(df["info_1"]):
    for pattern in patterns:
        match = re.search(pattern, item)
        if match:
            age = int(match.group(1))
            df.loc[i, "age"] = age
            df.loc[i, "info_1"] = re.sub(pattern, "", df.loc[i, "info_1"])

# Re-checking number of rows matching patterns
rows_to_check = multiple_patterns(df, "info_1", patterns)

# Checking first 2 rows
df.head(2)

^(\d{1,3})$
There are 0 rows with matching pattern in column 'info_1'.

^(\d{1,3})\s
There are 0 rows with matching pattern in column 'info_1'.

^(\d{1,3})\.\s
There are 0 rows with matching pattern in column 'info_1'.



Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
0,1,William Chappell,", 86, British dancer, ballet designer and director.",https://en.wikipedia.org/wiki/William_Chappell_(dancer),21,1994,January,,,British dancer,ballet designer and director,,,,,,,,,86.0
1,1,Raymond Crotty,", 68, Irish economist, writer, and academic.",https://en.wikipedia.org/wiki/Raymond_Crotty,12,1994,January,,,Irish economist,writer,and academic,,,,,,,,68.0


<IPython.core.display.Javascript object>

In [47]:
# Checking last 2 rows
df.tail(2)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
132957,9,Oleg Moliboga,", 69, Russian volleyball player, Olympic champion and coach.",https://en.wikipedia.org/wiki/Oleg_Moliboga,2,2022,June,(1980),,Russian volleyball player,Olympic champion and coach,,,,,,,,,69.0
132958,9,Zou Jing,", 86, Chinese engineer, member of the Chinese Academy of Engineering.",https://en.wikipedia.org/wiki/Zou_Jing_(engineer),3,2022,June,,,Chinese engineer,member of the Chinese Academy of Engineering,,,,,,,,,86.0


<IPython.core.display.Javascript object>

In [48]:
# Checking the number of remaining missing values for `age`
print(f'There are {df["age"].isna().sum()} remaining missing values for age.')

There are 294 remaining missing values for age.


<IPython.core.display.Javascript object>

#### Observations:
- The rows with single integer age data have been addressed.
- There are only 294 remaining missing values for `age` after extracting age in days, months, single integer years, and 2-integer year range values.
- Let us check the rows containing 'c.', '+', or '?' in the age information.  We will do the ranges ending in 's', such as 80s, separately.

#### Entries with Age Data Containing 'c.', '+', or '?' for Estimate

In [49]:
# Pattern for re
pattern = r"(c\.|\+|\?)"

# List and number of rows matching pattern
rows_to_check = rows_with_pattern(df, "info_1", pattern)

# Inspecting rows containing values
df.loc[rows_to_check, :]

There are 61 rows with matching pattern in column 'info_1'.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
12921,22,Graciela Quan,", c. 79, Guatemalan lawyer and women's rights activist.",https://en.wikipedia.org/wiki/Graciela_Quan,11,1999,January,,c. 79,Guatemalan lawyer and women's rights activist,,,,,,,,,,
25365,4,Tesfaye Gebre Kidan,", c. 69, Ethiopian general, defense minister and President of Ethiopia.",https://en.wikipedia.org/wiki/Tesfaye_Gebre_Kidan,10,2004,June,,c. 69,Ethiopian general,defense minister and President of Ethiopia,,,,,,,,,
25445,18,Paul Johnson,", c. 49, American hostage, decapitated by al-Qaeda.",https://en.wikipedia.org/wiki/Paul_Marshall_Johnson_Jr.,9,2004,June,,c. 49,American hostage,decapitated by al-Qaeda,,,,,,,,,
25446,18,Nek Mohammed,", c. 27, Pakistani tribal leader in Waziristan and key Taliban ally, killed by Pakistani military forces.",https://en.wikipedia.org/wiki/Nek_Muhammad_Wazir,10,2004,June,,c. 27,Pakistani tribal leader in Waziristan and key Taliban ally,killed by Pakistani military forces,,,,,,,,,
26726,13,Earl Cameron,", 89?, Canadian broadcaster and anchor .",https://en.wikipedia.org/wiki/Earl_Cameron_(broadcaster),3,2005,January,(1959–1966),89?,Canadian broadcaster and anchor,,,,,,,,,,
29743,28,Peter Snow,", c. 70, New Zealand doctor who discovered ""Tapanui flu"" .",https://en.wikipedia.org/wiki/Peter_Snow_(doctor),5,2006,February,(chronic fatigue syndrome),c. 70,"New Zealand doctor who discovered ""Tapanui flu""",,,,,,,,,,
29839,15,Humphrey,", c. 17, British Chief Mouser to the Cabinet Office, .",https://en.wikipedia.org/wiki/Humphrey_(cat),25,2006,March,(1989–1997),c. 17,British Chief Mouser to the Cabinet Office,,,,,,,,,,
30258,14,Paul Marco,", c. 81, American film actor .",https://en.wikipedia.org/wiki/Paul_Marco,1,2006,May,(),c. 81,American film actor,,,,,,,,,,
31173,6,Mohammed Taha Mohammed Ahmed,", c.50, Sudanese newspaper editor, beheaded.",https://en.wikipedia.org/wiki/Mohammed_Taha_Mohammed_Ahmed,0,2006,September,,c.50,Sudanese newspaper editor,beheaded,,,,,,,,,
31455,11,Benito Martínez,", 126?, Cuban claimant to the title of world's oldest person.",https://en.wikipedia.org/wiki/Benito_Mart%C3%ADnez,4,2006,October,,126?,Cuban claimant to the title of world's oldest person,,,,,,,,,,


<IPython.core.display.Javascript object>

#### Observations:
- Most of the entries are for people, but there are also one or more entries for the following:
    - carp
    - racehorse
    - chimpanzee
    - flamingo
    - cat
    - turkey
- We will proceed to check `info_2` for these values and drop these and other rows representing members of these other species.

#### Checking for Cat, Racehorse, Chimpanzee, Carp, and Flamingo in `info_2`

In [50]:
# pattern for re
pattern = r"\b(cat|racehorse|chimpanzee|carp|flamingo|turkey)\b"

# List and number of rows matching pattern
rows_to_check = rows_with_pattern(df[df["info_2"].notna()], "info_2", pattern)

There are 468 rows with matching pattern in column 'info_2'.


<IPython.core.display.Javascript object>

#### Observations:
- There are sufficient rows to warrant checking species by species.

#### Cat Entries per `info_2`

In [51]:
# Pattern for re
pattern = r"\b(cat)\b"

# List and number of rows matching pattern
rows_to_check = rows_with_pattern(df[df["info_2"].notna()], "info_2", pattern)

# Inpsecting rows with pattern
df.loc[rows_to_check, :]

There are 19 rows with matching pattern in column 'info_2'.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
18610,4,Lorenzo Music,", 64, American voice actor known for the voice of the cartoon cat Garfield, complications related to lung and bone cancer.",https://en.wikipedia.org/wiki/Lorenzo_Music,6,2001,August,,,American voice actor known for the voice of the cartoon cat Garfield,complications related to lung and bone cancer,,,,,,,,,64.0
31855,29,Dewey Readmore Books,", 19, Library cat, euthanized",https://en.wikipedia.org/wiki/Dewey_Readmore_Books,26,2006,November,,,Library cat,euthanized,,,,,,,,,19.0
38127,11,Scarlett,", 13, American stray cat, name source of the Scarlett Award for Animal Heroism, animal euthanasia.",https://en.wikipedia.org/wiki/Scarlett_(cat),6,2008,October,,,American stray cat,name source of the Scarlett Award for Animal Heroism,animal euthanasia,,,,,,,,13.0
39070,4,India,", 18, American pet cat of George W. Bush.",https://en.wikipedia.org/wiki/India_(cat),7,2009,January,,,American pet cat of George W. Bush,,,,,,,,,,18.0
39688,20,Socks,", 19, American Presidential cat of the Clinton family, euthanized.",https://en.wikipedia.org/wiki/Socks_(cat),21,2009,February,,,American Presidential cat of the Clinton family,euthanized,,,,,,,,,19.0
41343,27,Sybil,", 3, British Downing Street cat, Chief Mouser to the Cabinet Office , after short illness.",https://en.wikipedia.org/wiki/Sybil_(cat),5,2009,July,(2007–2008),,British Downing Street cat,Chief Mouser to the Cabinet Office,after short illness,,,,,,,,3.0
47436,21,Prince Chunk,", 10, American obese cat, heart disease.",https://en.wikipedia.org/wiki/Prince_Chunk,8,2010,November,,,American obese cat,heart disease,,,,,,,,,10.0
54951,5,Meow,", c. 2, American cat, heaviest cat at his time of death, lung failure.",https://en.wikipedia.org/wiki/Meow_(cat),12,2012,May,,c. 2,American cat,heaviest cat at his time of death,lung failure,,,,,,,,
59824,4,Stewie,", 7–8, world's longest domestic cat, cancer.",https://en.wikipedia.org/wiki/Stewie_(cat),6,2013,February,,,world's longest domestic cat,cancer,,,,,,,,,7.5
63271,7,Buurtpoes Bledder,", 1–2, domestic cat in Netherlands, motor accident.",https://en.wikipedia.org/wiki/Buurtpoes_Bledder,12,2013,August,,,domestic cat in Netherlands,motor accident,,,,,,,,,1.5


<IPython.core.display.Javascript object>

#### Observations:
- There is one person represented.  
- We can proceed to drop the others, using "cat", "cat of", and "cat in" patterns.

#### Dropping Entries for Cats per `info_2`

In [52]:
# List of re patterns to find
patterns = [r"\bcat$", r"\b(cat of|cat in)\b"]

# List and number of rows matching pattern
rows_to_drop = multiple_patterns(df[df["info_2"].notna()], "info_2", patterns)

\bcat$
There are 15 rows with matching pattern in column 'info_2'.

\b(cat of|cat in)\b
There are 3 rows with matching pattern in column 'info_2'.



<IPython.core.display.Javascript object>

In [53]:
# Dropping rows, resetting index, and checking new shape of df
df.drop(rows_to_drop, inplace=True)
df.reset_index(inplace=True, drop=True)
df.shape

(132941, 20)

<IPython.core.display.Javascript object>

####  Racehorse Entries per `info_2`

In [54]:
# Pattern for re
pattern = r"\bracehorse\b"

# List and number of rows matching pattern
rows_to_check = rows_with_pattern(df[df["info_2"].notna()], "info_2", pattern)

# Inpsecting rows with pattern
df.loc[rows_to_check, :].sample(10)

There are 441 rows with matching pattern in column 'info_2'.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
76381,29,Kauto Star,", 15, French-born British-trained racehorse, dual winner of the Cheltenham Gold Cup, euthanized.",https://en.wikipedia.org/wiki/Kauto_Star,48,2015,June,,,French-born British-trained racehorse,dual winner of the Cheltenham Gold Cup,euthanized,,,,,,,,15.0
64423,17,Take Control,", 6, American Thoroughbred racehorse, euthanized.",https://en.wikipedia.org/wiki/Take_Control_(horse),1,2013,October,,,American Thoroughbred racehorse,euthanized,,,,,,,,,6.0
74631,25,Smart Strike,", 23, Canadian Thoroughbred racehorse, euthanized.",https://en.wikipedia.org/wiki/Smart_Strike,4,2015,March,,,Canadian Thoroughbred racehorse,euthanized,,,,,,,,,23.0
85441,13,Denys Smith,", 92, British racehorse trainer.",https://en.wikipedia.org/wiki/Denys_Smith,3,2016,November,,,British racehorse trainer,,,,,,,,,,92.0
34903,6,Harry Thomson Jones,", 82, British racehorse trainer.",https://en.wikipedia.org/wiki/Harry_Thomson_Jones,2,2007,December,,,British racehorse trainer,,,,,,,,,,82.0
124315,17,Powerscourt,", 21, British-bred Irish Thoroughbred racehorse, heart attack.",https://en.wikipedia.org/wiki/Powerscourt_(horse),16,2021,July,(death announced on this date),,British-bred Irish Thoroughbred racehorse,heart attack,,,,,,,,,21.0
40056,27,Alysheba,", 25, American racehorse, Kentucky Derby and Preakness winner , euthanized.",https://en.wikipedia.org/wiki/Alysheba,7,2009,March,(1987),,American racehorse,Kentucky Derby and Preakness winner,euthanized,,,,,,,,25.0
33262,25,Arwon,", 33, New Zealand-born racehorse, longest surviving Melbourne Cup winner, euthanasia.",https://en.wikipedia.org/wiki/Arwon,4,2007,May,,,New Zealand-born racehorse,longest surviving Melbourne Cup winner,euthanasia,,,,,,,,33.0
38799,13,Christmas Past,", 29, American Thoroughbred racehorse, infirmities of old age.",https://en.wikipedia.org/wiki/Christmas_Past,5,2008,December,,,American Thoroughbred racehorse,infirmities of old age,,,,,,,,,29.0
76038,10,Bonecrusher,", 32, New Zealand Thoroughbred racehorse, euthanized following laminitis.",https://en.wikipedia.org/wiki/Bonecrusher_(horse),8,2015,June,,,New Zealand Thoroughbred racehorse,euthanized following laminitis,,,,,,,,,32.0


<IPython.core.display.Javascript object>

#### Observations:
- There are several entries for people involved in the racehorse business.
- Values that end in 'racehorse' and 'racehorse and sire' can be removed.

#### Dropping Entries for Racehorses per `info_2`

In [55]:
# List of re patterns to find
patterns = [r"\bracehorse$", r"\b(racehorse and sire)$"]

# List and number of rows matching pattern
rows_to_drop = multiple_patterns(df[df["info_2"].notna()], "info_2", patterns)

\bracehorse$
There are 319 rows with matching pattern in column 'info_2'.

\b(racehorse and sire)$
There are 30 rows with matching pattern in column 'info_2'.



<IPython.core.display.Javascript object>

In [56]:
# Dropping rows, resetting index, and checking new shape of df
df.drop(rows_to_drop, inplace=True)
df.reset_index(inplace=True, drop=True)
df.shape

(132592, 20)

<IPython.core.display.Javascript object>

####  Chimpanzee, Flamingo,  Carp, and Turkey Entries per `info_2`

In [57]:
# Defining pattern for re
pattern = r"\b(chimpanzee|flamingo|carp|turkey)\b"

# List and number of rows matching pattern
rows_to_check = rows_with_pattern(df[df["info_2"].notna()], "info_2", pattern)

# Inpsecting rows with pattern
df.loc[rows_to_check, :]

There are 8 rows with matching pattern in column 'info_2'.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
34556,30,Washoe,", c.42, American-trained African-born chimpanzee believed to be first non-human to acquire human language, influenza.",https://en.wikipedia.org/wiki/Washoe_(chimpanzee),42,2007,October,,c.42,American-trained African-born chimpanzee believed to be first non-human to acquire human language,influenza,,,,,,,,,
39598,16,Travis,", 13, American-born chimpanzee, television commercial animal, shot.",https://en.wikipedia.org/wiki/Travis_(chimpanzee),69,2009,February,,,American-born chimpanzee,television commercial animal,shot,,,,,,,,13.0
41364,4,Benson,", c. 25, British common carp, voted as Britain's Favourite Carp .",https://en.wikipedia.org/wiki/Benson_(fish),9,2009,August,(death announced on this date),c. 25,British common carp,voted as Britain's Favourite Carp,,,,,,,,,
45230,1,Heather the Leather,", 50, British scaleless carp, old age.",https://en.wikipedia.org/wiki/Heather_the_Leather,5,2010,June,,,British scaleless carp,old age,,,,,,,,,50.0
66268,30,Greater,", c. 83, Australian greater flamingo, world's oldest flamingo, euthanized.",https://en.wikipedia.org/wiki/Greater_(flamingo),12,2014,January,,c. 83,Australian greater flamingo,world's oldest flamingo,euthanized,,,,,,,,
70952,26,Zelda,", 11+, American wild turkey, resident of New York City's Battery Park, traffic collision.",https://en.wikipedia.org/wiki/Zelda_(turkey),13,2014,September,(body found on this date),11+,American wild turkey,resident of New York City's Battery Park,traffic collision,,,,,,,,
76075,22,Don Featherstone,", 79, American artist and inventor of the plastic pink flamingo, Lewy body dementia.",https://en.wikipedia.org/wiki/Don_Featherstone_(artist),10,2015,June,,,American artist and inventor of the plastic pink flamingo,Lewy body dementia,,,,,,,,,79.0
92090,14,Little Mama,", c.79, African-born chimpanzee , oldest on record, kidney failure.",https://en.wikipedia.org/wiki/Little_Mama,3,2017,November,(Lion Country Safari),c.79,African-born chimpanzee,oldest on record,kidney failure,,,,,,,,


<IPython.core.display.Javascript object>

#### Observations:
- All of these rows are for animals, so we will remove them.

#### Dropping Entries for Chimpanzees, Flamingos,  Carps, and Turkeys per `info_2`

In [58]:
# Dropping rows, resetting index, and checking new shape of df
df.drop(rows_to_check, inplace=True)
df.reset_index(inplace=True, drop=True)
df.shape

(132584, 20)

<IPython.core.display.Javascript object>

#### Observations:
- With those non-human entries addressed, we can return to processing `info_1`.
- Let us address the remaining entries with '?', '+', or 'c.', accepting the estimated age as `age`.
- We will treat the age estimates ending in 's' similarly, but separately.

#### Extracting `age` from `info_1` for Entries with Age Estimate Containing '?', '+', or 'c.'

In [59]:
# Pattern for re
pattern = r"(c\.|\+|\?)"

# List and number of rows matching pattern
rows_to_check = rows_with_pattern(df, "info_1", pattern)

# Checking a sample of rows
df.loc[rows_to_check, :].sample(2)

There are 53 rows with matching pattern in column 'info_1'.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
58485,9,Ataa Oko,", c. 93, Ghanaian fantasy coffin artist.",https://en.wikipedia.org/wiki/Ataa_Oko,8,2012,December,,c. 93,Ghanaian fantasy coffin artist,,,,,,,,,,
75367,13,Bob Randall,", c. 81, Australian Indigenous musician and author.",https://en.wikipedia.org/wiki/Bob_Randall_(Aboriginal_Australian_elder),7,2015,May,,c. 81,Australian Indigenous musician and author,,,,,,,,,,


<IPython.core.display.Javascript object>

In [60]:
# List to identify rows
values = ["c.", "+", "?"]

# Pattern for re
pattern = r"\b(\d{1,3})\b"

# For loop to find rows with values and pattern and extract age to age column and remove age from info_1
for i in df[df["age"].isna()].index:
    item = df.loc[i, "info_1"]

    if any(value in item for value in values):
        match = re.search(pattern, item)

        if match:
            age = int(match.group(1))
            df.loc[i, "age"] = age
            df.loc[i, "info_1"] = re.sub(pattern, "", df.loc[i, "info_1"])

        for value in values:
            df.loc[i, "info_1"] = df.loc[i, "info_1"].replace(value, "")


# Re-checking number of rows matching pattern
rows_to_check = rows_with_pattern(df, "info_1", pattern)

# Checking example rows
pd.concat([df[df["name"] == "Ataa Oko"], df[df["name"] == "Bob Randall"]])

There are 0 rows with matching pattern in column 'info_1'.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
58485,9,Ataa Oko,", c. 93, Ghanaian fantasy coffin artist.",https://en.wikipedia.org/wiki/Ataa_Oko,8,2012,December,,,Ghanaian fantasy coffin artist,,,,,,,,,,93.0
2677,11,Bob Randall,", 57, American screenwriter, playwright, novelist, and television producer.",https://en.wikipedia.org/wiki/Bob_Randall_(writer),7,1995,February,,,American screenwriter,playwright,novelist,and television producer,,,,,,,57.0
75367,13,Bob Randall,", c. 81, Australian Indigenous musician and author.",https://en.wikipedia.org/wiki/Bob_Randall_(Aboriginal_Australian_elder),7,2015,May,,,Australian Indigenous musician and author,,,,,,,,,,81.0


<IPython.core.display.Javascript object>

#### Observations:
- Here, we see an example of two people with the same name, but the extraction of `age` from `info_1` was successful.
- The ages ending in 's' should be the only ones remaining in the `info_1` column.
- We will examine those now.

#### Extracting `age` from `info_1` for Entries with Age Estimate ending in 's'

In [61]:
# Defining pattern for re
pattern = r"\b(\d{1,3})s\b"

# List and number of rows matching pattern
rows_to_check = rows_with_pattern(df, "info_1", pattern)

# Inpsecting rows with pattern
df.loc[rows_to_check, :]

There are 15 rows with matching pattern in column 'info_1'.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
27469,22,Mary Dann,", early 80s, American Indian activist.",https://en.wikipedia.org/wiki/Mary_Dann_and_Carrie_Dann,12,2005,April,,early 80s,American Indian activist,,,,,,,,,,
28056,14,Jacques Roche,", early 40s, Haitian journalist.",https://en.wikipedia.org/wiki/Jacques_Roche,25,2005,July,,early 40s,Haitian journalist,,,,,,,,,,
33053,27,Magda Gerber,", 90s, Hungarian-born American educator.",https://en.wikipedia.org/wiki/Magda_Gerber,7,2007,April,,90s,Hungarian-born American educator,,,,,,,,,,
41695,2,Abdullah Laghmani,", 40s, Afghan Secret Service chief, bomb blast.",https://en.wikipedia.org/wiki/Abdullah_Laghmani,4,2009,September,,40s,Afghan Secret Service chief,bomb blast,,,,,,,,,
47970,9,Makinti Napanangka,", 80s, Australian Papunya Tula artist.",https://en.wikipedia.org/wiki/Makinti_Napanangka,57,2011,January,,80s,Australian Papunya Tula artist,,,,,,,,,,
61930,1,Dorothy Napangardi,", 60s, Australian indigenous artist, traffic collision.",https://en.wikipedia.org/wiki/Dorothy_Napangardi,19,2013,June,,60s,Australian indigenous artist,traffic collision,,,,,,,,,
62092,10,Timothy Apiyo,", 70s, Tanzanian politician and civil servant.",https://en.wikipedia.org/wiki/Timothy_Apiyo,2,2013,June,,70s,Tanzanian politician and civil servant,,,,,,,,,,
71567,30,Mohamed Sheikh Ismail,", 50s, Somali military commander, Chief of the Police Force .",https://en.wikipedia.org/wiki/Mohamed_Sheikh_Ismail,8,2014,October,(2014),50s,Somali military commander,Chief of the Police Force,,,,,,,,,
84772,20,Achieng Abura,", 50s, Kenyan musician.",https://en.wikipedia.org/wiki/Achieng_Abura,13,2016,October,,50s,Kenyan musician,,,,,,,,,,
85973,27,Abu Jandal al-Kuwaiti,", 30s, Kuwaiti ISIL commander.",https://en.wikipedia.org/wiki/Abu_Jandal_al-Kuwaiti,16,2016,December,,30s,Kuwaiti ISIL commander,,,,,,,,,,


<IPython.core.display.Javascript object>

#### Observations:
- They are all for people, so we can extract `age` from `info` and set it to the middle value of the range.

In [62]:
# For loop to find rows with values and pattern and extract age to age column and remove age from info_1
for i in df[df["age"].isna()].index:
    item = df.loc[i, "info_1"]
    match = re.search(pattern, item)
    if match:
        age = int(match.group(1))
        df.loc[i, "age"] = age + 5
        df.loc[i, "info_1"] = re.sub(pattern, "", df.loc[i, "info_1"])
    if "early " in item:
        df.loc[i, "info_1"] = df.loc[i, "info_1"].replace("early ", "")

# Re-checking number of rows matching pattern
rows_to_check = rows_with_pattern(df, "info_1", pattern)

# Checking example rows
pd.concat([df[df["name"] == "Mary Dann"], df[df["name"] == "Timothy Apiyo"]])

There are 0 rows with matching pattern in column 'info_1'.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
27469,22,Mary Dann,", early 80s, American Indian activist.",https://en.wikipedia.org/wiki/Mary_Dann_and_Carrie_Dann,12,2005,April,,,American Indian activist,,,,,,,,,,85.0
62092,10,Timothy Apiyo,", 70s, Tanzanian politician and civil servant.",https://en.wikipedia.org/wiki/Timothy_Apiyo,2,2013,June,,,Tanzanian politician and civil servant,,,,,,,,,,75.0


<IPython.core.display.Javascript object>

### Checking for Any Missed Digits in `info_1` and for Remaining Missing Values for `age`

In [63]:
# Pattern for re
pattern = r"\d"

# Re-checking number of rows matching pattern
rows_to_check = rows_with_pattern(df, "info_1", pattern)

# Checking number of missing values for age
print(f'\nThere are {df["age"].isna().sum()} remaining missing values for age.')

There are 0 rows with matching pattern in column 'info_1'.

There are 218 remaining missing values for age.


<IPython.core.display.Javascript object>

### Observations:
- All of the age data that had been in `info_1` has been successfully extracted.
- There are 218 remaining missing values for `age` that we hope to find in the other info columns.
- We will include the remaining values in `info_1` when we extract citizenship and role information.
- It is time to export the current dataframe to a SQLite database.

### Exporting Dataset to SQLite Database [wp_life_expect_clean1.db](https://github.com/teresahanak/wikipedia-life-expectancy/blob/main/wp_life_expect_clean1.db)

In [64]:
# Saving complete raw dataset in a SQLite database
conn = sql.connect("wp_life_expect_clean1.db")
df.to_sql("wp_life_expect_clean1", conn, index=False)

132584

<IPython.core.display.Javascript object>

# [Proceed to Notebook 3 of  4:  Data Cleaning Part 2](https://github.com/teresahanak/wikipedia-life-expectancy/blob/main/wp_life_expect_data_clean2_thanak_2022_06_17.ipynb)