# Wikipedia Notable Life Expectancies

# [Notebook 2 of 4: Data Cleaning](https://github.com/teresahanak/wikipedia-life-expectancy/blob/main/wp_life_expect_data_clean_thanak_2022_06_13.ipynb)

## Context

The


## Objective

The

### Data Dictionary

- Feature: Description

## Importing Necessary Libraries

In [1]:
# To structure code automatically
%load_ext nb_black

# To import/export sqlite databases
import sqlite3 as sql

# To help with reading and manipulating data
import pandas as pd
import numpy as np
import re

# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)
# To define the maximum number of rows to be displayed in a dataframe
pd.set_option("display.max_rows", 200)

# To supress warnings
# import warnings

# warnings.filterwarnings("ignore")

# To set some visualization attributes
pd.set_option("max_colwidth", 150)

<IPython.core.display.Javascript object>

## Data Overview

### Reading, Sampling, and Checking Data Shape

In [2]:
# Reading the wp_life_expect_raw_complete dataset
conn = sql.connect("wp_life_expect_raw_complete.db")
data = pd.read_sql("SELECT * FROM wp_life_expect_raw_complete", conn)

# Making a working copy
df = data.copy()

# Checking the shape
print(f"There are {df.shape[0]} rows and {df.shape[1]} columns.")

# Checking first 2 rows of the data
df.head(2)

There are 133900 rows and 6 columns.


Unnamed: 0,month_year,day,name,info,link,num_references
0,January 1994,1,William Chappell,", 86, British dancer, ballet designer and director.",https://en.wikipedia.org/wiki/William_Chappell_(dancer),21
1,January 1994,1,Raymond Crotty,", 68, Irish economist, writer, and academic.",https://en.wikipedia.org/wiki/Raymond_Crotty,12


<IPython.core.display.Javascript object>

In [3]:
# Checking last 2 rows of the data
df.tail(2)

Unnamed: 0,month_year,day,name,info,link,num_references
133898,June 2022,9,Oleg Moliboga,", 69, Russian volleyball player, Olympic champion (1980) and coach.",https://en.wikipedia.org/wiki/Oleg_Moliboga,2
133899,June 2022,9,Zou Jing,", 86, Chinese engineer, member of the Chinese Academy of Engineering.",https://en.wikipedia.org/wiki/Zou_Jing_(engineer),3


<IPython.core.display.Javascript object>

In [4]:
# Checking a sample of the data
df.sample(5)

Unnamed: 0,month_year,day,name,info,link,num_references
12519,November 1998,6,Niklas Luhmann,", 70, German sociologist and philosopher of social science.",https://en.wikipedia.org/wiki/Niklas_Luhmann,28
33278,April 2007,3,Thomas Hal Phillips,", 84, American novelist and screenwriter.",https://en.wikipedia.org/wiki/Thomas_Hal_Phillips,12
69173,May 2014,12,Ruben T. Profugo,", 76, Filipino Roman Catholic prelate, Bishop of Lucena (1982–2003).",https://en.wikipedia.org/wiki/Ruben_T._Profugo,1
120771,February 2021,19,Joseph Kesenge Wandangakongu,", 92, Congolese Roman Catholic prelate, bishop of Molegbe (1968–1997).",https://en.wikipedia.org/wiki/Joseph_Kesenge_Wandangakongu,1
124921,July 2021,5,Masood Ashar,", 91, Pakistani writer.",https://en.wikipedia.org/wiki/Masood_Ashar,15


<IPython.core.display.Javascript object>

#### Observations:
- There are 133,900 rows and 6 columns.
- `month_year` contains the month and year of death, while `day` contains the day of the month of death.
- `name` is the notable person's name.  It is a nominal feature that will not be used for analysis, but will be maintained for any referencing needs.
- `info` contains multiple items including the notable person's "age, country of citizenship at birth, subsequent country of citizenship (if applicable), reason for notability, (and) cause of death (if known)."
- `link` is the url to the notable person's individual Wikipedia page.  If such a page does not exist, there is either a non-working link (https://en.wikipedia.orgNone), or the link is to a page with a message that the page does not exist for that individual.  `link` is a unique identifier for all entries, except the 6 with the non-working link, which do have unique `name` values from each other.
- `num_references` contains the number of references on the notable person's individual Wikipedia page.  This feature serves as a proxy measure of notability.
- Prior to EDA, our task will be to extract the individual elements that are comined in `month_year` and `info` columns.

### Checking Data Types, Duplicates, and Null Values

In [5]:
# Checking data types and null values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 133900 entries, 0 to 133899
Data columns (total 6 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   month_year      133900 non-null  object
 1   day             133900 non-null  object
 2   name            133894 non-null  object
 3   info            133900 non-null  object
 4   link            133900 non-null  object
 5   num_references  133900 non-null  object
dtypes: object(6)
memory usage: 6.1+ MB


<IPython.core.display.Javascript object>

In [6]:
# Checking duplicate rows
df.duplicated().sum()

0

<IPython.core.display.Javascript object>

In [7]:
# Check percentage of null values by column
df.isnull().sum() / df.count() * 100

month_year        0.000000
day               0.000000
name              0.004481
info              0.000000
link              0.000000
num_references    0.000000
dtype: float64

<IPython.core.display.Javascript object>

In [8]:
# Checking number of missing values per row (not necessary here, but done to keep process standard)
df.isnull().sum(axis=1).value_counts()

0    133894
1         6
dtype: int64

<IPython.core.display.Javascript object>

#### Observations:
- Our dataset was saved to and read from the database without any hiccups.
- As expected, we have 6 entries that are missing `name`, but we will find it in their `info` values.
- All columns are currently of object type.  We will need to appropriately typecast them after separating the information in `month_year` and `info`.

## Data Cleaning

### Addressing Missing `name` Values

In [9]:
# Checking rows with missing name values
missing_name = df[df["name"].isna()]
missing_name

Unnamed: 0,month_year,day,name,info,link,num_references
18937,August 2001,11,,"Kevin Kowalcyk, 2, known for eating a hamburger contaminated with E. coli O157:H7.",https://en.wikipedia.orgNone,0
24985,January 2004,22,,"Vincent Palmer, 37, British criminal.",https://en.wikipedia.orgNone,0
27458,March 2005,1,,"Barry Stigler, 57, American voice actor.",https://en.wikipedia.orgNone,0
34077,July 2007,11,,"Nana Gualdi, 75, German singer and actress.",https://en.wikipedia.orgNone,0
64769,September 2013,29,,"Scott Workman, 47, American stuntman (, , ), cancer.",https://en.wikipedia.orgNone,0
106613,September 2019,12,,"Thami Shobede, 31, Singer Songwriter",https://en.wikipedia.orgNone,0


<IPython.core.display.Javascript object>

#### Observations:
- These rows vary from the main set as there is a substring containing the person's name at the start of the `info` string.
- As there are so few rows missing `name`, let us address this issue first.

In [10]:
# For loop to copy name value from info value and remove name from info value
treat_rows = missing_name.index
for i in treat_rows:
    info = df.loc[i, "info"]
    info_lst = info.split(sep=",", maxsplit=1)

    name = info_lst[0].strip()
    df.loc[i, "name"] = name
    df.loc[i, "info"] = re.sub(name, "", info).strip()

# Re-check rows
df.loc[treat_rows, :]

Unnamed: 0,month_year,day,name,info,link,num_references
18937,August 2001,11,Kevin Kowalcyk,", 2, known for eating a hamburger contaminated with E. coli O157:H7.",https://en.wikipedia.orgNone,0
24985,January 2004,22,Vincent Palmer,", 37, British criminal.",https://en.wikipedia.orgNone,0
27458,March 2005,1,Barry Stigler,", 57, American voice actor.",https://en.wikipedia.orgNone,0
34077,July 2007,11,Nana Gualdi,", 75, German singer and actress.",https://en.wikipedia.orgNone,0
64769,September 2013,29,Scott Workman,", 47, American stuntman (, , ), cancer.",https://en.wikipedia.orgNone,0
106613,September 2019,12,Thami Shobede,", 31, Singer Songwriter",https://en.wikipedia.orgNone,0


<IPython.core.display.Javascript object>

#### Observations:
- Missing `name` values have been addressed and those names have been removed from `info` values.

In [11]:
# Re-check info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 133900 entries, 0 to 133899
Data columns (total 6 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   month_year      133900 non-null  object
 1   day             133900 non-null  object
 2   name            133900 non-null  object
 3   info            133900 non-null  object
 4   link            133900 non-null  object
 5   num_references  133900 non-null  object
dtypes: object(6)
memory usage: 6.1+ MB


<IPython.core.display.Javascript object>

#### Observations:
- We have no remaining missing values.
- Let us treat the `month_year` column next.

### Separating `month` and `year`

In [12]:
# Separating month and year into 2 columns and typecasting year as integer
df.loc[:, "year"] = df["month_year"].apply(lambda x: x.split(sep=" ")[1].strip())
df["year"] = df["year"].apply(lambda x: int(x))

df.loc[:, "month"] = df["month_year"].apply(lambda x: x.split(sep=" ")[0])
df.head(2)

Unnamed: 0,month_year,day,name,info,link,num_references,year,month
0,January 1994,1,William Chappell,", 86, British dancer, ballet designer and director.",https://en.wikipedia.org/wiki/William_Chappell_(dancer),21,1994,January
1,January 1994,1,Raymond Crotty,", 68, Irish economist, writer, and academic.",https://en.wikipedia.org/wiki/Raymond_Crotty,12,1994,January


<IPython.core.display.Javascript object>

In [13]:
# Dropping month_year column
df.drop("month_year", axis=1, inplace=True)
df.head(2)

Unnamed: 0,day,name,info,link,num_references,year,month
0,1,William Chappell,", 86, British dancer, ballet designer and director.",https://en.wikipedia.org/wiki/William_Chappell_(dancer),21,1994,January
1,1,Raymond Crotty,", 68, Irish economist, writer, and academic.",https://en.wikipedia.org/wiki/Raymond_Crotty,12,1994,January


<IPython.core.display.Javascript object>

### Treating `info`

#### Function to Save Indices of Rows Matching re Pattern to a List

In [14]:
# Define a function that takes dataframe, column name, and re pattern as arguments and returns list of indices
# for which column value matches re pattern
def rows_with_pattern(dataframe, column, pattern):
    """
    Takes input of dataframe, column name, and re pattern 
    and returns list of indices for rows that contain match
    for pattern anywhere within value for given column.
    
    dataframe: dataframe
    column: column name
    pattern: re pattern
    """
    index_list = []

    for i in dataframe.index:
        item = dataframe.loc[i, column]
        match = re.search(pattern, item)
        if match:
            index_list.append(i)
    print(
        f"There are {len(index_list)} rows with matching pattern in column '{column}'."
    )
    return index_list

<IPython.core.display.Javascript object>

#### Checking a Sample

In [15]:
# Checking a sample of info
df.sample(5)

Unnamed: 0,day,name,info,link,num_references,year,month
123290,6,Matang Sinh,", 67, Indian politician, MP (1992–1998), complications from COVID-19.",https://en.wikipedia.org/wiki/Matang_Sinh,7,2021,May
55685,17,Herbert Breslin,", 87, American music industry executive, heart attack.",https://en.wikipedia.org/wiki/Herbert_Breslin,2,2012,May
39580,8,Leonidas Vargas,", 60, Colombian drug trafficker, shot.",https://en.wikipedia.org/wiki/Leonidas_Vargas,6,2009,January
106800,21,Jack Donner,", 90, American actor (, , ).",https://en.wikipedia.org/wiki/Jack_Donner,4,2019,September
108963,5,Colin Howson,", 74–75, British philosopher.",https://en.wikipedia.org/wiki/Colin_Howson,4,2020,January


<IPython.core.display.Javascript object>

#### Observations:
- First, let us check for any rows that are missing digits, and therefore the age target, within `info` and remove them.
- Also, it would be helpful to remove information contained within parentheses, as we will not likely be using it.  We can save it to a new column, until we are certain.

#### Checking and Dropping Rows Lacking Digits (and therefore Age Data) within `info`

In [16]:
# Finding indices of rows that have pattern
has_digits = rows_with_pattern(df, "info", r"\d")
print(
    f"\nThere are {len(df) - len(has_digits)} rows without numbers in the info column."
)

# Dropping rows missing age data, resetting index, and checking new shape of df
df = df.loc[has_digits, :]
df.reset_index(inplace=True, drop=True)
df.shape

There are 132975 rows with matching pattern in column 'info'.

There are 925 rows without numbers in the info column.


(132975, 7)

<IPython.core.display.Javascript object>

#### Observations:
- 925 rows were removed as they lacked the target data for `age`.

#### Removing Information within Parentheses from `info` and saving to new column `info_parenth`

In [17]:
# Regular expression for parenthesis and its contents
pattern = r"\(.*\)"

# Finding indices of rows that have pattern
rows_to_check = rows_with_pattern(df, "info", pattern)

There are 50020 rows with matching pattern in column 'info'.


<IPython.core.display.Javascript object>

In [18]:
# For loop to extract parenthesis and its contents from info to info_parenth
for i, item in enumerate(df["info"]):
    match = re.search(pattern, item)
    if match:
        df.loc[i, "info_parenth"] = match.group(0)
        df.loc[i, "info"] = re.sub(pattern, "", df.loc[i, "info"])

# Rechecking for rows with pattern in original column
rows_to_check = rows_with_pattern(df, "info", pattern)
rows_to_check = rows_with_pattern(
    df[df["info_parenth"].notna()], "info_parenth", pattern
)

There are 0 rows with matching pattern in column 'info'.
There are 50020 rows with matching pattern in column 'info_parenth'.


<IPython.core.display.Javascript object>

#### Observation:
- Parentheses and information within has been removed from `info` and assigned to `info_parenth`.
- Next, we will follow the Wikipedia-defined fields to divide the `info` values.

#### Splitting `info` on Commas into Separate Columns

In [19]:
# For loop to split info on commas and separate into respective new columns and removing leading/trailing white space and periods
for i, item in enumerate(df["info"]):
    info_lst = item.split(",")

    for j in range(len(info_lst)):
        df.loc[i, f"info_{j}"] = info_lst[j].strip(" .")

# Checking the first 2 rows
df.head(2)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_0,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11
0,1,William Chappell,", 86, British dancer, ballet designer and director.",https://en.wikipedia.org/wiki/William_Chappell_(dancer),21,1994,January,,,86,British dancer,ballet designer and director,,,,,,,,
1,1,Raymond Crotty,", 68, Irish economist, writer, and academic.",https://en.wikipedia.org/wiki/Raymond_Crotty,12,1994,January,,,68,Irish economist,writer,and academic,,,,,,,


<IPython.core.display.Javascript object>

In [20]:
# Checking the last 2 rows
df.tail(2)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_0,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11
132973,9,Oleg Moliboga,", 69, Russian volleyball player, Olympic champion and coach.",https://en.wikipedia.org/wiki/Oleg_Moliboga,2,2022,June,(1980),,69,Russian volleyball player,Olympic champion and coach,,,,,,,,
132974,9,Zou Jing,", 86, Chinese engineer, member of the Chinese Academy of Engineering.",https://en.wikipedia.org/wiki/Zou_Jing_(engineer),3,2022,June,,,86,Chinese engineer,member of the Chinese Academy of Engineering,,,,,,,,


<IPython.core.display.Javascript object>

#### Observations:
- The `info` value is successfully divided and we can proceed through it column by column.
- We will check the set of values for the first two columns, for age.

### `info_0`

In [21]:
# Checking unique value counts
df["info_0"].value_counts()

                                                                         132946
Sir                                                                           6
92                                                                            2
Douglas Scott                                                                 1
VC                                                                            1
Sir Woolwich West                                                             1
Sir Lord Justice of Appeal                                                    1
Dame MEP                                                                      1
83                                                                            1
Sir Governor-General                                                          1
Notable ice hockey players and coaches among the 44 killed in the :\n         1
Mike Alexander                                                                1
Colonel                                 

<IPython.core.display.Javascript object>

#### Observations:
- The vast majority of rows have an empty string for this field.
- There is one row representing a group, rather than an individual, and we will drop it.
- We should verify the name and age information for the remainder of unique values in `info_0`.

#### Dropping Entry for Group

In [22]:
# Checking the entry representing a group
group_entry = df[
    df["info_0"]
    == "Notable ice hockey players and coaches among the 44 killed in the :\n"
]
group_entry

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_0,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11
51130,7,2011 Lokomotiv Yaroslavl plane crash,Notable ice hockey players and coaches among the 44 killed in the :\n,https://en.wikipedia.org/wiki/2011_Lokomotiv_Yaroslavl_plane_crash,95,2011,September,,Notable ice hockey players and coaches among the 44 killed in the :\n,,,,,,,,,,,


<IPython.core.display.Javascript object>

In [23]:
# Dropping group entry, resetting index, and checking new shape of df
df.drop(group_entry.index, inplace=True)
df.reset_index(inplace=True, drop=True)
df.shape

(132974, 20)

<IPython.core.display.Javascript object>

#### Examining Rows with Atypical `info_0` Values

In [24]:
# Examining rows with atypical info_0 values
list_to_check = df["info_0"].value_counts().index.to_list()

verify_df = pd.DataFrame()
for item in list_to_check[1:]:
    verify_df = pd.concat([verify_df, df[df["info_0"] == item]])
verify_df

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_0,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11
30349,25,Julian Bullard,"Sir , 78, British diplomat.",https://en.wikipedia.org/wiki/Julian_Bullard,3,2006,May,,Sir,78,British diplomat,,,,,,,,,
31140,1,Kyffin Williams,"Sir , 88, Welsh artist, lung and prostate cancer.",https://en.wikipedia.org/wiki/Kyffin_Williams,15,2006,September,,Sir,88,Welsh artist,lung and prostate cancer,,,,,,,,
34207,15,Jeremy Moore,"Sir , 79, British soldier, commander of UK land forces in the Falklands War.",https://en.wikipedia.org/wiki/Jeremy_Moore,3,2007,September,,Sir,79,British soldier,commander of UK land forces in the Falklands War,,,,,,,,
43606,29,Derek Hodgkinson,"Sir , 92, British air chief marshal.",https://en.wikipedia.org/wiki/Derek_Hodgkinson,5,2010,January,,Sir,92,British air chief marshal,,,,,,,,,
67212,7,Richard Best,"Sir , 80, British diplomat, Ambassador to Iceland .",https://en.wikipedia.org/wiki/Richard_Best_(diplomat),5,2014,March,(1989–1991),Sir,80,British diplomat,Ambassador to Iceland,,,,,,,,
67217,7,Thomas Hinde,"Sir , 88, British novelist.",https://en.wikipedia.org/wiki/Thomas_Hinde_(novelist),13,2014,March,,Sir,88,British novelist,,,,,,,,,
33086,28,David Turnbull,". 92, American materials scientist.",https://en.wikipedia.org/wiki/David_Turnbull_(materials_scientist),8,2007,April,,92,American materials scientist,,,,,,,,,,
117387,11,Gotthilf Fischer,"92, German choral conductor .",https://en.wikipedia.org/wiki/Gotthilf_Fischer,4,2020,December,(Fischer-Chöre),92,German choral conductor,,,,,,,,,,
67517,21,Colin Turner,"Sir Woolwich West, 92, British politician, MP for .",https://en.wikipedia.org/wiki/Colin_Turner,3,2014,March,(1959–1964),Sir Woolwich West,92,British politician,MP for,,,,,,,,
67162,5,Robin Dunn,"Sir Lord Justice of Appeal, 96, British jurist, .",https://en.wikipedia.org/wiki/Robin_Dunn,3,2014,March,(1980–1984),Sir Lord Justice of Appeal,96,British jurist,,,,,,,,,


<IPython.core.display.Javascript object>

#### Observations:
- The majority of rows contain additional aliases or titles within `info_0`, that we don't need, but we can leave in place for now.  
- There are a few rows that will need to be treated individually to correct the name value, as follows:
    1. Entry is for Mike Alexander whose band was Evile.
    2. Entry is for Herbert Wiere who performed slapstick.
    3. Entry is for Sarah-Jayne Mulvihill who was a Flight Lieutenant.
    4. Entry is for Douglass Scott who was killed by Demetreus Nix.
    5. Entry is for Kim Hwan-Sung who was a member of the band NRG.
- We can replace the `name` value with the `info_0` value for these rows as well as proceed with hard-coding the correct values for info_2 and info_3 fields to match the Wikipedia pattern, but staying true to the information scraped.
- The row with "Nearly 3" value for `info_0` represents a group, rather than an individual, so will be dropped, after treating the above rows.
- We can proceed to extract age from `info_0` for the few rows that contain it here instead of in `info_1`.

#### Treating 5 rows with Name in `info_0`

In [25]:
# List of names values in info_0
values_lst = [
    "Mike Alexander",
    "Herbert Wiere",
    "Sarah-Jayne Mulvihill",
    "Douglas Scott",
    "Kim Hwan-Sung",
]

<IPython.core.display.Javascript object>

In [26]:
# For loop to copy name from info_0 to name
for i in df[df["info_0"].isin(values_lst)].index.to_list():
    df.loc[i, "name"] = df.loc[i, "info_0"]

# Hard-coding info_2 and info_3 values for Kim Hwan-Sung
index = df[
    df["link"] == "https://en.wikipedia.org/wiki/NRG_(South_Korean_band)"
].index.to_list()
df.loc[index, "info_2"] = "South Korean musician"

df.loc[index, "info_3"] = "respiratory illness"

# # Hard-coding info_2 and info_3 values for Douglass Scott
index = df[
    df["link"]
    == "https://en.wikipedia.org/w/index.php?title=Demetreus_Nix&action=edit&redlink=1"
].index.to_list()
df.loc[index, "info_2"] = "student"

df.loc[index, "info_3"] = "murdered"

# # Hard-coding info_2 and info_3 values for Sarah-Jayne Mulvihill
index = df[
    df["link"] == "https://en.wikipedia.org/wiki/Flight_Lieutenant"
].index.to_list()
df.loc[index, "info_2"] = "British servicewoman"

df.loc[index, "info_3"] = "killed in action"

<IPython.core.display.Javascript object>

In [27]:
# Rechecking updated rows
df[df["info_0"].isin(values_lst)]

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_0,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11
14035,5,Herbert Wiere,"Herbert Wiere, 91, Austrian-born American Wiere Brothers comedian, member of the",https://en.wikipedia.org/wiki/Slapstick,16,1999,August,,Herbert Wiere,91,Austrian-born American Wiere Brothers comedian,member of the,,,,,,,,
15873,15,Kim Hwan-Sung,"Kim Hwan-Sung, 19, A Member of .",https://en.wikipedia.org/wiki/NRG_(South_Korean_band),27,2000,June,,Kim Hwan-Sung,19,South Korean musician,respiratory illness,,,,,,,,
18312,20,Douglas Scott,"Douglas Scott, 20, High-school student murdered by .",https://en.wikipedia.org/w/index.php?title=Demetreus_Nix&action=edit&redlink=1,0,2001,June,,Douglas Scott,20,student,murdered,,,,,,,,
30210,6,Sarah-Jayne Mulvihill,"Sarah-Jayne Mulvihill, 32, first British servicewoman to be killed in action in Iraq.",https://en.wikipedia.org/wiki/Flight_Lieutenant,12,2006,May,,Sarah-Jayne Mulvihill,32,British servicewoman,killed in action,,,,,,,,
42140,5,Mike Alexander,"Mike Alexander, 32, British bassist , pulmonary embolism.",https://en.wikipedia.org/wiki/Evile,82,2009,October,(),Mike Alexander,32,British bassist,pulmonary embolism,,,,,,,,


<IPython.core.display.Javascript object>

#### Dropping Entry for Group

In [28]:
# Checking the entry representing a group
group_entry = df[df["info_0"] == "Nearly 3"]
group_entry

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_0,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11
18908,11,were killed,"Nearly 3,000 people September 11 attacks in the , including:\n",https://en.wikipedia.org/wiki/Casualties_of_the_September_11_attacks,176,2001,September,,Nearly 3,000 people September 11 attacks in the,including:\n,,,,,,,,,


<IPython.core.display.Javascript object>

In [29]:
# Dropping group entry, resetting index, and checking new shape of df
df.drop(group_entry.index, inplace=True)
df.reset_index(inplace=True, drop=True)
df.shape

(132973, 20)

<IPython.core.display.Javascript object>

#### Extracting `age` from `info_0`

In [30]:
# Pattern for re
pattern = r"(\d{1,3})"

# Checking rows with pattern
rows_to_check = rows_with_pattern(df, "info_0", pattern)

There are 3 rows with matching pattern in column 'info_0'.


<IPython.core.display.Javascript object>

In [32]:
# For loop to extract age from info_0 to age
for i, item in enumerate(df["info_0"]):
    match = re.search(pattern, item)
    if match:
        df.loc[i, "age"] = int(match.group(1))
        df.loc[i, "info_0"] = re.sub(pattern, "", df.loc[i, "info_0"])

# Re-checking info_0 and age for pattern
rows_to_check = rows_with_pattern(df, "info_0", pattern)

There are 0 rows with matching pattern in column 'info_0'.


<IPython.core.display.Javascript object>

In [33]:
# Dropping info_0
df.drop("info_0", axis=1, inplace=True)

<IPython.core.display.Javascript object>

#### Observations:
- The new `age` column has been added successfully.
- We are finished processing `info_0`.

### `info_1`

#### Unique Values

In [None]:
# Checking unique values
df["info_1"].unique()

#### Observations:
- There is a lot of variety in the format of the age data.
- Also, this field contains several values that we would expect in info_2 and beyond.
- Let us take the approach of extracting age values first, then examining rows with missing `age`.

#### Examining unique formats for age data

In [None]:
# Examining formats for age data
has_num = []
for i, item in enumerate(df["info_1"]):
    pattern = r"\d"
    if re.search(pattern, item) is not None:
        has_num.append(i)

df.loc[has_num, :]["info_1"].unique()

#### Observations:
- The data for age within `info_1` is in the following formats:  
    - single integer ("age", "age ", "age.")
    - range of 2 integers (separators '-', '–', '/', and ' or ')
    - range of 2 integers with only unit value for second number ('age1/age2-2nd-digit')
    - age in days or months ('age days', 'age-months')
    - estimates ('c. age', 'c.age',  'age?', 'ages' (e.g. 80s), "age+"
- There are some specific rows that need to be examined, with the following values for `info_1`:
    - 1995
    - 1996
    - 1997
    - German Olympic sailor [1]
    - Taiwanese failed assassin in the 3-19 shooting incident
    - 255
    - 176
    - the first wild bear in Germany in 170 years
    - c. 3500
    - common chimpanzee 55
    - Maltese 15
    - c.1000
    - Tree of the Year 150
- We will need to be strategic in the order in which we extract age from `info_1`.
- First, we will look at the atypical values listed above.

#### Examining Rows with Digits and Atypical Values for `info_2`

In [None]:
# List of atypical info_1 values for rows with digits
values_lst = [
    "1995",
    "1996",
    "1997",
    "German Olympic sailor [1]",
    "Taiwanese failed assassin in the 3-19 shooting incident",
    "255",
    "176",
    "the first wild bear in Germany in 170 years",
    "c. 3500",
    "common chimpanzee 55",
    "Maltese 15",
    "c.1000",
    "Tree of the Year 150",
]

df[df["info_1"].isin(values_lst)]

#### Observations:
- Age data is either missing or the entry is for a member of a non-human species.
- We will drop all of these rows.

#### Dropping Rows for Non-Human Entries or Entries Missing Age Data

In [None]:
# List of indexes to be dropped
drop_rows = df[df["info_1"].isin(values_lst)].index.to_list()

# Dropping rows, resetting index, and checking new shape of df
df.drop(drop_rows, inplace=True)
df.reset_index(inplace=True, drop=True)
df.shape

#### Observations:
- With those rows addressed, we will begin extracting age in days and months and convert them to years.

### Extracting `age` from `info_1`

#### Step 1: Age in Years and Months

In [None]:
# Dictionary of patterns for days and months formats as keys and factor to convert to years
patterns = {
    r"(\d{1,3})( days)": 365,
    r"(\d{1,3})(-months)": 12,
    r"(\d{1,3})( months)": 12,
}

# For loop to check for age in days and months and convert to years and save in age.
# Also removes age in days/months from info_0.
for key, value in patterns.items():
    for i, item in enumerate(df["info_1"]):
        match = re.search(key, item)
        if match:
            age = int(match.group(1)) / value
            df.loc[i, "age"] = age
            df.loc[i, "info_1"] = re.sub(key, "", df.loc[i, "info_1"])

# Checking updated rows
df[df["age"].notna()]

#### Observations:
- We have successfully captured the age in days and months values and converted them to years.
- The other rows are in place already from our treatment of `info_0`.
- Next, we will address entries that contain two age values as a range, starting first with those that have a single digit as the second number.

#### Step 2: Extracting `age` from `info_1` for Entries with Age Estimate Containing Age Range with 2 Values

#### Ranges with Single Digit as Upper-end

In [None]:
# Pattern for re
pattern = r"(\d{1,3})(/)(\d)\b"

# List of rows to check
rows_to_check = []

# For loop to add index of list containing values to list
for i, item in enumerate(df["info_1"]):
    match = re.search(pattern, item)
    if match:
        rows_to_check.append(i)

# Checking sample of rows
df.loc[rows_to_check, :].sample(2)

In [None]:
# Pattern for re
pattern = r"(\d{1,3})(/)(\d)\b"


# For loop to find rows with values and pattern and calculate and extract age to age column and remove age from info_1
for i in df[df["age"].isna()].index:
    item = df.loc[i, "info_1"]
    match = re.search(pattern, item)
    if match:
        age_1 = int(match.group(1))
        age_2 = int(match.group(3))
        units = ((age_1 % 10) + age_2) / 2
        tens = age_1 - (age_1 % 10)
        age = tens + units
        df.loc[i, "age"] = age
        df.loc[i, "info_1"] = re.sub(pattern, "", df.loc[i, "info_1"])

# Checking example rows
pd.concat([df[df["name"] == "Lu'ay al-Atassi"], df[df["name"] == "Joji Banuve"]])

#### Other Ranges with Two Values

In [None]:
# Pattern for re
pattern = r"(\d{1,3})(-|–|/| or )(\d{1,3})"

# List of rows to check
rows_to_check = []

# For loop to add index of list containing values to list
for i, item in enumerate(df["info_1"]):
    match = re.search(pattern, item)
    if match:
        rows_to_check.append(i)

# Checking sample of rows
df.loc[rows_to_check, :].sample(2)

In [None]:
# Pattern for re
pattern = r"(\d{1,3})(-|–|/| or )(\d{1,3})"


# For loop to find rows with values and pattern and calculate and extract age to age column and remove age from info_1
for i in df[df["age"].isna()].index:
    item = df.loc[i, "info_1"]
    match = re.search(pattern, item)
    if match:
        age = (int(match.group(1)) + int(match.group(3))) / 2
        df.loc[i, "age"] = age
        df.loc[i, "info_1"] = re.sub(pattern, "", df.loc[i, "info_1"])

# Checking example rows
pd.concat([df[df["name"] == "Moncef Ouannes"], df[df["name"] == "Fehmi Sağınoğlu"]])

#### Observations:
- Next, we will extract from the entries with straightforward single integer age values, including the formats: "age", "age ", 'age.'.
- More vague estimates are excluded here to allow closer examination, as they are more likely to by atypical entries.

#### Step 3: Age as Single Integer (Excluding Estimates)

In [None]:
# List of patterns for age formats with single integer for age
patterns = [r"^(\d{1,3})$", r"^(\d{1,3})\s", r"^(\d{1,3})\.\s"]

# For loop to check age pattern in info_0, save age to age column, and remove from age from info_0
for i, item in enumerate(df["info_1"]):
    for pattern in patterns:
        match = re.search(pattern, item)
        if match:
            age = int(match.group(1))
            df.loc[i, "age"] = age
            df.loc[i, "info_1"] = re.sub(pattern, "", df.loc[i, "info_1"])

# Checking first 2 rows
df.head(2)

In [None]:
# Checking last 2 rows
df.tail(2)

In [None]:
# Checking the number of remaining missing values for `age`
df["age"].isna().sum()

#### Observations:
- The rows with single integer age data have been addressed.
- There are only 294 remaining missing values for `age` after extracting age in days, months, single integer years, and 2-integer year range values.
- Let us check the rows containing 'c.', '+', or '?' in the age information.

#### Entries with Age Data Containing 'c.', '+', or '?' for Estimate

In [None]:
values = ["c.", "+", "?"]

# List of rows to check
rows_to_check = []

# For loop to add index of list containing values to list
for index in df[df["age"].isna()].index:
    if any(value in df.loc[index, "info_1"] for value in values):
        rows_to_check.append(index)

# Inspecting rows containing values
df.loc[rows_to_check, :]

#### Observations:
- Most of the entries are for people, but there are also one or more entries for the following:
    - carp
    - racehorse
    - chimpanzee
    - flamingo
    - cat
    - turkey
- We will proceed to check `info_2` for these values and drop these and other rows representing members of these other species.

#### Checking for Cat, Racehorse, Chimpanzee, Carp, and Flamingo in `info_2`

In [None]:
# Defining pattern for re
species_pattern = r"\b(cat|racehorse|chimpanzee|carp|flamingo|turkey)\b"

# Empty list to collect indexes of rows containing pattern
rows_to_check = []

# For loop to add index of rows containing pattern to list
for index in df[df["info_2"].notna()].index:
    match = re.search(species_pattern, df.loc[index, "info_2"])
    if match:
        rows_to_check.append(index)

# Checking number of rows containing pattern
print(f"There are {len(rows_to_check)} rows containing these values.")

#### Observations:
- There are sufficient rows to warrant checking species by species.

#### Cat Entries per `info_2`

In [None]:
# Defining pattern for re
species_pattern = r"\b(cat)\b"

# Empty list to collect indexes of rows containing pattern
rows_to_check = []

# For loop to add index of rows containing pattern to list
for index in df[df["info_2"].notna()].index:
    match = re.search(species_pattern, df.loc[index, "info_2"])
    if match:
        rows_to_check.append(index)

# Inpsecting rows with pattern
df.loc[rows_to_check, :]

#### Observations:
- There is one person represented.  
- We can proceed to drop the others, using "cat", "cat of", and "cat in" patterns.

#### Dropping Entries for Cats per `info_2`

In [None]:
# List of re patterns to find
patterns = [r"\bcat$", r"\b(cat of|cat in)\b"]

# List to collect indexes of rows to drop
rows_to_drop = []

# For loop to find re pattern and add index of rows with pattern to list
for index in df[df["info_2"].notna()].index:
    for pattern in patterns:
        match = re.search(pattern, df.loc[index, "info_2"])
        if match:
            rows_to_drop.append(index)

# Checking number of rows added to list for dropping
print(f"{len(rows_to_drop)} rows will be dropped.")

In [None]:
# Dropping rows, resetting index, and checking new shape of df
df.drop(rows_to_drop, inplace=True)
df.reset_index(inplace=True, drop=True)
df.shape

####  Racehorse Entries per `info_2`

In [None]:
# Defining pattern for re
pattern = r"\bracehorse\b"

# Empty list to collect indexes of rows containing pattern
rows_to_check = []

# For loop to add index of rows containing pattern to list
for index in df[df["info_2"].notna()].index:
    match = re.search(pattern, df.loc[index, "info_2"])
    if match:
        rows_to_check.append(index)

# Inpsecting rows with pattern
df.loc[rows_to_check, :].sample(10)

#### Observations:
- There are several entries for people involved in the racehorse business.
- Values that end in 'racehorse' and 'racehorse and sire' can be removed.

#### Dropping Entries for Racehorses per `info_2`

In [None]:
# List of re patterns to find
patterns = [r"\bracehorse$", r"\b(racehorse and sire)$"]

# List to collect indexes of rows to drop
rows_to_drop = []

# For loop to find re pattern and add index of rows with pattern to list
for index in df[df["info_2"].notna()].index:
    for pattern in patterns:
        match = re.search(pattern, df.loc[index, "info_2"])
        if match:
            rows_to_drop.append(index)

# Checking number of rows added to list for dropping
print(f"{len(rows_to_drop)} rows will be dropped.")

In [None]:
# Dropping rows, resetting index, and checking new shape of df
df.drop(rows_to_drop, inplace=True)
df.reset_index(inplace=True, drop=True)
df.shape

####  Chimpanzee, Flamingo,  Carp, and Turkey Entries per `info_2`

In [None]:
# Defining pattern for re
pattern = r"\b(chimpanzee|flamingo|carp|turkey)\b"

# Empty list to collect indexes of rows containing pattern
rows_to_check = []

# For loop to add index of rows containing pattern to list
for index in df[df["info_2"].notna()].index:
    match = re.search(pattern, df.loc[index, "info_2"])
    if match:
        rows_to_check.append(index)

# Inpsecting rows with pattern
df.loc[rows_to_check, :]

#### Observations:
- All of these rows are for animals, so we will remove them.

#### Dropping Entries for Chimpanzees, Flamingos,  Carps, and Turkeys per `info_2`

In [None]:
# List of re patterns to find
patterns = [r"\b(chimpanzee|flamingo|carp|turkey)\b"]

# List to collect indexes of rows to drop
rows_to_drop = []

# For loop to find re pattern and add index of rows with pattern to list
for index in df[df["info_2"].notna()].index:
    for pattern in patterns:
        match = re.search(pattern, df.loc[index, "info_2"])
        if match:
            rows_to_drop.append(index)

# Checking number of rows added to list for dropping
print(f"{len(rows_to_drop)} rows will be dropped.")

In [None]:
# Dropping rows, resetting index, and checking new shape of df
df.drop(rows_to_drop, inplace=True)
df.reset_index(inplace=True, drop=True)
df.shape

#### Observations:
- With those non-human entries addressed, we can return to processing `info_1`.
- Let us address the remaining entries with '?', '+', or 'c.', accepting the estimated age as `age`.
- We will treat the age estimates ending in 's' similarly, but separately.

#### Extracting `age` from `info_1` for Entries with Age Estimate Containing '?', '+', or 'c.'

In [None]:
# List to identify rows
values = ["c.", "+", "?"]

# Pattern for re
pattern = r"\b(\d{1,3})\b"

# For loop to find rows with values and pattern and extract age to age column and remove age from info_1
for i in df[df["age"].isna()].index:
    item = df.loc[i, "info_1"]

    if any(value in item for value in values):
        match = re.search(pattern, item)

        if match:
            age = int(match.group(1))
            df.loc[i, "age"] = age
            df.loc[i, "info_1"] = re.sub(pattern, "", df.loc[i, "info_1"])

        for value in values:
            df.loc[i, "info_1"] = df.loc[i, "info_1"].replace(value, "")

# Checking example rows
pd.concat([df[df["name"] == "Tuti Yusupova"], df[df["name"] == "Raed al Atar"]])

#### Observations:
- The ages ending in 's' should be the only ones remaining in the `info_1` column.
- We will treat those now, putting them at the mid-range for the decade indicated.

#### Extracting `age` from `info_1` for Entries with Age Estimate ending in 's'

In [None]:
# Defining pattern for re
pattern = r"\b(\d{1,3})s\b"

# Empty list to collect indexes of rows containing pattern
rows_to_check = []

# For loop to add index of rows containing pattern to list
for index in df[df["age"].isna()].index:
    match = re.search(pattern, df.loc[index, "info_1"])
    if match:
        rows_to_check.append(index)

# Inpsecting rows with pattern
df.loc[rows_to_check, :]

In [None]:
# Pattern for re
pattern = r"\b(\d{1,3})s\b"

# For loop to find rows with values and pattern and extract age to age column and remove age from info_1
for i in df[df["age"].isna()].index:
    item = df.loc[i, "info_1"]
    match = re.search(pattern, item)
    if match:
        age = int(match.group(1))
        df.loc[i, "age"] = age + 5
        df.loc[i, "info_1"] = re.sub(pattern, "", df.loc[i, "info_1"])
    if "early " in item:
        df.loc[i, "info_1"] = df.loc[i, "info_1"].replace("early ", "")

# Checking example rows
pd.concat([df[df["name"] == "Mary Dann"], df[df["name"] == "Timothy Apiyo"]])

### Checking for Any Missed Digits in `info_1` and for Remaining Missing Values for `age`

In [None]:
pattern = r"\d"

rows_to_check = []
for i, item in enumerate(df["info_1"]):
    match = re.search(pattern, item)
    if match:
        rows_to_check.append(i)
print(
    f"There are {len(df.loc[rows_to_check, :])} remaining entries with digits in info_1 column.\n"
)
print(f'There are {df["age"].isna().sum()} remaining missing values for age.')

### Observations:
- All of the age data that had been in `info_1` has been successfully extracted.
- There are 218 remaining missing values for `age` that we hope to find in the other info columns.
- We will include the remaining values in `info_1` when extracting citizenship and role information.

In [None]:
df[df["age"].isna()]