# Wikipedia Notable Life Expectancies

# [Notebook 2 of 4: Data Cleaning](https://github.com/teresahanak/wikipedia-life-expectancy/blob/main/wp_life_expect_data_clean_thanak_2022_06_13.ipynb)

## Context

The


## Objective

The

### Data Dictionary

- Feature: Description

## Importing Necessary Libraries

In [1]:
# To structure code automatically
%load_ext nb_black

# To import/export sqlite databases
import sqlite3 as sql

# To help with reading and manipulating data
import pandas as pd
import numpy as np
import re

# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)
# To define the maximum number of rows to be displayed in a dataframe
pd.set_option("display.max_rows", 200)

# To supress warnings
# import warnings

# warnings.filterwarnings("ignore")

# To set some visualization attributes
pd.set_option("max_colwidth", 150)

<IPython.core.display.Javascript object>

## Data Overview

### Reading, Sampling, and Checking Data Shape

In [2]:
# Reading the wp_life_expect_raw_complete dataset
conn = sql.connect("wp_life_expect_raw_complete.db")
data = pd.read_sql("SELECT * FROM wp_life_expect_raw_complete", conn)

# Making a working copy
df = data.copy()

# Checking the shape
print(f"There are {df.shape[0]} rows and {df.shape[1]} columns.")

# Checking first 2 rows of the data
df.head(2)

There are 133900 rows and 6 columns.


Unnamed: 0,month_year,day,name,info,link,num_references
0,January 1994,1,William Chappell,", 86, British dancer, ballet designer and director.",https://en.wikipedia.org/wiki/William_Chappell_(dancer),21
1,January 1994,1,Raymond Crotty,", 68, Irish economist, writer, and academic.",https://en.wikipedia.org/wiki/Raymond_Crotty,12


<IPython.core.display.Javascript object>

In [3]:
# Checking last 2 rows of the data
df.tail(2)

Unnamed: 0,month_year,day,name,info,link,num_references
133898,June 2022,9,Oleg Moliboga,", 69, Russian volleyball player, Olympic champion (1980) and coach.",https://en.wikipedia.org/wiki/Oleg_Moliboga,2
133899,June 2022,9,Zou Jing,", 86, Chinese engineer, member of the Chinese Academy of Engineering.",https://en.wikipedia.org/wiki/Zou_Jing_(engineer),3


<IPython.core.display.Javascript object>

In [4]:
# Checking a sample of the data
df.sample(5)

Unnamed: 0,month_year,day,name,info,link,num_references
78460,September 2015,20,Jagmohan Dalmiya,", 75, Indian cricket official, President of International Cricket Council (1997–2000) and Board of Control for Cricket in India (2001–2004), cardi...",https://en.wikipedia.org/wiki/Jagmohan_Dalmiya,34
99112,September 2018,11,Richard Newbold Adams,", 94, American anthropologist.",https://en.wikipedia.org/wiki/Richard_Newbold_Adams,2
77638,July 2015,31,W. Eugene Wilson,", 86, American politician.",https://en.wikipedia.org/wiki/W._Eugene_Wilson,3
1977,November 1994,8,Marianne Straub,", 85, British textile designer.",https://en.wikipedia.org/wiki/Marianne_Straub,14
96782,May 2018,8,Frøystein Wedervang,", 100, Norwegian economist.",https://en.wikipedia.org/wiki/Fr%C3%B8ystein_Wedervang,1


<IPython.core.display.Javascript object>

#### Observations:
- There are 133,900 rows and 6 columns.
- `month_year` contains the month and year of death, while `day` contains the day of the month of death.
- `name` is the notable person's name.  It is a nominal feature that will not be used for analysis, but will be maintained for any referencing needs.
- `info` contains multiple items including the notable person's "age, country of citizenship at birth, subsequent country of citizenship (if applicable), reason for notability, (and) cause of death (if known)."
- `link` is the url to the notable person's individual Wikipedia page.  If such a page does not exist, there is either a non-working link (https://en.wikipedia.orgNone), or the link is to a page with a message that the page does not exist for that individual.  `link` is a unique identifier for all entries, except the 6 with the non-working link, which do have unique `name` values from each other.
- `num_references` contains the number of references on the notable person's individual Wikipedia page.  This feature serves as a proxy measure of notability.
- Prior to EDA, our task will be to extract the individual elements that are comined in `month_year` and `info` columns.

### Checking Data Types, Duplicates, and Null Values

In [5]:
# Checking data types and null values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 133900 entries, 0 to 133899
Data columns (total 6 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   month_year      133900 non-null  object
 1   day             133900 non-null  object
 2   name            133894 non-null  object
 3   info            133900 non-null  object
 4   link            133900 non-null  object
 5   num_references  133900 non-null  object
dtypes: object(6)
memory usage: 6.1+ MB


<IPython.core.display.Javascript object>

In [6]:
# Checking duplicate rows
df.duplicated().sum()

0

<IPython.core.display.Javascript object>

In [7]:
# Check percentage of null values by column
df.isnull().sum() / df.count() * 100

month_year        0.000000
day               0.000000
name              0.004481
info              0.000000
link              0.000000
num_references    0.000000
dtype: float64

<IPython.core.display.Javascript object>

In [8]:
# Checking number of missing values per row (not necessary here, but done to keep process standard)
df.isnull().sum(axis=1).value_counts()

0    133894
1         6
dtype: int64

<IPython.core.display.Javascript object>

#### Observations:
- Our dataset was saved to and read from the database without any hiccups.
- As expected, we have 6 entries that are missing `name`, but we will find it in their `info` values.
- All columns are currently of object type.  We will need to appropriately typecast them after separating the information in `month_year` and `info`.

## Data Cleaning

### Addressing Missing `name` Values

In [9]:
# Checking rows with missing name values
missing_name = df[df["name"].isna()]
missing_name

Unnamed: 0,month_year,day,name,info,link,num_references
18937,August 2001,11,,"Kevin Kowalcyk, 2, known for eating a hamburger contaminated with E. coli O157:H7.",https://en.wikipedia.orgNone,0
24985,January 2004,22,,"Vincent Palmer, 37, British criminal.",https://en.wikipedia.orgNone,0
27458,March 2005,1,,"Barry Stigler, 57, American voice actor.",https://en.wikipedia.orgNone,0
34077,July 2007,11,,"Nana Gualdi, 75, German singer and actress.",https://en.wikipedia.orgNone,0
64769,September 2013,29,,"Scott Workman, 47, American stuntman (, , ), cancer.",https://en.wikipedia.orgNone,0
106613,September 2019,12,,"Thami Shobede, 31, Singer Songwriter",https://en.wikipedia.orgNone,0


<IPython.core.display.Javascript object>

#### Observations:
- These rows vary from the main set as there is a substring containing the person's name at the start of the `info` string.
- As there are so few rows missing `name`, let us address this issue first.

In [10]:
# For loop to copy name value from info value and remove name from info value
treat_rows = missing_name.index
for i in treat_rows:
    info = df.loc[i, "info"]
    info_lst = info.split(sep=",", maxsplit=1)

    name = info_lst[0].strip()
    df.loc[i, "name"] = name
    df.loc[i, "info"] = re.sub(name, "", info).strip()

# Re-check rows
df.loc[treat_rows, :]

Unnamed: 0,month_year,day,name,info,link,num_references
18937,August 2001,11,Kevin Kowalcyk,", 2, known for eating a hamburger contaminated with E. coli O157:H7.",https://en.wikipedia.orgNone,0
24985,January 2004,22,Vincent Palmer,", 37, British criminal.",https://en.wikipedia.orgNone,0
27458,March 2005,1,Barry Stigler,", 57, American voice actor.",https://en.wikipedia.orgNone,0
34077,July 2007,11,Nana Gualdi,", 75, German singer and actress.",https://en.wikipedia.orgNone,0
64769,September 2013,29,Scott Workman,", 47, American stuntman (, , ), cancer.",https://en.wikipedia.orgNone,0
106613,September 2019,12,Thami Shobede,", 31, Singer Songwriter",https://en.wikipedia.orgNone,0


<IPython.core.display.Javascript object>

#### Observations:
- Missing `name` values have been addressed and those names have been removed from `info` values.

In [11]:
# Re-check info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 133900 entries, 0 to 133899
Data columns (total 6 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   month_year      133900 non-null  object
 1   day             133900 non-null  object
 2   name            133900 non-null  object
 3   info            133900 non-null  object
 4   link            133900 non-null  object
 5   num_references  133900 non-null  object
dtypes: object(6)
memory usage: 6.1+ MB


<IPython.core.display.Javascript object>

#### Observations:
- We have no remaining missing values.
- Let us treat the `month_year` column next.

### Separating `month` and `year`

In [12]:
# Separating month and year into 2 columns and typecasting year as integer
df.loc[:, "year"] = df["month_year"].apply(lambda x: x.split(sep=" ")[1].strip())
df["year"] = df["year"].apply(lambda x: int(x))

df.loc[:, "month"] = df["month_year"].apply(lambda x: x.split(sep=" ")[0])
df.head(2)

Unnamed: 0,month_year,day,name,info,link,num_references,year,month
0,January 1994,1,William Chappell,", 86, British dancer, ballet designer and director.",https://en.wikipedia.org/wiki/William_Chappell_(dancer),21,1994,January
1,January 1994,1,Raymond Crotty,", 68, Irish economist, writer, and academic.",https://en.wikipedia.org/wiki/Raymond_Crotty,12,1994,January


<IPython.core.display.Javascript object>

In [13]:
# Dropping month_year column
df.drop("month_year", axis=1, inplace=True)
df.head(2)

Unnamed: 0,day,name,info,link,num_references,year,month
0,1,William Chappell,", 86, British dancer, ballet designer and director.",https://en.wikipedia.org/wiki/William_Chappell_(dancer),21,1994,January
1,1,Raymond Crotty,", 68, Irish economist, writer, and academic.",https://en.wikipedia.org/wiki/Raymond_Crotty,12,1994,January


<IPython.core.display.Javascript object>

### Treating `info`

#### Checking a Sample

In [14]:
# Checking a sample of info
df.sample(5)

Unnamed: 0,day,name,info,link,num_references,year,month
88355,27,Sin Kek Tong,", 72, Singaporean politician, founder of the Singapore People's Party.",https://en.wikipedia.org/wiki/Sin_Kek_Tong,9,2017,February
33970,28,Catherine Troeh,", 96, American native people activist and historian.",https://en.wikipedia.org/wiki/Catherine_Troeh,7,2007,June
37900,6,Karl Kuehl,", 70, American baseball scout, coach and manager (Montreal Expos), pulmonary fibrosis.",https://en.wikipedia.org/wiki/Karl_Kuehl,3,2008,August
32971,23,John Ritchie,", 65, British footballer for Stoke City F.C., club's record goalscorer.","https://en.wikipedia.org/wiki/John_Ritchie_(footballer,_born_1941)",9,2007,February
89689,6,John Bennison,", 92, Australian businessman (Wesfarmers).",https://en.wikipedia.org/wiki/John_Bennison,7,2017,May


<IPython.core.display.Javascript object>

#### Observations:
- First, let us check for any rows that are missing digits, and therefore the age target, within `info` and remove them.
- Also, it would be helpful to remove information contained within parentheses, as we will not be using it.

#### Checking and Dropping Rows Lacking Digits (and therefore Age Data) within `info`

In [15]:
# For loop to extract index of rows without digits in info value
remove_lst = []
for index in df.index:
    pattern = r"\d"
    if re.search(pattern, df.loc[index, "info"]) is None:
        remove_lst.append(index)
print(len(remove_lst), "rows")
df.loc[remove_lst, :].sample(5)

925 rows


Unnamed: 0,day,name,info,link,num_references,year,month
70403,17,Fariha al-Berkawi,", Libyan politician, shot.",https://en.wikipedia.org/wiki/Fariha_al-Berkawi,6,2014,July
18252,13,Kinfe Gebremedhin,", Ethiopian Chief of Security and Immigration, murdered.",https://en.wikipedia.org/wiki/Kinfe_Gebremedhin,5,2001,May
54491,10,George Beattie,", Scottish footballer.",https://en.wikipedia.org/wiki/George_Beattie_(footballer),4,2012,March
108472,14,Gita Siddharth,", Indian actress (, ).",https://en.wikipedia.org/wiki/Gita_Siddharth,6,2019,December
19620,11,Dasht-e Qaleh Taliban ambush,Journalists killed in the \n,https://en.wikipedia.org/wiki/Timeline_of_the_War_in_Afghanistan_(2001%E2%80%93present)#2001,86,2001,November


<IPython.core.display.Javascript object>

In [16]:
# Dropping rows missing age data, resetting index, and checking new shape of df
df.drop(remove_lst, inplace=True)
df.reset_index(inplace=True, drop=True)
df.shape

(132975, 7)

<IPython.core.display.Javascript object>

#### Observations:
- 925 rows were removed as they lacked the target data for `age`.

#### Removing Information within Parentheses in `info`

In [17]:
# Regular expression for parenthesis and its contents
pattern = r"\(.*\)"

# Subbing empty string for parentheses and stripping white space
df.loc[:, "info"] = df["info"].apply(lambda x: re.sub(pattern, "", x).strip())
df.loc[[21850, 125430], :]

Unnamed: 0,day,name,info,link,num_references,year,month
21850,11,Sir Michael Clapham,", 90, British industrialist, president of the Confederation of British Industry from 1972 to 1974.",https://en.wikipedia.org/wiki/Michael_Clapham_(industrialist),14,2002,November
125430,23,Jean-Luc Nancy,", 81, French philosopher.",https://en.wikipedia.org/wiki/Jean-Luc_Nancy,8,2021,August


<IPython.core.display.Javascript object>

#### Observation:
- Parentheses and information within has been removed from `info`.
- Next, we will follow the Wikipedia-defined fields to divide the `info` values.

#### Splitting `info` on Commas into Separate Columns

In [18]:
# For loop to split info on commas and separate into respective new columns and removing leading/trailing white space and periods
for i, item in enumerate(df["info"]):
    info_lst = item.split(",")

    for j in range(len(info_lst)):
        df.loc[i, f"info_{j}"] = info_lst[j].strip(" .")

# Checking the first 2 rows
df.head(2)

Unnamed: 0,day,name,info,link,num_references,year,month,info_0,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11
0,1,William Chappell,", 86, British dancer, ballet designer and director.",https://en.wikipedia.org/wiki/William_Chappell_(dancer),21,1994,January,,86,British dancer,ballet designer and director,,,,,,,,
1,1,Raymond Crotty,", 68, Irish economist, writer, and academic.",https://en.wikipedia.org/wiki/Raymond_Crotty,12,1994,January,,68,Irish economist,writer,and academic,,,,,,,


<IPython.core.display.Javascript object>

In [19]:
# Checking the last 2 rows
df.tail(2)

Unnamed: 0,day,name,info,link,num_references,year,month,info_0,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11
132973,9,Oleg Moliboga,", 69, Russian volleyball player, Olympic champion and coach.",https://en.wikipedia.org/wiki/Oleg_Moliboga,2,2022,June,,69,Russian volleyball player,Olympic champion and coach,,,,,,,,
132974,9,Zou Jing,", 86, Chinese engineer, member of the Chinese Academy of Engineering.",https://en.wikipedia.org/wiki/Zou_Jing_(engineer),3,2022,June,,86,Chinese engineer,member of the Chinese Academy of Engineering,,,,,,,,


<IPython.core.display.Javascript object>

#### Observations:
- The `info` value is successfully divided and we can proceed through it column by column.
- We will check the set of values for the first two columns, for age.

### `info_0`

In [20]:
# Checking unique value counts
df["info_0"].value_counts()

                                                                       132946
Sir                                                                         6
92                                                                          2
Douglas Scott                                                               1
VC                                                                          1
Sir Woolwich West                                                           1
Sir Lord Justice of Appeal                                                  1
Dame MEP                                                                    1
83                                                                          1
Sir Governor-General                                                        1
Notable ice hockey players and coaches among the 44 killed in the :         1
Mike Alexander                                                              1
Colonel                                                         

<IPython.core.display.Javascript object>

#### Observations:
- The vast majority of rows have an empty string for this field.
- There is one row representing a group, rather than an individual, and we will drop it.
- We should verify the name and age information for the remainder of unique values in `info_0`.

#### Dropping Entry for Group

In [21]:
# Checking the entry representing a group
group_entry = df[
    df["info_0"]
    == "Notable ice hockey players and coaches among the 44 killed in the :"
]
group_entry

Unnamed: 0,day,name,info,link,num_references,year,month,info_0,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11
51130,7,2011 Lokomotiv Yaroslavl plane crash,Notable ice hockey players and coaches among the 44 killed in the :,https://en.wikipedia.org/wiki/2011_Lokomotiv_Yaroslavl_plane_crash,95,2011,September,Notable ice hockey players and coaches among the 44 killed in the :,,,,,,,,,,,


<IPython.core.display.Javascript object>

In [22]:
# Dropping group entry, resetting index, and checking new shape of df
df.drop(group_entry.index, inplace=True)
df.reset_index(inplace=True, drop=True)
df.shape

(132974, 19)

<IPython.core.display.Javascript object>

#### Examining Rows with Atypical `info_0` Values

In [23]:
# Examining rows with atypical info_0 values
list_to_check = df["info_0"].value_counts().index.to_list()

verify_df = pd.DataFrame()
for item in list_to_check[1:]:
    verify_df = pd.concat([verify_df, df[df["info_0"] == item]])
verify_df

Unnamed: 0,day,name,info,link,num_references,year,month,info_0,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11
30349,25,Julian Bullard,"Sir , 78, British diplomat.",https://en.wikipedia.org/wiki/Julian_Bullard,3,2006,May,Sir,78,British diplomat,,,,,,,,,
31140,1,Kyffin Williams,"Sir , 88, Welsh artist, lung and prostate cancer.",https://en.wikipedia.org/wiki/Kyffin_Williams,15,2006,September,Sir,88,Welsh artist,lung and prostate cancer,,,,,,,,
34207,15,Jeremy Moore,"Sir , 79, British soldier, commander of UK land forces in the Falklands War.",https://en.wikipedia.org/wiki/Jeremy_Moore,3,2007,September,Sir,79,British soldier,commander of UK land forces in the Falklands War,,,,,,,,
43606,29,Derek Hodgkinson,"Sir , 92, British air chief marshal.",https://en.wikipedia.org/wiki/Derek_Hodgkinson,5,2010,January,Sir,92,British air chief marshal,,,,,,,,,
67212,7,Richard Best,"Sir , 80, British diplomat, Ambassador to Iceland .",https://en.wikipedia.org/wiki/Richard_Best_(diplomat),5,2014,March,Sir,80,British diplomat,Ambassador to Iceland,,,,,,,,
67217,7,Thomas Hinde,"Sir , 88, British novelist.",https://en.wikipedia.org/wiki/Thomas_Hinde_(novelist),13,2014,March,Sir,88,British novelist,,,,,,,,,
33086,28,David Turnbull,". 92, American materials scientist.",https://en.wikipedia.org/wiki/David_Turnbull_(materials_scientist),8,2007,April,92,American materials scientist,,,,,,,,,,
117387,11,Gotthilf Fischer,"92, German choral conductor .",https://en.wikipedia.org/wiki/Gotthilf_Fischer,4,2020,December,92,German choral conductor,,,,,,,,,,
67517,21,Colin Turner,"Sir Woolwich West, 92, British politician, MP for .",https://en.wikipedia.org/wiki/Colin_Turner,3,2014,March,Sir Woolwich West,92,British politician,MP for,,,,,,,,
67162,5,Robin Dunn,"Sir Lord Justice of Appeal, 96, British jurist, .",https://en.wikipedia.org/wiki/Robin_Dunn,3,2014,March,Sir Lord Justice of Appeal,96,British jurist,,,,,,,,,


<IPython.core.display.Javascript object>

#### Observations:
- The majority of rows contain additional aliases or titles within `info_0`, that we don't need, but we can leave in place for now.  
- There are a few rows that will need to be treated individually to correct the name value, as follows:
    1. Entry is for Mike Alexander whose band was Evile.
    2. Entry is for Herbert Wiere who performed slapstick.
    3. Entry is for Sarah-Jayne Mulvihill who was a Flight Lieutenant.
    4. Entry is for Douglass Scott who was killed by Demetreus Nix.
    5. Entry is for Kim Hwan-Sung who was a member of the band NRG.
- We can replace the `name` value with the `info_0` value for these rows as well as proceed with hard-coding the correct values for info_2 and info_3 fields to match the Wikipedia pattern, but staying true to the information scraped.
- The row with "Nearly 3" value for `info_0` represents a group, rather than an individual, so will be dropped, after treating the above rows.
- We can proceed to extract age from `info_0` for the few rows that contain it here instead of in `info_1`.

#### Treating 5 rows with Name in `info_0`

In [24]:
values_lst = [
    "Mike Alexander",
    "Herbert Wiere",
    "Sarah-Jayne Mulvihill",
    "Douglas Scott",
    "Kim Hwan-Sung",
]

<IPython.core.display.Javascript object>

In [25]:
# For loop to copy name from info_0 to name
for i in df[df["info_0"].isin(values_lst)].index.to_list():
    df.loc[i, "name"] = df.loc[i, "info_0"]

# Hard-coding info_2 and info_3 values for Kim Hwan-Sung
index = df[
    df["link"] == "https://en.wikipedia.org/wiki/NRG_(South_Korean_band)"
].index.to_list()
df.loc[index, "info_2"] = "South Korean musician"

df.loc[index, "info_3"] = "respiratory illness"

# # Hard-coding info_2 and info_3 values for Douglass Scott
index = df[
    df["link"]
    == "https://en.wikipedia.org/w/index.php?title=Demetreus_Nix&action=edit&redlink=1"
].index.to_list()
df.loc[index, "info_2"] = "student"

df.loc[index, "info_3"] = "murdered"

# # Hard-coding info_2 and info_3 values for Sarah-Jayne Mulvihill
index = df[
    df["link"] == "https://en.wikipedia.org/wiki/Flight_Lieutenant"
].index.to_list()
df.loc[index, "info_2"] = "British servicewoman"

df.loc[index, "info_3"] = "killed in action"

<IPython.core.display.Javascript object>

In [26]:
df[df["info_0"].isin(values_lst)]

Unnamed: 0,day,name,info,link,num_references,year,month,info_0,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11
14035,5,Herbert Wiere,"Herbert Wiere, 91, Austrian-born American Wiere Brothers comedian, member of the",https://en.wikipedia.org/wiki/Slapstick,16,1999,August,Herbert Wiere,91,Austrian-born American Wiere Brothers comedian,member of the,,,,,,,,
15873,15,Kim Hwan-Sung,"Kim Hwan-Sung, 19, A Member of .",https://en.wikipedia.org/wiki/NRG_(South_Korean_band),27,2000,June,Kim Hwan-Sung,19,South Korean musician,respiratory illness,,,,,,,,
18312,20,Douglas Scott,"Douglas Scott, 20, High-school student murdered by .",https://en.wikipedia.org/w/index.php?title=Demetreus_Nix&action=edit&redlink=1,0,2001,June,Douglas Scott,20,student,murdered,,,,,,,,
30210,6,Sarah-Jayne Mulvihill,"Sarah-Jayne Mulvihill, 32, first British servicewoman to be killed in action in Iraq.",https://en.wikipedia.org/wiki/Flight_Lieutenant,12,2006,May,Sarah-Jayne Mulvihill,32,British servicewoman,killed in action,,,,,,,,
42140,5,Mike Alexander,"Mike Alexander, 32, British bassist , pulmonary embolism.",https://en.wikipedia.org/wiki/Evile,82,2009,October,Mike Alexander,32,British bassist,pulmonary embolism,,,,,,,,


<IPython.core.display.Javascript object>

#### Dropping Entry for Group

In [27]:
# Checking the entry representing a group
group_entry = df[df["info_0"] == "Nearly 3"]
group_entry

Unnamed: 0,day,name,info,link,num_references,year,month,info_0,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11
18908,11,were killed,"Nearly 3,000 people September 11 attacks in the , including:",https://en.wikipedia.org/wiki/Casualties_of_the_September_11_attacks,176,2001,September,Nearly 3,000 people September 11 attacks in the,including:,,,,,,,,,


<IPython.core.display.Javascript object>

In [28]:
# Dropping group entry, resetting index, and checking new shape of df
df.drop(group_entry.index, inplace=True)
df.reset_index(inplace=True, drop=True)
df.shape

(132973, 19)

<IPython.core.display.Javascript object>

#### Extracting `age` from `info_0`

In [29]:
# For loop to extract age from info_0
for i, item in enumerate(df["info_0"]):
    pattern = r"\d"
    if re.search(pattern, item):
        df.loc[i, "age"] = int(item)

# Checking one of these rows
df[df["info_0"] == "92"]

Unnamed: 0,day,name,info,link,num_references,year,month,info_0,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
33085,28,David Turnbull,". 92, American materials scientist.",https://en.wikipedia.org/wiki/David_Turnbull_(materials_scientist),8,2007,April,92,American materials scientist,,,,,,,,,,,92.0
117386,11,Gotthilf Fischer,"92, German choral conductor .",https://en.wikipedia.org/wiki/Gotthilf_Fischer,4,2020,December,92,German choral conductor,,,,,,,,,,,92.0


<IPython.core.display.Javascript object>

In [30]:
# Dropping info_0
df.drop("info_0", axis=1, inplace=True)

<IPython.core.display.Javascript object>

#### Observations:
- The new `age` column has been added successfully.
- We are finished processing `info_0`.

### `info_1`

#### Unique Values

In [31]:
# Checking unique values
df["info_1"].unique()

array(['86', '68', '87', '93', '79', '50', '88', '72', '81', '80', '90',
       '85', '92', '58', '54', '96', '49', '77', '76', '43', '35', '83',
       '31', '64', '57', '52', '84', '78', '70', '73', '67', '99', '33',
       '75', '66', '74', '62', '61', '82', '38', '47', '56', '91', '89',
       '94', '45', '65', '97', '63', '69', '37', '53', '46', '60', '26',
       '71', '25', '59', '39–40', '23', '95', '42', '32', '51', '41',
       '55', '44', '98', '39', '100', '27', '28', '40', '30', '48', '34',
       '29', '36', '111', '22', '104', '14', '21', '106', '105', '18',
       '101', '102', '20', '73-74', '84-85', '67-68', '85-86', '89-90',
       '103', '49-50', '87-88', '107', '48-49', '19', '71-72', '2',
       '82-83', '88-89', '11-12', '63-64', '64-65', '24', '17', '69-70',
       '52-53', '39-40', '80-81', '51-52', '70–71', '66–67', '59-60',
       '94-95', '15', '108', '76-77', '1995', '37-38',
       'American politician', '55/56', '87–88',
       '90 Swedish Olympic sprinte

<IPython.core.display.Javascript object>

#### Observations:
- There is a lot of variety in the format of the age data.
- Also, this field contains several values that we would expect in info_2 and beyond.
- Let us take the approach of extracting age values first, then examining rows with missing `age`.

#### Examining unique formats for age data

In [32]:
# Examining formats for age data
has_num = []
for i, item in enumerate(df["info_1"]):
    pattern = r"\d"
    if re.search(pattern, item) is not None:
        has_num.append(i)

df.loc[has_num, :]["info_1"].unique()

array(['86', '68', '87', '93', '79', '50', '88', '72', '81', '80', '90',
       '85', '92', '58', '54', '96', '49', '77', '76', '43', '35', '83',
       '31', '64', '57', '52', '84', '78', '70', '73', '67', '99', '33',
       '75', '66', '74', '62', '61', '82', '38', '47', '56', '91', '89',
       '94', '45', '65', '97', '63', '69', '37', '53', '46', '60', '26',
       '71', '25', '59', '39–40', '23', '95', '42', '32', '51', '41',
       '55', '44', '98', '39', '100', '27', '28', '40', '30', '48', '34',
       '29', '36', '111', '22', '104', '14', '21', '106', '105', '18',
       '101', '102', '20', '73-74', '84-85', '67-68', '85-86', '89-90',
       '103', '49-50', '87-88', '107', '48-49', '19', '71-72', '2',
       '82-83', '88-89', '11-12', '63-64', '64-65', '24', '17', '69-70',
       '52-53', '39-40', '80-81', '51-52', '70–71', '66–67', '59-60',
       '94-95', '15', '108', '76-77', '1995', '37-38', '55/56', '87–88',
       '90 Swedish Olympic sprinter', '6', '86-87', '62/63', '79

<IPython.core.display.Javascript object>

#### Observations:
- There are some specific rows that likely need to be dropped, with the following values for `info_1`:
    - 1995
    - 1997
    - German Olympic sailor [1]
    - Taiwanese failed assassin in the 3-19 shooting incident
    - 255
    - 176
    - the first wild bear in Germany in 170 years
    - c. 3500
    - common chimpanzee 55
    - Maltese 15
    - c.1000
    - Tree of the Year 150
- The dataset contains values for non-human species, so we will be on the lookout to drop those rows.
- The age formatting patterns include "age", "age1-age2" (short dash), "age1–age2" (long dash), 'age1/age2", "age ", 'age.', 'c. age', 'age/age 2nd digit', 'age or age', 'age?', 'early ages', 'age days', 'age-months', 'c.age', "age+", and "age months".
- We will need to address ages in days and months prior to those in years.

#### Examining Rows with Digits and Atypical Values for `info_2`

In [33]:
# List of atypical info_1 values for rows with digits
values_lst = [
    "1995",
    "1997",
    "German Olympic sailor [1]",
    "Taiwanese failed assassin in the 3-19 shooting incident",
    "255",
    "176",
    "the first wild bear in Germany in 170 years",
    "c. 3500",
    "common chimpanzee 55",
    "Maltese 15",
    "c.1000",
    "Tree of the Year 150",
]

df[df["info_1"].isin(values_lst)]

Unnamed: 0,day,name,info,link,num_references,year,month,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
6046,9,Khaptad Baba,", 1995, Nepalese spiritual saint.",https://en.wikipedia.org/wiki/Khaptad_Baba,6,1996,May,1995,Nepalese spiritual saint,,,,,,,,,,
6985,5,Wymberley D. Coerr,", 1995, American politician and diplomat.",https://en.wikipedia.org/wiki/Wymberley_D._Coerr,3,1996,October,1995,American politician and diplomat,,,,,,,,,,
10548,21,Pakoda Kadhar,", 1997, Indian actor.",https://en.wikipedia.org/wiki/Pakoda_Kadhar,1,1998,January,1997,Indian actor,,,,,,,,,,
22722,17,Klaus Oldendorff,", German Olympic sailor [1]",https://en.wikipedia.org/wiki/Klaus_Oldendorff,1,2003,March,German Olympic sailor [1],,,,,,,,,,,
25010,29,Chen Yi-hsiung,", Taiwanese failed assassin in the 3-19 shooting incident.",https://en.wikipedia.org/wiki/Chen_Yi-hsiung,4,2004,March,Taiwanese failed assassin in the 3-19 shooting incident,,,,,,,,,,,
29898,23,Adwaita,", 255 , tortoise claimant for world's oldest animal, reputedly a former pet of General Clive, liver failure.",https://en.wikipedia.org/wiki/Adwaita,4,2006,March,255,tortoise claimant for world's oldest animal,reputedly a former pet of General Clive,liver failure,,,,,,,,
30566,23,Harriet,", 176, Galápagos tortoise believed to be the third oldest animal in the world and allegedly owned by Charles Darwin, heart failure.",https://en.wikipedia.org/wiki/Harriet_(tortoise),10,2006,June,176,Galápagos tortoise believed to be the third oldest animal in the world and allegedly owned by Charles Darwin,heart failure,,,,,,,,,
30591,26,Bear JJ1,", the first wild bear in Germany in 170 years, shot to death.",https://en.wikipedia.org/wiki/Bear_JJ1,1,2006,June,the first wild bear in Germany in 170 years,shot to death,,,,,,,,,,
52996,16,The Senator,", c. 3500, American pond cypress tree, largest in the world, fire.",https://en.wikipedia.org/wiki/The_Senator_(tree),11,2012,January,c. 3500,American pond cypress tree,largest in the world,fire,,,,,,,,
55404,2,Oliver,", common chimpanzee 55, Congolese-born noted for his upright stature and humanlike traits.",https://en.wikipedia.org/wiki/Oliver_(chimpanzee),14,2012,June,common chimpanzee 55,Congolese-born noted for his upright stature and humanlike traits,,,,,,,,,,


<IPython.core.display.Javascript object>

#### Observations:
- Age data is either missing or the entry is for a member of a non-human species.
- We will drop all of these rows.

#### Dropping Rows for Non-Human Entries or Entries Missing Age Data

In [34]:
# List of indexes to be dropped
drop_rows = df[df["info_1"].isin(values_lst)].index.to_list()

# Dropping rows, resetting index, and checking new shape of df
df.drop(drop_rows, inplace=True)
df.reset_index(inplace=True, drop=True)
df.shape

(132960, 19)

<IPython.core.display.Javascript object>

### Extracting `age` from `info_1`

- Recall that the age formatting patterns include "age", "age1-age2" (short dash), "age1–age2" (long dash), 'age1/age2", "age ", 'age.', 'c. age', 'age/age 2nd digit', 'age or age', 'age?', 'early ages', 'age days', 'age-months', 'c.age', 'age+', and "age months".
- We should address age in days and months first.

#### Step 1: Age in Years and Months

In [35]:
# Dictionary of patterns for days and months formats as keys and factor to convert to years
patterns = {
    r"(\d{1,3})( days)": 365,
    r"(\d{1,3})(-months)": 12,
    r"(\d{1,3})( months)": 12,
}

# For loop to check for age in days and months and convert to years and save in age.
# Also removes age in days/months from info_0.
for key, value in patterns.items():
    for i, item in enumerate(df["info_1"]):
        match = re.search(key, item)
        if match:
            age = int(match.group(1)) / value
            df.loc[i, "age"] = age
            df.loc[i, "info_1"] = re.sub(key, "", df.loc[i, "info_1"])

# Checking updated rows
df[df["age"].notna()]

Unnamed: 0,day,name,info,link,num_references,year,month,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
28492,12,Susan Anne Catherine Torres,", 40 days, American baby born to Susan Torres, brain-dead woman, heart failure after intestinal surgery.",https://en.wikipedia.org/wiki/Susan_Torres,3,2005,September,,American baby born to Susan Torres,brain-dead woman,heart failure after intestinal surgery,,,,,,,,0.109589
30524,18,Chris and Cru Kahui,", 3-months, New Zealand child homicide victims.",https://en.wikipedia.org/wiki/Chris_and_Cru_Kahui,48,2006,June,,New Zealand child homicide victims,,,,,,,,,,0.25
33077,28,David Turnbull,". 92, American materials scientist.",https://en.wikipedia.org/wiki/David_Turnbull_(materials_scientist),8,2007,April,American materials scientist,,,,,,,,,,,92.0
59604,24,Kristján Jóhannsson,"83, Icelandic Olympic athlete.",https://en.wikipedia.org/wiki/Kristj%C3%A1n_J%C3%B3hannsson_(athlete),2,2013,January,Icelandic Olympic athlete,,,,,,,,,,,83.0
90456,28,Charlie Gard,", 11 months, British infant, subject of life support and parental rights case, MDDS.",https://en.wikipedia.org/wiki/Charlie_Gard_case,125,2017,July,,British infant,subject of life support and parental rights case,MDDS,,,,,,,,0.916667
117374,11,Gotthilf Fischer,"92, German choral conductor .",https://en.wikipedia.org/wiki/Gotthilf_Fischer,4,2020,December,German choral conductor,,,,,,,,,,,92.0


<IPython.core.display.Javascript object>

#### Observations:
- We have successfully captured the age in days and months values and converted them to years.
- The other rows are in place already from our treatment of `info_0`.
- Next, we will extract the entries with straightforward single integer age values, including the formats: "age", "age ", 'age.'.
- Apparent estimates are excluded here to allow closer examination, as they are more likely to by atypical entries.

#### Step 2: Age as Single Integer (Excluding Estimates)

In [36]:
# List of patterns for age formats with single integer for age
patterns = [r"^(\d{1,3})$", r"^(\d{1,3})\s", r"^(\d{1,3})\.\s"]

# For loop to check age pattern in info_0, save age to age column, and remove from age from info_0
for i, item in enumerate(df["info_1"]):
    for pattern in patterns:
        match = re.search(pattern, item)
        if match:
            age = int(match.group(1))
            df.loc[i, "age"] = age
            df.loc[i, "info_1"] = re.sub(pattern, "", df.loc[i, "info_1"])

# Checking first 2 rows
df.head(2)

Unnamed: 0,day,name,info,link,num_references,year,month,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
0,1,William Chappell,", 86, British dancer, ballet designer and director.",https://en.wikipedia.org/wiki/William_Chappell_(dancer),21,1994,January,,British dancer,ballet designer and director,,,,,,,,,86.0
1,1,Raymond Crotty,", 68, Irish economist, writer, and academic.",https://en.wikipedia.org/wiki/Raymond_Crotty,12,1994,January,,Irish economist,writer,and academic,,,,,,,,68.0


<IPython.core.display.Javascript object>

In [37]:
# Checking last 2 rows
df.tail(2)

Unnamed: 0,day,name,info,link,num_references,year,month,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
132958,9,Oleg Moliboga,", 69, Russian volleyball player, Olympic champion and coach.",https://en.wikipedia.org/wiki/Oleg_Moliboga,2,2022,June,,Russian volleyball player,Olympic champion and coach,,,,,,,,,69.0
132959,9,Zou Jing,", 86, Chinese engineer, member of the Chinese Academy of Engineering.",https://en.wikipedia.org/wiki/Zou_Jing_(engineer),3,2022,June,,Chinese engineer,member of the Chinese Academy of Engineering,,,,,,,,,86.0


<IPython.core.display.Javascript object>

In [38]:
# Checking the number of remaining missing values for `age`
df["age"].isna().sum()

1014

<IPython.core.display.Javascript object>

#### Observations:
- The rows with single integer age data have been addressed.
- There are 1014 remaining missing values for `age`.
- Let us check the rows containing 'c.' in the age information.

#### Entries with Age Data Containing 'c.', '+', or '?' for Estimate

In [39]:
values = ["c.", "+", "?"]

# List of rows to check
rows_to_check = []

# For loop to add index of list containing values to list
for index in df[df["age"].isna()].index:
    if any(value in df.loc[index, "info_1"] for value in values):
        rows_to_check.append(index)

# Inspecting rows containing values
df.loc[rows_to_check, :]

Unnamed: 0,day,name,info,link,num_references,year,month,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
12922,22,Graciela Quan,", c. 79, Guatemalan lawyer and women's rights activist.",https://en.wikipedia.org/wiki/Graciela_Quan,11,1999,January,c. 79,Guatemalan lawyer and women's rights activist,,,,,,,,,,
25366,4,Tesfaye Gebre Kidan,", c. 69, Ethiopian general, defense minister and President of Ethiopia.",https://en.wikipedia.org/wiki/Tesfaye_Gebre_Kidan,10,2004,June,c. 69,Ethiopian general,defense minister and President of Ethiopia,,,,,,,,,
25446,18,Paul Johnson,", c. 49, American hostage, decapitated by al-Qaeda.",https://en.wikipedia.org/wiki/Paul_Marshall_Johnson_Jr.,9,2004,June,c. 49,American hostage,decapitated by al-Qaeda,,,,,,,,,
25447,18,Nek Mohammed,", c. 27, Pakistani tribal leader in Waziristan and key Taliban ally, killed by Pakistani military forces.",https://en.wikipedia.org/wiki/Nek_Muhammad_Wazir,10,2004,June,c. 27,Pakistani tribal leader in Waziristan and key Taliban ally,killed by Pakistani military forces,,,,,,,,,
26727,13,Earl Cameron,", 89?, Canadian broadcaster and anchor .",https://en.wikipedia.org/wiki/Earl_Cameron_(broadcaster),3,2005,January,89?,Canadian broadcaster and anchor,,,,,,,,,,
29744,28,Peter Snow,", c. 70, New Zealand doctor who discovered ""Tapanui flu"" .",https://en.wikipedia.org/wiki/Peter_Snow_(doctor),5,2006,February,c. 70,"New Zealand doctor who discovered ""Tapanui flu""",,,,,,,,,,
29840,15,Humphrey,", c. 17, British Chief Mouser to the Cabinet Office, .",https://en.wikipedia.org/wiki/Humphrey_(cat),25,2006,March,c. 17,British Chief Mouser to the Cabinet Office,,,,,,,,,,
30259,14,Paul Marco,", c. 81, American film actor .",https://en.wikipedia.org/wiki/Paul_Marco,1,2006,May,c. 81,American film actor,,,,,,,,,,
31174,6,Mohammed Taha Mohammed Ahmed,", c.50, Sudanese newspaper editor, beheaded.",https://en.wikipedia.org/wiki/Mohammed_Taha_Mohammed_Ahmed,0,2006,September,c.50,Sudanese newspaper editor,beheaded,,,,,,,,,
31456,11,Benito Martínez,", 126?, Cuban claimant to the title of world's oldest person.",https://en.wikipedia.org/wiki/Benito_Mart%C3%ADnez,4,2006,October,126?,Cuban claimant to the title of world's oldest person,,,,,,,,,,


<IPython.core.display.Javascript object>

#### Observations:
- Most of the entries are for people, but there are also one or more entries for the following:
    - carp
    - racehorse
    - chimpanzee
    - flamingo
    - cat
    - turkey
- We will proceed to check `info_2` for these values and drop these and other rows representing members of these other species.

#### Checking for Cat, Racehorse, Chimpanzee, Carp, and Flamingo in `info_2`

In [40]:
# Defining pattern for re
species_pattern = r"\b(cat|racehorse|chimpanzee|carp|flamingo|turkey)\b"

# Empty list to collect indexes of rows containing pattern
rows_to_check = []

# For loop to add index of rows containing pattern to list
for index in df[df["info_2"].notna()].index:
    match = re.search(species_pattern, df.loc[index, "info_2"])
    if match:
        rows_to_check.append(index)

# Checking number of rows containing pattern
print(f"There are {len(rows_to_check)} rows containing these values.")

There are 468 rows containing these values.


<IPython.core.display.Javascript object>

#### Observations:
- There are sufficient rows to warrant checking species by species.

#### Cat Entries per `info_2`

In [41]:
# Defining pattern for re
species_pattern = r"\b(cat)\b"

# Empty list to collect indexes of rows containing pattern
rows_to_check = []

# For loop to add index of rows containing pattern to list
for index in df[df["info_2"].notna()].index:
    match = re.search(species_pattern, df.loc[index, "info_2"])
    if match:
        rows_to_check.append(index)

# Inpsecting rows with pattern
df.loc[rows_to_check, :]

Unnamed: 0,day,name,info,link,num_references,year,month,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
18611,4,Lorenzo Music,", 64, American voice actor known for the voice of the cartoon cat Garfield, complications related to lung and bone cancer.",https://en.wikipedia.org/wiki/Lorenzo_Music,6,2001,August,,American voice actor known for the voice of the cartoon cat Garfield,complications related to lung and bone cancer,,,,,,,,,64.0
31856,29,Dewey Readmore Books,", 19, Library cat, euthanized",https://en.wikipedia.org/wiki/Dewey_Readmore_Books,26,2006,November,,Library cat,euthanized,,,,,,,,,19.0
38128,11,Scarlett,", 13, American stray cat, name source of the Scarlett Award for Animal Heroism, animal euthanasia.",https://en.wikipedia.org/wiki/Scarlett_(cat),6,2008,October,,American stray cat,name source of the Scarlett Award for Animal Heroism,animal euthanasia,,,,,,,,13.0
39071,4,India,", 18, American pet cat of George W. Bush.",https://en.wikipedia.org/wiki/India_(cat),7,2009,January,,American pet cat of George W. Bush,,,,,,,,,,18.0
39689,20,Socks,", 19, American Presidential cat of the Clinton family, euthanized.",https://en.wikipedia.org/wiki/Socks_(cat),21,2009,February,,American Presidential cat of the Clinton family,euthanized,,,,,,,,,19.0
41344,27,Sybil,", 3, British Downing Street cat, Chief Mouser to the Cabinet Office , after short illness.",https://en.wikipedia.org/wiki/Sybil_(cat),5,2009,July,,British Downing Street cat,Chief Mouser to the Cabinet Office,after short illness,,,,,,,,3.0
47437,21,Prince Chunk,", 10, American obese cat, heart disease.",https://en.wikipedia.org/wiki/Prince_Chunk,8,2010,November,,American obese cat,heart disease,,,,,,,,,10.0
54952,5,Meow,", c. 2, American cat, heaviest cat at his time of death, lung failure.",https://en.wikipedia.org/wiki/Meow_(cat),12,2012,May,c. 2,American cat,heaviest cat at his time of death,lung failure,,,,,,,,
59825,4,Stewie,", 7–8, world's longest domestic cat, cancer.",https://en.wikipedia.org/wiki/Stewie_(cat),6,2013,February,7–8,world's longest domestic cat,cancer,,,,,,,,,
63272,7,Buurtpoes Bledder,", 1–2, domestic cat in Netherlands, motor accident.",https://en.wikipedia.org/wiki/Buurtpoes_Bledder,12,2013,August,1–2,domestic cat in Netherlands,motor accident,,,,,,,,,


<IPython.core.display.Javascript object>

#### Observations:
- There is one person represented.  
- We can proceed to drop the others, using "cat", "cat of", and "cat in" patterns.

#### Dropping Entries for Cats per `info_2`

In [42]:
# List of re patterns to find
patterns = [r"\bcat$", r"\b(cat of|cat in)\b"]

# List to collect indexes of rows to drop
rows_to_drop = []

# For loop to find re pattern and add index of rows with pattern to list
for index in df[df["info_2"].notna()].index:
    for pattern in patterns:
        match = re.search(pattern, df.loc[index, "info_2"])
        if match:
            rows_to_drop.append(index)

# Checking number of rows added to list for dropping
print(f"{len(rows_to_drop)} rows will be dropped.")

18 rows will be dropped.


<IPython.core.display.Javascript object>

In [43]:
# Dropping rows, resetting index, and checking new shape of df
df.drop(rows_to_drop, inplace=True)
df.reset_index(inplace=True, drop=True)
df.shape

(132942, 19)

<IPython.core.display.Javascript object>

####  Racehorse Entries per `info_2`

In [44]:
# Defining pattern for re
species_pattern = r"\bracehorse\b"

# Empty list to collect indexes of rows containing pattern
rows_to_check = []

# For loop to add index of rows containing pattern to list
for index in df[df["info_2"].notna()].index:
    match = re.search(species_pattern, df.loc[index, "info_2"])
    if match:
        rows_to_check.append(index)

# Inpsecting rows with pattern
df.loc[rows_to_check, :].sample(10)

Unnamed: 0,day,name,info,link,num_references,year,month,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
12071,2,Tommy J. Smith,", 81, Australian racehorse trainer.",https://en.wikipedia.org/wiki/Tommy_J._Smith,14,1998,September,,Australian racehorse trainer,,,,,,,,,,81.0
88651,21,Cape Cross,", 23, Irish-bred racehorse and sire, euthanized.",https://en.wikipedia.org/wiki/Cape_Cross_(horse),3,2017,April,,Irish-bred racehorse and sire,euthanized,,,,,,,,,23.0
121975,24,John T. Ward Jr.,", 75, American racehorse trainer.",https://en.wikipedia.org/wiki/John_T._Ward_Jr.,1,2021,April,,American racehorse trainer,,,,,,,,,,75.0
104973,30,Deep Impact,", 17, Japanese champion racehorse and sire, euthanised.",https://en.wikipedia.org/wiki/Deep_Impact_(horse),3,2019,July,,Japanese champion racehorse and sire,euthanised,,,,,,,,,17.0
56568,13,Typhoon Tracy,", 6, Australian Thoroughbred racehorse, winner of the Coolmore Classic .",https://en.wikipedia.org/wiki/Typhoon_Tracy,5,2012,August,,Australian Thoroughbred racehorse,winner of the Coolmore Classic,,,,,,,,,6.0
71899,7,Solwhit,", 10, French-bred Irish-trained Thoroughbred racehorse, fall while training.",https://en.wikipedia.org/wiki/Solwhit,8,2014,November,,French-bred Irish-trained Thoroughbred racehorse,fall while training,,,,,,,,,10.0
114845,29,Subzero,", 31, Australian racehorse, Melbourne Cup winner , euthanized.",https://en.wikipedia.org/wiki/Subzero_(horse),8,2020,August,,Australian racehorse,Melbourne Cup winner,euthanized,,,,,,,,31.0
75243,24,Success Express,", 30, American Thoroughbred racehorse.",https://en.wikipedia.org/wiki/Success_Express,1,2015,April,,American Thoroughbred racehorse,,,,,,,,,,30.0
37304,22,Ballindaggin,", 23, American Thoroughbred racehorse, euthanized.",https://en.wikipedia.org/wiki/Ballindaggin_(horse),9,2008,July,,American Thoroughbred racehorse,euthanized,,,,,,,,,23.0
87897,13,Danehill Dancer,", 24, Irish-bred British-trained thoroughbred racehorse and sire, euthanized.",https://en.wikipedia.org/wiki/Danehill_Dancer,34,2017,March,,Irish-bred British-trained thoroughbred racehorse and sire,euthanized,,,,,,,,,24.0


<IPython.core.display.Javascript object>

#### Observations:
- There are several entries for people involved in the racehorse business.
- Values that end in 'racehorse' and 'racehorse and sire' can be removed.

#### Dropping Entries for Racehorses per `info_2`

In [45]:
# List of re patterns to find
patterns = [r"\bracehorse$", r"\b(racehorse and sire)$"]

# List to collect indexes of rows to drop
rows_to_drop = []

# For loop to find re pattern and add index of rows with pattern to list
for index in df[df["info_2"].notna()].index:
    for pattern in patterns:
        match = re.search(pattern, df.loc[index, "info_2"])
        if match:
            rows_to_drop.append(index)

# Checking number of rows added to list for dropping
len(rows_to_drop)

349

<IPython.core.display.Javascript object>

In [46]:
# Dropping rows, resetting index, and checking new shape of df
df.drop(rows_to_drop, inplace=True)
df.reset_index(inplace=True, drop=True)
df.shape

(132593, 19)

<IPython.core.display.Javascript object>

####  Chimpanzee, Flamingo,  Carp, and Turkey Entries per `info_2`

In [47]:
# Defining pattern for re
species_pattern = r"\b(chimpanzee|flamingo|carp|turkey)\b"

# Empty list to collect indexes of rows containing pattern
rows_to_check = []

# For loop to add index of rows containing pattern to list
for index in df[df["info_2"].notna()].index:
    match = re.search(species_pattern, df.loc[index, "info_2"])
    if match:
        rows_to_check.append(index)

# Inpsecting rows with pattern
df.loc[rows_to_check, :]

Unnamed: 0,day,name,info,link,num_references,year,month,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
34557,30,Washoe,", c.42, American-trained African-born chimpanzee believed to be first non-human to acquire human language, influenza.",https://en.wikipedia.org/wiki/Washoe_(chimpanzee),42,2007,October,c.42,American-trained African-born chimpanzee believed to be first non-human to acquire human language,influenza,,,,,,,,,
39599,16,Travis,", 13, American-born chimpanzee, television commercial animal, shot.",https://en.wikipedia.org/wiki/Travis_(chimpanzee),69,2009,February,,American-born chimpanzee,television commercial animal,shot,,,,,,,,13.0
41365,4,Benson,", c. 25, British common carp, voted as Britain's Favourite Carp .",https://en.wikipedia.org/wiki/Benson_(fish),9,2009,August,c. 25,British common carp,voted as Britain's Favourite Carp,,,,,,,,,
45231,1,Heather the Leather,", 50, British scaleless carp, old age.",https://en.wikipedia.org/wiki/Heather_the_Leather,5,2010,June,,British scaleless carp,old age,,,,,,,,,50.0
66269,30,Greater,", c. 83, Australian greater flamingo, world's oldest flamingo, euthanized.",https://en.wikipedia.org/wiki/Greater_(flamingo),12,2014,January,c. 83,Australian greater flamingo,world's oldest flamingo,euthanized,,,,,,,,
70953,26,Zelda,", 11+, American wild turkey, resident of New York City's Battery Park, traffic collision.",https://en.wikipedia.org/wiki/Zelda_(turkey),13,2014,September,11+,American wild turkey,resident of New York City's Battery Park,traffic collision,,,,,,,,
76076,22,Don Featherstone,", 79, American artist and inventor of the plastic pink flamingo, Lewy body dementia.",https://en.wikipedia.org/wiki/Don_Featherstone_(artist),10,2015,June,,American artist and inventor of the plastic pink flamingo,Lewy body dementia,,,,,,,,,79.0
92091,14,Little Mama,", c.79, African-born chimpanzee , oldest on record, kidney failure.",https://en.wikipedia.org/wiki/Little_Mama,3,2017,November,c.79,African-born chimpanzee,oldest on record,kidney failure,,,,,,,,


<IPython.core.display.Javascript object>

#### Observations:
- All of these rows are for animals, so we will remove them.

#### Dropping Entries for Chimpanzees, Flamingos,  Carps, and Turkeys per `info_2`

In [48]:
# List of re patterns to find
patterns = [r"\b(chimpanzee|flamingo|carp|turkey)\b"]

# List to collect indexes of rows to drop
rows_to_drop = []

# For loop to find re pattern and add index of rows with pattern to list
for index in df[df["info_2"].notna()].index:
    for pattern in patterns:
        match = re.search(pattern, df.loc[index, "info_2"])
        if match:
            rows_to_drop.append(index)

# Checking number of rows added to list for dropping
len(rows_to_drop)

8

<IPython.core.display.Javascript object>

In [49]:
# Dropping rows, resetting index, and checking new shape of df
df.drop(rows_to_drop, inplace=True)
df.reset_index(inplace=True, drop=True)
df.shape

(132585, 19)

<IPython.core.display.Javascript object>

#### Observations:
- With those non-human entries addressed, we can return to processing `info_`.
- Let us address the remaining entries with '?' and 'c.', accepting the estimated age as `age`.
- The sole entry containing '+' in the age estimate has been removed.

#### Extracting `age` from `info_1` for Entries with Age Estimate Containing '?' or 'c.'

In [50]:
# List to identify rows
values = ["c.", "?"]

# Pattern for re
pattern = r"\b(\d{1,3})\b"

# For loop to find rows with values and pattern and extract age to age column and remove age from info_1
for i in df[df["age"].isna()].index:
    item = df.loc[i, "info_1"]

    if any(value in item for value in values):
        match = re.search(pattern, item)

        if match:
            age = match.group(1)
            df.loc[i, "age"] = age
            df.loc[i, "info_1"] = re.sub(pattern, "", df.loc[i, "info_1"])

        for value in values:
            df.loc[i, "info_1"] = df.loc[i, "info_1"].replace(value, "")

# Checking example rows
pd.concat([df[df["name"] == "Tuti Yusupova"], df[df["name"] == "Raed al Atar"]])

Unnamed: 0,day,name,info,link,num_references,year,month,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
74517,28,Tuti Yusupova,", 134?, Uzbekistani longevity claimant, unverified world's oldest person.",https://en.wikipedia.org/wiki/Tuti_Yusupova,5,2015,March,,Uzbekistani longevity claimant,unverified world's oldest person,,,,,,,,,134
70268,21,Raed al Atar,", c. 40, Palestinian militant , missile strike.",https://en.wikipedia.org/wiki/Raed_al_Atar,19,2014,August,,Palestinian militant,missile strike,,,,,,,,,40


<IPython.core.display.Javascript object>

#### Observations:


In [51]:
df[df["age"].isna()]["info_1"].unique()

array(['39–40', '73-74', '84-85', '67-68', '85-86', '89-90', '49-50',
       '87-88', '48-49', '71-72', '82-83', '88-89', '11-12', '63-64',
       '64-65', '69-70', '52-53', '39-40', '80-81', '51-52', '70–71',
       '66–67', '59-60', '94-95', '76-77', '37-38', 'American politician',
       '55/56', '87–88', '86-87', '62/63', '79-80', '1996', '100/101',
       '97/98', '91/92', '66-67', '66/67', '38-39', '82–83', '88–89', '',
       '92–93', '78/79', 'Prime Minister of Zaire', '85/86', '54/55',
       '50/51', '73/74', '67/68', '45–46', '34/35', '69–70', '75/76',
       'King of Nepal', '89/90', '72/3',
       'member of the then National Assembly of Pakistan and Union Minister of Labor',
       '87/88', '28/29', '90/91', '52/53', '69/70', '73/4', '72-73',
       '52–53', '85–86', '82/83', 'Jules Engel', '94/95', '75-76', '76/7',
       '38-40', '60–61', '43/4', 'Vanuatuan president', '53/54', '55–56',
       '81-82', '54/5', 'early 80s', '93/4', '86/7', 'early 40s', '46/47',
       '8

<IPython.core.display.Javascript object>

In [None]:
digits = [str(digit) for digit in np.arange(0, 10)]
rows_to_check = []
for index in df[df["age"].isna()].index:
    if any(digit in df.loc[index, "info_1"] for digit in digits):
        rows_to_check.append(index)
df.loc[rows_to_check, :].sample(10)

In [None]:
df[df["info_1"] == "80. New Zealand Maori leader"]

In [None]:
df[df["info_1"] == "11+"]