# Wikipedia Notable Life Expectancies

# [Notebook 2 of 4: Data Cleaning](https://github.com/teresahanak/wikipedia-life-expectancy/blob/main/wp_life_expect_data_clean_thanak_2022_06_13.ipynb)

## Context

The


## Objective

The

### Data Dictionary

- Feature: Description

## Importing Necessary Libraries

In [1]:
# To structure code automatically
%load_ext nb_black

# To import/export sqlite databases
import sqlite3 as sql

# To help with reading and manipulating data
import pandas as pd
import numpy as np
import re

# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)
# To define the maximum number of rows to be displayed in a dataframe
pd.set_option("display.max_rows", 200)

# To supress warnings
# import warnings

# warnings.filterwarnings("ignore")

# To set some visualization attributes
pd.set_option("max_colwidth", 150)

<IPython.core.display.Javascript object>

## Data Overview

### Reading, Sampling, and Checking Data Shape

In [2]:
# Reading the wp_life_expect_raw_complete dataset
conn = sql.connect("wp_life_expect_raw_complete.db")
data = pd.read_sql("SELECT * FROM wp_life_expect_raw_complete", conn)

# Making a working copy
df = data.copy()

# Checking the shape
print(f"There are {df.shape[0]} rows and {df.shape[1]} columns.")

# Checking first 2 rows of the data
df.head(2)

There are 133900 rows and 6 columns.


Unnamed: 0,month_year,day,name,info,link,num_references
0,January 1994,1,William Chappell,", 86, British dancer, ballet designer and director.",https://en.wikipedia.org/wiki/William_Chappell_(dancer),21
1,January 1994,1,Raymond Crotty,", 68, Irish economist, writer, and academic.",https://en.wikipedia.org/wiki/Raymond_Crotty,12


<IPython.core.display.Javascript object>

In [3]:
# Checking last 2 rows of the data
df.tail(2)

Unnamed: 0,month_year,day,name,info,link,num_references
133898,June 2022,9,Oleg Moliboga,", 69, Russian volleyball player, Olympic champion (1980) and coach.",https://en.wikipedia.org/wiki/Oleg_Moliboga,2
133899,June 2022,9,Zou Jing,", 86, Chinese engineer, member of the Chinese Academy of Engineering.",https://en.wikipedia.org/wiki/Zou_Jing_(engineer),3


<IPython.core.display.Javascript object>

In [4]:
# Checking a sample of the data
df.sample(5)

Unnamed: 0,month_year,day,name,info,link,num_references
133417,May 2022,15,Șerban Valeca,", 65, Romanian engineer and politician, deputy (2000) and senator (2008–2020).",https://en.wikipedia.org/wiki/%C8%98erban_Valeca,1
53151,December 2011,29,Svein Krøvel,", 65, Norwegian cinematographer.",https://en.wikipedia.org/wiki/Svein_Kr%C3%B8vel,1
1577,September 1994,6,Nicky Hopkins,", 50, English pianist and organist.",https://en.wikipedia.org/wiki/Nicky_Hopkins,28
26092,August 2004,11,Bjarne Andersson,", 64, Swedish cross-county skier, Olympic silver medallist (1968).",https://en.wikipedia.org/wiki/Bjarne_Andersson,3
90387,June 2017,13,Jack Ong,", 76, American actor (, , ), brain tumor.",https://en.wikipedia.org/wiki/Jack_Ong,2


<IPython.core.display.Javascript object>

#### Observations:
- There are 133,900 rows and 6 columns.
- `month_year` contains the month and year of death, while `day` contains the day of the month of death.
- `name` is the notable person's name.  It is a nominal feature that will not be used for analysis, but will be maintained for any referencing needs.
- `info` contains multiple items including the notable person's "age, country of citizenship at birth, subsequent country of citizenship (if applicable), reason for notability, (and) cause of death (if known)."
- `link` is the url to the notable person's individual Wikipedia page.  If such a page does not exist, there is either a non-working link (https://en.wikipedia.orgNone), or the link is to a page with a message that the page does not exist for that individual.  `link` is a unique identifier for all entries, except the 6 with the non-working link, which do have unique `name` values from each other.
- `num_references` contains the number of references on the notable person's individual Wikipedia page.  This feature serves as a proxy measure of notability.
- Prior to EDA, our task will be to extract the individual elements that are comined in `month_year` and `info` columns.

### Checking Data Types, Duplicates, and Null Values

In [5]:
# Checking data types and null values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 133900 entries, 0 to 133899
Data columns (total 6 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   month_year      133900 non-null  object
 1   day             133900 non-null  object
 2   name            133894 non-null  object
 3   info            133900 non-null  object
 4   link            133900 non-null  object
 5   num_references  133900 non-null  object
dtypes: object(6)
memory usage: 6.1+ MB


<IPython.core.display.Javascript object>

In [6]:
# Checking duplicate rows
df.duplicated().sum()

0

<IPython.core.display.Javascript object>

In [7]:
# Check percentage of null values by column
df.isnull().sum() / df.count() * 100

month_year        0.000000
day               0.000000
name              0.004481
info              0.000000
link              0.000000
num_references    0.000000
dtype: float64

<IPython.core.display.Javascript object>

In [8]:
# Checking number of missing values per row (not necessary here, but done to keep process standard)
df.isnull().sum(axis=1).value_counts()

0    133894
1         6
dtype: int64

<IPython.core.display.Javascript object>

#### Observations:
- Our dataset was saved to and read from the database without any hiccups.
- As expected, we have 6 entries that are missing `name`, but we will find it in their `info` values.
- All columns are currently of object type.  We will need to appropriately typecast them after separating the information in `month_year` and `info`.

## Data Cleaning

### Addressing Missing `name` Values

In [9]:
# Checking rows with missing name values
missing_name = df[df["name"].isna()]
missing_name

Unnamed: 0,month_year,day,name,info,link,num_references
18937,August 2001,11,,"Kevin Kowalcyk, 2, known for eating a hamburger contaminated with E. coli O157:H7.",https://en.wikipedia.orgNone,0
24985,January 2004,22,,"Vincent Palmer, 37, British criminal.",https://en.wikipedia.orgNone,0
27458,March 2005,1,,"Barry Stigler, 57, American voice actor.",https://en.wikipedia.orgNone,0
34077,July 2007,11,,"Nana Gualdi, 75, German singer and actress.",https://en.wikipedia.orgNone,0
64769,September 2013,29,,"Scott Workman, 47, American stuntman (, , ), cancer.",https://en.wikipedia.orgNone,0
106613,September 2019,12,,"Thami Shobede, 31, Singer Songwriter",https://en.wikipedia.orgNone,0


<IPython.core.display.Javascript object>

#### Observations:
- These rows vary from the main set as there is a substring containing the person's name at the start of the `info` string.
- As there are so few rows missing `name`, let us address this issue first.

In [10]:
# For loop to copy name value from info value and remove name from info value
treat_rows = missing_name.index
for i in treat_rows:
    info = df.loc[i, "info"]
    info_lst = info.split(sep=",", maxsplit=1)

    name = info_lst[0].strip()
    df.loc[i, "name"] = name
    df.loc[i, "info"] = re.sub(name, "", info).strip()

# Re-check rows
df.loc[treat_rows, :]

Unnamed: 0,month_year,day,name,info,link,num_references
18937,August 2001,11,Kevin Kowalcyk,", 2, known for eating a hamburger contaminated with E. coli O157:H7.",https://en.wikipedia.orgNone,0
24985,January 2004,22,Vincent Palmer,", 37, British criminal.",https://en.wikipedia.orgNone,0
27458,March 2005,1,Barry Stigler,", 57, American voice actor.",https://en.wikipedia.orgNone,0
34077,July 2007,11,Nana Gualdi,", 75, German singer and actress.",https://en.wikipedia.orgNone,0
64769,September 2013,29,Scott Workman,", 47, American stuntman (, , ), cancer.",https://en.wikipedia.orgNone,0
106613,September 2019,12,Thami Shobede,", 31, Singer Songwriter",https://en.wikipedia.orgNone,0


<IPython.core.display.Javascript object>

#### Observations:
- Missing `name` values have been addressed and those names have been removed from `info` values.

In [11]:
# Re-check info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 133900 entries, 0 to 133899
Data columns (total 6 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   month_year      133900 non-null  object
 1   day             133900 non-null  object
 2   name            133900 non-null  object
 3   info            133900 non-null  object
 4   link            133900 non-null  object
 5   num_references  133900 non-null  object
dtypes: object(6)
memory usage: 6.1+ MB


<IPython.core.display.Javascript object>

#### Observations:
- We have no remaining missing values.
- Let us treat the `month_year` column next.

### Separating `month` and `year`

In [12]:
# Separating month and year into 2 columns and typecasting year as integer
df.loc[:, "year"] = df["month_year"].apply(lambda x: x.split(sep=" ")[1].strip())
df["year"] = df["year"].apply(lambda x: int(x))

df.loc[:, "month"] = df["month_year"].apply(lambda x: x.split(sep=" ")[0])
df.head(2)

Unnamed: 0,month_year,day,name,info,link,num_references,year,month
0,January 1994,1,William Chappell,", 86, British dancer, ballet designer and director.",https://en.wikipedia.org/wiki/William_Chappell_(dancer),21,1994,January
1,January 1994,1,Raymond Crotty,", 68, Irish economist, writer, and academic.",https://en.wikipedia.org/wiki/Raymond_Crotty,12,1994,January


<IPython.core.display.Javascript object>

In [13]:
# Dropping month_year column
df.drop("month_year", axis=1, inplace=True)
df.head(2)

Unnamed: 0,day,name,info,link,num_references,year,month
0,1,William Chappell,", 86, British dancer, ballet designer and director.",https://en.wikipedia.org/wiki/William_Chappell_(dancer),21,1994,January
1,1,Raymond Crotty,", 68, Irish economist, writer, and academic.",https://en.wikipedia.org/wiki/Raymond_Crotty,12,1994,January


<IPython.core.display.Javascript object>

### Treating `info`

#### Checking a Sample

In [14]:
# Checking a sample of info
df.sample(5)

Unnamed: 0,day,name,info,link,num_references,year,month
44467,24,Richard Gruenwald,", 93, Canadian politician, Alberta MLA for Lethbridge-West (1971–1975).",https://en.wikipedia.org/wiki/Richard_Gruenwald,3,2010,February
120555,13,Urs Jaeggi,", 89, Swiss sociologist, painter, and author.",https://en.wikipedia.org/wiki/Urs_Jaeggi,2,2021,February
28141,3,Teodoro Benigno,", 82, Filipino journalist.",https://en.wikipedia.org/wiki/Teodoro_Benigno,1,2005,June
33412,19,Anthony Brooks,", 85, British agent who led French Resistance saboteurs after the Normandy Invasion, stomach cancer.",https://en.wikipedia.org/wiki/Anthony_Brooks,11,2007,April
124043,1,Tram Iv Tek,", 72, Cambodian politician, MP (2003–2008, since 2018), minister of public works and transport (2008–2016) and posts and telecommunications (2016–...",https://en.wikipedia.org/wiki/Tram_Iv_Tek,1,2021,June


<IPython.core.display.Javascript object>

#### Observations:
- First, let us check for any rows that are missing digits, and therefore the age target, within `info` and remove them.
- Also, it would be helpful to remove information contained within parentheses, as we will not be using it.

#### Checking and Dropping Rows Lacking Digits (and therefore Age Data) within `info`

In [15]:
# For loop to extract index of rows without digits in info value
remove_lst = []
for index in df.index:
    pattern = r"\d"
    if re.search(pattern, df.loc[index, "info"]) is None:
        remove_lst.append(index)
print(len(remove_lst), "rows")
df.loc[remove_lst, :].sample(5)

925 rows


Unnamed: 0,day,name,info,link,num_references,year,month
65645,21,Ronny Coaches,", Ghanaian musician (Buk Bak), heart attack.",https://en.wikipedia.org/wiki/Ronny_Coaches,1,2013,November
31140,14,Tom Frame,", British comic book letterer, cancer.",https://en.wikipedia.org/wiki/Tom_Frame_(letterer),3,2006,July
45215,18,Abu Ayyub al-Masri,", Egyptian terrorist (al-Qaeda), airstrike.",https://en.wikipedia.org/wiki/Abu_Ayyub_al-Masri,42,2010,April
46104,24,Jean-Léonard Rugambage,", Rwandan journalist, shot.",https://en.wikipedia.org/wiki/Jean-L%C3%A9onard_Rugambage,7,2010,June
13330,7,Harris Lamb,", American football, basketball and track coach.",https://en.wikipedia.org/wiki/Harris_Lamb,4,1999,March


<IPython.core.display.Javascript object>

In [16]:
# Dropping rows missing age data
df.drop(remove_lst, inplace=True)
df.reset_index(inplace=True, drop=True)
df.shape

(132975, 7)

<IPython.core.display.Javascript object>

#### Observations:
- 925 rows were removed as they lacked the target data for `age`.

#### Removing Information within Parentheses in `info`

In [17]:
# Regular expression for parenthesis and its contents
pattern = r"\(.*\)"

# Subbing empty string for parentheses and stripping white space
df.loc[:, "info"] = df["info"].apply(lambda x: re.sub(pattern, "", x).strip())
df.loc[[21850, 125430], :]

Unnamed: 0,day,name,info,link,num_references,year,month
21850,11,Sir Michael Clapham,", 90, British industrialist, president of the Confederation of British Industry from 1972 to 1974.",https://en.wikipedia.org/wiki/Michael_Clapham_(industrialist),14,2002,November
125430,23,Jean-Luc Nancy,", 81, French philosopher.",https://en.wikipedia.org/wiki/Jean-Luc_Nancy,8,2021,August


<IPython.core.display.Javascript object>

#### Observation:
- Parentheses and information within has been removed from `info`.
- Next, we will follow the Wikipedia-defined fields to divide the `info` values.

#### Splitting `info` on Commas into Separate Columns

In [18]:
# For loop to split info on commas and separate into respective new columns and removing leading/trailing white space and periods
for i, item in enumerate(df["info"]):
    info_lst = item.split(",")

    for j in range(len(info_lst)):
        df.loc[i, f"info_{j}"] = info_lst[j].strip(" .")

# Checking the first 2 rows
df.head(2)

Unnamed: 0,day,name,info,link,num_references,year,month,info_0,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11
0,1,William Chappell,", 86, British dancer, ballet designer and director.",https://en.wikipedia.org/wiki/William_Chappell_(dancer),21,1994,January,,86,British dancer,ballet designer and director,,,,,,,,
1,1,Raymond Crotty,", 68, Irish economist, writer, and academic.",https://en.wikipedia.org/wiki/Raymond_Crotty,12,1994,January,,68,Irish economist,writer,and academic,,,,,,,


<IPython.core.display.Javascript object>

In [19]:
# Checking the last 2 rows
df.tail(2)

Unnamed: 0,day,name,info,link,num_references,year,month,info_0,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11
132973,9,Oleg Moliboga,", 69, Russian volleyball player, Olympic champion and coach.",https://en.wikipedia.org/wiki/Oleg_Moliboga,2,2022,June,,69,Russian volleyball player,Olympic champion and coach,,,,,,,,
132974,9,Zou Jing,", 86, Chinese engineer, member of the Chinese Academy of Engineering.",https://en.wikipedia.org/wiki/Zou_Jing_(engineer),3,2022,June,,86,Chinese engineer,member of the Chinese Academy of Engineering,,,,,,,,


<IPython.core.display.Javascript object>

#### Observations:
- The `info` value is successfully divided and we can proceed through it column by column.
- We will check the set of values for the first two columns, for age.

#### `info_0`

In [20]:
# Checking unique value counts
df["info_0"].value_counts()

                                                                       132946
Sir                                                                         6
92                                                                          2
Douglas Scott                                                               1
VC                                                                          1
Sir Woolwich West                                                           1
Sir Lord Justice of Appeal                                                  1
Dame MEP                                                                    1
83                                                                          1
Sir Governor-General                                                        1
Notable ice hockey players and coaches among the 44 killed in the :         1
Mike Alexander                                                              1
Colonel                                                         

<IPython.core.display.Javascript object>

#### Observations:
- The vast majority of rows have an empty string for this field.
- There is one row representing a group, rather than an individual, and we will drop it.
- We should verify the name and age information for the remainder of unique values in `info_0`.

#### Dropping Entry for Group

In [21]:
# Checking the entry representing a group
group_entry = df[
    df["info_0"]
    == "Notable ice hockey players and coaches among the 44 killed in the :"
]
group_entry

Unnamed: 0,day,name,info,link,num_references,year,month,info_0,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11
51130,7,2011 Lokomotiv Yaroslavl plane crash,Notable ice hockey players and coaches among the 44 killed in the :,https://en.wikipedia.org/wiki/2011_Lokomotiv_Yaroslavl_plane_crash,95,2011,September,Notable ice hockey players and coaches among the 44 killed in the :,,,,,,,,,,,


<IPython.core.display.Javascript object>

In [22]:
# Dropping group entry
df.drop(group_entry.index, inplace=True)
df.reset_index(inplace=True, drop=True)

df.shape

(132974, 19)

<IPython.core.display.Javascript object>

#### Examining Rows with Atypical `info_0` Values

In [23]:
# Examining rows with atypical info_0 values
list_to_check = df["info_0"].value_counts().index.to_list()

verify_df = pd.DataFrame()
for item in list_to_check[1:]:
    verify_df = pd.concat([verify_df, df[df["info_0"] == item]])
verify_df

Unnamed: 0,day,name,info,link,num_references,year,month,info_0,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11
30349,25,Julian Bullard,"Sir , 78, British diplomat.",https://en.wikipedia.org/wiki/Julian_Bullard,3,2006,May,Sir,78,British diplomat,,,,,,,,,
31140,1,Kyffin Williams,"Sir , 88, Welsh artist, lung and prostate cancer.",https://en.wikipedia.org/wiki/Kyffin_Williams,15,2006,September,Sir,88,Welsh artist,lung and prostate cancer,,,,,,,,
34207,15,Jeremy Moore,"Sir , 79, British soldier, commander of UK land forces in the Falklands War.",https://en.wikipedia.org/wiki/Jeremy_Moore,3,2007,September,Sir,79,British soldier,commander of UK land forces in the Falklands War,,,,,,,,
43606,29,Derek Hodgkinson,"Sir , 92, British air chief marshal.",https://en.wikipedia.org/wiki/Derek_Hodgkinson,5,2010,January,Sir,92,British air chief marshal,,,,,,,,,
67212,7,Richard Best,"Sir , 80, British diplomat, Ambassador to Iceland .",https://en.wikipedia.org/wiki/Richard_Best_(diplomat),5,2014,March,Sir,80,British diplomat,Ambassador to Iceland,,,,,,,,
67217,7,Thomas Hinde,"Sir , 88, British novelist.",https://en.wikipedia.org/wiki/Thomas_Hinde_(novelist),13,2014,March,Sir,88,British novelist,,,,,,,,,
33086,28,David Turnbull,". 92, American materials scientist.",https://en.wikipedia.org/wiki/David_Turnbull_(materials_scientist),8,2007,April,92,American materials scientist,,,,,,,,,,
117387,11,Gotthilf Fischer,"92, German choral conductor .",https://en.wikipedia.org/wiki/Gotthilf_Fischer,4,2020,December,92,German choral conductor,,,,,,,,,,
67517,21,Colin Turner,"Sir Woolwich West, 92, British politician, MP for .",https://en.wikipedia.org/wiki/Colin_Turner,3,2014,March,Sir Woolwich West,92,British politician,MP for,,,,,,,,
67162,5,Robin Dunn,"Sir Lord Justice of Appeal, 96, British jurist, .",https://en.wikipedia.org/wiki/Robin_Dunn,3,2014,March,Sir Lord Justice of Appeal,96,British jurist,,,,,,,,,


<IPython.core.display.Javascript object>

#### Observations:
- The majority of rows contain additional aliases or titles within `info_0`, that we don't need, but we can leave in place for now.  
- There are a few rows that will need to be treated individually to correct the name value, as follows:
    1. Entry is for Mike Alexander whose band was Evile.
    2. Entry is for Herbert Wiere who performed slapstick.
    3. Entry is for Sarah-Jayne Mulvihill who was a Flight Lieutenant.
    4. Entry is for Douglass Scott who was killed by Demetreus Nix.
    5. Entry is for Kim Hwan-Sung who was a member of the band NRG.
- We can replace the `name` value with the `info_0` value for these rows as well as proceed with hard-coding the correct values for info_2 and info_3 fields to match the Wikipedia pattern, but staying true to the information scraped.
- The row with "Nearly 3" value for `info_0` represents a group, rather than an individual, so will be dropped, after treating the above rows.
- We can proceed to extract age from `info_0` for the few rows that contain it here instead of in `info_1`.

#### Treating 5 rows with Name in `info_0`

In [24]:
values_lst = [
    "Mike Alexander",
    "Herbert Wiere",
    "Sarah-Jayne Mulvihill",
    "Douglas Scott",
    "Kim Hwan-Sung",
]

<IPython.core.display.Javascript object>

In [25]:
# For loop to copy name from info_0 to name
for i in df[df["info_0"].isin(values_lst)].index.to_list():
    df.loc[i, "name"] = df.loc[i, "info_0"]

# Hard-coding info_2 and info_3 values for Kim Hwan-Sung
index = df[
    df["link"] == "https://en.wikipedia.org/wiki/NRG_(South_Korean_band)"
].index.to_list()
df.loc[index, "info_2"] = "South Korean musician"

df.loc[index, "info_3"] = "respiratory illness"

# # Hard-coding info_2 and info_3 values for Douglass Scott
index = df[
    df["link"]
    == "https://en.wikipedia.org/w/index.php?title=Demetreus_Nix&action=edit&redlink=1"
].index.to_list()
df.loc[index, "info_2"] = "student"

df.loc[index, "info_3"] = "murdered"

# # Hard-coding info_2 and info_3 values for Sarah-Jayne Mulvihill
index = df[
    df["link"] == "https://en.wikipedia.org/wiki/Flight_Lieutenant"
].index.to_list()
df.loc[index, "info_2"] = "British servicewoman"

df.loc[index, "info_3"] = "killed in action"

<IPython.core.display.Javascript object>

In [26]:
df[df["info_0"].isin(values_lst)]

Unnamed: 0,day,name,info,link,num_references,year,month,info_0,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11
14035,5,Herbert Wiere,"Herbert Wiere, 91, Austrian-born American Wiere Brothers comedian, member of the",https://en.wikipedia.org/wiki/Slapstick,16,1999,August,Herbert Wiere,91,Austrian-born American Wiere Brothers comedian,member of the,,,,,,,,
15873,15,Kim Hwan-Sung,"Kim Hwan-Sung, 19, A Member of .",https://en.wikipedia.org/wiki/NRG_(South_Korean_band),27,2000,June,Kim Hwan-Sung,19,South Korean musician,respiratory illness,,,,,,,,
18312,20,Douglas Scott,"Douglas Scott, 20, High-school student murdered by .",https://en.wikipedia.org/w/index.php?title=Demetreus_Nix&action=edit&redlink=1,0,2001,June,Douglas Scott,20,student,murdered,,,,,,,,
30210,6,Sarah-Jayne Mulvihill,"Sarah-Jayne Mulvihill, 32, first British servicewoman to be killed in action in Iraq.",https://en.wikipedia.org/wiki/Flight_Lieutenant,12,2006,May,Sarah-Jayne Mulvihill,32,British servicewoman,killed in action,,,,,,,,
42140,5,Mike Alexander,"Mike Alexander, 32, British bassist , pulmonary embolism.",https://en.wikipedia.org/wiki/Evile,82,2009,October,Mike Alexander,32,British bassist,pulmonary embolism,,,,,,,,


<IPython.core.display.Javascript object>

#### Dropping Entry for Group

In [27]:
# Checking the entry representing a group
group_entry = df[df["info_0"] == "Nearly 3"]
group_entry

Unnamed: 0,day,name,info,link,num_references,year,month,info_0,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11
18908,11,were killed,"Nearly 3,000 people September 11 attacks in the , including:",https://en.wikipedia.org/wiki/Casualties_of_the_September_11_attacks,176,2001,September,Nearly 3,000 people September 11 attacks in the,including:,,,,,,,,,


<IPython.core.display.Javascript object>

In [28]:
# Dropping group entry
df.drop(group_entry.index, inplace=True)
df.reset_index(inplace=True, drop=True)

df.shape

(132973, 19)

<IPython.core.display.Javascript object>

#### Extracting `age` from `info_0`

In [29]:
# For loop to extract age from info_0
for i, item in enumerate(df["info_0"]):
    pattern = r"\d"
    if re.search(pattern, item):
        df.loc[i, "age"] = int(item)

# Checking one of these rows
df[df["info_0"] == "92"]

Unnamed: 0,day,name,info,link,num_references,year,month,info_0,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
33085,28,David Turnbull,". 92, American materials scientist.",https://en.wikipedia.org/wiki/David_Turnbull_(materials_scientist),8,2007,April,92,American materials scientist,,,,,,,,,,,92.0
117386,11,Gotthilf Fischer,"92, German choral conductor .",https://en.wikipedia.org/wiki/Gotthilf_Fischer,4,2020,December,92,German choral conductor,,,,,,,,,,,92.0


<IPython.core.display.Javascript object>

In [30]:
# Dropping info_0
df.drop("info_0", axis=1, inplace=True)

<IPython.core.display.Javascript object>

#### Observations:
- The new `age` column has been added successfully.
- We are finished processing `info_0`.

In [None]:
# Checking the different patterns of first substrings that start with digits
{item for item in substring_set if re.match(r"\d", item)}

#### Observations:
- The patterns include "age", "age ", "age+", "age/", "age?, "age–", "age months", "age-" (distinct dash),"age-months", "age–months" (distinct dash), "ages", "age?", "age days", "age."
- We will need to address ages in months first.

#### Extracting `age` from `info` for `info` Values that Begin with "age,"

In [None]:
# Creating a list of ages
ages = pd.Series(np.arange(1, 150)).astype("string")

# For loop to extract age from info column and assign to age column, if info starts with 'age,'
for age in ages:
    for i, item in enumerate(df["info"]):
        if item.startswith(age + ","):
            df.loc[i, "age"] = int(age)
df.sample(5)

In [None]:
# Checking for remaining rows missing age
df["age"].isna().sum()

#### Observations:
- We were able to extract `age` for the vast majority of entries.
- Let us take a look at the remaining entries.

#### Checking Rows with Missing `age`

In [None]:
df[df["age"].isna()]

#### Observations:
- We can immediately see 2 apparent patterns.  The first is an age range with a hyphen and the second is age missing altogether.
- Let us do another iteration for the age-range pattern, accepting the lower value as age.

#### Extracting `age` from `info` for `info` Values that Begin with "age-"

In [None]:
# Creating a list of ages
ages = pd.Series(np.arange(1, 150)).astype("string")

# For loop to extract age from info column and assign to age column, if info starts with 'age-'
for age in ages:
    for i, item in enumerate(df["info"]):
        if item.startswith(age + "-"):
            df.loc[i, "age"] = int(age)
df.sample(5)

In [None]:
# Checking the number of remaining missing values for `age`
df["age"].isna().sum()

#### Observation:
- The last iteration captured nearly 200 values.
- Let us take a look at the remaining rows with missing `name'.

#### Checking Rows with Missing `age`

In [None]:
# Checking rows still missing `age`
df[df["age"].isna()].sample(5)

#### Observations:
- It almost appears as if our iteration for addressing `info` starting with "age-' missed some values.
- Closer examination reveals that the dash character varies for the remaining values ("-" vs "–").
- So we will iterate again with the larger dash.

In [None]:
# Creating a list of ages
ages = pd.Series(np.arange(1, 150)).astype("string")

# For loop to extract age from info column and assign to age column, if info starts with 'age-'
for age in ages:
    for i, item in enumerate(df["info"]):
        if item.startswith(age + "–"):
            df.loc[i, "age"] = int(age)
df.sample(5)

In [None]:
# Checking the number of remaining missing values for `age`
df["age"].isna().sum()

#### Observation:
- That iteration captured over 400 values.
- Let us take a look at the remaining rows with missing `name'.

#### Checking Rows with Missing `age`

In [None]:
# Checking rows still missing `age`
df[df["age"].isna()].head(100)

#### Observations:
- The next pattern starts with 'age/', so we will iterate again with it.

In [None]:
# Creating a list of ages
ages = pd.Series(np.arange(1, 150)).astype("string")

# For loop to extract age from info column and assign to age column, if info starts with 'age-'
for age in ages:
    for i, item in enumerate(df["info"]):
        if item.startswith(age + "/"):
            df.loc[i, "age"] = int(age)
df.sample(5)

In [None]:
# Checking the number of remaining missing values for `age`
df["age"].isna().sum()

#### Observation:
- That iteration captured nearly 70 missing values.
- Let us take a look at the remaining rows with missing `name'.

#### Checking Rows with Missing `age`

In [None]:
# Checking rows still missing `age`
df[df["age"].isna()].head(100)

#### Observations:
- The next pattern starts with 'age.', so we will iterate again with it.

In [None]:
# Creating a list of ages
ages = pd.Series(np.arange(1, 150)).astype("string")

# For loop to extract age from info column and assign to age column, if info starts with 'age-'
for age in ages:
    for i, item in enumerate(df["info"]):
        if item.startswith(age + "."):
            df.loc[i, "age"] = int(age)
df.sample(5)

In [None]:
# Checking the number of remaining missing values for `age`
df["age"].isna().sum()

#### Observation:
- We captured a handful of more missing values with that iteration.
- Let us take a look at the remaining rows with missing `name'.

#### Checking Rows with Missing `age`

In [None]:
# Checking rows still missing `age`
df[df["age"].isna()].head(100)

#### Observations:
- There is at least one entry that escaped the net of that last iteration.  It can be hard-coded later.
- The next pattern starts with "age ", so will will iterate again with it.

In [None]:
# Creating a list of ages
ages = pd.Series(np.arange(1, 150)).astype("string")

# For loop to extract age from info column and assign to age column, if info starts with 'age-'
for age in ages:
    for i, item in enumerate(df["info"]):
        if item.startswith(age + " "):
            df.loc[i, "age"] = int(age)
df.sample(5)

In [None]:
# Checking the number of remaining missing values for `age`
df["age"].isna().sum()

#### Observations:
- That iteration captured 26 more missing values.
- Let us look again at the remaining rows.

#### Checking Rows with Missing `age`

In [None]:
# Checking rows still missing `age`
df[df["age"].isna()].head(100)

#### Observations:
- This sample shows entries that are missing age information, or the age is imbedded in the middle of the string. 
- Let us remove rows without any age information, as they will not add to the analysis.

In [None]:
# Creating a list of ages
ages = pd.Series(np.arange(1, 150)).astype("string")

# List of index of entries to drop due to no age information
drop_list = []

# For loop to add index of entries to drop_list if no age information present
for i in df[df["age"].isna()].index:
    if not any(age in df.loc[i, "info"] for age in ages):
        drop_list.append(i)
df.loc[drop_list, :].sample(50)

In [None]:
# Checking the number of entries that will be dropped
print(len(drop_list))

# Dropping the rows without age information
df.drop(drop_list, axis=0, inplace=True)

In [None]:
# Checking df shape after dropping rows
df.shape

#### Observations:
- We have successfully dropped the entries lacking age data.
- Let us examine the remaining rows missing `age`.

#### Checking Rows with Missing `age`

In [None]:
# Checking rows still missing `age`
df[df["age"].isna()].sample(100)

#### Removing Leading and Trailing Commas, Whitespace, and Periods

In [None]:
# Removing the leading/trailing commas, periods, and whitespace
df.loc[:, "info"] = df["info"].apply(lambda x: x.strip(" ,."))
df.sample(5)