# Wikipedia Notable Life Expectancies

# [Notebook 2 of 4: Data Cleaning](https://github.com/teresahanak/wikipedia-life-expectancy/blob/main/wp_life_expect_data_clean_thanak_2022_06_13.ipynb)

## Context

The


## Objective

The

### Data Dictionary

- Feature: Description

## Importing Necessary Libraries

In [1]:
# To structure code automatically
%load_ext nb_black

# To import/export sqlite databases
import sqlite3 as sql

# To help with reading and manipulating data
import pandas as pd
import numpy as np
import re

# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)
# To define the maximum number of rows to be displayed in a dataframe
pd.set_option("display.max_rows", 200)

# To supress scientific notations for a dataframe
pd.set_option("display.float_format", lambda x: "%.3f" % x)

# To supress warnings
# import warnings

# warnings.filterwarnings("ignore")

# To set some visualization attributes
pd.set_option("max_colwidth", 150)

<IPython.core.display.Javascript object>

## Data Overview

### Reading, Sampling, and Checking Data Shape

In [2]:
# Reading the wp_life_expect_raw_complete dataset
conn = sql.connect("wp_life_expect_raw_complete.db")
data = pd.read_sql("SELECT * FROM wp_life_expect_raw_complete", conn)

# Making a working copy
df = data.copy()

# Checking the shape
print(f"There are {df.shape[0]} rows and {df.shape[1]} columns.")

# Checking first 2 rows of the data
df.head(2)

There are 133900 rows and 6 columns.


Unnamed: 0,month_year,day,name,info,link,num_references
0,January 1994,1,William Chappell,", 86, British dancer, ballet designer and director.",https://en.wikipedia.org/wiki/William_Chappell_(dancer),21
1,January 1994,1,Raymond Crotty,", 68, Irish economist, writer, and academic.",https://en.wikipedia.org/wiki/Raymond_Crotty,12


<IPython.core.display.Javascript object>

In [3]:
# Checking last 2 rows of the data
df.tail(2)

Unnamed: 0,month_year,day,name,info,link,num_references
133898,June 2022,9,Oleg Moliboga,", 69, Russian volleyball player, Olympic champion (1980) and coach.",https://en.wikipedia.org/wiki/Oleg_Moliboga,2
133899,June 2022,9,Zou Jing,", 86, Chinese engineer, member of the Chinese Academy of Engineering.",https://en.wikipedia.org/wiki/Zou_Jing_(engineer),3


<IPython.core.display.Javascript object>

In [4]:
# Checking a sample of the data
df.sample(5)

Unnamed: 0,month_year,day,name,info,link,num_references
1108,June 1994,18,Roger Lebel,", 71, Canadian actor.",https://en.wikipedia.org/wiki/Roger_Lebel,0
70246,July 2014,8,Maxine Cochran,", 87, Canadian politician, Nova Scotia Minister of Transport (1985–1988) and MLA for Lunenburg (1984–1988).",https://en.wikipedia.org/wiki/Maxine_Cochran,3
4202,August 1995,25,Doug Stegmeyer,", 43, American rock bassist and vocalist, suicide by gunshot.",https://en.wikipedia.org/wiki/Doug_Stegmeyer,5
57628,September 2012,13,Edgar Metcalfe,", 78, English actor, director and writer, liver cancer.",https://en.wikipedia.org/wiki/Edgar_Metcalfe,20
94594,January 2018,23,Francisco Moreno Martínez,", 86, Spanish cyclist.",https://en.wikipedia.org/wiki/Francisco_Moreno_Mart%C3%ADnez,3


<IPython.core.display.Javascript object>

#### Observations:
- There are 133,900 rows and 6 columns.
- `month_year` contains the month and year of death, while `day` contains the day of the month of death.
- `name` is the notable person's name.  It is a nominal feature that will not be used for analysis, but will be maintained for any referencing needs.
- `info` contains multiple items including the notable person's "age, country of citizenship at birth, subsequent country of citizenship (if applicable), reason for notability, (and) cause of death (if known)."
- `link` is the url to the notable person's individual Wikipedia page.  If such a page does not exist, there is either a non-working link (https://en.wikipedia.orgNone), or the link is to a page with a message that the page does not exist for that individual.  `link` is a unique identifier for all entries, except the 6 with the non-working link, which do have unique `name` values from each other.
- `num_references` contains the number of references on the notable person's individual Wikipedia page.  This feature serves as a proxy measure of notability.
- Prior to EDA, our task will be to extract the individual elements that are comined in `month_year` and `info` columns.

### Checking Data Types, Duplicates, and Null Values

In [5]:
# Checking data types and null values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 133900 entries, 0 to 133899
Data columns (total 6 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   month_year      133900 non-null  object
 1   day             133900 non-null  object
 2   name            133894 non-null  object
 3   info            133900 non-null  object
 4   link            133900 non-null  object
 5   num_references  133900 non-null  object
dtypes: object(6)
memory usage: 6.1+ MB


<IPython.core.display.Javascript object>

In [6]:
# Checking duplicate rows
df.duplicated().sum()

0

<IPython.core.display.Javascript object>

In [7]:
# Check percentage of null values by column
df.isnull().sum() / df.count() * 100

month_year       0.000
day              0.000
name             0.004
info             0.000
link             0.000
num_references   0.000
dtype: float64

<IPython.core.display.Javascript object>

In [8]:
# Checking number of missing values per row (not necessary here, but done to keep process standard)
df.isnull().sum(axis=1).value_counts()

0    133894
1         6
dtype: int64

<IPython.core.display.Javascript object>

#### Observations:
- Our dataset was saved to and read from the database without any hiccups.
- As expected, we have 6 entries that are missing `name`, but we will find it in their `info` values.
- All columns are currently of object type.  We will need to appropriately typecast them after separating the information in `month_year` and `info`.

## Data Cleaning

### Addressing Missing `name` Values

In [9]:
# Checking rows with missing name values
missing_name = df[df["name"].isna()]
missing_name

Unnamed: 0,month_year,day,name,info,link,num_references
18937,August 2001,11,,"Kevin Kowalcyk, 2, known for eating a hamburger contaminated with E. coli O157:H7.",https://en.wikipedia.orgNone,0
24985,January 2004,22,,"Vincent Palmer, 37, British criminal.",https://en.wikipedia.orgNone,0
27458,March 2005,1,,"Barry Stigler, 57, American voice actor.",https://en.wikipedia.orgNone,0
34077,July 2007,11,,"Nana Gualdi, 75, German singer and actress.",https://en.wikipedia.orgNone,0
64769,September 2013,29,,"Scott Workman, 47, American stuntman (, , ), cancer.",https://en.wikipedia.orgNone,0
106613,September 2019,12,,"Thami Shobede, 31, Singer Songwriter",https://en.wikipedia.orgNone,0


<IPython.core.display.Javascript object>

#### Observations:
- These rows vary from the main set as there is a substring containing the person's name at the start of the `info` string.
- As there are so few rows missing `name`, let us address this issue first.

In [10]:
# For loop to copy name value from info column
treat_rows = missing_name.index
for i in treat_rows:
    info = df.loc[i, "info"]
    info_lst = info.split(sep=",", maxsplit=1)
    df.loc[i, "name"] = info_lst[0]

# Re-check rows
df.loc[treat_rows, :]

Unnamed: 0,month_year,day,name,info,link,num_references
18937,August 2001,11,Kevin Kowalcyk,"Kevin Kowalcyk, 2, known for eating a hamburger contaminated with E. coli O157:H7.",https://en.wikipedia.orgNone,0
24985,January 2004,22,Vincent Palmer,"Vincent Palmer, 37, British criminal.",https://en.wikipedia.orgNone,0
27458,March 2005,1,Barry Stigler,"Barry Stigler, 57, American voice actor.",https://en.wikipedia.orgNone,0
34077,July 2007,11,Nana Gualdi,"Nana Gualdi, 75, German singer and actress.",https://en.wikipedia.orgNone,0
64769,September 2013,29,Scott Workman,"Scott Workman, 47, American stuntman (, , ), cancer.",https://en.wikipedia.orgNone,0
106613,September 2019,12,Thami Shobede,"Thami Shobede, 31, Singer Songwriter",https://en.wikipedia.orgNone,0


<IPython.core.display.Javascript object>

#### Observations:
- Missing `name` values have been addressed.
- The names still appear in the `info` value for these rows, but we can address that as we separate the information in that column.

In [11]:
# Re-check info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 133900 entries, 0 to 133899
Data columns (total 6 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   month_year      133900 non-null  object
 1   day             133900 non-null  object
 2   name            133900 non-null  object
 3   info            133900 non-null  object
 4   link            133900 non-null  object
 5   num_references  133900 non-null  object
dtypes: object(6)
memory usage: 6.1+ MB


<IPython.core.display.Javascript object>

#### Observations:
- We have no remaining missing values.
- Let us treat the `month_year` column next.

### Separating `month` and `year`

In [12]:
# Separating month and year into 2 columns and typecasting year as integer
df.loc[:, "year"] = df["month_year"].apply(lambda x: x.split(sep=" ")[1])
df["year"] = df["year"].astype("int64")

df.loc[:, "month"] = df["month_year"].apply(lambda x: x.split(sep=" ")[0])
df.head(2)

Unnamed: 0,month_year,day,name,info,link,num_references,year,month
0,January 1994,1,William Chappell,", 86, British dancer, ballet designer and director.",https://en.wikipedia.org/wiki/William_Chappell_(dancer),21,1994,January
1,January 1994,1,Raymond Crotty,", 68, Irish economist, writer, and academic.",https://en.wikipedia.org/wiki/Raymond_Crotty,12,1994,January


<IPython.core.display.Javascript object>

In [13]:
# Dropping month_year column
df.drop("month_year", axis=1, inplace=True)

<IPython.core.display.Javascript object>

### Treating `info`

#### Checking a Sample

In [14]:
# Checking a sample of info
df.sample(5)

Unnamed: 0,day,name,info,link,num_references,year,month
94287,10,William B. Keene,", 92, American judge (Los Angeles County Superior Court) and television personality ().",https://en.wikipedia.org/wiki/William_B._Keene,4,2018,January
130789,31,Flemming Quist Møller,", 79, Danish animator (, ) and screenwriter (), heart attack.",https://en.wikipedia.org/wiki/Flemming_Quist_M%C3%B8ller,8,2022,January
56478,6,Betty Buehler,", 90, American actress.",https://en.wikipedia.org/wiki/Betty_Buehler,5,2012,July
108605,21,Deng Hongxun,", 88, Chinese politician and engineer, Communist Party Secretary of Hainan (1990–1993).",https://en.wikipedia.org/wiki/Deng_Hongxun,5,2019,December
23394,14,Robert Stack,", 84, American film and television actor.",https://en.wikipedia.org/wiki/Robert_Stack,29,2003,May


<IPython.core.display.Javascript object>

#### Observations:
- We have leading and trailing characters of comma, period, and white space that can be removed.

#### Removing Leading and Trailing Commas, Whitespace, and Periods

In [15]:
# Removing the leading/trailing commas, periods, and whitespace
df.loc[:, "info"] = df["info"].apply(lambda x: x.strip(" ,."))
df.sample(5)

Unnamed: 0,day,name,info,link,num_references,year,month
26588,3,Janet Backhouse,"66, English manuscripts curator at the British Museum, cancer",https://en.wikipedia.org/wiki/Janet_Backhouse,21,2004,November
63810,4,Betty Babcock,"91, American politician, First Lady of Montana (1962–1969), member of Montana House of Representatives (1975–1977)",https://en.wikipedia.org/wiki/Betty_Babcock,4,2013,August
88611,11,András Kovács,"91, Hungarian filmmaker",https://en.wikipedia.org/wiki/Andr%C3%A1s_Kov%C3%A1cs_(film_director),7,2017,March
118112,6,Eugenia Tsoumani-Spentza,"Greek lawyer and politician, MP (2009–2012)",https://en.wikipedia.org/wiki/Eugenia_Tsoumani-Spentza,1,2020,December
86247,17,Steve Truglia,"54, British stuntman (, , ), fall",https://en.wikipedia.org/wiki/Steve_Truglia,7,2016,November


<IPython.core.display.Javascript object>

#### Checking the Set of First Substrings (Before First Comma)

In [16]:
# Checking values in the first substring of info (before the first comma)
age_set = set(df["info"].apply(lambda x: x.split(sep=",", maxsplit=1)[0]))
age_set

{'Afghan presidential adviser',
 'French historian and politician',
 'North Korean head of the Unhasu Orchestra',
 '92',
 'British journalist and business executive',
 '98–99',
 'Rwandan journalist',
 'American intelligence officer',
 'c. 66',
 'French doctor and drug test pioneer',
 'Malian actor and comedian',
 '96–97',
 'Indian actor (',
 'Vietnamese politician',
 '40',
 '67–68',
 '42',
 '57-58',
 'American bongo player',
 '79/80',
 'Syrian general',
 'Georgian politician and deputy governor of the Kvemo Kartli region',
 'South African psychotherapist and television production manager (',
 'Burkinabe actor (',
 'common chimpanzee 55',
 'Malawian diplomat involved in leaked diplomatic cable controversy',
 '(岡本喜八)',
 '1–2',
 'Brazilian female pilot',
 '72/73',
 'Sri Lankan lawyer',
 '57–58',
 'Indian Maoist militant',
 'Chilean forester and academic',
 'American librarian and civil rights activist',
 '(Bruno the Bear)',
 '68',
 'American artist',
 'c. 96',
 'French archaeologist',
 'B

<IPython.core.display.Javascript object>

#### Observations:
- Though we can see the age values, there is a lot of other information in the first substring of the `info` value.
- Some entries may be missing the age information.
- Also, age is entered in various formats.
- Let us see how many variations there are.

In [17]:
# Checking number of different values for first substring of info column
len(age_set)

1412

<IPython.core.display.Javascript object>

#### Observations:
- There are 1412 variations for the substring that represents age on Wikipedia.  
- We can start by extracting the age for those entries that begin with the age value for `info`.

#### Extracting `age` from `info` for `info` Values that Begin with "age,"

In [18]:
# Creating a list of ages
ages = pd.Series(np.arange(1, 150)).astype("string")

# For loop to extract age from info column and assign to age column, if info starts with 'age,'
for age in ages:
    for i, item in enumerate(df["info"]):
        if item.startswith(age + ","):
            df.loc[i, "age"] = int(age)
df.sample(5)

Unnamed: 0,day,name,info,link,num_references,year,month,age
45561,13,Peter Provan,"73, Australian rugby league footballer, Balmain Tigers premiership captain (1969), after long illness",https://en.wikipedia.org/wiki/Peter_Provan,5,2010,May,73.0
32198,23,Ştefan Haukler,"64, Romanian Olympic fencer",https://en.wikipedia.org/wiki/%C5%9Etefan_Haukler,3,2006,November,64.0
65594,18,Forrest Claunch,"73, American politician, pancreatic cancer",https://en.wikipedia.org/wiki/Forrest_Claunch,3,2013,November,73.0
19903,18,Marcel Mule,"100, French saxophonist",https://en.wikipedia.org/wiki/Marcel_Mule,1,2001,December,100.0
73477,23,Norman Wray,"91, American Roman Catholic missionary, Alzheimer's disease",https://en.wikipedia.org/wiki/Norman_Wray,9,2014,December,91.0


<IPython.core.display.Javascript object>

In [19]:
# Checking for remaining rows missing age
df["age"].isna().sum()

2025

<IPython.core.display.Javascript object>

#### Observations:
- We were able to extract `age` for the vast majority of entries.
- Let us take a look at the remaining entries.

#### Checking Rows with Missing `age`

In [20]:
df[df["age"].isna()]

Unnamed: 0,day,name,info,link,num_references,year,month,age
268,10,Dominic McGlinchey,"39–40, Irish republican paramilitary leader, shot",https://en.wikipedia.org/wiki/Dominic_McGlinchey,418,1994,February,
900,20,Fernande Giroux,Canadian actress and jazz singer,https://en.wikipedia.org/wiki/Fernande_Giroux,5,1994,May,
1029,6,Peter Graves,"English actor and nobleman, heart attack","https://en.wikipedia.org/wiki/Peter_Graves,_8th_Baron_Graves",5,1994,June,
1895,24,1994 Colombo suicide attack,Notable people killed in the \n,"https://en.wikipedia.org/wiki/List_of_attacks_attributed_to_the_LTTE,_1990s#1994",41,1994,October,
2318,1,Nina Leen,Russian-born American photographer for,https://en.wikipedia.org/wiki/Nina_Leen,3,1995,January,
...,...,...,...,...,...,...,...,...
133602,24,Khyongla Rato,"98–99, Tibetan Buddhist scholar, founder of The Tibet Center",https://en.wikipedia.org/wiki/Khyongla_Rato,20,2022,May,
133816,4,Mike Omotosho,Nigerian politician,https://en.wikipedia.org/wiki/Mike_Omotosho,10,2022,June,
133830,4,Fei Liang,"85–86, Chinese herpetologist",https://en.wikipedia.org/wiki/Fei_Liang,5,2022,June,
133837,5,Roman Kutuzov,Russian major general,https://en.wikipedia.org/wiki/Roman_Kutuzov_(general),10,2022,June,


<IPython.core.display.Javascript object>

#### Observations:
- We can immediately see 2 apparent patterns.  The first is an age range with a hyphen and the second is age missing altogether.
- Let us do another iteration for the age-range pattern, accepting the lower value as age.

#### Extracting `age` from `info` for `info` Values that Begin with "age-"

In [21]:
# Creating a list of ages
ages = pd.Series(np.arange(1, 150)).astype("string")

# For loop to extract age from info column and assign to age column, if info starts with 'age-'
for age in ages:
    for i, item in enumerate(df["info"]):
        if item.startswith(age + "-"):
            df.loc[i, "age"] = int(age)
df.sample(5)

Unnamed: 0,day,name,info,link,num_references,year,month,age
70621,28,Alakbar Mammadov,"84, Azerbaijani Soviet football player and manager",https://en.wikipedia.org/wiki/Alakbar_Mammadov,8,2014,July,84.0
116366,27,Barclay Palmer,"88, English Olympic shot putter (1956)",https://en.wikipedia.org/wiki/Barclay_Palmer,2,2020,September,88.0
35922,27,Irene Stegun,"88, American mathematician",https://en.wikipedia.org/wiki/Irene_Stegun,3,2008,January,88.0
53612,21,Slavko Ziherl,"66, Slovenian psychiatrist and politician",https://en.wikipedia.org/wiki/Slavko_Ziherl,1,2012,January,66.0
49782,16,Chinesinho,"76, Brazilian footballer, Alzheimer's disease",https://en.wikipedia.org/wiki/Chinesinho,5,2011,April,76.0


<IPython.core.display.Javascript object>

In [22]:
# Checking the number of remaining missing values for `age`
df["age"].isna().sum()

1841

<IPython.core.display.Javascript object>

#### Observation:
- The last iteration captured nearly 200 values.
- Let us take a look at the remaining rows with missing `name'.

#### Checking Rows with Missing `age`

In [23]:
# Checking rows still missing `age`
df[df["age"].isna()].sample(5)

Unnamed: 0,day,name,info,link,num_references,year,month,age
11996,3,Ahmet Kayhan Dede,Turkish Sufi master,https://en.wikipedia.org/wiki/Ahmet_Kayhan_Dede,0,1998,August,
41319,12,Annesley Dias,Sri Lankan comedian,https://en.wikipedia.org/wiki/Annesley_Dias,9,2009,June,
124542,21,Haribhushan,"Indian politician and guerrilla, COVID-19",https://en.wikipedia.org/wiki/Haribhushan,6,2021,June,
70914,13,Hani Abbadi,"Jordanian politician, member of the House of Representatives (1993–1997)",https://en.wikipedia.org/wiki/Hani_Abbadi,1,2014,August,
90317,10,Abu Khattab al-Tunisi,"Tunisian jihadist, shot",https://en.wikipedia.org/wiki/Abu_Khattab_al-Tunisi,4,2017,June,


<IPython.core.display.Javascript object>

#### Observations:
- It almost appears as if our iteration for addressing `info` starting with "age-' missed some values.
- Closer examination reveals that the dash character varies for the remaining values ("-" vs "–").
- So we will iterate again with the larger dash.

In [24]:
# Creating a list of ages
ages = pd.Series(np.arange(1, 150)).astype("string")

# For loop to extract age from info column and assign to age column, if info starts with 'age-'
for age in ages:
    for i, item in enumerate(df["info"]):
        if item.startswith(age + "–"):
            df.loc[i, "age"] = int(age)
df.sample(5)

Unnamed: 0,day,name,info,link,num_references,year,month,age
55084,12,Vladimir Astapovsky,"65, Soviet Olympic bronze medal-winning (1976) footballer",https://en.wikipedia.org/wiki/Vladimir_Astapovsky,2,2012,April,65.0
30048,16,Dennis Kirkland,"63, British television producer and director, after a short illness",https://en.wikipedia.org/wiki/Dennis_Kirkland,0,2006,February,63.0
48918,6,Gary Moore,"58, Irish rock guitarist and singer (Thin Lizzy), heart attack",https://en.wikipedia.org/wiki/Gary_Moore,134,2011,February,58.0
127651,12,Zinaida Korneva,"99, Russian military veteran and charity fundraiser",https://en.wikipedia.org/wiki/Zinaida_Korneva,36,2021,October,99.0
47932,20,Jim Yardley,"64, English cricketer",https://en.wikipedia.org/wiki/Jim_Yardley_(cricketer),1,2010,November,64.0


<IPython.core.display.Javascript object>

In [25]:
# Checking the number of remaining missing values for `age`
df["age"].isna().sum()

1374

<IPython.core.display.Javascript object>

#### Observation:
- That iteration captured over 400 values.
- Let us take a look at the remaining rows with missing `name'.

#### Checking Rows with Missing `age`

In [26]:
# Checking rows still missing `age`
df[df["age"].isna()].head(100)

Unnamed: 0,day,name,info,link,num_references,year,month,age
900,20,Fernande Giroux,Canadian actress and jazz singer,https://en.wikipedia.org/wiki/Fernande_Giroux,5,1994,May,
1029,6,Peter Graves,"English actor and nobleman, heart attack","https://en.wikipedia.org/wiki/Peter_Graves,_8th_Baron_Graves",5,1994,June,
1895,24,1994 Colombo suicide attack,Notable people killed in the \n,"https://en.wikipedia.org/wiki/List_of_attacks_attributed_to_the_LTTE,_1990s#1994",41,1994,October,
2318,1,Nina Leen,Russian-born American photographer for,https://en.wikipedia.org/wiki/Nina_Leen,3,1995,January,
2506,22,Henry Gladstone,"American radio newscaster and actor, heart failure",https://en.wikipedia.org/wiki/Henry_Gladstone,5,1995,January,
2902,11,Ernest Kabushemeye,"Burundian politician and the Minister for Mines and Energy, assassinated",https://en.wikipedia.org/wiki/Ernest_Kabushemeye,2,1995,March,
3296,30,Christopher Chadman,"American dancer and choreographer, AIDS-related complications",https://en.wikipedia.org/wiki/Christopher_Chadman,2,1995,April,
3359,8,Carroll Best,American banjo player,https://en.wikipedia.org/wiki/Carroll_Best,4,1995,May,
4642,24,Syed Abuzar Bukhari,Pakistani scholar and president of Majlis-e-Ahrar-ul-Islam,https://en.wikipedia.org/wiki/Syed_Abuzar_Bukhari,2,1995,October,
5308,12,Jon Pattis,"American engineer imprisoned in Iran, congestive heart failure",https://en.wikipedia.org/wiki/Jon_Pattis,3,1996,January,


<IPython.core.display.Javascript object>

#### Observations:
- The next pattern starts with 'age/', so we will iterate again with it.

In [27]:
# Creating a list of ages
ages = pd.Series(np.arange(1, 150)).astype("string")

# For loop to extract age from info column and assign to age column, if info starts with 'age-'
for age in ages:
    for i, item in enumerate(df["info"]):
        if item.startswith(age + "/"):
            df.loc[i, "age"] = int(age)
df.sample(5)

Unnamed: 0,day,name,info,link,num_references,year,month,age
49341,11,David Brown,"69, British cricketer, brain tumour",https://en.wikipedia.org/wiki/David_Brown_(Scottish_cricketer),4,2011,March,69.0
84018,13,Celso Peçanha,"99, Brazilian politician, Governor of Rio de Janeiro (1961–1962)",https://en.wikipedia.org/wiki/Celso_Pe%C3%A7anha,2,2016,July,99.0
121286,6,Todd R. Klaenhammer,"69, American food scientist and microbiologist",https://en.wikipedia.org/wiki/Todd_R._Klaenhammer,7,2021,March,69.0
38960,16,Luisín Landáez,"77, Venezuelan-Chilean cumbia singer",https://en.wikipedia.org/wiki/Luis%C3%ADn_Land%C3%A1ez,9,2008,November,77.0
22220,19,Harry Watson,"79, Canadian professional hockey player (Detroit Red Wings, Toronto Maple Leafs, Chicago Black Hawks)","https://en.wikipedia.org/wiki/Harry_Watson_(ice_hockey,_born_1923)",0,2002,November,79.0


<IPython.core.display.Javascript object>

In [28]:
# Checking the number of remaining missing values for `age`
df["age"].isna().sum()

1305

<IPython.core.display.Javascript object>

#### Observation:
- That iteration captured nearly 70 missing values.
- Let us take a look at the remaining rows with missing `name'.

#### Checking Rows with Missing `age`

In [29]:
# Checking rows still missing `age`
df[df["age"].isna()].head(100)

Unnamed: 0,day,name,info,link,num_references,year,month,age
900,20,Fernande Giroux,Canadian actress and jazz singer,https://en.wikipedia.org/wiki/Fernande_Giroux,5,1994,May,
1029,6,Peter Graves,"English actor and nobleman, heart attack","https://en.wikipedia.org/wiki/Peter_Graves,_8th_Baron_Graves",5,1994,June,
1895,24,1994 Colombo suicide attack,Notable people killed in the \n,"https://en.wikipedia.org/wiki/List_of_attacks_attributed_to_the_LTTE,_1990s#1994",41,1994,October,
2318,1,Nina Leen,Russian-born American photographer for,https://en.wikipedia.org/wiki/Nina_Leen,3,1995,January,
2506,22,Henry Gladstone,"American radio newscaster and actor, heart failure",https://en.wikipedia.org/wiki/Henry_Gladstone,5,1995,January,
2902,11,Ernest Kabushemeye,"Burundian politician and the Minister for Mines and Energy, assassinated",https://en.wikipedia.org/wiki/Ernest_Kabushemeye,2,1995,March,
3296,30,Christopher Chadman,"American dancer and choreographer, AIDS-related complications",https://en.wikipedia.org/wiki/Christopher_Chadman,2,1995,April,
3359,8,Carroll Best,American banjo player,https://en.wikipedia.org/wiki/Carroll_Best,4,1995,May,
4642,24,Syed Abuzar Bukhari,Pakistani scholar and president of Majlis-e-Ahrar-ul-Islam,https://en.wikipedia.org/wiki/Syed_Abuzar_Bukhari,2,1995,October,
5308,12,Jon Pattis,"American engineer imprisoned in Iran, congestive heart failure",https://en.wikipedia.org/wiki/Jon_Pattis,3,1996,January,


<IPython.core.display.Javascript object>

#### Observations:
- The next pattern starts with 'age.', so we will iterate again with it.

In [30]:
# Creating a list of ages
ages = pd.Series(np.arange(1, 150)).astype("string")

# For loop to extract age from info column and assign to age column, if info starts with 'age-'
for age in ages:
    for i, item in enumerate(df["info"]):
        if item.startswith(age + "."):
            df.loc[i, "age"] = int(age)
df.sample(5)

Unnamed: 0,day,name,info,link,num_references,year,month,age
97850,3,Boris Orlov,"73, Russian gymnastics coach",https://en.wikipedia.org/wiki/Boris_Orlov_(coach),5,2018,July,73.0
63524,19,A K Azizul Huq,"84, Bangladeshi civil servant, Comptroller and Auditor General (1983–1989)",https://en.wikipedia.org/wiki/A_K_Azizul_Huq,2,2013,July,84.0
116840,19,Tony Lewis,"62, English bassist, singer and songwriter (The Outfield)",https://en.wikipedia.org/wiki/Tony_Lewis_(musician),17,2020,October,62.0
55582,10,Andreas Shipanga,"80, Namibian politician, Chairman of the Transitional Government of National Unity (1987, 1988), heart attack",https://en.wikipedia.org/wiki/Andreas_Shipanga,4,2012,May,80.0
47513,16,Aldo Maria Lazzarín Stella,"83, Italian-born Chilean Roman Catholic prelate, Vicar Apostolic of Aysén (1989–1998)",https://en.wikipedia.org/wiki/Aldo_Maria_Lazzar%C3%ADn_Stella,1,2010,October,83.0


<IPython.core.display.Javascript object>

In [31]:
# Checking the number of remaining missing values for `age`
df["age"].isna().sum()

1297

<IPython.core.display.Javascript object>

#### Observation:
- We captured a handful of more missing values with that iteration.
- Let us take a look at the remaining rows with missing `name'.

#### Checking Rows with Missing `age`

In [32]:
# Checking rows still missing `age`
df[df["age"].isna()].head(100)

Unnamed: 0,day,name,info,link,num_references,year,month,age
900,20,Fernande Giroux,Canadian actress and jazz singer,https://en.wikipedia.org/wiki/Fernande_Giroux,5,1994,May,
1029,6,Peter Graves,"English actor and nobleman, heart attack","https://en.wikipedia.org/wiki/Peter_Graves,_8th_Baron_Graves",5,1994,June,
1895,24,1994 Colombo suicide attack,Notable people killed in the \n,"https://en.wikipedia.org/wiki/List_of_attacks_attributed_to_the_LTTE,_1990s#1994",41,1994,October,
2318,1,Nina Leen,Russian-born American photographer for,https://en.wikipedia.org/wiki/Nina_Leen,3,1995,January,
2506,22,Henry Gladstone,"American radio newscaster and actor, heart failure",https://en.wikipedia.org/wiki/Henry_Gladstone,5,1995,January,
2902,11,Ernest Kabushemeye,"Burundian politician and the Minister for Mines and Energy, assassinated",https://en.wikipedia.org/wiki/Ernest_Kabushemeye,2,1995,March,
3296,30,Christopher Chadman,"American dancer and choreographer, AIDS-related complications",https://en.wikipedia.org/wiki/Christopher_Chadman,2,1995,April,
3359,8,Carroll Best,American banjo player,https://en.wikipedia.org/wiki/Carroll_Best,4,1995,May,
4642,24,Syed Abuzar Bukhari,Pakistani scholar and president of Majlis-e-Ahrar-ul-Islam,https://en.wikipedia.org/wiki/Syed_Abuzar_Bukhari,2,1995,October,
5308,12,Jon Pattis,"American engineer imprisoned in Iran, congestive heart failure",https://en.wikipedia.org/wiki/Jon_Pattis,3,1996,January,


<IPython.core.display.Javascript object>

#### Observations:
- There is at least one entry that escaped the net of that last iteration.  It can be hard-coded later.
- The next pattern starts with "age ", so will will iterate again with it.

In [33]:
# Creating a list of ages
ages = pd.Series(np.arange(1, 150)).astype("string")

# For loop to extract age from info column and assign to age column, if info starts with 'age-'
for age in ages:
    for i, item in enumerate(df["info"]):
        if item.startswith(age + " "):
            df.loc[i, "age"] = int(age)
df.sample(5)

Unnamed: 0,day,name,info,link,num_references,year,month,age
88898,25,Louis Feldman,"90, American classical scholar",https://en.wikipedia.org/wiki/Louis_Feldman,5,2017,March,90.0
111236,29,Chandan Singh,"94, Indian military officer",https://en.wikipedia.org/wiki/Chandan_Singh_(Air_Vice_Marshal),13,2020,March,94.0
56680,18,Pancho Martin,"86, Cuban racehorse trainer (Sham)",https://en.wikipedia.org/wiki/Pancho_Martin,10,2012,July,86.0
76216,11,Frankie Sodano,"84, American Olympic boxer (1948)",https://en.wikipedia.org/wiki/Frankie_Sodano,1,2015,May,84.0
38670,20,Sœur Emmanuelle,"99, Belgian-born French nun, natural causes",https://en.wikipedia.org/wiki/S%C5%93ur_Emmanuelle,6,2008,October,99.0


<IPython.core.display.Javascript object>

In [34]:
# Checking the number of remaining missing values for `age`
df["age"].isna().sum()

1271

<IPython.core.display.Javascript object>

#### Observations:
- That iteration captured 26 more missing values.
- Let us look again at the remaining rows.

#### Checking Rows with Missing `age`

In [35]:
# Checking rows still missing `age`
df[df["age"].isna()].head(100)

Unnamed: 0,day,name,info,link,num_references,year,month,age
900,20,Fernande Giroux,Canadian actress and jazz singer,https://en.wikipedia.org/wiki/Fernande_Giroux,5,1994,May,
1029,6,Peter Graves,"English actor and nobleman, heart attack","https://en.wikipedia.org/wiki/Peter_Graves,_8th_Baron_Graves",5,1994,June,
1895,24,1994 Colombo suicide attack,Notable people killed in the \n,"https://en.wikipedia.org/wiki/List_of_attacks_attributed_to_the_LTTE,_1990s#1994",41,1994,October,
2318,1,Nina Leen,Russian-born American photographer for,https://en.wikipedia.org/wiki/Nina_Leen,3,1995,January,
2506,22,Henry Gladstone,"American radio newscaster and actor, heart failure",https://en.wikipedia.org/wiki/Henry_Gladstone,5,1995,January,
2902,11,Ernest Kabushemeye,"Burundian politician and the Minister for Mines and Energy, assassinated",https://en.wikipedia.org/wiki/Ernest_Kabushemeye,2,1995,March,
3296,30,Christopher Chadman,"American dancer and choreographer, AIDS-related complications",https://en.wikipedia.org/wiki/Christopher_Chadman,2,1995,April,
3359,8,Carroll Best,American banjo player,https://en.wikipedia.org/wiki/Carroll_Best,4,1995,May,
4642,24,Syed Abuzar Bukhari,Pakistani scholar and president of Majlis-e-Ahrar-ul-Islam,https://en.wikipedia.org/wiki/Syed_Abuzar_Bukhari,2,1995,October,
5308,12,Jon Pattis,"American engineer imprisoned in Iran, congestive heart failure",https://en.wikipedia.org/wiki/Jon_Pattis,3,1996,January,


<IPython.core.display.Javascript object>

#### Observations:
- This sample shows entries that are missing age information, or the age is imbedded in the middle of the string. 
- Let us remove rows without any age information, as they will not add to the analysis.

In [36]:
# Creating a list of ages
ages = pd.Series(np.arange(1, 150)).astype("string")

# List of index of entries to drop due to no age information
drop_list = []

# For loop to add index of entries to drop_list if no age information present
for i in df[df["age"].isna()].index:
    if not any(age in df.loc[i, "info"] for age in ages):
        drop_list.append(i)
df.loc[drop_list, :].sample(50)

Unnamed: 0,day,name,info,link,num_references,year,month,age
75011,10,Mohammed Kaliel,Nigerian army officer and politician,https://en.wikipedia.org/wiki/Mohammed_Kaliel,7,2015,March,
15639,3,Ernesto Layaguin,hospital corpsman of the Philippine Marine Corps and posthumous recipient of the Medal of Valor,https://en.wikipedia.org/wiki/Ernesto_Layaguin,3,2000,April,
46870,24,Bulle Hassan Mo'allim,"Somali politician, member of the Transitional Federal Parliament, victim of Muna Hotel attack",https://en.wikipedia.org/wiki/Bulle_Hassan_Mo%27allim,2,2010,August,
16208,10,Denis O'Conor Don,"O'Conor Don, hereditary Chief of the Name O'Conor",https://en.wikipedia.org/wiki/Denis_O%27Conor_Don,7,2000,July,
8421,6,Barbara Yu Ling,Singapore-British actress,https://en.wikipedia.org/wiki/Barbara_Yu_Ling,44,1997,April,
110073,17,Pandhari Juker,Indian make-up artist,https://en.wikipedia.org/wiki/Pandhari_Juker,6,2020,February,
66946,26,Tom Nyuma,Sierra Leonean politician and military officer,https://en.wikipedia.org/wiki/Tom_Nyuma,3,2014,January,
121810,23,Gilles Rossignol,French writer and editor,https://en.wikipedia.org/wiki/Gilles_Rossignol,4,2021,March,
27407,23,José Cruxent,Venezuelan archaeologist,https://en.wikipedia.org/wiki/Jos%C3%A9_Cruxent,4,2005,February,
65757,26,Himachal Som,Indian diplomat,https://en.wikipedia.org/wiki/Himachal_Som,3,2013,November,


<IPython.core.display.Javascript object>

In [37]:
# Checking the number of entries that will be dropped
print(len(drop_list))

# Dropping the rows without age information
df.drop(drop_list, axis=0, inplace=True)

925


<IPython.core.display.Javascript object>

In [39]:
# Checking df shape after dropping rows
df.shape

(132975, 8)

<IPython.core.display.Javascript object>

#### Observations:
- We have successfully dropped the entries lacking age data.
- Let us examine the remaining rows missing `age`.

#### Checking Rows with Missing `age`

In [44]:
# Checking rows still missing `age`
df[df["age"].isna()].sample(100)

Unnamed: 0,day,name,info,link,num_references,year,month,age
34411,21,Čabulītis,"c. 72, American alligator considered to be Europe's oldest",https://en.wikipedia.org/wiki/%C4%8Cabul%C4%ABtis,2,2007,August,
77501,22,Martin Storey,"British Channel Islander politician, member of the States (since 2008), cancer",https://en.wikipedia.org/wiki/Martin_Storey_(politician),2,2015,July,
109765,5,Mohammad Shafiq,"Pakistani politician, MLA (since 2015), cardiac arrest",https://en.wikipedia.org/wiki/Mohammad_Shafiq_(politician),4,2020,February,
30593,6,Flight Lieutenant,"Sarah-Jayne Mulvihill, 32, first British servicewoman to be killed in action in Iraq",https://en.wikipedia.org/wiki/Flight_Lieutenant,12,2006,May,
45575,15,Gabriel Bien-Aimé,"Haitian politician, Minister of Education (2006–2008), heart attack",https://en.wikipedia.org/wiki/Gabriel_Bien-Aim%C3%A9,2,2010,May,
131258,14,Khayal Zaman Orakzai,"Pakistani politician, MNA (since 2013), cancer",https://en.wikipedia.org/wiki/Khayal_Zaman_Orakzai,12,2022,February,
59193,10,Harry Iauko,"Vanuatuan politician, MP for Tanna (2008–2012), complications of pneumonia",https://en.wikipedia.org/wiki/Harry_Iauko,35,2012,December,
36946,2,Justin Yak,"Sudanese politician, minister for cabinet affairs for Southern Sudan (2006–2007), plane crash",https://en.wikipedia.org/wiki/Justin_Yak,1,2008,May,
90591,25,Agha Shahbaz Khan Durrani,"Pakistani politician, Senator (since 2015), heart attack",https://en.wikipedia.org/wiki/Agha_Shahbaz_Khan_Durrani,2,2017,June,
82352,11,A. R. Surendran,"Sri Lankan lawyer, President's Counsel (2004)",https://en.wikipedia.org/wiki/A._R._Surendran,13,2016,April,


<IPython.core.display.Javascript object>