# Wikipedia Notable Deaths 

## [Notebook 1 of 4:   Data Collection](https://github.com/teresahanak/wikipedia-notable-deaths/blob/main/wikipedia_notable_deaths_data_collection_thanak_2022_06_10.ipynb)

## Context

This case study was chosen by this author as an end-to-end portfolio project for the following reasons:  
1. accessibility of the dataset
2. potential for use and demonstration of a wide range of skills, including coding in Python (using Pycharm and Jupyter Notebooks), version control (using Git, GitHub, and ReviewNB), web scraping (using Scrapy), relational database management (using SQLite and [SQLite Viewer](https://inloop.github.io/sqlite-viewer/)),  data cleaning, natural language processing (NLP using regular expressions and fuzzywuzzy), Exploratory Data Analysis (EDA using numpy, pandas, matplotlib, plotly, and seaborn), data preprocessing, unsupervised learning (Kmeans and hierachical clustering), supervised learning (regression models), user interface for predictions.
3. in particular, a dataset with potential for regression modeling was chosen to gain more exposure to its associated algorithms as this author had relative broader experience with classification models in past projects.
    

## Objective

The...

## Data Dictionary
Variable: Description

## Data Collection
- Data was collected from 6/9/22 to 6/10/22, using Scrapy. 

### 6/9/2022

- The [Wikipedia](https://en.wikipedia.org/wiki/Main_Page) page, [List of Deaths by Year](https://en.wikipedia.org/wiki/Lists_of_deaths_by_year), contains entries for as early as 1987, to the present day.  
- 1994 was chosen as the start year for collection as it is the first year with entries following the current format: "Name, age, country of citizenship at birth, subsequent country of citizenship (if applicable), reason for notability, cause of death (if known), and reference."
- For ease of pagination, [Deaths in January 1994](https://en.wikipedia.org/wiki/Deaths_in_January_1994) was the start url for scraping, proceeding month by month through subsequent pages.
- Spider ["by_year"](https://github.com/teresahanak/wikipedia-notable-deaths/blob/main/wikipedia_notable_deaths/spiders/by_year.py) scraped `month_year`, `day`, `name`, `info` ("age, country of citizenship at birth, subsequent country of citizenship (if applicable), reason for notability, and cause of death (if known)"), and `link` for each entry on each month's page.  
- The project's [pipelines.py](https://github.com/teresahanak/wikipedia-notable-deaths/blob/main/wikipedia_notable_deaths/pipelines.py)* wrote results to SQLite table wp_deaths_94_to_22 within [wp_deaths_94_to_22.db](https://github.com/teresahanak/wikipedia-notable-deaths/blob/main/wp_deaths_94_to_22.db).  This scraping was successful for January, 1994 through May, 2022 data.
- [Deaths in 2022 -- June](https://en.wikipedia.org/wiki/Deaths_in_2022#June)--the current month's page--varied in format, and was therefore scraped separately, after `num_references` was scraped for the previous entries.
- The original order of entries was preserved by Spider ["by_year"](https://github.com/teresahanak/wikipedia-notable-deaths/blob/main/wikipedia_notable_deaths/spiders/by_year.py) in [wp_deaths_94_to_22.db](https://github.com/teresahanak/wikipedia-notable-deaths/blob/main/wp_deaths_94_to_22.db).  Scrapy trades pagination order for speed, which is noticable when pagination is of higher magnitude.  Therefore, scraping each entry's page for number of references was done separately as the order was sure to vary. Spider ["references"](https://github.com/teresahanak/wikipedia-notable-deaths/blob/main/wikipedia_notable_deaths/spiders/references.py) scraped for number of references.  The project's [pipelines.py](https://github.com/teresahanak/wikipedia-notable-deaths/blob/main/wikipedia_notable_deaths/pipelines.py)* wrote results to SQLite table wp_reference_counts in [wp_reference_counts.db](https://github.com/teresahanak/wikipedia-notable-deaths/blob/main/wp_reference_counts_2.db), falling ~1300 rows short of by_year's 133,769 rows.
- Finally, the [June, 2022](https://en.wikipedia.org/wiki/Deaths_in_2022#June) page was scraped by Spider ["June_2022"](https://github.com/teresahanak/wikipedia-notable-deaths/blob/main/wikipedia_notable_deaths/spiders/June_2022.py), successfully capturing all of the previous fields, including number of references.  Note that the number of pages was some order of magnitude smaller.
- The project's [pipelines.py](https://github.com/teresahanak/wikipedia-notable-deaths/blob/main/wikipedia_notable_deaths/pipelines.py)* wrote results to SQLite table wp_deaths_June_2022 in [wp_deaths_June_2022.db](https://github.com/teresahanak/wikipedia-notable-deaths/blob/main/wp_deaths_June_2022.db), resulting in 145 rows from the first part of June, 2022.

### 6/10/2022
The remaining data collection steps are outlined in this notebook:
1. [Reading, sampling, and checking data shape](#step1)
- SQLite tables wp_deaths_94_to_22, wp_reference_counts, and wp_June_2022 were read in as Pandas dataframes.
2. [Combining dataframes](#step2)
- Dataframes for wp_deaths_94_to_22 and wp_reference_counts were combined using `link` as the unique identifier.
- Dataframe for wp_deaths_June_2022 was added.
3. [Duplicate Rows](#step3)
- 9 rows of duplicate entries were dropped.
4. [Missing Values](#step4)
- 5 rows missing essential data were dropped.
- 6 entries had missing `name` and `num_references`, but contained the name in the `info` feature, to be extracted later during data cleaning.  As these entries had no associated page, their `num_references` was set equal to to 0.
5. [Missing Reference Count Values](#step5)
- A single modification was made to the original XPath for the original "by_year" Spider, to match a variation on the pages for links with missing `num_references`.  
- Those pages were then were rescraped iteratively by Spiders ["refs2"](https://github.com/teresahanak/wikipedia-notable-deaths/blob/main/wikipedia_notable_deaths/spiders/refs2.py), ["refs3"](https://github.com/teresahanak/wikipedia-notable-deaths/blob/main/wikipedia_notable_deaths/spiders/refs3.py), ["refs4"](https://github.com/teresahanak/wikipedia-notable-deaths/blob/main/wikipedia_notable_deaths/spiders/refs4.py), and ["refs5"](https://github.com/teresahanak/wikipedia-notable-deaths/blob/main/wikipedia_notable_deaths/spiders/refs5.py).
- The project's [pipelines.py](https://github.com/teresahanak/wikipedia-notable-deaths/blob/main/wikipedia_notable_deaths/pipelines.py)* wrote their respective results to SQLite tables refs2 in [refs2.db](https://github.com/teresahanak/wikipedia-notable-deaths/blob/main/refs2.db), refs3 in [refs3.db](https://github.com/teresahanak/wikipedia-notable-deaths/blob/main/refs3.db), refs4 in [refs4.db](https://github.com/teresahanak/wikipedia-notable-deaths/blob/main/refs4.db), and refs5 in [refs5.db](https://github.com/teresahanak/wikipedia-notable-deaths/blob/main/refs5.db).
- During each iteration, the additional `num_references` were added to the main dataset and remaining links with missing `num_references` values were identified.
- After the final iteration, 92 entries remained with missing `num_references` that were all missing an associated page, so `num_references` was set equal to to 0.  These entries contained all of the other relevant features, so were preserved.
- The resultant raw dataset has 133,900 rows and 6 columns.

\*  The current version of [pipelines.py](https://github.com/teresahanak/wikipedia-notable-deaths/blob/main/wikipedia_notable_deaths/pipelines.py) reflects its use by the most recent project spider crawled, as it is reused for multiple spiders within the [Scrapy project folder](https://github.com/teresahanak/wikipedia-notable-deaths/tree/main/wikipedia_notable_deaths).

## Importing necessary libraries

In [1]:
# To limit number of threads in numpy and thereby prevent known dataleak associated with KMeans
# Note:  this cell must be run BEFORE installing numpy to have desired effect
import os

os.environ["OMP_NUM_THREADS"] = "1"

In [2]:
# To structure code automatically
%load_ext nb_black

# To import sqlite databases
import sqlite3 as sql

# To help with reading and manipulating data
import pandas as pd
import numpy as np

# To help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA

# To be used for data scaling
from sklearn.preprocessing import StandardScaler

# To compute distances
from scipy.spatial.distance import cdist, pdist

# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)
# To define the maximum number of rows to be displayed in a dataframe
pd.set_option("display.max_rows", 200)

# To perform k-means clustering and compute silhouette scores
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# To visualize the elbow curve and silhouette scores
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer
from yellowbrick.style.palettes import PALETTES

# To perform hierarchical clustering, compute cophenetic correlation, and create dendrograms
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet

# To supress scientific notations for a dataframe
pd.set_option("display.float_format", lambda x: "%.3f" % x)

# To supress warnings
# import warnings

# warnings.filterwarnings("ignore")

# To set some visualization attributes
pd.set_option("max_colwidth", 150)
sns.set_theme()
sns.set_palette(
    (
        "midnightblue",
        "goldenrod",
        "maroon",
        "darkolivegreen",
        "cadetblue",
        "tab:purple",
        "yellowgreen",
    )
)
plt.rc("font", size=12)
plt.rc("axes", titlesize=15)
plt.rc("axes", labelsize=14)
plt.rc("xtick", labelsize=13)
plt.rc("ytick", labelsize=13)
plt.rc("legend", fontsize=13)
plt.rc("legend", fontsize=14)
plt.rc("figure", titlesize=16)

<IPython.core.display.Javascript object>

## Data Overview

<a id='step1'></a>
### Reading, sampling, and checking data shape

### January 1994 through May 2022 Data (without reference counts)

In [3]:
# Reading the wp_deaths_94_to_22 dataset from sql db and table
conn = sql.connect("wp_deaths_94_to_22.db")
raw_94_to_22 = pd.read_sql("SELECT * FROM wp_deaths_94_to_22", conn)

# Making a working copy
df_94_to_22 = raw_94_to_22.copy()

# Checking the shape
print(f"There are {df_94_to_22.shape[0]} rows and {df_94_to_22.shape[1]} columns.")

# Checking first 2 rows of the data
df_94_to_22.head(2)

There are 133769 rows and 5 columns.


Unnamed: 0,month_year,day,name,info,link
0,January 1994,1,William Chappell,", 86, British dancer, ballet designer and director.",https://en.wikipedia.org/wiki/William_Chappell_(dancer)
1,January 1994,1,Raymond Crotty,", 68, Irish economist, writer, and academic.",https://en.wikipedia.org/wiki/Raymond_Crotty


<IPython.core.display.Javascript object>

In [4]:
# Checking last 2 rows of the data
df_94_to_22.tail(2)

Unnamed: 0,month_year,day,name,info,link
133767,May 2022,31,Dave Smith,", 72, American sound engineer, founder of Sequential.",https://en.wikipedia.org/wiki/Dave_Smith_(engineer)
133768,May 2022,31,Wang Zherong,", 86, Chinese tank designer, member of the Chinese Academy of Engineering.",https://en.wikipedia.org/wiki/Wang_Zherong


<IPython.core.display.Javascript object>

In [5]:
# Checking a sample of the data
df_94_to_22.sample(5)

Unnamed: 0,month_year,day,name,info,link
114252,July 2020,6,Inuwa Abdulkadir,", 54, Nigerian politician, complications from COVID-19.",https://en.wikipedia.org/wiki/Inuwa_Abdulkadir
44538,February 2010,28,Rose Gray,", 71, British restaurateur () and food writer, brain cancer.",https://en.wikipedia.org/wiki/Rose_Gray
29964,February 2006,5,Norma Candal,", 75, Puerto Rican comedian, actress and drama teacher, head injury.",https://en.wikipedia.org/wiki/Norma_Candal
71343,September 2014,2,"William ""Bill"" Ralph Merton",", 96, British military scientist and financier.",https://en.wikipedia.org/wiki/William_%22Bill%22_Ralph_Merton
15535,March 2000,13,Carlo Tagnin,", 67, Italian football player and manager.",https://en.wikipedia.org/wiki/Carlo_Tagnin


<IPython.core.display.Javascript object>

#### Observations:
- There are 133,769 rows and 5 columns in the data from January, 1994 through May, 2022.
- The number of references was scraped separately.

### January 1994 through May 2022 Reference Count Data

In [6]:
# Reading the wp_reference_counts_2 dataset from sql db and table
conn = sql.connect("wp_reference_counts_2.db")
raw_reference_counts = pd.read_sql("SELECT * FROM wp_reference_counts_2", conn)

# Making a working copy
df_reference_counts = raw_reference_counts.copy()

# Checking the shape
print(
    f"There are {df_reference_counts.shape[0]} rows and {df_reference_counts.shape[1]} columns."
)

# Checking first 2 rows of the data
df_reference_counts.head(2)

There are 120368 rows and 2 columns.


Unnamed: 0,link,num_references
0,https://en.wikipedia.org/wiki/Lys_Gauty,5
1,https://en.wikipedia.org/wiki/William_Chappell_(dancer),21


<IPython.core.display.Javascript object>

In [7]:
# Checking last 2 rows of the data
df_reference_counts.tail(2)

Unnamed: 0,link,num_references
120366,https://en.wikipedia.org/wiki/Shirley_Thomas_(USC_professor),6
120367,https://en.wikipedia.org/wiki/James_Doohan,52


<IPython.core.display.Javascript object>

In [8]:
# Checking a sample of the data
df_reference_counts.sample(5)

Unnamed: 0,link,num_references
19089,https://en.wikipedia.org/wiki/Eric_Freeman_(artist),3
93410,https://en.wikipedia.org/wiki/Toni_Mendez,13
71896,https://en.wikipedia.org/wiki/Mort_Lindsey,5
11450,https://en.wikipedia.org/wiki/Budi_Darma,20
89093,https://en.wikipedia.org/wiki/Belita,6


<IPython.core.display.Javascript object>

#### Observations:
- Here, we see that there are ~13,000 fewer rows for the reference data, indicating some pages were not successfully scraped to obtain the number of references for the individual.
- After combining the three dataframes, we can examine those pages and reattempt scraping them, in order to obtain the missing information.

### June 2022 Data

In [9]:
# Reading the wp_deaths_June_2022 dataset from sql db and table
conn = sql.connect("wp_deaths_June_2022.db")
raw_June_2022 = pd.read_sql("SELECT * FROM wp_deaths_June_2022", conn)

# Making a working copy
df_June_2022 = raw_June_2022.copy()

# Checking the shape
print(f"There are {df_June_2022.shape[0]} rows and {df_June_2022.shape[1]} columns.")

# Checking first 2 rows of the data
df_June_2022.head(2)

There are 145 rows and 6 columns.


Unnamed: 0,month_year,day,name,info,link,num_references
0,June 2022,8,Mladen Frančić,", 67, Croatian football player and manager (Vrbovec, Podravina, Al-Watani Club).",https://en.wikipedia.org/wiki/Mladen_Fran%C4%8Di%C4%87,1
1,June 2022,6,Valery Ryumin,", 82, Russian cosmonaut (Soyuz 25, Soyuz 32, Soyuz 35).",https://en.wikipedia.org/wiki/Valery_Ryumin,2


<IPython.core.display.Javascript object>

#### Observations:
- The June, 2022 data does not follow the same row order as the previous dataframe, which was in order of day of the month. 
- For continuity, before concatinating the two dataframes, we will sort June, 2022 by day.

In [10]:
# Sorting by day
df_June_2022.sort_values(by="day", inplace=True)

# Re-checking first 2 rows of the data
df_June_2022.head(2)

Unnamed: 0,month_year,day,name,info,link,num_references
26,June 2022,1,Richard Oldcorn,", 84, English Olympic fencer (1964, 1968, 1972).",https://en.wikipedia.org/wiki/Richard_Oldcorn,5
20,June 2022,1,István Szőke,", 75, Hungarian footballer (Ferencváros, national team), stroke.",https://en.wikipedia.org/wiki/Istv%C3%A1n_Sz%C5%91ke,2


<IPython.core.display.Javascript object>

In [11]:
# Checking last 2 rows of the data
df_June_2022.tail(2)

Unnamed: 0,month_year,day,name,info,link,num_references
8,June 2022,9,Oleg Moliboga,", 69, Russian volleyball player, Olympic champion (1980) and coach.",https://en.wikipedia.org/wiki/Oleg_Moliboga,2
5,June 2022,9,Zou Jing,", 86, Chinese engineer, member of the Chinese Academy of Engineering.",https://en.wikipedia.org/wiki/Zou_Jing_(engineer),3


<IPython.core.display.Javascript object>

In [12]:
# Checking a sample of the data
df_June_2022.sample(5)

Unnamed: 0,month_year,day,name,info,link,num_references
94,June 2022,4,John Cooksey,", 80, American ophthalmologist and politician, member of the U.S. House of Representatives (1997–2003).",https://en.wikipedia.org/wiki/John_Cooksey,3
19,June 2022,1,Joseph Zoderer,", 86, Italian writer.",https://en.wikipedia.org/wiki/Joseph_Zoderer,3
124,June 2022,7,Keijo Korhonen,", 88, Finnish diplomat and politician, minister for foreign affairs (1976–1977). (death announced on this date)",https://en.wikipedia.org/wiki/Keijo_Korhonen,10
131,June 2022,8,George Thompson,", 74, American basketball player (Milwaukee Bucks), complications from diabetes.",https://en.wikipedia.org/wiki/George_Thompson_(basketball),3
16,June 2022,3,Larry Hillman,", 85, Canadian ice hockey player (Toronto Maple Leafs, Boston Bruins, Detroit Red Wings).",https://en.wikipedia.org/wiki/Larry_Hillman,14


<IPython.core.display.Javascript object>

#### Observations:
- Now, we are ready to combine the three dataframes.

<a id='step2'></a>
## Combining Dataframes

### Adding Number of References to 1994 through May 2022 Data

In [13]:
# Adding num_references column to 1994 through May 2022 data
df_combined = pd.merge(df_94_to_22, df_reference_counts, how="left", on="link")

# Checking first 2 rows of the data
df_combined.head(2)

Unnamed: 0,month_year,day,name,info,link,num_references
0,January 1994,1,William Chappell,", 86, British dancer, ballet designer and director.",https://en.wikipedia.org/wiki/William_Chappell_(dancer),21
1,January 1994,1,Raymond Crotty,", 68, Irish economist, writer, and academic.",https://en.wikipedia.org/wiki/Raymond_Crotty,12


<IPython.core.display.Javascript object>

### Adding June 2022 Data

In [14]:
# Adding Juned 2022 data
df_combined = pd.concat([df_combined, df_June_2022], ignore_index=True)

# Making a working copy
df = df_combined.copy()

# Checking the shape
print(f"There are {df.shape[0]} rows and {df.shape[1]} columns.")

# Checking first 2 rows of the data
df.head(2)

There are 133914 rows and 6 columns.


Unnamed: 0,month_year,day,name,info,link,num_references
0,January 1994,1,William Chappell,", 86, British dancer, ballet designer and director.",https://en.wikipedia.org/wiki/William_Chappell_(dancer),21
1,January 1994,1,Raymond Crotty,", 68, Irish economist, writer, and academic.",https://en.wikipedia.org/wiki/Raymond_Crotty,12


<IPython.core.display.Javascript object>

In [15]:
# Checking last 2 rows of the data
df.tail(2)

Unnamed: 0,month_year,day,name,info,link,num_references
133912,June 2022,9,Oleg Moliboga,", 69, Russian volleyball player, Olympic champion (1980) and coach.",https://en.wikipedia.org/wiki/Oleg_Moliboga,2
133913,June 2022,9,Zou Jing,", 86, Chinese engineer, member of the Chinese Academy of Engineering.",https://en.wikipedia.org/wiki/Zou_Jing_(engineer),3


<IPython.core.display.Javascript object>

In [16]:
# Checking a sample of the data
df.sample(5)

Unnamed: 0,month_year,day,name,info,link,num_references
63692,July 2013,28,Drungo Hazewood,", 53, American baseball player (Baltimore Orioles), cancer.",https://en.wikipedia.org/wiki/Drungo_Hazewood,5.0
80549,January 2016,15,Francisco X. Alarcón,", 61, American poet, cancer.",https://en.wikipedia.org/wiki/Francisco_X._Alarc%C3%B3n,
33311,April 2007,6,Colin Graham,", 75, British opera, theatre and television director, cardiac arrest.",https://en.wikipedia.org/wiki/Colin_Graham,11.0
7221,November 1996,4,Gottlieb Weber,", 86, Swiss cyclist.",https://en.wikipedia.org/wiki/Gottlieb_Weber,1.0
99607,October 2018,6,Michel Vovelle,", 85, French historian.",https://en.wikipedia.org/wiki/Michel_Vovelle,1.0


<IPython.core.display.Javascript object>

#### Confirming Correct Number of Resultant Entries

In [17]:
# Confirming correct number of total rows
df_94_to_22.shape[0] + df_June_2022.shape[0]

133914

<IPython.core.display.Javascript object>

#### Observations:
- We have successfully combined the three dataframes.
- Now, we can check for data types, duplicates, and missing values.

<a id='step3'></a>
## Checking data types, duplicates, and null values

### Data types

In [18]:
# Checking data types and null values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 133914 entries, 0 to 133913
Data columns (total 6 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   month_year      133914 non-null  object
 1   day             133914 non-null  object
 2   name            133903 non-null  object
 3   info            133914 non-null  object
 4   link            133914 non-null  object
 5   num_references  120637 non-null  object
dtypes: object(6)
memory usage: 6.1+ MB


<IPython.core.display.Javascript object>

#### Observations:
- There are 6 columns, all of object type.
- `name` and `num_references` both have missing values.
- The data is in a very raw format and there are other columns that have combined information that will need to be extracted.
- For now, we will leave all as object type.

<a id='step3'></a>
### Duplicate Rows

In [19]:
# Checking duplicate rows
df.duplicated().sum()

9

<IPython.core.display.Javascript object>

#### Observations:
- There are 9 duplicate rows that we will drop now.

In [20]:
# Drop duplicate rows
df.drop_duplicates(inplace=True, ignore_index=True)

# Re-check shape
df.shape

(133905, 6)

<IPython.core.display.Javascript object>

<a id='step4'></a>
### Missing Values

In [21]:
# Check percentage of null values by column
df.isnull().sum() / df.count() * 100

month_year        0.000
day               0.000
name              0.008
info              0.000
link              0.000
num_references   11.007
dtype: float64

<IPython.core.display.Javascript object>

In [22]:
# Checking number of missing values per row
df.isnull().sum(axis=1).value_counts()

0    120628
1     13266
2        11
dtype: int64

<IPython.core.display.Javascript object>

#### Observations:
- The number of rows missing only 1 value appears consistent with our anticipated missing `num_references`.
- There are 11 rows that are each missing 2 values.  Let us take a closer look at these rows.

In [23]:
# Checking the rows that are missing values for 2 columns
missing_2 = df[df.isnull().sum(axis=1) == 2]
missing_2

Unnamed: 0,month_year,day,name,info,link,num_references
18937,August 2001,11,,"Kevin Kowalcyk, 2, known for eating a hamburger contaminated with E. coli O157:H7.",https://en.wikipedia.orgNone,
24985,January 2004,22,,"Vincent Palmer, 37, British criminal.",https://en.wikipedia.orgNone,
27458,March 2005,1,,"Barry Stigler, 57, American voice actor.",https://en.wikipedia.orgNone,
34077,July 2007,11,,"Nana Gualdi, 75, German singer and actress.",https://en.wikipedia.orgNone,
35097,November 2007,11,,,https://en.wikipedia.orgNone,
41075,May 2009,18,,Either killed in a missile attack or shot:\n,https://en.wikipedia.orgNone,
64771,September 2013,29,,"Scott Workman, 47, American stuntman (, , ), cancer.",https://en.wikipedia.orgNone,
76024,April 2015,29,,Notable convicted drug traffickers executed by Indonesian firing squad:\n,https://en.wikipedia.orgNone,
105871,August 2019,2,,"Japanese convicted murderers, executed by hanging.\n",https://en.wikipedia.orgNone,
106617,September 2019,12,,"Thami Shobede, 31, Singer Songwriter",https://en.wikipedia.orgNone,


<IPython.core.display.Javascript object>

#### Observations:
- We can see that multiple rows are missing `name`, but have the name in `info`, so we can extract it later.  
- The missing link itself is not of concern as it serves only as a means by which to retrieve the `num_references` value.
- As there is no associated link for the individual, we can safely replace the NaN `num_references` values for rows with extractable names with 0.
- We can proceed with removing the rows that lack an extractable name, as they also lack other information necessary for the analysis.

In [24]:
# List of rows to keep
keep_rows = [18937, 24985, 27458, 34077, 64771, 106617]

# For loop to replace num_references NaNs with 0 for rows with extractable names
for row in keep_rows:
    df.loc[row, "num_references"] = 0

# List of rows to remove
remove_rows = [index for index in missing_2.index if index not in keep_rows]
del missing_2

# Dropping rows
df.drop(remove_rows, inplace=True)

# Re-checking shape
df.shape

(133900, 6)

<IPython.core.display.Javascript object>

In [25]:
# Re-check info
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 133900 entries, 0 to 133904
Data columns (total 6 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   month_year      133900 non-null  object
 1   day             133900 non-null  object
 2   name            133894 non-null  object
 3   info            133900 non-null  object
 4   link            133900 non-null  object
 5   num_references  120634 non-null  object
dtypes: object(6)
memory usage: 7.2+ MB


<IPython.core.display.Javascript object>

#### Observations:
- There are now only 6 rows with missing `name`, corresponding to the names we identified in `info`, that we will extract later.
- The remaining missing values are all for `num_references`, so we can proceed to make another attempt at scraping this information.
- Let us check a sample of these rows.

In [26]:
# Checking sample of rows missing num_references
df[df["num_references"].isna()].sample(5)

Unnamed: 0,month_year,day,name,info,link,num_references
85680,October 2016,16,Kigeli V,", 80, Rwandan monarch, King (1959–1961).",https://en.wikipedia.org/wiki/Kigeli_V_of_Rwanda,
46559,July 2010,29,Sabina Mugabe,", 75, Zimbabwean politician, MP (1985–2008) and sister of Robert Mugabe, after long illness.",https://en.wikipedia.org/wiki/Sabina_Mugabe,
97505,June 2018,15,Frank Harden,", 95, American radio announcer.",https://en.wikipedia.org/wiki/Frank_Harden,
105736,July 2019,28,Ferruh Bozbeyli,", 92, Turkish politician, Chairman of the Democratic Party (1970–1978) and Speaker of the Grand National Assembly (1965–1970).",https://en.wikipedia.org/wiki/Ferruh_Bozbeyli,
36688,April 2008,8,Graham Higman,", 91, British mathematician.",https://en.wikipedia.org/wiki/Graham_Higman,


<IPython.core.display.Javascript object>

#### Observations:
- Following the links reveals that the pages contain references.  
- Therefore, they either have a variation in the XPath followed for scraping, or Scrapy had an issue with following their links.
- We will export a dataframe of the links to the pages that need to be re-scraped for `num_references`.

In [28]:
# Exporting dataframe of pages to rescrape for num_references
rescrape_df = df[df["num_references"].isna()]["link"]
rescrape_df.to_csv("rescrape_df.csv", index=False)
del rescrape_df

<IPython.core.display.Javascript object>

#### Observations:
- A second iteration of scraping individual pages for number of references reveals variation in the XPath for those pages.
- There appears to be at least a third variation as the second scraping obtained a little more than half of the missing values.
- We will import and merge the data, as before.

<a id='step5'></a>
## Missing Reference Count Values

### First Re-scrape with Spider "refs2"

In [29]:
# Reading the refs2 dataset from sql db and table
conn = sql.connect("refs2.db")
raw_refs2 = pd.read_sql("SELECT * FROM refs2", conn)

# Making a working copy
df_refs2 = raw_refs2.copy()

# Checking the shape
print(f"There are {df_refs2.shape[0]} rows and {df_refs2.shape[1]} columns.")

# Checking first 2 rows of the data
df_refs2.head(2)

There are 7365 rows and 2 columns.


Unnamed: 0,link,num_references
0,https://en.wikipedia.org/wiki/List_of_American_supercentenarians#Charlotte_Benkner,63
1,https://en.wikipedia.org/wiki/Eugene_Record,11


<IPython.core.display.Javascript object>

In [30]:
# Checking last 2 rows of the data
df_refs2.tail(2)

Unnamed: 0,link,num_references
7363,https://en.wikipedia.org/wiki/Gunnar_Utterberg,3
7364,https://en.wikipedia.org/wiki/Bill_Sudakis,19


<IPython.core.display.Javascript object>

In [31]:
# Checking a sample of the data
df_refs2.sample(5)

Unnamed: 0,link,num_references
7037,https://en.wikipedia.org/wiki/Luke_Letlow,21
3255,https://en.wikipedia.org/wiki/Graham_Anderson,1
1954,https://en.wikipedia.org/wiki/Allan_Ekelund,1
4939,https://en.wikipedia.org/wiki/Vladimir_Pribylovsky,16
5480,https://en.wikipedia.org/wiki/David_Mattingley,3


<IPython.core.display.Javascript object>

#### Observations:
- We were able to obtain 7365 of the missing values.

#### Adding Missing References to Dataframe

In [32]:
# Adding new num_references column to data
df = pd.merge(df, df_refs2, how="left", on="link")

# Checking sample of the data
df.sample(5)

Unnamed: 0,month_year,day,name,info,link,num_references_x,num_references_y
91121,July 2017,24,Naiyer Masud,", 81, Indian Urdu short story writer.",https://en.wikipedia.org/wiki/Naiyer_Masud,5.0,
120353,February 2021,7,Karen Lewis,", 67, American labor leader, president of the Chicago Teachers Union (2010–2014), glioblastoma.",https://en.wikipedia.org/wiki/Karen_Lewis,29.0,
40290,March 2009,7,Barbara Parker,", 62, American novelist, after long illness.",https://en.wikipedia.org/wiki/Barbara_Parker_(writer),3.0,
107248,October 2019,14,Steve Cash,", 73, American singer-songwriter, author and harmonica player (The Ozark Mountain Daredevils).",https://en.wikipedia.org/wiki/Steve_Cash,,4.0
69466,May 2014,28,Massimo Vignelli,", 83, Italian graphic designer (New York City Subway map, American Airlines).",https://en.wikipedia.org/wiki/Massimo_Vignelli,45.0,


<IPython.core.display.Javascript object>

In [33]:
# Filling missing values with newly obtained values
df["num_references_x"].fillna(df["num_references_y"], inplace=True)
df.sample(5)

Unnamed: 0,month_year,day,name,info,link,num_references_x,num_references_y
7310,November 1996,16,Benjamin Arthur Quarles,", 92, American historian, educator, and writer, heart attack.",https://en.wikipedia.org/wiki/Benjamin_Arthur_Quarles,4,
58951,November 2012,28,Cosimo Nocera,", 74, Italian footballer (Foggia Calcio).",https://en.wikipedia.org/wiki/Cosimo_Nocera,2,
133115,May 2022,3,Javier Barrero,", 72, Spanish politician, deputy (1982–2016), cancer.",https://en.wikipedia.org/wiki/Javier_Barrero,2,
34042,July 2007,7,John Szarkowski,", 81, American photography curator, complications of a stroke.",https://en.wikipedia.org/wiki/John_Szarkowski,32,
115376,August 2020,16,Nina McClelland,", 90, American chemist.",https://en.wikipedia.org/wiki/Nina_McClelland,26,


<IPython.core.display.Javascript object>

In [34]:
# Dropping new references column and reverting to original column name
df.drop("num_references_y", axis=1, inplace=True)
df.rename(columns={"num_references_x": "num_references"}, inplace=True)
df.head(2)

Unnamed: 0,month_year,day,name,info,link,num_references
0,January 1994,1,William Chappell,", 86, British dancer, ballet designer and director.",https://en.wikipedia.org/wiki/William_Chappell_(dancer),21
1,January 1994,1,Raymond Crotty,", 68, Irish economist, writer, and academic.",https://en.wikipedia.org/wiki/Raymond_Crotty,12


<IPython.core.display.Javascript object>

#### Checking Remaining Missing Values

In [35]:
# Checking remaining missing values
df.isna().sum()

month_year           0
day                  0
name                 6
info                 0
link                 0
num_references    5895
dtype: int64

<IPython.core.display.Javascript object>

#### Observations:
- We have nearly 6000 remaining missing values for `num_references`, so we will iterate through the rescraping again.

In [36]:
# Checking sample of rows missing num_references
df[df["num_references"].isna()].sample(5)

Unnamed: 0,month_year,day,name,info,link,num_references
73251,December 2014,12,Alan Ward,", 79, New Zealand historian.",https://en.wikipedia.org/wiki/Alan_Ward_(historian),
56824,July 2012,26,Walter Goss,", 84, American sound engineer (, , ).",https://en.wikipedia.org/wiki/Walter_Goss,
64925,October 2013,10,Walter P. Lomax Jr.,", 79–80, American medical practitioner.",https://en.wikipedia.org/wiki/Walter_P._Lomax_Jr.,
101746,January 2019,23,Steven H. Amick,", 71, American politician, member of the Delaware House of Representatives (1987–1995) and Senate (1995–2009).",https://en.wikipedia.org/wiki/Steven_H._Amick,
98029,July 2018,13,Stan Dragoti,", 85, American film director (, , ), complications from pneumonia.",https://en.wikipedia.org/wiki/Stan_Dragoti,


<IPython.core.display.Javascript object>

#### Observations:
- Following the links again reveals that the pages contain references.  
- We will export another dataframe of the links to the pages that need to be re-scraped for `num_references` and examine the pages for alternate XPaths for scraping.

In [37]:
# Exporting dataframe of pages to rescrape for num_references
rescrape_df_2nd = df[df["num_references"].isna()]["link"]
rescrape_df_2nd.to_csv("rescrape_df_2nd.csv", index=False)

<IPython.core.display.Javascript object>

#### Observations:
- The XPath matched that of the last scraping iteration for several pages, so the scraping was repeated for the remaining rows with missing `num_references`.

### Second Re-scrape with Spider "refs3"

In [38]:
# Reading the refs3 dataset from sql db and table
conn = sql.connect("refs3.db")
raw_refs3 = pd.read_sql("SELECT * FROM refs3", conn)

# Making a working copy
df_refs3 = raw_refs3.copy()

# Checking the shape
print(f"There are {df_refs3.shape[0]} rows and {df_refs3.shape[1]} columns.")

# Checking first 2 rows of the data
df_refs3.head(2)

There are 3633 rows and 2 columns.


Unnamed: 0,link,num_references
0,https://en.wikipedia.org/wiki/List_of_American_supercentenarians#Grace_Thaxton,63
1,"https://en.wikipedia.org/wiki/Christopher_Prout,_Baron_Kingsland",6


<IPython.core.display.Javascript object>

In [39]:
# Checking last 2 rows of the data
df_refs3.tail(2)

Unnamed: 0,link,num_references
3631,https://en.wikipedia.org/wiki/Concepci%C3%B3n_Ram%C3%ADrez,8
3632,https://en.wikipedia.org/wiki/Mary_Mahoney_(physician),6


<IPython.core.display.Javascript object>

In [40]:
# Checking a sample of the data
df_refs3.sample(5)

Unnamed: 0,link,num_references
1917,https://en.wikipedia.org/wiki/Titus_Munteanu,2
2337,https://en.wikipedia.org/wiki/Ragnhild_Barland,1
188,https://en.wikipedia.org/wiki/Graham_Leonard,17
826,https://en.wikipedia.org/wiki/Giulio_Rinaldi,4
2181,https://en.wikipedia.org/wiki/Carol_Severance,4


<IPython.core.display.Javascript object>

#### Observations:
- We were able to obtain 3633 of the missing values.

#### Adding Missing References to Dataframe

In [41]:
# Adding new num_references column to data
df = pd.merge(df, df_refs3, how="left", on="link")

# Checking sample of the data
df.sample(5)

Unnamed: 0,month_year,day,name,info,link,num_references_x,num_references_y
71183,August 2014,25,Arthur H. White,", 90, American business consultant.",https://en.wikipedia.org/wiki/Arthur_H._White,34,
10973,March 1998,8,Alexandre Gemignani,", 72, Brazilian basketball player.",https://en.wikipedia.org/wiki/Alexandre_Gemignani,2,
24128,September 2003,12,Patrick Wilson,", 75, American librarian, philosopher, professor and author.",https://en.wikipedia.org/wiki/Patrick_Wilson_(librarian),2,
25387,April 2004,4,Austin Willis,", 87, Canadian actor and television host.",https://en.wikipedia.org/wiki/Austin_Willis,5,
40712,April 2009,13,Harry Kalas,", 73, American sportscaster, heart attack.",https://en.wikipedia.org/wiki/Harry_Kalas,29,


<IPython.core.display.Javascript object>

In [42]:
# Filling missing values with newly obtained values
df["num_references_x"].fillna(df["num_references_y"], inplace=True)
df.sample(5)

Unnamed: 0,month_year,day,name,info,link,num_references_x,num_references_y
111406,April 2020,2,Jan Veentjer,", 82, Dutch Olympic field hockey player (1964).",https://en.wikipedia.org/wiki/Jan_Veentjer,3,
19410,October 2001,14,Vernon Harrison,", 89, British photographer and parapsychologist.",https://en.wikipedia.org/wiki/Vernon_Harrison,6,
121244,March 2021,5,Sasa Klaas,", 27, Botswanan singer-songwriter, helicopter crash.",https://en.wikipedia.org/wiki/Sasa_Klaas,8,
133867,June 2022,7,Keijo Korhonen,", 88, Finnish diplomat and politician, minister for foreign affairs (1976–1977). (death announced on this date)",https://en.wikipedia.org/wiki/Keijo_Korhonen,10,
127505,October 2021,7,Raoul Franklin,", 86, British physicist and academic administrator, vice-chancellor of City, University of London (1978–1998).",https://en.wikipedia.org/wiki/Raoul_Franklin,5,


<IPython.core.display.Javascript object>

In [43]:
# Dropping new references column and reverting to original column name
df.drop("num_references_y", axis=1, inplace=True)
df.rename(columns={"num_references_x": "num_references"}, inplace=True)
df.head(2)

Unnamed: 0,month_year,day,name,info,link,num_references
0,January 1994,1,William Chappell,", 86, British dancer, ballet designer and director.",https://en.wikipedia.org/wiki/William_Chappell_(dancer),21
1,January 1994,1,Raymond Crotty,", 68, Irish economist, writer, and academic.",https://en.wikipedia.org/wiki/Raymond_Crotty,12


<IPython.core.display.Javascript object>

#### Checking Remaining Missing Values

In [44]:
# Checking remaining missing values
df.isna().sum()

month_year           0
day                  0
name                 6
info                 0
link                 0
num_references    2260
dtype: int64

<IPython.core.display.Javascript object>

#### Observations:
- We have 2260 remaining missing values for `num_references`, so we will iterate through the rescraping again.

In [45]:
# Checking sample of rows missing num_references
df[df["num_references"].isna()].sample(5)

Unnamed: 0,month_year,day,name,info,link,num_references
66768,January 2014,17,Nadia Boudesoque,", 97, Mexican actress and Olympic fencer (1948).",https://en.wikipedia.org/wiki/Nadia_Boudesoque,
120864,February 2021,22,Pairoj Jaisingha,", 77, Thai actor ().",https://en.wikipedia.org/wiki/Pairoj_Jaisingha,
102776,March 2019,7,Issei Suda,", 78, Japanese photographer.",https://en.wikipedia.org/wiki/Issei_Suda,
120193,February 2021,3,Abdelkader Jerbi,", Tunisian film director.",https://en.wikipedia.org/wiki/Abdelkader_Jerbi,
69857,June 2014,17,Wolfram Dorn,", 89, German politician.",https://en.wikipedia.org/wiki/Wolfram_Dorn_(politician),


<IPython.core.display.Javascript object>

#### Observations:
- Following the links again reveals that the pages contain references.  
- We will export another dataframe of the links to the pages that need to be re-scraped for `num_references` and examine the pages for alternate XPaths for scraping.

In [46]:
# Exporting dataframe of pages to rescrape for num_references
rescrape_df_3rd = df[df["num_references"].isna()]["link"]
rescrape_df_3rd.to_csv("rescrape_df_3rd.csv", index=False)

<IPython.core.display.Javascript object>

#### Observations:
- Following the links again reveals that the pages contain references.  
- We will export another dataframe of the links to the pages that need to be re-scraped for `num_references` and examine the pages for alternate XPaths for scraping.

### Third Re-scrape with Spider "refs4"

In [49]:
# Reading the refs4 dataset from sql db and table
conn = sql.connect("refs4.db")
raw_refs4 = pd.read_sql("SELECT * FROM refs4", conn)

# Making a working copy
df_refs4 = raw_refs4.copy()

# Checking the shape
print(f"There are {df_refs4.shape[0]} rows and {df_refs4.shape[1]} columns.")

# Checking first 2 rows of the data
df_refs4.head(2)

There are 1960 rows and 2 columns.


Unnamed: 0,link,num_references
0,https://en.wikipedia.org/wiki/List_of_French_supercentenarians,157
1,https://en.wikipedia.org/wiki/Bob_Kahler,3


<IPython.core.display.Javascript object>

In [50]:
# Checking last 2 rows of the data
df_refs4.tail(2)

Unnamed: 0,link,num_references
1958,https://en.wikipedia.org/wiki/Jorge_Sampaio,73
1959,https://en.wikipedia.org/wiki/Irma_Kalish,14


<IPython.core.display.Javascript object>

In [51]:
# Checking a sample of the data
df_refs4.sample(5)

Unnamed: 0,link,num_references
1199,https://en.wikipedia.org/wiki/Brian_Perry_(cricketer),9
1491,https://en.wikipedia.org/wiki/Imrat_Khan,7
1252,https://en.wikipedia.org/wiki/Anna_Campori,4
1698,https://en.wikipedia.org/wiki/Ignazio_Paleari,3
46,https://en.wikipedia.org/wiki/Sverre_Magelssen,7


<IPython.core.display.Javascript object>

#### Observations:
- We were able to obtain 1960 of the missing values.

#### Adding Missing References to Dataframe

In [52]:
# Adding new num_references column to data
df = pd.merge(df, df_refs4, how="left", on="link")

# Checking sample of the data
df.sample(5)

Unnamed: 0,month_year,day,name,info,link,num_references_x,num_references_y
66309,December 2013,27,Carter Camp,", 72, American activist, chair of the American Indian Movement (1973).",https://en.wikipedia.org/wiki/Carter_Camp,6,
69884,June 2014,18,John E. Miller,", 85, American politician, member (1958–1998) and Speaker (1979–1980) of the Arkansas House of Representatives.",https://en.wikipedia.org/wiki/John_E._Miller_(Arkansas_politician),2,
88881,March 2017,24,Jean Rouverol,", 100, American actress () and screenwriter (, ).",https://en.wikipedia.org/wiki/Jean_Rouverol,10,
105997,August 2019,10,Bernard Unabali,", 62, Papua New Guinean Roman Catholic prelate, Bishop of Bougainville (since 2009).",https://en.wikipedia.org/wiki/Bernard_Unabali,3,
37520,June 2008,27,Raymond Lefèvre,", 78, French conductor.",https://en.wikipedia.org/wiki/Raymond_Lef%C3%A8vre,6,


<IPython.core.display.Javascript object>

In [53]:
# Filling missing values with newly obtained values
df["num_references_x"].fillna(df["num_references_y"], inplace=True)
df.sample(5)

Unnamed: 0,month_year,day,name,info,link,num_references_x,num_references_y
112581,May 2020,4,Motoko Fujishiro Huthwaite,", 92, American preservationist, last surviving female Monuments Men, COVID-19.",https://en.wikipedia.org/wiki/Motoko_Fujishiro_Huthwaite,8,
5962,April 1996,23,Mario Luigi Ciappi,", 86, Italian Cardinal of the Roman Catholic Church.",https://en.wikipedia.org/wiki/Mario_Luigi_Ciappi,3,
122874,April 2021,24,Yves Rénier,", 78, Swiss-born French actor (, , ), film director and screenwriter.",https://en.wikipedia.org/wiki/Yves_R%C3%A9nier,2,
91458,August 2017,11,Richard Gordon,", 95, English physician and author ().",https://en.wikipedia.org/wiki/Richard_Gordon_(English_author),8,
45853,June 2010,5,Esma Agolli,", 81, Albanian actress, cardiac arrest.",https://en.wikipedia.org/wiki/Esma_Agolli,1,


<IPython.core.display.Javascript object>

In [54]:
# Dropping new references column and reverting to original column name
df.drop("num_references_y", axis=1, inplace=True)
df.rename(columns={"num_references_x": "num_references"}, inplace=True)
df.head(2)

Unnamed: 0,month_year,day,name,info,link,num_references
0,January 1994,1,William Chappell,", 86, British dancer, ballet designer and director.",https://en.wikipedia.org/wiki/William_Chappell_(dancer),21
1,January 1994,1,Raymond Crotty,", 68, Irish economist, writer, and academic.",https://en.wikipedia.org/wiki/Raymond_Crotty,12


<IPython.core.display.Javascript object>

#### Checking Remaining Missing Values

In [55]:
# Checking remaining missing values
df.isna().sum()

month_year          0
day                 0
name                6
info                0
link                0
num_references    300
dtype: int64

<IPython.core.display.Javascript object>

#### Observations:
- We have just 300 remaining missing values for `num_references`, so we will iterate through the rescraping again.

In [56]:
# Checking sample of rows missing num_references
df[df["num_references"].isna()].sample(5)

Unnamed: 0,month_year,day,name,info,link,num_references
125508,July 2021,26,David Von Ancken,", 56, American film and television director (, , ), stomach cancer.",https://en.wikipedia.org/wiki/David_Von_Ancken,
133329,May 2022,12,Valeriy Grishko,", 70, Russian stage director, actor (, , ) and drama teacher.",https://en.wikipedia.org/w/index.php?title=Valeriy_Grishko&action=edit&redlink=1,
125874,August 2021,8,Maria José Gonzaga,", 75, Brazilian businesswoman, philanthropist and politician, mayor of Tatuí (since 2017), abdominal cancer.",https://en.wikipedia.org/wiki/Maria_Jos%C3%A9_Gonzaga,
119811,January 2021,23,Makhosi Vilakati,", Swazi politician, minister of labour and social security (since 2019), COVID-19.",https://en.wikipedia.org/wiki/Makhosi_Vilakati,
133404,May 2022,15,Nina Mazaeva,", 100, Russian actress (, , ).",https://en.wikipedia.org/w/index.php?title=Nina_Mazaeva&action=edit&redlink=1,


<IPython.core.display.Javascript object>

#### Observations:
- Following the links again reveals that the pages contain references.  
- We will export another dataframe of the links to the pages that need to be re-scraped for `num_references` and examine the pages for alternate XPaths for scraping.

In [57]:
# Exporting dataframe of pages to rescrape for num_references
rescrape_df_4th = df[df["num_references"].isna()]["link"]
rescrape_df_4th.to_csv("rescrape_df_4th.csv", index=False)

<IPython.core.display.Javascript object>

### Fourth Re-scrape with Spider "refs5"

In [58]:
# Reading the refs5 dataset from sql db and table
conn = sql.connect("refs5.db")
raw_refs5 = pd.read_sql("SELECT * FROM refs5", conn)

# Making a working copy
df_refs5 = raw_refs5.copy()

# Checking the shape
print(f"There are {df_refs5.shape[0]} rows and {df_refs5.shape[1]} columns.")

# Checking first 2 rows of the data
df_refs5.head(2)

There are 208 rows and 2 columns.


Unnamed: 0,link,num_references
0,https://en.wikipedia.org/wiki/Mohamed_Haytham_Khayat,5
1,https://en.wikipedia.org/wiki/Sid_McCray,6


<IPython.core.display.Javascript object>

In [59]:
# Checking last 2 rows of the data
df_refs5.tail(2)

Unnamed: 0,link,num_references
206,https://en.wikipedia.org/wiki/Anna_Cataldi,9
207,https://en.wikipedia.org/wiki/Allan_Egolf,4


<IPython.core.display.Javascript object>

In [60]:
# Checking a sample of the data
df_refs5.sample(5)

Unnamed: 0,link,num_references
207,https://en.wikipedia.org/wiki/Allan_Egolf,4
187,https://en.wikipedia.org/wiki/Hans_M%C3%BCller_(figure_skater),1
110,https://en.wikipedia.org/wiki/Ken_McCaffery,4
58,https://en.wikipedia.org/wiki/Ruhollah_Zam,22
128,https://en.wikipedia.org/wiki/Miguel_Arsenio_Lara_Sosa,5


<IPython.core.display.Javascript object>

#### Observations:
- We were able to obtain 208 of the missing values.

#### Adding Missing References to Dataframe

In [61]:
# Adding new num_references column to data
df = pd.merge(df, df_refs5, how="left", on="link")

# Checking sample of the data
df.sample(5)

Unnamed: 0,month_year,day,name,info,link,num_references_x,num_references_y
124116,June 2021,4,John M. Patterson,", 99, American politician, attorney general (1955–1959) and governor (1959–1963) of Alabama.",https://en.wikipedia.org/wiki/John_M._Patterson,49,
67011,January 2014,29,Biko,", 30, American Olympic eventing horse (1996), euthanized.",https://en.wikipedia.org/wiki/Biko_(horse),21,
133066,April 2022,30,Mino Raiola,", 54, Italian football agent (Pavel Nedvěd, Paul Pogba, Zlatan Ibrahimović).",https://en.wikipedia.org/wiki/Mino_Raiola,63,
97785,June 2018,29,Lawrence Rondon,", 68, Trinidadian footballer.",https://en.wikipedia.org/wiki/Lawrence_Rondon,2,
37730,July 2008,20,Yann Richter,", 80, Swiss politician, president of the FDP (1978–1984), heart disease.",https://en.wikipedia.org/wiki/Yann_Richter,0,


<IPython.core.display.Javascript object>

In [62]:
# Filling missing values with newly obtained values
df["num_references_x"].fillna(df["num_references_y"], inplace=True)
df.sample(5)

Unnamed: 0,month_year,day,name,info,link,num_references_x,num_references_y
26503,October 2004,19,Sang Lee,", 51, Korean-American three-cushion billiard player, stomach cancer.",https://en.wikipedia.org/wiki/Sang_Lee,4,
44839,March 2010,23,Bob Abbott,", 77, American judge.",https://en.wikipedia.org/wiki/Bob_Abbott,6,
41188,May 2009,31,Martin Clemens,", 94, British colonial administrator and soldier.",https://en.wikipedia.org/wiki/Martin_Clemens,8,
133755,June 2022,1,Richard Oldcorn,", 84, English Olympic fencer (1964, 1968, 1972).",https://en.wikipedia.org/wiki/Richard_Oldcorn,5,
24801,December 2003,25,Frédéric Berthet,", 49, French writer.",https://en.wikipedia.org/wiki/Fr%C3%A9d%C3%A9ric_Berthet,2,


<IPython.core.display.Javascript object>

In [63]:
# Dropping new references column and reverting to original column name
df.drop("num_references_y", axis=1, inplace=True)
df.rename(columns={"num_references_x": "num_references"}, inplace=True)
df.head(2)

Unnamed: 0,month_year,day,name,info,link,num_references
0,January 1994,1,William Chappell,", 86, British dancer, ballet designer and director.",https://en.wikipedia.org/wiki/William_Chappell_(dancer),21
1,January 1994,1,Raymond Crotty,", 68, Irish economist, writer, and academic.",https://en.wikipedia.org/wiki/Raymond_Crotty,12


<IPython.core.display.Javascript object>

#### Checking Remaining Missing Values

In [64]:
# Checking remaining missing values
df.isna().sum()

month_year         0
day                0
name               6
info               0
link               0
num_references    92
dtype: int64

<IPython.core.display.Javascript object>

#### Observations:
- We have just 92 remaining missing values for `num_references`.
- Let us examine the remaining rows with missing values more closely.

In [68]:
# Checking sample of rows missing num_references
df[df["num_references"].isna()]

Unnamed: 0,month_year,day,name,info,link,num_references
12476,October 1998,29,Sjoerdtsje Faber,", 83, Dutch speed skater.",https://en.wikipedia.org/w/index.php?title=Sjoerdje_Faber&action=edit&redlink=1,
15859,May 2000,10,Bill Foster,", 68, American entertainer.",https://en.wikipedia.org/w/index.php?title=Bill_Foster_(performer)&action=edit&redlink=1,
18561,June 2001,20,Demetreus Nix,"Douglas Scott, 20, High-school student murdered by .",https://en.wikipedia.org/w/index.php?title=Demetreus_Nix&action=edit&redlink=1,
19905,December 2001,18,Sietske Pasveer,", 86, Dutch speed skater.",https://en.wikipedia.org/w/index.php?title=Sietske_Pasveer&action=edit&redlink=1,
22942,March 2003,3,Gilbert Wheeler Beebe,", 90, American epidemiologist and statistician, conducted ground-breaking radiation exposure studies.",https://en.wikipedia.org/w/index.php?title=Gilbert_Wheeler_Beebe&action=edit&redlink=1,
23291,April 2003,27,Charles A. Marvin,", 73, American district attorney and judge.",https://en.wikipedia.org/w/index.php?title=Charles_A._Marvin&action=edit&redlink=1,
24346,October 2003,15,Ray Kuhlman,", 84, American pilot and businessman.",https://en.wikipedia.org/w/index.php?title=Ray_Kuhlman&action=edit&redlink=1,
33653,May 2007,20,Moses Siregar,", 96, Maritime Chef and American World War 2 Veteran",https://en.wikipedia.org/w/index.php?title=Moses_Siregar&action=edit&redlink=1,
35963,January 2008,31,Michael A. Dions,", 90, American Olympic swimmer, natural causes",https://en.wikipedia.org/w/index.php?title=Michael_A._Dions&action=edit&redlink=1,
36198,February 2008,19,Samuel Champkin,", 28, American singer in Metal band Tech Giants, car crash.",https://en.wikipedia.org/w/index.php?title=Samuel_Champkin&action=edit&redlink=1,


<IPython.core.display.Javascript object>

#### Observations:
- None of the remaining links direct to a personal page for the individual, as such a page does not yet exist.
- As such, as the rows contain the other elements necessary for analysis, it is safe to replace the missing `num_references` values with 0, for these rows.

In [72]:
# Fill remaining missing values for num_references with 0
df["num_references"].fillna(0, inplace=True)

# Recheck missing values
df.isna().sum()

month_year        0
day               0
name              6
info              0
link              0
num_references    0
dtype: int64

<IPython.core.display.Javascript object>

#### Observations:
- All of the missing values for `num_references` have been addressed.
- The 6 remaining missing values for `name` will be fixed during data cleaning.
- We now have our complete raw dataset, which we will also export to a sql database for safe keeping.
- Then it is time to start cleaning the data.

In [74]:
# Saving complete raw dataset in a SQLite database
# conn = sql.connect("wp_notable_deaths_raw_complete.db")
# df.to_sql("wp_notable_deaths_raw_complete", conn)

133900

<IPython.core.display.Javascript object>