# Wikipedia Notable Deaths

## Context

The...

## Objective

The...

## Data Dictionary
Variable: Description

### Data Collection
Scraped on 6/9/22 with scrapy.

Main site: https://en.wikipedia.org/wiki/Lists_of_deaths_by_year
Start_url: https://en.wikipedia.org/wiki/Deaths_in_January_1994
paginating through May, 2022 successfully.
Reference count scraped separately,due to pagination, to maintain original order in first dataset.
June, 20222 as the current month had inconsistent formatting and was scraped separately. 
~13,000 missing entries for reference count.

6/10/22 set out to id links with missing reference count to rescrape.

## Importing necessary libraries

In [1]:
# To limit number of threads in numpy and thereby prevent known dataleak associated with KMeans
# Note:  this cell must be run BEFORE installing numpy to have desired effect
import os

os.environ["OMP_NUM_THREADS"] = "1"

In [2]:
# To structure code automatically
%load_ext nb_black

# To import sqlite databases
import sqlite3 as sql

# To help with reading and manipulating data
import pandas as pd
import numpy as np

# To help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA

# To be used for data scaling
from sklearn.preprocessing import StandardScaler

# To compute distances
from scipy.spatial.distance import cdist, pdist

# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)
# To define the maximum number of rows to be displayed in a dataframe
pd.set_option("display.max_rows", 200)

# To perform k-means clustering and compute silhouette scores
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# To visualize the elbow curve and silhouette scores
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer
from yellowbrick.style.palettes import PALETTES

# To perform hierarchical clustering, compute cophenetic correlation, and create dendrograms
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet

# To supress scientific notations for a dataframe
pd.set_option("display.float_format", lambda x: "%.3f" % x)

# To supress warnings
import warnings

warnings.filterwarnings("ignore")

# To set some visualization attributes
pd.set_option("max_colwidth", 150)
sns.set_theme()
sns.set_palette(
    (
        "midnightblue",
        "goldenrod",
        "maroon",
        "darkolivegreen",
        "cadetblue",
        "tab:purple",
        "yellowgreen",
    )
)
plt.rc("font", size=12)
plt.rc("axes", titlesize=15)
plt.rc("axes", labelsize=14)
plt.rc("xtick", labelsize=13)
plt.rc("ytick", labelsize=13)
plt.rc("legend", fontsize=13)
plt.rc("legend", fontsize=14)
plt.rc("figure", titlesize=16)

<IPython.core.display.Javascript object>

## Data Overview

### Reading, sampling, and checking data shape

### January 1994 through May 2022 Data (without reference counts)

In [3]:
# Reading the wp_deaths_94_to_22 dataset from sql db and table
conn = sql.connect("wp_deaths_94_to_22.db")
raw_94_to_22 = pd.read_sql("SELECT * FROM wp_deaths_94_to_22", conn)

# Making a working copy
df_94_to_22 = raw_94_to_22.copy()

# Checking the shape
print(f"There are {df_94_to_22.shape[0]} rows and {df_94_to_22.shape[1]} columns.")

# Checking first 2 rows of the data
df_94_to_22.head(2)

There are 133769 rows and 5 columns.


Unnamed: 0,month_year,day,name,info,link
0,January 1994,1,William Chappell,", 86, British dancer, ballet designer and director.",https://en.wikipedia.org/wiki/William_Chappell_(dancer)
1,January 1994,1,Raymond Crotty,", 68, Irish economist, writer, and academic.",https://en.wikipedia.org/wiki/Raymond_Crotty


<IPython.core.display.Javascript object>

In [4]:
# Checking last 2 rows of the data
df_94_to_22.tail(2)

Unnamed: 0,month_year,day,name,info,link
133767,May 2022,31,Dave Smith,", 72, American sound engineer, founder of Sequential.",https://en.wikipedia.org/wiki/Dave_Smith_(engineer)
133768,May 2022,31,Wang Zherong,", 86, Chinese tank designer, member of the Chinese Academy of Engineering.",https://en.wikipedia.org/wiki/Wang_Zherong


<IPython.core.display.Javascript object>

In [5]:
# Checking a sample of the data
df_94_to_22.sample(5)

Unnamed: 0,month_year,day,name,info,link
22651,January 2003,17,Richard Crenna,", 74, American actor (, ), heart failure.",https://en.wikipedia.org/wiki/Richard_Crenna
88354,February 2017,26,Ray Stokes,", 92, Australian footballer.",https://en.wikipedia.org/wiki/Ray_Stokes
40601,April 2009,3,Crodowaldo Pavan,", 89, Brazilian biologist and geneticist, multiple organ dysfunction syndrome and cancer.",https://en.wikipedia.org/wiki/Crodowaldo_Pavan
100011,October 2018,29,Li Xifan,", 90, Chinese literary scholar and redologist.",https://en.wikipedia.org/wiki/Li_Xifan
30384,April 2006,4,Vickery Turner,", 61, British actress of the 60's.",https://en.wikipedia.org/wiki/Vickery_Turner


<IPython.core.display.Javascript object>

#### Observations:
- There are 133,769 rows and 5 columns in the data from January, 1994 through May, 2022.
- The number of references was scraped separately.

### January 1994 through May 2022 Reference Count Data

In [6]:
# Reading the wp_reference_counts_2 dataset from sql db and table
conn = sql.connect("wp_reference_counts_2.db")
raw_reference_counts = pd.read_sql("SELECT * FROM wp_reference_counts_2", conn)

# Making a working copy
df_reference_counts = raw_reference_counts.copy()

# Checking the shape
print(
    f"There are {df_reference_counts.shape[0]} rows and {df_reference_counts.shape[1]} columns."
)

# Checking first 2 rows of the data
df_reference_counts.head(2)

There are 120368 rows and 2 columns.


Unnamed: 0,link,num_references
0,https://en.wikipedia.org/wiki/Lys_Gauty,5
1,https://en.wikipedia.org/wiki/William_Chappell_(dancer),21


<IPython.core.display.Javascript object>

In [7]:
# Checking last 2 rows of the data
df_reference_counts.tail(2)

Unnamed: 0,link,num_references
120366,https://en.wikipedia.org/wiki/Shirley_Thomas_(USC_professor),6
120367,https://en.wikipedia.org/wiki/James_Doohan,52


<IPython.core.display.Javascript object>

In [8]:
# Checking a sample of the data
df_reference_counts.sample(5)

Unnamed: 0,link,num_references
64325,https://en.wikipedia.org/wiki/Doe_B,49
97370,https://en.wikipedia.org/wiki/Adrian_Henri,4
90945,https://en.wikipedia.org/wiki/Taiji_Kase,24
79741,https://en.wikipedia.org/wiki/Kamal_Mahsud,1
40797,https://en.wikipedia.org/wiki/Gerry_Lenfest,18


<IPython.core.display.Javascript object>

#### Observations:
- Here, we see that there are ~13,000 fewer rows for the reference data, indicating some pages were not successfully scraped to obtain the number of references for the individual.
- After combining the three dataframes, we can examine those pages and reattempt scraping them, in order to obtain the missing information.

### June 2022 Data

In [9]:
# Reading the wp_deaths_June_2022 dataset from sql db and table
conn = sql.connect("wp_deaths_June_2022.db")
raw_June_2022 = pd.read_sql("SELECT * FROM wp_deaths_June_2022", conn)

# Making a working copy
df_June_2022 = raw_June_2022.copy()

# Checking the shape
print(f"There are {df_June_2022.shape[0]} rows and {df_June_2022.shape[1]} columns.")

# Checking first 2 rows of the data
df_June_2022.head(2)

There are 145 rows and 6 columns.


Unnamed: 0,month_year,day,name,info,link,num_references
0,June 2022,8,Mladen Frančić,", 67, Croatian football player and manager (Vrbovec, Podravina, Al-Watani Club).",https://en.wikipedia.org/wiki/Mladen_Fran%C4%8Di%C4%87,1
1,June 2022,6,Valery Ryumin,", 82, Russian cosmonaut (Soyuz 25, Soyuz 32, Soyuz 35).",https://en.wikipedia.org/wiki/Valery_Ryumin,2


<IPython.core.display.Javascript object>

#### Observations:
- The June, 2022 data does not follow the same row order as the previous dataframe, which was in order of day of the month. 
- For continuity, before concatinating the two dataframes, we will sort June, 2022 by day.

In [10]:
# Sorting by day
df_June_2022.sort_values(by="day", inplace=True)

# Re-checking first 2 rows of the data
df_June_2022.head(2)

Unnamed: 0,month_year,day,name,info,link,num_references
26,June 2022,1,Richard Oldcorn,", 84, English Olympic fencer (1964, 1968, 1972).",https://en.wikipedia.org/wiki/Richard_Oldcorn,5
20,June 2022,1,István Szőke,", 75, Hungarian footballer (Ferencváros, national team), stroke.",https://en.wikipedia.org/wiki/Istv%C3%A1n_Sz%C5%91ke,2


<IPython.core.display.Javascript object>

In [11]:
# Checking last 2 rows of the data
df_June_2022.tail(2)

Unnamed: 0,month_year,day,name,info,link,num_references
8,June 2022,9,Oleg Moliboga,", 69, Russian volleyball player, Olympic champion (1980) and coach.",https://en.wikipedia.org/wiki/Oleg_Moliboga,2
5,June 2022,9,Zou Jing,", 86, Chinese engineer, member of the Chinese Academy of Engineering.",https://en.wikipedia.org/wiki/Zou_Jing_(engineer),3


<IPython.core.display.Javascript object>

In [12]:
# Checking a sample of the data
df_June_2022.sample(5)

Unnamed: 0,month_year,day,name,info,link,num_references
129,June 2022,7,Robert Alexander,", 64, American football player (Los Angeles Rams).",https://en.wikipedia.org/wiki/Robert_Alexander_(American_football),2
125,June 2022,7,Trudy Haynes,", 95, American journalist (WXYZ-TV, KYW-TV).",https://en.wikipedia.org/wiki/Trudy_Haynes,13
18,June 2022,1,Takeyoshi Tanuma,", 93, Japanese photographer.",https://en.wikipedia.org/wiki/Takeyoshi_Tanuma,2
143,June 2022,8,Rocky Freitas,", 76, American football player (Detroit Lions, Tampa Bay Buccaneers).",https://en.wikipedia.org/wiki/Rocky_Freitas,4
70,June 2022,3,Piergiorgio Bressani,", 92, Italian politician, deputy (1963–1986), and mayor of Udine (1985–1990).",https://en.wikipedia.org/wiki/Piergiorgio_Bressani,3


<IPython.core.display.Javascript object>

#### Observations:
- Now, we are ready to combine the three dataframes.

## Combining Dataframes

### Adding Number of References to 1994 through May 2022 Data

In [13]:
# Adding num_references column to 1994 through May 2022 data
df_combined = pd.merge(df_94_to_22, df_reference_counts, how="left", on="link")

# Checking first 2 rows of the data
df_combined.head(2)

Unnamed: 0,month_year,day,name,info,link,num_references
0,January 1994,1,William Chappell,", 86, British dancer, ballet designer and director.",https://en.wikipedia.org/wiki/William_Chappell_(dancer),21
1,January 1994,1,Raymond Crotty,", 68, Irish economist, writer, and academic.",https://en.wikipedia.org/wiki/Raymond_Crotty,12


<IPython.core.display.Javascript object>

### Adding June 2022 Data

In [14]:
# Adding Juned 2022 data
df_combined = pd.concat([df_combined, df_June_2022], ignore_index=True)

# Making a working copy
df = df_combined.copy()

# Checking the shape
print(f"There are {df.shape[0]} rows and {df.shape[1]} columns.")

# Checking first 2 rows of the data
df.head(2)

There are 133914 rows and 6 columns.


Unnamed: 0,month_year,day,name,info,link,num_references
0,January 1994,1,William Chappell,", 86, British dancer, ballet designer and director.",https://en.wikipedia.org/wiki/William_Chappell_(dancer),21
1,January 1994,1,Raymond Crotty,", 68, Irish economist, writer, and academic.",https://en.wikipedia.org/wiki/Raymond_Crotty,12


<IPython.core.display.Javascript object>

In [15]:
# Checking last 2 rows of the data
df.tail(2)

Unnamed: 0,month_year,day,name,info,link,num_references
133912,June 2022,9,Oleg Moliboga,", 69, Russian volleyball player, Olympic champion (1980) and coach.",https://en.wikipedia.org/wiki/Oleg_Moliboga,2
133913,June 2022,9,Zou Jing,", 86, Chinese engineer, member of the Chinese Academy of Engineering.",https://en.wikipedia.org/wiki/Zou_Jing_(engineer),3


<IPython.core.display.Javascript object>

In [16]:
# Checking a sample of the data
df.sample(5)

Unnamed: 0,month_year,day,name,info,link,num_references
9743,October 1997,5,Dave Marr,", 63, American golfer and sportscaster, stomach cancer.",https://en.wikipedia.org/wiki/Dave_Marr,8
30500,April 2006,22,Alida Valli,", 84, Italian actress ().",https://en.wikipedia.org/wiki/Alida_Valli,15
102555,February 2019,24,Dame Margaret Scott,", 96, South African-Australian ballet dancer.",https://en.wikipedia.org/wiki/Margaret_Scott_(dancer),20
100214,November 2018,8,Raymond Plank,", 96, American businessman (Apache Corporation).",https://en.wikipedia.org/wiki/Raymond_Plank,19
55086,April 2012,11,Keith Leeson,", 83, Australian Olympic hockey player.",https://en.wikipedia.org/wiki/Keith_Leeson,2


<IPython.core.display.Javascript object>

#### Confirming Correct Number of Resultant Entries

In [17]:
# Confirming correct number of total rows
df_94_to_22.shape[0] + df_June_2022.shape[0]

133914

<IPython.core.display.Javascript object>

#### Observations:
- We have successfully combined the three dataframes.
- Now, we can check for data types, duplicates, and missing values.

## Checking data types, duplicates, and null values

### Data types

In [18]:
# Checking data types and null values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 133914 entries, 0 to 133913
Data columns (total 6 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   month_year      133914 non-null  object
 1   day             133914 non-null  object
 2   name            133903 non-null  object
 3   info            133914 non-null  object
 4   link            133914 non-null  object
 5   num_references  120637 non-null  object
dtypes: object(6)
memory usage: 6.1+ MB


<IPython.core.display.Javascript object>

#### Observations:
- There are 6 columns, all of object type.
- `name` and `num_references` both have missing values.
- The data is in a very raw format and there are other columns that have combined information that will need to be extracted.
- For now, we will leave all as object type.

### Duplicate Rows

In [19]:
# Checking duplicate rows
df.duplicated().sum()

9

<IPython.core.display.Javascript object>

#### Observations:
- There are 9 duplicate rows that we will drop now.

In [20]:
# Drop duplicate rows
df.drop_duplicates(inplace=True, ignore_index=True)

# Re-check shape
df.shape

(133905, 6)

<IPython.core.display.Javascript object>

### Missing Values

In [21]:
# Check percentage of null values by column
df.isnull().sum() / df.count() * 100

month_year        0.000
day               0.000
name              0.008
info              0.000
link              0.000
num_references   11.007
dtype: float64

<IPython.core.display.Javascript object>

In [22]:
# Checking number of missing values per row
df.isnull().sum(axis=1).value_counts()

0    120628
1     13266
2        11
dtype: int64

<IPython.core.display.Javascript object>

#### Observations:
- The number of rows missing only 1 value appears consistent with our anticipated missing `num_references`.
- There are 11 rows that are each missing 2 values.  Let us take a closer look at these rows.

In [23]:
# Checking the rows that are missing values for 2 columns
missing_2 = df[df.isnull().sum(axis=1) == 2]
missing_2

Unnamed: 0,month_year,day,name,info,link,num_references
18937,August 2001,11,,"Kevin Kowalcyk, 2, known for eating a hamburger contaminated with E. coli O157:H7.",https://en.wikipedia.orgNone,
24985,January 2004,22,,"Vincent Palmer, 37, British criminal.",https://en.wikipedia.orgNone,
27458,March 2005,1,,"Barry Stigler, 57, American voice actor.",https://en.wikipedia.orgNone,
34077,July 2007,11,,"Nana Gualdi, 75, German singer and actress.",https://en.wikipedia.orgNone,
35097,November 2007,11,,,https://en.wikipedia.orgNone,
41075,May 2009,18,,Either killed in a missile attack or shot:\n,https://en.wikipedia.orgNone,
64771,September 2013,29,,"Scott Workman, 47, American stuntman (, , ), cancer.",https://en.wikipedia.orgNone,
76024,April 2015,29,,Notable convicted drug traffickers executed by Indonesian firing squad:\n,https://en.wikipedia.orgNone,
105871,August 2019,2,,"Japanese convicted murderers, executed by hanging.\n",https://en.wikipedia.orgNone,
106617,September 2019,12,,"Thami Shobede, 31, Singer Songwriter",https://en.wikipedia.orgNone,


<IPython.core.display.Javascript object>

#### Observations:
- We can see that multiple rows are missing `name`, but have the name in `info`, so we can extract it later.  
- The missing link itself is not of concern as it serves only as a means by which to retrieve the `num_references` value.
- As there is no associated link for the individual, we can safely replace the NaN `num_references` values for rows with extractable names with 0.
- We can proceed with removing the rows that lack an extractable name, as they also lack other information necessary for the analysis.

In [24]:
# List of rows to keep
keep_rows = [18937, 24985, 27458, 34077, 64771, 106617]

# For loop to replace num_references NaNs with 0 for rows with extractable names
for row in keep_rows:
    df.loc[row, "num_references"] = 0

# List of rows to remove
remove_rows = [index for index in missing_2.index if index not in keep_rows]
del missing_2

# Dropping rows
df.drop(remove_rows, inplace=True)

# Re-checking shape
df.shape

(133900, 6)

<IPython.core.display.Javascript object>

In [25]:
# Re-check info
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 133900 entries, 0 to 133904
Data columns (total 6 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   month_year      133900 non-null  object
 1   day             133900 non-null  object
 2   name            133894 non-null  object
 3   info            133900 non-null  object
 4   link            133900 non-null  object
 5   num_references  120634 non-null  object
dtypes: object(6)
memory usage: 7.2+ MB


<IPython.core.display.Javascript object>

#### Observations:
- There are now only 6 rows with missing `name`, corresponding to the names we identified in `info`, that we will extract later.
- The remaining missing values are all for `num_references`, so we can proceed to make another attempt at scraping this information.
- Let us check a sample of these rows.

In [26]:
# Checking sample of rows missing num_references
df[df["num_references"].isna()].sample(5)

Unnamed: 0,month_year,day,name,info,link,num_references
54511,March 2012,10,Ruth-Marie Stewart,", 84, American Olympic skier.",https://en.wikipedia.org/wiki/Ruth-Marie_Stewart,
43152,November 2009,20,Roman Trakhtenberg,", 41, Russian actor, television and radio presenter, heart attack.",https://en.wikipedia.org/wiki/Roman_Trakhtenberg,
65499,November 2013,12,Raymond S. Burton,", 74, American politician, Executive Councillor for New Hampshire District 1 (1977–1979, since 1981), kidney cancer.",https://en.wikipedia.org/wiki/Raymond_S._Burton,
96526,April 2018,24,Emma Smith,", 94, English author ().",https://en.wikipedia.org/wiki/Emma_Smith_(author),
93930,December 2017,25,Renato Marchiaro,", 98, Italian footballer.",https://en.wikipedia.org/wiki/Renato_Marchiaro,


<IPython.core.display.Javascript object>

#### Observations:
- Following the links reveals that the pages contain references.  
- Therefore, they either have a variation in the XPath followed for scraping, or Scrapy had an issue with following their links.
- We will export a dataframe of the links to the pages that need to be re-scraped for `num_references`.

In [27]:
# Exporting dataframe of pages to rescrape for num_references
rescrape_df = df[df["num_references"].isna()]["link"]
rescrape_df.to_csv("rescrape_df.csv", index=False)

<IPython.core.display.Javascript object>

#### Observations:
- A second iteration of scraping individual pages for number of references reveals variation in the XPath for those pages.
- There appears to be at least a third variation as the second scraping obtained a little more than half of the missing values.
- We will import and merge the data, as before.

### Missing Reference Count Data -- First Re-scrape

In [28]:
# Reading the refs2 dataset from sql db and table
conn = sql.connect("refs2.db")
raw_refs2 = pd.read_sql("SELECT * FROM refs2", conn)

# Making a working copy
df_refs2 = raw_refs2.copy()

# Checking the shape
print(f"There are {df_refs2.shape[0]} rows and {df_refs2.shape[1]} columns.")

# Checking first 2 rows of the data
df_refs2.head(2)

There are 7365 rows and 2 columns.


Unnamed: 0,link,num_references
0,https://en.wikipedia.org/wiki/List_of_American_supercentenarians#Charlotte_Benkner,63
1,https://en.wikipedia.org/wiki/Eugene_Record,11


<IPython.core.display.Javascript object>

In [29]:
# Checking last 2 rows of the data
df_refs2.tail(2)

Unnamed: 0,link,num_references
7363,https://en.wikipedia.org/wiki/Gunnar_Utterberg,3
7364,https://en.wikipedia.org/wiki/Bill_Sudakis,19


<IPython.core.display.Javascript object>

In [30]:
# Checking a sample of the data
df_refs2.sample(5)

Unnamed: 0,link,num_references
7074,https://en.wikipedia.org/wiki/Jon_Westling,7
6003,https://en.wikipedia.org/wiki/Kenyatta_Jones,4
3503,https://en.wikipedia.org/wiki/Owen_Lynch,2
1326,https://en.wikipedia.org/wiki/Sandy_Allen,9
6631,https://en.wikipedia.org/wiki/Lorenza_Mazzetti,4


<IPython.core.display.Javascript object>

#### Observations:
- We were able to obtain 7365 of the missing values.

#### Adding Missing References to Dataframe

In [31]:
# Adding num_references column to 1994 through May 2022 data
df = pd.merge(df, df_refs2, how="left", on="link")

# Checking sample of the data
df.sample(5)

Unnamed: 0,month_year,day,name,info,link,num_references_x,num_references_y
6069,May 1996,11,Ademir de Menezes,", 73, Brazilian football player and manager.",https://en.wikipedia.org/wiki/Ademir_de_Menezes,5.0,
103350,April 2019,3,Einar Iversen,", 88, Norwegian jazz pianist and composer.",https://en.wikipedia.org/wiki/Einar_Iversen,6.0,
123585,May 2021,16,Vijay Singh Yadav,", 67–68, Indian politician, Bihar MLA (1995–2000), MP (2000–2006), COVID-19.",https://en.wikipedia.org/wiki/Vijay_Singh_Yadav,6.0,
83078,May 2016,20,Ádám Rajhona,", 72, Hungarian actor.",https://en.wikipedia.org/wiki/%C3%81d%C3%A1m_Rajhona,,1.0
6715,August 1996,28,Gevork Kotiantz,", 86, Russian artist.",https://en.wikipedia.org/wiki/Gevork_Kotiantz,26.0,


<IPython.core.display.Javascript object>

In [38]:
# Filling missing values with newly obtained values
df["num_references_x"].fillna(df["num_references_y"], inplace=True)
df.sample(5)

Unnamed: 0,month_year,day,name,info,link,num_references_x,num_references_y
113760,June 2020,17,Hugh Fraser,", 62, Canadian jazz musician, cancer.",https://en.wikipedia.org/wiki/Hugh_Fraser_(musician),3,
72487,November 2014,3,Sadashiv Amrapurkar,", 64, Indian actor (, ), lung infection.",https://en.wikipedia.org/wiki/Sadashiv_Amrapurkar,13,13.0
12190,September 1998,8,Leonid Kinskey,", 95, Russian-born actor, complications of a stroke.",https://en.wikipedia.org/wiki/Leonid_Kinskey,10,
103158,March 2019,25,Adolph Lawrence,", 50, Liberian politician, member of the House of Representatives (since 2012), traffic collision.",https://en.wikipedia.org/wiki/Adolph_Lawrence,2,
57740,September 2012,21,Konda Laxman Bapuji,", 96, Indian politician and freedom fighter.",https://en.wikipedia.org/wiki/Konda_Laxman_Bapuji,9,


<IPython.core.display.Javascript object>

In [46]:
# Dropping new references column and reverting to original column name
df.drop("num_references_y", axis=1, inplace=True)
df.rename(columns={"num_references_x": "num_references"}, inplace=True)
df.head(2)

Unnamed: 0,month_year,day,name,info,link,num_references
0,January 1994,1,William Chappell,", 86, British dancer, ballet designer and director.",https://en.wikipedia.org/wiki/William_Chappell_(dancer),21
1,January 1994,1,Raymond Crotty,", 68, Irish economist, writer, and academic.",https://en.wikipedia.org/wiki/Raymond_Crotty,12


<IPython.core.display.Javascript object>

#### Checking Remaining Missing Values

In [47]:
# Checking remainig missing values
df.isna().sum()

month_year           0
day                  0
name                 6
info                 0
link                 0
num_references    5895
dtype: int64

<IPython.core.display.Javascript object>

#### Observations:
- We have nearly 6000 remaining missing values for `num_references`, so we will iterate through the rescraping again.

In [48]:
# Checking sample of rows missing num_references
df[df["num_references"].isna()].sample(5)

Unnamed: 0,month_year,day,name,info,link,num_references
55356,April 2012,28,Joaquín Dualde,", 79, Spanish Olympic bronze medal-winning (1960) field hockey player. (Spanish)",https://en.wikipedia.org/wiki/Joaqu%C3%ADn_Dualde,
74864,March 2015,1,Matthew Young,", 70, British civil servant and executive (Panini Group).",https://en.wikipedia.org/wiki/Matthew_Young_(civil_servant),
81094,February 2016,10,Richard Unis,", 87, American judge, stroke.",https://en.wikipedia.org/wiki/Richard_Unis,
64139,August 2013,20,John W. Morris,", 91, American army lieutenant general, Chief of Engineers (1976—1980).",https://en.wikipedia.org/wiki/John_W._Morris,
62740,June 2013,6,Elaine Laron,", 83, American songwriter and lyricist (, ), pneumonia.",https://en.wikipedia.org/wiki/Elaine_Laron,


<IPython.core.display.Javascript object>

#### Observations:
- Following the links again reveals that the pages contain references.  
- We will export another dataframe of the links to the pages that need to be re-scraped for `num_references` and examine the pages for alternate XPaths for scraping.

In [49]:
# Exporting dataframe of pages to rescrape for num_references
rescrape_df_2nd = df[df["num_references"].isna()]["link"]
rescrape_df_2nd.to_csv("rescrape_df_2nd.csv", index=False)

<IPython.core.display.Javascript object>

#### Observations:
