# Wikipedia Notable Deaths

## Context

The...

## Objective

The...

## Data Dictionary
Variable: Description

### Data Collection
Scraped on 6/9/22 with scrapy.

Main site: https://en.wikipedia.org/wiki/Lists_of_deaths_by_year
Start_url: https://en.wikipedia.org/wiki/Deaths_in_January_1994
paginating through May, 2022 successfully.
Reference count scraped separately,due to pagination, to maintain original order in first dataset.
June, 20222 as the current month had inconsistent formatting and was scraped separately. 
~13,000 missing entries for reference count.

6/10/22 set out to id links with missing reference count to rescrape.

## Importing necessary libraries

In [1]:
# To limit number of threads in numpy and thereby prevent known dataleak associated with KMeans
# Note:  this cell must be run BEFORE installing numpy to have desired effect
import os

os.environ["OMP_NUM_THREADS"] = "1"

In [2]:
# To structure code automatically
%load_ext nb_black

# To import sqlite databases
import sqlite3 as sql

# To help with reading and manipulating data
import pandas as pd
import numpy as np

# To help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA

# To be used for data scaling
from sklearn.preprocessing import StandardScaler

# To compute distances
from scipy.spatial.distance import cdist, pdist

# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)
# To define the maximum number of rows to be displayed in a dataframe
pd.set_option("display.max_rows", 200)

# To perform k-means clustering and compute silhouette scores
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# To visualize the elbow curve and silhouette scores
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer
from yellowbrick.style.palettes import PALETTES

# To perform hierarchical clustering, compute cophenetic correlation, and create dendrograms
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet

# To supress scientific notations for a dataframe
pd.set_option("display.float_format", lambda x: "%.3f" % x)

# To supress warnings
import warnings

warnings.filterwarnings("ignore")

# To set some visualization attributes
sns.set_theme()
sns.set_palette(
    (
        "midnightblue",
        "goldenrod",
        "maroon",
        "darkolivegreen",
        "cadetblue",
        "tab:purple",
        "yellowgreen",
    )
)
plt.rc("font", size=12)
plt.rc("axes", titlesize=15)
plt.rc("axes", labelsize=14)
plt.rc("xtick", labelsize=13)
plt.rc("ytick", labelsize=13)
plt.rc("legend", fontsize=13)
plt.rc("legend", fontsize=14)
plt.rc("figure", titlesize=16)

<IPython.core.display.Javascript object>

## Data Overview

### Reading, sampling, and checking data shape

#### For June 2022

In [31]:
# Reading the wp_deaths_June_2022 dataset from sql db and table
conn = sql.connect("wp_deaths_June_2022.db")
raw_June_2022 = pd.read_sql("SELECT * FROM wp_deaths_June_2022", conn)

# Making a working copy
df_June_2022 = raw_June_2022.copy()

# Checking the shape
print(f"There are {df_June_2022.shape[0]} rows and {df_June_2022.shape[1]} columns.")

# Checking first 2 rows of the data
df_June_2022.head(2)

There are 145 rows and 6 columns.


Unnamed: 0,month_year,day,name,info,link,num_references
0,June 2022,8,Mladen Frančić,", 67, Croatian football player and manager (Vr...",https://en.wikipedia.org/wiki/Mladen_Fran%C4%8...,1
1,June 2022,6,Valery Ryumin,", 82, Russian cosmonaut (Soyuz 25, Soyuz 32, S...",https://en.wikipedia.org/wiki/Valery_Ryumin,2


<IPython.core.display.Javascript object>

In [32]:
# Checking last 2 rows of the data
df_June_2022.tail(2)

Unnamed: 0,month_year,day,name,info,link,num_references
143,June 2022,8,Rocky Freitas,", 76, American football player (Detroit Lions,...",https://en.wikipedia.org/wiki/Rocky_Freitas,4
144,June 2022,8,Birkha Bahadur Muringla,", 79, Indian writer.",https://en.wikipedia.org/wiki/Birkha_Bahadur_M...,3


<IPython.core.display.Javascript object>

In [37]:
# Checking a sample of the data
df_June_2022.sample(5)

Unnamed: 0,month_year,day,name,info,link,num_references
110,June 2022,5,Shaun Greatbatch,", 52, English darts player.",https://en.wikipedia.org/wiki/Shaun_Greatbatch,2
44,June 2022,2,José Luccioni,", 72, French actor ().",https://en.wikipedia.org/wiki/Jos%C3%A9_Luccio...,2
9,June 2022,8,Tarhan Erdem,", 89, Turkish politician, deputy (1977–1980) a...",https://en.wikipedia.org/wiki/Tarhan_Erdem,8
64,June 2022,3,Geoff Hunter,", 62, English footballer (Crewe Alexandra, Por...",https://en.wikipedia.org/wiki/Geoff_Hunter_(fo...,10
129,June 2022,7,Robert Alexander,", 64, American football player (Los Angeles Ra...",https://en.wikipedia.org/wiki/Robert_Alexander...,2


<IPython.core.display.Javascript object>

#### For January 1994 through May 2022 (without reference counts)

In [34]:
# Reading the wp_deaths_94_to_22 dataset from sql db and table
conn = sql.connect("wp_deaths_94_to_22.db")
raw_94_to_22 = pd.read_sql("SELECT * FROM wp_deaths_94_to_22", conn)

# Making a working copy
df_94_to_22 = raw_94_to_22.copy()

# Checking the shape
print(f"There are {df_94_to_22.shape[0]} rows and {df_94_to_22.shape[1]} columns.")

# Checking first 2 rows of the data
df_94_to_22.head(2)

There are 133769 rows and 5 columns.


Unnamed: 0,month_year,day,name,info,link
0,January 1994,1,William Chappell,", 86, British dancer, ballet designer and dire...",https://en.wikipedia.org/wiki/William_Chappell...
1,January 1994,1,Raymond Crotty,", 68, Irish economist, writer, and academic.",https://en.wikipedia.org/wiki/Raymond_Crotty


<IPython.core.display.Javascript object>

In [35]:
# Checking last 2 rows of the data
df_94_to_22.tail(2)

Unnamed: 0,month_year,day,name,info,link
133767,May 2022,31,Dave Smith,", 72, American sound engineer, founder of Sequ...",https://en.wikipedia.org/wiki/Dave_Smith_(engi...
133768,May 2022,31,Wang Zherong,", 86, Chinese tank designer, member of the Chi...",https://en.wikipedia.org/wiki/Wang_Zherong


<IPython.core.display.Javascript object>

In [38]:
# Checking a sample of the data
df_94_to_22.sample(5)

Unnamed: 0,month_year,day,name,info,link
11120,March 1998,28,Larry Stephens,", 59, American gridiron football player (Cleve...",https://en.wikipedia.org/wiki/Larry_Stephens_(...
30238,March 2006,15,George Mackey,", 90, American mathematician, formerly Landon ...",https://en.wikipedia.org/wiki/George_Mackey
61769,April 2013,14,Efi Arazi,", 76, Israeli businessman.",https://en.wikipedia.org/wiki/Efi_Arazi
129095,December 2021,6,Eugenio Minasso,", 62, Italian politician, deputy (2008–2013), ...",https://en.wikipedia.org/wiki/Eugenio_Minasso
75906,April 2015,22,Yoichi Funado,", 71, Japanese novelist, thymic cancer.",https://en.wikipedia.org/wiki/Yoichi_Funado


<IPython.core.display.Javascript object>

#### For January 1994 through May 2022 Reference Counts

In [39]:
# Reading the wp_reference_counts_2 dataset from sql db and table
conn = sql.connect("wp_reference_counts_2.db")
raw_reference_counts = pd.read_sql("SELECT * FROM wp_reference_counts_2", conn)

# Making a working copy
df_reference_counts = raw_reference_counts.copy()

# Checking the shape
print(
    f"There are {df_reference_counts.shape[0]} rows and {df_reference_counts.shape[1]} columns."
)

# Checking first 2 rows of the data
df_reference_counts.head(2)

There are 120368 rows and 2 columns.


Unnamed: 0,link,num_references
0,https://en.wikipedia.org/wiki/Lys_Gauty,5
1,https://en.wikipedia.org/wiki/William_Chappell...,21


<IPython.core.display.Javascript object>

In [40]:
# Checking last 2 rows of the data
df_reference_counts.tail(2)

Unnamed: 0,link,num_references
120366,https://en.wikipedia.org/wiki/Shirley_Thomas_(...,6
120367,https://en.wikipedia.org/wiki/James_Doohan,52


<IPython.core.display.Javascript object>

In [41]:
# Checking a sample of the data
df_reference_counts.sample(5)

Unnamed: 0,link,num_references
47338,https://en.wikipedia.org/wiki/Sasha_Lakovic,6
9388,https://en.wikipedia.org/wiki/Ann_Cartwright_D...,4
35897,https://en.wikipedia.org/wiki/Leevi_Lehto,2
105563,https://en.wikipedia.org/wiki/Jerry_Rivers,2
79660,https://en.wikipedia.org/wiki/Detlev_Lauscher,3


<IPython.core.display.Javascript object>