# Wikipedia Notable Life Expectancies
## [Notebook 10: Exploratory Data Analysis](https://github.com/teresahanak/wikipedia-life-expectancy/blob/main/wp_life_expect_EDA_thanak_2022_09_30.ipynb)
### Context

The
### Objective

The
### Data Dictionary
- **day**: Reported day of month of death
- **name**: Name of individual
- **info**: Original Wikipedia information fields for the individual from Wikipedia Notable Deaths' List page ("age, country of citizenship at birth, subsequent country of citizenship (if applicable), reason for notability, and cause of death (if known)"
- **link**: Link to the individual's page
- **num_references**: Number of references on the individual's page (a proxy for notability)
- **year**: Reported year of death
- **month**: Reported month of death
- **info_parenth**: Additional information for individual that was extracted from info because it was in parentheses
- **age**: Reported age in integer years at death*
- **cause_of_death**: Reported cause of death
- **place_1**: Country of citizenship at birth
- **place_2**: Subsequent country of citizenship (if applicable)
- **known for categories**: 0 (No) or 1 (Yes) value if individual's reported known-for role(s) is within the category. Mutliple categories are possible.† 
        - sciences
        - social
        - spiritual
        - academia_humanities
        - business_farming
        - arts
        - sports
        - law_enf_military_operator 
        - politics_govt_law
        - crime
        - event_record_other
        - other_species
- **num_categories**: Total **num_categories** for individual

    \* For age reported in a two-value estimated range, **age** reflects the arithmetic mean.  Reported estimated values of a single number reflect that number, while estimates covering a decade (e.g., 80's) were converted to the middle of the decade (i.e., 85).  The vast majority of entries for **age** reflect the single integer value that was reported.  
    
    † See Appendix A for further category definitions and decision-making regarding role categorization.

### Importing Libraries

In [20]:
# To structure code automatically
%load_ext nb_black

# To import/export sqlite databases
import sqlite3 as sql

# To save/open python objects in pickle file
# import pickle

# To help with reading, cleaning, and manipulating data
import pandas as pd
import numpy as np

# import re

# To help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)
# To define the maximum number of rows to be displayed in a dataframe
pd.set_option("display.max_rows", 211)

# To set some dataframe visualization attributes
pd.set_option("max_colwidth", 150)

# To supress scientific notations for a dataframe
pd.set_option("display.float_format", lambda x: "%.3f" % x)

# To supress warnings
# import warnings

# warnings.filterwarnings("ignore")

# To set some plot visualization attributes
sns.set_theme()
sns.set_palette(
    (
        "midnightblue",
        "goldenrod",
        "maroon",
        "darkolivegreen",
        "cadetblue",
        "tab:purple",
        "yellowgreen",
    )
)
plt.rc("font", size=12)
plt.rc("axes", titlesize=15)
plt.rc("axes", labelsize=14)
plt.rc("xtick", labelsize=13)
plt.rc("ytick", labelsize=13)
plt.rc("legend", fontsize=13)
plt.rc("legend", fontsize=14)
plt.rc("figure", titlesize=16)

# To play auditory cue when cell has executed, has warning, or has error and set chime theme
import chime

chime.theme("zelda")

The nb_black extension is already loaded. To reload it, use:
  %reload_ext nb_black


<IPython.core.display.Javascript object>

## Data Overview

### [Reading](https://github.com/teresahanak/wikipedia-life-expectancy/blob/main/wp_life_expect_clean8.db), Sampling, and Checking Data Shape

In [12]:
# Reading the dataset
conn = sql.connect("wp_life_expect_clean8.db")
data = pd.read_sql("SELECT * FROM wp_life_expect_clean8", conn)

# Making a working copy
df = data.copy()

# Checking the shape
print(f"There are {df.shape[0]} rows and {df.shape[1]} columns.")

# Checking first 2 rows of the data
df.head(2)

There are 98038 rows and 26 columns.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,age,cause_of_death,place_1,place_2,log_num_references,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
0,1,William Chappell,", 86, British dancer, ballet designer and director.",https://en.wikipedia.org/wiki/William_Chappell_(dancer),21,1994,January,,86.0,,United Kingdom of Great Britain and Northern Ireland,,3.091042,0,0,0,0,0,1,0,0,0,0,0,0,1
1,1,Raymond Crotty,", 68, Irish economist, writer, and academic.",https://en.wikipedia.org/wiki/Raymond_Crotty,12,1994,January,,68.0,,Ireland,,2.564949,0,0,0,1,0,1,0,0,1,0,0,0,3


<IPython.core.display.Javascript object>

In [13]:
# Checking last 2 rows of the data
df.tail(2)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,age,cause_of_death,place_1,place_2,log_num_references,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
98036,9,Aamir Liaquat Hussain,", 50, Pakistani journalist and politician, MNA .",https://en.wikipedia.org/wiki/Aamir_Liaquat_Hussain,99,2022,June,"2002 2007, since 2018",50.0,,Pakistan,,4.60517,0,0,0,0,0,1,0,0,1,0,0,0,2
98037,9,Zou Jing,", 86, Chinese engineer, member of the Chinese Academy of Engineering.",https://en.wikipedia.org/wiki/Zou_Jing_(engineer),3,2022,June,,86.0,,"China, People's Republic of",,1.386294,1,0,0,0,0,0,0,0,0,0,0,0,1


<IPython.core.display.Javascript object>

In [14]:
# Checking a sample of the data
df.sample(5)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,age,cause_of_death,place_1,place_2,log_num_references,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
47693,17,Cheo Feliciano,", 78, American Puerto Rican salsa and bolero composer and singer, traffic collision.",https://en.wikipedia.org/wiki/Cheo_Feliciano,27,2014,April,,78.0,traffic collision,United States of America,Puerto Rico,3.332205,0,0,0,0,0,1,0,0,0,0,0,0,1
2281,14,Jessy Blackburn,", 101, British aviation pioneer.",https://en.wikipedia.org/wiki/Jessy_Blackburn,4,1995,May,,101.0,,United Kingdom of Great Britain and Northern Ireland,,1.609438,0,0,0,0,0,0,0,1,0,0,0,0,1
51649,19,Bob Sadino,", 81, Indonesian businessman.",https://en.wikipedia.org/wiki/Bob_Sadino,6,2015,January,,81.0,,Indonesia,,1.94591,0,0,0,0,1,0,0,0,0,0,0,0,1
40331,21,Chacha Pakistani,", 90, Pakistani nationalist.",https://en.wikipedia.org/wiki/Chacha_Pakistani,4,2012,October,,90.0,,Pakistan,,1.609438,0,0,0,0,0,0,0,0,1,0,0,0,1
4790,1,Edgardo Enríquez,", 84, Chilean physician, academic and politician.",https://en.wikipedia.org/wiki/Edgardo_Enr%C3%ADquez,5,1996,November,,84.0,,Chile,,1.791759,1,0,0,1,0,0,0,0,1,0,0,0,3


<IPython.core.display.Javascript object>

### Checking data types, duplicates, and null values

In [15]:
# Checking data types and null values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 98038 entries, 0 to 98037
Data columns (total 26 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   day                        98038 non-null  object 
 1   name                       98038 non-null  object 
 2   info                       98038 non-null  object 
 3   link                       98038 non-null  object 
 4   num_references             98038 non-null  int64  
 5   year                       98038 non-null  int64  
 6   month                      98038 non-null  object 
 7   info_parenth               36659 non-null  object 
 8   age                        98038 non-null  float64
 9   cause_of_death             33490 non-null  object 
 10  place_1                    97885 non-null  object 
 11  place_2                    6619 non-null   object 
 12  log_num_references         98038 non-null  float64
 13  sciences                   98038 non-null  int

<IPython.core.display.Javascript object>

In [16]:
# Checking duplicate rows
df.duplicated().sum()

0

<IPython.core.display.Javascript object>

In [22]:
# Checking for duplicate links
df["link"].duplicated().sum()

0

<IPython.core.display.Javascript object>

In [17]:
# Check percentage of null values by column
df.isnull().sum() / df.count() * 100

day                             0.000000
name                            0.000000
info                            0.000000
link                            0.000000
num_references                  0.000000
year                            0.000000
month                           0.000000
info_parenth                  167.432281
age                             0.000000
cause_of_death                192.738131
place_1                         0.156306
place_2                      1381.160296
log_num_references              0.000000
sciences                        0.000000
social                          0.000000
spiritual                       0.000000
academia_humanities             0.000000
business_farming                0.000000
arts                            0.000000
sports                          0.000000
law_enf_military_operator       0.000000
politics_govt_law               0.000000
crime                           0.000000
event_record_other              0.000000
other_species   

<IPython.core.display.Javascript object>

In [18]:
# Checking number of missing values per row
df.isnull().sum(axis=1).value_counts()

2    42651
3    38637
1    15946
0      719
4       85
dtype: int64

<IPython.core.display.Javascript object>

#### Observations:
- We have 98,038 rows and 25 columns
- Since we do not have a date of birth feature, we are not working with age calculated to the day, so we can drop `day`.
- There are no duplicate `link` values in the current dataset, so we can drop the purely nominal `name` column as we are retaining `link`, which we may need for referencing specific entries.
- `num_references`, `year`, and `age` are appropriately of numeric types, either integer or float.
- `month` may be interesting for EDA, but not useful as a predictor as we do not have date of birth.  We will retain it for now and typecast it from object to category.
- `info`, `link`, `info_parenth`, and `cause_of_death` will be left as object type.  `cause_of
- 

In [19]:
df["place_1"].value_counts()

United States of America                                35061
United Kingdom of Great Britain and Northern Ireland    12242
India                                                    3859
Canada                                                   3585
Australia                                                2958
France                                                   2433
Germany                                                  2398
Italy                                                    1862
Russia                                                   1406
New Zealand                                              1264
Ireland                                                  1234
Japan                                                    1176
China, People's Republic of                              1139
Scotland                                                 1094
Spain                                                    1046
Norway                                                    867
Netherla

<IPython.core.display.Javascript object>

In [23]:
df["cause_of_death"].value_counts()

cancer                                           4226
heart attack                                     3084
COVID                                            1594
heart failure                                    1278
lung cancer                                      1090
                                                 ... 
complications of Charcot Marie Tooth disease        1
euthanasia due to multiple organ failure            1
kidney failure/heart problems                       1
blood cancer/renal failure                          1
complications from a stroke/Alzheimer disease       1
Name: cause_of_death, Length: 3230, dtype: int64

<IPython.core.display.Javascript object>

In [None]:
print("dunzo!")

# Sound notification when cell executes
chime.success()

#### Observations:
- We will now save our dataset and pick back up in a new notebook.

### Exporting Dataset to SQLite Database [wp_life_expect_clean.db]()

In [None]:
# # Exporting dataframe

# # Saving dataset in a SQLite database
# conn = sql.connect("wp_life_expect_clean.db")
# df.to_sql("wp_life_expect_clean", conn, index=False)

In [None]:
print('Complete')

# Chime notification when cell executes
chime.success()

# [Proceed to Data Cleaning Part ]()