# Wikipedia Notable Life Expectancies
## [Notebook 10: Exploratory Data Analysis](https://github.com/teresahanak/wikipedia-life-expectancy/blob/main/wp_life_expect_EDA_thanak_2022_09_30.ipynb)
### Context

The
### Objective

The
### Data Dictionary
- **day**: Reported day of month of death
- **name**: Name of individual
- **info**: Original Wikipedia information fields for the individual from Wikipedia Notable Deaths' List page ("age, country of citizenship at birth, subsequent country of citizenship (if applicable), reason for notability, and cause of death (if known)"
- **link**: Link to the individual's page
- **num_references**: Number of references on the individual's page (a proxy for notability)
- **year**: Reported year of death
- **month**: Reported month of death
- **info_parenth**: Additional information for individual that was extracted from info because it was in parentheses
- **age**: Reported age in integer years at death*
- **cause_of_death**: Reported cause of death
- **place_1**: Country of citizenship at birth
- **place_2**: Subsequent country of citizenship (if applicable)
- **known for categories**: 0 (No) or 1 (Yes) value if individual's reported known-for role(s) is within the category. Mutliple categories are possible.† 
        - sciences
        - social
        - spiritual
        - academia_humanities
        - business_farming
        - arts
        - sports
        - law_enf_military_operator 
        - politics_govt_law
        - crime
        - event_record_other
        - other_species
- **num_categories**: Total **num_categories** for individual

    \* For age reported in a two-value estimated range, **age** reflects the arithmetic mean.  Reported estimated values of a single number reflect that number, while estimates covering a decade (e.g., 80's) were converted to the middle of the decade (i.e., 85).  The vast majority of entries for **age** reflect the single integer value that was reported.  
    
    † See Appendix A for further category definitions and decision-making regarding role categorization.

### Importing Libraries

In [1]:
# To structure code automatically
%load_ext nb_black

# To import/export sqlite databases
import sqlite3 as sql

# To save/open python objects in pickle file
# import pickle

# To help with reading, cleaning, and manipulating data
import pandas as pd
import numpy as np

# import re

# To help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)
# To define the maximum number of rows to be displayed in a dataframe
pd.set_option("display.max_rows", 211)

# To set some dataframe visualization attributes
pd.set_option("max_colwidth", 150)

# To supress scientific notations for a dataframe
pd.set_option("display.float_format", lambda x: "%.3f" % x)

# To supress warnings
# import warnings

# warnings.filterwarnings("ignore")

# To set some plot visualization attributes
sns.set_theme()
sns.set_palette(
    (
        "midnightblue",
        "goldenrod",
        "maroon",
        "darkolivegreen",
        "cadetblue",
        "tab:purple",
        "yellowgreen",
    )
)
plt.rc("font", size=12)
plt.rc("axes", titlesize=15)
plt.rc("axes", labelsize=14)
plt.rc("xtick", labelsize=13)
plt.rc("ytick", labelsize=13)
plt.rc("legend", fontsize=13)
plt.rc("legend", fontsize=14)
plt.rc("figure", titlesize=16)

# To play auditory cue when cell has executed, has warning, or has error and set chime theme
import chime

chime.theme("zelda")

<IPython.core.display.Javascript object>

## Data Overview

### [Reading](https://github.com/teresahanak/wikipedia-life-expectancy/blob/main/wp_life_expect_clean8.db), Sampling, and Checking Data Shape

In [2]:
# Reading the dataset
conn = sql.connect("wp_life_expect_clean8.db")
data = pd.read_sql("SELECT * FROM wp_life_expect_clean8", conn)

# Making a working copy
df = data.copy()

# Checking the shape
print(f"There are {df.shape[0]} rows and {df.shape[1]} columns.")

# Checking first 2 rows of the data
df.head(2)

There are 98038 rows and 26 columns.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,age,cause_of_death,place_1,place_2,log_num_references,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
0,1,William Chappell,", 86, British dancer, ballet designer and director.",https://en.wikipedia.org/wiki/William_Chappell_(dancer),21,1994,January,,86.0,,United Kingdom of Great Britain and Northern Ireland,,3.091,0,0,0,0,0,1,0,0,0,0,0,0,1
1,1,Raymond Crotty,", 68, Irish economist, writer, and academic.",https://en.wikipedia.org/wiki/Raymond_Crotty,12,1994,January,,68.0,,Ireland,,2.565,0,0,0,1,0,1,0,0,1,0,0,0,3


<IPython.core.display.Javascript object>

In [3]:
# Checking last 2 rows of the data
df.tail(2)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,age,cause_of_death,place_1,place_2,log_num_references,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
98036,9,Aamir Liaquat Hussain,", 50, Pakistani journalist and politician, MNA .",https://en.wikipedia.org/wiki/Aamir_Liaquat_Hussain,99,2022,June,"2002 2007, since 2018",50.0,,Pakistan,,4.605,0,0,0,0,0,1,0,0,1,0,0,0,2
98037,9,Zou Jing,", 86, Chinese engineer, member of the Chinese Academy of Engineering.",https://en.wikipedia.org/wiki/Zou_Jing_(engineer),3,2022,June,,86.0,,"China, People's Republic of",,1.386,1,0,0,0,0,0,0,0,0,0,0,0,1


<IPython.core.display.Javascript object>

In [4]:
# Checking a sample of the data
df.sample(5)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,age,cause_of_death,place_1,place_2,log_num_references,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
96012,11,Mel Keefer,", 95, American cartoonist .",https://en.wikipedia.org/wiki/Mel_Keefer,4,2022,February,"Inkpot Award, inductee 2007",95.0,,United States of America,,1.609,0,0,0,0,0,1,0,0,0,0,0,0,1
26655,16,Germán Abad Valenzuela,", 89, Ecuadorian radiologist.",https://en.wikipedia.org/wiki/Germ%C3%A1n_Abad_Valenzuela,6,2008,October,,89.0,,Ecuador,,1.946,1,0,0,0,0,0,0,0,0,0,0,0,1
27974,28,Peter F. Donnelly,", 70, American arts patron, vice-chairman of Americans for the Arts, complications of pancreatic cancer.",https://en.wikipedia.org/wiki/Peter_F._Donnelly,13,2009,March,,70.0,complications of pancreatic cancer,United States of America,,2.639,0,0,0,0,0,1,0,0,0,0,0,0,1
28732,6,Vasily Aksyonov,", 76, Russian novelist, stroke.",https://en.wikipedia.org/wiki/Vasily_Aksyonov,6,2009,July,,76.0,stroke,Russia,,1.946,0,0,0,0,0,1,0,0,0,0,0,0,1
11905,9,Robert Armitage,", 45, South African cricketer, cancer.",https://en.wikipedia.org/wiki/Robert_Armitage_(cricketer),9,2000,December,,45.0,cancer,South Africa,,2.303,0,0,0,0,0,0,1,0,0,0,0,0,1


<IPython.core.display.Javascript object>

### Checking data types, duplicates, and null values

In [5]:
# Checking data types and null values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 98038 entries, 0 to 98037
Data columns (total 26 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   day                        98038 non-null  object 
 1   name                       98038 non-null  object 
 2   info                       98038 non-null  object 
 3   link                       98038 non-null  object 
 4   num_references             98038 non-null  int64  
 5   year                       98038 non-null  int64  
 6   month                      98038 non-null  object 
 7   info_parenth               36659 non-null  object 
 8   age                        98038 non-null  float64
 9   cause_of_death             33490 non-null  object 
 10  place_1                    97885 non-null  object 
 11  place_2                    6619 non-null   object 
 12  log_num_references         98038 non-null  float64
 13  sciences                   98038 non-null  int

<IPython.core.display.Javascript object>

In [6]:
# Checking duplicate rows
df.duplicated().sum()

0

<IPython.core.display.Javascript object>

In [7]:
# Checking for duplicate links
df["link"].duplicated().sum()

0

<IPython.core.display.Javascript object>

In [8]:
# Checking sum of null values by column
df.isnull().sum()

day                              0
name                             0
info                             0
link                             0
num_references                   0
year                             0
month                            0
info_parenth                 61379
age                              0
cause_of_death               64548
place_1                        153
place_2                      91419
log_num_references               0
sciences                         0
social                           0
spiritual                        0
academia_humanities              0
business_farming                 0
arts                             0
sports                           0
law_enf_military_operator        0
politics_govt_law                0
crime                            0
event_record_other               0
other_species                    0
num_categories                   0
dtype: int64

<IPython.core.display.Javascript object>

In [9]:
# Check percentage of null values by column
df.isnull().sum() / len(df) * 100

day                          0.000
name                         0.000
info                         0.000
link                         0.000
num_references               0.000
year                         0.000
month                        0.000
info_parenth                62.607
age                          0.000
cause_of_death              65.840
place_1                      0.156
place_2                     93.249
log_num_references           0.000
sciences                     0.000
social                       0.000
spiritual                    0.000
academia_humanities          0.000
business_farming             0.000
arts                         0.000
sports                       0.000
law_enf_military_operator    0.000
politics_govt_law            0.000
crime                        0.000
event_record_other           0.000
other_species                0.000
num_categories               0.000
dtype: float64

<IPython.core.display.Javascript object>

In [10]:
# Checking number of missing values per row
df.isnull().sum(axis=1).value_counts()

2    42651
3    38637
1    15946
0      719
4       85
dtype: int64

<IPython.core.display.Javascript object>

#### Observations:
- We have 98,038 rows and 25 columns.
- Our target, `age` has no missing values.  
- As expected, `info_parenth` and `place_2` both have high percentages of missing values.  `cause_of death` also has a very high proportion of missing values, which is not problematic for the current analysis.
- There are 153 missing values (~1.6%) for `place_1`, the handling of which we will need to consider.
- The value counts for missing values per row looks generally consistent with the expected missing values, with the `place_1` missing values being the only concern.
- Since we do not have a date of birth feature, we are not working with age calculated to the day, so we can drop `day`.
- There are no duplicate `link` values in the current dataset, so we can drop the purely nominal `name` column as we are retaining `link`, which we may need for referencing specific entries.
- `num_references`, `year`, `age`, and `num_categories` are all of the appopriate numeric type, either integer or float.
- `month` may be interesting for EDA, but is not anticipated to be useful as a predictor as we do not have date of birth.  We will retain it for now and typecast it from object to category.
- `info`, `link`, `info_parenth`, and `cause_of_death` will be left as object type.  `cause_of_death` is not a focus of this analysis, but this column could be further treated to create broader categories of causes (e.g., grouping all types of cancer) for further analysis.  For now, we will retain it, as we might probe it somewhat during EDA.  `info` and `info_parenth` we will retain for reference only.
- `place_1` and `place_2` are of object type and we will convert them to category.  After initial EDA, we will extract a new feature, `region`, to reduce dimensionality of the `place_` information.
- The `known for` categories are all of integer type, but are boolean in nature.  For EDA, we will typecast them as category, then convert them back to integer for modeling.

#### Dropping `day` and `name`

In [11]:
# Dropping day and name columns
df.drop(["day", "name"], axis=1, inplace=True)

<IPython.core.display.Javascript object>

#### Typecasting `place_1`,  `place_2`, and `known for` Categories as Category

In [12]:
# Typecasting place_1 and place_2 as category
df[
    [
        "place_1",
        "place_2",
        "sciences",
        "social",
        "spiritual",
        "academia_humanities",
        "business_farming",
        "arts",
        "sports",
        "law_enf_military_operator",
        "politics_govt_law",
        "crime",
        "event_record_other",
        "other_species",
    ]
] = df[
    [
        "place_1",
        "place_2",
        "sciences",
        "social",
        "spiritual",
        "academia_humanities",
        "business_farming",
        "arts",
        "sports",
        "law_enf_military_operator",
        "politics_govt_law",
        "crime",
        "event_record_other",
        "other_species",
    ]
].astype(
    "category"
)

<IPython.core.display.Javascript object>

#### Confirming Updated Data Types and Number of Columns

In [13]:
# Confirming data types and number of columns
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 98038 entries, 0 to 98037
Data columns (total 24 columns):
 #   Column                     Non-Null Count  Dtype   
---  ------                     --------------  -----   
 0   info                       98038 non-null  object  
 1   link                       98038 non-null  object  
 2   num_references             98038 non-null  int64   
 3   year                       98038 non-null  int64   
 4   month                      98038 non-null  object  
 5   info_parenth               36659 non-null  object  
 6   age                        98038 non-null  float64 
 7   cause_of_death             33490 non-null  object  
 8   place_1                    97885 non-null  category
 9   place_2                    6619 non-null   category
 10  log_num_references         98038 non-null  float64 
 11  sciences                   98038 non-null  category
 12  social                     98038 non-null  category
 13  spiritual                  9803

<IPython.core.display.Javascript object>

#### Observations:
- With 23 remaining columns, we are read to proceed with EDA.

### Summary Statistics of Numerical Features

In [14]:
# Summary statistics of numerica features
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
num_references,98038.0,12.685,21.37,3.0,4.0,7.0,13.0,660.0
year,98038.0,2012.153,7.817,1994.0,2007.0,2014.0,2019.0,2022.0
age,98038.0,76.541,20.06,0.25,69.0,80.0,88.0,3500.0
log_num_references,98038.0,2.243,0.736,1.386,1.609,2.079,2.639,6.494
num_categories,98038.0,1.159,0.399,1.0,1.0,1.0,1.0,5.0


<IPython.core.display.Javascript object>

#### Observations:
- Our target, `age`, has a huge spread, which we suspect is the result of inclusion of non-human entries (e.g. a 3500-year-old tree).  Despite that extreme upper-end outlier, the mean and median are close, at ~76 and 80 years, respectively.
- `num_references` has a wide spread and is highly right skewed, with a mean of ~13 and median of 7.
- We see the range of `year` correctly reflects the data that was collected, from 1994 to 2022.
- `num_categories` ranges from 1 to 5, with at least 75% of entries having only 1 category.

### Summary Statistics of Categorical and Object Features

In [15]:
# Summary statistics of non-numerical features
df.describe(include=["object", "category"]).T

Unnamed: 0,count,unique,top,freq
info,98038,89928,", 87, American baseball player .",41
link,98038,98038,https://en.wikipedia.org/wiki/William_Chappell_(dancer),1
month,98038,12,January,9923
info_parenth,36659,16977,", ,",3124
cause_of_death,33490,3230,cancer,4226
place_1,97885,211,United States of America,35061
place_2,6619,155,United States of America,2323
sciences,98038,2,0,89283
social,98038,2,0,97185
spiritual,98038,2,0,94590


<IPython.core.display.Javascript object>

#### Observations
- `info` jumps out as having a value consistent for 41 entries--American baseball player, living to age 87.  This feature is retained only for reference, as untreated it is unmanageable, but the example does provide validation to the entries having identifiable similarities and differences, on which to base analysis.
- `link` is again confirmed here as having all unique values.
- As

In [None]:
print("dunzo!")

# Sound notification when cell executes
chime.success()

#### Observations:
- We will now save our dataset and pick back up in a new notebook.

### Exporting Dataset to SQLite Database [wp_life_expect_clean.db]()

In [None]:
# # Exporting dataframe

# # Saving dataset in a SQLite database
# conn = sql.connect("wp_life_expect_clean.db")
# df.to_sql("wp_life_expect_clean", conn, index=False)

In [None]:
print('Complete')

# Chime notification when cell executes
chime.success()

# [Proceed to Data Cleaning Part ]()