In [None]:
# import dependencies
from IPython.display import Image
import pandas as pd
import numpy as np
import os
import glob
import pycountry_convert as pc

# <center> Netflix Applied Data Science - Project 1 </center>
## <center> Part 1: Dataset selection and data exploration</center>
### <center> Group 6 </center>
### <center>Vu, Alex, Kwabena </center>
<br>
<center> 10/14/2021 </center>


### Dataset selection
Our group is highly interested in heathcare and medical relating dataset. After going through **[10 Great Healthcare Datasets](https://www.datasciencecentral.com/profiles/blogs/10-great-healthcare-data-sets)**, the **[Human Mortality Database (HMD)](https://www.mortality.org/)** was selected to investigate for this project because of its well-established data structure, well-documented matetials, and diversity by countries over the world.

### Short-term Mortality Fluctuations (STMF) data series
Available at **[Human Mortality Database (HMD)](<https://www.mortality.org/>)**
<p>STMF input dataset provides weekly death counts for <b>38 countries</b>: Austria, Australia, Belgium, Bulgaria, Chile, Canada, Croatia, Czech Republic, Denmark, England and Wales, Estonia, Finland, France, Germany, Greece, Hungary, Iceland, Israel, Italy, Latvia, Lithuania, Luxembourg, Netherlands, New Zealand, Northern Ireland, Norway, Poland, Portugal, Republic of Korea, Russia, Scotland, Slovenia, Slovakia, Spain, Sweden, Switzerland, Taiwan and the USA.</p>

In [None]:
Image("../resources/images/smtf_viz_tool.png")

#### Example of USA SMTF input CSV file

In [None]:
stmf_df = pd.read_csv('../resources/dataset/STMFinput/USAstmf.csv')
print(stmf_df.info())
stmf_df.head()

## Questions?
0. Are we interested in human mortality or deaths?
1. What do the factors such as age or gender tell us about deaths?
2. Is there any different of death counts in different geographic location (continent) around the world?
3. How the death counts over time (weeks/years) leave us clues about the causation of death?
4. Why the weekly death counts are effective in demonstrating temporary global health hazards?

In [None]:
def alpha3_to_continent(alpha3):
    alpha2 = pc.country_alpha3_to_country_alpha2(alpha3)
    cont_code = pc.country_alpha2_to_continent_code(alpha2)
    return pc.convert_continent_code_to_continent_name(cont_code)

### Dataset import
Import and merge country datasets into single dataframe by concatenating input dataframes read from CSV files.

In [None]:
# read and merge all the CSV dataset in STMFinput/
path = '../resources/dataset/STMFinput'
all_files = glob.glob(os.path.join(path, "*.csv"))
df = pd.concat([pd.read_csv(f, low_memory=False) for f in all_files], ignore_index=True)
df.info()

### Dataset cleanup
+ Correct mismatch input columns of CSV files
+ Convert **Deaths** column to 64-bit integer
+ Add **iso_alpha3** and **continent** columns for visualization
+ Drop unnecessary rows (TOT and UNK in **Age** and **Week** columns)
+ Split dataframe into two:
    + one with gender difference **clean_df**,
    + and other with combination of gender **all_clean_df**

In [None]:
df['PopCode'] = df['PopCode'].replace(['a'], 'NOR')
df['iso_alpha3'] = df['PopCode'].apply(lambda x: x[:3])
df['continent'] = df['iso_alpha3'].apply(alpha3_to_continent)
df['Deaths'] = df['Deaths'].replace(['.'], '0')
df['Deaths'] = df['Deaths'].astype('int64')
clean_df = df[(df['Age']!='TOT') & (df['Age']!='UNK') & (df['Week']!='UNK')]
# print(clean_df['Age'].unique())
clean_df['Age'] = clean_df['Age'].astype('int32')
clean_df['Week'] = clean_df['Week'].astype('int32')
all_clean_df = clean_df[clean_df['Sex']=='b'].reset_index(drop=True)
clean_df = clean_df[clean_df['Sex']!='b'].reset_index(drop=True)

**clean_df** includes **12 columns** and **1,253,884 rows** after cleaning.

In [None]:
print(clean_df.info())

In [None]:
clean_df.head(20)

### Data exploration

In [None]:
clean_df.describe()

In [None]:
clean_df[['Year', 'Week', 'Age', 'Deaths']].corr()

**all_clean_df** includes **12 columns** and **626,318 rows** after cleaning.

In [None]:
print(all_clean_df.info())

In [None]:
all_clean_df.head(20)

### Github collaboration & variables storing
+ **[nflx-data-project1-group6](https://github.com/vuhpham94/nflx-data-project1-group6)** Github repository was created for our group's collaboration. We decided to have different branches for individual work. After cross-verifying, **main** branch will be merge from others.
+ To prevent the data cleanup notebook from being modified when each group member is working, **clean_df** and **all_clean_df** variables is stored to be used in other Jupyter notebooks.

In [None]:
%store clean_df all_clean_df