<a href="https://colab.research.google.com/github/szhang12345/MSDS-422-Assignment-1---Exploratory-Data-Analysis/blob/main/Assignment1_EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

MSDS 422 Assignment 1: Exploring and Visualizing Data

**By Siying Zhang**


**Topic:** The spread of COVID 19 (incidence) and its fatality rate 

**Data Source**

https://www.ecdc.europa.eu/en/publications-data/download-todays-data-geographic-distribution-covid-19-cases-worldwide

*CoLab Shared Link - this notebook*
https://colab.research.google.com/drive/1gaq5O6_R95mr2VIYfuL_JeBj-2jszPLV?authuser=5#scrollTo=4vdfzElVLcJ_

**Table of contents:**

System & Data Preparation
*   Load Relevant Packages
*   Load Data from CSV
*   Rename Columns
*   Drop Negative Values
*   Drop Missing Values

Data Exploration & Visualization
*   Features Creation
*   Time Series Analysis of Incidence
*   Time Series Analysis of Fatality
*   Incidence Distribution
*   Incidence Distribution by Continent


Data scaling and comparisons
*   Standard Scaling
*   Min-max Scaling
*   Comparison

Insights from analysis

**System & Data Preparation**

Load relevant packages

In [None]:
import pandas as pd 
import numpy as np 
import matplotlib 
import matplotlib.pyplot as plt  
import seaborn as sns  
import io
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from fa_kit import FactorAnalysis
from matplotlib.backends.backend_pdf import PdfPages

In [None]:
# suppress warning messages
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

In [None]:
# correlation heat map setup for seaborn
def corr_chart(df_corr):
    corr=df_corr.corr()
    #screen top half to get a triangle
    top = np.zeros_like(corr, dtype=np.bool)
    top[np.triu_indices_from(top)] = True
    fig=plt.figure()
    fig, ax = plt.subplots(figsize=(12,12))
    sns.heatmap(corr, mask=top, cmap='coolwarm', 
        center = 0, square=True, 
        linewidths=.5, cbar_kws={'shrink':.5}, 
        annot = True, annot_kws={'size': 9}, fmt = '.3f')           
    plt.xticks(rotation=45) # rotate variable labels on columns (x axis)
    plt.yticks(rotation=0) # use horizontal variable labels on rows (y axis)
    plt.title('Correlation Heat Map')   
    plt.savefig('plot-corr-map.pdf', 
        bbox_inches = 'tight', dpi=None, facecolor='w', edgecolor='b', 
        orientation='portrait', papertype=None, format=None, 
        transparent=True, pad_inches=0.25, frameon=None)      

np.set_printoptions(precision=3)
plt.rcParams['figure.dpi'] = 100


Load data from the csv and gather descriptive information of dataframe

In [None]:
df = pd.read_csv('data.csv')
df.head()
df.info()
df.describe()
df.shape

Rename Columns

In [None]:
'''based on calculation, it appears that notification_rate_per_100000_population_14-days is 14 days cases per 100k population by country
, shorten the variable/column names for variables to reflect info'''
df = df.rename(index=str, columns={
    'dateRep': 'date',
    'year_week': 'yweek',
    'cases_weekly': 'cases_weekly',
    'deaths_weekly': 'deaths_weekly',
    'countriesAndTerritories': 'country',
        'geoId': 'geoId',
    'countryterritoryCode': 'countrycode',
        'popData2019': 'pop2019',
    'continentExp': 'continent',
        'continentExp': 'continent',
    'notification_rate_per_100000_population_14-days': 'cases_14d_per_100k'})

Drop Negative Values

In [None]:
'''description indicates there are negative numbers in variables cases_weekly and deaths_weekly,
which could be data collection error, drop negative values to improve data quality'''
df=df.drop(df[(df.cases_weekly<0)|(df.deaths_weekly<0)].index)

Drop Missing Values

In [None]:
'''description indicates there are missing values in pop2019, since missing value compose small portion, 
drop rows with missing values in pop2019 to improve data quality'''
df=df.dropna(subset=["pop2019"])
df.shape

**Data Exploration & Visualization**


Features Creation

In [None]:
#Create weekly case rate and death rate per million in the population
df['case_rate_weekly_per_1m']=df['cases_weekly']*1000000/df['pop2019']
df['death_rate_weekly_per_1m']=df['deaths_weekly']*1000000/df['pop2019']
df.describe()


Distribution of Weekly Incidence and Fatality

In [None]:
#plot distribution of weekly incidence and fatality as well as incidence rate and fatality rate in one graph
f, axes = plt.subplots(2, 2, figsize=(15, 15), sharex=False)
f.suptitle('Distribution of Incidence and Fatality', size = 16, y=.9)
sns.distplot(df["cases_weekly"] , color="blue", ax=axes[0, 0])
sns.distplot(df["deaths_weekly"] , color="red", ax=axes[0, 1])
sns.distplot(df["case_rate_weekly_per_1m"] , color="blue", ax=axes[1, 0])
sns.distplot(df["death_rate_weekly_per_1m"], color="red", ax=axes[1, 1])
f.savefig('Incidence and Fatality Distribution' + '.pdf', 
    bbox_inches = 'tight', dpi=None, facecolor='w', edgecolor='b', 
    orientation='portrait', papertype=None, format=None, 
    transparent=True, pad_inches=0.25, frameon=None)

Distribution of Weekly Incidence and Fatality by Continent

In [None]:
#plot distribution of weekly incidence and fatality by continent as well as incidence rate and fatality rate by continent
sns.displot(df,x="cases_weekly", hue='continent', kind='kde', multiple='stack')
sns.displot(df,x="deaths_weekly", hue='continent', kind='kde', multiple='stack')
sns.displot(df, x="case_rate_weekly_per_1m", hue='continent', kind='kde', multiple='stack')
sns.displot(df, x="death_rate_weekly_per_1m", hue='continent', kind='kde', multiple='stack')

Scatter Plot of Weekly Incidence and Fatality by Continent

In [None]:
#scatter plot weekly incidence and fatality by continent as well as incidence rate and fatality rate by continent in one graph
f, axes = plt.subplots(2, 2, figsize=(15, 15), sharex=False)
f.suptitle('Incidence and Fatality by Continent', size = 16, y=.9)
sns.scatterplot(df["cases_weekly"] , df['continent'], hue=df['continent'], ax=axes[0, 0])
sns.scatterplot(df["deaths_weekly"] , df['continent'], hue=df['continent'], ax=axes[0, 1])
sns.scatterplot(df["case_rate_weekly_per_1m"] , df['continent'], hue=df['continent'], ax=axes[1, 0])
sns.scatterplot(df["death_rate_weekly_per_1m"], df['continent'], hue=df['continent'], ax=axes[1, 1])
f.savefig('Incidence and Fatality by Continent' + '.pdf', 
    bbox_inches = 'tight', dpi=None, facecolor='w', edgecolor='b', 
    orientation='portrait', papertype=None, format=None, 
    transparent=True, pad_inches=0.25, frameon=None)

Time Series Analysis of Incidence


Time Series Analysis of Fatality

Incidence Distribution by Continent