## Run me on colab 
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/rjlopez2/ADS_CAS_Bern_2020/blob/main/Projects/M1%20and%20M2/M1M2_cas_project.ipynb)


### **Import libraries**

In [None]:
import pandas as pd
!pip install wget # uncomment this igf you run it via colab
import os
from zipfile import ZipFile
import numpy as np
import wget
import fnmatch
import requests
import matplotlib.pyplot as plt
pd.set_option('display.max_rows', 200)

# Part I 
# M1 project 

## On data aquisition, formating and cleaning

# 1. John Hopkins data collection and cleaning
### **Download the time series datasets on global Covid cases from the John Hpkins University**
The time series are organized in 3 different files from their Github repository:

 - one file retrieve information on the confirmed cases
 - one file retrieve information on the death cases
 - one file retrieve information on the recovered cases

Below we download the 3 datastes and store them locally in .csv format.

In [None]:
urls = ['https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv',
       'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv',
       'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv']

path = os.getcwd() # get the current directory

for url in urls:
    filename = path + '/' + os.path.basename(url) # get the full path of the file
    if os.path.exists(filename):
        os.remove(filename) # if exist, remove it directly
    wget.download(url, out=filename) # download it to the specific path.
    
# IMPORTANT: if error loading files bacause link is down, don't run this code chunk and go to the next.
# It will read only the local data in the repo from the last time this script was run and updated

In [None]:
confirmed_df = pd.read_csv('time_series_covid19_confirmed_global.csv')
deaths_df = pd.read_csv('time_series_covid19_deaths_global.csv')
recovered_df = pd.read_csv('time_series_covid19_recovered_global.csv')

### **We explore below the structure of the 3 datasets**

By looking at the shape of the 3 df, we observe that the recovered_df has different dimention than the two others.
Closer inspection revealed that 14 provinces from Canada were missed in the recovered_df.

In [None]:
# check size of the 3 datasets
print([confirmed_df.shape, deaths_df.shape, recovered_df.shape])


In [None]:
confirmed_df[~confirmed_df['Province/State'].isin(recovered_df['Province/State'])][['Province/State', 'Country/Region']] # !!! 14 'Province/State'  no found in the recovered_df

Because of this inconsistency, we decided to exclude data from Canada for now.

In [None]:
recovered_df = recovered_df[recovered_df['Country/Region']!='Canada']
confirmed_df = confirmed_df[confirmed_df['Country/Region']!='Canada']
deaths_df = deaths_df[deaths_df['Country/Region']!='Canada']

In [None]:
# check size of the 3 datasets
print([confirmed_df.shape, deaths_df.shape, recovered_df.shape])


We observed that the first 4 colums of each dataset have the same variables so we use them to merge all 3 datasets and we define the time variable with the rest of the colums

1. We create the vector for the time varibale
2. we transform the 3 dataframes to long format. 

In [None]:
recovered_df.columns

In [None]:
deaths_df.columns

In [None]:
confirmed_df.columns

In [None]:
dates = confirmed_df.columns[4:]

confirmed_df_long = confirmed_df.melt(
    id_vars=['Province/State', 'Country/Region', 'Lat', 'Long'], 
    value_vars=dates, 
    var_name='Date', 
    value_name='Confirmed')

deaths_df_long = deaths_df.melt(
    id_vars=['Province/State', 'Country/Region', 'Lat', 'Long'], 
    value_vars=dates, 
    var_name='Date', 
    value_name='Deaths')

recovered_df_long = recovered_df.melt(
    id_vars=['Province/State', 'Country/Region', 'Lat', 'Long'], 
    value_vars=dates, 
    var_name='Date', 
    value_name='Recovered')

In [None]:
# check the size of each dataset in long format
print([confirmed_df_long.shape, deaths_df_long.shape, recovered_df_long.shape])

In [None]:
# check if the number of countries are the same in each subset
print(confirmed_df_long['Country/Region'].drop_duplicates().shape, 
      deaths_df_long['Country/Region'].drop_duplicates().shape, 
      recovered_df_long['Country/Region'].drop_duplicates().shape)
      

In [None]:
confirmed_df_long

In [None]:
# Merge the 3 datasets 

confirmed_df_long["Deaths"] = deaths_df_long["Deaths"]
confirmed_df_long["Recovered"] = recovered_df_long['Recovered']
full_table = confirmed_df_long

In [None]:
full_table.info(verbose = True)

In [None]:
# trasnform from string to date the "Date" column
full_table['Date'] = pd.to_datetime(full_table['Date'])

### **Check and fix NaN in the full dataset**

In [None]:
full_table.isna().sum()

### **Remove cruise ships data**
 #### We also observed that there is some of confirmed cases of Covid from the cruise ships (Grand Princess, Diamond Princess and MS Zaandam) that make it difficult to fit in in the Country category, so we excluded from our anaysis

In [None]:
# select the ships rows
ship_rows = full_table['Province/State'].str.contains('Grand Princess') | full_table['Province/State'].str.contains('Diamond Princess') | full_table['Country/Region'].str.contains('Diamond Princess') | full_table['Country/Region'].str.contains('MS Zaandam')

In [None]:
full_table = full_table[~(ship_rows)] # the '~' operator negate the selections

## **Add new colum for active cases**
Below we compute the active cases by substracting the number of death and recovered to the confirmed cases.

In [None]:
# Active Case = confirmed - deaths - recovered
full_table['Active'] = full_table['Confirmed'] - full_table['Deaths'] - full_table['Recovered']
full_table

We agregate the data by Country and Date (by means of grouping) and calculate de sum of the cases

In [None]:
full_grouped = full_table.groupby(['Date', 'Country/Region'])[['Confirmed', 'Deaths', 'Recovered', 'Active']].sum().reset_index()

full_grouped

## Add new column(s) for new cases / new deaths / new recovered

In [None]:
# new cases 
temp = full_grouped.groupby(['Country/Region', 'Date', ])['Confirmed', 'Deaths', 'Recovered']
temp = temp.sum().diff().reset_index()

mask = temp['Country/Region'] != temp['Country/Region'].shift(1)

temp.loc[mask, 'Confirmed'] = np.nan
temp.loc[mask, 'Deaths'] = np.nan
temp.loc[mask, 'Recovered'] = np.nan

# renaming columns
temp.columns = ['Country/Region', 'Date', 'New_cases', 'New_deaths', 'New_recovered']

# merging new values
full_grouped = pd.merge(full_grouped, temp, on=['Country/Region', 'Date'])# filling na with 0
full_grouped = full_grouped.fillna(0)

# fixing data types
cols = ['New_cases', 'New_deaths', 'New_recovered']
full_grouped[cols] = full_grouped[cols].astype('int')

# 
full_grouped['New_cases'] = full_grouped['New_cases'].apply(lambda x: 0 if x<0 else x)

In [None]:
#rename the "Country/Region" variable
full_grouped.rename(columns = {'Country/Region' : 'Country_Region'}, inplace = True)


In [None]:
# compute the number of countries registered in the covide dataset
full_grouped['Country_Region'].unique().size

## **Extract metadata for Covid datasets**
We want to merge the covid dataset with other datasets by a comun variable, in our case is the Country. To make sure that the union of datasets are compatible, and since countries might be named disticntly  from each dataset source we use the country code as an standard varibale fro later merge. Now we asign to the Covid dataframe a new colum for the Country codes. To achieve this task we do the following steps: 
 - Load metadata from the Covid repository
 - Extract the information on Country code (here is the variable called 'iso3')

We also extract additional information on the population from each country. this will be used later for normalizing our variables.

In [None]:
covid_metadata_countries = pd.read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/UID_ISO_FIPS_LookUp_Table.csv",
                                      usecols = ['Country_Region', 'Province_State', 'iso3', 'Population'])


In [None]:
covid_metadata_countries.info(verbose = True)

### **Remove regional subset in hte metadata and only work at national level**

In [None]:
covid_metadata_countries = covid_metadata_countries[covid_metadata_countries['Province_State'].isna()].drop_duplicates()#.shape

### **Remove the cruise ships information from the metadata on country codes**

In [None]:
# select from columns 'Country_Region' the names 'Diamond Princess'and 'MS Zaandam'
ship_metadata = covid_metadata_countries['Country_Region'].str.contains('Diamond Princess') | covid_metadata_countries['Country_Region'].str.contains('MS Zaandam')
ship_metadata
covid_metadata_countries = covid_metadata_countries[~(ship_metadata)]

### **Summarize the population by country in the metadata dataframe**

In [None]:
#my_covid_variables = ['Confirmed', 'Deaths', 'Recovered', 'Active']
code_vars = ['Country_Region', 'iso3']

#full_table.groupby(['Date', 'Country/Region'])[my_covid_variables].sum().reset_index()
country_population = covid_metadata_countries.groupby(code_vars)['Population'].sum().reset_index()

In [None]:
country_population['Country_Region'].unique().shape

In [None]:
# this is the number of countries registered in the Covid df
full_grouped['Country_Region'].unique().size

### **Merge country code with Covid datasets**

In [None]:
full_grouped_ccode = pd.merge(country_population, full_grouped, how = 'left')


In [None]:
full_grouped_ccode.info(verbose = True)

In [None]:
full_grouped_ccode.rename(columns = {'iso3' : 'CountryCode'}, inplace = True)

In [None]:
# Check NaNs generated during the merge and remove them
full_grouped_ccode.isna().sum()

In [None]:
full_grouped_ccode = full_grouped_ccode[~full_grouped_ccode['Date'].isna()]

# 2. Oxford Stringency index data collection, cleaning and merging

In [None]:
str_url = ["https://raw.githubusercontent.com/OxCGRT/covid-policy-tracker/master/data/OxCGRT_latest.csv"]

for url in str_url:
    filename = path + '/' + os.path.basename(url) # get the full path of the file
    if os.path.exists(filename):
        os.remove(filename) # if exist, remove it directly
    wget.download(url, out=filename) # download it to the specific path.
# IMPORTANT: if error loading files bacause link is down, don't run this code chunk and go to the next.
# It will read only the local data in the repo from the last time this script was run and updated

In [None]:
my_string_columns = ["Date", "CountryCode", "CountryName", "StringencyIndex", "RegionName", "RegionCode"] 
stringency_raw_dataset = pd.read_csv("OxCGRT_latest.csv", usecols = my_string_columns, low_memory=False)
stringency_raw_dataset.info(verbose = True)

### **Selecting national data only ( exclude regional data) read documentation in this link why -> https://github.com/OxCGRT/covid-policy-tracker**
To take only the natinal data we followed instructions in the link above and take only rows where the variable RegionCoide is Null

In [None]:
stringe_natio_dataset = stringency_raw_dataset[stringency_raw_dataset.RegionCode.isnull()]


In [None]:
# remove columns with no needed information
stringe_natio_dataset = stringe_natio_dataset[my_string_columns[:4]]

Closer inspection revealed that the number of country codes from the Stringency dataset is less than the number of country codes in the Covid dataframe.


In [None]:
print([stringe_natio_dataset['CountryCode'].unique().size, 
       full_grouped_ccode['CountryCode'].unique().size])


#### **The list below shows the Countries/region/dependencies which have not information regarding stringency index. Those countries (31) will be excluded form the analysis for the momment**

In [None]:
# finding What is not present in the stringency dataset
full_grouped_ccode[~full_grouped_ccode['CountryCode'].isin(stringe_natio_dataset['CountryCode'])][['CountryCode', 'Country_Region']].drop_duplicates()

In [None]:
## We filtered out the countries above for joining with the covid dataset
full_grouped_ccode_filtered = full_grouped_ccode[full_grouped_ccode['CountryCode'].isin(stringe_natio_dataset['CountryCode'])]

In [None]:
full_grouped_ccode_filtered.info(verbose = True)

### Fixing the date format in the stringency dataset

In [None]:
#stringency_raw_dataset.info(verbose = True)
stringe_natio_dataset['Date'] = pd.to_datetime(stringe_natio_dataset['Date'], format = '%Y%m%d')

In [None]:
stringe_natio_dataset.info(verbose = True)

In [None]:
stringe_natio_dataset

### Create categories for Stringency index

In [None]:
condition1 = (stringe_natio_dataset["StringencyIndex"] >= 0) & (stringe_natio_dataset["StringencyIndex"] <= 20)
condition2 = (stringe_natio_dataset["StringencyIndex"] > 20) & (stringe_natio_dataset["StringencyIndex"] <= 40)
condition3 = (stringe_natio_dataset["StringencyIndex"] > 40) & (stringe_natio_dataset["StringencyIndex"] <= 60)
condition4 = (stringe_natio_dataset["StringencyIndex"] > 60) & (stringe_natio_dataset["StringencyIndex"] <= 80)
condition5 = (stringe_natio_dataset["StringencyIndex"] > 80) & (stringe_natio_dataset["StringencyIndex"] <= 100)

case1 = "Very_low"
case2 = "Low"
case3 = "Middle"
case4 = "High"
case5 = "Very_high"


stringe_natio_dataset["StringencyIndex_factor"] = np.where(condition1, case1, 
                                                    np.where(condition2, case2, 
                                                            np.where(condition3, case3,
                                                                    np.where(condition3, case3,
                                                                            np.where(condition4, case4,
                                                                                    np.where(condition5, case5,
                                                                                            "unknown"))))))
stringe_natio_dataset

# 3. Joining Covid cases with Stringency index data 
Join the covid dataset with the stringency dataset and transfrm the final df in a timeseries

In [None]:
my_complete_df = pd.merge(stringe_natio_dataset[['Date', 'CountryCode','StringencyIndex', 'StringencyIndex_factor']], # selecting only the variables to join
                          full_grouped_ccode_filtered)

my_complete_df.info(verbose = True)

In [None]:
my_complete_df.set_index('Date', inplace = True)

In [None]:
my_complete_df

## Normalize the variables on all cases by 100.000 people per country

In [None]:
my_complete_df['Confirmed_100K'] = my_complete_df['Confirmed'].multiply(100000, fill_value = 0).divide(my_complete_df['Population'], fill_value = 0)
my_complete_df['Deaths_100K'] = my_complete_df['Deaths'].multiply(100000, fill_value = 0).divide(my_complete_df['Population'], fill_value = 0)
my_complete_df['Recovered_100K'] = my_complete_df['Recovered'].multiply(100000, fill_value = 0).divide(my_complete_df['Population'], fill_value = 0)
my_complete_df['Active_100K'] = my_complete_df['Active'].multiply(100000, fill_value = 0).divide(my_complete_df['Population'], fill_value = 0)
my_complete_df['New_cases_100K'] = my_complete_df['New_cases'].multiply(100000, fill_value = 0).divide(my_complete_df['Population'], fill_value = 0)
my_complete_df['New_deaths_100K'] = my_complete_df['New_deaths'].multiply(100000, fill_value = 0).divide(my_complete_df['Population'], fill_value = 0)
my_complete_df['New_recovered_100K'] = my_complete_df['New_recovered'].multiply(100000, fill_value = 0).divide(my_complete_df['Population'], fill_value = 0)
my_complete_df

# NOTE: This old visualization chunk section below can be removed and use the new one you have created in the exploratory analysis

## Visualizing Covid data by individual countries
To visualize a country performance with the Covid assign a country code to the variable my_country from the following diccionary of countries names and codes

In [None]:
my_countries_dicc = my_complete_df[['CountryCode', 'Country_Region']].drop_duplicates().reset_index()[['CountryCode', 'Country_Region']].set_index('CountryCode').to_dict()['Country_Region']

#my_countries_dicc# to see al country codes and names uncomment this line

Here we take an example visualizing the dataset from Switzerland (CHE)

In [None]:
country = 'CHN'
single_country_covid_df = my_complete_df[my_complete_df['CountryCode'] == country]
single_country_covid_df

In [None]:
# my_vars_for_ploting = ['StringencyIndex', 'New_cases_100K', 'New_deaths_100K', 'New_recovered_100K', 'Confirmed_100K', 'Deaths_100K', 'Recovered_100K', 'Active_100K']

# i = 0
# for variables in range(len(my_vars_for_ploting)):
#     single_country_covid_df.plot(y=my_vars_for_ploting[i],
#                                  kind="line",
# #                                  c=['c', 'b'], 
#                                  c = 'c',
#                                  label = my_vars_for_ploting[i]) # = ['StringencyIndex', 'New cases', 'New deaths', 'New recovered', 'Confirmed', 'Deaths', 'Recovered']])
#     plt.title('Country = ' + my_countries_dicc['Country_Region'][country])
#     i+=1
# plt.legend()
# plt.show()

# 4. Colecting and merging Socieconomical data from the WorldBank
We extract socieconomical data such is GDP and Income level from the Worldbank datasets via API query request protocol.

**Note**: retrieving *GDP* and *income level* doesn't seem to be so straightforward in a single call. So may be the strategy would be to make a call for each dataset and then merge then.

In [None]:
my_home_url = 'http://api.worldbank.org/v2/country/all/indicator/NY.GDP.MKTP.CD'
my_params = {'date' : '2019',
            'incomelevel' :'',
            'downloadformat' : 'csv',
            'per_page' : '304'} # dic with the parameters of interest


# remove excel file if exists
for file in os.listdir(path):
    if fnmatch.fnmatch(file, 'API_*.csv'):
        os.remove(file)

# remove excel file if exists
for file in os.listdir(path):
    if fnmatch.fnmatch(file, 'Metadata_*.csv'):
        os.remove(file)
        
        
        
# if the zip file exist it will be updated
for file in os.listdir(path):
    if fnmatch.fnmatch(file, '*.zip'):
        file_exists = True
        os.remove(file)
        r_GDP = requests.get(my_home_url, params = my_params)
        my_zip_file = wget.download(r_GDP.url)
        
#         for i in list_files:
#             os.remove(file)
#             r_GDP = requests.get(my_home_url, params = my_params)
#             my_zip_file = wget.download(r_GDP.url)

    else:
        file_exists = False
        


In [None]:
file_exists

In [None]:
# if the zip file doesent exist it will be downloaded
if file_exists == False:
    r_GDP = requests.get(my_home_url, params = my_params)
    my_zip_file = wget.download(r_GDP.url)

In [None]:
for file in os.listdir(path):
    if fnmatch.fnmatch(file, '*.zip'):
        #print(file)# find only the zip file
        with ZipFile(file, 'r') as zipObj:
            for content in zipObj.namelist():
                if fnmatch.fnmatch(content, 'API_*'):
                    #print(content) wihtin the content of the zip file find and extract the csv file that contain the data
                    my_filename = content
                    zipObj.extract(content)
                    

In [None]:
GDP_raw_df = pd.read_csv(path + '/' + my_filename,
                        header = 2,
                        usecols = [1, 4])
GDP_raw_df

In [None]:
GDP_raw_df.info(verbose = True)

In [None]:
# 1. fixing names in GDP_raw_df dataset

GDP_correct_names = {'Country Code' : 'CountryCode',
                     '2019' : 'GDP_in_USD'}
GDP_raw_df.rename(columns = GDP_correct_names, inplace= True)
GDP_raw_df

In [None]:
GDP_raw_df[~GDP_raw_df['CountryCode'].isin(my_complete_df['CountryCode'])].drop_duplicates().shape #this are the 88 regions or dependencies from the GDP dataset that are not in the Covid df

In [None]:
my_complete_df[my_complete_df['CountryCode'].isin(GDP_raw_df['CountryCode'])][['CountryCode']].drop_duplicates().shape

We select only those countries from the GDP df that are present in the covid dataframe to be merged with the covid full dataframe

In [None]:
GDP_raw_df = GDP_raw_df[GDP_raw_df['CountryCode'].isin(my_complete_df['CountryCode'])]
GDP_raw_df.info()

### Join the GDP data to the covid df

In [None]:
my_complete_df = my_complete_df.reset_index().merge(right=GDP_raw_df[['CountryCode', 'GDP_in_USD']],how='left', on=['CountryCode']).set_index('Date')
my_complete_df.info()

## We now extract data on Income level of the different countries from metadata files from the WorldBank in the zip file

In [None]:
for file in os.listdir(path):
    if fnmatch.fnmatch(file, '*.zip'):
        #print(file)# find only the zip file
        with ZipFile(file, 'r') as zipObj:
            for content in zipObj.namelist():
                if fnmatch.fnmatch(content, 'Metadata_Country*'):
                    #print(content) wihtin the content of the zip file find and extract the csv file that contain the data
                    my_filename = content
                    zipObj.extract(content)
                    

In [None]:
income_raw_df = pd.read_csv(my_filename,
                        #header = 2,
                        usecols = [0, 1, 2, 4])

income_raw_df

In [None]:
#2 (on income_raw_df)
# make consistennt names for all datasets
# 1. fixing names in income_raw_df dataset
# 2. take only relevant columns

income_correct_names = {'Country Code' : 'CountryCode',
                        'TableName' : 'Country_Region'}

income_raw_df.rename(columns = income_correct_names, inplace = True)
income_raw_df

In [None]:
income_raw_df[~income_raw_df['CountryCode'].isin(my_complete_df['CountryCode'])].shape #this are the 85 regions or aggregated regions, dependencies from the income level dataset that are not in the Covid df

In [None]:
my_complete_df[~my_complete_df['CountryCode'].isin(income_raw_df['CountryCode'])][['CountryCode']].drop_duplicates()#.shape # these 4 dependencies from de covid dataset have no income level information

### As before, we select only those countries from the Income level dataset wich are also present in the covid dataset, and exclude all other regions dependencies.

In [None]:
income_raw_df = income_raw_df[income_raw_df['CountryCode'].isin(my_complete_df['CountryCode'])]
income_raw_df.info()

## We make the final join of the income level data with the covid dataset

In [None]:
my_final_df = my_complete_df.reset_index().merge(right=income_raw_df[['CountryCode', 'IncomeGroup']],how='left', on=['CountryCode']).set_index('Date')
my_final_df.info()

In [None]:
my_final_df.head(10)#[my_final_df['CountryCode'] == "DEU"][['New_deaths']]

In [None]:
my_final_df.columns

## This is the final clean working dataframe  which contain:
 - time series of Covid cases of 180 countries or dependencies etc from the world.
 - standard country code for ease finding of countries
 - the cumulative sum of confirmed, deatch and recovered cases
 - the new cases, new death and new recovered in a day-wise format
 - all varibales before mentioned normalized by 100.000 people per country. This is may be usefull to compare among different countries
 - government response on restraining the spread of the pandemic indicated by the restringency index
 - two socioeconomical indicators for countries: GDP in USD and Income level
 

# Part II 
# M2 project on descriptive statistics

## 1.  Descriptive Statistics


# 1. Visualization of Coronavirus cases per country and exploring which income group they belong to


In [None]:
import scipy.stats
from sklearn.linear_model import LinearRegression
from scipy.stats import pearsonr
from scipy.stats import spearmanr
model = LinearRegression()
import seaborn as sb

## Sorting the countries with the **hihgest** coronavirus outbreak
We first sort the data to understand the countries with most cases


## Plotting the top ten MORE affected countries based on total tols and infecctions


In [None]:
#find last reported date
last_reported_date = my_final_df.reset_index().tail(1)['Date'].to_string(index = False)
last_reported_date

In [None]:
#find top 10 countries wiht highest numebr of deaths and confirmed cases (cumulative)
my_top = 5
top_deaths_100K = my_final_df.loc[last_reported_date].sort_values(by = "Deaths_100K", ascending = False).head(my_top).reset_index()['CountryCode'].to_list()#.set_index('CountryCode').to_dict()["Country_Region"]
top_confirmed_100K = my_final_df.loc[last_reported_date].sort_values(by = "Confirmed_100K", ascending = False).head(my_top).reset_index()['CountryCode'].to_list()#.set_index('CountryCode').to_dict()["Country_Region"]

top_deaths_100K_df = my_final_df[my_final_df['CountryCode'].isin(top_deaths_100K)].reset_index()
top_confirmed_100K_df = my_final_df[my_final_df['CountryCode'].isin(top_confirmed_100K)].reset_index()

# font = {'family' : 'normal',
#         'weight' : 'bold',
#         'size'   : 18}
# plt.rc('font', **font)

factor_size = 1.5
SMALL_SIZE = 8 * factor_size
MEDIUM_SIZE = 10 * factor_size
BIGGER_SIZE = 12 * factor_size

plt.rc('font', size=SMALL_SIZE)          # controls default text sizes
plt.rc('axes', titlesize=SMALL_SIZE)     # fontsize of the axes title
plt.rc('axes', labelsize=BIGGER_SIZE)    # fontsize of the x and y labels
plt.rc('xtick', labelsize=SMALL_SIZE)    # fontsize of the tick labels
plt.rc('ytick', labelsize=SMALL_SIZE)    # fontsize of the tick labels
plt.rc('legend', fontsize=SMALL_SIZE)    # legend fontsize
plt.rc('figure', titlesize=BIGGER_SIZE)  # fontsize of the figure title

# Multiline Plot: number of confirmed cases in 100K 
fig1 = plt.figure(figsize=(22, 10) , dpi=300)


ax1 = fig1.add_subplot(221)
sb.lineplot(
    data=top_confirmed_100K_df, 
#     kind="line",
    x="Date", 
    y="Confirmed_100K",
    ax = ax1,
    linewidth = 4,
    hue="Country_Region")
plt.xticks(rotation=45) 
plt.legend(frameon=False)
plt.legend(bbox_to_anchor=(1.04,1), borderaxespad=0)
#plt.title('My title')

ax2 = fig1.add_subplot(222)
sb.barplot(
    data=top_confirmed_100K_df[top_confirmed_100K_df['Date'] == last_reported_date], 
    x="IncomeGroup", 
    y="Confirmed_100K",
    ax = ax2,
    hue="Country_Region")

plt.legend(frameon=False)
plt.legend(bbox_to_anchor=(1.04,1), borderaxespad=0)
ax3 = fig1.add_subplot(223)

sb.lineplot(
    data=top_deaths_100K_df, 
#     kind="line",
    x="Date", 
    y="Deaths_100K",
    linewidth = 4,
    ax = ax3,
    hue="Country_Region")
plt.xticks(rotation=45)
plt.legend(frameon=False)
plt.legend(bbox_to_anchor=(1.04,1), borderaxespad=0)

ax4 = fig1.add_subplot(224)
sb.barplot(
    data=top_deaths_100K_df[top_deaths_100K_df['Date'] == last_reported_date], 
    x="IncomeGroup", 
    y="Deaths_100K",
    ax = ax4,
    hue="Country_Region")
plt.legend(frameon=False)
plt.legend(bbox_to_anchor=(1.04,1), borderaxespad=0)

fig1.tight_layout(pad=3)

plt.savefig('Most_affected_countries.png')

## Plotting the top ten LESS affected countries and Inconme level


In [None]:
#find top 10 countries wiht highest numebr of deaths and confirmed cases (cumulative)
my_top_n = 5
top_deaths_100K = my_final_df.loc[last_reported_date].sort_values(by = "Deaths_100K", ascending = True).head(my_top_n).reset_index()['CountryCode'].to_list()#.set_index('CountryCode').to_dict()["Country_Region"]
top_confirmed_100K = my_final_df.loc[last_reported_date].sort_values(by = "Confirmed_100K", ascending = True).head(my_top_n).reset_index()['CountryCode'].to_list()#.set_index('CountryCode').to_dict()["Country_Region"]

top_deaths_100K_df = my_final_df.loc[my_final_df['CountryCode'].isin(top_deaths_100K)].reset_index()
top_confirmed_100K_df = my_final_df.loc[my_final_df['CountryCode'].isin(top_confirmed_100K)].reset_index()
# font = {'family' : 'normal',
#         'weight' : 'bold',
#         'size'   : 18}
# plt.rc('font', **font)

factor_size = 1.5
SMALL_SIZE = 8 * factor_size
MEDIUM_SIZE = 10 * factor_size
BIGGER_SIZE = 12 * factor_size

plt.rc('font', size=SMALL_SIZE)          # controls default text sizes
plt.rc('axes', titlesize=SMALL_SIZE)     # fontsize of the axes title
plt.rc('axes', labelsize=BIGGER_SIZE)    # fontsize of the x and y labels
plt.rc('xtick', labelsize=SMALL_SIZE)    # fontsize of the tick labels
plt.rc('ytick', labelsize=SMALL_SIZE)    # fontsize of the tick labels
plt.rc('legend', fontsize=SMALL_SIZE)    # legend fontsize
plt.rc('figure', titlesize=BIGGER_SIZE)  # fontsize of the figure title

# Multiline Plot: number of confirmed cases in 100K 
fig1 = plt.figure(figsize=(22, 10) , dpi=300)


ax1 = fig1.add_subplot(221)
sb.lineplot(
    data=top_confirmed_100K_df, 
#     kind="line",
    x="Date", 
    y="Confirmed_100K",
    ax = ax1,
    linewidth = 4,
    hue="Country_Region")
plt.xticks(rotation=45) 
plt.legend(frameon=False)
plt.legend(bbox_to_anchor=(1.04,1), borderaxespad=0)
#plt.title('My title')

ax2 = fig1.add_subplot(222)
sb.barplot(
    data=top_confirmed_100K_df[top_confirmed_100K_df['Date'] == last_reported_date], 
    x="IncomeGroup", 
    y="Confirmed_100K",
    ax = ax2,
    hue="Country_Region")

plt.legend(frameon=False)
plt.legend(bbox_to_anchor=(1.04,1), borderaxespad=0)
ax3 = fig1.add_subplot(223)

sb.lineplot(
    data=top_deaths_100K_df, 
#     kind="line",
    x="Date", 
    y="Deaths_100K",
    linewidth = 4,
    ax = ax3,
    hue="Country_Region")
plt.xticks(rotation=45)
plt.legend(frameon=False)
plt.legend(bbox_to_anchor=(1.04,1), borderaxespad=0)

ax4 = fig1.add_subplot(224)
sb.barplot(
    data=top_deaths_100K_df[top_deaths_100K_df['Date'] == last_reported_date], 
    x="IncomeGroup", 
    y="Deaths_100K",
    ax = ax4,
    hue="Country_Region")
plt.legend(frameon=False)
plt.legend(bbox_to_anchor=(1.04,1), borderaxespad=0)

fig1.tight_layout(pad=3)

plt.savefig('Less_affected_countries.png')

## NOTE: We found that the less affected countries seem to be countries with missed information (no reported cases) or very small populated countries. 

## Beacuse of this, we remove countries with small population size (< 1000000 people)

In [None]:
my_top_n = 5
best_10_confirmed=my_final_df[(my_final_df['IncomeGroup'] == 'High income') & (my_final_df['Population'] >= 1000000)].loc[last_reported_date].sort_values(by = "Confirmed_100K", ascending = True).head(my_top_n).reset_index()['CountryCode'].to_list()

In [None]:
best_10_deaths=my_final_df[(my_final_df['IncomeGroup'] == 'High income')& (my_final_df['Population'] >= 1000000)].loc[last_reported_date].sort_values(by = "Deaths_100K", ascending = True).head(my_top_n).reset_index()['CountryCode'].to_list()

In [None]:
best_deaths_100K_df = my_final_df.loc[my_final_df['CountryCode'].isin(best_10_deaths)].reset_index()
best_confirmed_100K_df = my_final_df.loc[my_final_df['CountryCode'].isin(best_10_confirmed)].reset_index()

In [None]:
#find top 10 countries wiht highest numebr of deaths and confirmed cases (cumulative)
my_top_n = 5
best_10_confirmed=my_final_df[(my_final_df['IncomeGroup'] == 'High income') & (my_final_df['Population'] >= 1000000)].loc[last_reported_date].sort_values(by = "Confirmed_100K", ascending = True).head(my_top_n).reset_index()['CountryCode'].to_list()
best_10_deaths=my_final_df[(my_final_df['IncomeGroup'] == 'High income')& (my_final_df['Population'] >= 1000000)].loc[last_reported_date].sort_values(by = "Deaths_100K", ascending = True).head(my_top_n).reset_index()['CountryCode'].to_list()

best_deaths_100K_df = my_final_df.loc[my_final_df['CountryCode'].isin(best_10_deaths)].reset_index()
best_confirmed_100K_df = my_final_df.loc[my_final_df['CountryCode'].isin(best_10_confirmed)].reset_index()



factor_size = 1.5
SMALL_SIZE = 8 * factor_size
MEDIUM_SIZE = 10 * factor_size
BIGGER_SIZE = 12 * factor_size

plt.rc('font', size=SMALL_SIZE)          # controls default text sizes
plt.rc('axes', titlesize=SMALL_SIZE)     # fontsize of the axes title
plt.rc('axes', labelsize=BIGGER_SIZE)    # fontsize of the x and y labels
plt.rc('xtick', labelsize=SMALL_SIZE)    # fontsize of the tick labels
plt.rc('ytick', labelsize=SMALL_SIZE)    # fontsize of the tick labels
plt.rc('legend', fontsize=SMALL_SIZE)    # legend fontsize
plt.rc('figure', titlesize=BIGGER_SIZE)  # fontsize of the figure title

# Multiline Plot: number of confirmed cases in 100K 
fig1 = plt.figure(figsize=(22, 10) , dpi=300)


ax1 = fig1.add_subplot(221)
sb.lineplot(
    data=best_confirmed_100K_df, 
#     kind="line",
    x="Date", 
    y="Confirmed_100K",
    ax = ax1,
    linewidth = 4,
    hue="Country_Region")
plt.xticks(rotation=45) 
plt.legend(frameon=False)
plt.legend(bbox_to_anchor=(1.04,1), borderaxespad=0)
#plt.title('My title')

ax2 = fig1.add_subplot(222)
sb.barplot(
    data=best_confirmed_100K_df[best_confirmed_100K_df['Date'] == last_reported_date], 
    x="IncomeGroup", 
    y="Confirmed_100K",
    ax = ax2,
    hue="Country_Region")

plt.legend(frameon=False)
plt.legend(bbox_to_anchor=(1.04,1), borderaxespad=0)
ax3 = fig1.add_subplot(223)

sb.lineplot(
    data=best_deaths_100K_df, 
#     kind="line",
    x="Date", 
    y="Deaths_100K",
    linewidth = 4,
    ax = ax3,
    hue="Country_Region")
plt.xticks(rotation=45)
plt.legend(frameon=False)
plt.legend(bbox_to_anchor=(1.04,1), borderaxespad=0)

ax4 = fig1.add_subplot(224)
sb.barplot(
    data=best_deaths_100K_df[best_deaths_100K_df['Date'] == last_reported_date], 
    x="IncomeGroup", 
    y="Deaths_100K",
    ax = ax4,
    hue="Country_Region")
plt.legend(frameon=False)
plt.legend(bbox_to_anchor=(1.04,1), borderaxespad=0)

fig1.tight_layout(pad=3)

plt.savefig('Real_Less_affected_countries.png')

## Checking the High income countries and find out what is the best stringency index pattern

In [None]:
my_final_df.IncomeGroup.unique()

In [None]:
my_top_n = 5
best_10_confirmed=my_final_df[(my_final_df['IncomeGroup'] == 'High income') & (my_final_df['Population'] >= 1000000)].loc[last_reported_date].sort_values(by = "Confirmed_100K", ascending = True).head(my_top_n).reset_index()['CountryCode'].to_list()

In [None]:
best_10_deaths=my_final_df[(my_final_df['IncomeGroup'] == 'High income') & (my_final_df['Population'] >= 1000000)].loc[last_reported_date].sort_values(by = "Deaths_100K", ascending = True).head(my_top_n).reset_index()['CountryCode'].to_list()

In [None]:
best_deaths_100K_df = my_final_df.loc[my_final_df['CountryCode'].isin(best_10_deaths)].reset_index()
best_confirmed_100K_df = my_final_df.loc[my_final_df['CountryCode'].isin(best_10_confirmed)].reset_index()

In [None]:
#find top 10 countries wiht highest numebr of deaths and confirmed cases (cumulative)
my_top_n = 5

best_10_confirmed=my_final_df[(my_final_df['IncomeGroup'] == 'High income') & (my_final_df['Population'] >= 1000000)].loc[last_reported_date].sort_values(by = "Confirmed_100K", ascending = True).head(my_top_n).reset_index()['CountryCode'].to_list()
best_10_deaths=my_final_df[(my_final_df['IncomeGroup'] == 'High income') & (my_final_df['Population'] >= 1000000)].loc[last_reported_date].sort_values(by = "Deaths_100K", ascending = True).head(my_top_n).reset_index()['CountryCode'].to_list()

best_deaths_100K_df = my_final_df.loc[my_final_df['CountryCode'].isin(best_10_deaths)].reset_index()
best_confirmed_100K_df = my_final_df.loc[my_final_df['CountryCode'].isin(best_10_confirmed)].reset_index()



factor_size = 1.5
SMALL_SIZE = 8 * factor_size
MEDIUM_SIZE = 10 * factor_size
BIGGER_SIZE = 12 * factor_size

plt.rc('font', size=SMALL_SIZE)          # controls default text sizes
plt.rc('axes', titlesize=SMALL_SIZE)     # fontsize of the axes title
plt.rc('axes', labelsize=BIGGER_SIZE)    # fontsize of the x and y labels
plt.rc('xtick', labelsize=SMALL_SIZE)    # fontsize of the tick labels
plt.rc('ytick', labelsize=SMALL_SIZE)    # fontsize of the tick labels
plt.rc('legend', fontsize=SMALL_SIZE)    # legend fontsize
plt.rc('figure', titlesize=BIGGER_SIZE)  # fontsize of the figure title

# Multiline Plot: number of confirmed cases in 100K 
fig1 = plt.figure(figsize=(22, 10) , dpi=300)


ax1 = fig1.add_subplot(221)
sb.lineplot(
    data=best_confirmed_100K_df, 
#     kind="line",
    x="Date", 
    y="Confirmed_100K",
    ax = ax1,
    linewidth = 4,
    hue="Country_Region")
plt.xticks(rotation=45) 
plt.legend(frameon=False)
plt.legend(bbox_to_anchor=(1.04,1), borderaxespad=0)
#plt.title('My title')

ax2 = fig1.add_subplot(222)
sb.lineplot(
    data=best_confirmed_100K_df, 
#     kind="line",
    x="Date", 
    y="StringencyIndex",
    ax = ax2,
    linewidth = 4,
    hue="Country_Region")
plt.xticks(rotation=45) 
plt.legend(frameon=False)
plt.legend(bbox_to_anchor=(1.04,1), borderaxespad=0)

ax3 = fig1.add_subplot(223)
sb.lineplot(
    data=best_deaths_100K_df, 
#     kind="line",
    x="Date", 
    y="Deaths_100K",
    linewidth = 4,
    ax = ax3,
    hue="Country_Region")
plt.xticks(rotation=45)
plt.legend(frameon=False)
plt.legend(bbox_to_anchor=(1.04,1), borderaxespad=0)

ax4 = fig1.add_subplot(224)
sb.lineplot(
    data=best_deaths_100K_df, 
#     kind="line",
    x="Date", 
    y="StringencyIndex",
    ax = ax4,
    linewidth = 4,
    hue="Country_Region")
plt.xticks(rotation=45) 
plt.legend(frameon=False)
plt.legend(bbox_to_anchor=(1.04,1), borderaxespad=0)

fig1.tight_layout(pad=3)

plt.savefig('best_perfarmance.png')

# END