<h1 id="tocheading">Data analysis on Coronavirus</h1>

<h2>Team Members & Team Number</h2>

**Group 12**

- Abdelrhman Adel Zaher
- LUO Dan
- Maonan WANG
- Mohamed Yahya Jabokji

## Table of Contents

In [None]:
from IPython.display import Image
Image('/kaggle/input/pictures/snipaste_20200330_183005.png')

In [None]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')

## Preparation

In [None]:
!pip install country_converter

In [None]:
import numpy as np
import pandas as pd
# for Visualization
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns
# country name and country code convert
import country_converter as coco
import functools
# prediction
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn import preprocessing
# for copy
import copy

In [None]:
latest_date = '5/11/20'
diff_date = '3/21/20'

## Import Data and Data Preprocess

- combine different regions into the country, we only analysis on countries.
- delete "Cruise Ship" data
- add active case: confirm-death-recovery
- add new variables, combine with other datasets
    - add "ISO3"
    - add continent
    - add GDP, Pop, and Aging

### Import data

In [None]:
confirm = pd.read_csv('/kaggle/input/coronavirus-analysis/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv')
death = pd.read_csv('/kaggle/input/coronavirus-analysis/csse_covid_19_time_series/time_series_covid19_deaths_global.csv')
recovered = pd.read_csv('/kaggle/input/coronavirus-analysis/csse_covid_19_time_series/time_series_covid19_recovered_global.csv')

In [None]:
# show data
confirm.head(5)

In [None]:
# preprocess data--Statistics for the entire country
dict_groupby = {i:'sum' for i in confirm.columns.values[4:]}
dict_groupby['Lat'] = 'mean'
dict_groupby['Long'] = 'mean'

In [None]:
# confirmed cases
confirmCountry = confirm.groupby('Country/Region').agg(dict_groupby)
# confirmCountry.drop('Cruise Ship', inplace=True)
confirmCountry.head()

In [None]:
deathCountry = death.groupby('Country/Region').agg(dict_groupby)
# deathCountry.drop('Cruise Ship', inplace=True)
deathCountry.head()

In [None]:
recoveredCountry = recovered.groupby('Country/Region').agg(dict_groupby)
# recoveredCountry.drop('Cruise Ship',inplace=True)
recoveredCountry.head()

In [None]:
# add active case: confirm-death-recovery
activeCountry = (confirmCountry - deathCountry - recoveredCountry)
activeCountry['Lat'] = -activeCountry.loc[:,'Lat'].values 
activeCountry['Long'] = -activeCountry.loc[:,'Long'].values 
activeCountry.head()

### Add New Variables

- add "ISO3"
- add continent
- add GDP, Pop, and Aging

In [None]:
# add continent
continent_name = coco.convert(names = list(confirmCountry.index.values), to='continent')
continent_name[:5]

In [None]:
confirmCountry['continent'] = continent_name
deathCountry['continent'] = continent_name
recoveredCountry['continent'] = continent_name
activeCountry['continent'] = continent_name

In [None]:
confirmCountry.head()

In [None]:
# add country name
country_name = confirmCountry.index.values

In [None]:
confirmCountry['Country/Region'] = country_name
deathCountry['Country/Region'] = country_name
recoveredCountry['Country/Region'] = country_name
activeCountry['Country/Region'] = country_name

In [None]:
# add country code
confirmCountry['ISO3'] = confirmCountry.apply(lambda x : coco.convert(x['Country/Region'], to='ISO3', not_found=None), axis=1)
deathCountry['ISO3'] = deathCountry.apply(lambda x : coco.convert(x['Country/Region'], to='ISO3', not_found=None), axis=1)
recoveredCountry['ISO3'] = recoveredCountry.apply(lambda x : coco.convert(x['Country/Region'], to='ISO3', not_found=None), axis=1)
activeCountry['ISO3'] = activeCountry.apply(lambda x : coco.convert(x['Country/Region'], to='ISO3', not_found=None), axis=1)

In [None]:
# check the head data in confirm table
confirmCountry.head()

### Combine with other datasets

- combine with GDP dataset, https://data.worldbank.org/indicator/NY.GDP.PCAP.CD?locations=MC&name_desc=false
- combine with population dataset, https://data.worldbank.org/indicator/sp.pop.totl?end=2018&start=2018
- combine with aging dataset, https://data.worldbank.org/indicator/SP.POP.65UP.TO.ZS?end=2018&locations=LK&most_recent_value_desc=true&start=2018&view=map&year=2018

#### Combine with GDP dataset

In [None]:
# combine with GDP
GDP_data = pd.read_csv('/kaggle/input/coronavirus-analysis/GDP/GDP.csv')
GDP_data.head(3)

In [None]:
# create new table
GDP_2018 = GDP_data.loc[:,['Country Code', '2018']]
GDP_2018.rename(columns={'Country Code':'ISO3', '2018':'GDP_2018'}, inplace=True)
GDP_2018.dropna(subset=['GDP_2018'], inplace=True)
GDP_2018.head()

In [None]:
confirmCountry = pd.merge(confirmCountry, GDP_2018, on=['ISO3'])
deathCountry = pd.merge(deathCountry, GDP_2018, on=['ISO3'])
recoveredCountry = pd.merge(recoveredCountry, GDP_2018, on=['ISO3'])
activeCountry = pd.merge(activeCountry, GDP_2018, on=['ISO3'])

In [None]:
# ensure the shape of every table
(confirmCountry.shape, deathCountry.shape, recoveredCountry.shape, activeCountry.shape)

#### Combine with Population dataset

In [None]:
Pop_data = pd.read_csv('/kaggle/input/coronavirus-analysis/Pop/Pop.csv')
Pop_data.head(3)

In [None]:
Pop_2018 = Pop_data.loc[:,['Country Code', '2018']]
Pop_2018.rename(columns={'Country Code':'ISO3', '2018':'Pop_2018'}, inplace=True)
Pop_2018.dropna(subset=['Pop_2018'], inplace=True)
Pop_2018.head()

In [None]:
confirmCountry = pd.merge(confirmCountry, Pop_2018, on=['ISO3'])
deathCountry = pd.merge(deathCountry, Pop_2018, on=['ISO3'])
recoveredCountry = pd.merge(recoveredCountry, Pop_2018, on=['ISO3'])
activeCountry = pd.merge(activeCountry, Pop_2018, on=['ISO3'])

In [None]:
(confirmCountry.shape, deathCountry.shape, recoveredCountry.shape, activeCountry.shape)

#### Combine with Aging dataset

In [None]:
Aging_data = pd.read_csv('/kaggle/input/coronavirus-analysis/Aging/Aging.csv')
Aging_data.head(3)

In [None]:
Aging_2018 = Aging_data.loc[:,['Country Code', '2018']]
Aging_2018.rename(columns={'Country Code':'ISO3', '2018':'Aging_2018'}, inplace=True)
Aging_2018.dropna(subset=['Aging_2018'], inplace=True)
Aging_2018.head()

In [None]:
confirmCountry = pd.merge(confirmCountry, Aging_2018, on=['ISO3'])
deathCountry = pd.merge(deathCountry, Aging_2018, on=['ISO3'])
recoveredCountry = pd.merge(recoveredCountry,Aging_2018, on=['ISO3'])
activeCountry = pd.merge(activeCountry, Aging_2018, on=['ISO3'])

In [None]:
(confirmCountry.shape, deathCountry.shape, recoveredCountry.shape, activeCountry.shape)

In [None]:
activeCountry.head()

In [None]:
# Check the situation in a certain country
confirmCountry[confirmCountry['Country/Region']=='China']

## Overview of the Coronavirus

### The Global Trend of the Cases 

- We can see that there are three time periods in the following chart,
- 1/12 to 2/12: the confirmed/active cases increase rapidly
- 2/12 to 3/12: slow growth in confirmed cases, and active cases decreases.
- 3/12 to - : the confirmed/active cases increase rapidly

In [None]:
matplotlib.style.use('default')
# Start
fig, ax1 = plt.subplots()
fig.set_size_inches(16, 8)
plt.set_cmap('RdBu')
# plt.xkcd()

# multiple line plot
pos = np.where(confirmCountry.columns.values==latest_date)[0][0]+1 # 找出想画图画到的日期
x = confirmCountry.columns.values[:pos] 
lw = 4 
a, = ax1.plot(x, confirmCountry.sum().values[:pos], linewidth=lw, label='Confirmed Cases', marker='o') # confirm
b, = ax1.plot(x, activeCountry.sum().values[:pos], linewidth=lw, label='Active Cases', marker='o') # active
plt.legend(handles = [a,b], fontsize=15)

# add Vertical line
ax1.plot(['2/12/20', '2/12/20'], [0, 10000000], lw=3, linestyle='--', alpha=0.7)
ax1.plot(['3/12/20', '3/12/20'], [0, 10000000], lw=3, linestyle='--', alpha=0.7)
ax1.plot([latest_date, latest_date], [0, 10000000], lw=3, linestyle='--', alpha=0.7)


ax1.yaxis.set_tick_params(labelsize=15) 
# ax1.set_xticks(x) 
xticks = [i if i in ['1/22/20' ,'2/12/20', '3/12/20', latest_date] else '' for i in x] # x轴几个标记点
ax1.set_xticklabels(xticks, rotation=0, fontsize=15) # x轴设置trick



ax1.set_ylabel("Number of Cases (log)", fontsize='x-large')
ax1.set_xlabel('Date',  fontsize='x-large')

ax1.set_title('Worldwide Corona Virus Cases - Confirmed, Active (Line Chart)', fontsize='x-large')

plt.yscale('log')
# plt.ylabel('logy')

plt.show()

### The trends in each country

- We can see that from 2/12, cases in China started to rise rapidly and reached a peak within a week. Then the number of cases begans to decline.
- From 3/12, the number of cases in Europe and America starts to rise rapidly.

First we plot the confirmed data

In [None]:
plotCountry = confirmCountry[confirmCountry[latest_date]>200000]


matplotlib.style.use('default')
fig, ax1 = plt.subplots()
fig.set_size_inches(16, 8)
plt.set_cmap('RdBu')
# plt.xkcd()

# multiple line plot
x = confirmCountry.columns.values[:-8]
lw = 1
for name in plotCountry.index.values:
    ax1.plot(x, plotCountry.loc[name].values[:-8], linewidth=lw, label=name, marker='o', markersize=2.5) # plot confirmed data in different countries

plt.legend(plotCountry.loc[:,'Country/Region'])


ax1.plot(['2/12/20', '2/12/20'], [0, 2000000], lw=3, linestyle='--', alpha=0.7)
ax1.plot(['3/12/20', '3/12/20'], [0, 2000000], lw=3, linestyle='--', alpha=0.7)
ax1.plot([latest_date, latest_date], [0, 2000000], lw=3, linestyle='--', alpha=0.7)

ax1.yaxis.set_tick_params(labelsize=15)
xticks = [i if i in ['1/22/20', '2/12/20', '3/12/20', latest_date] else '' for i in x] # 只显示特定日期的坐标
ax1.set_xticklabels(xticks, rotation=0, fontsize=15) 


ax1.set_ylabel("Number of Cases (log)", fontsize='x-large')
ax1.set_xlabel('Date',  fontsize='x-large')

# set y scale
plt.yscale('log')

ax1.set_title('Confirmed Corona Virus Cases in Each Country - (Line Chart)', fontsize='x-large')
plt.show()

then we plot the actived data

In [None]:
# plot the same country
countryList = plotCountry['Country/Region'].values
countryList = np.append(countryList, np.array(['India', 'Turkey', 'China']))
plotCountry = activeCountry[[True if i in countryList else False for i in activeCountry['Country/Region']]]


matplotlib.style.use('default')
fig, ax1 = plt.subplots()
fig.set_size_inches(16, 8)
plt.set_cmap('RdBu')
# plt.xkcd()

# multiple line plot
x = confirmCountry.columns.values[:-8]
lw = 2 
for name in plotCountry.index.values:
    ax1.plot(x, plotCountry.loc[name].values[:-8], linewidth=lw, label=name, marker='o', markersize=2.5) # confirm
    
plt.legend(plotCountry.loc[:,'Country/Region'])

plt.text(x[-26], 1.05*(plotCountry[plotCountry['Country/Region']=='China'].values[0,-8-26]), 'China', fontsize=13)
plt.text(x[-10], 0.56*plotCountry[plotCountry['Country/Region']=='US'].values[0,-8-10], 'US', fontsize=13)
plt.text(x[-20], 0.56*plotCountry[plotCountry['Country/Region']=='India'].values[0,-8-20], 'India', fontsize=13)
plt.text(x[-70], 0.56*plotCountry[plotCountry['Country/Region']=='Italy'].values[0,-8-70], 'Italy', fontsize=13)


ax1.plot(['2/12/20', '2/12/20'], [0, 1000000], lw=3, linestyle='--', alpha=0.7)
ax1.plot(['3/12/20', '3/12/20'], [0, 1000000], lw=3, linestyle='--', alpha=0.7)
ax1.plot([latest_date, latest_date], [0, 1000000], lw=3, linestyle='--', alpha=0.7)

ax1.yaxis.set_tick_params(labelsize=15) 
# ax1.set_xticks(x)
xticks = [i if i in ['1/22/20', '2/12/20', '3/12/20', latest_date] else '' for i in x] 
ax1.set_xticklabels(xticks, rotation=0, fontsize=15) 

ax1.set_ylabel("Number of Cases (log)", fontsize='x-large')
ax1.set_xlabel('Date',  fontsize='x-large')

ax1.set_title('Active Corona Virus Cases in Each Country - (Line Chart)', fontsize='x-large')

# set y scale
plt.yscale('log')

plt.show()

### Analysis each continent

In [None]:
plotCountry = confirmCountry.groupby('continent').sum()
plotCountry.drop(['Lat', 'Long', 'GDP_2018', 'Pop_2018', 'Aging_2018'], axis=1, inplace=True)
plotCountry['continent'] = plotCountry.index.values
plotCountry

In [None]:
matplotlib.style.use('default')
fig, ax1 = plt.subplots()
fig.set_size_inches(16, 8)
plt.set_cmap('RdBu')


x = plotCountry.columns.values[:-1] # 设置日期
lw = 2 
for name in plotCountry.index.values:
    ax1.plot(x, plotCountry.loc[name].values[:-1], linewidth=lw, label=name, marker='o', markersize=2.5) # confirm
    
plt.legend(plotCountry.loc[:,'continent'])


ax1.plot(['2/12/20', '2/12/20'], [0, 1000000], lw=3, linestyle='--', alpha=0.7)
ax1.plot(['3/12/20', '3/12/20'], [0, 1000000], lw=3, linestyle='--', alpha=0.7)
ax1.plot([latest_date, latest_date], [0, 1000000], lw=3, linestyle='--', alpha=0.7)

ax1.yaxis.set_tick_params(labelsize=15) 
# ax1.set_xticks(x)
xticks = [i if i in ['1/22/20', '2/12/20', '3/12/20', latest_date] else '' for i in x] 
ax1.set_xticklabels(xticks, rotation=0, fontsize=15) 

ax1.set_ylabel("Number of Cases (log)", fontsize='x-large')
ax1.set_xlabel('Date',  fontsize='x-large')

ax1.set_title('Confirmed Corona Virus Cases in Each Continent - (Line Chart)', fontsize='x-large')

# set y scale
plt.yscale('log')

plt.show()

In [None]:
plotCountry = activeCountry.groupby('continent').sum()
plotCountry.drop(['Lat', 'Long', 'GDP_2018', 'Pop_2018', 'Aging_2018'], axis=1, inplace=True)
plotCountry['continent'] = plotCountry.index.values

matplotlib.style.use('default')
fig, ax1 = plt.subplots()
fig.set_size_inches(16, 8)
plt.set_cmap('RdBu')


x = plotCountry.columns.values[:-1] # 设置日期
lw = 2 
for name in plotCountry.index.values:
    ax1.plot(x, plotCountry.loc[name].values[:-1], linewidth=lw, label=name, marker='o', markersize=2.5) # confirm
    
plt.legend(plotCountry.loc[:,'continent'])


ax1.plot(['2/12/20', '2/12/20'], [0, 1000000], lw=3, linestyle='--', alpha=0.7)
ax1.plot(['3/12/20', '3/12/20'], [0, 1000000], lw=3, linestyle='--', alpha=0.7)
ax1.plot([latest_date, latest_date], [0, 1000000], lw=3, linestyle='--', alpha=0.7)

ax1.yaxis.set_tick_params(labelsize=15) 
# ax1.set_xticks(x)
xticks = [i if i in ['1/22/20', '2/12/20', '3/12/20', latest_date] else '' for i in x] 
ax1.set_xticklabels(xticks, rotation=0, fontsize=15) 

ax1.set_ylabel("Number of Cases (log)", fontsize='x-large')
ax1.set_xlabel('Date',  fontsize='x-large')

ax1.set_title('Active Corona Virus Cases in Each Continent - (Line Chart)', fontsize='x-large')

# set y scale
plt.yscale('log')

plt.show()

### Analyze the situation on every continent

- First, we analyze the population on each continent 
- Then we analyze the confirmed and active cases on each continent
- Finally, we analyze the changing trends of all continents

**We have the following findings: **

- We can see that the population in Europe is smallest, but currently Europe has the largest number of confirmed and active cases.
- We can also find that there are very few cases have been found in Africa.

In [None]:
# first plot the worldwide population
fig = px.sunburst(activeCountry, path=['continent', 'Country/Region'], values='Pop_2018',
                  color='continent',
                  color_continuous_scale='OrRd')

fig.update_layout(title='Worldwide Population Analysis',
                  font=dict(family="Courier New, monospace",
                            size=13)
                 )

fig.update_layout(margin={"r":0,"t":50,"l":0,"b":0})

fig.show()

In [None]:
# then plot the active case in world wide
fig = px.sunburst(activeCountry, path=['continent', 'Country/Region'], values=latest_date,
                  color='continent',
                  color_continuous_scale='OrRd')

fig.update_layout(title='Worldwide Corona Virus Cases in Each Country and Continent - Active Cases',
                  font=dict(family="Courier New, monospace",
                            size=13)
                 )

fig.update_layout(margin={"r":0,"t":50,"l":0,"b":0})

fig.show()

In [None]:
# plot the confirmed case in world wide (plot the latest data)
fig = px.sunburst(confirmCountry, path=['continent', 'Country/Region'], values=latest_date,
                  color='continent',
                  color_continuous_scale='OrRd')

fig.update_layout(title='Worldwide Corona Virus Cases in Each Country and Continent - Confirmed Cases',
                  font=dict(family="Courier New, monospace",
                            size=13)
                 )

fig.update_layout(margin={"r":0,"t":50,"l":0,"b":0})

fig.show()

### Changes in every continent every day

- We reorganize the dataset.
- And add two new variables (we will use these in the next part):
    - Death Rate = death/confirmed
    - Sick Rate = confirmed/population

In [None]:
# 1. confirmed cases
animationCountry = confirmCountry.copy()
animationCountry.reset_index(drop=True, inplace=True)
pd1 = pd.melt(animationCountry, id_vars=['Lat', 'Long', 'continent', 'Country/Region', 'ISO3', 'GDP_2018', 'Pop_2018', 'Aging_2018'], var_name = 'date', value_name='Confirmed')

# 2. death cases
animationCountry = deathCountry.copy()
animationCountry.reset_index(drop=True, inplace=True)
pd2 = pd.melt(animationCountry, id_vars=['Lat', 'Long', 'continent', 'Country/Region', 'ISO3', 'GDP_2018', 'Pop_2018', 'Aging_2018'], var_name = 'date', value_name='Death')

# 3. recovered cases
animationCountry = recoveredCountry.copy()
animationCountry.reset_index(drop=True, inplace=True)
pd3 = pd.melt(animationCountry, id_vars=['Lat', 'Long', 'continent', 'Country/Region', 'ISO3', 'GDP_2018', 'Pop_2018', 'Aging_2018'], var_name = 'date', value_name='Recovered')

# 4. active cases
animationCountry = activeCountry.copy()
animationCountry.reset_index(drop=True, inplace=True)
pd4 = pd.melt(animationCountry, id_vars=['Lat', 'Long', 'continent', 'Country/Region', 'ISO3', 'GDP_2018', 'Pop_2018', 'Aging_2018'], var_name = 'date', value_name='Active')

In [None]:
# product new dataset
data_frames = [pd1, pd2, pd3, pd4]
df_merged = functools.reduce(lambda left, right: pd.merge(left, right,on=['Lat', 'Long', 'continent', 'Country/Region', 'ISO3', 'GDP_2018', 'Pop_2018', 'Aging_2018', 'date']), data_frames)
df_merged.head()

In [None]:
# add two new variables
df_merged['deathRate'] = df_merged['Death']/df_merged['Confirmed']
df_merged['SickRate'] = df_merged['Confirmed']/df_merged['Pop_2018']
df_merged.head(10)

In [None]:
fig = px.bar(df_merged, x="continent", y="Active", color="continent",
             animation_frame="date", animation_group="Country/Region", range_y=[0,2000000])


fig.update_layout(margin={"r":0,"t":50,"l":0,"b":0})

fig.update_layout(title='Worldwide Corona Virus Cases in Each Continent - Active Cases',
                  font=dict(family="Courier New, monospace",
                            size=13)
                 )

fig.show()

### Show the trend of active cases in the Map

In [None]:
fig = px.scatter_geo(df_merged[df_merged['date']=='5/11/20'],
                    locations = 'ISO3',
                    size='Active', size_max = 55, color="continent")

fig.update_layout(margin={"r":0,"t":50,"l":0,"b":0})

fig.update_layout(title='Worldwide Corona Virus Cases Time Lapse - Active',
                  font=dict(family="Courier New, monospace",size=13)
                 )
                  
fig.show()

In [None]:
fig = px.scatter_geo(df_merged,
                    locations = 'ISO3',
                    size='Active', size_max = 55,
                    animation_frame="date", animation_group='Country/Region',color="continent")

fig.update_layout(margin={"r":0,"t":50,"l":0,"b":0})

fig.update_layout(title='Worldwide Corona Virus Cases Time Lapse - Active',
                  font=dict(family="Courier New, monospace",size=13)
                 )
                  
fig.show()

<h2>Comprehensive analysis with other dataset</h2>

In this part, we will try to find some clues between coronavirus and aging or GDP.

Firstly, we plot the following chart: 

- Each circle represent one country.
- The x-axis represents the "GDP per capita (current US$)"
- The y-axis represents the "Population ages 65 and above (% of total population)"
- The size of the circle represents the number of confirmed cases.
- The color represent the continent.

**We found that the size of the circle in the upper right corner is relatively large, which means that the virus has a higher incidence in countries with higher GDP and countries with higher ageing.**

Based on charts, so we can see that countries with higher GDP and aging have more confirmed cases than other countries, that means Europe and USA.

In [None]:
confirmCountry.head()

In [None]:
# Correlations for Dataset
correlationM = df_merged[df_merged['date']==latest_date][['GDP_2018','Pop_2018','Aging_2018','Confirmed', 'Death', 'Recovered', 'Active', 'deathRate', 'SickRate']]
correlationM.reset_index(drop=True, inplace=True)
correlationM.head()

In [None]:
# 绘制相关系数矩阵
plt.figure(figsize = (10,10))
ax = sns.heatmap(correlationM.corr(), annot=True, fmt='.1f', cmap="BuPu") # fmt表示保留的小数点
# 设置y轴的字体的大小
plt.yticks(rotation=0) # 让y轴的字进行旋转
# ax.yaxis.set_tick_params(labelsize=15)
plt.title('Correlations for Dataset', fontsize='xx-large')

We change the size to the "number of confirmed cases", and plot the following chart:

- We can still find the countries with higher GDP and with higher ageing have more cases.

In [None]:
px.scatter(confirmCountry, x='GDP_2018', y='Aging_2018', 
           color='continent', size=latest_date, size_max=60, 
           hover_name="Country/Region", log_x=True)

Finally, let's look at the relationship between prevalence and aging. We plot the following chart:

- Each circle represent one country.
- The x-axis represents the "Sick Rate"
    - Sick Rate = confirmed/population
- The y-axis represents the "Population ages 65 and above (% of total population)"
- The size of the circle represents the number of confirmed cases.
- The color represent the continent.

**We can find out that: **

- As aging increases, so does the rate of illness.
- You can see that most African countries (green) are less aging, and their prevalence is lower.
- European countries (red) are more aging, and their prevalence is higher.

In [None]:
px.scatter(df_merged[df_merged['date']==latest_date].dropna(subset=['deathRate','SickRate']), 
           x='GDP_2018', y='SickRate', 
           color='continent', size='Pop_2018', size_max=60, 
           hover_name="Country/Region", log_x=True, log_y=True)

**Then we use regression to make prediction and plot the line**

In [None]:
x = df_merged[df_merged['date']==latest_date].dropna(subset=['deathRate','SickRate'])['GDP_2018'].values.reshape(-1,1)
y = df_merged[df_merged['date']==latest_date].dropna(subset=['deathRate','SickRate'])['SickRate'].values
weight = df_merged[df_merged['date']==latest_date].dropna(subset=['deathRate','SickRate'])['Pop_2018'].values

reg = LinearRegression().fit(np.log(x), np.log(y))
y_pre = reg.predict(np.log(x))
# get prediction
pred_dataframe = pd.DataFrame(x, columns=['x'])
pred_dataframe['y_pred'] = np.e**y_pre

In [None]:
r2_score(y_true=np.log(y), y_pred=y_pre)

In [None]:
fig = go.Figure()

fig = px.scatter(df_merged[df_merged['date']==latest_date].dropna(subset=['deathRate','SickRate']), 
           x='GDP_2018', y='SickRate', 
           color='continent', size='Pop_2018', size_max=60, 
           hover_name="Country/Region", log_x=True, log_y=True)

fig.add_trace(go.Scatter(x=pred_dataframe['x'], 
                         y=pred_dataframe['y_pred'],
                         mode='lines',
                         marker_color='rgba(152, 0, 0, .8)',
                         name='Regression'))

fig.update_layout(xaxis_type="log")

fig.show()

<h2>Make the predictions</h2>

- Here, we want to use regression to predict the number of cases in Italy and US.
- But there are many factors that affect the spread of the virus, we only consider our existing dataset here, and use the simplest regression to make predictions.
- So the results of predictions may not be very accurate.

In [None]:
US_Active = df_merged[df_merged['Country/Region']=='US'][['date','Active']]
US_Active.rename(columns={'date':'date', 'Active':'Active Cases'}, inplace=True)
US_Active.reset_index(drop=True, inplace=True)
US_Active.head(-10)

In [None]:
days_since_5_11 = np.array([i for i in range(len(US_Active))]).reshape(-1, 1)
days_since_5_28 = np.array([i for i in range(len(US_Active)+14)]).reshape(-1, 1)

In [None]:
svm_confirmed = SVR(shrinking=True, kernel='poly',gamma=0.1, degree=3, C=0.1)
svm_confirmed.fit(days_since_5_11, US_Active['Active Cases'].values)
svm_test_pred = svm_confirmed.predict(days_since_5_28)

In [None]:
plotCountry = confirmCountry[confirmCountry[latest_date]>1700]

China_ID = plotCountry[plotCountry['Country/Region']=='China'].index.values[0]
plotCountry = plotCountry.drop(China_ID, inplace=False)

matplotlib.style.use('default')

fig, ax1 = plt.subplots()
fig.set_size_inches(16, 8)
plt.set_cmap('RdBu')


# multiple line plot
x = confirmCountry.columns.values[:-8] # get date
x_pred = copy.deepcopy(x)
x_pred = np.concatenate([x_pred, (['5/{}/20'.format(date) for date in range(12, 26)])]) # the dates for prediction
lw = 2 

for name in plotCountry.index.values:
    if name==158: # keep U.S
        ax1.plot(x, plotCountry.loc[name].values[:-8], linewidth=lw, label=name, marker='o', markersize=2)

# plot predict
# ax1.plot(x_pred, np.exp(svm_test_pred), linewidth=lw, linestyle='dashed', label='Prediction', marker='o')
ax1.plot(x_pred, svm_test_pred, linewidth=lw, linestyle='dashed', label='Prediction', marker='o', markersize=2)

plt.legend(['US','Prediction'], fontsize=15)

# 加上竖线
ax1.plot(['2/12/20', '2/12/20'], [0, 1600000], lw=3, linestyle='--', alpha=0.7)
ax1.plot(['3/12/20', '3/12/20'], [0, 1600000], lw=3, linestyle='--', alpha=0.7)
ax1.plot([latest_date, latest_date], [0, 1600000], lw=3, linestyle='--', alpha=0.7)
ax1.plot([x_pred[-1], x_pred[-1]], [0, 1600000], lw=3, linestyle='--', alpha=0.7)


ax1.yaxis.set_tick_params(labelsize=15) 
# ax1.set_xticks(x) 
xticks = [i if i in ['2/12/20', '3/12/20', latest_date, x_pred[-1]] else '' for i in x_pred] 
ax1.set_xticklabels(xticks, rotation=0, fontsize=20) 

ax1.set_ylabel("Number of Cases", fontsize='xx-large')
ax1.set_xlabel('Date',  fontsize='xx-large')

ax1.set_title('U.S. Active Cases Predictions - (Line Chart)', fontsize='xx-large')
plt.show()

<h2>What we should do now</h2>

### Learn from China

China government lockdowns the city from 1/22. Within 2 weeks, the country was starting to get back to work. Within ~5 weeks it was completely under control. And within 7 weeks the new diagnostics was just a trickle.

- Turn some hospitals into specialized Coronavirus hospitals and bring all the cases to those hospitals.
- Working and studying from home if possible.
- Ban Transportation for at least 2 weeks.


### Learn from South Korea

And we can also learn something from South Korea. South Korea is the second country after China to start a large-scale case, but he is now well controlled overall. The way they take is very simple:

- efficient testing
- efficient tracing
- travel bans
- efficient isolating
- efficient quarantining
