# The Analysis of Coronavirus (COVID-19) Comparison Between Top 22 Countries and Taiwan

# 1. Introduction

Why should we not blame China for hiding coronavirus (COVID-19) information from the world in the first place? 

## 1.1 Project Description

Since the novel coronavirus (COVID-19) had spread all over the whole world, I want to find out how the changes of new confirmed cases, new recovered, new deaths in the top 22 confirmed cases countries and my country, Taiwan. I use multiple linear regression to predict and fit between 23 countries, including US, Brazil, Russia, United Kingdom, Spain, Italy, France, Germany, Turkey, India, Iran, Peru, Canada, China, Chile, Saudi Arabia, Mexico, Pakistan, Belgium, Qatar, Bangladesh, South Africa, and Taiwan. Then I compare the trends of these countries and try to explain the differences and why the result comes out.

`Note:` The country sequence is listed as the confirmed cases ranking on May 26, 2020. These countries may go up and down, so I will not change the processing sequence of these countries.


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## 1.2 The Dataset
The dataset source is from https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset. I mainly use the `covid_19_data.csv` as my dataset, but I still use `time_series_covid_19_confirmed.csv`, `time_series_covid_19_deaths.csv`, and `time_series_covid_19_recovered.csv` as reference.

### Column Description
* SNo: The sequence number or data
* ObservationDate: The date when confirmed observed
* Province/State: The province or state where found confirmed
* Country/Region: The country the confirmed case belongs to
* Last Update: Last update time
* Confirmed: The accumulated confirmed cases until the observation date
* Deaths: The accumulated deaths until the observation date
* Recovered: The accumulated recovered cases until the observation date

The dataset I used collected confirmed, deaths, and recovered data is from January 22, 2020, to Jun 14, 2020.

# 2. Data Munging
## 2.1 Basic knowledge of dataset
The dataset is loaded from the CSV file and assigned to a variable `virus`.

In [None]:
# reading the data
virus = pd.read_csv('../input/novel-corona-virus-2019-dataset/covid_19_data.csv')

# preview data
virus.head()

In [None]:
# preview tail data
virus.tail()

In [None]:
# data dimesion
virus.shape
# 36598 observations, 8 features

In [None]:
# columns
virus.keys()

In [None]:
# get basic informaiton about missing values
virus.info()
# we know taht not all countries have the Province/State data.

In [None]:
# get summary information 
virus.describe()
# a high standard deviation means that the numbers are more spread out

## 2.2 Perform data transformation
We only need data of top 22 confirmed cases countries, which are US, Brazil, Russia, United Kingdom, Spain, Italy, France, Germany, Turkey, India, Iran, Peru, Canada, China, Chile, Saudi Arabia, Mexico, Pakistan, Belgium, Qatar, Bangladesh, South Africa, and my country, Taiwan. Some countries have provinces or states information, but others do not have this information. We need to separate them and use several customed functions to calculate the daily new confirmed cases, new deaths, and new recovered cases.

In [None]:
# change the key/feature name for future use
virus.rename(columns={
    'Province/State': 'ProvinceState',
    'Country/Region': 'Country',
    'Last Update': 'Update'
}, inplace=True)

virus.keys()

In [None]:
# transfer observationDate to datetime format
virus['ObservationDate'] = pd.to_datetime(virus['ObservationDate'], format='%m/%d/%Y', errors='ignore')
virus.info() # ObservationDate is datetime64 data type

We will separate those 22 countries to the corresponding dataset. Then we will choose some of them to be the training sets, and others are the test sets. This method could compare the differences in the learning models of each country.

In [None]:
# US dataset inspection
us = virus[virus.Country == 'US']
print(us)

We can see US data is separated by different states, but I only want to get the confirmed count in country level. Therefore, we group it by country and date to get whole data.

In [None]:
# The SQL is 
# select ObservationDate, ProvinceState, sum(Confirmed) as Confirmed, 
#                  sum(Deaths) as Deaths, sum(Recovered) as Recovered
#                  from data 
#                  where Country = '%s' 
#                  group by ProvinceState, ObservationDate
usByStateDate = us.groupby(['ProvinceState', 'ObservationDate']).sum()
print(usByStateDate)

Then we need the data accumalated by observation date.

In [None]:
# The SQL is
# select ObservationDate, sum(Confirmed) as Confirmed, sum(Deaths) as Deaths, 
#        sum(Recovered) as Recovered
#        from countryData 
#        group by ObservationDate
usByDate = usByStateDate.groupby(['ObservationDate']).sum()
print(usByDate)
# now we have the daily confirmed/deaths/recovered cases by country

We need to calculate new confirmed, new deaths, and new recovered counts according to the previous day.

In [None]:
# calculate the delta between observations by date
usByDate['ConfirmedNew'] = usByDate.sort_values('ObservationDate')['Confirmed'].diff().fillna(0)
usByDate['DeathsNew'] = usByDate.sort_values('ObservationDate')['Deaths'].diff().fillna(0)
usByDate['RecoveredNew'] = usByDate.sort_values('ObservationDate')['Recovered'].diff().fillna(0)
usByDate

Now, we have three new columns called ConfirmedNew, DeathsNew, and RecoveredNew. We use this pattern to produce each country's own dataset. Therefore, we define two functions for reusing.

In [None]:
from sklearn import preprocessing

# define common function to calculate
def groupByProvinceStateDate(country):
    '''
    Group data by ProvinceState and Date to calculate the sum of confirmed, deaths, and recovered

    Parameters
    ----------
    country : String
        The string to filter out the specified country.
    '''
    
    # because some data of the country has province and others don't.
    # hence the groupby value could fail
    # need to fill with specified string
    
    data = virus[virus.Country == country].fillna({'ProvinceState':'blank'}) # subset
    
    dataByStateDate = data.groupby(['ProvinceState', 'ObservationDate']).sum().reset_index()
    print('\n[****** The', country, 'data by province or state ******]')
    
    print(dataByStateDate)
    return dataByStateDate

def calculateMetrics(data, country):
    '''
    Group data by Date to calculate the sum of confirmed, the sum of deaths, the sum of recovered, death rate, and recovered rate. Then calculate the delta of different observations by date.
    Parameters
    ----------
    data : Pandas Dataframe
        The original dataset to be group by.
    country : String
        The string to filter out the specified country.
    '''
    
    data = data.groupby(['ObservationDate']).sum().reset_index()
        
    # we need to fix the first observation since it doesn't have the previous observation to calculate
    # so we just use itself as the delta by fillna()
    
    # ConfirmedNew is calculated as confirmed-of-today minus confirmed-of-yesterday
    data['ConfirmedNew'] = data.sort_values('ObservationDate')['Confirmed'].diff().fillna(data['Confirmed'])
    
    # DeathsNew is calculated as deaths-of-today minus deaths-of-yesterday
    data['DeathsNew'] = data.sort_values('ObservationDate')['Deaths'].diff().fillna(data['Deaths'])
    
    # RecoveredNew is calculated as recovered-of-today minus recovered-of-yesterday
    data['RecoveredNew'] = data.sort_values('ObservationDate')['Recovered'].diff().fillna(data['Recovered'])
    
    # DeathRate is calculated as deaths divided by confirmed
    data['DeathRate'] = data['Deaths'] / data['Confirmed']
    
    # RecoveredRate is calculated as recovered divided by confirmed
    data['RecoveredRate'] = data['Recovered'] / data['Confirmed']
    
    # The quantities trend in each country is different, we need to normalize it.
    column_names_to_normalize = ['Confirmed', 'Deaths', 'Recovered', 'ConfirmedNew', 'DeathsNew', 'RecoveredNew']
    column_names_normalize = ['ConfirmedN', 'DeathsN', 'RecoveredN', 'ConfirmedNewN', 'DeathsNewN', 'RecoveredNewN']
    confirmedNew = data[column_names_to_normalize].values
    confirmedNewNormal = preprocessing.MinMaxScaler().fit_transform(confirmedNew)
    temp = pd.DataFrame(confirmedNewNormal, columns=column_names_normalize, index = data.index)
    data = pd.concat([data, temp], axis=1, sort=False)
    
    # Because the country informaiton will be lost after sum(),
    # so we need to add country back to the dastaset.
    data['Country'] = country
    
    # add Day column to decide the happened day since the first confirmed cases
    data.insert(0, 'Day', range(1, len(data) + 1))
    
    # remove SNo column since we have Day column
    del data['SNo']
    
    
    print('\n[****** The', country, 'data by date ******]')
    print(data)
    return data

Therefore, now we have define the functions for calculate the accumalated confirmed, deaths, recovered, and related data. We are going to making subset for each country.

### The United States Data

In [None]:
# the US has detailed information regarding states, so we need to group it.
usByState = groupByProvinceStateDate('US')

# calculate the delta of confirmed, deaths, recovered cases by grouping date
us = calculateMetrics(usByState, 'US')

us
# we can double-check the result by corresponding output from those two functions

The organized and transformed data contains 17 columns.

In [None]:
us.info()

### The Brazil Data

In [None]:
# the Brazil has detailed information regarding states since 5/21, so we need to group it.
brazilByProvince = groupByProvinceStateDate('Brazil')

# because Brazil doesn't have any confirmed cases on 2020-01-23, the first row, we need to remove it
brazilByProvince = brazilByProvince[brazilByProvince['Confirmed'] != 0]

# calculate the delta of confirmed, deaths, recovered cases by grouping date
brazil = calculateMetrics(brazilByProvince, 'Brazil')

brazil
# we can double-check the result by corresponding output from those two functions

### The Russia Data

In [None]:
# the Russia doesn't have detailed state or province information
# calculate the delta of confirmed, deaths, recovered cases by grouping date
russia = calculateMetrics(virus[virus.Country == 'Russia'], 'Russia')

russia
# we can double-check the result by corresponding output

### The United Kingdom Data

In [None]:
# the UK has detailed information regarding provinces, so we need to group it.
ukByState = groupByProvinceStateDate('UK')

# calculate the delta of confirmed, deaths, recovered cases by grouping date
uk = calculateMetrics(ukByState, 'UK')

uk
# we can double-check the result by corresponding output from those two functions

### The Spain Data

In [None]:
# the Spain has detailed information regarding provinces, so we need to group it.
spainByState = groupByProvinceStateDate('Spain')

# calculate the delta of confirmed, deaths, recovered cases by grouping date
spain = calculateMetrics(spainByState, 'Spain')

spain
# we can double-check the result by corresponding output from those two functions

### The Italy Data

In [None]:
# the Italy has detailed information regarding provinces, so we need to group it.
italyByState = groupByProvinceStateDate('Italy')

# calculate the delta of confirmed, deaths, recovered cases by grouping date
italy = calculateMetrics(italyByState, 'Italy')

italy
# we can double-check the result by corresponding output from those two functions

### The France Data

In [None]:
# the France has detailed information regarding provinces, so we need to group it.
franceByState = groupByProvinceStateDate('France')

# calculate the delta of confirmed, deaths, recovered cases by grouping date
france = calculateMetrics(franceByState, 'France')

france
# we can double-check the result by corresponding output from those two functions

### The Germany Data

In [None]:
# the Germany has detailed information regarding provinces, so we need to group it.
germanyByState = groupByProvinceStateDate('Germany')

# calculate the delta of confirmed, deaths, recovered cases by grouping date
germany = calculateMetrics(germanyByState, 'Germany')

germany
# we can double-check the result by corresponding output from those two functions

### The Turkey Data

In [None]:
# the Turkey doesn't have detailed state or province information
# calculate the delta of confirmed, deaths, recovered cases by grouping date
turkey = calculateMetrics(virus[virus.Country == 'Turkey'], 'Turkey')

turkey
# we can double-check the result by corresponding output

### The India Data

In [None]:
# the India doesn't have detailed state or province information
# calculate the delta of confirmed, deaths, recovered cases by grouping date
india = calculateMetrics(virus[virus.Country == 'India'], 'India')

india
# we can double-check the result by corresponding output

### The Iran Data

In [None]:
# the Iran doesn't have detailed state or province information
# calculate the delta of confirmed, deaths, recovered cases by grouping date
iran = calculateMetrics(virus[virus.Country == 'Iran'], 'Iran')

iran
# we can double-check the result by corresponding output

### The Peru Data

In [None]:
# the Peru doesn't have detailed state or province information
# calculate the delta of confirmed, deaths, recovered cases by grouping date
peru = calculateMetrics(virus[virus.Country == 'Peru'], 'Peru')

peru
# we can double-check the result by corresponding output

### The Canada Data

In [None]:
# the Canada has detailed information regarding provinces, so we need to group it.
canadaByState = groupByProvinceStateDate('Canada')

# calculate the delta of confirmed, deaths, recovered cases by grouping date
canada = calculateMetrics(canadaByState, 'Canada')

canada
# we can double-check the result by corresponding output from those two functions

### The China Data

In [None]:
# the China has detailed information regarding provinces, so we need to group it.
chinaByState = groupByProvinceStateDate('Mainland China')

# calculate the delta of confirmed, deaths, recovered cases by grouping date
china = calculateMetrics(chinaByState, 'Mainland China')

china['Country'] = 'China' # shorten the name of China
china
# we can double-check the result by corresponding output from those two functions

### The Chile Data

In [None]:
# the Chile has detailed information regarding provinces since 5/20, so we need to group it.
chileByState = groupByProvinceStateDate('Chile')

# calculate the delta of confirmed, deaths, recovered cases by grouping date
chile = calculateMetrics(chileByState, 'Chile')

chile
# we can double-check the result by corresponding output from those two functions

### The Saudi Arabia Data

In [None]:
# the Saudi Arabia doesn't have detailed state or province information
# calculate the delta of confirmed, deaths, recovered cases by grouping date
arabia = calculateMetrics(virus[virus.Country == 'Saudi Arabia'], 'Saudi Arabia')

arabia
# we can double-check the result by corresponding output

### The Mexico Data

In [None]:
# the Mexico has detailed information regarding provinces since 5/20, so we need to group it.
mexicoByState = groupByProvinceStateDate('Mexico')

# because Mexico doesn't have any confirmed cases on 2020-01-23, the first row, we need to remove it
mexicoByState = mexicoByState[mexicoByState['Confirmed'] != 0]

# calculate the delta of confirmed, deaths, recovered cases by grouping date
mexico = calculateMetrics(mexicoByState, 'Mexico')

mexico
# we can double-check the result by corresponding output from those two functions

### The Pakistan Data

In [None]:
# the Pakistan doesn't have detailed state or province information
# calculate the delta of confirmed, deaths, recovered cases by grouping date
pakistan = calculateMetrics(virus[virus.Country == 'Pakistan'], 'Pakistan')

pakistan
# we can double-check the result by corresponding output

### The Belgium Data

In [None]:
# the Belgium doesn't have detailed state or province information
# calculate the delta of confirmed, deaths, recovered cases by grouping date
belgium = calculateMetrics(virus[virus.Country == 'Belgium'], 'Belgium')

belgium
# we can double-check the result by corresponding output

### The Qatar Data

In [None]:
# the Qatar doesn't have detailed state or province information
# calculate the delta of confirmed, deaths, recovered cases by grouping date
qatar = calculateMetrics(virus[virus.Country == 'Qatar'], 'Qatar')

qatar
# we can double-check the result by corresponding output

### The Bangladesh Data

In [None]:
# the Bangladesh doesn't have detailed state or province information
# calculate the delta of confirmed, deaths, recovered cases by grouping date
bangladesh = calculateMetrics(virus[virus.Country == 'Bangladesh'], 'Bangladesh')

bangladesh
# we can double-check the result by corresponding output

### The South Africa Data

In [None]:
# the South Africa doesn't have detailed state or province information
# calculate the delta of confirmed, deaths, recovered cases by grouping date
africa = calculateMetrics(virus[virus.Country == 'South Africa'], 'South Africa')

africa
# we can double-check the result by corresponding output

### The Taiwan Data

In [None]:
# the Taiwan doesn't have detailed state or province information
# calculate the delta of confirmed, deaths, recovered cases by grouping date
taiwan = calculateMetrics(virus[virus.Country == 'Taiwan'], 'Taiwan')

taiwan
# we can double-check the result by corresponding output

### Add geographical information
We will need geographical information for categorical classification or visualization.

In [None]:
# define custom function to add geographical informaiton
def addGeoInfo(data):
    '''
    add the continent, longitude, and latitude info by country

    Parameters
    ----------
    data : Pandas Dataframe
        The country dataset.
    '''
    country = data['Country'][0] # get the country name
    
    # we use the most cases city/province/state in the country
    # the info comes from time_series_covid19_confirmed_global.csv
    if (country in ['US', 'Brazil', 'Peru', 'Canada', 'Chile', 'Mexico']):
        
        data['Continent'] = 'America'
        
        if (country == 'US'):
            addLatLong(data, 37.0902, -95.7129)
        elif (country == 'Brazil'):
            addLatLong(data, -14.235, -51.9253)
        elif (country == 'Peru'):
            addLatLong(data, -9.19, -75.0152)
        elif (country == 'Canada'):
            addLatLong(data, 51.2538, -85.3232)
        elif (country == 'Chile'):
            addLatLong(data, -35.6751, -71.543)
        elif (country == 'Mexico'):
            addLatLong(data, 23.6345, -102.5528)
        else:
            print('Can\'t find the latitude/longitude of country:', country)
            
    elif (country in ['Russia', 'UK', 'Spain', 'Italy', 'France', 'Germany', 'Belgium']) :
        
        data['Continent'] = 'Europe'
        
        if (country == 'Russia'):
            addLatLong(data, 60, 90)
        elif (country == 'UK'):
            addLatLong(data, 49.3723, -2.3644)
        elif (country == 'Spain'):
            addLatLong(data, 40, -4)
        elif (country == 'Italy'):
            addLatLong(data, 43, 12)
        elif (country == 'France'):
            addLatLong(data, 46.2276, 2.2137)
        elif (country == 'Germany'):
            addLatLong(data, 51, 9)
        elif (country == 'Belgium'):
            addLatLong(data, 50.8333, 4)
        else:
            print('Can\'t find the latitude/longitude of country:', country)
            
    elif (country in ["Turkey", 'India', 'Iran', 'China', 'Saudi Arabia', 'Pakistan', 'Qatar', 'Bangladesh', 'Taiwan']) :
        
        data['Continent'] = 'Asia'
        
        if (country == 'Turkey'):
            addLatLong(data, 38.9637, 35.2433)
        elif (country == 'India'):
            addLatLong(data, 21, 78)
        elif (country == 'Iran'):
            addLatLong(data, 32, 53)
        elif (country == 'China'):
            addLatLong(data, 30.9756, 112.2707)
        elif (country == 'Saudi Arabia'):
            addLatLong(data, 24, 45)
        elif (country == 'Pakistan'):
            addLatLong(data, 30.3753, 69.3451)
        elif (country == 'Qatar'):
            addLatLong(data, 25.3548, 51.1839)
        elif (country == 'Bangladesh'):
            addLatLong(data, 23.685, 90.3563)
        elif (country == 'Taiwan'):
            addLatLong(data, 23.7, 121)
        else:
            print('Can\'t find the latitude/longitude of country:', country)
    elif (country in ["South Africa"]) :
        data['Continent'] = 'Africa'
        
        if (country == 'South Africa'):
            addLatLong(data, -30.5595, 22.9375)
        else:
            print('Can\'t find the latitude/longitude of country:', country)
    else:
        print('Can\'t find the country:', country)
        
def addLatLong(data, latitude, longitude):
    '''
    add the longitude and latitude info to dataframe

    Parameters
    ----------
    data : Pandas Dataframe
        The country dataset.
    latitude : Float
        A point on Earth's surface is the angle between the equatorial plane and the straight line that passes through that point and through (or close to) the center of the Earth.
    longitude : Float
        A point on Earth's surface is the angle east or west of a reference meridian to another meridian that passes through that point.
    '''
    data['Latitude'] = latitude
    data['Longitude'] = longitude
    print(data['Country'][0], ':', latitude, ',', longitude)

Add geographical information to each country.

In [None]:
addGeoInfo(us)
addGeoInfo(brazil)
addGeoInfo(russia)
addGeoInfo(uk)
addGeoInfo(spain)
addGeoInfo(italy)
addGeoInfo(france)
addGeoInfo(germany)
addGeoInfo(turkey)
addGeoInfo(india)
addGeoInfo(iran)
addGeoInfo(peru)
addGeoInfo(canada)
addGeoInfo(china)
addGeoInfo(chile)
addGeoInfo(arabia)
addGeoInfo(mexico)
addGeoInfo(pakistan)
addGeoInfo(belgium)
addGeoInfo(qatar)
addGeoInfo(bangladesh)
addGeoInfo(africa)
addGeoInfo(taiwan)

After we add more features, now the dataset contains 20 columns.

In [None]:
print(us.info())

us.tail() # show some info

For some unknown reasons, the dataset is not 100% correctly, which occurred the negative value of ConfirmedNew, DeathsNew, and Recovered. So we try to clean it up.

In [None]:
def clean(data):
    '''
    Remove the row if there is a negative number in ConfirmedNew, DeathsNew, or RecoveredNew. 
    After removing, print the logs.

    Parameters
    ----------
    data : Pandas Dataframe
        The country dataset.
    '''
    country = data.Country[0]
    
    if (len(data[data.ConfirmedNew < 0]) > 0):
        print('The value of ConfirmedNew of', country, 'is negative.')
        print(data[data.ConfirmedNew < 0])
        data.drop(data[data.ConfirmedNew < 0].index, inplace = True)
        print('ConfirmedNew cleaned up\n')
    
    if (len(data[data.DeathsNew < 0]) > 0):
        print('The value of DeathsNew of', country, 'is negative.')
        print(data[data.DeathsNew < 0])
        data.drop(data[data.DeathsNew < 0].index, inplace = True)
        print('DeathsNew cleaned up\n')
    
    if (len(data[data.RecoveredNew < 0]) > 0):
        print('The value of RecoveredNew of', country, 'is negative.')
        print(data[data.RecoveredNew < 0])
        data.drop(data[data.RecoveredNew < 0].index, inplace = True)
        print('RecoveredNew cleaned up\n')
        

In [None]:
# clean data correspondingly
clean(us)
clean(brazil)
clean(russia)
clean(uk)
clean(spain)
clean(italy)
clean(france)
clean(germany)
clean(turkey)
clean(india)
clean(iran)
clean(peru)
clean(canada)
clean(china)
clean(chile)
clean(arabia)
clean(mexico)
clean(pakistan)
clean(belgium)
clean(qatar)
clean(bangladesh) 
clean(africa)
clean(taiwan)

# 3. EDA (Exploratory Data Analysis)
## 3.1 Scatterplot
We use scatterplot to observe the relationship between Day and ConfirmedNew variables in each country.


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
sns.set_style("darkgrid")
   
def scatterplot(data):
    '''
    draw scatter plot about related information.

    Parameters
    ----------
    data : Pandas Dataframe
        The original dataset to be group by.
    '''
    
    print('The Scatter Plot of', data.Country[0])
       
    plt.figure(figsize=(20,6))
    plt.subplot(1,2,1)
    plt.plot(data.Day, data.ConfirmedNew, '.')
    plt.xlabel('$Day$', fontsize=12)
    plt.ylabel('$ConfirmedNew$', fontsize=12)

    plt.subplot(1,2,2)
    plt.plot(data.Day, data.Confirmed, '.')
    plt.xlabel('$Day$', fontsize=12)
    plt.ylabel('$Confirmed$', fontsize=12)
    plt.show()

    plt.figure(figsize=(20,6))
    plt.subplot(1,2,1)
    plt.plot(data.RecoveredNew, data.ConfirmedNew, '.')
    plt.xlabel('$RecoveredNew$', fontsize=12)
    plt.ylabel('$ConfirmedNew$', fontsize=12)

    plt.subplot(1,2,2)
    plt.plot(data.DeathRate, data.ConfirmedNew, '.')
    plt.xlabel('$DeathsNew$', fontsize=12)
    plt.ylabel('$ConfirmedNew$', fontsize=12)
    plt.show()

    plt.figure(figsize=(20,6))
    plt.subplot(1,2,1)
    plt.plot(data.DeathRate, data.DeathsNew, '.')
    plt.xlabel('$DeathRate$', fontsize=12)
    plt.ylabel('$DeathsNew$', fontsize=12)

    plt.subplot(1,2,2)
    plt.plot(data.RecoveredRate, data.RecoveredNew, '.')
    plt.xlabel('$RecoveredRate$', fontsize=12)
    plt.ylabel('$RecoveredNew$', fontsize=12)
    plt.show()
    
    plt.figure(figsize=(20,6))
    plt.subplot(1,2,1)
    plt.plot(data.Day, data.DeathRate, '.')
    plt.xlabel('$Day$', fontsize=12)
    plt.ylabel('$DeathRate$', fontsize=12)

    plt.subplot(1,2,2)
    plt.plot(data.Day, data.RecoveredRate, '.')
    plt.xlabel('$Day$', fontsize=12)
    plt.ylabel('$RecoveredRate$', fontsize=12)
    plt.show()
    

In [None]:
scatterplot(us)

### Quick analysis of US
We can see the first two plots show that the trend of ConfirmedNew cases is decreasing while the trend of Confirmed cases is getting flat.

The second row shows that the relationship between ConfirmedNew, RecoveredNew, and DeathsNew.

The third row shows DeathRate and RecoveredRate are increasing while DeathsNew and RecoveredNew are increasing.

The fourth row shows the trend of DeathRate and RecoveredRate.

In [None]:
scatterplot(brazil)

### Quick analysis of Brazil
We can see the first two plots show that the trend of ConfirmedNew cases is still increasing. Therefore, the trend of Confirmed cases is increasing dramatically.

The second row shows that the relationship between ConfirmedNew, RecoveredNew, and DeathsNew.

The third row shows DeathRate and RecoveredRate are increasing while DeathsNew and RecoveredNew are increasing.

The fourth row shows the trend of DeathRate and RecoveredRate.

In [None]:
scatterplot(russia)

### Quick analysis of Russia
We can see the first two plots show that the trend of ConfirmedNew cases seems to start decreasing while the trend of Confirmed cases is increasing dramatically.

The second row shows that the relationship between ConfirmedNew, RecoveredNew, and DeathsNew.

The third row shows DeathRate and RecoveredRate are increasing while DeathsNew and RecoveredNew are increasing.

The fourth row shows the trend of DeathRate and RecoveredRate.

In [None]:
scatterplot(uk)

### Quick analysis of UK
We can see the first two plots show that the trend of ConfirmedNew cases is decreasing while the trend of Confirmed cases is getting flat.

The second row shows that the relationship between ConfirmedNew, RecoveredNew, and DeathsNew.

The third row shows DeathRate and RecoveredRate are increasing while DeathsNew and RecoveredNew are increasing.

The fourth row shows the trend of DeathRate and RecoveredRate.

In [None]:
scatterplot(spain)

### Quick analysis of Spain
We can see the first two plots show that the trend of ConfirmedNew cases is decreasing while the trend of Confirmed cases is getting flat.

The second row shows that the relationship between ConfirmedNew, RecoveredNew, and DeathsNew.

The third row shows DeathRate and RecoveredRate are increasing while DeathsNew and RecoveredNew are increasing.

The fourth row shows the trend of DeathRate and RecoveredRate.

In [None]:
scatterplot(italy)

### Quick analysis of Italy
We can see the first two plots show that the trend of ConfirmedNew cases is decreasing while the trend of Confirmed cases is getting flat.

The second row shows that the relationship between ConfirmedNew, RecoveredNew, and DeathsNew.

The third row shows DeathRate and RecoveredRate are increasing while DeathsNew and RecoveredNew are increasing.

The fourth row shows the trend of DeathRate and RecoveredRate.

In [None]:
scatterplot(france)

### Quick analysis of France
We can see the first two plots show that the trend of ConfirmedNew cases is getting stable while the trend of Confirmed cases is getting flat.

The second row shows that the relationship between ConfirmedNew, RecoveredNew, and DeathsNew.

The third row shows DeathRate and RecoveredRate are increasing while DeathsNew and RecoveredNew are increasing.

The fourth row shows the trend of DeathRate and RecoveredRate.

In [None]:
scatterplot(germany)

### Quick analysis of Germany
We can see the first two plots show that the trend of ConfirmedNew cases is getting stable while the trend of Confirmed cases is getting flat.

The second row shows that the relationship between ConfirmedNew, RecoveredNew, and DeathsNew.

The third row shows DeathRate and RecoveredRate are increasing while DeathsNew and RecoveredNew are increasing.

The fourth row shows the trend of DeathRate and RecoveredRate.

In [None]:
scatterplot(turkey)

### Quick analysis of Turkey
We can see the first two plots show that the trend of ConfirmedNew cases is decreasing while the trend of Confirmed cases is getting flat.

The second row shows that the relationship between ConfirmedNew, RecoveredNew, and DeathsNew.

The third row shows DeathRate and RecoveredRate are increasing while DeathsNew and RecoveredNew are increasing.

The fourth row shows the trend of DeathRate and RecoveredRate.

In [None]:
scatterplot(india)

### Quick analysis of India
We can see the first two plots show that the trend of ConfirmedNew cases is still increasing. Therefore, the trend of Confirmed cases is increasing dramatically.

The second row shows that the relationship between ConfirmedNew, RecoveredNew, and DeathsNew.

The third row shows DeathRate and RecoveredRate are increasing while DeathsNew and RecoveredNew are increasing.

The fourth row shows the trend of DeathRate and RecoveredRate.

In [None]:
scatterplot(iran)

### Quick analysis of Iran
We can see the first two plots show that although the trend of ConfirmedNew cases was decreasing, it is in the second wave of increasing. Therefore, the trend of Confirmed cases is increasing.

The second row shows that the relationship between ConfirmedNew, RecoveredNew, and DeathsNew.

The third row shows DeathRate and RecoveredRate are increasing while DeathsNew and RecoveredNew are increasing.

The fourth row shows the trend of DeathRate and RecoveredRate.

In [None]:
scatterplot(peru)

### Quick analysis of Peru
We can see the first two plots show that the trend of ConfirmedNew cases is still increasing. Therefore, the trend of Confirmed cases is increasing dramatically.

The second row shows that the relationship between ConfirmedNew, RecoveredNew, and DeathsNew.

The third row shows DeathRate and RecoveredRate are increasing while DeathsNew and RecoveredNew are increasing.

The fourth row shows the trend of DeathRate and RecoveredRate.

In [None]:
scatterplot(canada)

### Quick analysis of Canada
We can see the first two plots show that the trend of ConfirmedNew cases is still decreasing. Therefore, the trend of Confirmed cases is slowly increasing.

The second row shows that the relationship between ConfirmedNew, RecoveredNew, and DeathsNew.

The third row shows DeathRate and RecoveredRate are increasing while DeathsNew and RecoveredNew are increasing.

The fourth row shows the trend of DeathRate and RecoveredRate.

In [None]:
scatterplot(china)

### Quick analysis of China
We can see the first two plots show that the trend of ConfirmedNew cases is stable for a long time. Therefore, the trend of Confirmed cases keeps flat for a long time. This is kind of weird.

The second row shows that the relationship between ConfirmedNew, RecoveredNew, and DeathsNew.

The third row shows DeathRate and RecoveredRate are increasing while DeathsNew and RecoveredNew are increasing.

The fourth row shows the trend of DeathRate and RecoveredRate.

In [None]:
scatterplot(chile)

### Quick analysis of Chile 
We can see the first two plots show that the trend of ConfirmedNew cases is still increasing. Therefore, the trend of Confirmed cases is increasing dramatically.

The second row shows that the relationship between ConfirmedNew, RecoveredNew, and DeathsNew.

The third row shows DeathRate and RecoveredRate are increasing while DeathsNew and RecoveredNew are increasing.

The fourth row shows the trend of DeathRate and RecoveredRate.

In [None]:
scatterplot(arabia)

### Quick analysis of Saudi Arabia
We can see the first two plots show that although the trend of ConfirmedNew cases was decreasing, it is in the second wave of increasing. Therefore, the trend of Confirmed cases is increasing.

The second row shows that the relationship between ConfirmedNew, RecoveredNew, and DeathsNew.

The third row shows DeathRate and RecoveredRate are increasing while DeathsNew and RecoveredNew are increasing.

The fourth row shows the trend of DeathRate and RecoveredRate.

In [None]:
scatterplot(mexico)

### Quick analysis of Mexico 
We can see the first two plots show that the trend of ConfirmedNew cases is still increasing. Therefore, the trend of Confirmed cases is increasing dramatically.

The second row shows that the relationship between ConfirmedNew, RecoveredNew, and DeathsNew.

The third row shows DeathRate and RecoveredRate are increasing while DeathsNew and RecoveredNew are increasing.

The fourth row shows the trend of DeathRate and RecoveredRate.

In [None]:
scatterplot(pakistan)

### Quick analysis of Pakistan 
We can see the first two plots show that the trend of ConfirmedNew cases is still increasing. Therefore, the trend of Confirmed cases is increasing dramatically.

The second row shows that the relationship between ConfirmedNew, RecoveredNew, and DeathsNew.

The third row shows DeathRate and RecoveredRate are increasing while DeathsNew and RecoveredNew are increasing.

The fourth row shows the trend of DeathRate and RecoveredRate.

In [None]:
scatterplot(belgium)

### Quick analysis of Belgium
We can see the first two plots show that the trend of ConfirmedNew cases is decreasing while the trend of Confirmed cases is getting flat.

The second row shows that the relationship between ConfirmedNew, RecoveredNew, and DeathsNew.

The third row shows DeathRate and RecoveredRate are increasing while DeathsNew and RecoveredNew are increasing.

The fourth row shows the trend of DeathRate and RecoveredRate.

In [None]:
scatterplot(qatar)

### Quick analysis of Qatar 
We can see the first two plots show that the trend of ConfirmedNew cases is still increasing. Therefore, the trend of Confirmed cases is increasing dramatically.

The second row shows that the relationship between ConfirmedNew, RecoveredNew, and DeathsNew.

The third row shows DeathRate and RecoveredRate are increasing while DeathsNew and RecoveredNew are increasing.

The fourth row shows the trend of DeathRate and RecoveredRate.

In [None]:
scatterplot(bangladesh)

### Quick analysis of Bangladesh 
We can see the first two plots show that the trend of ConfirmedNew cases is still increasing. Therefore, the trend of Confirmed cases is increasing dramatically.

The second row shows that the relationship between ConfirmedNew, RecoveredNew, and DeathsNew.

The third row shows DeathRate and RecoveredRate are increasing while DeathsNew and RecoveredNew are increasing.

The fourth row shows the trend of DeathRate and RecoveredRate.

In [None]:
scatterplot(africa)

### Quick analysis of South Africa 
We can see the first two plots show that the trend of ConfirmedNew cases is still increasing. Therefore, the trend of Confirmed cases is increasing dramatically.

The second row shows that the relationship between ConfirmedNew, RecoveredNew, and DeathsNew.

The third row shows DeathRate and RecoveredRate are increasing while DeathsNew and RecoveredNew are increasing.

The fourth row shows the trend of DeathRate and RecoveredRate.

In [None]:
scatterplot(taiwan)

### Quick analysis of Taiwan 
We can see the first two plots show that the trend of ConfirmedNew cases is getting stable. Therefore, the trend of Confirmed cases is becoming flat.

The second row shows that the relationship between ConfirmedNew, RecoveredNew, and DeathsNew.

The third row shows DeathRate and RecoveredRate are increasing while DeathsNew and RecoveredNew are increasing.

The fourth row shows the trend of DeathRate and RecoveredRate.

## 3.2 Distribution plot

We can see that almost all distribution plots are right-skewed. Iran has a more normal distribution than others. China, France, and Taiwan have a very long tail in distribution plots for different reasons. China's and France's data are very big to ten thousand but Taiwan's data is only double digits.

In [None]:
# draw distribution plot
def distribution(data):
    '''
    Draw distribution plot of ConfirmedNew, DeathsNew, and RecoveredNew

    Parameters
    ----------
    data : Pandas Dataframe
        The dataset to draw distribution
    '''
    
    country = data.Country[0]
    plt.figure(figsize=(30,6))

    plt.subplot(1,3,1)
    plt.title('Confirmed New Distribution Plot of ' + country)
    sns.distplot(data.ConfirmedNew)

    # We won't show the plot of DeathsNew of Taiwan because there will occur an error with message "You have categorical data, but your model needs something numerical. See our one hot encoding tutorial for a solution."
    # The reason is because the DeathsNew of Taiwan is only 0, 1, or 3, which will be treated as categorical data
    if (country != 'Taiwan'):
        plt.subplot(1,3,2)
        plt.title('Deaths New Distribution Plot of ' + country)
        sns.distplot(data.DeathsNew)

    plt.subplot(1,3,3)
    plt.title('Recovered New Distribution Plot of ' + country)
    sns.distplot(data.RecoveredNew)

    plt.show()

In [None]:
distribution(us)

In [None]:
distribution(brazil)

In [None]:
distribution(russia)

In [None]:
distribution(uk)

In [None]:
distribution(spain)

In [None]:
distribution(italy)

In [None]:
distribution(france)

In [None]:
distribution(germany)

In [None]:
distribution(turkey)

In [None]:
distribution(india)

In [None]:
distribution(iran)

In [None]:
distribution(peru)

In [None]:
distribution(canada)

In [None]:
distribution(china)

In [None]:
distribution(chile)

In [None]:
distribution(arabia)

In [None]:
distribution(mexico)

In [None]:
distribution(pakistan)

In [None]:
distribution(belgium)

In [None]:
distribution(qatar)

In [None]:
distribution(bangladesh)

In [None]:
distribution(africa)

In [None]:
distribution(taiwan)

## 3.3 Boxplot
We use boxplots to view the quantitative distribution of each country’s data.

In [None]:
world = pd.concat([us, brazil, russia, uk, spain, italy, france, germany, turkey, india, iran, peru, canada, china, chile, arabia, mexico, pakistan, belgium, qatar, bangladesh, africa, taiwan]) 

# show the related between box plot
plt.figure(figsize=(20,8))
plt.title('Country vs ConfirmedNew')
sns.boxplot(x=world.Country, y=world.ConfirmedNew)
plt.ylabel('Confirmed New')
plt.xlabel('Country')
plt.show()

According to these boxplots, the boxplots of Iran and Turkey have a typical distribution of the data and quantitative variables. We can see there are outliners in the polts of Brazil, France, Germany, India, Peru, China, Pakistan, and South Africa. China has the most outliners.

## 3.4 Pairplot
We draw correlation matrix to view the relationship between variables.

In [None]:
# exclude unnecessary columns
pairdata = world[world.columns[~world.columns.isin(['ObservationDate', 'Latitude', 'Longitude', 'ConfirmedN', 'DeathsN', 'RecoveredN', 'ConfirmedNewN', 'DeathsNewN', 'RecoveredNewN'])]]

# take america as sample
america = pairdata[pairdata.Continent == 'America']

# ignore the categorical continent column
america = america[america.columns[~america.columns.isin(['Continent'])]]

# draw pairplot
sns.pairplot(america, hue='Country')

We can see that each country has its own shape of plots.

# 4. Machine Learning
## 4.1 Multiple Linear Regression
We want to find out the COVID-19 correlations between these 23 countries.
Although these countries adopt different policies and methods against the pandemic, we assume the coronavirus (COVID-19) outbreak has the same pattern in all countries.

In [None]:
import statsmodels.api as sm

def evaluate(X, y, num):
    '''
    Perform linear regression on specified features

    Parameters
    ----------
    X : array_like
        A nobs x k array where nobs is the number of observations and k is the number of regressors. An intercept is not included by default and should be added by the user. See statsmodels.tools.add_constant.
    y : array_like
        A 1-d endogenous response variable. The dependent variable.
    num : int
        The evaluate number.
    '''
    model = sm.OLS(y, X).fit()
    print('\nModel #', num)
    print(model.summary())

def evaluateByCountry(data):
    '''    
    Evulate different models by composition of features

    Parameters
    ----------
    data : Pandas Dataframe
        The dataset to train by linear regression
    '''
    
    print('\nThe evaluation of', data.Country[0])

    
    # model #1
    X = data[['Day']]
    y = data['ConfirmedNewN']
    evaluate(X, y, 1)
    
    # model #2
    X = data[['Day', 'ConfirmedN']]
    evaluate(X, y, 2)
    
    # model #3
    X = data[['Day', 'ConfirmedN', 'DeathsN']]
    evaluate(X, y, 3)
    
    # model #4
    X = data[['Day', 'ConfirmedN', 'DeathsN', 'RecoveredN']]
    evaluate(X, y, 4)

    # model #5
    X = data[['Day', 'ConfirmedN', 'DeathsN', 'RecoveredN', 'DeathsNewN']]
    evaluate(X, y, 5)

    # model #6
    X = data[['Day', 'ConfirmedN', 'DeathsN', 'RecoveredN', 'DeathsNewN', 'RecoveredNewN']]
    evaluate(X, y, 6)

    # model #7
    X = data[['Day', 'ConfirmedN', 'DeathsN', 'RecoveredN', 'DeathsNewN', 'RecoveredNewN', 'DeathRate']]
    evaluate(X, y, 7)

    # model #8
    X = data[['Day', 'ConfirmedN', 'DeathsN', 'RecoveredN', 'DeathsNewN', 'RecoveredNewN', 'DeathRate', 'RecoveredRate']]
    evaluate(X, y, 8)
    


In [None]:
evaluateByCountry(us)

#### Summary of US model evaluation
After linear models trained by eight different compositions of features, we can see the model #8 has the best R-squared score, which is 0.978. The p-values are very small on all features, which means all features are significantly related to the ConfirmedNew.

In [None]:
evaluateByCountry(italy)

#### Summary of Italy model evaluation
After linear models trained by eight different compositions of features, we can see the model #5~#8 has the same best R-squared score, which is 0.974. The p-values in model #5 and model #6 are small on all features, which means features are significantly related to the ConfirmedNew.

In [None]:
evaluateByCountry(india)

#### Summary of India model evaluation
After linear models trained by eight different compositions of features, we can see the model #5~#8 has the same best R-squared score, which is 0.993. The p-values in model #5 and model #6 are small on all features, which means features are significantly related to the ConfirmedNew.

## 4.2 Predicting with each country dataset

After comparing the above results, we decide to choose model #6 as the training variables, including Day, ConfirmedN, DeathsN, RecoveredN, DeathsNewN, and RecoveredNewN. Therefore, we use six features declared in model #6 as the selected variable features in the next training and prediction process.

In [None]:
from scipy import stats
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error

def linearTrain(data):
    '''
    Perform linear regression on specified features

    Parameters
    ----------
    data : Pandas Dataframe
        The dataset to train by linear regression
    '''
    
    X = data[['Day', 'ConfirmedN', 'DeathsN', 'RecoveredN', 'DeathsNewN', 'RecoveredNewN']] # the features selected from model #6
    y = data['ConfirmedNewN']
    model = sm.OLS(y, X).fit()
    return model

def cor(model, data):
    '''
    Perform predict by given model and calculate the correlation

    Parameters
    ----------
    model: OLS data model
        The trained linear model
    data : Pandas Dataframe
        The dataset to predict by trained model
    '''

    # predict ConfirmedNew by corresponding model given the test data
    predictConfirmedNewN = model.predict(data[['Day', 'ConfirmedN', 'DeathsN', 'RecoveredN', 'DeathsNewN', 'RecoveredNewN']]) # the features selected from model #6
    
    print('\nThe prediction of', data.Country[0])
    
    # calculate correlation by pearson's methond
    pearson = stats.pearsonr(predictConfirmedNewN, data.ConfirmedNewN)
    correlation = abs(pearson[0]) * 100
    
    # calculate Mean Absolute Error
    mae = mean_absolute_error(data.ConfirmedNewN, predictConfirmedNewN)
    
    # calculate Mean Squared Error
    mse = mean_squared_error(data.ConfirmedNewN, predictConfirmedNewN)
    
    # calculate Mean Root Squared Error
    mqse = np.sqrt(mean_squared_error(data.ConfirmedNewN, predictConfirmedNewN))

    # print predicted info
    print('Correlation: %.3f' % correlation, ',       Mean Absolute Error: %.3f' % mae)
    print('Mean Squared Error: %.3f' % mse, ', Mean Root Squared Error:%.3f' % mqse)
    return correlation

In [None]:
# all country list
countryList = ['US', 'Brazil', 'Russia', 'UK', 'Spain', 'Italy', 'France', 'Germany', 'Turkey', 'India', 'Iran', 'Peru', 'Canada', 'China', 'Chile', 'Saudi Arabia', 'Mexico', 'Pakistan', 'Belgium', 'Qatar', 'Bangladesh', 'South Africa', 'Taiwan']

#define a global dataframe to for calculating the correlations
correlationScores = pd.DataFrame({'Country' : countryList,
                                  'Correlation' : [0] * 23}) # initial score is 0 for 22 countries

# show the correlation score are all 0
correlationScores

Here, we build a dataframe contains 23 countries and related correlation scores. We use the score to test if the country has a correlation with other countries, which means how similar the pattern they have.

In [None]:
def corBar(countries, correlations, country):
    '''
    Draw barplots about the countries and correlations.

    Parameters
    ----------
    countries : list of countries
        The specified country in list.
    correlations : list of correlation score
        The calculated correlation score in list
    country : String
        the country name
    '''

    df = pd.DataFrame({'country' : countries,
                       'correlation' : correlations})

    plt.figure(figsize=(20,10))
    sns.barplot(data=df, x='country', y='correlation')
    plt.title('Correlations by Linear Model of ' + country)

    # add correlation score value to each bar
    # Now the trick is here. credit: https://stackoverflow.com/a/55866275/510320
    # plt.text() , you need to give (x,y) location , where you want to put the numbers,
    # So here index will give you x pos and data+1 will provide a little gap in y axis.
    for index,data in enumerate(correlations):
        plt.text(x=index-0.3 , y =data+1 , s='{:2.2f}'.format(data) , fontdict=dict(fontsize=16))

    plt.show()
    
   
def barBasedOn(data):
    '''
    Calculate correlations and draw plot

    Parameters
    ----------
    data : the pandas dataframe
        The specified country dataset.
    '''
    country = data.Country[0]
    countries = []
    correlations = []
    
    model = linearTrain(data)
    
    if (country != 'US'):
        corUs = cor(model, us) # calculate correlation score
        countries.append('US') # append country to show in the bar graph
        correlations.append(corUs) # append correlation score to show in the bar graph
        correlationScores.iat[0, 1] = correlationScores.iat[0, 1] + corUs  # add correlation score to corresponding country
        
    if (country != 'Brazil'):
        corBrazil = cor(model, brazil)
        countries.append('Brazil')
        correlations.append(corBrazil)
        correlationScores.iat[1, 1] = correlationScores.iat[1, 1] + corBrazil
        
    if (country != 'Russia'):
        corRussia = cor(model, russia)
        countries.append('Russia')
        correlations.append(corRussia)
        correlationScores.iat[2, 1] = correlationScores.iat[2, 1] + corRussia
        
    if (country != 'UK'):
        corUk = cor(model, uk)
        countries.append('UK')
        correlations.append(corUk)
        correlationScores.iat[3, 1] = correlationScores.iat[3, 1] + corUk
        
    if (country != 'Spain'):
        corSpain = cor(model, spain)
        countries.append('Spain')
        correlations.append(corSpain)
        correlationScores.iat[4, 1] = correlationScores.iat[4, 1] + corSpain
        
    if (country != 'Italy'):
        corItaly = cor(model, italy)
        countries.append('Italy')
        correlations.append(corItaly)
        correlationScores.iat[5, 1] = correlationScores.iat[5, 1] + corItaly
        
    if (country != 'France'):
        corFrance = cor(model, france)
        countries.append('France')
        correlations.append(corFrance)
        correlationScores.iat[6, 1] = correlationScores.iat[6, 1] + corFrance
        
    if (country != 'Germany'):
        corGermany = cor(model, germany)
        countries.append('Germany')
        correlations.append(corGermany)
        correlationScores.iat[7, 1] = correlationScores.iat[7, 1] + corGermany
        
    if (country != 'Turkey'):
        corTurkey = cor(model, turkey)
        countries.append('Turkey')
        correlations.append(corTurkey)
        correlationScores.iat[8, 1] = correlationScores.iat[8, 1] + corTurkey
        
    if (country != 'India'):
        corIndia = cor(model, india)
        countries.append('India')
        correlations.append(corIndia)
        correlationScores.iat[9, 1] = correlationScores.iat[9, 1] + corIndia
        
    if (country != 'Iran'):
        corIran = cor(model, iran)
        countries.append('Iran')
        correlations.append(corIran)
        correlationScores.iat[10, 1] = correlationScores.iat[10, 1] + corIran
        
    if (country != 'Peru'):
        corPeru = cor(model, peru)
        countries.append('Peru')
        correlations.append(corPeru)
        correlationScores.iat[11, 1] = correlationScores.iat[11, 1] + corPeru
        
    if (country != 'Canada'):
        corCanada = cor(model, canada)
        countries.append('Canada')
        correlations.append(corCanada)
        correlationScores.iat[12, 1] = correlationScores.iat[12, 1] + corCanada
        
    if (country != 'China'):
        corChina = cor(model, china)
        countries.append('China')
        correlations.append(corChina)
        correlationScores.iat[13, 1] = correlationScores.iat[13, 1] + corChina
        
    if (country != 'Chile'):
        corChile = cor(model, chile)
        countries.append('Chile')
        correlations.append(corChile)
        correlationScores.iat[14, 1] = correlationScores.iat[14, 1] + corChile
        
    if (country != 'Saudi Arabia'):
        corArabia = cor(model, arabia)
        countries.append('Saudi Arabia')
        correlations.append(corArabia)
        correlationScores.iat[15, 1] = correlationScores.iat[15, 1] + corArabia
        
    if (country != 'Mexico'):
        corMexico = cor(model, mexico)
        countries.append('Mexico')
        correlations.append(corMexico)
        correlationScores.iat[16, 1] = correlationScores.iat[16, 1] + corMexico
        
    if (country != 'Pakistan'):
        corPakistan = cor(model, pakistan)
        countries.append('Pakistan')
        correlations.append(corPakistan)
        correlationScores.iat[17, 1] = correlationScores.iat[17, 1] + corPakistan
        
    if (country != 'Belgium'):
        corBelgium = cor(model, belgium)
        countries.append('Belgium')
        correlations.append(corBelgium)
        correlationScores.iat[18, 1] = correlationScores.iat[18, 1] + corBelgium
        
    if (country != 'Qatar'):
        corQatar = cor(model, qatar)
        countries.append('Qatar')
        correlations.append(corQatar)
        correlationScores.iat[19, 1] = correlationScores.iat[19, 1] + corQatar
    
    if (country != 'Bangladesh'):
        corBangladesh = cor(model, bangladesh)
        countries.append('Bangladesh')
        correlations.append(corBangladesh)
        correlationScores.iat[20, 1] = correlationScores.iat[20, 1] + corBangladesh
        
    if (country != 'South Africa'):
        corAfrica = cor(model, africa)
        countries.append('South Africa')
        correlations.append(corAfrica)
        correlationScores.iat[21, 1] = correlationScores.iat[21, 1] + corAfrica
        
    if (country != 'Taiwan'):
        corTaiwan = cor(model, taiwan)
        countries.append('Taiwan')
        correlations.append(corTaiwan)
        correlationScores.iat[22, 1] = correlationScores.iat[22, 1] + corTaiwan
    
    # start drawing bar plots
    corBar(countries, correlations, country)
    

We train models by a country in turn, then use other countries as the test sets to validate the model and calculate the correlation score. Therefore, we can sum up the total correlation scores to check if all countries have the same or similar patterns.

### Predict by the trained model of US 

In [None]:
barBasedOn(us)

Based on the trained model of the US, the prediction of Russia, India, Peru, Canada, Chile, Saudi Arabia, Mexico, Pakistan, Belgium, Qatar, and Bangladesh have higher correlation scores than other countries. The barplot shows the correlation scores by each country.

### Predict by the trained model of Brazil 

In [None]:
barBasedOn(brazil)

Based on the trained model of the Brazil, the prediction of US, Russia, India, Peru, Canada, Chile, Saudi Arabia, Mexico, Pakistan, Belgium, Qatar, and Bangladesh have higher correlation scores than other countries. The barplot shows the correlation scores by each country.

### Predict by the trained model of Russia

In [None]:
barBasedOn(russia)

Based on the trained model of the Russia, the prediction of India, Chile, and Mexico have higher correlation scores than other countries. The barplot shows the correlation scores by each country.

### Predict by the trained model of UK 

In [None]:
barBasedOn(uk)

Based on the trained model of the UK, the prediction of US, Russia, Italy, Turkey, India, Peru, Canada, Chile, Mexico, Pakistan, Belgium, and Bangladesh have higher correlation scores than other countries. The barplot shows the correlation scores by each country.

### Predict by the trained model of Spain 

In [None]:
barBasedOn(spain)

### Predict by the trained model of Italy 

In [None]:
barBasedOn(italy)

### Predict by the trained model of France

In [None]:
barBasedOn(france)

### Predict by the trained model of Germany 

In [None]:
barBasedOn(germany)

### Predict by the trained model of Turkey 

In [None]:
barBasedOn(turkey)

### Predict by the trained model of India 

In [None]:
barBasedOn(india)

### Predict by the trained model of Iran 

In [None]:
barBasedOn(iran)

### Predict by the trained model of Peru 

In [None]:
barBasedOn(peru)

### Predict by the trained model of Canada 

In [None]:
barBasedOn(canada)

### Predict by the trained model of China 

In [None]:
barBasedOn(china)

### Predict by the trained model of Chile 

In [None]:
barBasedOn(chile)

### Predict by the trained model of Saudi Arabia 

In [None]:
barBasedOn(arabia)

### Predict by the trained model of Mexico 

In [None]:
barBasedOn(mexico)

In [None]:
barBasedOn(pakistan)

In [None]:
barBasedOn(belgium)

### Predict by the trained model of Qatar 

In [None]:
barBasedOn(qatar)

### Predict by the trained model of Bangladesh

In [None]:
barBasedOn(bangladesh)

### Predict by the trained model of South Africa

In [None]:
barBasedOn(africa)

### Predict by the trained model of Taiwan 

In [None]:
barBasedOn(taiwan)

After all the linear regression training and validating, we have all the correlation scores about all countries. Then we can draw the plot.

In [None]:
correlationScores

In [None]:
sortedScores = correlationScores.sort_values('Correlation')

plt.figure(figsize=(20,10))
sns.barplot(data=sortedScores, x='Country', y='Correlation')
plt.title('Overall scores of Mutual Correlation')

# add correlation score value to each bar
for index,data in enumerate(sortedScores.Correlation):
    plt.text(x=index-0.3 , y =data+15 , s=data , fontdict=dict(fontsize=16))

plt.show()


No surprise, China has the lowest correlation score with other countries. This means China doesn't have a similar pattern to other countries. China is the only exception. China's score is even lower than half of the average score. But we know that China was Top 1 confirmed cases country in the world, and even now China is still in the top 20 countries. The second-lowest correlation score is Taiwan, which is quite different from other countries because Taiwan only has less than 500 confirmed cases and less than 10 deaths cases. I will have detailed explanations for both of them in the conclusion.

## 4.3 Trend comparison by continent

I compare the trend of NEW confirmed cases of each country in three different categories.

In [None]:
def trend(continent):
    '''
    Draw trend by continent
    
    Parameters
    ----------
    continent : the continent name
        the trend will filter by continent
    '''
    
    plt.figure(figsize=(20,10))
    sns.lineplot(data=world[world['Continent'] == continent ], x='Day', y='ConfirmedNew', hue='Country', style='Country', markers=False, dashes=False, linewidth=1.5)
    plt.show()

### Trend of America continent

In [None]:
trend('America')

According to the plot, we can see that since the COVID-19 outbreak, almost all countries in America were increasing the new confirmed cases daily after 40~50 days. The US and Brazil both have severe situation than others. Only Canada and the US have confirmed cases for more than 130 days, and other countries are just affected by COVID-19 for 100 days. The daily new confirmed cases for Brazil, Chile, and Mexico are in the increasing trend.

### Trend of Europe continent

In [None]:
trend('Europe')

According to the plot, we can see that since the COVID-19 outbreak, almost all countries in Europe were increasing the new confirmed cases daily after 40 days. France once had the most daily new confirmed case, but France keeps decreasing now. We can see that Russia still has many new confirmed cases daily, which leads it to be one of the top 3 confirmed cases countries in the world.

### Trend of Asia continent

In [None]:
trend('Asia')

According to the plot, we can see that since the COVID-19 outbreak, China, Turkey, and Iran were increasing the new confirmed cases daily after 10 days. Other countries in Asia were increasing after 40 days. India, Iran, Saudi Arabia, Pakistan, Qatar, and Bangladesh are in the increasing trend of new confirmed cases. But Turkey seems slowly decrease the new confirmed cases daily. China and Taiwan are two special cases. China once had more than 15000 new confirmed cases a day, but the number drastically down to only double digits in 10 days, which is believed that the report from China may be fake. According to the media and news during the period after the highest peak, Xi Jinping, the president of China, said the disease is under control. After then, the report of new confirmed cases is drastically down. Taiwan reached 27 new confirmed cases daily at maximum. That is because Taiwan prepared for the COVID-19 outbreak in advance, which makes it become the only country that can prevent the coronavirus spread out while Taiwan is the closest country to China.

## 4.4 Choropleth map

In [None]:
# pip install plotly==4.8.1
# reference: https://towardsdatascience.com/visualizing-worldwide-covid-19-data-using-python-plotly-maps-c0fba09a1b37
import plotly.graph_objects as go

fig = go.Figure(data=go.Choropleth(
    locationmode = "country names",
    locations = correlationScores['Country'],
    z = correlationScores['Correlation'],
    text = correlationScores['Correlation'],
    colorscale = 'matter',
    reversescale=True,
    colorbar_title = 'Correlation Score',
))

fig.update_layout(
    title_text='COVID-19 TOP 22 Mutual Correlation Scores',
    geo=dict(
        showcoastlines=True,
    ),
)

fig.show()

We show the correlation score on the world map. The darker color means the lower correlation score, which means the country doesn't have a similar pattern regarding COVID-19 pandemic to others. As we can see, China is the darkest color when it has the lowest correlation score.

# 5 Summary
## 5.1 Linear regression result comparison
The linear models trained by every country always suit for some other countries in our samples. The only two exceptions are China and Taiwan. As we all know, China government ignored and hid the disease at the very beginning time, and then it blocked the information from the world. In the end, the outbreak was boomed in Wuhan city, and the confirmed cases and deaths suddenly increased a lot. After that, President Xi Jinping controlled the media, news, and new confirmed cases. It is believed that the China government may report fake data to WHO when after Xi’s talk to the citizens in China.

Taiwan is in the other situation. Taiwan suffered SARS in 2003. Since then, Taiwan has always prepared for another disease outbreak. So when Taiwan first heard there might be new coronavirus appeared in Wuhan in the early of January 2020, Taiwan government soon decided to build the Central Epidemic Command Center (CECC) to manage all the information about COVID-19. The CECC arranged many policies to prevent the disease from spreading and monitor people who may be in danger. It turns out to make the incredible few confirmed cases in Taiwan, although Taiwan is the closest country to China in the world. Do not forget that there are two to three million people fly to and back between China and Taiwan every year.

According to the above reasons, other countries are not so aware of COVID-19 or China like Taiwan. Even the WHO also declared that this disease is not dangerous in the first. So other countries will not prepare for the outbreak because they believe WHO. But people in Taiwan never trust China and know that China actually controls WHO. Hence, Taiwan prepared for it in advance. The population of Taiwan is 23 million people but only has less than 500 confirmed cases and less than ten deaths, despite the island’s proximity to China, where the outbreak originated. It significantly showed why I said Taiwan indeed prevents the COVID-19 from spreading.

Even if Taiwan has so few cases, it has a higher correlation score than China's. This is another evidence that China lies to the world and report the fake data regarding COVID-19.

These explanations tell us why the linear models trained by each country do not suit for China and Taiwan but suit other countries. The reasons are China’s data is not precise, and Taiwan’s situation is under control without any outbreak. We know that the first confirmed case in Taiwan is the same day as the first confirmed case in South Korea and Japan. But it turns out that South Korea has 11814 confirmed cases, and Japan has 17056 confirmed cases on June 08, which both are the severe COVID-19 affected area.

## 5.2 Conclusion
According to the trends and plots above, the top 22 confirmed cases countries have a similar situation since the disease appeared in those countries. The only exception is China in the top 22. By these comparisons, they showed the increasing of the new confirmed case would occur after 30 ~ 40 days since the first confirmed case. The linear models produced by all countries are properly suited for other countries except China and Taiwan. This means the COVID-19 outbreak has similar patterns in almost any country except China and Taiwan. For Taiwan, this is because it prepares and controls in advance before WHO’s announcement. For China, its strange patterns about the disease are obviously controlled by the communist government. Until now, the new confirmed cases are still not transparent in China. 

So, why should we not blame China for hiding the coronavirus (COVID-19) information from the world in the first place? 
Yes, we should blame China for its lie and dishonor behavior regarding COVID-19. The world needs to fight the disease together, but China only cares about its reputation without following international rules.

This project states how strange the new confirmed cases in China dramatically down are. The project also shows how different the pattern China has with other top 22 countries. By machine learning, we not only can train models to know the related features with new confirmed cases but also can use trained models to find similar patterns regarding the COVID-9 pandemic.