# Table of Contents

* [1. Introduction](#Introduction)
* [2. Data Analysis](#Data_Analysis)
* [3. Feature Engineering](#Feature_Engineering)
* [4. Model Building](#Model_Building)
    * [4.1. First Model](#First_Model)
    * [4.2. Second Model](#Second_Model)
    * [4.3. Third Model](#Third_Model)
* [5. Model Comparing](#Model_Comparing)

<a id="Introduction"></a>
# 1. Introduction

Kaggle describes this competition as [follows](https://www.kaggle.com/c/covid19-global-forecasting-week-5/overview)

**The Challenge**
<br>Kaggle is launching a companion COVID-19 forecasting challenges to help answer a subset of the NASEM/WHO questions. While the challenge involves developing quantile estimates intervals for confirmed cases and fatalities between May 12 and June 7 by region, the primary goal isn't only to produce accurate forecasts. Itâ€™s also to identify factors that appear to impact the transmission rate of COVID-19.

## The Story of COVID-19
#### The COVID-19 pandemic is the defining global health crisis of our time and the greatest global humanitarian challenge the world has faced since World War II. The virus has spread widely, and the number of cases is rising daily as governments work to slow its spread. India has moved quickly, implementing a proactive, nationwide, lockdown, with the goal of flattening the curve and using the time to plan and resource responses adequately.

![alt text](https://kesk.org.tr/wp-content/uploads/2020/04/covid.png)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Importing Libraries

In [None]:
import pandas as pd
import seaborn as sns
import numpy as np
from scipy.stats import norm
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
pd.pandas.set_option('display.max_columns', None)
from scipy import stats
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
from sklearn.linear_model import Lasso
from sklearn.feature_selection import SelectFromModel
pd.pandas.set_option('display.max_columns', None)
import matplotlib.pyplot as plt
from fbprophet import Prophet
import plotly.express as px
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.metrics import accuracy_score
from sklearn import metrics
from sklearn.model_selection import ParameterGrid
from tqdm import tqdm
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA

## Importing Data

In [None]:
path = "../input/covid19-global-forecasting-week-5/train.csv"
path2 = "../input/covid19-global-forecasting-week-5/test.csv"
path3="../input/covid19-useful-features-by-country/Countries_usefulFeatures.csv"
path4="../input/covid19-global-forecasting-week-5/submission.csv"


In [None]:
df_train = pd.read_csv(path,encoding = 'unicode_escape')
df_test = pd.read_csv(path2,encoding = 'unicode_escape')
df_count_feat=pd.read_csv(path3,encoding = 'unicode_escape')
df_sub=pd.read_csv(path4,encoding = 'unicode_escape')

<a id="Data_Analysis"></a>
# 2. Data_Analysis

In [None]:
df_train.head()

**Below We see that County and Province_State variable have null values..**

In [None]:
df_train.info()

In [None]:
#missing data
total = df_train.isnull().sum().sort_values(ascending=False)
percent = (df_train.isnull().sum()/df_train.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data

**Below We see that TargetValue variable has negatif values at some point. But the number of cases must be at least 0..**

In [None]:
df_train.describe()

**We drop the negatif values**

In [None]:
df_train.drop(df_train[df_train.TargetValue < 0].index, inplace=True)

## Map

In [None]:
country_wise=df_train[df_train['Province_State'].isnull()]
country_wise=country_wise[country_wise['Target']=='ConfirmedCases']
country_wise=country_wise.groupby('Country_Region')['TargetValue'].sum().reset_index()
country_wise=country_wise.rename(columns={"Country_Region":"Country_Region","TargetValue":"ConfimedCases"})
country_wise

### According to Map and the graph below, we can say that the number of cases is mostly in the US and then in Brazil..

In [None]:
def plot_map(df, col, pal):
    df = df[df[col]>0]
    fig = px.choropleth(df, locations="Country_Region", locationmode='country names', 
                  color=col, hover_name="Country_Region", 
                  title=col, hover_data=[col], color_continuous_scale=pal)
#     fig.update_layout(coloraxis_showscale=False)
    fig.show()

In [None]:
plot_map(country_wise, 'ConfimedCases', 'matter')

## Top 20 Countries

In [None]:
def plot_hbar(df, col, n, hover_data=[]):
    fig = px.bar(df.sort_values(col).tail(n), 
                 x=col, y="Country_Region", color='Country_Region',  
                 text=col, orientation='h', width=700, hover_data=hover_data,
                 color_discrete_sequence = px.colors.qualitative.Dark2)
    fig.update_layout(title=col, xaxis_title="", yaxis_title="", 
                      yaxis_categoryorder = 'total ascending',
                      uniformtext_minsize=8, uniformtext_mode='hide')
    fig.show()

In [None]:
plot_hbar(country_wise, 'ConfimedCases', 15)

**We merge the 3 countries with the highest number of cases in a single table..**

In [None]:
df_Us=df_train[df_train['Country_Region']=='US']
df_Us=df_Us[df_Us['Target']=='ConfirmedCases']
df_Us=df_Us[df_Us['Province_State'].isnull()]
df_plot=df_Us.rename(columns={"Date":"Date","TargetValue": "US_TotalCase"})
df_plot=df_plot[["Date","US_TotalCase"]]
df_Br=df_train[df_train['Country_Region']=='Brazil']
df_Br=df_Br[df_Br['Target']=='ConfirmedCases']
df_Br=df_Br.rename(columns={"Date":"Date","TargetValue": "Brazil_TotalCase"})
df_Br=df_Br[["Date","Brazil_TotalCase"]]
df_Rus=df_train[df_train['Country_Region']=='Russia']
df_Rus=df_Rus[df_Rus['Target']=='ConfirmedCases']
df_Rus=df_Rus.rename(columns={"Date":"Date","TargetValue": "Russia_TotalCase"})
df_Rus=df_Rus[["Date","Russia_TotalCase"]]

In [None]:
df_plot=df_plot.merge(df_Br,on='Date').merge(df_Rus,on='Date')
df_plot

### When we look at the graph below, we see the top 3 country's behaviours. After May, the cases in US and Brazil increase while those in Russia decrease.

In [None]:
temp = df_plot.groupby('Date')['Russia_TotalCase','Brazil_TotalCase','US_TotalCase'].sum().reset_index()
temp = temp.melt(id_vars="Date", value_vars=['Russia_TotalCase','Brazil_TotalCase','US_TotalCase'],
                 var_name='Case', value_name='Count')
temp.head()

fig = px.area(temp, x="Date", y="Count", color='Case', height=600, width=700,
             title='Cases over time', color_discrete_sequence = ["blue", "green", "red"])
fig.update_layout(xaxis_rangeslider_visible=True)
fig.show()

In [None]:
df_train2=df_train.merge(df_count_feat[['Country_Region','Tourism','Latitude','Longtitude','Mean_Age','Lockdown_Date','Lockdown_Type']], on='Country_Region', how='inner', sort=False)
df_train2.head()

<a id="Feature_Engineering"></a>
# 3. Feature Engineering

### Creating ConfirmedCases and Fatalities variables for some calculation

In [None]:
def confatal(df):
    df_Confirmed_Cases=df[df["Target"]=="ConfirmedCases"]
    df_Confirmed_Cases=df_Confirmed_Cases.rename(columns={"TargetValue": "ConfirmedCases"})
    df_Confirmed_Cases=df_Confirmed_Cases.drop(['Target'], axis=1)
    df_Fatalities=df[df["Target"]=="Fatalities"]
    df_Fatalities=df_Fatalities.rename(columns={"TargetValue": "Fatalities"})
    df_Fatalities=df_Fatalities.drop(['Target'], axis=1)
    df=pd.merge(df_Confirmed_Cases,df_Fatalities[['Date','County','Province_State','Country_Region','Fatalities']],on=['Date','County','Province_State','Country_Region'], how='inner')
    df=df[['Id','County','Province_State','Country_Region','Population','Weight','Date','ConfirmedCases','Fatalities','Tourism','Latitude','Longtitude','Mean_Age','Lockdown_Date','Lockdown_Type']]
    return df

In [None]:
df_confat=confatal(df_train2)
df_confat.head()

### Creating Datetime Feautures

**We create datetime features to use in the model..**

In [None]:
from sklearn.preprocessing import OrdinalEncoder

def create_date_features(df):
    df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
    df['Year']=df.Date.dt.year
    df['Month']=df.Date.dt.month
    df['Day']=df['Date'].dt.strftime('%d')
    df['Day_number_of_week'] = df.Date.dt.weekday
    return df

In [None]:
create_date_features(df_confat)

### Transformation of datetime features

In [None]:
def getdayofweek(dow):
    if (dow == 0):
        return "Monday"
    elif(dow == 1):
        return "Tuesday"
    elif(dow ==2):
        return "Wednesday"
    elif(dow == 3):
        return "Thursday"
    elif(dow ==4):
        return "Friday"
    elif(dow == 5):
        return "Saturday"
    elif(dow ==6):
        return "Sunday"

In [None]:
df_confat['Dayofweek'] = df_confat.Day_number_of_week.apply(getdayofweek)

<a id="Model_Building"></a>
# 4. Model Building

<a id="First_Model"></a>
>## 4.1 Time Series for US - with Prophet
I use Prophet for time series because it provides intuitive parameters which are easy to tune..

In [None]:
df_Us=df_train[df_train['Country_Region']=='US']
df_Us=df_Us[df_Us['Target']=='ConfirmedCases']

In [None]:
df_Us=df_Us[df_Us['Province_State'].isnull()]
df_Us

In [None]:
plt.figure(figsize=(20, 10))
sns.lineplot(data=df_Us[df_Us['Date']<"2020-05-01"], x="Date", y="TargetValue")
plt.xticks(rotation=90);

Since we are asked to predict the cases after May of 12, we split our observations from this point as Train-Test. Train data set includes 110 observations, test data set 30 observations.

In [None]:
Train_us=df_Us[df_Us["Date"]<"2020-05-12"]
Test_us=df_Us[df_Us["Date"]>="2020-05-12"]

In [None]:
Train_us=Train_us[["Date","TargetValue"]]
Test_us=Test_us[["Date","TargetValue"]]
Train_us=Train_us.rename(columns={"Date":"ds","TargetValue":"y"})
Test_us=Test_us.rename(columns={"Date":"ds","TargetValue":"y"})

In [None]:
model=Prophet(growth='linear',changepoint_prior_scale=60)
model.fit(Train_us)
forecast = model.predict(Test_us)
fig = model.plot_components(forecast)

In [None]:
plot = model.plot(forecast)

In [None]:
Test_us['yhat']=forecast['yhat'].values
Test_us

When we look at the Actual-Prediction chart, we see that the model can catch the change-points, but the difference between the actual and the prediction increased after May.

In [None]:
plt.figure(figsize=(20, 8))
plt.plot(Test_us['ds'], Test_us['y'], 'b-', label = 'Actual')
plt.plot(Test_us['ds'], Test_us['yhat'], 'r--', label = 'Prediction')
plt.xlabel('Date',rotation=90); plt.ylabel('Sales'); plt.title('Actual vs Prediction')
plt.xticks(rotation=90)
plt.legend();

Accuracy is %84 for the prophet model with default parameters.. We should try to do parameter tuning to increase accuracy..

In [None]:
Test_us['diff']=(Test_us.y-Test_us.yhat).abs()
acc_ts=(1-(Test_us['diff'].sum()/Test_us['y'].sum()))*100
acc_ts

I use MAE, MSE, RMSE performance metric, because it is easy to explain. Prediction differs 3378 case from the actual.. It is not bad, because there are around 20000, 25000 case in US in a day..

In [None]:
MAE_ts=metrics.mean_absolute_error(Test_us['y'], Test_us['yhat'])
MSE_ts=metrics.mean_squared_error(Test_us['y'], Test_us['yhat'])
RMSE_ts=np.sqrt(metrics.mean_squared_error(Test_us['y'], Test_us['yhat']))
print('MAE:', MAE_ts)
print('MSE:', MSE_ts)
print('RMSE:', RMSE_ts)

### Parameter Tuning for Prophet

In [None]:
params_grid = {'seasonality_mode':('multiplicative','additive'),
               'changepoint_prior_scale':[0.5,1.2,2.5],
              'seasonality_prior_scale':[0.5,1.2,2.5]
              }
grid = ParameterGrid(params_grid)

In [None]:
model_parameters = pd.DataFrame(columns = ['Acc','Parameters'])
for p in tqdm(grid):
    
    Train=Train_us.copy()
    Valid=Test_us[['ds','y']]
            
    m =Prophet(changepoint_prior_scale = p['changepoint_prior_scale'],
               seasonality_prior_scale = p['seasonality_prior_scale'],
               seasonality_mode = p['seasonality_mode'],
               interval_width=0.95)
            
    m.fit(Train_us)
            
            
    forecast = m.predict(Valid[['ds']])
    forecast = forecast.astype({"ds": object})
    Valid=Valid.merge(forecast[['ds', 'yhat']],'inner',['ds'])
    
    #performance metric
    Valid['diff']=(Valid.y-Valid.yhat).abs()
    acc=(1-((Valid['diff'].sum()/Valid['y'].sum())))*100
    
    model_parameters = model_parameters.append({'Acc':acc,'Parameters':p},ignore_index=True)
            
parameters = model_parameters.sort_values(by=['Acc'],ascending=False)
parameters = parameters.reset_index(drop=True)
        
best_parameters=parameters['Parameters'][0]

In [None]:
best_parameters

We fit the model with best parameters..

In [None]:
m = Prophet(
        growth="linear",
        seasonality_mode=best_parameters['seasonality_mode'],
        changepoint_prior_scale=best_parameters['changepoint_prior_scale'],
        seasonality_prior_scale=best_parameters['seasonality_prior_scale']
        )
m.fit(Train_us)
forecast=m.predict(Test_us)

We see trend and weekly seasonality for train dataset..

In [None]:
fig = m.plot_components(forecast)

In [None]:
plot = m.plot(forecast)

In [None]:
Test_us['yhat']=forecast['yhat'].values
plt.figure(figsize=(20, 8))
plt.plot(Test_us['ds'], Test_us['y'], 'b-', label = 'Actual')
plt.plot(Test_us['ds'], Test_us['yhat'], 'r--', label = 'Prediction')
plt.xlabel('Date',rotation=90); plt.ylabel('Sales'); plt.title('Actual vs Prediction')
plt.xticks(rotation=90)
plt.legend();

Accuracy is better with the best parameters

In [None]:
Test_us['diff']=(Test_us.y-Test_us.yhat).abs()
acc_ts2=(1-(Test_us['diff'].sum()/Test_us['y'].sum()))*100
acc_ts2

In [None]:
MAE_ts2=metrics.mean_absolute_error(Test_us['y'], Test_us['yhat'])
MSE_ts2=metrics.mean_squared_error(Test_us['y'], Test_us['yhat'])
RMSE_ts2=np.sqrt(metrics.mean_squared_error(Test_us['y'], Test_us['yhat']))
print('MAE:', MAE_ts2)
print('MSE:', MSE_ts2)
print('RMSE:', RMSE_ts2)

<a id="Second_Model"></a>
>## 4.2 US Confimed Case Forecasting with Random Forest Reggressor

**Creating Date Features**

In [None]:
reg_us=create_date_features(df_Us)

**Creating day of week variable from Date**

In [None]:
reg_us['Dayofweek'] = reg_us.Day_number_of_week.apply(getdayofweek)

**One hot encoding for Dayofweek Feature**
<br>We use one hot encoding to transform categorical variables (Day0fweek) to use in our model

In [None]:
reg_us=pd.get_dummies(reg_us,columns=['Dayofweek'])
reg_us.head()

**Splitting Data - Train and Test**
<br> We will try to predict the cases after 2020-05-12, we are spliting the dataset from here..

In [None]:
Train_reg_us=reg_us[reg_us["Date"]<"2020-05-12"]
Test_reg_us=reg_us[reg_us["Date"]>="2020-05-12"]

In [None]:
x_train_reg_us=Train_reg_us[['Month','Dayofweek_Monday','Dayofweek_Tuesday','Dayofweek_Wednesday','Dayofweek_Thursday','Dayofweek_Friday','Dayofweek_Saturday','Dayofweek_Sunday']]
y_train_reg_us=Train_reg_us[['TargetValue']]

In [None]:
x_test_reg_us=Test_reg_us[['Month','Dayofweek_Monday','Dayofweek_Tuesday','Dayofweek_Wednesday','Dayofweek_Thursday','Dayofweek_Friday','Dayofweek_Saturday','Dayofweek_Sunday']]
y_test_reg_us=Test_reg_us[['TargetValue']]

In [None]:
rf_us = RandomForestRegressor(n_estimators=100)
rf_us.fit(x_train_reg_us,y_train_reg_us)
pred_rf = rf_us.predict(x_test_reg_us)

![](http://)Accuracy is %83.42 for RandomForest Regressor model with default parameters.. We should try to do parameter tuning to increase accuracy..

In [None]:
y_test_reg_us['diff']=(y_test_reg_us.TargetValue-pred_rf).abs()
acc_rf=(1-(y_test_reg_us['diff'].sum()/y_test_reg_us['TargetValue'].sum()))*100
acc_rf

In [None]:
MAE_rf=metrics.mean_absolute_error(y_test_reg_us.TargetValue, pred_rf)
MSE_rf=metrics.mean_squared_error(y_test_reg_us.TargetValue, pred_rf)
RMSE_rf=np.sqrt(metrics.mean_squared_error(y_test_reg_us.TargetValue, pred_rf))
print('MAE:', MAE_rf)
print('MSE:', MSE_rf)
print('RMSE:', RMSE_rf)

### Parameter Tuning for RandomForest Regressor

In [None]:
param_grid = { 
        "n_estimators"      : [10,20,300,100,200,500],
        "max_features"      : ["auto", "sqrt", "log2"],
        "min_samples_split" : [2,4,6,8],
        "bootstrap": [True, False],
            }
grid = ParameterGrid(param_grid)

In [None]:
model_parameters = pd.DataFrame(columns = ['Acc','Parameters'])
for p in tqdm(grid):
    
    X_Train=x_train_reg_us.copy()
    Y_Train=y_train_reg_us.copy()
    X_Valid=x_test_reg_us.copy()
    Y_Valid=y_test_reg_us.copy()
    m = RandomForestRegressor(n_estimators = p['n_estimators'],
               max_features = p['max_features'],
               min_samples_split = p['min_samples_split'],
               bootstrap=p['bootstrap'])
    
    
    
    m.fit(X_Train,Y_Train)
    pred_rf2 = m.predict(X_Valid)
            
    Y_Valid['yhat']=pred_rf2
    
    #performance metric
    Y_Valid['diff']=(Y_Valid.TargetValue-Y_Valid.yhat).abs()
    acc=(1-((Y_Valid['diff'].sum()/Y_Valid['TargetValue'].sum())))*100
    
    model_parameters = model_parameters.append({'Acc':acc,'Parameters':p},ignore_index=True)
            
parameters = model_parameters.sort_values(by=['Acc'],ascending=False)
parameters = parameters.reset_index(drop=True)
        
best_parameters=parameters['Parameters'][0]

In [None]:
best_parameters

In [None]:
m = RandomForestRegressor(
        bootstrap=best_parameters['bootstrap'],
        max_features=best_parameters['max_features'],
        min_samples_split=best_parameters['min_samples_split'],
        n_estimators=best_parameters['n_estimators']
        )
m.fit(x_train_reg_us,y_train_reg_us)
pred_rf2 = m.predict(x_test_reg_us)

Accuracy is better with the best parameters

In [None]:
y_test_reg_us['diff']=(y_test_reg_us.TargetValue-pred_rf2).abs()
acc_rf2=(1-(y_test_reg_us['diff'].sum()/y_test_reg_us['TargetValue'].sum()))*100
acc_rf2

In [None]:
MAE_rf2=metrics.mean_absolute_error(y_test_reg_us.TargetValue, pred_rf2)
MSE_rf2=metrics.mean_squared_error(y_test_reg_us.TargetValue, pred_rf2)
RMSE_rf2=np.sqrt(metrics.mean_squared_error(y_test_reg_us.TargetValue, pred_rf2))
print('MAE:', MAE_rf2)
print('MSE:', MSE_rf2)
print('RMSE:', RMSE_rf2)

<a id="Third_Model"></a>
>## 4.3 US Confimed Case Forecasting with  XGBOOST Reggressor

In [None]:
xgb_us = XGBRegressor(n_estimators=100)
xgb_us.fit(x_train_reg_us,y_train_reg_us)
xgb_pred = xgb_us.predict(x_test_reg_us)

Accuracy is %83 for XGBOOST Regressor model with default parameters.. We should try to do parameter tuning to increase accuracy..

In [None]:
y_test_reg_us['diff']=(y_test_reg_us.TargetValue-xgb_pred).abs()
acc_xg=(1-(y_test_reg_us['diff'].sum()/y_test_reg_us['TargetValue'].sum()))*100
acc_xg

In [None]:
MAE_xgb=metrics.mean_absolute_error(y_test_reg_us.TargetValue, xgb_pred)
MSE_xgb=metrics.mean_squared_error(y_test_reg_us.TargetValue, xgb_pred)
RMSE_xgb=np.sqrt(metrics.mean_squared_error(y_test_reg_us.TargetValue, xgb_pred))
print('MAE:', MAE_xgb)
print('MSE:', MSE_xgb)
print('RMSE:', RMSE_xgb)

### Parameter Tuning for XGBOOST Regressor

XGBOOST Regressor Fit with GridSearch Parameters for cv data

In [None]:
param_grid = { 
            'nthread':[4], #when use hyperthread, xgboost may become slower,
            'learning_rate': [.03, 0.05, .07], #so called `eta` value
            'max_depth': [5, 6, 7],
            'min_child_weight': [1,4],
            'subsample': [0.7],
            'colsample_bytree': [0.7],
            'n_estimators': [100,200,500]
            }
grid = ParameterGrid(param_grid)

In [None]:
model_parameters = pd.DataFrame(columns = ['Acc','Parameters'])
for p in tqdm(grid):
    
    X_Train=x_train_reg_us.copy()
    Y_Train=y_train_reg_us.copy()
    X_Valid=x_test_reg_us.copy()
    Y_Valid=y_test_reg_us.copy()
    m = XGBRegressor(nthread = p['nthread'],
               learning_rate = p['learning_rate'],
               max_depth=p['max_depth'],
               min_child_weight = p['min_child_weight'],
               subsample = p['subsample'],
               colsample_bytree=p['colsample_bytree'],
               n_estimators=p['n_estimators']             )
    
    
    
    m.fit(X_Train,Y_Train)
    pred_xg2 = m.predict(X_Valid)
            
    Y_Valid['yhat']=pred_xg2
    
    #performance metric
    Y_Valid['diff']=(Y_Valid.TargetValue-Y_Valid.yhat).abs()
    acc=(1-((Y_Valid['diff'].sum()/Y_Valid['TargetValue'].sum())))*100
    
    model_parameters = model_parameters.append({'Acc':acc,'Parameters':p},ignore_index=True)
            
parameters = model_parameters.sort_values(by=['Acc'],ascending=False)
parameters = parameters.reset_index(drop=True)
        
best_parameters=parameters['Parameters'][0]

In [None]:
best_parameters

In [None]:
m = XGBRegressor(
        nthread=best_parameters['nthread'],
        learning_rate=best_parameters['learning_rate'],
        max_depth=best_parameters['max_depth'],
        min_child_weight=best_parameters['min_child_weight'],
        subsample=best_parameters['subsample'],
        colsample_bytree=best_parameters['colsample_bytree'],
        n_estimators=best_parameters['n_estimators']
        )
m.fit(x_train_reg_us,y_train_reg_us)
pred_xg2 = m.predict(x_test_reg_us)

Accuracy is better with the best parameters

In [None]:
y_test_reg_us['diff']=(y_test_reg_us.TargetValue-pred_xg2).abs()
acc_xg2=(1-(y_test_reg_us['diff'].sum()/y_test_reg_us['TargetValue'].sum()))*100
acc_xg2

In [None]:
MAE_xgb2=metrics.mean_absolute_error(y_test_reg_us.TargetValue, pred_xg2)
MSE_xgb2=metrics.mean_squared_error(y_test_reg_us.TargetValue, pred_xg2)
RMSE_xgb2=np.sqrt(metrics.mean_squared_error(y_test_reg_us.TargetValue, pred_xg2))
print('MAE:', MAE_xgb2)
print('MSE:', MSE_xgb2)
print('RMSE:', RMSE_xgb2)

<a id="Model_Comparing"></a>
## 5. Model Comparison for Confirmed Cases for US

In [None]:
df_performance = {'Model':['Prophet','Prophet','Random Forest','Random Forest','XGBoost','XGBoost'],
        'Parameters':['Default','Best','Default','Best','Default','Best'],
        'Accuracy':[acc_ts,acc_ts2,acc_rf,acc_rf2,acc_xg,acc_xg2],
        'MAE': [MAE_ts,MAE_ts2, MAE_rf,MAE_rf2,MAE_xgb,MAE_xgb2],
        'MSE': ['{:f}'.format(MSE_ts),'{:f}'.format(MSE_ts2),'{:f}'.format(MSE_rf),'{:f}'.format(MSE_rf2),'{:f}'.format(MSE_xgb),'{:f}'.format(MSE_xgb2)], 
        'RMSE': ['{:f}'.format(RMSE_ts),'{:f}'.format(RMSE_ts2),'{:f}'.format(RMSE_rf),'{:f}'.format(RMSE_rf2),'{:f}'.format(RMSE_xgb),'{:f}'.format(RMSE_xgb2)]}
pd.DataFrame.from_dict(df_performance)

In [None]:
fig, ax = plt.subplots(figsize=(9,4))
a=sns.barplot(data=df_performance,x="Model", y="Accuracy",hue = 'Parameters')
a.set_title("Model Performance",fontsize=15)
plt.xlabel('Model')
plt.ylabel('Accuracy')
plt.legend(loc='upper left')
plt.ylim(60, 92)
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(9,4))
a=sns.barplot(data=df_performance,x="Model", y="MAE",hue = 'Parameters')
a.set_title("Model Performance",fontsize=15)
plt.xlabel('Model')
plt.ylabel('Mean Absolute Error')
plt.ylim(500, 4500)
plt.legend(loc='upper left')
plt.show()