# Modelling

## Steps for predicting fatalities and confirmed cases

We follow the below steps for the modelling process:

* Load Required Packages and Datasets
* Combine Train and Test Data
* Join Government Measures Data
* Join Distance from China
* Join COVID Indicators
* Prepare Data
* Split into Train and Test Sets
* Functions to make predictions
* Linear Models
    + Linear Regression
    + Lasso Regression
    + Ridge Regression
* Non-Linear Models
    + Decision Trees
    + Random Forests
    + Gradient Boosting
* Choosing the Best Model for Submission

Our focus in the modelling process is to model data for predicting fatalities and confirmed cases, so we won't be diving much into the inferences from each of these models, except evaluating their performance in terms of predictions.

In [744]:
# Loading Necessary Packages

import pandas as pd
import numpy as np
import sys
import matplotlib.pyplot as plt
%matplotlib inline
import io
import seaborn as sns
import time
import datetime

from sklearn.impute import SimpleImputer
from sklearn.feature_selection import VarianceThreshold
from sklearn.preprocessing import PowerTransformer
from sklearn.preprocessing import LabelEncoder 
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression

from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

from sklearn import datasets, linear_model
from sklearn.linear_model import Ridge
from sklearn import tree
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression
from sklearn import ensemble

from geopy.distance import geodesic
from geopy.distance import distance
from geopy import Point

In [712]:
# Loading Kaggle Files

train_data = pd.read_csv("train.csv", encoding= 'unicode_escape', parse_dates = ['Date'])
test_data = pd.read_csv("test.csv", encoding= 'unicode_escape', parse_dates = ['Date'])
submission_data = pd.read_csv("submission.csv", encoding= 'unicode_escape')

# Loading Distance From China Data
lat_long = pd.read_csv("johns-hopkins-covid-19-daily-dashboard-cases-by-country.csv", encoding= 'unicode_escape')

# Loading Government Measurement Data

govt_measures_data = pd.read_csv("acaps-covid-19-government-measures-dataset.csv", encoding= 'unicode_escape')

# Loading Covid Indicators Data

covid_indicators_data = pd.read_csv("inform-covid-indicators.csv", encoding= 'unicode_escape')

# Combine Train and Test Data

In [713]:
#Adding Indicator Columns to identify datasets
train_data['data_set'] = 'Train'
test_data['data_set'] = 'Test'

#Convert Target columns into log scale
train_data['ConfirmedCases'] = np.log(train_data['ConfirmedCases']+1)
train_data['Fatalities'] = np.log(train_data['Fatalities']+1)

#Adding columns to test data set
test_data = test_data.rename(columns={"ForecastId": "Id"})
test_data['ConfirmedCases'] = None
test_data['Fatalities'] = None

data = pd.concat([train_data,test_data])

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  


In [714]:
# Days since first occurence

data['days'] = data['Date']-data['Date'].min()
data['days'] = data['days'].astype('timedelta64[D]').astype('int32')

# Add Government Measures Data

In [715]:
#Clean Data

var_req = ['country', 'measure']
govt_measures_data = govt_measures_data[var_req]
govt_measures_data['measure'] = govt_measures_data['measure'].str.lower()
govt_measures_data = govt_measures_data.drop_duplicates()


In [716]:
#Create Categorical Columns

govt_measures_data = govt_measures_data.reset_index()
govt_measures_data['val'] = 1
govt_measures_data = govt_measures_data.set_index(['index','country','measure']).unstack(level=2).fillna(0).groupby('country').max()
govt_measures_data = govt_measures_data.reset_index()

In [717]:
#Renaming Columns

names = govt_measures_data.columns
new_names = ['Country_Region', 'additional health/documents requirements upon arrival', 'amendments to funeral and burial regulations', 'awareness campaigns', 'border checks', 'border closure', 'changes in prison-related policies', 'checkpoints within the country', 'complete border closure', 'curfews', 'domestic travel restrictions', 'economic measures', 'emergency administrative structures activated or established', 'full lockdown', 'general recommendations', 'health screenings in airports and border crossings', 'humanitarian exemptions', 'international flights suspension', 'introduction of quarantine policies', 'limit product imports/exports', 'limit public gatherings', 'lockdown of refugee/idp camps or other minorities', 'mass population testing', 'military deployment', 'obligatory medical tests not related to covid-19', 'other public health measures enforced', 'partial lockdown', 'psychological assistance and medical social work', 'public services closure', 'requirement to wear protective gear in public', 'schools closure', 'state of emergency declared', 'strengthening the public health system', 'surveillance and monitoring', 'testing policy', 'visa restrictions']
new_names = ["gm_"+ s for s in new_names]
new_names[0] = 'Country_Region'
govt_measures_data.columns = new_names


In [718]:
#Join with the original dataset

data = data.merge(govt_measures_data, how = 'left', on='Country_Region')

# Add Distance From China

In [719]:
lat_long = lat_long[['country_region','lat','long']]
lat_long = lat_long.dropna(0)

#Wuhan Co-ordinates
Wuhan_Cord = (30.583332, 114.2833330)

#Calculate Distance from China
def calc_distance(row, site_coords):
    target_coords = (row['lat'], row['long'])
    dist = geodesic(site_coords, target_coords).miles
    return(dist)

lat_long['distance_from_china'] = lat_long.apply(calc_distance, site_coords=Wuhan_Cord, axis=1)

#Get Rid of Lat, Long Columns
lat_long = lat_long.rename(columns={"country_region": "Country_Region"}).drop(['lat', 'long'], axis=1)

In [720]:
data = data.merge(lat_long, how = 'left', on='Country_Region')

# Add CoVID Indicators

In [721]:
# Adding Indicator Columns
covid_indicators_data = covid_indicators_data.drop(['iso3'], axis=1)
names2 = covid_indicators_data.columns
names2 = ["ci_"+ s for s in names2]
names2[0] = 'Country_Region'
covid_indicators_data.columns = names2
covid_indicators_data = covid_indicators_data.replace({'No data': 0.0001, 'x':0.0001})

In [722]:
# Merge with Existing Data
data = data.merge(covid_indicators_data, how = 'left', on='Country_Region')

# Prepare Data

In [723]:
#Define function to deal with overfitting/incorrect predictions later on 
def data_prep(data):
    data = data.astype('float32')
    data = np.nan_to_num(data)
    
    pt = PowerTransformer()
    pt.fit_transform(data)
    
    scaler = StandardScaler()
    scaler.fit_transform(data)

    imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
    imp_mean.fit_transform(data)

    sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
    sel.fit_transform(data)
    
    return(data)

# Split into train and test 



In [724]:
#Isolate Training and Testing Test

#Training Set
train = data[data['data_set'] == 'Train']
train = train.drop(['data_set', 'Id', 'Province_State', 'Country_Region', 'Date'], axis = 1)

train_confirmed_X = train.drop(['Fatalities', 'ConfirmedCases'], axis = 1)
train_fatalities_X = train.drop(['Fatalities','ConfirmedCases'], axis = 1)

train_confirmed_y = train['ConfirmedCases']
train_fatalities_y = train['Fatalities']

train_confirmed_X = data_prep(train_confirmed_X)
train_fatalities_X = data_prep(train_fatalities_X)

#Testing Set
test = data[data['data_set'] == 'Test']
test = test.drop(['data_set', 'Id', 'Province_State', 'Country_Region', 'Date','ConfirmedCases', 'Fatalities'], axis = 1)

test = data_prep(test)

  loglike = -n_samples / 2 * np.log(x_trans.var())
  x = um.multiply(x, x, out=x)
  loglike = -n_samples / 2 * np.log(x_trans.var())
  x = um.multiply(x, x, out=x)
  loglike = -n_samples / 2 * np.log(x_trans.var())
  x = um.multiply(x, x, out=x)


# Functions for Evaluation Metric and Making Predictions

In [None]:
# RMSLE: Evaluation Metric

def rmsle(real, predicted):
    sum=0.0
    for x in range(len(predicted)):
        if predicted[x]<0 or real[x]<0: #check for negative values
            continue
        p = np.log(predicted[x]+1)
        r = np.log(real[x]+1)
        sum = sum + (p - r)**2
    return (sum/len(predicted))**0.5

In [763]:
#Functions for making predictions

def make_pred(train_x, train_y, model):
    #Train test split
    X_train, X_test, y_train, y_test = train_test_split(train_x, train_y, test_size=0.3, random_state=66)

    # Create linear regression object
    regr = model

    # Train the model using the training sets
    regr.fit(X_train, y_train)

    # Make predictions using the testing set
    y_pred = regr.predict(X_test)
    
    #Transform to log scale
    y_pred = np.exp(y_pred)-1
    
    y_test = np.array(y_test).astype(float)
    y_test = np.exp(y_test)-1
    
    #Calculate Error
    err = rmsle(y_test, y_pred)
    
    return(y_pred, err, regr)

# Linear Regression

In [734]:
# Confirmed Cases Prediction
lin_pred_confirmed = make_pred(train_confirmed_X, train_confirmed_y,linear_model.LinearRegression())

# Fatalities Cases Prediction
lin_pred_fatalities = make_pred(train_fatalities_X, train_fatalities_y,linear_model.LinearRegression())

In [740]:
# Mean RMSLE

(lin_pred_confirmed[1]+lin_pred_fatalities[1])/2

1.328103957784514

# Lasso Regression

In [742]:
# Confirmed Cases Prediction
lasso_pred_confirmed = make_pred(train_confirmed_X, train_confirmed_y,linear_model.Lasso(alpha=0.1))

# Fatalities Cases Prediction
lasso_pred_fatalities = make_pred(train_fatalities_X, train_fatalities_y,linear_model.Lasso(alpha=0.1))

In [743]:
# Mean RMSLE
(lasso_pred_confirmed[1]+lasso_pred_fatalities[1])/2

1.419285223064493

# Ridge Regression

In [746]:
# Confirmed Cases Prediction
ridge_pred_confirmed = make_pred(train_confirmed_X, train_confirmed_y,Ridge(alpha=1.0))

# Fatalities Cases Prediction
ridge_pred_fatalities = make_pred(train_fatalities_X, train_fatalities_y,Ridge(alpha=1.0))

  overwrite_a=True).T
  overwrite_a=True).T


In [747]:
# Mean RMSLE
(ridge_pred_confirmed[1]+ridge_pred_confirmed[1])/2

1.646777130221392

# Decision Tree

In [764]:
# Confirmed Cases Prediction
tree_pred_confirmed = make_pred(train_confirmed_X, train_confirmed_y,tree.DecisionTreeRegressor())

# Fatalities Cases Prediction
tree_pred_fatalities = make_pred(train_fatalities_X, train_fatalities_y,tree.DecisionTreeRegressor())

In [765]:
# Mean RMSLE
(tree_pred_confirmed[1]+tree_pred_fatalities[1])/2

0.9104719298477852

# Random Forests

In [756]:
# Confirmed Cases Prediction
forests_pred_confirmed = make_pred(train_confirmed_X, train_confirmed_y,RandomForestRegressor(max_depth=2, random_state=66, max_features = 'sqrt'))

# Fatalities Cases Prediction
forests_pred_fatalities = make_pred(train_fatalities_X, train_fatalities_y,RandomForestRegressor(max_depth=2, random_state=66, max_features = 'sqrt'))



In [755]:
# Mean RMSLE
(forests_pred_confirmed[1]+forests_pred_fatalities[1])/2

1.683731318178617

# Gradient Boosting

In [752]:
params = {'n_estimators': 500, 'max_depth': 4, 'min_samples_split': 2,
          'learning_rate': 0.01, 'loss': 'ls'}

# Confirmed Cases Prediction
gb_pred_confirmed = make_pred(train_confirmed_X, train_confirmed_y,ensemble.GradientBoostingRegressor(**params))

# Fatalities Cases Prediction
gb_pred_fatalities = make_pred(train_fatalities_X, train_fatalities_y,ensemble.GradientBoostingRegressor(**params))

In [753]:
# Mean RMSLE
(gb_pred_confirmed[1]+gb_pred_fatalities[1])/2

0.9291424684558763

## Submission

Since Decision Trees performed the best in terms of RMSLE, we choose decision tree model to make predictions on the test set and submit it on Kaggle.

### Confirmed Cases

In [768]:
# Make predictions using the testing set
y_confirmedcases = tree_pred_confirmed[2].predict(test)
y_confirmedcases = np.exp(y_confirmedcases)-1

### Fatalities

In [770]:
# Make predictions using the testing set
y_fatalities = tree_pred_fatalities[2].predict(test)
y_fatalities = np.exp(y_fatalities)-1

In [772]:
a = pd.Series(np.arange(1,len(y_fatalities)+1).astype(int))
b = pd.Series(y_confirmedcases)
c = pd.Series(y_fatalities)

result = pd.concat([a, b, c], axis=1, sort=False)
result.columns = ['ForecastId','ConfirmedCases', 'Fatalities']

In [773]:
result.to_csv('submission_vs.csv',index=False)

*The test RMSLE is 1.23572, which means we are in the top 200 submissions. These are vanilla models but we have a lot of scope to use hyper-parameter tuning and other complex techniques to make better predictions.*