# StepFunction Team COVID Global Forecast

In the context of the global COVID-19 pandemic, Kaggle has launched several challenges in order to provide useful insights that may answer some of the open scientific questions about the virus. This is the case of the [COVID19 Global Forecasting](https://www.kaggle.com/c/covid19-global-forecasting-week-1), in which participants are encouraged to fit worldwide data in order to predict the pandemic evolution, hopefully helping to determine factors that impact the transmission rate of COVID-19.

In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
from sklearn import preprocessing
import time
from datetime import datetime

# ML libraries
import lightgbm as lgb
import xgboost as xgb
from xgboost import plot_importance, plot_tree
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn import linear_model
from sklearn.metrics import mean_squared_error

# 1. Load Data <a id="section1"></a>

First of all, let's take a look on the data structure:

In [None]:
#submission = pd.read_csv("../input/covid19-global-forecasting-week-1/submission.csv")
test = pd.read_csv("../input/covid19-global-forecasting-week-1/test.csv")
train = pd.read_csv("../input/covid19-global-forecasting-week-1/train.csv")
display(train.head(5))
display(train.describe())
print("Number of Country/Region: ", train['Country/Region'].nunique())
print("Dates go from day", max(train['Date']), "to day", min(train['Date']), ", a total of", train['Date'].nunique(), "days")

In [None]:
print("Number of Country/Region: ", test['Country/Region'].nunique())
print("Dates go from day", max(test['Date']), "to day", min(test['Date']), ", a total of", test['Date'].nunique(), "days")

The dataset covers 163 countries and almost 2 full months from 2020, which is enough data to get some clues about the pandemic.

# 2. Data enrichment <a id="section3"></a>

Main workflow of this section:
1. Join data, filter dates and clean missings
2. Add country related details
3. Compute lags and trends


**Disclaimer**: this data enrichment is not mandatory and we could end up not using all of the new features in our models. However I consider it a didactical step that will surely add some value, for example in an in-depth exploratory analysis.

## 2.1. Join data, filter dates and clean missings

First of all, we perform some pre-processing prepare the dataset, consisting on:

* **Join data**. Join train/test to facilitate data transformations
* **Filter dates**. According to the challenge conditions, remove ConfirmedCases and Fatalities post 2020-03-12
* **Missings**. Analyze and fix missing values

In [None]:
# Merge train and test, exclude overlap from test set so that the forecast id are correct
#dates_overlap = ['2020-03-12','2020-03-13','2020-03-14','2020-03-15','2020-03-16','2020-03-17','2020-03-18','2020-03-19','2020-03-20','2020-03-21','2020-03-22']
dates=set(test.Date)
dates_overlap=set(train[train['Date'].isin(dates)]['Date'])
dates_prediction=dates-dates_overlap

# train2 = train.loc[~train['Date'].isin(dates_overlap)]
test2 = test.loc[~test['Date'].isin(dates_overlap)]
all_data = pd.concat([train, test2], axis = 0, sort=False)
all_data=all_data.sort_values(by=['Country/Region','Province/State','Date'])

#update the forecastid from test set
all_data.loc[all_data['Date']>='2020-03-12','ForecastId']=test['ForecastId'].to_list()

# Double check that there are no informed ConfirmedCases and Fatalities after 2020-03-11
# all_data.loc[all_data['Date'] >= '2020-03-12', 'ConfirmedCases'] = np.nan
# all_data.loc[all_data['Date'] >= '2020-03-12', 'Fatalities'] = np.nan
all_data['Date'] = pd.to_datetime(all_data['Date'])

# Create column Day, label encoding Date
le = preprocessing.LabelEncoder()
all_data['Day'] = le.fit_transform(all_data.Date)

# # Country wise days since 1st case
countrydate = all_data[all_data['ConfirmedCases']>0].groupby('Country/Region').agg({"Date":'min'}).reset_index()
countrydate.columns=['Country/Region','Dayof1stcase']
all_data=all_data.merge(countrydate, left_on='Country/Region', right_on='Country/Region', how='left')
all_data['Dayofcases'] = (all_data['Date']-all_data['Dayof1stcase']).dt.days
all_data.loc[~(all_data['Dayofcases']>0),'Dayofcases']=-1
all_data=all_data.drop(columns=['Dayof1stcase'])

# Aruba has no Lat nor Long. Inform it manually
all_data.loc[all_data['Lat'].isna()==True, 'Lat'] = 12.510052
all_data.loc[all_data['Long'].isna()==True, 'Long'] = -70.009354

# Fill null values given that we merged train-test datasets
all_data['Province/State'].fillna("None", inplace=True)
all_data['ConfirmedCases'].fillna(0, inplace=True)
all_data['Fatalities'].fillna(0, inplace=True)
all_data['Id'].fillna(-1, inplace=True)
all_data['ForecastId'].fillna(-1, inplace=True)

#Add day of week and month
all_data['dayofweek'] = all_data['Date'].dt.dayofweek
# all_data['month'] = all_data['Date'].dt.month
# all_data['dayofyear'] = all_data['Date'].dt.dayofyear

display(all_data)
display(all_data.loc[all_data['Date'] == '2020-03-12'])

**Observations**: 
* "ConfirmedCases" and "Fatalities" are now only informed for dates previous to 2020-03-12
* The dataset includes all countries and dates, which is required for the lag/trend step
* Missing values for "ConfirmedCases" and "Fatalities" have been replaced by 0, which may be dangerous if we do not remember it at the end of the process. However, since we will train only on dates previous to 2020-03-12, this won't impact our prediction algorithm
* A new column "Day" has been created, as a day counter starting from the first date

Double-check that there are no remaining missing values:

In [None]:
missings_count = {col:all_data[col].isnull().sum() for col in all_data.columns}
missings = pd.DataFrame.from_dict(missings_count, orient='index')
print(missings.nlargest(30, 0))

## 2.2. Add country details

Variables like the total population of a country, the average age of citizens or the fraction of peoople living in cities may strongly impact on the COVID-19 transmission behavior. Hence, it's important to consider these factors. I'm using [Tanu's dataset](https://www.kaggle.com/tanuprabhu/population-by-country-2020) based on Web Scrapping for this purpose.

In [None]:
world_population = pd.read_csv("/kaggle/input/population-by-country-2020/population_by_country_2020.csv")
country_data = pd.read_csv("/kaggle/input/countryinfo/covid19countryinfo.csv")
continent = pd.read_csv("/kaggle/input/country-to-continent/countryContinent.csv", encoding = 'ISO-8859-1')
economy = pd.read_csv("/kaggle/input/the-economic-freedom-index/economic_freedom_index2019_data.csv", encoding = 'ISO-8859-1')


# Select desired columns and rename some of them
world_population = world_population[['Country (or dependency)', 'Population (2020)', 'Density (P/Km²)', 'Land Area (Km²)', 'Med. Age', 'Urban Pop %']]
world_population.columns = ['country', 'Population (2020)', 'Density', 'Land Area', 'Med Age', 'Urban Pop']

country_data = country_data[['country','quarantine', 'schools', 'restrictions', 'hospibed','smokers']]
country_data.columns = ['country', 'Quarantine', 'Schools', 'Restrictions', 'Hospibed','Smokers']

continent = continent[['country','continent','sub_region']]
continent.columns = ['country', 'Continent','Sub_Region']

economy = economy[['Country Name', 'World Rank','Region Rank', '2019 Score', 'Property Rights', 'Judical Effectiveness',
       'Government Integrity', 'Tax Burden', "Gov't Spending", 'Fiscal Health',
       'Business Freedom', 'Labor Freedom', 'Monetary Freedom',
       'Trade Freedom', 'Investment Freedom ', 'Financial Freedom',
       'Tariff Rate (%)', 'Income Tax Rate (%)', 'Corporate Tax Rate (%)',
       'Tax Burden % of GDP', "Gov't Expenditure % of GDP ",  'GDP (Billions, PPP)', 'GDP Growth Rate (%)',
       '5 Year GDP Growth Rate (%)', 'GDP per Capita (PPP)',
       'Unemployment (%)', 'Inflation (%)', 'FDI Inflow (Millions)',
       'Public Debt (% of GDP)']]

economy.columns=['country', 'World Rank','Region Rank', '2019 Score', 'Property Rights', 'Judical Effectiveness',
       'Government Integrity', 'Tax Burden', "Gov't Spending", 'Fiscal Health',
       'Business Freedom', 'Labor Freedom', 'Monetary Freedom',
       'Trade Freedom', 'Investment Freedom ', 'Financial Freedom',
       'Tariff Rate (%)', 'Income Tax Rate (%)', 'Corporate Tax Rate (%)',
       'Tax Burden % of GDP', "Gov't Expenditure % of GDP ",  'GDP (Billions, PPP)', 'GDP Growth Rate (%)',
       '5 Year GDP Growth Rate (%)', 'GDP per Capita (PPP)',
       'Unemployment (%)', 'Inflation (%)', 'FDI Inflow (Millions)',
       'Public Debt (% of GDP)']

In [None]:

# Replace United States by US
world_population.loc[world_population['country']=='United States', 'country'] = 'US'
world_population.loc[world_population['country']=="Gambia",'country']='The Gambia'
world_population.loc[world_population['country']=="Bahamas",'country']='The Bahamas'
world_population.loc[world_population['country']=="Réunion",'country']='Reunion'
world_population.loc[world_population['country']=="Czech Republic (Czechia)",'country']='Czechia'
world_population.loc[world_population['country']=="DR Congo",'country']='Congo (Kinshasa)'
world_population.loc[world_population['country']=="Congo",'country']='Congo (Brazzaville)'
world_population.loc[world_population['country']=="Côte d'Ivoire",'country']="Cote d'Ivoire"
world_population.loc[world_population['country']=="South Korea",'country']="Korea, South"
world_population.loc[world_population['country']=="St. Vincent & Grenadines",'country']='Saint Vincent and the Grenadines'

# continent
continent.loc[continent['country']=="United States of America",'country']='US'
continent.loc[continent['country']=="Bolivia (Plurinational State of)",'country']='Bolivia'
continent.loc[continent['country']=="Brunei Darussalam" ,'country'] = 'Brunei'
continent.loc[continent['country']=="Gambia",'country']='The Gambia'
continent.loc[continent['country']=="Bahamas",'country']='The Bahamas'
continent.loc[continent['country']=="Réunion",'country']='Reunion'
continent.loc[continent['country']=="Congo (Democratic Republic of the)",'country']='Congo (Kinshasa)'
continent.loc[continent['country']=="Congo",'country']='Congo (Brazzaville)'
continent.loc[continent['country']=="Czech Republic",'country']='Czechia'
continent.loc[continent['country']=="Côte d'Ivoire",'country']="Cote d'Ivoire"
continent.loc[continent['country']=="Macedonia (the former Yugoslav Republic of)",'country']="North Macedonia"
continent.loc[continent['country']=="Viet Nam",'country']='Vietnam'
continent.loc[continent['country']=="Venezuela (Bolivarian Republic of)",'country']='Venezuela'
continent.loc[continent['country']=="United Kingdom of Great Britain and Northern Ireland",'country']='United Kingdom'
continent.loc[continent['country']=="Tanzania, United Republic of",'country']='Tanzania'
continent.loc[continent['country']=="Russian Federation",'country']='Russia'
continent.loc[continent['country']=="Moldova (Republic of)",'country']='Moldova'
continent.loc[continent['country']=="Korea (Republic of)",'country']='Korea, South'
continent.loc[continent['country']=="Iran (Islamic Republic of)",'country']='Iran'
continent.loc[continent['country']=="Iran (Islamic Republic of)",'country']='Iran'
continent.loc[continent['country']=="Iran (Islamic Republic of)",'country']='Iran'


# actual data
# all_data.loc[train['Country/Region']=="Republic of the Congo",'country']='Congo (Kinshasa)'
all_data.loc[all_data['Country/Region']=="Gambia, The",'Country/Region']='The Gambia'

In [None]:
# DEMOGRAPHICS
# Remove the % character from Urban Pop values
world_population['Urban Pop'] = world_population['Urban Pop'].str.rstrip('%')

# Replace Urban Pop and Med Age "N.A" by their respective modes, then transform to int
world_population.loc[world_population['Urban Pop']=='N.A.', 'Urban Pop'] = int(world_population.loc[world_population['Urban Pop']!='N.A.', 'Urban Pop'].mode()[0])
world_population['Urban Pop'] = world_population['Urban Pop'].astype('int16')
world_population.loc[world_population['Med Age']=='N.A.', 'Med Age'] = int(world_population.loc[world_population['Med Age']!='N.A.', 'Med Age'].mode()[0])
world_population['Med Age'] = world_population['Med Age'].astype('int16')

print("Cleaned country details dataset")
display(world_population)

# Now join the dataset to our previous DataFrame and clean missings (not match in left join)
print("Enriched dataset")
all_data = all_data.merge(world_population, left_on='Country/Region', right_on='country', how='left')
all_data[['Population (2020)', 'Density', 'Land Area', 'Med Age', 'Urban Pop']] = all_data[['Population (2020)', 'Density', 'Land Area', 'Med Age', 'Urban Pop']].fillna(0)
all_data=all_data.drop(columns=['country'])
# display(all_data)


# CONTINENT INFO
# Now join the dataset to our previous DataFrame and clean missings (not match in left join)
print("Enriched dataset")
all_data = all_data.merge(continent, left_on='Country/Region', right_on='country', how='left')
all_data[['Continent','Sub_Region']] = all_data[['Continent','Sub_Region']].fillna('Other')
all_data=all_data.drop(columns=['country'])
# display(all_data)


# ECONOMY DATA
# Remove the $ character from GDP values
economy['GDP per Capita (PPP)'] = economy['GDP per Capita (PPP)'].str.strip('$')
economy['GDP (Billions, PPP)']  = economy['GDP (Billions, PPP)'].str.strip('$')

# Now join the dataset to our previous DataFrame and clean missings (not match in left join)
print("Enriched dataset")
all_data = all_data.merge(economy, left_on='Country/Region', right_on='country', how='left')
all_data[economy.columns] = all_data[economy.columns].fillna('0')
all_data=all_data.drop(columns=['country'])
# display(all_data)
all_data['GDP per Capita (PPP)'].str.split('(').apply(lambda x: x[0]).str.replace(",","").astype(float)
all_data['GDP (Billions, PPP)'].str.split('\ ').apply(lambda x: x[0]).str.replace(",","").astype(float)


# Covid Country Hospital Smokers Quarantine Data
country_data['Smokers'].fillna(country_data.Smokers.mode()[0],inplace=True)
country_data['Hospibed'].fillna(country_data.Hospibed.mode()[0],inplace=True)
country_data['Quarantine']=pd.to_datetime(country_data['Quarantine'])
country_data['Schools']=pd.to_datetime(country_data['Schools'])
country_data['Restrictions']=pd.to_datetime(country_data['Restrictions'])


print("Enriching country actions dataset")
all_data = all_data.merge(country_data, left_on='Country/Region', right_on='country', how='left')
all_data['Smokers'].fillna(country_data.Smokers.mode()[0],inplace=True)
all_data['Hospibed'].fillna(country_data.Hospibed.mode()[0],inplace=True)
all_data=all_data.drop(columns=['country'])
display(all_data)



In [None]:
# Quarantine info flags
all_data['quarantine_flag']=(pd.to_datetime(all_data['Quarantine'])<all_data['Date'])
all_data['schools_flag']=(pd.to_datetime(all_data['Schools'])<all_data['Date'])
all_data['restrictions_flag']=(pd.to_datetime(all_data['Restrictions'])<all_data['Date'])

## 2.3 Compute lags and trends

Enriching a dataset is key to obtain good results. In this case we will apply 2 different transformations:

**Lag**. Lags are a way to compute the previous value of a column, so that the lag 1 for ConfirmedCases would inform the this column from the previous day. The lag 3 of a feature X is simply:
$$X_{lag3}(t) = X(t-3)$$


**Trend**. Transformig a column into its trend gives the natural tendency of this column, which is different from the raw value. The definition of trend I will apply is: 
$$Trend_{X} = {X(t) - X(t-1) \over X(t-1)}$$

The backlog of lags I'll apply is 14 days, while for trends is 7 days.  For ConfirmedCases and Fatalities:

In [None]:
def calculate_trend(df, lag_list, column):
    for lag in lag_list:
        trend_column_lag = "Trend_" + column + "_" + str(lag)
        df[trend_column_lag] = (df[column].shift(lag, fill_value=0)-df[column].shift(lag+1, fill_value=-999))/df[column].shift(lag+1, fill_value=0)
    return df


def calculate_lag(df, lag_list, column):
    for lag in lag_list:
        column_lag = column + "_" + str(lag)
        df[column_lag] = df[column].shift(lag, fill_value=0)
    return df


ts = time.time()
all_data=all_data.sort_values(by=['Country/Region','Province/State','Date'])

all_data = calculate_lag(all_data, range(1,7), 'ConfirmedCases')
all_data = calculate_lag(all_data, range(1,7), 'Fatalities')
all_data = calculate_trend(all_data, [1], 'ConfirmedCases')
all_data = calculate_trend(all_data, [1], 'Fatalities')
all_data.replace([np.inf, -np.inf], 0, inplace=True)
all_data.fillna(0, inplace=True)
print("Time spent: ", time.time()-ts)

As you see, the process is really fast. An example of some of the lag/trend columns for Spain:

# 4. Predictions with machine learning <a id="section4"></a>

Our obective in this section consists on  predicting the evolution of the expansion from a data-centric perspective, like any other regression problem. To do so, remember that the challenge specifies that submissions on the public LB should only contain data previous to 2020-03-12.

Models to apply:
1. Random Forest


In [None]:
raw_data=all_data.copy()

# Prepare data for Fitting model

In [None]:
# Label encode country names
data = all_data.copy()
data = data.drop(columns=['Quarantine','Schools', 'Restrictions'])

data['Country/Region'] = le.fit_transform(data['Country/Region'])

# Save dictionary for exploration purposes
number = data['Country/Region']
countries = le.inverse_transform(data['Country/Region'])
country_dict = dict(zip(countries, number)) 

data['Continent'] = le.fit_transform(data['Continent'])
data['Sub_Region'] = le.fit_transform(data['Sub_Region'])


In [None]:
# data = data[['Date','Id', 'ForecastId','Lat', 'Long', 'Country/Region', 'ConfirmedCases', 'Fatalities', 
#        'Day', 'Dayofcases', 'dayofweek', 'ConfirmedCases_1', 'ConfirmedCases_2',
#        'ConfirmedCases_3', 'ConfirmedCases_4', 'ConfirmedCases_5',
#        'ConfirmedCases_6', 'Fatalities_1', 'Fatalities_2', 'Fatalities_3',
#        'Fatalities_4', 'Fatalities_5', 'Fatalities_6',
#        'Trend_ConfirmedCases_1', 'Trend_Fatalities_1','Population (2020)', 'Density', 'Land Area',
#        'Med Age', 'Urban Pop', 'Continent', 'Sub_Region','Hospibed', 'Smokers', 'quarantine_flag',
#        'schools_flag', 'restrictions_flag']]

In [None]:

def split_data_with_custom_date(data,valid_date,end_date):
    
    # Train set
    X_train   = data[(data.ForecastId == -1)&(data.Date<valid_date)].drop(['ConfirmedCases', 'Fatalities'], axis=1)
    Y_train_1 = data[(data.ForecastId == -1)&(data.Date<valid_date)]['ConfirmedCases']
    Y_train_2 = data[(data.ForecastId == -1)&(data.Date<valid_date)]['Fatalities']

    # Test set
    X_test = data[(data.ForecastId == -1)&(data.Date>=valid_date)&(data.Date<end_date)].drop(['ConfirmedCases', 'Fatalities'], axis=1)
    #Y_test_1 = test['ConfirmedCases']
    #Y_test_2 = test['Fatalities']

    # Test set
    #X_test = data[data.Day > day_valid].drop(['ConfirmedCases', 'Fatalities'], axis=1)
    
    X_train.drop(columns=['Id','ForecastId','Date'], inplace=True, errors='ignore')
    
#     X_test.drop('Id', inplace=True, errors='ignore')
    index = X_test['ForecastId']
    X_test.drop(columns=['Id','ForecastId'], inplace=True, errors='ignore')
#     testdata=X_test.copy()
#     X_test.drop('Date', inplace=True, errors='ignore')
    
    return X_train, Y_train_1, Y_train_2, X_test, index



def split_data(data):
        # Train set
    new_train=data[data.ForecastId == -1]
    X_train = new_train.drop(['Province/State','ConfirmedCases', 'Fatalities'], axis=1)
    Y_train_1 = new_train['ConfirmedCases']
    Y_train_2 = new_train['Fatalities']

    # Valid set
    valid=data[(data.ForecastId != -1)&(data.Date.isin(dates_overlap))]
    X_valid = valid.drop(['Province/State','ConfirmedCases', 'Fatalities'], axis=1)
    Y_valid_1 = valid['ConfirmedCases']
    Y_valid_2 = valid['Fatalities']

    # Test set
    new_test=data[(data.ForecastId != -1)&(data.Date.isin(dates))]
    X_test = new_test.drop(['Province/State','ConfirmedCases', 'Fatalities'], axis=1)

    # Test set
    #X_test = data[data.Day > day_valid].drop(['ConfirmedCases', 'Fatalities'], axis=1)
    
    X_train.drop(columns=['Id','ForecastId','Date'], inplace=True, errors='ignore')
    
    valid_index= X_valid['ForecastId']
    test_index = X_test['ForecastId']
    X_valid.drop(columns=['Id','ForecastId'], inplace=True, errors='ignore')
    X_test.drop(columns=['Id','ForecastId'], inplace=True, errors='ignore')
#     testdata=X_test.copy()
#     X_test.drop('Date', inplace=True, errors='ignore')

    return new_train, X_train, Y_train_1, Y_train_2, valid, X_valid, Y_valid_1, Y_valid_2, new_test, X_test, valid_index, test_index

In [None]:
restrictions=all_data[all_data['restrictions_flag']==True].groupby('Country/Region').agg({"Date":'min'})
restrictions.columns=['restricted_date']
new_data = all_data.merge(restrictions, left_on=['Country/Region','Date'], right_on=['Country/Region','restricted_date'], how='inner')
new_data.head()

In [None]:
new_data[['ForecastId','restricted_date','Dayofcases','ConfirmedCases','Country/Region','Province/State']]

In [None]:
def recreate_valid_split(data):
    # Valid set
    valid   = data[(data.ForecastId != -1)&(data.Date.isin(dates_overlap))]
    X_valid = valid.drop(['Province/State','ConfirmedCases', 'Fatalities'], axis=1)
    Y_valid_1 = valid['ConfirmedCases']
    Y_valid_2 = valid['Fatalities']
    X_valid.drop(columns=['Id','ForecastId'], inplace=True, errors='ignore')
#     testdata=X_test.copy()
#     X_test.drop('Date', inplace=True, errors='ignore')
    return X_valid#, Y_valid_1, Y_valid_2

def recreate_submission_split(data):
    # Valid set
    sub   = data[(data.ForecastId != -1)&(data.Date.isin(dates))]
    X_sub = sub.drop(['Province/State','ConfirmedCases', 'Fatalities'], axis=1)
    Y_sub_1 = sub['ConfirmedCases']
    Y_sub_2 = sub['Fatalities']
    X_sub.drop(columns=['Id','ForecastId'], inplace=True, errors='ignore')
#     testdata=X_test.copy()
#     X_test.drop('Date', inplace=True, errors='ignore')
    return X_sub#, Y_valid_1, Y_valid_2

def recalculate_lags(df):
    df = df.sort_values(by=['Country/Region','Province/State','Date'])
    df = calculate_lag(df, range(1,7), 'ConfirmedCases')
    df = calculate_lag(df, range(1,7), 'Fatalities')
    df = calculate_trend(df, [1], 'ConfirmedCases')
    df = calculate_trend(df, [1], 'Fatalities')
    df.replace([np.inf, -np.inf], 0, inplace=True)
    df.fillna(0, inplace=True)
    return df
#     print("Time spent: ", time.time()-ts)

In [None]:
economy_columns=['World Rank', 'Region Rank', '2019 Score',
       'Property Rights', 'Judical Effectiveness', 'Government Integrity',
       'Tax Burden', "Gov't Spending", 'Fiscal Health', 'Business Freedom',
       'Labor Freedom', 'Monetary Freedom', 'Trade Freedom',
       'Investment Freedom ', 'Financial Freedom', 'Tariff Rate (%)',
       'Income Tax Rate (%)', 'Corporate Tax Rate (%)', 'Tax Burden % of GDP',
       "Gov't Expenditure % of GDP ", 'GDP (Billions, PPP)',
       'GDP Growth Rate (%)', '5 Year GDP Growth Rate (%)',
       'GDP per Capita (PPP)', 'Unemployment (%)', 'Inflation (%)',
       'FDI Inflow (Millions)', 'Public Debt (% of GDP)']


In [None]:
new_train,X_train, Y_train_1, Y_train_2, valid,X_valid, Y_valid_1, Y_valid_2, new_test, X_test, valid_index, test_index=split_data(data.drop(columns=economy_columns))

## 4.1 Random Forest Regressor 

Recalculate lags and trends after each date of predictions

In [None]:
from sklearn.ensemble import RandomForestRegressor
import  xgboost as xgb
from sklearn.metrics import mean_squared_log_error

Using RandomizedSearchCV

In [None]:
random_grid={'bootstrap': [True, False],
 'max_depth': [10, 30, 50, None],
 'max_features': ['auto', 'sqrt'],
 'min_samples_leaf': [1, 2, 4],
 'min_samples_split': [2, 5, 10],
 'n_estimators': [100,200,300,400]}

rf = RandomForestRegressor()
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 10, cv = 2, verbose=2, random_state=42, n_jobs = -1)
# Fit the random search model
rf_random.fit(X_train, Y_train_1)

rf = RandomForestRegressor()
rf_random2 = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 10, cv = 2, verbose=2, random_state=42, n_jobs = -1)
# Fit the random search model
rf_random2.fit(X_train, Y_train_2)

In [None]:
# regr1=RandomForestRegressor(bootstrap=True,max_depth=None, max_features='auto', max_leaf_nodes=None,
#                             n_estimators=150, random_state=None, n_jobs=1, verbose=0)
# regr1.fit(X_train, Y_train_1)

# regr2=RandomForestRegressor(bootstrap=True,max_depth=None, max_features='auto', max_leaf_nodes=None,
#                             n_estimators=150, random_state=None, n_jobs=1, verbose=0)
# regr2.fit(X_train, Y_train_2)
# # X_test.head()

In [None]:
regr1=rf_random.best_estimator_
regr1.fit(X_train, Y_train_1)

regr2=rf_random2.best_estimator_
regr2.fit(X_train, Y_train_2)
# X_test.head()

In [None]:
display(regr1.feature_importances_)
display(X_train.columns)
display(regr2.feature_importances_)

## By using actual data from last day (next day prediction only/ lesser error )

In [None]:
x_pred=valid.copy()
predictions1=regr1.predict(X_valid.drop(columns=['Date']))
x_pred['predictions1']=predictions1
predictions2=regr2.predict(X_valid.drop(columns=['Date']))
x_pred['predictions2']=predictions2

xpred = x_pred[['Date','Country/Region','ConfirmedCases','Fatalities','predictions1','predictions2']]
display(xpred[xpred['Country/Region']==74])
display(xpred[xpred['Country/Region']==68])

In [None]:
print(np.sqrt(mean_squared_log_error( Y_valid_1, predictions1 )))
print(np.sqrt(mean_squared_log_error( Y_valid_2, predictions2 )))

## One shot predictions based on forecasted data points

In [None]:
# valid_dataset=X_valid.drop(columns=['Date'])
# confirmed=sorted(X_valid.columns[X_valid.columns.str.contains('Confirmed')])
# fatalities=sorted(X_valid.columns[X_valid.columns.str.contains('Fatalities')])
# train,X_train, Y_train_1, Y_train_2, valid,X_valid, Y_valid_1, Y_valid_2, test,X_test, valid_index, test_index=split_data(data.drop(columns=economy_columns))
temp_data=data.copy()
temp_x_valid= X_valid.copy()
x_pred=valid.copy() # Final predictions
x_pred['Predictions1']=0
x_pred['Predictions2']=0

for date in sorted(dates_overlap):  
    if date>min(dates_overlap):
        temp_data.loc[temp_data['ForecastId'].isin(forecastids),'ConfirmedCases']=predictions1
        temp_data.loc[temp_data['ForecastId'].isin(forecastids),'Fatalities']=predictions2
        temp_data = recalculate_lags(temp_data)
        temp_x_valid = recreate_valid_split(temp_data.drop(columns=economy_columns))
    forecastids=temp_data[temp_data['Date']==date]['ForecastId']
    valid_dataset=temp_x_valid[temp_x_valid['Date']==date].drop(columns=['Date'])
    predictions1=regr1.predict(valid_dataset)
    predictions2=regr2.predict(valid_dataset)
    x_pred.loc[x_pred['ForecastId'].isin(forecastids),'Predictions1']=predictions1
    x_pred.loc[x_pred['ForecastId'].isin(forecastids),'Predictions2']=predictions2

    

In [None]:
temp_data.head()

In [None]:
print(np.sqrt(mean_squared_log_error( Y_valid_1, x_pred['Predictions1'] )))
print(np.sqrt(mean_squared_log_error( Y_valid_2, x_pred['Predictions2'] )))

print(np.sqrt(mean_squared_log_error( Y_valid_1, np.round(x_pred['Predictions1']) )))
print(np.sqrt(mean_squared_log_error( Y_valid_2, np.round(x_pred['Predictions2']) )))

In [None]:
print(np.sqrt(mean_squared_log_error( Y_valid_1[x_pred['Date']>='2020-03-20'], x_pred[x_pred['Date']>='2020-03-20']['Predictions1'] )))
print(np.sqrt(mean_squared_log_error( Y_valid_2[x_pred['Date']>='2020-03-20'], x_pred[x_pred['Date']>='2020-03-20']['Predictions2'] )))

## Submissions Part

In [None]:
# train,X_train, Y_train_1, Y_train_2, valid,X_valid, Y_valid_1, Y_valid_2, test,X_test, valid_index, test_index = split_data(data.drop(columns=economy_columns))
temp_data=data.copy()
temp_x_sub= X_test.copy()
x_pred2=new_test.copy() # Final predictions
x_pred2['Predictions1']=0
x_pred2['Predictions2']=0

for date in sorted(dates):  
    if date>min(dates):
        temp_data.loc[temp_data['ForecastId'].isin(forecastids),'ConfirmedCases']=predictions1
        temp_data.loc[temp_data['ForecastId'].isin(forecastids),'Fatalities']=predictions2
        temp_data = recalculate_lags(temp_data)
        temp_x_sub = recreate_submission_split(temp_data.drop(columns=economy_columns))
    forecastids=temp_data[temp_data['Date']==date]['ForecastId']
    sub_dataset=temp_x_sub[temp_x_sub['Date']==date].drop(columns=['Date'])
    predictions1=regr1.predict(sub_dataset)
    predictions2=regr2.predict(sub_dataset)
    x_pred2.loc[x_pred2['ForecastId'].isin(forecastids),'Predictions1']=predictions1
    x_pred2.loc[x_pred2['ForecastId'].isin(forecastids),'Predictions2']=predictions2

    

In [None]:
print(np.sqrt(mean_squared_log_error( Y_valid_1, np.ceil(x_pred2[x_pred2['Date']<='2020-03-23']['Predictions1']) )))
print(np.sqrt(mean_squared_log_error( Y_valid_2, np.round(x_pred2[x_pred2['Date']<='2020-03-23']['Predictions2']) )))

In [None]:
def get_submission(x_pred):
    # Submit predictions
#     submission = pd.DataFrame({
#         "ForecastId": index, 
#         "ConfirmedCases": prediction1,
#         "Fatalities": predictions2
#     })
    submission=x_pred[['ForecastId','Predictions1','Predictions2']].copy()
    submission.columns=['ForecastId','ConfirmedCases','Fatalities']
    submission.loc[:,'ForecastId']=submission['ForecastId'].astype(int)
    submission.loc[:,'ConfirmedCases']=np.ceil(submission['ConfirmedCases'])
    submission.loc[:,'Fatalities']=np.round(submission['Fatalities'])
    
#     submission['ConfirmedCases']=np.ceil(submission['ConfirmedCases'])
    submission.to_csv('submission.csv', index=False)
    

In [None]:
get_submission(x_pred2)
# type(x_pred['ForecastId'][50])

In [None]:
# new_test[new_test['Province/State']=='New Brunswick'].head(50)

In [None]:
# country_dict