# StepFunction Team COVID Global Forecast

In the context of the global COVID-19 pandemic, Kaggle has launched several challenges in order to provide useful insights that may answer some of the open scientific questions about the virus. This is the case of the [COVID19 Global Forecasting](https://www.kaggle.com/c/covid19-global-forecasting-week-1), in which participants are encouraged to fit worldwide data in order to predict the pandemic evolution, hopefully helping to determine factors that impact the transmission rate of COVID-19.


Chose this notebook as the starting point for the model. Great EDA and approach in this one
https://www.kaggle.com/saga21/covid-global-forecast-sir-model-ml-regressions

In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
from sklearn import preprocessing
import time
from datetime import datetime
from google.cloud import bigquery

# ML libraries
import lightgbm as lgb
import xgboost as xgb
from xgboost import plot_importance, plot_tree
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn import linear_model
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
import  xgboost as xgb
from sklearn.metrics import mean_squared_log_error
pd.options.display.max_columns= None
from sklearn.ensemble import ExtraTreesRegressor
import  xgboost as xgb
from sklearn.metrics import mean_squared_log_error

# 1. Get DATA <a id="section1"></a>

First of all, let's take a look on the data structure:

In [None]:
#submission = pd.read_csv("../input/covid19-global-forecasting-week-1/submission.csv")
test = pd.read_csv("../input/covid19-global-forecasting-week-1/test.csv")
train = pd.read_csv("../input/covid19-global-forecasting-week-1/train.csv")
display(train.head(5))
display(train.describe())
print("Number of Country/Region: ", train['Country/Region'].nunique())
print("Dates go from day", max(train['Date']), "to day", min(train['Date']), ", a total of", train['Date'].nunique(), "days")

In [None]:
print("Number of Country/Region: ", test['Country/Region'].nunique())
print("Dates go from day", max(test['Date']), "to day", min(test['Date']), ", a total of", test['Date'].nunique(), "days")

The dataset covers 163 countries and almost 2 full months from 2020, which is enough data to get some clues about the pandemic. Let's see some plots of the worldwide tendency to see if we can extract some insights:

# 2. Data enrichment <a id="section3"></a>


Main workflow of this section:
1. Join data, filter dates and clean missing values + feature engineering
2. Add country details (external datasets)
3. Add Weather Data
4. Compute lags and trends
5. Target Encoding

**Disclaimer**: this data enrichment is not mandatory and we could end up not using all of the new features in our models. However I consider it a didactical step that will surely add some value, for example in an in-depth exploratory analysis.

## 2.1. Join data, filter dates and clean missings

First of all, we perform some pre-processing prepare the dataset, consisting on:

* **Join data**. Join train/test to facilitate data transformations
* **Filter dates**. According to the challenge conditions, remove ConfirmedCases and Fatalities post 2020-03-12
* **Missings**. Analyze and fix missing values

In [None]:
# Merge train and test, exclude overlap from test set so that the forecast id are correct
#dates_overlap = ['2020-03-12','2020-03-13','2020-03-14','2020-03-15','2020-03-16','2020-03-17','2020-03-18','2020-03-19','2020-03-20','2020-03-21','2020-03-22']
dates=set(test.Date)
dates_overlap=set(train[train['Date'].isin(dates)]['Date'])
dates_prediction=dates-dates_overlap

# train2 = train.loc[~train['Date'].isin(dates_overlap)]
test2 = test.loc[~test['Date'].isin(dates_overlap)]
all_data = pd.concat([train, test2], axis = 0, sort=False)
all_data=all_data.sort_values(by=['Country/Region','Province/State','Date'])

#update the forecastid from test set
all_data.loc[all_data['Date']>='2020-03-12','ForecastId']=test['ForecastId'].to_list()

# Double check that there are no informed ConfirmedCases and Fatalities after 2020-03-11
# all_data.loc[all_data['Date'] >= '2020-03-12', 'ConfirmedCases'] = np.nan
# all_data.loc[all_data['Date'] >= '2020-03-12', 'Fatalities'] = np.nan
all_data['Date'] = pd.to_datetime(all_data['Date'])

# Create column Day, label encoding Date
le = preprocessing.LabelEncoder()
all_data['Day'] = le.fit_transform(all_data.Date)

# # Country wise days since 1st case
all_data['Province/State'].fillna("None", inplace=True)
all_data['country/state']=all_data['Country/Region']+"_"+all_data['Province/State']

countrydate = all_data[all_data['ConfirmedCases']>0].groupby('country/state').agg({"Date":'min'}).reset_index()
countrydate.columns=['country/state','Dayof1stcase']
all_data=all_data.merge(countrydate, left_on='country/state', right_on='country/state', how='left')
all_data['Dayofcases'] = (all_data['Date']-all_data['Dayof1stcase']).dt.days
all_data.loc[~(all_data['Dayofcases']>=0),'Dayofcases']=-1
all_data=all_data.drop(columns=['Dayof1stcase'])

# # Country wise days since 1st fatality
countrydate = all_data[all_data['Fatalities']>0].groupby('country/state').agg({"Date":'min'}).reset_index()
countrydate.columns=['country/state','Dayof1stfatality']
all_data=all_data.merge(countrydate, left_on='country/state', right_on='country/state', how='left')
all_data['Dayoffatalities'] = (all_data['Date']-all_data['Dayof1stfatality']).dt.days
all_data.loc[~(all_data['Dayoffatalities']>=0),'Dayoffatalities']=-1
all_data=all_data.drop(columns=['Dayof1stfatality'])

# Aruba has no Lat nor Long. Inform it manually
all_data.loc[all_data['Lat'].isna()==True, 'Lat'] = 12.510052
all_data.loc[all_data['Long'].isna()==True, 'Long'] = -70.009354

# Fill null values given that we merged train-test datasets
all_data['Province/State'].fillna("None", inplace=True)
all_data['ConfirmedCases'].fillna(0, inplace=True)
all_data['Fatalities'].fillna(0, inplace=True)
all_data['Id'].fillna(-1, inplace=True)
all_data['ForecastId'].fillna(-1, inplace=True)

#Add day of week and month
all_data['dayofweek'] = all_data['Date'].dt.dayofweek
# all_data['month'] = all_data['Date'].dt.month
# all_data['dayofyear'] = all_data['Date'].dt.dayofyear

# display(all_data)
display(all_data.loc[all_data['Date'] == '2020-03-12'])

**Observations**: 
* The dataset includes all countries and dates, which is required for the lag/trend step
* Missing values for "ConfirmedCases" and "Fatalities" have been replaced by 0, which may be dangerous if we do not remember it at the end of the process. However, since we will train only on dates previous to 2020-03-12, this won't impact our prediction algorithm
* A new column "Day" has been created, as a day counter starting from the first date

Double-check that there are no remaining missing values:

In [None]:
missings_count = {col:all_data[col].isnull().sum() for col in all_data.columns}
missings = pd.DataFrame.from_dict(missings_count, orient='index')
print(missings.nlargest(30, 0))

## 2.2. Add country details

Variables like the total population of a country, the average age of citizens or the fraction of peoople living in cities may strongly impact on the COVID-19 transmission behavior. Hence, it's important to consider these factors. 

Datasets used:
https://www.kaggle.com/tanuprabhu/population-by-country-2020) 
https://www.kaggle.com/koryto/countryinfo
https://www.kaggle.com/lewisduncan93/the-economic-freedom-index



In [None]:
world_population = pd.read_csv("/kaggle/input/population-by-country-2020/population_by_country_2020.csv")
country_data = pd.read_csv("/kaggle/input/countryinfo/covid19countryinfo.csv")
continent = pd.read_csv("/kaggle/input/country-to-continent/countryContinent.csv", encoding = 'ISO-8859-1')
economy = pd.read_csv("/kaggle/input/the-economic-freedom-index/economic_freedom_index2019_data.csv", encoding = 'ISO-8859-1')


# Select desired columns and rename some of them
world_population = world_population[['Country (or dependency)', 'Population (2020)', 'Density (P/Km²)', 'Land Area (Km²)', 'Med. Age', 'Urban Pop %']]
world_population.columns = ['country', 'Population (2020)', 'Density', 'Land Area', 'Med Age', 'Urban Pop']

country_data = country_data[['country','quarantine', 'schools', 'restrictions', 'hospibed','smokers']]
country_data.columns = ['country', 'Quarantine', 'Schools', 'Restrictions', 'Hospibed','Smokers']

continent = continent[['country','continent','sub_region']]
continent.columns = ['country', 'Continent','Sub_Region']

economy = economy[['Country Name', 'World Rank','Region Rank', '2019 Score', 'Property Rights', 'Judical Effectiveness',
       'Government Integrity', 'Tax Burden', "Gov't Spending", 'Fiscal Health',
       'Business Freedom', 'Labor Freedom', 'Monetary Freedom',
       'Trade Freedom', 'Investment Freedom ', 'Financial Freedom',
       'Tariff Rate (%)', 'Income Tax Rate (%)', 'Corporate Tax Rate (%)',
       'Tax Burden % of GDP', "Gov't Expenditure % of GDP ",  'GDP (Billions, PPP)', 'GDP Growth Rate (%)',
       '5 Year GDP Growth Rate (%)', 'GDP per Capita (PPP)',
       'Unemployment (%)', 'Inflation (%)', 'FDI Inflow (Millions)',
       'Public Debt (% of GDP)']]

economy.columns=['country', 'World Rank','Region Rank', '2019 Score', 'Property Rights', 'Judical Effectiveness',
       'Government Integrity', 'Tax Burden', "Gov't Spending", 'Fiscal Health',
       'Business Freedom', 'Labor Freedom', 'Monetary Freedom',
       'Trade Freedom', 'Investment Freedom ', 'Financial Freedom',
       'Tariff Rate (%)', 'Income Tax Rate (%)', 'Corporate Tax Rate (%)',
       'Tax Burden % of GDP', "Gov't Expenditure % of GDP ",  'GDP (Billions, PPP)', 'GDP Growth Rate (%)',
       '5 Year GDP Growth Rate (%)', 'GDP per Capita (PPP)',
       'Unemployment (%)', 'Inflation (%)', 'FDI Inflow (Millions)',
       'Public Debt (% of GDP)']

In [None]:

# Replace United States by US
world_population.loc[world_population['country']=='United States', 'country'] = 'US'
world_population.loc[world_population['country']=="Gambia",'country']='The Gambia'
world_population.loc[world_population['country']=="Bahamas",'country']='The Bahamas'
world_population.loc[world_population['country']=="Réunion",'country']='Reunion'
world_population.loc[world_population['country']=="Czech Republic (Czechia)",'country']='Czechia'
world_population.loc[world_population['country']=="DR Congo",'country']='Congo (Kinshasa)'
world_population.loc[world_population['country']=="Congo",'country']='Congo (Brazzaville)'
world_population.loc[world_population['country']=="Côte d'Ivoire",'country']="Cote d'Ivoire"
world_population.loc[world_population['country']=="South Korea",'country']="Korea, South"
world_population.loc[world_population['country']=="St. Vincent & Grenadines",'country']='Saint Vincent and the Grenadines'

# continent
continent.loc[continent['country']=="United States of America",'country']='US'
continent.loc[continent['country']=="Bolivia (Plurinational State of)",'country']='Bolivia'
continent.loc[continent['country']=="Brunei Darussalam" ,'country'] = 'Brunei'
continent.loc[continent['country']=="Gambia",'country']='The Gambia'
continent.loc[continent['country']=="Bahamas",'country']='The Bahamas'
continent.loc[continent['country']=="Réunion",'country']='Reunion'
continent.loc[continent['country']=="Congo (Democratic Republic of the)",'country']='Congo (Kinshasa)'
continent.loc[continent['country']=="Congo",'country']='Congo (Brazzaville)'
continent.loc[continent['country']=="Czech Republic",'country']='Czechia'
continent.loc[continent['country']=="Côte d'Ivoire",'country']="Cote d'Ivoire"
continent.loc[continent['country']=="Macedonia (the former Yugoslav Republic of)",'country']="North Macedonia"
continent.loc[continent['country']=="Viet Nam",'country']='Vietnam'
continent.loc[continent['country']=="Venezuela (Bolivarian Republic of)",'country']='Venezuela'
continent.loc[continent['country']=="United Kingdom of Great Britain and Northern Ireland",'country']='United Kingdom'
continent.loc[continent['country']=="Tanzania, United Republic of",'country']='Tanzania'
continent.loc[continent['country']=="Russian Federation",'country']='Russia'
continent.loc[continent['country']=="Moldova (Republic of)",'country']='Moldova'
continent.loc[continent['country']=="Korea (Republic of)",'country']='Korea, South'
continent.loc[continent['country']=="Iran (Islamic Republic of)",'country']='Iran'
continent.loc[continent['country']=="Iran (Islamic Republic of)",'country']='Iran'
continent.loc[continent['country']=="Iran (Islamic Republic of)",'country']='Iran'

economy.loc[economy['country']=="United States of America",'country']='US'
economy.loc[economy['country']=="Bolivia (Plurinational State of)",'country']='Bolivia'
economy.loc[economy['country']=="Brunei Darussalam" ,'country'] = 'Brunei'
economy.loc[economy['country']=="Gambia",'country']='The Gambia'
economy.loc[economy['country']=="Bahamas",'country']='The Bahamas'
economy.loc[economy['country']=="Réunion",'country']='Reunion'
economy.loc[economy['country']=="Congo (Democratic Republic of the)",'country']='Congo (Kinshasa)'
economy.loc[economy['country']=="Congo",'country']='Congo (Brazzaville)'
economy.loc[economy['country']=="Czech Republic",'country']='Czechia'
economy.loc[economy['country']=="Côte d'Ivoire",'country']="Cote d'Ivoire"
economy.loc[economy['country']=="Macedonia (the former Yugoslav Republic of)",'country']="North Macedonia"
economy.loc[economy['country']=="Viet Nam",'country']='Vietnam'
economy.loc[economy['country']=="Venezuela (Bolivarian Republic of)",'country']='Venezuela'
economy.loc[economy['country']=="United Kingdom of Great Britain and Northern Ireland",'country']='United Kingdom'
economy.loc[economy['country']=="Tanzania, United Republic of",'country']='Tanzania'
economy.loc[economy['country']=="Russian Federation",'country']='Russia'
economy.loc[economy['country']=="Moldova (Republic of)",'country']='Moldova'
economy.loc[economy['country']=="Korea (Republic of)",'country']='Korea, South'
economy.loc[economy['country']=="Iran (Islamic Republic of)",'country']='Iran'
economy.loc[economy['country']=="Iran (Islamic Republic of)",'country']='Iran'
economy.loc[economy['country']=="Iran (Islamic Republic of)",'country']='Iran'

# actual data
# all_data.loc[train['Country/Region']=="Republic of the Congo",'country']='Congo (Kinshasa)'
all_data.loc[all_data['Country/Region']=="Gambia, The",'Country/Region']='The Gambia'

In [None]:
# DEMOGRAPHICS
# Remove the % character from Urban Pop values
world_population['Urban Pop'] = world_population['Urban Pop'].str.rstrip('%')

# Replace Urban Pop and Med Age "N.A" by their respective modes, then transform to int
world_population.loc[world_population['Urban Pop']=='N.A.', 'Urban Pop'] = int(world_population.loc[world_population['Urban Pop']!='N.A.', 'Urban Pop'].mode()[0])
world_population['Urban Pop'] = world_population['Urban Pop'].astype('int16')
world_population.loc[world_population['Med Age']=='N.A.', 'Med Age'] = int(world_population.loc[world_population['Med Age']!='N.A.', 'Med Age'].mode()[0])
world_population['Med Age'] = world_population['Med Age'].astype('int16')

print("Cleaned country details dataset")
display(world_population)

# Now join the dataset to our previous DataFrame and clean missings (not match in left join)
print("Enriched dataset")
all_data = all_data.merge(world_population, left_on='Country/Region', right_on='country', how='left')
all_data[['Population (2020)', 'Density', 'Land Area', 'Med Age', 'Urban Pop']] = all_data[['Population (2020)', 'Density', 'Land Area', 'Med Age', 'Urban Pop']].fillna(0)
all_data=all_data.drop(columns=['country'])
# display(all_data)


# CONTINENT INFO
# Now join the dataset to our previous DataFrame and clean missings (not match in left join)
print("Enriched dataset")
all_data = all_data.merge(continent, left_on='Country/Region', right_on='country', how='left')
all_data[['Continent','Sub_Region']] = all_data[['Continent','Sub_Region']].fillna('Other')
all_data=all_data.drop(columns=['country'])
# display(all_data)


# ECONOMY DATA
# Remove the $ character from GDP values
economy['GDP per Capita (PPP)'] = economy['GDP per Capita (PPP)'].str.strip('$')
economy['GDP (Billions, PPP)'] = economy['GDP (Billions, PPP)'].str.strip('$')
# economy['GDP per Capita (PPP)'] = economy['GDP per Capita (PPP)'].str.split('(').apply(lambda x: x[0]).str.replace(",","").astype(float)
# economy['GDP (Billions, PPP)'] = economy['GDP (Billions, PPP)'].str.split('\ ').apply(lambda x: x[0]).str.replace(",","").astype(float)
# Now join the dataset to our previous DataFrame and clean missings (not match in left join)
print("Enriched dataset")
all_data = all_data.merge(economy, left_on='Country/Region', right_on='country', how='left')
all_data[economy.columns] = all_data[economy.columns].fillna(0)
all_data=all_data.drop(columns=['country'])
# display(all_data)
# all_data['GDP per Capita (PPP)']= all_data['GDP per Capita (PPP)'].str.split('(').apply(lambda x: x[0]).str.replace(",","").astype(float)
# all_data['GDP (Billions, PPP)'] = all_data['GDP (Billions, PPP)'].str.split('\ ').apply(lambda x: x[0]).str.replace(",","").astype(float)


# Covid Country Hospital Smokers Quarantine Data
country_data['Smokers'].fillna(country_data.Smokers.mode()[0],inplace=True)
country_data['Hospibed'].fillna(country_data.Hospibed.mode()[0],inplace=True)
country_data['Quarantine']=pd.to_datetime(country_data['Quarantine'])
country_data['Schools']=pd.to_datetime(country_data['Schools'])
country_data['Restrictions']=pd.to_datetime(country_data['Restrictions'])


print("Enriching country actions dataset")
all_data = all_data.merge(country_data, left_on='Country/Region', right_on='country', how='left')
all_data['Smokers'].fillna(country_data.Smokers.mode()[0],inplace=True)
all_data['Hospibed'].fillna(country_data.Hospibed.mode()[0],inplace=True)
all_data=all_data.drop(columns=['country'])
display(all_data)



In [None]:
# Quarantine info flags
all_data['quarantine_flag']=(pd.to_datetime(all_data['Quarantine'])<all_data['Date'])
all_data['schools_flag']=(pd.to_datetime(all_data['Schools'])<all_data['Date'])
all_data['restrictions_flag']=(pd.to_datetime(all_data['Restrictions'])<all_data['Date'])

In [None]:
# a=(pd.to_datetime(all_data['Quarantine'])-pd.to_datetime(all_data['Date'])).dt.days
# a[~(a>=0)]

all_data['quarantine_days']=(pd.to_datetime(all_data['Quarantine'])-pd.to_datetime(all_data['Date'])).dt.days
all_data['schools_days']=(pd.to_datetime(all_data['Schools'])-pd.to_datetime(all_data['Date'])).dt.days
all_data['restrictions_days']=(pd.to_datetime(all_data['Restrictions'])-pd.to_datetime(all_data['Date'])).dt.days

all_data.loc[~(all_data['quarantine_days']<=0),'quarantine_days']=-1
all_data.loc[~(all_data['schools_days']<=0),'schools_days'] =-1
all_data.loc[~(all_data['restrictions_days']<=0),'restrictions_days']=-1

## 2.3 Adding Weather & Happiness Index Data
For Weather Data, We will be using the technique outlined in the great notebook https://www.kaggle.com/davidbnn92/weather-data?scriptVersionId=30695168


Dataset used fro Happiness Data
https://www.kaggle.com/londeen/world-happiness-report-2020

In [None]:
# all_data.to_csv('all_data.csv', index=False)
weather_data= pd.read_csv('../input/data-for-covid19/all_data.csv')
weather_data['Date']=pd.to_datetime(weather_data['Date'])


In [None]:
all_data=all_data.merge(weather_data[['Lat','Long','Country/Region','Province/State','Date','temp', 'min', 'max', 'stp', 'wdsp', 'prcp', 'fog']],
                          left_on=['Lat','Long','Country/Region','Province/State', 'Date'], right_on=['Lat','Long','Country/Region','Province/State', 'Date'], how='left')

In [None]:
world_happiness_index = pd.read_csv("../input/world-happiness/2019.csv")
world_happiness_index.loc[world_happiness_index['Country or region']=='United States', 'Country or region'] = 'US'
world_happiness_grouped = world_happiness_index.groupby('Country or region').nth(-1)
# world_happiness_grouped.drop("Year", axis=1, inplace=True)

In [None]:
all_data = pd.merge(left=all_data, right=world_happiness_grouped, how='left', left_on='Country/Region', right_index=True)
# all_data = all_data.drop(columns=['Country or region'])

In [None]:
all_data=all_data.sort_values(by=['Country/Region','Province/State','Date'])
all_data.shape

## 2.4. Compute lags and trends

Enriching a dataset is key to obtain good results. In this case we will apply 2 different transformations:

**Lag**. Lags are a way to compute the previous value of a column, so that the lag 1 for ConfirmedCases would inform the this column from the previous day. The lag 3 of a feature X is simply:
$$X_{lag3}(t) = X(t-3)$$


**Trend**. Transformig a column into its trend gives the natural tendency of this column, which is different from the raw value. The definition of trend I will apply is: 
$$Trend_{X} = {X(t) - X(t-1) \over X(t-1)}$$


**Moving Average** Rolling mean of confirmed cases and fatalities calculated on the previous 7 days numbers
$$Moving_{X} = \frac{1}{7} * {\sum_{1}^{8}(X(t-i)} $$

The backlog of lags I'll apply is 14 days, while for trends is 7 days.  For ConfirmedCases and Fatalities:

In [None]:
def calculate_trend(df, lag_list, column):
    for lag in lag_list:
        trend_column_lag = "Trend_" + column + "_" + str(lag)
        df[trend_column_lag] = (df[column].shift(lag, fill_value=0)-df[column].shift(lag+1, fill_value=-999))/df[column].shift(lag+1, fill_value=0)
    return df


def calculate_lag(df, lag_list, column):
    for lag in lag_list:
        column_lag = "Lag_" + column + "_" + str(lag)
        df[column_lag] = df[column].shift(lag, fill_value=0)
    return df

def moving_average(df,column):
    column_ma = "Moving_" + column 
    column_lag= "Lag_" + column + "_" + str(1)
    df[column_ma] = df[column_lag].rolling(window=7).mean()
    return df

ts = time.time()
all_data=all_data.sort_values(by=['Country/Region','Province/State','Date'])

all_data = calculate_lag(all_data, range(1,7), 'ConfirmedCases')
all_data = calculate_lag(all_data, range(1,7), 'Fatalities')
all_data = calculate_trend(all_data, range(1,7), 'ConfirmedCases')
all_data = calculate_trend(all_data, range(1,7), 'Fatalities')
all_data.replace([np.inf, -np.inf], 0, inplace=True)
all_data.fillna(0, inplace=True)

all_data = moving_average(all_data,'ConfirmedCases')
all_data = moving_average(all_data,'Fatalities')
all_data.replace([np.inf, -np.inf], 0, inplace=True)
all_data.fillna(0, inplace=True)

print("Time spent: ", time.time()-ts)

In [None]:
# all_data[all_data['Country/Region']=='Spain'].iloc[40:50][['Id', 'Province/State', 'Country/Region', 'Lat', 'Long', 'Date',
#        'ConfirmedCases', 'Fatalities', 'ForecastId', 'Day', 'ConfirmedCases_1',
#        'ConfirmedCases_2', 'ConfirmedCases_3', 'Fatalities_1', 'Fatalities_2',
#        'Fatalities_3']]

## 2.5 Likelihood Encoding

Using mean values of target variables (in training data) as encodes for country + province unique combination

In [None]:
train_data = all_data[all_data['ForecastId'] == -1]
encoding = train_data.groupby('country/state').agg({'ConfirmedCases': 'max', 'Dayofcases': 'max','Fatalities': 'max', 'Dayoffatalities': 'max'})
encoding['mean_encoding_confirmedCases'] = abs(encoding['ConfirmedCases']/(encoding['Dayofcases']+1))
encoding['mean_encoding_deathCases'] = abs(encoding['Fatalities']/(encoding['Dayoffatalities']+1))
encoding.replace([np.inf, -np.inf, np.NaN], 0, inplace=True)

In [None]:
all_data=all_data.merge(encoding[['mean_encoding_confirmedCases','mean_encoding_deathCases']], left_on='country/state', right_index=True, how='left')

# 3. Predictions with machine learning <a id="section4"></a>

Our obective in this section consists on  predicting the evolution of the expansion from a data-centric perspective, like any other regression problem. To do so, remember that the challenge specifies that submissions on the public LB shouldn only contain data previous to 2020-03-12.

Models applied:
1. RandomForest


# Prepare data for Fitting model

In [None]:
# Label encode country names
data = all_data.copy()
data = data.drop(columns=['country/state','Quarantine','Schools', 'Restrictions'])

data['Country/Region'] = le.fit_transform(data['Country/Region'])

# Save dictionary for exploration purposes
number = data['Country/Region']
countries = le.inverse_transform(data['Country/Region'])
country_dict = dict(zip(countries, number)) 

data['Continent'] = le.fit_transform(data['Continent'])
data['Sub_Region'] = le.fit_transform(data['Sub_Region'])
data['Province/State'] = le.fit_transform(data['Province/State'])
# data.drop(columns=[''])

In [None]:
dates_valid={date for date in dates_overlap if date > '2020-03-18'}

In [None]:
def split_data(data):
        # Train set
    new_train=data[data.ForecastId == -1]
    X_train = new_train.drop(['ConfirmedCases', 'Fatalities'], axis=1)
    Y_train_1 = new_train['ConfirmedCases']
    Y_train_2 = new_train['Fatalities']

    # Valid set
    valid=data[(data.ForecastId != -1)&(data.Date.isin(dates_overlap))]
    X_valid = valid.drop(['ConfirmedCases', 'Fatalities'], axis=1)
    Y_valid_1 = valid['ConfirmedCases']
    Y_valid_2 = valid['Fatalities']

    # Test set
    new_test=data[(data.ForecastId != -1)&(data.Date.isin(dates))]
    X_test = new_test.drop(['ConfirmedCases', 'Fatalities'], axis=1)

    # Test set
    #X_test = data[data.Day > day_valid].drop(['ConfirmedCases', 'Fatalities'], axis=1)
    
    X_train.drop(columns=['Id','ForecastId','Date'], inplace=True, errors='ignore')
    
    valid_index= X_valid['ForecastId']
    test_index = X_test['ForecastId']
    X_valid.drop(columns=['Id','ForecastId'], inplace=True, errors='ignore')
    X_test.drop(columns=['Id','ForecastId'], inplace=True, errors='ignore')
#     testdata=X_test.copy()
#     X_test.drop('Date', inplace=True, errors='ignore')

    return new_train, X_train, Y_train_1, Y_train_2, valid, X_valid, Y_valid_1, Y_valid_2, new_test, X_test, valid_index, test_index

def split_data_24(data):
        # Train set
    new_train=data[data.Date <=max(dates_overlap)]
    X_train = new_train.drop(['ConfirmedCases', 'Fatalities'], axis=1)
    Y_train_1 = new_train['ConfirmedCases']
    Y_train_2 = new_train['Fatalities']

    # Valid set
    valid=data[(data.ForecastId != -1)&(data.Date.isin(dates_overlap))]
    X_valid = valid.drop(['ConfirmedCases', 'Fatalities'], axis=1)
    Y_valid_1 = valid['ConfirmedCases']
    Y_valid_2 = valid['Fatalities']

    # Test set
    new_test=data[(data.ForecastId != -1)&(data.Date.isin(dates))]
    X_test = new_test.drop(['ConfirmedCases', 'Fatalities'], axis=1)

    # Test set
    #X_test = data[data.Day > day_valid].drop(['ConfirmedCases', 'Fatalities'], axis=1)
    
    X_train.drop(columns=['Id','ForecastId','Date'], inplace=True, errors='ignore')
    
    valid_index= X_valid['ForecastId']
    test_index = X_test['ForecastId']
    X_valid.drop(columns=['Id','ForecastId'], inplace=True, errors='ignore')
    X_test.drop(columns=['Id','ForecastId'], inplace=True, errors='ignore')
#     testdata=X_test.copy()
#     X_test.drop('Date', inplace=True, errors='ignore')

    return new_train, X_train, Y_train_1, Y_train_2, valid, X_valid, Y_valid_1, Y_valid_2, new_test, X_test, valid_index, test_index



# def split_data_18(data):
#         # Train set
#     new_train=data[data.Date <='2020-03-18']
#     X_train = new_train.drop(['ConfirmedCases', 'Fatalities'], axis=1)
#     Y_train_1 = new_train['ConfirmedCases']
#     Y_train_2 = new_train['Fatalities']

#     # Valid set
#     valid=data[(data.ForecastId != -1)&(data.Date.isin(dates_valid))]
#     X_valid = valid.drop(['ConfirmedCases', 'Fatalities'], axis=1)
#     Y_valid_1 = valid['ConfirmedCases']
#     Y_valid_2 = valid['Fatalities']

#     # Test set
#     new_test=data[(data.ForecastId != -1)&(data.Date.isin(dates))]
#     X_test = new_test.drop(['ConfirmedCases', 'Fatalities'], axis=1)

#     # Test set
#     #X_test = data[data.Day > day_valid].drop(['ConfirmedCases', 'Fatalities'], axis=1)
    
#     X_train.drop(columns=['Id','ForecastId','Date'], inplace=True, errors='ignore')
    
#     valid_index= X_valid['ForecastId']
#     test_index = X_test['ForecastId']
#     X_valid.drop(columns=['Id','ForecastId'], inplace=True, errors='ignore')
#     X_test.drop(columns=['Id','ForecastId'], inplace=True, errors='ignore')
# #     testdata=X_test.copy()
# #     X_test.drop('Date', inplace=True, errors='ignore')

#     return new_train, X_train, Y_train_1, Y_train_2, valid, X_valid, Y_valid_1, Y_valid_2, new_test, X_test, valid_index, test_index


def split_data_18(data):
        # Train set
    new_train=data[data.Date <='2020-03-18']
    X_train = new_train.drop(['ConfirmedCases', 'Fatalities'], axis=1)
    Y_train_1 = new_train['ConfirmedCases']
    Y_train_2 = new_train['Fatalities']

    # Valid set
    valid=data[(data.ForecastId != -1)&(data.Date.isin(dates_overlap))]
    X_valid = valid.drop(['ConfirmedCases', 'Fatalities'], axis=1)
    Y_valid_1 = valid['ConfirmedCases']
    Y_valid_2 = valid['Fatalities']

    # Test set
    new_test=data[(data.ForecastId != -1)&(data.Date.isin(dates))]
    X_test = new_test.drop(['ConfirmedCases', 'Fatalities'], axis=1)

    # Test set
    #X_test = data[data.Day > day_valid].drop(['ConfirmedCases', 'Fatalities'], axis=1)
    
    X_train.drop(columns=['Id','ForecastId','Date'], inplace=True, errors='ignore')
    
    valid_index= X_valid['ForecastId']
    test_index = X_test['ForecastId']
    X_valid.drop(columns=['Id','ForecastId'], inplace=True, errors='ignore')
    X_test.drop(columns=['Id','ForecastId'], inplace=True, errors='ignore')
#     testdata=X_test.copy()
#     X_test.drop('Date', inplace=True, errors='ignore')

    return new_train, X_train, Y_train_1, Y_train_2, valid, X_valid, Y_valid_1, Y_valid_2, new_test, X_test, valid_index, test_index

In [None]:
def recreate_valid_split(data):
    # Valid set
    valid   = data[(data.ForecastId != -1)&(data.Date.isin(dates_overlap))]
    X_valid = valid.drop(['ConfirmedCases', 'Fatalities'], axis=1)
    Y_valid_1 = valid['ConfirmedCases']
    Y_valid_2 = valid['Fatalities']
    X_valid.drop(columns=['Id','ForecastId'], inplace=True, errors='ignore')
#     testdata=X_test.copy()
#     X_test.drop('Date', inplace=True, errors='ignore')
    return X_valid#, Y_valid_1, Y_valid_2

def recreate_submission_split(data):
    # Valid set
    sub   = data[(data.ForecastId != -1)]
    X_sub = sub.drop(['ConfirmedCases', 'Fatalities'], axis=1)
    Y_sub_1 = sub['ConfirmedCases']
    Y_sub_2 = sub['Fatalities']
    X_sub.drop(columns=['Id','ForecastId'], inplace=True, errors='ignore')
#     testdata=X_test.copy()
#     X_test.drop('Date', inplace=True, errors='ignore')
    return X_sub#, Y_valid_1, Y_valid_2

def recalculate_lags(df):
    df = df.sort_values(by=['Country/Region','Province/State','Date'])
    df = calculate_lag(df, range(1,7), 'ConfirmedCases')
    df = calculate_lag(df, range(1,7), 'Fatalities')
#     df = calculate_trend(df, [1], 'ConfirmedCases')
#     df = calculate_trend(df, [1], 'Fatalities')
    df = calculate_trend(df, range(1,7), 'ConfirmedCases')
    df = calculate_trend(df, range(1,7), 'Fatalities')
    df.replace([np.inf, -np.inf], 0, inplace=True)
    df.fillna(0, inplace=True)
    
    df = moving_average(df,'ConfirmedCases')
    df = moving_average(df,'Fatalities')
    df.replace([np.inf, -np.inf], 0, inplace=True)
    df.fillna(0, inplace=True)
    return df
#     print("Time spent: ", time.time()-ts)

In [None]:
economy_columns=['World Rank', 'Region Rank', '2019 Score',
       'Property Rights', 'Judical Effectiveness', 'Government Integrity',
       'Tax Burden', "Gov't Spending", 'Fiscal Health', 'Business Freedom',
       'Labor Freedom', 'Monetary Freedom', 'Trade Freedom',
       'Investment Freedom ', 'Financial Freedom', 'Tariff Rate (%)',
       'Income Tax Rate (%)', 'Corporate Tax Rate (%)', 'Tax Burden % of GDP',
       "Gov't Expenditure % of GDP ", 'GDP (Billions, PPP)',
       'GDP Growth Rate (%)', '5 Year GDP Growth Rate (%)',
       'GDP per Capita (PPP)', 'Unemployment (%)', 'Inflation (%)',
       'FDI Inflow (Millions)', 'Public Debt (% of GDP)']


In [None]:
trend_columns=[e  for e in data.columns if e.startswith('Trend_')]
lag_columns=[e for e in data.columns if e.startswith('Lag_')  ]

# for e in data.columns:
#     if e.str.contains('Lag_'):
#         print(e)

In [None]:
new_train,X_train, Y_train_1, Y_train_2, valid, \
X_valid, Y_valid_1, Y_valid_2, new_test,X_test, valid_index, test_index \
= split_data(data.drop(columns=economy_columns ))
                     #+lag_columns+trend_columns))



## 3.1 Random Forest Regressor 

Recalculate lags and trends after each date of predictions

In [None]:
test1=RandomForestRegressor(n_estimators=150,random_state = 42)

test1.fit(X_train, Y_train_1)

test2=RandomForestRegressor(n_estimators=150,random_state = 42)

test2.fit(X_train, Y_train_2)

In [None]:
columns=pd.DataFrame(data=X_train.columns, columns=['col'])
importances1=test1.feature_importances_
importances2=test2.feature_importances_
columns['importances1']=importances1
columns['importances2']=importances2

display(columns[(~columns.col.str.startswith('Trend'))&(~columns.col.str.startswith('Lag'))].sort_values(by='importances1',ascending=False).head(10))

display(columns[(~columns.col.str.startswith('Trend'))&(~columns.col.str.startswith('Lag'))].sort_values(by='importances2',ascending=False).head(10))



**Inferences from the model**

1. Feature Engineering on Fatalities and Confirmed Cases (Lags, Trends and Moving Averages) came out to be the most important features for the model
2. Days since the first confirmed case and days since quarantine also highly impact the predictions
3. Likelihood encodings for province and country combo helped the model to get better results
4. School and Restriction Flags didn't show up in the important feature list (probably due to the fact that the dates for these restrictions lie in the validation time period 12th to 25th March)
5. Weather Data specifically temperature values also impact the performance of the model (but lower feature_importance values)


## Validation Set and RMSLE Error

In [None]:
temp_data=data.copy()
temp_x_valid= X_valid.copy()
x_pred=valid.copy() # Final predictions
x_pred['Predictions1']=0
x_pred['Predictions2']=0
for date in sorted(dates_overlap):  
    if date>min(dates_overlap):
        temp_data.loc[temp_data['ForecastId'].isin(forecastids),'ConfirmedCases']=predictions1
        temp_data.loc[temp_data['ForecastId'].isin(forecastids),'Fatalities']=predictions2
        temp_data = recalculate_lags(temp_data)
#         temp_x_valid = recreate_valid_split(temp_data[important_columns+drop_features+lag_columns+trend_columns])
        temp_x_valid = recreate_valid_split(temp_data.drop(columns=economy_columns))
    forecastids=temp_data[temp_data['Date']==date]['ForecastId']
    valid_dataset=temp_x_valid[temp_x_valid['Date']==date].drop(columns=['Date'])
    
    predictions1=test1.predict(valid_dataset)
#     predictions1=pd.concat((valid_dataset['Lag_ConfirmedCases_1'].reset_index(),pd.Series(predictions1)),axis=1)
#     predictions1=predictions1.drop(columns=['index'])
#     predictions1=predictions1.max(axis=1).to_list()
    predictions1[predictions1 < 0] = 0
    
    predictions2=test2.predict(valid_dataset)
#     predictions2=pd.concat((valid_dataset['Lag_Fatalities_1'].reset_index(),pd.Series(predictions2)),axis=1)
#     predictions2=predictions2.drop(columns=['index'])
#     predictions2=predictions2.max(axis=1).to_list()
    predictions2[predictions2 < 0] = 0
    x_pred.loc[x_pred['ForecastId'].isin(forecastids),'Predictions1']=predictions1
    x_pred.loc[x_pred['ForecastId'].isin(forecastids),'Predictions2']=predictions2
#     print(predictions1)
#     print(predictions2)
   

In [None]:
(np.sqrt(mean_squared_log_error( Y_valid_1, x_pred['Predictions1'] )) +
      np.sqrt(mean_squared_log_error( Y_valid_2, x_pred['Predictions2'] )))/2

**Error of 0.0 on validation set due to overlap in training and validation data. Targeting Private Leaderboard**

In [None]:
# xpred = x_pred[['Date','Country/Region','ConfirmedCases','Fatalities','Predictions1','Predictions2']]
# display(xpred[xpred['Country/Region']==74].tail(5))
# display(xpred[xpred['Country/Region']==68].tail(5))

## 2nd Model trained till 24th March

In [None]:
new_train,X_train, Y_train_1, Y_train_2, valid, \
X_valid, Y_valid_1, Y_valid_2, new_test,X_test, valid_index, test_index \
= split_data_24(data.drop(columns=economy_columns ))
                     #+lag_columns+trend_columns))

In [None]:
test1=RandomForestRegressor(n_estimators=150,random_state = 42)

test1.fit(X_train, Y_train_1)

test2=RandomForestRegressor(n_estimators=150,random_state = 42)

test2.fit(X_train, Y_train_2)

In [None]:
# train,X_train, Y_train_1, Y_train_2, valid,X_valid, Y_valid_1, Y_valid_2, test,X_test, valid_index, test_index = split_data(data.drop(columns=economy_columns))
temp_data=data.copy()
temp_x_sub= X_test.copy()
x_pred2=new_test.copy() # Final predictions
x_pred2['Predictions1']=0
x_pred2['Predictions2']=0

for date in sorted(dates):  
    if date>min(dates):
        temp_data.loc[temp_data['ForecastId'].isin(forecastids),'ConfirmedCases']=predictions1
        temp_data.loc[temp_data['ForecastId'].isin(forecastids),'Fatalities']=predictions2
        temp_data = recalculate_lags(temp_data)
        temp_x_sub = recreate_submission_split(temp_data.drop(columns=economy_columns))
    forecastids=temp_data[temp_data['Date']==date]['ForecastId']
    sub_dataset=temp_x_sub[temp_x_sub['Date']==date].drop(columns=['Date'])
    predictions1=test1.predict(sub_dataset)
    predictions1[predictions1 < 0] = 0
    predictions2=test2.predict(sub_dataset)
    predictions2[predictions2 < 0] = 0
    x_pred2.loc[x_pred2['ForecastId'].isin(forecastids),'Predictions1']=predictions1
    x_pred2.loc[x_pred2['ForecastId'].isin(forecastids),'Predictions2']=predictions2

    

### Combine validation and test period results

In [None]:
x_pred2.loc[x_pred2['ForecastId'].isin(x_pred['ForecastId']),['Predictions1','Predictions2']]=x_pred[['Predictions1','Predictions2']]

In [None]:
(np.sqrt(mean_squared_log_error( Y_valid_1, np.floor(x_pred2[x_pred2.Date.isin(dates_overlap)]['Predictions1']) )) +
      np.sqrt(mean_squared_log_error( Y_valid_2, np.floor(x_pred2[x_pred2.Date.isin(dates_overlap)]['Predictions2']) )))/2

## Submissions 12-25th march from 1st model and 26th+ from 2nd model

In [None]:
def get_submission(x_pred):
    # Submit predictions
#     submission = pd.DataFrame({
#         "ForecastId": index, 
#         "ConfirmedCases": prediction1,
#         "Fatalities": predictions2
#     })
    submission=x_pred[['ForecastId','Predictions1','Predictions2']].copy()
    submission.columns=['ForecastId','ConfirmedCases','Fatalities']
    submission.loc[:,'ForecastId']=submission['ForecastId'].astype(int)
    submission.loc[:,'ConfirmedCases']=submission['ConfirmedCases']
    submission.loc[:,'Fatalities']=submission['Fatalities']
    
#     submission['ConfirmedCases']=np.ceil(submission['ConfirmedCases'])
    submission.to_csv('submission.csv', index=False)
    

In [None]:
get_submission(x_pred2)