# COVID19 Predictions using XGBOOST
![covid19](https://2s7gjr373w3x22jf92z99mgm5w-wpengine.netdna-ssl.com/wp-content/uploads/2020/02/coronavirus-768x432.jpg)
In this Project I'll be using past three months data to predict the Confirmed Cases and Fatalities for the month of April. The model used for training will be an **XGBOOST** model.
## <a id='main'>Table of Contents</a>
- [Let's Explore the Data](#exp)
- [Exploratory Data Analysis(EDA)](#eda)
    1. [Universal growth of COVID19 over time](#world)
    2. [Trend of COVID19 in top 10 affected countries](#top10)
    3. [Country Specific growth of COVID19](#country)
        - [United States of America](#us)
        - [India](#in)
        - [China](#ch)
- [Preprocessing](#pp)
- [Training and evaluating the model](#te)
- [Prediction](#pred)

# <a id='exp'>Lets explore the Data</a>

In [None]:
#Libraries to import
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt
import pycountry
import plotly_express as px
sns.set_style('darkgrid')
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')
from sklearn.preprocessing import OrdinalEncoder
from sklearn import metrics
import xgboost as xgb
from xgboost import XGBRegressor
from xgboost import plot_importance, plot_tree

In [None]:
df_train = pd.read_csv('/kaggle/input/covid19-global-forecasting-week-2/train.csv') 
df_test = pd.read_csv('/kaggle/input/covid19-global-forecasting-week-2/test.csv')

In [None]:
display(df_train.head())
display(df_train.describe())
display(df_train.info())

Currenty, the date is coming as a string. Lets convert it into datetime format so that EDA on the data becomes easier.

In [None]:
df_train['Date'] = pd.to_datetime(df_train['Date'], format = '%Y-%m-%d')
df_test['Date'] = pd.to_datetime(df_test['Date'], format = '%Y-%m-%d')

In [None]:
print('Minimum date from training set: {}'.format(df_train['Date'].min()))
print('Maximum date from training set: {}'.format(df_train['Date'].max()))

In [None]:
print('Minimum date from test set: {}'.format(df_test['Date'].min()))
print('Maximum date from test set: {}'.format(df_test['Date'].max()))

# <a id='eda'>Exploratory Data Analysis(EDA)</a>
[Go back to the main page](#main)

After exploring the data and its datatypes, let's perform some EDA on the data in order to get a better understanding of the data and how COVID19 is affecting all of us.

### <a id='world'>Universal growth of COVID19 over time</a>
In this section, I'll have a look at how COVID19 has been growing throughout the world from 22nd january 2020. I'll be using a chloropleth map with a time slider to show the daily impact of virus.  

In [None]:
df_map = df_train.copy()
df_map['Date'] = df_map['Date'].astype(str)
df_map = df_map.groupby(['Date','Country_Region'], as_index=False)['ConfirmedCases','Fatalities'].sum()

In [None]:
def get_iso3_util(country_name):
    try:
        country = pycountry.countries.get(name=country_name)
        return country.alpha_3
    except:
        if 'Congo' in country_name:
            country_name = 'Congo'
        elif country_name == 'Diamond Princess' or country_name == 'Laos':
            return country_name
        elif country_name == 'Korea, South':
            country_name = 'Korea, Republic of'
        elif country_name == 'Taiwan*':
            country_name = 'Taiwan'
        country = pycountry.countries.search_fuzzy(country_name)
        return country[0].alpha_3

d = {}
def get_iso3(country):
    if country in d:
        return d[country]
    else:
        d[country] = get_iso3_util(country)
    
df_map['iso_alpha'] = df_map.apply(lambda x: get_iso3(x['Country_Region']), axis=1)

In [None]:
df_map['ln(ConfirmedCases)'] = np.log(df_map.ConfirmedCases + 1)
df_map['ln(Fatalities)'] = np.log(df_map.Fatalities + 1)

>Since, cases and fatalities have grown exponentially over the last two months and countries like China, Italy, USA,and Spain, I have plotted the choropleth map on logarithmic scale. You can hover on the country to know the total confirmed cases or fatalities.

In [None]:
px.choropleth(df_map, 
              locations="iso_alpha", 
              color="ln(ConfirmedCases)", 
              hover_name="Country_Region", 
              hover_data=["ConfirmedCases"] ,
              animation_frame="Date",
              color_continuous_scale=px.colors.sequential.dense, 
              title='Daily Confirmed Cases growth(Logarithmic Scale)')

In [None]:
px.choropleth(df_map, 
              locations="iso_alpha", 
              color="ln(Fatalities)", 
              hover_name="Country_Region",
              hover_data=["Fatalities"],
              animation_frame="Date",
              color_continuous_scale=px.colors.sequential.OrRd,
              title = 'Daily Deaths growth(Logarithmic Scale)')

1. Below are my finding from the above plots:-
- China was the first country to experience the onset of virus.
- US and Italy, which are the worst affected countries currently didn't recond many cases in january. This shows that how fast the virus spreads.
- Majority of the cases are in the northern hemisphere, which is relatively cooler at this time of the year. Maybe the virus is temperature sensitive and as the summer progresses, we may see a fall in the growth of the cases in the northern hemisphere. Meaning not a good winter season for the southern hemisphere.
- Western Europe is the worst affected. Hence, it can be adjudged as the new epicenter of COVID19. USA is also in the reckoning.
- Lockdown has seem to have worked in China's favour as the growth rate has plummeted.

### <a id='top10'>Trend of COVID19 in top 10 affected countries</a>
I need to find the Top 10 affected countries. Since, the Confirmed cases and Fatalities are the cummulative sums till date, I'll find the top 10 countries by using the country data of the last date for which the training data is available.

In [None]:
#Get the top 10 countries
last_date = df_train.Date.max()
df_countries = df_train[df_train['Date']==last_date]
df_countries = df_countries.groupby('Country_Region', as_index=False)['ConfirmedCases','Fatalities'].sum()
df_countries = df_countries.nlargest(10,'ConfirmedCases')
#Get the trend for top 10 countries
df_trend = df_train.groupby(['Date','Country_Region'], as_index=False)['ConfirmedCases','Fatalities'].sum()
df_trend = df_trend.merge(df_countries, on='Country_Region')
df_trend.drop(['ConfirmedCases_y','Fatalities_y'],axis=1, inplace=True)
df_trend.rename(columns={'Country_Region':'Country', 'ConfirmedCases_x':'Cases', 'Fatalities_x':'Deaths'}, inplace=True)
#Add columns for studying logarithmic trends
df_trend['ln(Cases)'] = np.log(df_trend['Cases']+1)# Added 1 to remove error due to log(0).
df_trend['ln(Deaths)'] = np.log(df_trend['Deaths']+1)

In [None]:
px.line(df_trend, x='Date', y='Cases', color='Country', title='COVID19 Cases growth for top 10 worst affected countries')

In [None]:
px.line(df_trend, x='Date', y='Deaths', color='Country', title='COVID19 Deaths growth for top 10 worst affected countries')

> Below are my analysis from the above line plots for the top 10 affected countries:
- Cases and Deaths for China have stagnated over time.
- The cases and deaths are monotonically increasing(almost exponentially) for rest of the countries.
- US has shown the greatest rise in the number of Confirmed Cases. Italy, on the other hand having the highest rise in deaths has to bear the brunt of the virus. Spain is a close second to Italy.
- 6 out of the top 10 affected countries are Western European countries.

Below, I have also plotted the daily variation of Confirmed Cases and Deaths for Top 10 affected countries on a logarithmic scale.

In [None]:
px.line(df_trend, x='Date', y='ln(Cases)', color='Country', title='COVID19 Cases growth for top 10 worst affected countries(Logarithmic Scale)')

In [None]:
px.line(df_trend, x='Date', y='ln(Deaths)', color='Country', title='COVID19 Deaths growth for top 10 worst affected countries(Logarithmic Scale)')

### <a id='country'>Country Specific growth of COVID19</a>

#### <a id='us'>United States of America</a>
As can be seen through the below graphs: 
- The COVID19 outbreak started from Washington state on the west coast and later on picked up pace in New york on the east coast . 
- Now, New York itself has around 40% of the total cases in USA.
- New Jersey is the second in the list of worst affected states.
- Cases and Fatalities in the East Coast are more than that of West Coast's.

In [None]:
# Dictionary to get the state codes from state names for US
us_state_abbrev = {
    'Alabama': 'AL',
    'Alaska': 'AK',
    'American Samoa': 'AS',
    'Arizona': 'AZ',
    'Arkansas': 'AR',
    'California': 'CA',
    'Colorado': 'CO',
    'Connecticut': 'CT',
    'Delaware': 'DE',
    'District of Columbia': 'DC',
    'Florida': 'FL',
    'Georgia': 'GA',
    'Guam': 'GU',
    'Hawaii': 'HI',
    'Idaho': 'ID',
    'Illinois': 'IL',
    'Indiana': 'IN',
    'Iowa': 'IA',
    'Kansas': 'KS',
    'Kentucky': 'KY',
    'Louisiana': 'LA',
    'Maine': 'ME',
    'Maryland': 'MD',
    'Massachusetts': 'MA',
    'Michigan': 'MI',
    'Minnesota': 'MN',
    'Mississippi': 'MS',
    'Missouri': 'MO',
    'Montana': 'MT',
    'Nebraska': 'NE',
    'Nevada': 'NV',
    'New Hampshire': 'NH',
    'New Jersey': 'NJ',
    'New Mexico': 'NM',
    'New York': 'NY',
    'North Carolina': 'NC',
    'North Dakota': 'ND',
    'Northern Mariana Islands':'MP',
    'Ohio': 'OH',
    'Oklahoma': 'OK',
    'Oregon': 'OR',
    'Pennsylvania': 'PA',
    'Puerto Rico': 'PR',
    'Rhode Island': 'RI',
    'South Carolina': 'SC',
    'South Dakota': 'SD',
    'Tennessee': 'TN',
    'Texas': 'TX',
    'Utah': 'UT',
    'Vermont': 'VT',
    'Virgin Islands': 'VI',
    'Virginia': 'VA',
    'Washington': 'WA',
    'West Virginia': 'WV',
    'Wisconsin': 'WI',
    'Wyoming': 'WY'
}

In [None]:
df_us = df_train[df_train['Country_Region']=='US']
df_us['Date'] = df_us['Date'].astype(str)
df_us['state_code'] = df_us.apply(lambda x: us_state_abbrev.get(x.Province_State,float('nan')), axis=1)
df_us['ln(ConfirmedCases)'] = np.log(df_us.ConfirmedCases + 1)
df_us['ln(Fatalities)'] = np.log(df_us.Fatalities + 1)

In [None]:
px.choropleth(df_us,
              locationmode="USA-states",
              scope="usa",
              locations="state_code",
              color="ln(ConfirmedCases)",
              hover_name="Province_State",
              hover_data=["ConfirmedCases"],
              animation_frame="Date",
              color_continuous_scale=px.colors.sequential.Darkmint,
              title = 'Daily Cases growth for USA(Logarithmic Scale)')

In [None]:
px.choropleth(df_us,
              locationmode="USA-states",
              scope="usa",
              locations="state_code",
              color="ln(Fatalities)",
              hover_name="Province_State",
              hover_data=["Fatalities"],
              animation_frame="Date",
              color_continuous_scale=px.colors.sequential.OrRd,
              title = 'Daily deaths growth for USA(Logarithmic Scale)')

#### <a id='in'>India</a>
COVID19 outbreak has started a bit late in India as compared to other countries. But, it has started to pick up pace. With limited testing and not a well funded healthcare system, India is surely up for a challenge. Let's hope that the 21 day lockdown helps to stop or atleast slower down the spread of this dreaded virus.

In [None]:
df_train.Province_State.fillna('NaN', inplace=True)

In [None]:
df_plot = df_train.groupby(['Date','Country_Region','Province_State'], as_index=False)['ConfirmedCases','Fatalities'].sum()

In [None]:
df = df_plot.query("Country_Region=='India'")
px.line(df, x='Date', y='ConfirmedCases', title='Daily Cases growth for India')

In [None]:
px.line(df, x='Date', y='Fatalities', title='Daily Deaths growth for India')

#### <a id='ch'>China</a>
- This is where it all started! By looking at the graph it can be seen that China has been able to almost stop the spread of COVID19 substantially.
- Almost all the cases are from the Hubei Province which can be attributed to the fact that the outbreak started from its capital, Wuhan.

> In order to get a better understanding of the cases/fatalities growth from other provinces, you can click on Hubei in the legend so that it gets hidden and the scale will autoscale.

In [None]:
ch_geojson = "../input/china-regions-map/china-provinces.json"
df_plot['day'] = df_plot.Date.dt.dayofyear
df_plot['Province_ch'] = "新疆维吾尔自治区"

In [None]:
df = df_plot.query("Country_Region=='China'")
fig = px.choropleth_mapbox(df,
              geojson=ch_geojson,
              #scope="asia",
              color="ConfirmedCases",
              locations="Province_ch",
              featureidkey="objects.CHN_adm1.geometries.properties.NL_NAME_1",
              #featureidkey="features.properties.name",
              animation_frame="day")
fig.update_geos(fitbounds="locations", visible=False)
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

In [None]:
df = df_plot.query("Country_Region=='China'")
px.line(df, x='Date', y='ConfirmedCases', color='Province_State', title='Daily Cases growth for China')

In [None]:
px.line(df, x='Date', y='Fatalities', color='Province_State', title='Daily Deaths growth for China')

# <a id='pp'>Preprocessing</a>
[Go back to the main page](#main)

Convert Categorical variables: **Province_State** & **Country_Region**, into integers for training the model.

Province_State contains Null values. As I need null values as well as this feature is needed for training, I'll convert Null values to string 'NaN'. Now, OrdinalEncoder() can be easily applied on it.

In [None]:
def categoricalToInteger(df):
    #convert NaN Province State values to a string
    df.Province_State.fillna('NaN', inplace=True)
    #Define Ordinal Encoder Model
    oe = OrdinalEncoder()
    df[['Province_State','Country_Region']] = oe.fit_transform(df.iloc[:,1:3])
    return df

Extract useful features from date.

In [None]:
def create_features(df):
    df['day'] = df['Date'].dt.day
    df['month'] = df['Date'].dt.month
    df['dayofweek'] = df['Date'].dt.dayofweek
    df['dayofyear'] = df['Date'].dt.dayofyear
    df['quarter'] = df['Date'].dt.quarter
    df['weekofyear'] = df['Date'].dt.weekofyear
    return df

In [None]:
def cum_sum(df, date, country, state):
    sub_df = df[(df['Country_Region']==country) & (df['Province_State']==state) & (df['Date']<=date)]
    display(sub_df)
    return sub_df['ConfirmedCases'].sum(), sub_df['Fatalities'].sum()

Split the training data into train and dev set for cross-validation.

In [None]:
def train_dev_split(df):
    date = df['Date'].max() - dt.timedelta(days=7)
    return df[df['Date'] <= date], df[df['Date'] > date]

In [None]:
df_train = categoricalToInteger(df_train)
df_train = create_features(df_train)

In [None]:
df_train, df_dev = train_dev_split(df_train)

Select all the columns that are needed for training the model.

In [None]:
columns = ['day','month','dayofweek','dayofyear','quarter','weekofyear','Province_State', 'Country_Region','ConfirmedCases','Fatalities']
df_train = df_train[columns]
df_dev = df_dev[columns]

# <a id='te'>Training and evaluating the model</a>
[Go back to the main page](#main)

In this section, I'll training the data on an XGBOOST model and evaluate it on the dev set. Since, I have to predict both: **Confirmed Cases** & **Fatalities**, I'll be using 2 separate models.

Obtain the numpy arrays of the train and dev set.

In [None]:
train = df_train.values
dev = df_dev.values
X_train, y_train = train[:,:-2], train[:,-2:]
X_dev, y_dev = dev[:,:-2], dev[:,-2:]

In [None]:
'''train = df_train.values
X_train, y_train = train[:,:-2], train[:,-2:]'''

In [None]:
def modelfit(alg, X_train, y_train,useTrainCV=True, cv_folds=5, early_stopping_rounds=50):
    
    if useTrainCV:
        xgb_param = alg.get_xgb_params()
        xgtrain = xgb.DMatrix(X_train, label=y_train)
        cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=alg.get_params()['n_estimators'], nfold=cv_folds,
            metrics='rmse', early_stopping_rounds=early_stopping_rounds, show_stdv=False)
        alg.set_params(n_estimators=cvresult.shape[0])
    
    #Fit the algorithm on the data
    alg.fit(X_train, y_train,eval_metric='rmse')
        
    #Predict training set:
    predictions = alg.predict(X_train)
    #predprob = alg.predict_proba(X_train)[:,1]
        
    #Print model report:
    print("\nModel Report")
    #print("Accuracy : %.4g" % metrics.accuracy_score(y_train, predictions))
    print("RMSE Score (Train): %f" % metrics.mean_squared_error(y_train, predictions))
                    
    feat_imp = pd.Series(alg.feature_importances_).sort_values(ascending=False)
    feat_imp.plot(kind='bar', title='Feature Importances')
    plt.ylabel('Feature Importance Score')

In [None]:
'''model1 = XGBRegressor(
 learning_rate =0.1,
 n_estimators=1000,
 max_depth=5,
 min_child_weight=1,
 gamma=0,
 subsample=0.8,
 colsample_bytree=0.8,
 objective= 'reg:squarederror',
 scale_pos_weight=1)
modelfit(model1, X_train, y_train[:,0])'''

In [None]:
'''model2 = XGBRegressor(
 learning_rate =0.1,
 n_estimators=1000,
 max_depth=5,
 min_child_weight=1,
 gamma=0,
 subsample=0.8,
 colsample_bytree=0.8,
 objective= 'reg:squarederror',
 scale_pos_weight=1)
modelfit(model2, X_train, y_train[:,1])'''

Create the model

In [None]:
model1 = XGBRegressor(n_estimators=1000)
model2 = XGBRegressor(n_estimators=1000)

Train the model and evaluate it using the dev set. I have used last week's data for the dev set.

In [None]:
model1.fit(X_train, y_train[:,0],
           eval_set=[(X_train, y_train[:,0]), (X_dev, y_dev[:,0])],
           verbose=False)

In [None]:
model2.fit(X_train, y_train[:,1],
           eval_set=[(X_train, y_train[:,1]), (X_dev, y_dev[:,1])],
           verbose=False)

* Get the feature importance for both the models.

In [None]:
plot_importance(model1);

In [None]:
plot_importance(model2);

# <a id='pred'>Prediction</a>
[Go back to the main page](#main)

Here, I have combined the predictions from both the models and prepared the submission file.

In [None]:
df_train = categoricalToInteger(df_test)
df_train = create_features(df_test)

In [None]:
columns = ['day','month','dayofweek','dayofyear','quarter','weekofyear','Province_State', 'Country_Region']
df_test = df_test[columns]

In [None]:
y_pred1 = model1.predict(df_test.values)
y_pred2 = model2.predict(df_test.values)

In [None]:
df_submit = pd.read_csv('/kaggle/input/covid19-global-forecasting-week-2/submission.csv')

In [None]:
df_submit.ConfirmedCases = y_pred1
df_submit.Fatalities = y_pred2

In [None]:
'''df_submit.ConfirmedCases = df_submit.ConfirmedCases.apply(lambda x:max(0,round(x,0)))
df_submit.Fatalities = df_submit.Fatalities.apply(lambda x:max(0,round(x,0)))'''

In [None]:
df_submit.to_csv(r'submission.csv', index=False)

# Do leave an upvote if you like the work:) Constructive feedbacks are welcome! 