# INTRODUCTION

The 2019–20 coronavirus pandemic is an ongoing pandemic of coronavirus disease 2019 (COVID-19), caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). The outbreak started in Wuhan, Hubei Province, China, in December 2019. The World Health Organization (WHO) declared the outbreak to be a Public Health Emergency of International Concern on 30 January 2020 and recognised it as a pandemic on 11 March 2020.
As of 3 April 2020, more than 1.01 million cases of COVID-19 have been reported in more than 180 countries and 200 territories, resulting in more than 53,100 deaths. More than 212,000 people have recovered.


Efforts to prevent the virus spreading include travel restrictions, quarantines, curfews, workplace hazard controls, event postponements and cancellations, and facility closures. These include national or regional quarantines throughout the world (starting with the quarantine of Hubei), curfew measures in mainland China and South Korea, various border closures or incoming passenger restrictions, screening at airports and train stations, and outgoing passenger travel bans.



# EXPLORATORY DATA ANALYSIS(EDA)

**TASKS**

1.   The number of countries affected with the virus.
2.   Visualizing Global Confirmed and Death cases over time.
3.   Analysis on top 5 countries Confirmed and Death cases.

In [None]:
# Importing required libraries for data processing and visualization

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import plotly.express as px
import plotly.io as pio
pio.templates.default = "plotly_dark"
%matplotlib inline

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:
df_train = pd.read_csv('/kaggle/input/covid19-global-forecasting-week-3/train.csv', na_filter=False)
df_train.columns.tolist()

In [None]:
df_train=df_train.drop(['Id'],axis=1)

df_train.head(10)

In [None]:
df_train['Country_Region'].unique().tolist()

In [None]:
# Get the number of countries affected with the virus

affected_country = df_train['Country_Region'].nunique()
earliest_entry = f"{df_train['Date'].min()}"
last_entry = f"{df_train['Date'].max()}"
print('There are {a} number of affected countries within {b} and {c}'.format(a=affected_country, b=earliest_entry, c=last_entry))


In [None]:
# confirmed cases as of 03-04-2020

xy = df_train.drop('Province_State',axis=1)
current = xy[xy['Date'] == max(xy['Date'])].reset_index()
current_case = current.groupby('Country_Region')['ConfirmedCases','Fatalities'].sum().reset_index()
highest_case = current.groupby('Country_Region')['ConfirmedCases'].sum().reset_index()
fig = px.bar(highest_case.sort_values('ConfirmedCases', ascending=False)[:5][::-1], 
             x='ConfirmedCases', y='Country_Region',
             title='Global Confirmed Cases (03-04-2020)', text='ConfirmedCases', height=900, orientation='h')
fig.show()

**OBSERVATION:** The Chart above shows the top 5 countries with the most confirmed cases as of 3rd April, 2020. USA top the chart with the most confirmed cases with a record of 308690, followed by Italy being the second largest, Spain follows, then China and Germany follows.

In [None]:
# plot the confirmed cases in the world over time.

world_wide_case = df_train.groupby('Date')['ConfirmedCases'].sum().reset_index()
fig = px.line(world_wide_case, x="Date", y="ConfirmedCases", 
              title="Worldwide Confirmed Cases Over Time")
fig.show()

**OBSERVATION:** The Growth rate of the virus spread over time within the space of Jan 22 - Apr 4 is still at its peak and this is not encouraging at all. From the graph above, we observe that it has risen close to 1.2 million.

In [None]:
# Countries with the highest death rate

highest_death = current.groupby('Country_Region')['Fatalities'].sum().reset_index()
fig = px.bar(highest_death.sort_values('Fatalities',ascending=False)[:5][::-1],
            x='Fatalities',y='Country_Region',
             title='Global Death Cases (03-04-2020)', text='Fatalities', height=900, orientation='h')
fig.show()

**OBSERVATIONS:** 

This chart shows the countries with the most number of death cases recorded between Jan and April 3. Italy leads with an estimated value of 15362, followed by Spain 11947, USA estimated 1840, France and UK with estimated 7574 and 4320 respectively.

In [None]:
# Death cases worldwide over time

death_cases = df_train.groupby('Date')['Fatalities'].sum().reset_index()
fig = px.line(death_cases, x="Date", y="Fatalities", 
              title="Worldwide Fatalities Over Time")
fig.show()

**OBSERVATION:** 
A steady surge of COVID-19 activity on many continents pushed the Global death toll from the novel coronavirus to over 64000 and hence has higher alarming mortality rate.

# How did covid-19 spread to the rest of world, and its impact?

In [None]:
# How did covid-19 spread?

virus_spread = df_train.groupby(['Date', 'Country_Region'])['ConfirmedCases', 'Fatalities'].max()
virus_spread = virus_spread.reset_index()
virus_spread['Date'] = pd.to_datetime(virus_spread['Date'])
virus_spread['Date'] = virus_spread['Date'].dt.strftime('%m/%d/%Y')
virus_spread['Size'] = virus_spread['ConfirmedCases'].pow(0.3)
fig = px.scatter_geo(virus_spread, locations="Country_Region", locationmode='country names', color="ConfirmedCases", size='Size', hover_name="Country_Region", range_color= [0, 100], projection="natural earth", animation_frame="Date", title='COVID-19: Virus Spread Over Time Globally (2020–01–22 to 2020–03–30.)', color_continuous_scale="peach")
fig.show()

**OBSERVATION:**

Wow!

At the earliest point (from the data available) the disease seems to be only around China and its neighboring countries.

However it quickly spread off to Europe, Autralia and even the US which is very interesting.

Things seem to be in fairly good light even in mid February for European countries.

West Asia especially Iran and Iraq begins to catch fire at the end of February along with Italy showing signs of the dread to come. South Korea and China peaking at the moment.

By March 5 look at Europe. They could've have locked down right at that moment.

The disease has taken away Africa and Americas too by storm early this March with alarm bells ringing loudly for the US. Needless to say how it ended.

According to the data so far, USA, UK, Spain, Italy, Germany, France and the UK are in deep trouble. Next few days are crucial for how the disease develops around the world.


In [None]:
df_test = pd.read_csv("/kaggle/input/covid19-global-forecasting-week-3/test.csv")
df_test.head()

In [None]:
test_data = (
    df_test.groupby(["Date", "Country_Region"]).last().reset_index()[["Date", "Country_Region"]])
test_data

# TRAINING AND EVALUATING THE MODELS

In [None]:
# importing required libraries for data processing and prediction

from sklearn.preprocessing import OrdinalEncoder
from sklearn import metrics
import xgboost as xgb
from xgboost import XGBRegressor
from xgboost import plot_importance, plot_tree
import datetime as dt

In [None]:
def categoricalToInteger(df):
    #convert NaN Province State values to a string
    df.Province_State.fillna('NaN', inplace=True)
    #Define Ordinal Encoder Model
    oe = OrdinalEncoder()
    df[['Province_State','Country_Region']] = oe.fit_transform(df.iloc[:,1:3])
    return df

In [None]:
def create_features(df):
    df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
    df['day'] = df['Date'].dt.day
    df['month'] = df['Date'].dt.month
    df['dayofweek'] = df['Date'].dt.dayofweek
    df['dayofyear'] = df['Date'].dt.dayofyear
    df['quarter'] = df['Date'].dt.quarter
    df['weekofyear'] = df['Date'].dt.weekofyear
    return df

In [None]:
def cum_sum(df, date, country, state):
    sub_df = df[(df['Country_Region']==country) & (df['Province_State']==state) & (df['Date']<=date)]
    display(sub_df)
    return sub_df['ConfirmedCases'].sum(), sub_df['Fatalities'].sum()

In [None]:
# Split the training data into train and dev set for cross-validation.

def train_dev_split(df):
    date = df['Date'].max() - dt.timedelta(days=7)
    return df[df['Date'] <= date], df[df['Date'] > date]

In [None]:
df_train = categoricalToInteger(df_train)
df_train = create_features(df_train)

In [None]:
df_train, df_dev = train_dev_split(df_train)

In [None]:
# Selecting all columns that are necessary for prediction

columns = ['day','month','dayofweek','dayofyear','quarter','weekofyear','Province_State', 'Country_Region','ConfirmedCases','Fatalities']
df_train = df_train[columns]
df_dev = df_dev[columns]

In [None]:
# Training and evaluating modeling

train = df_train.values
dev = df_dev.values
X_train, y_train = train[:,:-2], train[:,-2:]
X_dev, y_dev = dev[:,:-2], dev[:,-2:]

In [None]:
'''train = df_train.values
X_train, y_train = train[:,:-2], train[:,-2:]'''

In [None]:
def modelfit(alg, X_train, y_train,useTrainCV=True, cv_folds=5, early_stopping_rounds=50):
    
    if useTrainCV:
        xgb_param = alg.get_xgb_params()
        xgtrain = xgb.DMatrix(X_train, label=y_train)
        cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=alg.get_params()['n_estimators'], nfold=cv_folds,
            metrics='rmse', early_stopping_rounds=early_stopping_rounds, show_stdv=False)
        alg.set_params(n_estimators=cvresult.shape[0])
    
    #Fit the algorithm on the data
    alg.fit(X_train, y_train,eval_metric='rmse')
        
    #Predict training set:
    predictions = alg.predict(X_train)
    #predprob = alg.predict_proba(X_train)[:,1]
        
    #Print model report:
    print("\nModel Report")
    #print("Accuracy : %.4g" % metrics.accuracy_score(y_train, predictions))
    print("RMSE Score (Train): %f" % metrics.mean_squared_error(y_train, predictions))
                    
    feat_imp = pd.Series(alg.feature_importances_).sort_values(ascending=False)
    feat_imp.plot(kind='bar', title='Feature Importances')
    plt.ylabel('Feature Importance Score')

In [None]:
'''model1 = XGBRegressor(
 learning_rate =0.1,
 n_estimators=1000,
 max_depth=5,
 min_child_weight=1,
 gamma=0,
 subsample=0.8,
 colsample_bytree=0.8,
 objective= 'reg:squarederror',
 scale_pos_weight=1)
modelfit(model1, X_train, y_train[:,0])'''

In [None]:
'''model2 = XGBRegressor(
 learning_rate =0.1,
 n_estimators=1000,
 max_depth=5,
 min_child_weight=1,
 gamma=0,
 subsample=0.8,
 colsample_bytree=0.8,
 objective= 'reg:squarederror',
 scale_pos_weight=1)
modelfit(model2, X_train, y_train[:,1])'''

In [None]:
# Creating the model

model1 = XGBRegressor(n_estimators=1000)
model2 = XGBRegressor(n_estimators=1000)

In [None]:
# training the model

model1.fit(X_train, y_train[:,0],
           eval_set=[(X_train, y_train[:,0]), (X_dev, y_dev[:,0])],
           verbose=False)

In [None]:
model2.fit(X_train, y_train[:,1],
           eval_set=[(X_train, y_train[:,1]), (X_dev, y_dev[:,1])],
           verbose=False)

In [None]:
plot_importance(model1);

In [None]:
plot_importance(model2);

# FORECASTING

In [None]:
df_train = categoricalToInteger(df_test)
df_train = create_features(df_test)

In [None]:
columns = ['day','month','dayofweek','dayofyear','quarter','weekofyear','Province_State', 'Country_Region']
df_test = df_test[columns]

In [None]:
y_pred1 = model1.predict(df_test.values)
y_pred2 = model2.predict(df_test.values)

In [None]:
df_submit = pd.read_csv('/kaggle/input/covid19-global-forecasting-week-3/submission.csv')

In [None]:
df_submit.ConfirmedCases = y_pred1
df_submit.Fatalities = y_pred2

In [None]:
'''df_submit.ConfirmedCases = df_submit.ConfirmedCases.apply(lambda x:max(0,round(x,0)))
df_submit.Fatalities = df_submit.Fatalities.apply(lambda x:max(0,round(x,0)))'''


In [None]:
df_submit.to_csv(r'submission.csv', index=False)

please leave an upvote and a comment, thank you...
#**\#Stay Safe Everyone**
