<h2>Forecasting The Number Of People Infected With Coronavirus in the World</h2>

<h3>Importing necessary libraries</h3>

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.metrics import mean_absolute_error,mean_squared_error
from statsmodels.graphics.tsaplots import plot_acf,plot_pacf
from statsmodels.tsa.arima_model import ARIMA
from statsmodels.tsa.ar_model import AR
from datetime import datetime
from statsmodels.tsa.holtwinters import ExponentialSmoothing
import math
from statsmodels.tsa.statespace.sarimax import SARIMAX
from keras.preprocessing.sequence import TimeseriesGenerator
from statsmodels.tsa.seasonal import seasonal_decompose

In [None]:
cor_inf = pd.read_csv('../input/novel-corona-virus-2019-dataset/time_series_covid_19_confirmed.csv')
cor_inf.head(5)

<h3>Data Preprocessing for the covid-19 dataset</h3>

<p>The valuable features what we are hunting for is <li>Country names.</li><li>Count of
    infected people</li><li> Dates</li> </br>At which they were affected in the country.
    As we are not focused on predicting on indiviual country as it might be 
    bias on some country as a result the prediction might be very large or too small
    which may be act as outlier.</p>

In [None]:
#drop lat long and province/state columns
cor_inf.columns

<p> Drop the <b>Province/State</b>,<b> Latitude</b> and <b> Longitude</b> columns  as they make the data to narrow and data for those column might be missing for some countries </p>

In [None]:
cor_inf.drop(labels = ['Province/State','Lat', 'Long'],axis = 1, inplace= True)

In [None]:
cor_inf.head(20)

<h3>Reshaping the data to proceed with Forecasting</h3>

<p>Now we want to reshape the data as per the requirement i.e make index as dates,and column
name as country names
</p>

In [None]:
cor_inf.shape

In [None]:
#cor_inf['Country/Region'].value_counts()

<p>To get the total infected people per day in the world we sum the count of infected people from all the countries and group them as per dates</p>
<p>But before we saw that in the country/Region column we have duplicated data as a result
we need to first sum all the country and then get the count of all the people infected </p>

In [None]:
#group by data based upon country since the countries name are repeated more than once 
cor_inf = cor_inf.groupby(['Country/Region']).sum()

In [None]:
cor_inf.loc['China'].tail(5)

In [None]:
#reshape the data as per the time series analysis
cor_inf_re = pd.DataFrame()
for i in range(0,len(cor_inf)):
    cor_inf_re[cor_inf.index[i]] = cor_inf.iloc[i].values
    

In [None]:
type(cor_inf.index[0])

In [None]:
cor_inf_re.index = cor_inf.columns[:]

In [None]:
cor_inf_re.head(5)

In [None]:
def total_infected_sum():
    count = []
    for i in range(0,len(cor_inf_re)):
        count.append(sum(cor_inf_re.iloc[i].values))
    return count

In [None]:
cor_inf_re['Total infected'] = total_infected_sum()

In [None]:
cor_inf_re.tail(5)

<p>Now only we need to convert index datatype i.e object to datetime </p>

In [None]:
def parser(date):
    date = datetime.strptime(date,'%m/%d/%y')
    date  = str(date.day) + '-' + str(date.month) + '-' + str(date.year)
    print(date)
    return datetime.strptime(date,'%d-%m-%Y')

In [None]:
#convert str to datetime in index 
timestamp = []
for i in range(0,len(cor_inf_re)):
    timestamp.append(parser(cor_inf_re.index[i]))
cor_inf_re.index = timestamp

<b>Sum all the count of infected people of each country to get the total infected people per date
</b>

In [None]:
cor_inf_re.to_csv('./covid_19_confirmed.csv')

In [None]:
#preparing for time series
infected_people = cor_inf_re['Total infected']

In [None]:
#column for infected per day
diff = []
diff.append(cor_inf_re['Total infected'][0])
for i in range(0,len(cor_inf_re['Total infected']) - 1):
    diff.append(cor_inf_re['Total infected'][i+1] - cor_inf_re['Total infected'][i])

cor_inf_re['Infected_per_Day'] = diff

<h3>Visualization of dataset</h3>

In [None]:
#visualization
plt.xlabel('dates')
plt.ylabel('infected people')
infected_people.plot(figsize = (11,5),marker='o')
plt.legend()

In [None]:
#check the statistical part of the data
infected_people.describe()

In [None]:
#to check if there is an trend or seasonality
from statsmodels.tsa.seasonal import seasonal_decompose
result = seasonal_decompose(infected_people)

In [None]:
result.trend.plot(figsize=(12,4))

In [None]:
result.seasonal.plot(figsize=(12,4))

In [None]:
#autocorrelation graph
plot_acf(infected_people)

In [None]:
plot_pacf(infected_people)

In [None]:
infec_one = infected_people.diff(periods=1)
infec_one = infec_one[1:]
plot_acf(infec_one)

In [None]:
train = infected_people.iloc[:-8]
test = infected_people.iloc[-8:]

<h2>Exponential Smoothing</h2>

Single, Double and Triple Exponential Smoothing can be implemented in Python using the ExponentialSmoothing Statsmodels class.

First, an instance of the ExponentialSmoothing class must be instantiated, specifying both the training data and some configuration for the model.

Specifically, you must specify the following configuration parameters:

<li>trend: The type of trend component, as either “add” for additive or “mul” for multiplicative. Modeling the trend can be disabled by setting it to None.</li>
<li>damped: Whether or not the trend component should be damped, either True or False.</li>
<li>seasonal: The type of seasonal component, as either “add” for additive or “mul” for multiplicative. Modeling the seasonal component can be disabled by setting it to None.</li>
<li>seasonal_periods: The number of time steps in a seasonal period, e.g. 12 for 12 months in a yearly seasonal structure (more here).
</li>

The model can then be fit on the training data by calling the fit() function.

This function allows you to either specify the smoothing coefficients of the exponential smoothing model or have them optimized. By default, they are optimized (e.g. optimized=True). These coefficients include:

smoothing_level (alpha): the smoothing coefficient for the level.
smoothing_slope (beta): the smoothing coefficient for the trend.
smoothing_seasonal (gamma): the smoothing coefficient for the seasonal component.
damping_slope (phi): the coefficient for the damped trend.

In [None]:
model = ExponentialSmoothing(train,trend = "mul",seasonal_periods=7,seasonal="add").fit()

In [None]:
predictions = model.predict(start = 50 ,end= 57 )
#predictions

In [None]:
plt.figure(figsize = (12,4))
predictions.plot(c ='r',marker = 'o',markersize=10,linestyle='--')
test.plot(marker = 'o',markersize=10,linestyle='--')
print("root mean squared error : ",math.sqrt(mean_squared_error(test,predictions)))
print("mean absolute error : ",mean_absolute_error(test,predictions))

<h2>SARIMAX model</h2>

In [None]:
model = SARIMAX(train,order = (4,2,1),trend='t',seasonal_order=(2, 2, 1, 14))
model_fit = model.fit()

In [None]:
predictions = model_fit.predict(start = 53,end=60)
predictions

In [None]:
plt.figure(figsize = (12,4))
plt.plot(predictions,'r',marker = 'o',markersize=10,linestyle='--')
plt.plot(test,marker = 'o',markersize=10,linestyle='--')
print("root mean squared error : ",math.sqrt(mean_squared_error(test,predictions)))
print("mean absolute error : ",mean_absolute_error(test,predictions))

<p>As a result I go over a sarima model as i seen it perform better although the prediction
as per the real data was really close and it didn't ovefit the model which in the case of exponenetial smoothning
</p>

In [None]:
predictions = model_fit.predict(start = 50,end=67)
predictions

<p> as per the result it might be that after the end of march, it will affect to 450000
people all over the world so precaution and prevention is a first priority of evry human
being</p>