The codes below are starter codes. As I have just started learning time series analysis, the below codes are just used as I am going through a course of time series analysis. If you feel that some improvements need to be made, it will be really helpful if you can inform me of them. I will try my best to definitely improve them. Also, if you like my work, do upvote this notebook; I will be highly grateful for that. Again, Thank you for your Support!! 

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:
from statsmodels.tsa.stattools import acf
from statsmodels.graphics.tsaplots import plot_acf

In [None]:
# Reading Train and Test Datasets
train = pd.read_csv('../input/covid19-global-forecasting-week-2/train.csv')
test = pd.read_csv('../input/covid19-global-forecasting-week-2/test.csv')
print("\t\tTrain Data:\n")
display(train.head())
display(train.tail())
print("\t\tTest Data:\n")
display(test.head())
display(test.tail())
print("\t\tSummary of Train Data:\n")
display(train.describe())

# <u>Exploratory Analysis</u>

## 1. Exploring The Fatalities and Confirmed Cases

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
confirmed_cases = train.groupby(['Date']).agg({'ConfirmedCases' : ['sum']})
fatalities = train.groupby(['Date']).agg({'Fatalities' : ['sum']})
totalCases = confirmed_cases.join(fatalities)
fig = plt.figure(figsize=(17, 8))
ax1 = fig.add_subplot(211)
ax2 = fig.add_subplot(212)
confirmed_cases.plot(ax=ax1)
ax1.set_title('Exploration of Global Confirmed Cases', size = 13)
ax1.set_ylabel('Number of Cases', size = 13)
ax1.set_xlabel('Date', size = 13)
fatalities.plot(ax=ax2)
ax2.set_title('Exploration of Fatalities', size = 13)
ax2.set_ylabel('Number of Cases', size = 13)
ax2.set_xlabel('Date', size = 13)
fig.tight_layout()
plt.show()

As can be seen from above plots, from 11 Feb 2020, both confirmed cases and fatalities started increasing significantly. Just after 1 month i.e from 12 March 2020, both confirmed cases and fatalities suddenly spiked up on global level.

In [None]:
totalCases.head(10)

## 2. Exploring Fatalities and Confirmed Cases by Country

In [None]:
import plotly.express as px
countries = list(set(list(train['Country_Region'])))
agg_funcs = {'Date': 'first', 'ConfirmedCases': 'sum', 'Fatalities': 'sum'}

### Exploring Confirmed Cases for Top 20 Countries

In [None]:
num_conf_cases = []
for country in countries:
    data2 = train.loc[train['Country_Region'] == country]
    num_cases_country = data2.groupby(data2['Date']).aggregate(agg_funcs).max().ConfirmedCases
    num_conf_cases.append(num_cases_country)

# index ordered by num_conf_cases    
idx_top_by_cases = list(reversed(np.argsort(num_conf_cases)))

for i in range(20):
    idx_top = idx_top_by_cases[i]
    print('%d: %s (%d cases)' % (i+1, countries[idx_top], num_conf_cases[idx_top]))

In [None]:
countries_str = '[%s]'% (', '.join(["'%s'"%countries[idx] for idx in idx_top_by_cases[:20]]))  
data_top_countries = train.query("Country_Region == %s" % countries_str) 

fig = px.line(data_top_countries, x="Date", y="ConfirmedCases", color="Country_Region",
              line_group="Country_Region", hover_name="Country_Region",
              title="Top 20 Countries with Most Confirmed Cases")
fig.update_layout(xaxis_rangeslider_visible=True)
fig.show()

Before 16 Feburary 2020, Confirmed Cases of China continued to rise. But after 18 Feburary 2020, the number of confirmed cases have almost plateaued. On the other hand, from 12 March 2020, 14 March 2020 and 19 March 2020, the number of confirmed cases in Italy, Spain and US respectively, suddenly grew. As of 26 March 2020, number of confirmed cases in Italy are higher than China's as can be seen in the above plot.

### Exploring Top 20 Countries for Cases of Fatalities

In [None]:
fatalities_cases = []
for country in countries:
    data2 = train.loc[train['Country_Region'] == country]
    fatalities_country = data2.groupby(data2['Date']).aggregate(agg_funcs).max().Fatalities
    fatalities_cases.append(fatalities_country)

# index ordered by num_conf_cases    
idx_top_by_fatalities = list(reversed(np.argsort(fatalities_cases)))

for i in range(20):
    idx_top = idx_top_by_fatalities[i]
    print('%d: %s (%d cases)' % (i+1, countries[idx_top], fatalities_cases[idx_top]))

In [None]:
countries_str = '[%s]'% (', '.join(["'%s'"%countries[idx] for idx in idx_top_by_fatalities[:20]]))

data_top_countries = train.query("Country_Region == %s" % countries_str)

fig = px.line(data_top_countries, x="Date", y="Fatalities", color="Country_Region",
              line_group="Country_Region", hover_name="Country_Region",
              title="Top 20 countries with Fatalities")
fig.update_layout(xaxis_rangeslider_visible=True)
fig.show()

After 24 Feburary 2020, the number of fatalities in China have almost flattened. From 8 March 2020 and 14 March 2020, number of fatalities in Italy and Spain have spiked up and have even crossed China's. After 8 March 2020, Iran's too has started inceasing, almost reaching China's. From 20 March 2020, number of fatalities in France have also started increasing, almost crossing Iran's. 

## 3. Correlation Between Confirmed Cases and Fatalities

In [None]:
#Calculating percentage increase in Confirmed Cases and Fatalities
totalCases['CasesIncrease'] = totalCases.ConfirmedCases.pct_change()
totalCases['FatalitiesIncrease'] = totalCases.Fatalities.pct_change()

In [None]:
totalCases.head()

In [None]:
fig = px.scatter(totalCases, x = 'CasesIncrease', y = 'FatalitiesIncrease')
fig.show()

In [None]:
# Finding Correlation Between Confirmed Cases and Fatalities
correlation = totalCases.CasesIncrease.corr(totalCases.FatalitiesIncrease)
print("Correlation Between Increase in Fatalities and Confirmed Cases:", correlation)

In [None]:
type(totalCases)

In [None]:
percent_change = pd.DataFrame({"CasesIncrease" : [x for x in totalCases.CasesIncrease],
                              "FatalitiesIncrease" : [y for y in totalCases.FatalitiesIncrease]}, index=totalCases.index)
percent_change.head()

## 4. Simple Linear Regression Model

In [None]:
import statsmodels.api as sm

In [None]:
X = pd.DataFrame(percent_change, columns = ['CasesIncrease'])
X = sm.add_constant(X)
X.head()

In [None]:
X.fillna(0, inplace = True)
X.head()

In [None]:
percent_change['FatalitiesIncrease'].fillna(0, inplace = True)
y = percent_change['FatalitiesIncrease']
print(y)

In [None]:
results = sm.OLS(y, X).fit()
results.summary()

Clearly, The relationship between CasesIncrease and FatalitiesIncrease can be explained by above model. With Increase in positive corona cases, the rate of increase in fatalities in the whole world will move up almost linearly.

## 5. Autocorrealtion

In [None]:
percent_change.head()

In [None]:
totalCases2 = totalCases[['ConfirmedCases', 'Fatalities']]
totalCases2.head()

In [None]:
totalCases2.index = pd.to_datetime(totalCases2.index)
totalCases3 = totalCases2.resample(rule='W').last() # Weekly
percent_change3 = totalCases3.pct_change()
percent_change3 = percent_change3.dropna()

In [None]:
percent_change3.head()

In [None]:
# Weekly Autocorrelations
auto_corr_conf = percent_change3['ConfirmedCases']['sum'].autocorr() #Autocorrelation of CasesIncrease 
auto_corr_Fatal = percent_change3['Fatalities']['sum'].autocorr() #Autocorrelation of FatalitiesIncrease
print("The Autocorrelation of CasesIncrease Time Series:", auto_corr_conf)
print("The Autocorrelation of FatalitiesIncrease Time Series:", auto_corr_Fatal)

On weekly basis, both time series are positively correlated i.e they are going to follow the trend.

In [None]:
# Potting ACF of total Confirmed Cases
plot_acf(totalCases2.ConfirmedCases, alpha = 0.05)

In [None]:
# Potting ACF of Total Fatalities
plot_acf(totalCases2.Fatalities, alpha = 0.05)

### 5.1. Test for Random Walk in Total Confirmed Cases and Fatalities

In [None]:
totalCases2.head()

In [None]:
from statsmodels.tsa.stattools import adfuller

# Running ADF test on ConfirmedCases
results_ADF = adfuller(totalCases2.ConfirmedCases)

# print p-value
print(results_ADF[1])

<B>Null Hypothesis:</B> ConfirmedCases Time Series follows Random Walk<br>
<B>Alternate Hypothesis:</B> ConfirmedCases Time Series does not follow Random Walk<br>
<br><br>
As can be seen from the results of above ADF test, since p-value is greater than 0.05, we do not reject the Null Hypothesis that ConfirmedCases Time Series follows Random Walk with 95% confidence.

In [None]:
# Running ADF test on Fatalities
results2_ADF = adfuller(totalCases2.Fatalities)

# print p-value
print(results2_ADF[1])

<B>Null Hypothesis:</B> Fatalities Time Series follows Random Walk<br>
<B>Alternate Hypothesis:</B> Fatalities Time Series does not follow Random Walk<br>
<br><br>
As can be seen from the results of above ADF test, since p-value is greater than 0.05, we do not reject the Null Hypothesis that Fatalities Time Series follows Random Walk with 95% confidence.

In [None]:
# We are now testing the same test for Increase in ConfirmedCases and Fatalities
totalCases4 = totalCases2.pct_change()
totalCases4.dropna(inplace = True)
totalCases4.columns = ['CasesIncrease', 'FatalitiesIncrease']
totalCases4.head()

In [None]:
# Running ADF test on FatalitiesIncrease
results3_ADF = adfuller(totalCases4.FatalitiesIncrease)

# print p-value
print(results3_ADF[1])

<B>Null Hypothesis:</B> FatalitiesIncrease Time Series follows Random Walk<br>
<B>Alternate Hypothesis:</B> FatalitiesIncrease Time Series does not follow Random Walk<br>
<br><br>
As can be seen from the results of above ADF test, since p-value is greater than 0.05, we do not reject the Null Hypothesis that FatalitiesIncrease Time Series follows Random Walk with 95% confidence.

In [None]:
# Running ADF test on CasesIncrease
results4_ADF = adfuller(totalCases4.CasesIncrease)

# print p-value
print(results4_ADF[1])

<B>Null Hypothesis:</B> CasesIncrease Time Series follows Random Walk<br>
<B>Alternate Hypothesis:</B> CasesIncrease Time Series does not follow Random Walk<br>
<br><br>
As can be seen from the results of above ADF test, since p-value is greater than 0.05, we do not reject the Null Hypothesis that CasesIncrease Time Series follows Random Walk with 95% confidence.

### 5.2. Exploring AutoCorellations in Total Confirmed Cases and Fatalities in India

In [None]:
india_data = train[train.Country_Region == 'India']
india_data.tail()

In [None]:
confirmed_cases_india = india_data.groupby(['Date']).agg({'ConfirmedCases' : ['sum']})
fatalities_india = india_data.groupby(['Date']).agg({'Fatalities' : ['sum']})
totalCasesIndia = confirmed_cases.join(fatalities)
fig = plt.figure(figsize=(17, 8))
ax1 = fig.add_subplot(211)
ax2 = fig.add_subplot(212)
confirmed_cases_india.plot(ax=ax1)
ax1.set_title('Exploration of Confirmed Cases in India', size = 13)
ax1.set_ylabel('Number of Cases', size = 13)
ax1.set_xlabel('Date', size = 13)
fatalities_india.plot(ax=ax2)
ax2.set_title('Exploration of Fatalities in India', size = 13)
ax2.set_ylabel('Number of Cases', size = 13)
ax2.set_xlabel('Date', size = 13)
fig.tight_layout()
plt.show()

In case of India too, the number of Confirmed Cases and Fatalities are in upward swing.

In [None]:
acf_array_cases = acf(confirmed_cases_india)
print(acf_array_cases)

In [None]:
acf_array_fatalities = acf(fatalities_india)
print(acf_array_fatalities)

In [None]:
# Potting ACF of Confirmed Cases in India
plot_acf(confirmed_cases_india, alpha = 0.05)

In [None]:
# Plotting ACF of Fatalities in India
plot_acf(fatalities_india, alpha = 0.05)

#### 5.2.1 Test for Random Walk for Confirmed Cases and Fatalities in India

In [None]:
# Running ADF test on Confirmed Cases in India
results_india_conf = adfuller(confirmed_cases_india)

# print p-value
print(results_india_conf[1])

In [None]:
# Running ADF test on Fatalities in India
results_india_fatal = adfuller(fatalities_india)

# print p-value
print(results_india_fatal[1])

Results for India's are same as in case of Confirmed Cases and Fatalities globally

## 6. AR Model

In [None]:
from statsmodels.graphics.tsaplots import plot_pacf

In [None]:
plot_pacf(totalCases2.ConfirmedCases, alpha=0.05)

Based on PACF, AR(1) model is more suitable for ConfirmedCases

In [None]:
from statsmodels.tsa.arima_model import ARMA

# ARMA Model for ConfirmedCases
mod = ARMA(totalCases2.ConfirmedCases, order = (1, 0))
result = mod.fit()
result.summary()

In [None]:
result.plot_predict(start=60, end=90)
plt.show()

In [None]:
#ARMA Model for Fatalities
plot_pacf(totalCases2.Fatalities, alpha=0.05)

In [None]:
mod2 = ARMA(totalCases2.Fatalities, order = (1, 0))
result2 = mod2.fit()
result2.summary()

In [None]:
result2.plot_predict(start = 60, end = 90)
plt.show()

In [None]:
totalCases4.head()

In [None]:
plot_pacf(totalCases4.FatalitiesIncrease, alpha=0.05)

In [None]:
BIC = np.zeros(6)
for p in range(6):
    mod3 = ARMA(totalCases4.FatalitiesIncrease, order = (p, 0))
    result3 = mod3.fit()
    # Storing BIC
    BIC[p] = result3.bic

In [None]:
# Plot the BIC as a function of p
plt.plot(range(1,6), BIC[1:6], marker='o')
plt.xlabel('Order of AR Model')
plt.ylabel('Bayesian Information Criterion')
plt.show()

In [None]:
mod4 = ARMA(totalCases4.FatalitiesIncrease, order = (3, 0))
result4 = mod4.fit()
result4.summary()

In [None]:
result4.plot_predict(start = 50, end = 90)
plt.show()

In [None]:
plot_pacf(totalCases4.CasesIncrease, alpha=0.05)

In [None]:
BIC = np.zeros(6)
for p in range(6):
    mod3 = ARMA(totalCases4.CasesIncrease, order = (p, 0))
    result3 = mod3.fit()
    # Storing BIC
    BIC[p] = result3.bic

In [None]:
# Plot the BIC as a function of p
plt.plot(range(1,6), BIC[1:6], marker='o')
plt.xlabel('Order of AR Model')
plt.ylabel('Bayesian Information Criterion')
plt.show()

In [None]:
mod5 = ARMA(totalCases4.CasesIncrease, order = (3, 0))
result5 = mod5.fit()
result5.summary()

In [None]:
result5.plot_predict(start=50, end=90)
plt.show()