# CoVid 19 - Prediction using VAR model

****The objective of this notebook is to generate a time series model using VAR to predict the confirmed ,samples tested and positive cases for each state for the upcoming days.****

In [None]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import folium
import seaborn as sns

In [None]:
df = pd.read_csv('/kaggle/input/covid19-in-india/covid_19_india.csv')
df.head()

## Data Wrangling

In [None]:
df = df.dropna()

In [None]:
df = df.replace({'Telangana':'Telengana'})

Date and time conversion to make the dataset as a time-series data

In [None]:
df['Date'] = pd.to_datetime(df.Date,dayfirst=True).dt.strftime('%Y-%m-%d')
df['Time'] = pd.to_datetime(df['Time']).dt.strftime('%H:%M:%S')
df.tail()

In [None]:
df.info()

In [None]:
df = df.loc[df['State/UnionTerritory'] != 'Unassigned']
df = df.loc[df['State/UnionTerritory'] != 'Cases being reassigned to states']

## Numerical variable dependency

In [None]:
df_train = df[['ConfirmedIndianNational','ConfirmedForeignNational','Cured','Deaths','Confirmed']]
sns.pairplot(df_train)

Based on the graph, the rate of curing a person will be more when compared to the rate of deaths caused in a state. Likewise, the rate of confirmed cases will be increasing more than the rate at which the persons are cured.

Since the recent count of the cases is recorded on the current date. The last record of each dataset is considered for visualisation

In [None]:
df_u = df['State/UnionTerritory'].unique()
df3 = pd.DataFrame()
for s in (df_u):
                l = df.loc[df['State/UnionTerritory'] == s]
                l1= l['Sno'].idxmax(axis=1)
                l2 = l.loc[l.index == l1]
                df3 = df3.append([l2])
df3.head()                                            
                                           
                                                                

In [None]:
df3 = df3.reset_index()
df3 = df3.drop(columns=['index','Sno'])
df3 = df3[['State/UnionTerritory','Confirmed','Deaths','Cured']]
df3 = df3.loc[df3['State/UnionTerritory'] != 'Unassigned']
df3 = df3.loc[df3['State/UnionTerritory'] != 'Cases being reassigned to states']
df3.head()

Calculation of the total cases in each state

In [None]:
df3['Total Cases'] = (df3['Confirmed'] + df3['Deaths'] + df3['Cured']).astype(int)
df3.head()


## Total cases in each state in India

In [None]:
cols = df3[['State/UnionTerritory','Total Cases']]
states = np.asarray(df3['State/UnionTerritory'])
plt.figure(figsize=(15,10))
p = sns.barplot(x=df3['State/UnionTerritory'],y=df3['Total Cases'])
p.set_xticklabels(labels=states,rotation=90)
plt.title('Total Cases in India');

From the graph, Delhi,TN,Maharashtra and Gujarat are having more than 1L cases till date. While other states have minimal cases.

## Rate of increase in confirmed cases based on Date

In [None]:
import matplotlib.ticker as ticker
fig =plt.figure(figsize=(20,20));
for i,j in zip(df_u,range(1,len(df_u))):
                                                                        g = df.loc[df['State/UnionTerritory']==i]
                                                                        x = g['Date']
                                                                        y = g['Confirmed']
                                                                        ax = plt.subplot(9,4,j)
                                                                        ax.plot(x,y)
                                                                        plt.xticks(rotation=90)
                                                                        plt.title(i)
                                                                        fig.tight_layout(pad=3.0)
                                                                        ax.xaxis.set_major_locator(ticker.MultipleLocator(10))
                                                                        

The above graph is a time-series graph depicting the confirmed cases until date. Most of the states have a steady / exponential increase in the number of confirmed cases. 

## Samples Tested

In [None]:
df_testing = pd.read_csv('/kaggle/input/covid19-in-india/StatewiseTestingDetails.csv')
df_testing = df_testing.fillna(0)
df_testing['Date'] = pd.to_datetime(df_testing['Date']).dt.strftime('%Y-%m-%d')
df_testing.head()

## Total samples collected everyday in each state

In [None]:
fig =plt.figure(figsize=(20,20))

for i,j in zip(df_u,range(1,len(df_u))):        
                                                                        g = df_testing.loc[df_testing['State']==i]
                                                                        x = g['Date']
                                                                        y = g['TotalSamples']
                                                                        ax = plt.subplot(9,4,j)
                                                                        ax.plot(x,y,color='green')
                                                                        plt.xticks(rotation=90)
                                                                        plt.title(i)
                                                                        fig.tight_layout(pad=3.0)
                                                                        ax.xaxis.set_major_locator(ticker.MultipleLocator(10))
                                                                        
                                                                        

The sample collection rate in most of the states on an every day basis is increasing exponentially. However, this feature will be suitable for prediction only if 87% of the samples get tested and are categorised as 'positive' or 'negative'. For some states, the sample collection rate is less and hence, it would be difficult to predict the testing rate using the ML model.

## Positive cases tested everyday in each state

In [None]:
fig =plt.figure(figsize=(20,20))

for i,j in zip(df_u,range(1,len(df_u))):        
                                                                        g = df_testing.loc[df_testing['State']==i]
                                                                        x = g['Date']
                                                                        y = g['Positive']
                                                                        ax = plt.subplot(9,4,j)
                                                                        ax.plot(x,y,color='red')
                                                                        plt.xticks(rotation=90)
                                                                        plt.title(i)
                                                                        fig.tight_layout(pad=3.0)
                                                                        ax.xaxis.set_major_locator(ticker.MultipleLocator(10))
                                                                        

For every state, the number of samples tested provides a considerable percentage of positive cases. For example, in case of TN, out of 6lakh samples, 5% of the samples were positive. On the other hand, some states have a constant positive case rate which means there were no positive cases for few days.

An inference from the above observations is that testing samples on a daily basis has certainly helped in detecting a considerable amount of positive cases. 

## Merging testing details with the main covid-19 dataset

This merging is done to build a model that will predict the covid 19 cases based on the samples tested and positive/ negative cases observed everyday.

In [None]:
df4 = pd.DataFrame(columns=[])
for i in df_u:
            state = df.loc[df['State/UnionTerritory'] == i]
            state1 = df_testing.loc[df_testing['State'] == i]
            
            for j in state['Date']:
                                        t = state1.loc[state1['Date'] == j]
                                        t1 = state.loc[state['Date'] == j]
                                        df4 = df4.append(t1.merge(t,how='outer',on=['Date']))

                                       
                             

In [None]:
df4 = df4.drop(columns=['Time','ConfirmedIndianNational','ConfirmedForeignNational','State'],axis=1)
df4 = df4.fillna(0)
df4 = df4.reset_index()
df4 = df4.drop(columns = ['index','Sno'],axis=1)
df4

## Correlation between the numerical parameters

In [None]:
coff = df4.corr()
coff[['Confirmed']]

Based on the correlation coefficient, it is best to use Total samples, Positive and Negative features as independant variables. While cured and death feature provides the perfect correlation coefficient, it can't be used in predicting the confirmed cases since those two features are dependent on the latter.

## Model Development

Since the dataset is a time-series data, Vector Auto Regression (VAR) is used for future forecasting. VAR model is used due to multiple features that needs to be predicted such as Confirmed,Total samples tested, positive and negative cases.

In [None]:
from sklearn.utils import shuffle
df5 = shuffle(df4)
df5 = df5.reset_index()
df5= df5.drop(columns=['index'],axis=1)
df5 = df5.sort_values(by='Date')
df5

## Prediction using VAR Model

Before training the model, johansen test is performed to inspect the stationarity of the dataset.After the test, all the eigen values for the respective features are less than 1. Hence, the dataset is stationary and no further differncing/intergration is required.

Further, the VAR model is trained with the data of a single state and the same is used to predict the cases for the upcoming days.

The below model is an interactive one where the user can provide the state and number of days of prediction required. Though, the model can predict for n number of days, it is hard to represent the large data in the graph. Hence, a maximum of 10 to 20 days can be sent as input for a better visualisation.

***Note***: Telengana,Daman&Diu,Dadar and Nagar Haveli has no testing sample data. Hence, forecast is not predicted for these states alone. Also, for few states, the number of samples tested is zero. Hence, the Johansen test will fail and predictions cant be made.

In [None]:
import ipywidgets as widgets
from ipywidgets import interact, interact_manual
@interact
def forecast_model(State = df_u,days = 5):
                                df6 = df5.loc[df5['State/UnionTerritory'] == State]
                                df6 = df6[['Date','Confirmed','TotalSamples','Positive']]
                                df6.index = df6['Date']
                                df6 = df6.drop(columns=['Date'])
                                df6_upd = df6.loc[df6.index != df6.index.max()]
                                
                        # Fit the exisiting data trends to the forecast model
                                from statsmodels.tsa.vector_ar.vecm import coint_johansen
                                jtest = coint_johansen(df6_upd,1,1);
                                from statsmodels.tsa.vector_ar.var_model import VAR
                                m = VAR(df6_upd);
                                model = m.fit();
                            #    ax = model.plot();
                             #   ax.tight_layout(pad=3.0)
                       # predict the model         
                                valid_pred = model.forecast(model.y,steps=days);
                             #   ax = model.plot_forecast(days);
                            #    ax.tight_layout(pad=3.0)
                                df7 = pd.DataFrame(valid_pred.round(0),columns=[['Confirmed','TotalSamples','Positive']])
                                df7.index = pd.date_range(df6.index.max(),periods=days);
                                
                       #plot predictions         
                                plt.figure(figsize=(10,5))

                                plt.subplot(1,2,1)
                                plt.plot(df7[['Confirmed']],color='red')
                                plt.xticks(df7.index,rotation=90)
                                plt.tight_layout(pad=7.0)
                                plt.title('Confirmed cases')
                                plt.subplot(1,2,2)
                                plt.plot(df7[['TotalSamples']])
                                plt.xticks(df7.index,rotation=90)
                                plt.title('Total Samples to be tested')
                                return df7
                                

**The interactive model will run only when Kernal is active**

In [None]:
df8 = pd.DataFrame(columns=[['State','Confirmed','TotalSamples','Positive']])
df10 = pd.DataFrame()
df_f = pd.DataFrame()
for i in df_u:
                                df9 = df5.loc[df5['State/UnionTerritory'] == i]
                                df9 = df9[['Date','Confirmed','TotalSamples','Positive']]
                                df9.index = df9['Date']
                                df9 = df9.drop(columns=['Date'])
                                df9_upd = df9.loc[df9.index != df9.index.max()]
                                
                        # Fit the exisiting data trends to the forecast model
                                if i != 'Daman & Diu':
                                                                                
                                                                                    from statsmodels.tsa.vector_ar.var_model import VAR
                                                                                    m = VAR(df9_upd) 
                                                                                    model = m.fit()
                                                                               # predict the model         
                                                                                    valid_pred = model.forecast(model.y,steps=10);
                                                                                    df8 = pd.DataFrame(valid_pred.round(0))
                                                                                    df8.index = pd.date_range(df9.index.max(),periods=10);
                                                                                    for j in range(len(df8)):
                                                                                                                df10 = df10.append([i])
                                                                                    df10.index = pd.date_range(df9.index.max(),periods=10);
                                                                                    df_f = df_f.append(df10.merge(df8,how='outer',on=df10.index))
                                                                                    df10 = pd.DataFrame()
                                                        
                                

In [None]:
df_forecast = df_f
df_forecast.columns=[['Date','State','Confirmed','TotalSamples','Positive']]
df_forecast = df_forecast.reset_index()
df_forecast = df_forecast.drop(columns=['index'],axis=1);
df_forecast

In [None]:
df_forecast.to_csv('/kaggle/working/forecast.csv')
out = pd.read_csv('/kaggle/working/forecast.csv')
out.head()

In [None]:
fig =plt.figure(figsize=(20,20))

for i,j in zip(df_u,range(1,len(df_u))):        
                                                                        g = out.loc[out['State']==i]
                                                                        x = g['Date']
                                                                        y = g['Confirmed']
                                                                        ax = plt.subplot(9,4,j)
                                                                        ax.plot(x,y,color='red')
                                                                        plt.xticks(rotation=90)
                                                                        plt.title(i)
                                                                        fig.tight_layout(pad=3.0)
                                                                        ax.xaxis.set_major_locator(ticker.MultipleLocator(1))
                                                                        

From the forecast model, it is understood that in the forthcoming week, most of the states have a steady increase in confirmed cases. While few states such as Karnataka shows a steep decline in the cases. Gujarat shows a decrease in initial days but it increases steadily after 2 days.

In [None]:
df2 = pd.read_csv('/kaggle/input/states/states.csv')
df2.head()

## Interactive Map Viz using Folium

The interact mode works only when Kernal is active. The purpose is to view different count for each of the dates for which the forecast has been done.

In [None]:
import folium
ind_map = folium.Map(location=(20,78),zoom_state=2)
@interact
def map_view(Date=out['Date']):
                    for i in df2['State']:
                                            d = out.loc[out['State'] == i]
                                            d1 = df2.loc[df2['State'] == i]
                                            d2 = d.loc[d['Date'] == Date]

                                            label = folium.Popup('Date:'+ str(Date) + ',' + str(d2[['Confirmed']]) + ','+ str(d2[['Positive']]), parse_html=True)
                                         
                                            folium.Marker(
                                                    [d1['Latitude'],d1['Longitude']],
                                                    radius=10,
                                                    popup=label,
                                                    fill=True,
                                                    fill_opacity=0.7).add_to(ind_map)
                    return ind_map




## Inference

Most of the states have a steady increase in the cases for this particular week. With better testing rates, the confirmed cases are identified more easily.