# Predicting 2021 COVID cases using Time-Series for Telangana

### Context

Coronaviruses are a large family of viruses which may cause illness in animals or humans. In humans, several coronaviruses are known to cause respiratory infections ranging from the common cold to more severe diseases such as Middle East Respiratory Syndrome (MERS) and Severe Acute Respiratory Syndrome (SARS). The most recently discovered coronavirus causes coronavirus disease COVID-19 - World Health Organization
The number of new cases are increasing day by day around the world. This dataset has information from the states and union territories of India at daily level.


I use Time series analysis to understand the data better and to answer many questions which may arise.

So what is Time Series?

1. A time series is a series of data points indexed (or listed or graphed) in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data.
2. An observed time series can be decomposed into three components:
   * the trend (long term direction)
   * the seasonal (systematic, calendar related movements) 
   * the irregular (unsystematic, short term fluctuations).
3. Time series analysis is a statistical technique that deals with time series data, or trend analysis. Time series data means that data is in a series of particular time periods or intervals.

How to do a time series analysis?

* Step 1: Visualize the Time Series.It is essential to analyze the trends prior to building any kind of time series model.
* Step 2: Stationarize/Decompose the Series.
* Step 3: Find Optimal Parameters.
* Step 4: Build ARIMA Model.
* Step 5: Make Predictions.

In [None]:
import pandas as pd                      
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

In [None]:
df = pd.read_csv("../input/covid19-in-india/covid_19_india.csv")

In [None]:
print(df.shape)
df.head(5)


In [None]:
df.iloc[0]

## Data wrangling/preprocessing

In [None]:
df.columns

In [None]:
df.tail()

In [None]:
df.isnull().sum()

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.corr()

In [None]:
print(df['Cured'].unique())

In [None]:
df['State/UnionTerritory'].unique()

In [None]:
len(df['State/UnionTerritory'].unique())

### you see that we have so many dublicates that ending with '***' , we can drop or replace them if we want
### I'm droping 

In [None]:
for i in df['State/UnionTerritory'].iteritems():
    if i[1][-3:]=="***":
        df.drop(i[0],inplace=True)

In [None]:
df['State/UnionTerritory'].unique()

In [None]:
len(df['State/UnionTerritory'].unique())

### In above we can also notice "Telenagana" and "Telangana" which both are same state ,but just a spelling mistake.
### so we can replace all the name with "Telengana" to "Telangana" so our visualization will be more accurate.

In [None]:
df = df.replace(to_replace ="Telengana", value ="Telangana")

In [None]:
df['State/UnionTerritory'].unique()

#now u can see "Telangana" is replaced with "Telangana" in the row, that why u cant see "Telangana"

In [None]:
len(df['State/UnionTerritory'].unique())

## Data Visualization

In [None]:
df['Cured'].plot(alpha=0.8)
df['Deaths'].plot(alpha=0.3)
df['Confirmed'].plot(alpha=0.5)
plt.show()


In [None]:
df.groupby('State/UnionTerritory')['Confirmed'].plot()
plt.show()
df.groupby('State/UnionTerritory')['Deaths'].plot()
plt.show()
df.groupby('State/UnionTerritory')['Cured'].plot()
plt.show()

In [None]:
##adding data and time and creating a new column "Datetime" for our convinience for better visualization
df['Datetime'] = df['Date']+' '+df['Time']


In [None]:
l = df.groupby('State/UnionTerritory')
current = l.last()
print(current)


In [None]:
fig ,ax = plt.subplots(figsize= (12,8))
plt.title('Top 10 Contaminated States')
current1 = current.sort_values("Confirmed",ascending=False)
p = sns.barplot(ax=ax, x=current1.index, y=current1['Confirmed'])
p.set_xticklabels(labels=current1.index, rotation=90)
p.set_yticklabels(labels=(p.get_yticks()*1).astype(int))
plt.show()

In [None]:
fig ,ax = plt.subplots(figsize= (15,15))
plt.title('Contaminated States in side bar')

P = sns.barplot(ax=ax,y= current1.index, x=current1['Confirmed'])
P.set_yticklabels(labels=current1.index)
P.set_xticklabels(labels=(P.get_xticks()*1).astype(int))
plt.show()

### now we will only see top 10 contaminated 

In [None]:
fig ,ax = plt.subplots(figsize= (12,8))
plt.title('Top 10 Contaminated States')

current2 = current.sort_values("Confirmed", ascending=False)[:10]

p = sns.barplot(ax=ax, x=current2.index, y=current2['Confirmed'])
p.set_xticklabels(labels=current2.index, rotation=90)
p.set_yticklabels(labels=(p.get_yticks()*1).astype(int))
plt.show()

### now we will only see top 10 states with cured ppl

In [None]:
fig, ax = plt.subplots(figsize=(12,8))
plt.title("Top 10 states with cured ppl")
current3 = current.sort_values("Cured", ascending=False)[:10]

p1 = sns.barplot(ax=ax, x=current3.index, y=current3["Cured"])
p1.set_xticklabels(labels=current3.index, rotation=90)
p1.set_yticklabels(labels=(p1.get_yticks()*1).astype(int))
plt.show()



### now we will only see top 10 states with dead ppl

In [None]:
fig, ax = plt.subplots(figsize=(12,8))
plt.title("Top 10 states with dead ppl")
current4 = current.sort_values("Deaths", ascending=False)[:10]

p1 = sns.barplot(ax=ax, x=current4.index, y=current4["Deaths"])
p1.set_xticklabels(labels=current4.index, rotation=90)
p1.set_yticklabels(labels=(p1.get_yticks()*1).astype(int))
plt.show()


* so from above we can conclude that Maharashtra is 1st in confirmed cases , cured and deaths.
* we can also see that "Andhra Pradesh" is 3nd in confirmed cases but it is at 2nd place in cured and 7th     place in deaths which is actuallyimpressive.
* we can also see that orisa is doing better rajastan with less deaths.

## Time Series Analysis For 'Telangana' State

In [None]:
TS = df.loc[df['State/UnionTerritory'] == 'Telangana' ]

In [None]:
TS.head()

In [None]:
TS.shape

In [None]:
TS.isnull().sum()

In [None]:
TS['Date'] = pd.to_datetime(TS['Date'])

In [None]:
TS.head()

In [None]:
TS.columns

### dropping all the unneccesry columns except confirmed and datetime

In [None]:
cols = ['Sno', 'Time', 'State/UnionTerritory',
       'ConfirmedIndianNational', 'ConfirmedForeignNational', 'Cured',
       'Deaths', 'Date']
TS.drop(cols, axis=1, inplace=True)

In [None]:
TS= TS.sort_values('Datetime')
TS.isnull().sum()

In [None]:
TS.head()


* converting datatime column type(if it is a 'string type' it will convert to 'datetime' type) and 
* setting date as our index

In [None]:
TS.Datetime = pd.to_datetime(df.Datetime)
TS.set_index('Datetime', inplace=True)

### converting daily data into weekly data

In [None]:
TS=TS.resample('W').mean()

In [None]:
TS.head()

In [None]:
TS.tail()

In [None]:
TS.shape

In [None]:
TS.fillna(0, inplace=True)

In [None]:
TS.head()

#### Now lets plot a graph showing the increasing trend and seasonality in the data

In [None]:
plot_ts = TS.plot(figsize=(14,8))

plot_ts.set_yticklabels(labels=(plot_ts.get_yticks()*1).astype(int))
plt.legend()
plt.show()

#### Now lets plot the Decomposition Plot which shows :

* orignal data
* Trend in the data
* Seasonality
* Residual


### But why do we decompose time series?
When we decompose a time series into components, we usually combine the trend and cycle into a single trend-cycle component (sometimes called the trend for simplicity). Often this is done to help improve understanding of the time series, but it can also be used to improve forecast accuracy.


### Types of decomposition :
* Multiplicative : The components multiply together to make the time series. If you have an increasing        trend, the amplitude of seasonal activity increases. Everything becomes more exaggerated.
* Additive : In an additive time series, the components add together to make the time series.
  (Here we used Additive)

In [None]:
from pylab import rcParams
import statsmodels.api as sm

rcParams['figure.figsize'] = 18, 16
decomposition = sm.tsa.seasonal_decompose(TS['Confirmed'], freq = 20, model='additive')
fig = decomposition.plot()

plt.show()

# Implementing SARIMAx

(We used SARIMAX)

-> Seasonal AutoRegressive Integrated Moving Averages:

     One of the methods available in Python to model and predict future points of a time series is known as SARIMAX, which stands for Seasonal AutoRegressive Integrated Moving Averages with eXogenous regressors

-> What does an Arima model do?

    Autoregressive Integrated Moving Average Model. An ARIMA model is a class of statistical models for    analyzing and forecasting time series data. It explicitly caters to a suite of standard structures in time series data, and as such provides a simple yet powerful method for making skillful time series forecasts.

-> How to select perfect ARIMA model?

    Rules for identifying ARIMA models. General seasonal models: ARIMA (0,1,1)x(0,1,1) etc. Identifying the order of differencing and the constant: If the series has positive autocorrelations out to a high number of lags (say, 10 or more), then it probably needs a higher order of differencing.

* we are using order=(p, d, q)=(1,1,1) becaz usally (1,1,1) or (0,1,1) both will give best results.

* In seasional order we are use extxa value "4" becaze we are comparing a season ,where here it is 4 weeks


## Preparing Model

In [None]:
import statsmodels.api as sm

model = sm.tsa.statespace.SARIMAX(TS['Confirmed'], order = (1,1,1), seasonal_order=(1,1,1,4))
results = model.fit()


 ### testing our model

In [None]:
TS['forecast'] = results.predict(start=44, end =50, dynamic=True)
ax = TS[['Confirmed','forecast']].plot(figsize=(16,8))

ax.set_yticklabels(labels=(ax.get_yticks()*1).astype(int))
plt.legend()
plt.show()

## preparing  DATA Future prediction

### Here we created a new "future_data_ts" which has weekly dates of 2021

In [None]:
from pandas.tseries.offsets import DateOffset


future_dates = [ TS.index[-1]+DateOffset(weeks=x) for x in range(0,53) ]
future_data_TS = pd.DataFrame(index=future_dates[1:], columns = TS.columns)


In [None]:
future_data_TS.tail()

### Now coombine "future_data_TS" with "TS" to get a new "future_TS" data set

In [None]:
future_TS = pd.concat([TS, future_data_TS])

In [None]:
future_TS.head()

In [None]:
future_TS.tail()

In [None]:
future_TS['forecast'] = results.predict(start=44, end = pd.to_datetime('2021-12-12'), dynamic=True)


ax = future_TS[['Confirmed','forecast']].plot(figsize=(16,8))
plt.title("Prediction of covid19 cases of 2021 for Telangana State")
ax.set_yticklabels(labels=(ax.get_yticks()*1).astype(int))	
plt.legend()
plt.show()