# **A brief analysis of COVID-19's impact on India's AQI and Time Series Forecasting with SARIMA**

The air quality index (AQI) is an index for reporting air quality on a daily basis. A low AQI indicates good air quality and low levels of pollution while a higher AQI suggests increased concentrations of pollutants in the air which is extremely detrimental to human health.

The air quality index is composed of 8 pollutants ((PM10, PM2.5, NO2, SO2, CO, O3, NH3, and Pb)

AQI scores and categories:
Good (0–50)
Satisfactory (51–100)
Moderately polluted (101–200)
Poor (201–300)
Very poor (301–400)
Severe (401–500)

In this notebook, we will be using the day-wise AQI dataset which contains information regarding the daily level of pollutants and AQI in around 26 Indian cities from 2015-2020.

Our notebook has two major parts:
1. Finding the most polluted cities in recent years and analysing the levels of pollutants here. Understanding the impact of COVID-19 induced lockdowns on Air Quality in some of the major cities : analysing which cities underwent the most drastic improvement in Air Quality and which cities showed a spike in AQI levels inspite of a stringent lockdown

2. We do a time-series analysis of the data and and fit a SARIMA model with computed orders to forecast India's AQI in 2021.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from statsmodels.graphics.tsaplots import plot_pacf
from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.stattools import adfuller
import warnings
warnings.filterwarnings('ignore')

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input/air-quality-data-in-india'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
#importing day-wise data of cities
cities= pd.read_csv('../input/air-quality-data-in-india/city_day.csv')

#visualizing the top rows of the dataset
cities.head()

In [None]:
#getting information about the columns in our dataset
cities.info()

To be able to construct time plots, we need to first convert the Date column into a DateTime format.

In [None]:
print(cities.shape)

#converting column Date into DateTime format
cities['Date']=pd.to_datetime(cities['Date'])

Next, we try to find the percentage of missing values in each column.

In [None]:
import missingno as msno
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

#finding the proportion of missing values in each column
missing=pd.DataFrame(cities.isna().sum()/len(cities))
missing.columns=['Proportion']
print(missing.sort_values(by='Proportion', ascending=False))

#plotting the number of non-null values in each column
msno.bar(cities)

We see that Xylene and PM10 have a very high proportion of missing values. This might be due to manual inconsistencies/human negligence while recording or absence of enough stations in the city.

Next, processing our dataset further and for ease of analysis, we club columns of the same category together.

In [None]:
#filling missing values with zero - can also be imputed by mean of the observations
cities.fillna(0,inplace=True)

#extracting year and month for each record
cities['year'] = pd.DatetimeIndex(cities['Date']).year
cities['month'] = pd.DatetimeIndex(cities['Date']).month

#clubbing all particulate matter
cities['PM']=cities['PM2.5'] + cities['PM10']

#clubbing nitrogen oxides
cities['Nitric']=cities['NO'] + cities['NO2']+ cities['NOx']

#clubbing Benzene, toluene and Xylene together
cities['BTX']=cities['Benzene'] + cities['Toluene']+ cities['Xylene']

#grouping pollutant levels in every city by year and month
cities_group_ym=cities.groupby(['City','year','month'])[['PM','Nitric','CO','NH3','O3','SO2','BTX','AQI']].mean()

cities_group_ym=cities_group_ym.reset_index(['City','year','month'])
cities_group_ym.head()

### **Visualizing the most polluted cities for each category of pollutants (2017-19)**

We take years 2017-2019 as our reference years to understand the general trend of pollutants prevailing in some of the most polluted Indian cities:

In [None]:
#taking a subset of our dataset for the last three years before 2020
cities_17_19=cities_group_ym[cities_group_ym['year'].isin([2017,2018,2019])]

#list of pollutants
pollutants=['PM','Nitric','CO','NH3','O3','SO2','BTX','AQI']
sns.set_theme(style='whitegrid')

#plotting the top 10 most polluted cities for each category of pollutants, as well as overall AQI
for i in pollutants:
    df=cities_17_19.groupby(['City'])[[i]].mean().sort_values(i,ascending=False).iloc[:10,:]
    
    df=df.reset_index(['City'])

    plt.figure()
    sns.barplot(data=df, x="City", y=i, palette="viridis", alpha=.9)
    plt.xticks(rotation=45) 

Now, we try to find how each individual pollutant is related to the AQI.

In [None]:
#plotting the correlation matrix with sns heatmap
corr_matrix = cities_group_ym.iloc[:,3:].corr()
print(corr_matrix)
fig = plt.figure(figsize = (6, 4))
sns.heatmap(corr_matrix, vmin=-1, vmax=1)
plt.show()

We see that BTX has the lowest correlation with AQI- which is perfectly in sync with the AQI calculation formula. The air quality index is composed of 8 pollutants ((PM10, PM2.5, NO2, SO2, CO, O3, NH3, and Pb), but does not directly account for BTX.

Next, we study the general AQI trend over the months from year 2017-19.

In [None]:
df_AQI_trend= cities_17_19.groupby(['City','month'])[['AQI']].mean().reset_index()

sns.lineplot(
    data=df_AQI_trend,
    x="month", y="AQI",hue='City'
)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

We see that there isa clear pattern which emerges here. AQI decreases in the summer months, which in turn means that air quality improves over these months.

## Analysing the impact of COVID-19 induced lockdown on AQI:

To start off, we will be picking out some cities from the most polluted ones(as inferred above) and try to visualize how their AQI changed in 2020 as compared to 2019.

Here onwards, we will be doing our analysis and inferencing based on percentage change in AQI from 2019 to 2020.

In [None]:
#creating a list of some of the most polluted cities
most_polluted=['Delhi','Patna','Ahmedabad','Gurugram','Kolkata']

#forming two df's- containing data from 2019 and 2020 respectively
cities_2019= cities_group_ym[(cities_group_ym['City'].isin(most_polluted)) & (cities_group_ym['year']==2019)]
cities_2020= cities_group_ym[(cities_group_ym['City'].isin(most_polluted)) & (cities_group_ym['year']==2020)]

cities_19_vs_20 = pd.merge(cities_2019, cities_2020, how="inner", on=["City", "month"])

#computing the percentage change in AQI
cities_19_vs_20['AQI Percentage change']=100*(cities_19_vs_20['AQI_y']-cities_19_vs_20['AQI_x'])/cities_19_vs_20['AQI_x']

#plotting AQI change for a few highly polluted cities
fig = plt.figure(figsize=(10,7))
sns.lineplot(
    data=cities_19_vs_20,
    x="month", y="AQI Percentage change",hue='City',linewidth=4.5,
    markers=True, dashes=False
)


The general trend shows that the AQI indeed decreased for the lockdown months, signifying a major improvement in Air quality with reduced pollution levels.

However, we will now investigate the cities which fared the best in these 4 months and also the ones which showed anomalies with a spike in AQI.

## Cities which had underwent the most drastic improvement in Air Quality:

In [None]:
#forming two seperate dataframes for years 2019 and 2020
cities_19_all= cities_group_ym[cities_group_ym['year']==2019]
cities_20_all= cities_group_ym[cities_group_ym['year']==2020]

#joining the two df's to get a comparitive view of AQI value in 2019 and 2020
cities_19_vs_20_all = pd.merge(cities_19_all, cities_20_all, how="inner", on=["City", "month"])
cities_19_vs_20_all['AQI Percentage change']=100*(cities_19_vs_20_all['AQI_y']-cities_19_vs_20_all['AQI_x'])/cities_19_vs_20_all['AQI_x']

#lockdown months- which we will be analysing
months=[3,4,5,6]
fig, axes = plt.subplots(ncols=2, nrows=2,figsize=(12, 7))

#plotting the top 10 cities for the months March-June 2020 which had the most improvement in AQI
for i, ax in zip(months, axes.flat):
    cities_AQI_comp=cities_19_vs_20_all[(cities_19_vs_20_all['AQI_y']!= 0.000000) & (cities_19_vs_20_all['month']==i)]
    cities_AQI_comp_10=cities_AQI_comp[['City','month','AQI_x','AQI_y','AQI Percentage change']].sort_values(by='AQI Percentage change', ascending=True).iloc[:10,:]
    
    h=sns.barplot(data=cities_AQI_comp_10, x="City", y='AQI Percentage change', palette="flare", alpha=.9, ax=ax)
    h.set(title='Month : {}'.format(i))
    h.set_xticklabels(h.get_xticklabels(), rotation=45)
    
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

We can see that there has been a significant improvement in the air quality for these cities over the four months.

### Cities which showed an increased AQI as compared to 2019 in the lockdown-months:

Analysing the cities which showed a positive AQI change percentage, denoting an increased AQI in 2020:

In [None]:
#finding cities which have a higher AQI in months March April, May, June of 2020 as compared to the same months last year
cities_19_vs_20_all[(cities_19_vs_20_all['AQI Percentage change']>=0)&(cities_19_vs_20_all['month'].
                                                                       isin([3,4,5,6]) )][['City','month','AQI_x','AQI_y','AQI Percentage change']].sort_values(by='AQI Percentage change', ascending=False)

Analysing in detail the towns/cities which recorded a higher AQI in April and May of 2020 as compared to 2019:

In [None]:
#cities displayed above which showed a higher AQI in the lockdown months of 2020 as compared to 2019
anomalies=['Guwahati','Jorapokhar','Brajrajnagar','Talcher']

#understanding the rise of pollutants which contributed to the increased AQI in 2020 by comparing the levels of each pollutant in 2019 and 2020
for i in anomalies:
    city_19_20= cities_group_ym[(cities_group_ym['City']==i) & 
                                 (cities_group_ym['year'].isin([2019,2020]))&
                                (cities_group_ym['month']<8)]

    sns.set_theme(style="whitegrid")
    fig = plt.figure()
    fig, axes = plt.subplots(2,4,figsize=(12, 5))
    
    sns.barplot(
        data=city_19_20, 
        x="month", y="AQI", hue="year",
         palette="dark", alpha=.6,ax=axes[0,0]
    )
    sns.lineplot(
        data=city_19_20,
        x="month", y="PM", hue="year",palette='dark',
        markers=True, dashes=False,ax=axes[0,1]
    )
    sns.lineplot(
        data=city_19_20,
        x="month", y="Nitric", hue="year",palette='dark',
        markers=True, dashes=False,ax=axes[0,2]
    )
    sns.lineplot(
        data=city_19_20,
        x="month", y="CO", hue="year",palette='dark',
        markers=True, dashes=False,ax=axes[0,3]
    )
    sns.lineplot(
        data=city_19_20,
        x="month", y="BTX", hue="year",palette='dark',
        markers=True, dashes=False,ax=axes[1,0]
    )
    sns.lineplot(
        data=city_19_20,
        x="month", y="SO2", hue="year",palette='dark',
        markers=True, dashes=False,ax=axes[1,1]
    )
    sns.lineplot(
        data=city_19_20,
        x="month", y="NH3", hue="year",palette='dark',
        markers=True,ax=axes[1,2]
    )
    sns.lineplot(
        data=city_19_20,
        x="month", y="O3", hue="year",palette='dark',
        markers=True, ax=axes[1,3]
    )

    fig.tight_layout()
    print(i,':')
    plt.show()

We see that the four cities mentioned above did not witness an improvement in AQI during the COVID-19 induced lockdown as expected. The reason might be manifold: sparse AQI readings in 2020,flouting of lockdown norms, or any other natural phenomenon overriding the positive impact of decreased human and industrial activity.

* Guwahati: we see that AQI for April'20 is more than 20% higher as compared to April'19. Particulate matter and NH3 were the increased contributing factors
* Jorapokhar: we see that AQI for May'20 is substantially higher as compared to May'19. Concentration of almost all pollutants have increased
* Brajrajnagar: Higher AQI in April'20 as compared to April'19. Can be attributed to increased O3 and PM levels
* Talcher:Higher AQI in April'20 as compared to April'19. Can be attributed to increased CO,O3 and NH3 levels

### Part 2 : Time Series Analysis and Forecasting:

In [None]:
import warnings
warnings.filterwarnings('ignore')

#importing day-wise data of cities
df= pd.read_csv('../input/air-quality-data-in-india/city_day.csv')

df['Date'] = pd.to_datetime(df['Date'])

#visualizing the top rows of the dataset
df.tail(5)

We pivot the values from the 'City' column, so that we can have a comparitive view of the value of every city's AQI through every day.

Then we resample them to find the mean of every month, so now our dataset contains month-wise data.

In [None]:
cities_all = df.pivot_table(values='AQI', index=['Date'], columns='City')
cities_all=cities_all.add_suffix('_AQI')
cities=cities_all.resample(rule='MS').mean()
cities.head()

In [None]:
#form a new column containing India's AQI for every month by taking the average of all cities for that month
cities['India_AQI']=cities.mean(axis=1)
cities.head()

In [None]:
cities.reset_index()

sns.set_theme(style='whitegrid')

#plot India's AQI
cities['India_AQI'].plot(kind='line',grid=True,figsize=(10, 6), linewidth=2.5)

From the plot above, we can visually see that there is a slight downward trend and a seasonality present. However, we will decompose the plot into trend, seasonality and residuals to get a clearer picture.

In [None]:
plt.rcParams['figure.figsize'] = (10, 6);
cities['India_AQI']=cities.mean(axis=1)
fig = seasonal_decompose(cities['India_AQI'], model='additive').plot()

We can see a clear seasonality and trend present here. The AQI decreases towards mid-year before rising again.

### Augmented Dicky Fuller Test:

We'll perform the ADF for determining stationarity of the time series.

In [None]:
dftest = adfuller(cities['India_AQI'], autolag='AIC')
dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
for key,value in dftest[4].items():
    dfoutput['Critical Value (%s)'%key] = value
dfoutput

The p-value is 0.94, which means that this time series is not stationary.
We perform a first order differencing to remove the trend and then perform the ADF test again.

In [None]:
diff = cities['India_AQI'].diff(periods=1)
diff.dropna(inplace=True)
fig = seasonal_decompose(diff, model='additive').plot()

In [None]:
dftest = adfuller(diff)
dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
for key,value in dftest[4].items():
    dfoutput['Critical Value (%s)'%key] = value
dfoutput

From the p-value and the Test Statistic, we can conclude that with one differencing, the time series becomes stationary. Therefore, d=1.

In [None]:
fig, ax = plt.subplots(2,figsize=(13, 8))
ax[0] = plot_acf(diff, lags=30, ax=ax[0])
ax[1] = plot_pacf(diff,lags=30, ax=ax[1])

We can use auto-arima to determine the parameters of the SARIMA model.

In [None]:
#installing pmdarima
!pip install pmdarima;
from pmdarima import auto_arima;  

In [None]:
auto_arima(y=cities['India_AQI'],start_p=1,start_P=1,start_q=1,start_Q=1,seasonal=True,m=12, stepwise=\
          True).summary()

We see that the model has an RMSE of 22.75 on the test data set. Now, we can use this model to predict values into the future.

We'll be forecasting AQI values for 2021. However, 2020 yielded unexpected AQI values owing to the lockdown imposed due to COVID-19, as we saw earlier. So our prediction might have a wider margin of error to be considered.

In [None]:
#dividing into train and test:
train_data=cities['India_AQI'][:'2018-12']
test_data=cities['India_AQI'][:'2019-12']

#Building the model:
model=SARIMAX(train_data,order=(0,1,2),seasonal_order=(1,0,1,12), trend='n')
results=model.fit()

#printing summry of model reults
results.summary()

In [None]:
fig, ax= plt.subplots(figsize=(10,6))

#predict the next 12 months values to compare with the test dataset
forecasts = results.get_forecast(steps=12, dynamic=True)

#find the confidence intervals
confidence_intervals=forecasts.conf_int()
lower_limits = confidence_intervals.loc[:,'lower India_AQI']
upper_limits = confidence_intervals.loc[:,'upper India_AQI']

#plot the forecasted mean data for the next 12 months and the confidence interval
forecasts.predicted_mean.plot(legend=True, ax=ax, label ='Predicted Values')
plt.fill_between(confidence_intervals.index, lower_limits, upper_limits, color='pink')

#plotting the actual value from test data
test_data.plot(legend=True, ax=ax)

In [None]:
from sklearn.metrics import mean_squared_error

test= cities['India_AQI']['2019-01':'2019-12']
RMSE=np.sqrt(mean_squared_error(forecasts.predicted_mean,test))
print('RMSE = ',RMSE)

We see that the model has an RMSE of 22.75 on the test data set. Now, we can use this model to predict values into the future.

We'll be forecastig AQI values for 2021. However, 2020 yielded unexpected AQI values owing to the lockdown imposed due to COVID-19, as we saw earlier. So our prediction might have a wider margin of error to be considered.

In [None]:
fig, ax= plt.subplots(figsize=(10,6))

forecasts = results.get_forecast(steps=36, dynamic=True)

confidence_intervals=forecasts.conf_int()
lower_limits = confidence_intervals.loc[:,'lower India_AQI']
upper_limits = confidence_intervals.loc[:,'upper India_AQI']

#plot the forecasted data
forecasts.predicted_mean.plot(legend=True, ax=ax, label ='Predicted Values')

#plot the confidence interval as the shaded area
plt.fill_between(confidence_intervals.index, lower_limits, upper_limits, color='pink')

#Plot India's AQI Data
cities['India_AQI'].plot(legend=True, ax=ax)