# **An analysis and forecasting  of the air pollution levels in India**

In this project we will take a look at 24 Indian cities air pollution levels over the years as well as forecast the air pollution levels for the next 1 year at the current rate of pollution for the entire country. We will also try to explain the trends, seasonality etc. from the data given.
We will be using AQI - Air quality Index, as our measure for the air pollution levels.

The data has been made publicly available by the Central Pollution Control Board: https://cpcb.nic.in/ which is the official portal of Government of India. They also have a real-time monitoring app: https://app.cpcbccr.com/AQI_India/ .



There will be two main parts to the project:

1. To compare the various states on the level of pollution for the year 2019.

2. To find trends, seasonality etc for the pollution levels of India as a whole as well as Delhi and forecast it to the future.

## A brief introduction to the calculation of AQI

<img style="float: center;" src="https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcR8MwkjROMGpNIVRTeKgO_vIj2QU-J9MAIW8v6wf6yg6mWvPPWH&usqp=CAU.jpg">

1. The AQI calculation uses 7 measures: PM2.5(Particulate Matter 2.5-micrometer), PM10, SO2, NOx, NH3, CO and O3(ozone).

2. For PM2.5, PM10, SO2, NOx and NH3 the average value in last 24-hrs is used with the condition of having at least 16 values.

3. For CO and O3 the maximum value in last 8-hrs is used.

4. Each measure is converted into a Sub-Index based on pre-defined groups.

5. Sometimes measures are not available due to lack of measuring or lack of required data points.

6. Final AQI is the maximum Sub-Index with the condition that at least one of PM2 and PM10 should be available and at least three out of the seven should be available.

### How is AQI calculated?
1. The Sub-indices for individual pollutants at a monitoring location are calculated using its
24-hourly average concentration value (8-hourly in case of CO and O3) and health
breakpoint concentration range. The worst sub-index is the AQI for that location.
2. All the eight pollutants may not be monitored at all the locations. Overall AQI is
calculated only if data are available for minimum three pollutants out of which one should
necessarily be either PM2.5 or PM10. Else, data are considered insufficient for calculating
AQI. Similarly, a minimum of 16 hours’ data is considered necessary for calculating subindex.
3. The sub-indices for monitored pollutants are calculated and disseminated, even if data are
inadequate for determining AQI. The Individual pollutant-wise sub-index will provide air
quality status for that pollutant.
4. The web-based system is designed to provide AQI on real time basis. It is an automated
system that captures data from continuous monitoring stations without human
intervention, and displays AQI based on running average values (e.g. AQI at 6am on a
day will incorporate data from 6am on previous day to the current day).
5. For manual monitoring stations, an AQI calculator is developed wherein data can be fed
manually to get AQI value. 

Let us take a look at the ranges of AQI.

<img src="https://i.imgur.com/XmnE0rT.png" alt="">

Now we can proceed with our analysis.

## Downloading the dataset and importing libraries to conduct analysis:


In [None]:
# Importing necessary libraries to conduct our analysis
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
# Ignore harmless warnings
import warnings
warnings.filterwarnings("ignore")
from IPython.display import HTML,display

warnings.filterwarnings("ignore")

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
#Reading the dataset into object 'df' using pandas:
df= pd.read_csv('/kaggle/input/air-quality-data-in-india/city_day.csv',parse_dates=True)
df['Date'] = pd.to_datetime(df['Date'])

## Exploratory data analysis(EDA),Data Wrangling and Pre-processing:
First, let us take a look at the first five rows of our dataset.

In [None]:
df.head(5)

Right off we can notice there are many missing values which can lead to incorrect predictions and inference.

Taking a deeper look we notice that only the Delhi dataset is complete for the AQI column with the rest of the cities with incomplete data. This is unfortunately unrectifiable as official records of pollutant levels are only available as given above leaving a large amount of data missing.

Next let us take a look at a summary of all the data:

In [None]:
df.describe()

Above is a summary statistics of all the columns. The AQI as explained above is based on these columns and for our notebook we will only deal with AQI values across the states.




#### Removing unused columns:
Here we will keep the columns 'City', 'Date', 'AQI' and 'AQI_Bucket'.



In [None]:
df=df[['City','Date','AQI','AQI_Bucket']]

#### Modifying dataset for our needs:
Here,we will tranform the data with the columns as the cities AQI so as to compare AQI between states. The table after transforming is given below.

In [None]:
cities=pd.unique(df['City'])
column1= cities+'_AQI'
column2=cities+'_AQI_Bucket'
columns=[*column1,*column2]

In [None]:
final_df=pd.DataFrame(index=np.arange('2015-01-01','2020-05-02',dtype='datetime64[D]'),columns=column1)
for city,i in zip(cities,final_df.columns):
    n=len(np.array(df[df['City']==city]['AQI']))
    final_df[i][-n:]=np.array(df[df['City']==city]['AQI'])

Notice that the data is daily data. We will convert it into monthly data for our ease by averaging a months data.

In [None]:
final_df=final_df.astype('float64')
final_df=final_df.resample(rule='MS').mean()

In [None]:
final_df.tail()

Next, we will add a column 'India_AQI' which gives us the average of all the cities data across a row. Note that this is not necessarily an accurate measure of AQI for India as a whole as only a small subset of all the cities are being used. Nevertheless,we can consider this as a reasonably representative measure of the AQI.

In [None]:
final_df['India_AQI']=final_df.mean(axis=1)

Let us take a quick look at the graph of India's AQI over the years.

In [None]:
ax=final_df[['India_AQI']].plot(figsize=(12,8),grid=True,lw=2,color='Red')
ax.autoscale(enable=True, axis='both', tight=True)

Straight away we can see patterns and trends over the years. There are two highly noticeable patterns. One is the general trend downwards. Over the past 5 years we can see the AQI reducing marginally. Note that this can be  a litte misleading, especially due to the 2015 data, as the dataset in the first few observations  only comprises of Delhi and Ahmedabad during which have relatively highly pollution compared to the rest of the cities which makes the initial portion of the graph highly exaggerated. Nevertheless we can see a general decline in pollution over the years. 

The next pattern thats easily observable is the seasonal component which plays a big role in the pollution of the country. We will discuss further  in the 2nd part of our project.
One other important point to note is the affect of COVID-19 on India's pollution level. The pollution levels are drastically lower during the year 2020 for the same reason.

We can move on to comparing the AQI of the cities to find the most polluted city and the least.
Note that we will be leaving the unavailable data as is and further modify if required.

## Air pollution by city for the year 2019
Our aim of the section is to find level of pollution in the cities and compare them, we use the year 2019 as it is by far the most complete in terms of data and it is the most recent full year and hence rather apt to compare.

We will start with forming a table with the data from 2019.

In [None]:
df_2019=final_df['2019-01-01':'2020-01-01']
df_2019.head()

We can see that there seems to be still quite a few missing values. Let us take a look at the missing data.

In [None]:
df_2019.isna().sum()

We can see that there are 3 cities whose data is missing in its entirety. We will remove these columns as they serve no purpose. There are few other columns with a few missing months of data. For our analysis we will keep them even though it might add to the inacuracy of our results.

In [None]:
df_2019=df_2019.drop(['Aizawl_AQI','Ernakulam_AQI','Kochi_AQI'],axis=1)

We will take the average of all the months for each city to find the AQI for the year 2019.

In [None]:
AQI_2019=df_2019.mean(axis=0)

Before looking at the means of the AQI values of the cities, we will take a look at the boxplots of the AQI values of the various cities.

In [None]:
plt.figure(figsize=(20,8))
plt.xticks(rotation=90)
bplot = sns.boxplot( data=df_2019,  width=0.75,palette="GnBu_d")
plt.ylabel('AQI');
bplot.grid(True)


We can see that Ahmedebad has easily the highest values of AQI in the country, followed by Delhi lagging far behind. Let us take a look at the means of the values of AQI for further comparison.

In [None]:
plt.figure(figsize=(20,8))
plt.xticks(rotation=90)
plt.ylabel('AQI')
bplot=sns.barplot(AQI_2019.index, AQI_2019.values,palette="GnBu_d")


We can see that Ahmedabad and Delhi are the most polluted whereas Shillong is the least followed by trivandrum. With this we end the comparison and move to the next section of forecasting the values of future AQI for the whole of India.

## Analysing and forecasting of AQI values:
We will first take a look at the seasonal decompose of the AQI values of india.

In [None]:
from statsmodels.tsa.seasonal import seasonal_decompose
India_AQI=final_df['India_AQI']
result=seasonal_decompose(India_AQI,model='multiplicative')
result.plot();

As we have discussed earlier ,there is a very clear seasonality, and  a less clearer trend. The trend is possibly  due to increasing restrictions on pollution by the govt and the last surge downward is clearly due to the recent Covid-19. 

How about the seasonality, what causes the increase during certian months and a decline in others? Let us take a closer look during which months the pollution peaks.

In [None]:
from matplotlib import dates
ax=result.seasonal.plot(xlim=['2018-01-01','2020-02-10'],figsize=(20,8),lw=2)
ax.yaxis.grid(True)
ax.xaxis.grid(True)

We can see that there are two peaks largely, one during October and the during January. And the lowest amount of pollution is around july-September after which there is a sharp increase. 
Similarly, there is a decrease from January to July. This spike in the winters is due to a combination of factors. One point of note is that North Indian states have a higher increase of pollution.

The spike is due to factors including Winter aversion(explained after),valley affect(explained after), seasonal factors such as dust storms,  crop fires, burning of solid fuels for heating, and firecracker-related pollution during Diwali, stubble burning etc.


### Winter Aversion:
In summer, air in the planetary boundary layer (the lowest part of the atmosphere) is warmer and lighter, and rises upwards more easily. This carries pollutants away from the ground and mixes them with cleaner air in the upper layers of the atmosphere in a process called ‘vertical mixing’.  

During winters the planetary boundary layer is thinner as the cooler air near the earth’s surface is dense. The cooler air is trapped under the warm air above that forms a kind of atmospheric ‘lid’. This phenomenon is called winter inversion. Since the vertical mixing of air happens only within this layer, the pollutants released lack enough space to disperse in the atmosphere.
During summers, pollution levels decrease as the warmer air rises up freely, making the boundary layer thicker, and providing enough space for pollutants to disperse. The same thing happens during winter afternoons, when increased heat brings down pollution slightly.

The effects of inversion are stronger at night, which is why air quality levels drop overnight. This is also why experts ask people to refrain from early morning walks, as they could be exposed to much higher pollution levels at that time.
In cities closer to the coast, like Mumbai, the sea breeze and moisture help disperse pollution. However, the Indo-Gangetic plain, which includes Punjab, Delhi, UP, Bihar and West Bengal, is like a valley surrounded by the Himalayas and other mountain ranges. Polluted air settles in this land-locked valley and is unable to escape due to low wind speeds.
In major cities of this region, such as Delhi and Kanpur, high industrial and vehicular emissions coupled with biomass burning in surrounding areas cause more pollution that gets trapped due to this valley effect and inversion.

 Now that we have an explantion for the seasonal component as well as trend component let us try to predict future values of AQI based purely on previous values.



# Forecasting: 
We will be using three methods for forecasting values of AQI for India, namely,  SARIMA, RNN using LSTM and facebook prophet.
It is obviously overkill to be using these three methods however being new to time series I would personally like to explore all three options. Normally for such a small dataset RNN would not be recommended.

## SARIMA(Seasonal Autoregressive Integrated Moving Average)
Autoregressive Integrated Moving Average, or ARIMA, is one of the most widely used forecasting methods for univariate time series data forecasting.Although the method can handle data with a trend, it does not support time series with a seasonal component.An extension to ARIMA that supports the direct modeling of the seasonal component of the series is called SARIMA.

In [None]:
# Load specific forecasting tools
from statsmodels.tsa.statespace.sarimax import SARIMAX
!pip install pmdarima;
from pmdarima import auto_arima;                              # for determining ARIMA orders

First, we run auto arima to find out the parameters of the model for us. We can manually do it,however, it is much easier for us let the notebook do the work for us.

In [None]:
auto_arima(y=India_AQI,start_p=0,start_P=0,start_q=0,start_Q=0,seasonal=True, m=12).summary()

We have found the optimal parameters for the SARIMAX model is (1,1,1)x(1,0,1,12).Note that the model is called SARIMAX, however, we do not have an external variable hence it reduces to SARIMA. The model selection criterion is AIC which is default.

Our next step is to forecast using this model into the future. However, since we do not have information regarding future values, we will split the data into a training data and testing data and try to predict 1 year into the future. We will use the years 2015-2018(till june) as our train dataset and July-June the next year as our test dataset. The reason we exclude 2020 is due to the fact that 2020 is an outlier due to covid an we will not get an accurate figure for the prediction. We will also take a  look at the predicted values of 2020 for reference. Further, we will predict into the year 2021.

In [None]:
len(India_AQI)

In [None]:
#dividing into train and test:
train=India_AQI[:41]
test=India_AQI[42:54]

In [None]:
# Forming the model:
model=SARIMAX(train,order=(1,1,1),seasonal_order=(1,0,1,12),)
results=model.fit()
results.summary()


We have fitted out model with the training data and the required parameters. Next we need to forecast the next 12 months AQI values.

In [None]:
#Obtaining predicted values:
predictions = results.predict(start=42, end=53, typ='levels').rename('Predictions')

In [None]:
#Plotting predicted values against the true values:
predictions.plot(legend=True)
test.plot(legend=True);

We can see that the predicted values are fairly close to our actual values using SARIMA and hence is quite fascinating how looking at previous values gives us so much insight into future air pollution.However, there is a discrepency at the peak of the graph where our model has not been able to predict with a high accuracy. To obtain the value of error we will be using root mean square error(RMSE) for comparison between the models.

In [None]:
from sklearn.metrics import mean_squared_error
RMSE=np.sqrt(mean_squared_error(predictions,test))
print('RMSE = ',RMSE)
print('Mean AQI',test.mean())

We have got an RMSE value of approximately 21, which is quite alright, we can approximately judge the scale of error by comparing with the mean values of AQI which is 177, so the error is approximately 1/9 of the actual values. 

Next we will try predicting the AQI values for the year 2019-2020(July-May)

In [None]:
#dividing into train and test:
train=India_AQI[:53]
test=India_AQI[54:]
# Forming the model:
model=SARIMAX(train,order=(1,1,1),seasonal_order=(1,0,1,12),)
results=model.fit()
results.summary()
#Obtaining predicted values:
predictions = results.predict(start=54, end=64, typ='levels').rename('Predictions')
#Plotting predicted values against the true values:
predictions.plot(legend=True)
test.plot(legend=True);

As expected, the predicted values are much higher than the actual value as we can see from the graphs. Let us take a look at the error value.

In [None]:
#Finding RMSE:
from sklearn.metrics import mean_squared_error
RMSE=np.sqrt(mean_squared_error(predictions,test))
print('RMSE = ',RMSE)
print('Mean AQI',test.mean())

The error value is much higher than earlier for obvious reason and hence we can see that predicting for the year 2020 is not going to yeild accurate results due to the Covid-19.

Next we will take a look into forecasting into the unknown, i.e. 2020-2021.
This poses a problem, as if we predict including 2020 data, we are bound to get an innacurate prediction for next year simply due to the fact that 2020 is an outlier.However, if we remove 2020 from our dataset and predict from 2019 till 2021 we are left with wrong predictions for sure and considering that covid-19 could have further lasting effects we will predict poorly. 
We will choose to include 2020 as well for this predicition. We could compare the values next year.

### Predicting into the future:

In [None]:
# Forming the model:
model=SARIMAX(India_AQI,order=(1,1,1),seasonal_order=(1,0,1,12))
results=model.fit()
results.summary()
#Obtaining predicted values:
predictions = results.predict(start=64, end=77, typ='levels').rename('Predictions')
#Plotting predicted values against the true values:
predictions.plot(legend=True)
India_AQI.plot(legend=True,figsize=(12,8),grid=True);

We can see the predictions plotted in continuation with 2020 and one thing we note is the highly optimistic prediction. That is purely due to the fact that 2020 is such an outlier, chances are, the pollution levels will follow the trend pre 2020 which would mean a bump in the AQI levels unless the country decides to keep the restrictions etc as is which is highly unlikely. We can always get a more accurate prediction skipping 2020.

Next,we will try forecasting using Facebook's Prophet.

## Facebook Prophet:
This is a library formed by Facebook which is extremely easy to implement with aesthetically pleasing visuals. It is also quite a well made model. However, one drawback is that the model is fairly uncustomizable in terms of the actual modelling as well as visualization and hence a less transparent version than SARIMA, however, quite a bit less of a black box than RNN's. Also note that Prophets main strength is daily data, which isnt being used here.

In [None]:
from fbprophet import Prophet

One peculiarity of the library is that the data has to be of a very specific form

In [None]:
#Formatting necessary to Prophet:
India_AQI=India_AQI.reset_index()
India_AQI.columns=['ds','y']

### Creating and fitting the model:
First we will split the data into test/train as shown earlier and predict for 2018-2019 to get  a comparison of its general predictive power with SARIMA.

In [None]:
# Forming test/train data:
train=India_AQI[:-24]
test=India_AQI[-24:-12]
m = Prophet(seasonality_mode='multiplicative')
m.fit(train)

### Forecasting: 
For prophet to work, we need to make a dataset for future predictions beforehand.

In [None]:
future = m.make_future_dataframe(periods=12,freq = 'MS')

Next we forecast for our model and plot the graph:

In [None]:
forecast = m.predict(future)
m.plot(forecast);

Next we will take a look at root mean square error:

In [None]:
#Finding RMSE:
from sklearn.metrics import mean_squared_error
RMSE=np.sqrt(mean_squared_error(forecast['yhat'][-12:],test['y']))
print('RMSE = ',RMSE)
print('Mean AQI',test['y'].mean())

The RMSE value is a lot higher than that of SARIMA, this could be indicative that Prophet weighs long past data with more weight. Perhaps the model could be further tweaked to get better results.

Lastly, let us predict into the future using the Prophet Library.

In [None]:
m = Prophet(seasonality_mode='multiplicative',weekly_seasonality=False,daily_seasonality=False)
m.fit(India_AQI)
future = m.make_future_dataframe(periods=12,freq = 'MS')
forecast = m.predict(future)
m.plot(forecast);

This is the final plot into the future. Surprisingly, this model looks like it could be a  predictor better than SARIMA for this case considering the trend has not been too greatly altered for due to covid. This further seems to indicate that prophet seems to be placing more emphasis on past values as compared to SARIMA. Note that we can retrieve the predicted values from the forecast object. 

Finally let us take a look at the components of our data found by prophet:

In [None]:
m.plot_components(forecast);

The above diagram does give us a much clearer indepth idea of the trend and seasonal component of our data. 
With this we have come to an end to the prediction of AQI values using prophet.

##  Recurring Neural Networks(RNN):
For this last forecast we will be using RNN which is a type of Neural Network that is used for time based/frequency based/ memory based data like text data, speech, time series etc. We will be using a particular cell type LSTM(Long short term memory).LSTM networks are particularly meant to keep particular information for a longer term as compared to regular RNN's. As all Neural Networks, RNN's works best with a huge amount of data. RNN is a black box method, which means there is little transparency in the model and how it trains. 
Another major disadvantage is the high complexity of hyperparameters.Hence RNN's should preferably used as last resort.

In [None]:
India_AQI=India_AQI.set_index('ds')

#### Splitting into test/train:

In [None]:
train=India_AQI[:-24]
test=India_AQI[-24:-12]

#### Scaling the data: 
For this model we will be scaling the data to 0-1.

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(train)

In [None]:
scaled_train = scaler.transform(train)
scaled_test = scaler.transform(test)

We need to put the data in a particular format for Keras, the library used to implement RNN.n_input tells us how many values before the output value we need to consider to make a prediction. I have chosen 2 years. One year is also a reasonable value. However, since I want to predict into the future, I want the year before COVID-19  to be in my calculation too(Note that I can do this for my SARIMA as well).n_features is simply the number of values I want to predict.

### Formatting the data:

In [None]:
from keras.preprocessing.sequence import TimeseriesGenerator
n_input = 24
n_features = 1
generator = TimeseriesGenerator(scaled_train, scaled_train, length=n_input, batch_size=1)

In [None]:
#To give an idea of what generator file holds:
X,y = generator[0]

In [None]:
# We can see that the x array gives the list of values that we are going to predict y of:
print(f'Given the Array: \n{X.flatten()}')
print(f'Predict this y: \n {y}')

### Creating the model:

In [None]:
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM

In [None]:
# defining the model(note that  I am using a very basic model here, a 2 layer model only):
model = Sequential()
model.add(LSTM(50, activation='relu', input_shape=(n_input, n_features)))
model.add(Dense(1))
model.compile(optimizer='adam', loss='mse')

model.summary()

### Fitting the model:
We can define the number of epochs we want.

In [None]:
# Fitting the model with the generator object:
model.fit_generator(generator,epochs=250)

The plot below shows how the values of the loss reduces as each epoch gets over.

In [None]:
loss_per_epoch = model.history.history['loss']
plt.plot(range(len(loss_per_epoch)),loss_per_epoch)

### Forming our predictions and putting them in the array test_predictions:

In [None]:
test_predictions = []

first_eval_batch = scaled_train[-n_input:]
current_batch = first_eval_batch.reshape((1, n_input, n_features))

for i in range(len(test)):
    
    
    current_pred = model.predict(current_batch)[0]
    
    
    test_predictions.append(current_pred) 
    
    
    current_batch = np.append(current_batch[:,1:,:],[[current_pred]],axis=1)

In [None]:
true_predictions = scaler.inverse_transform(test_predictions)

In [None]:
test['Predictions'] = true_predictions

### Plotting our predictions with the true values:

In [None]:
test.plot(figsize=(12,8))
plt.plot(true_predictions)


The graph looks quite on point! Let us find the RMSE value for this model:

In [None]:
RMSE=np.sqrt(mean_squared_error(test['y'],test['Predictions']))
print('RMSE = ',RMSE)
print('India_AQI=',India_AQI['y'].mean())

The RMSE value is lower than what we had predicted with the above two models even with our limited dataset.

### Forecasting into the future with RNN:
We will use the same model but with the entire dataset now and predict one year into the future.

In [None]:
scaler.fit(India_AQI)
scaled_India_AQI=scaler.transform(India_AQI)

In [None]:
generator = TimeseriesGenerator(scaled_India_AQI, scaled_India_AQI, length=n_input, batch_size=1)

In [None]:
model.fit_generator(generator,epochs=250)

In [None]:
test_predictions = []

first_eval_batch = scaled_India_AQI[-n_input:]
current_batch = first_eval_batch.reshape((1, n_input, n_features))

for i in range(len(test)):
    
    
    current_pred = model.predict(current_batch)[0]
    
    
    test_predictions.append(current_pred) 
    
    
    current_batch = np.append(current_batch[:,1:,:],[[current_pred]],axis=1)

In [None]:
true_predictions = scaler.inverse_transform(test_predictions)

In [None]:
true_predictions=true_predictions.flatten()

In [None]:
true_preds=pd.DataFrame(true_predictions,columns=['Forecast'])
true_preds=true_preds.set_index(pd.date_range('2020-06-01',periods=12,freq='MS'))

Given below are the forecasted values:

In [None]:
true_preds

Next, we will take a look at the plot of the actual values followed by the predicted values:

In [None]:
plt.figure(figsize=(20,8))
plt.grid(True)
plt.plot( true_preds['Forecast'])
plt.plot( India_AQI['y'])

Again, like that with SARIMA, we can see that the prediction is highly optimistic due to COVID-19 which can possibly be better by removing the year 2020 and predicting two years in using data pre 2020. 

With this we have come to the end of the forecasting section and the notebook overall.