In [None]:
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd


In [None]:
dataset = pd.read_csv('../input/air-quality-data-in-india/city_day.csv')
df_city_day = dataset.copy()

In [None]:
df_city_day.head()

In [None]:
df_city_day.info()

In [None]:
df_city_day.isnull().sum()

We can see that Xylene and PM10 have the most null values(18K and 11K) respectively. 

In [None]:
df_city_day['AQI_Bucket'].value_counts(ascending= True)

Mainly there are six categories of AQI which we observe: *Good, Satisfactory, Moderate, Poor, Very Poor and Severe.*

In [None]:
sns.catplot(x = "AQI_Bucket", kind= "count", palette = "ch: 2.87", height=5, aspect=1.1, data = df_city_day)

**Plot: Number of entries vs AQI_Bucket Catogory**

Let us drop those null values which have the least part to play in the analysis of AQI. 

In [None]:
df_city_day = df_city_day.dropna(axis = 0, subset = ['PM10'])

In [None]:
df_city_day = df_city_day.dropna(axis = 0, subset = ['Xylene'])

In [None]:
df_city_day.info()

We would analyse the contribution of these major pollutants city-wise.

In [None]:
df_city_day[['PM2.5', 'City']].groupby(['City']).median().sort_values("PM2.5", ascending = False).plot.bar()


*Delhi* has the **highest levels of PM2.5** whereas Mumbai and Ernakulam turn out to be on the other side of the table. Major reasons behind the rise of PM2.5 levels in Delhi are increasing traffic, dust and smoke from fires.

In [None]:
df_city_day[['NO', 'City']].groupby(['City']).median().sort_values("NO", ascending = False).plot.bar(color='brown')

*Mumbai* has the **highest levels of NO** whereas Gurugram and Amaravati have quite minimal figures. Major spike of NO in the commercial capital of India are causing *respiratory ailments, hematologic side effects, metabolic disorders, low blood pressure, nausea, vomiting and diarrhoea.*

In [None]:
df_city_day[['NO2', 'City']].groupby(['City']).median().sort_values("NO2", ascending = False).plot.bar(color='purple')
df_city_day[['CO', 'City']].groupby(['City']).median().sort_values("CO", ascending = False).plot.bar(color='y')
df_city_day[['SO2', 'City']].groupby(['City']).median().sort_values("SO2", ascending = False).plot.bar(color='r')
df_city_day[['O3', 'City']].groupby(['City']).median().sort_values("O3", ascending = False).plot.bar(color='orange')
df_city_day[['Benzene', 'City']].groupby(['City']).median().sort_values("Benzene", ascending = False).plot.bar(color='teal')

Ahmedabad has the highest stake when **nitrogen dioxides, sulphur dioxides and carbon monoxide** is concerned, whereas Gurugram and Kolkata are the most polluted due to **ozone** and **benzene** respectively. On a broader view, Ernakulam and Amaravati seem to be less hazardous compared to other mid-tier and top-tier cities.

In [None]:
sns.set()
cols = ['SO2', 'NOx', 'O3', 'NO2', 'PM2.5']
sns.pairplot(df_city_day[cols], size = 2.5)
plt.show()

We see that SO2 and NO2 are more concentrated towards the origin, hence sensing to be somewhat correlated. While for others, there is no clear indication since they are widely scattered.

In [None]:
corrmat = df_city_day.corr()
heatmap_df= corrmat.drop(['NOx', 'NH3','O3','Toluene','Xylene', 'AQI']).drop(['NOx', 'NH3','O3','Toluene','Xylene', 'AQI'], axis=1)
f, ax = plt.subplots(figsize = (10,10))
sns.heatmap(heatmap_df, vmax = 1, square = True, annot = True)

**PM2.5** has a huge correlation with **PM10**- both being in the particulate matter category. We can also observe that there is some correlation between **carbon monoxide** and **sulphur dioxide**. Similarly, the same is analysed between **sulphur dioxide** and **nitrogen dioxide**. The rest are not directly correlated.

While the learning curve continues, I am trying to analyse more such datasets relating to Sustainability, social causes, environment etc. Please suggest if you have something relevant to these domains.

Thank you and Happy Learning!