We have daily and hourly city data as well as daily and hourly Station data. Station refers to the continuous pollution monitoring stations operated and maintained by the Central Pollution Control Board (CPCB) and the State Pollution Control Boards. Let's begin by analyzing the various cities' daily data to get a big picture.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
import warnings
warnings.filterwarnings('ignore')

df_city_day = pd.read_csv('../input/air-quality-data-in-india/city_day.csv')
df_city_hour = pd.read_csv('../input/air-quality-data-in-india/city_hour.csv')
df_station_day = pd.read_csv('../input/air-quality-data-in-india/station_day.csv')
df_station_hour = pd.read_csv('../input/air-quality-data-in-india/station_hour.csv')
df_stations = pd.read_csv('../input/air-quality-data-in-india/stations.csv')

list_of_df = [df_city_day,df_city_hour,df_station_day,df_station_hour,df_stations]
list_of_df_name = ['df_city_day','df_city_hour','df_station_day','df_station_hour','df_stations']
print(f"Available datasets are: {list_of_df_name}")
for i,df in zip(list_of_df_name,list_of_df):
    print(f"Columns of {i} are \n{df.columns}\n")

In [None]:
df_city_day.head()

In [None]:
#creating a func to make missing value table so that it can be used again
def missing_value_table(df):
    values = df.isnull().sum()
    percentage = 100*df.isnull().sum()/len(df)
    table = pd.concat([values,percentage.round(2)],axis=1)
    table.columns = ['No of missing values','% of missing values']
    return table[table['No of missing values']!=0].sort_values('% of missing values',ascending=False).style.background_gradient('Greens')
    
missing_value_table(df_city_day)

In [None]:
#converting dtype of date column to datetime
df_city_day['Date']=df_city_day['Date'].apply(pd.to_datetime)
#setting date column as index
df_city_day.set_index('Date',inplace=True)

### Handling missing values

Here, I am not imputing missing values as this notebook is a task submission for which only EDA is required.
> In case of modelling, missing values need to be removed. To see how to fill them see the hidden code block.

In [None]:
#            imputing missing values using interpolation
#df_city_day.interpolate(method='linear',axis=0,limit_direction='both',inplace=True)

#            imputing AQI_Bucket column according to AQI column.
#def custom_imputer(df):
 #   if df['AQI'] < 51.0:
  #      return 'Good'
   # elif 50.0<df['AQI']<101.0:
    #    return 'Satisfactory'
#    elif 100.0<df['AQI']<201.0:
 #       return 'Moderate'
  #  elif 200.0<df['AQI']<301.0:
   #     return 'Poor'
    #elif 300.0<df['AQI']<401.0:
#        return 'Very Poor'
 #   else:
  #      return 'Severe'

#df_city_day['AQI_Bucket'] = df_city_day.apply(custom_imputer,axis=1)

### EDA

In [None]:
print(f"City data is available from {df_city_day.index.min().date()} to {df_city_day.index.max().date()}")

In [None]:
df_city_day[['City','AQI']].groupby('City').mean().sort_values('AQI').plot(kind='barh',cmap='summer',figsize=(8,8))
plt.title('Average AQI in last 5 years');

Ahmedabad's air quality is worst. Delhi, Gurugram, Patna and Lucknow also have an alarming AQI.

Now, I combine some relevant features like Nitrogen Oxides (NO, NO2, NOx); Benzene, Toluene and Xylene (BTX); and Particulate matter (PM2.5 and PM10).

In [None]:
city_day = df_city_day.copy()
city_day['BTX'] = city_day['Benzene']+city_day['Toluene']+city_day['Xylene']
city_day['Particulate_Matter'] = city_day['PM2.5']+city_day['PM10']
city_day['Nitrogen Oxides'] = city_day['NO']+city_day['NO2']+city_day['NOx']
city_day.drop(['Benzene','Toluene','Xylene','PM2.5','PM10','NO','NO2','NOx'],axis=1,inplace=True)

plt.figure(figsize=(5,4))
sns.heatmap(city_day.corr(),cmap='coolwarm',annot=True);

AQI is highly correlated with Particulate_matter, Nitrogen_Oxides and CO.

In [None]:
pollutants = ['City','AQI_Bucket', 'AQI', 'Particulate_Matter', 'Nitrogen Oxides','NH3', 'CO', 'SO2', 'O3',  'BTX']
city_day = city_day[pollutants]
print('Distribution of different pollutants in last 5 years')
city_day.plot(kind='line',figsize=(18,18),cmap='coolwarm',subplots=True,fontsize=10);

* NH3 (Ammonia): There was  a rise in 2016 and 2018.
* CO (Carbon Monoxide) and SO2 (Sulphur Dioxide): Its level is increasing since 2018 with a slight seasonal effect.
* O3 (Ozone): Ozone levels are almost similar in these 5 years.
* BTX level was minimal in 2016 to 2018.
* Particulate Matter and Nitrogen Oxides show high seasonal effects.

> In below table, we can have a look on cities having worst level of each pollutant and AQI. Cities corresponding to darkest colour have highest level in that column.

In [None]:
def max_polluted_cities(pollutant):
    table = city_day[[pollutant,'City']].groupby(["City"]).mean().sort_values(by=pollutant,ascending=False).reset_index()
    return table[:5].style.background_gradient(cmap='Reds')

print("Cities having worst levelss of each pollutant-")
for pollutant in pollutants[2:]:
    df = max_polluted_cities(pollutant)
    display(df)

To analyse further I will consider only AQI as in India, the proposed AQI will consider eight pollutants (PM10, PM2.5, NO2, SO2, CO, O3, NH3, and Pb) for which short-term (up to 24-hourly averaging period) National Ambient Air Quality Standards are prescribed. So, other pollutants are included in AQI.

In [None]:
city_ahmedabad = city_day[city_day['City']=='Ahmedabad']
city_ahmedabad['month']=city_ahmedabad.index.month
city_ahmedabad['year']=city_ahmedabad.index.year
print("AQI distribution in Ahmedabad")
fig,axes=plt.subplots(1,2,figsize=(10,5))
sns.pointplot(x='month',y='AQI',data=city_ahmedabad,ax=axes[0])
sns.pointplot(x='year',y='AQI',data=city_ahmedabad,ax=axes[1]);

In [None]:
#extracting date from df_city_hour
df_city_hour['Datetime'] = df_city_hour['Datetime'].apply(pd.to_datetime)
df_city_hour['Hour'] = df_city_hour['Datetime'].apply(lambda x: x.hour)

city_ahmedabad_hour = df_city_hour[df_city_hour['City']=='Ahmedabad']
sns.pointplot(x='Hour',y='AQI',data=city_ahmedabad_hour,color='Orange')
plt.title('AQI level throughout the day in Ahmedabad');

In Ahmedabad, AQI level had a steep rise in 2017-2018. Its air is in worst condition. Winter season is very dangerous for its residents as its air quality was beyond 'Very Poor' category. To much surprise, AQI is worst during nighttime.

In [None]:
city_Delhi = city_day[city_day['City']=='Delhi']
city_Delhi['month']=city_Delhi.index.month
city_Delhi['year']=city_Delhi.index.year
print("AQI distribution in Delhi")
fig,axes=plt.subplots(1,2,figsize=(10,5))
sns.pointplot(x='month',y='AQI',data=city_Delhi,ax=axes[0],color='Green')
sns.pointplot(x='year',y='AQI',data=city_Delhi,ax=axes[1],color='Green');

In [None]:
city_delhi_hour = df_city_hour[df_city_hour['City']=='Delhi']
sns.pointplot(x='Hour',y='AQI',data=city_delhi_hour,color='Orange')
plt.title('AQI level throughout the day in Delhi');

Delhi saw the worst of air pollution in 2015-2016 and took many steps to combat pollution. As a result of which ita AQI is decreasing in long term. Due to firecracker burning and scrubble burning in neighbouring states, Delhites s peak of pollution in end of the year. AQI distribution is similar throughout the day with peak during evening time.

In [None]:
city_Patna = city_day[city_day['City']=='Patna']
city_Patna['month']=city_Patna.index.month
city_Patna['year']=city_Patna.index.year
print("AQI distribution in Patna")
fig,axes=plt.subplots(1,2,figsize=(10,5))
sns.pointplot(x='month',y='AQI',data=city_Patna,ax=axes[0],color='Purple')
sns.pointplot(x='year',y='AQI',data=city_Patna,ax=axes[1],color='Purple');

In [None]:
city_Patna_hour = df_city_hour[df_city_hour['City']=='Patna']
sns.pointplot(x='Hour',y='AQI',data=city_Patna_hour,color='Orange')
plt.title('AQI level throughout the day in Patna');

Patna's AQI is has decreased in past 3 years. Though much can't be said about present year as it may be because of lockdown.
Moderate AQI is observed in third quarter of an year. November, December and January are most polluted months. As in day, pollution is high after noon. 

# Letter to uncle:

Dear Uncle,
I am sorry for your loss. It is so inspiring of you to think about others in this hard time of your life.

According to my analysis, Ahmedabad is in worst state of air pollution and needs more attention and budget to combat it. Since 2018, air pollution has increased to severe. It is slowly killing people of Ahmedabad as cases of lungs and heart diseases are on a rise. While pollution caused by firecrackers and climatic condition is a seasonal contributor, according to the Gujarat ENVIS Centre, major contributors to air pollution are industries and vehicles. Most air contaminants originate from combustion processes. Your money will be used in better wastewater treatment, solid-waste management, and hazardous-waste management. Also, big-scale air purifiers can be installed on various hotspots.

You can also use your influence to spread awareness among citizens and government of India. Government can play a crucial role in combating air pollution by makins strict rules and punishing violators.

If this project is successful, you can look forward to Delhi and Patna.