##Data Preprocessing and Exploratory data analysis on Indian Air Quality dataset

#### Data Preprocessing Steps

Importing Libraries

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

Loading the dataset

In [None]:
data = pd.read_csv('../input/india-air-quality-data/data.csv', encoding = 'ISO-8859-1')
data.head()

In [None]:
data.shape

In [None]:
data.info()

Finding the Fraction of the data missing from each attribute

In [None]:
print((data.isna().sum()/len(data))*100)

As seen from the missing values present in columns: spm, pm2_5 need to be dropped as they are having 54 and 97 percent of data missing respectively,also: stn_code column is not usefull at all,and there are two date related columns in dataset, hence dropping sampling_date column.

In [None]:
to_drop = ['stn_code','sampling_date','spm','pm2_5']
data = data.drop(to_drop, axis = 1)

Getting list of Numerical and Categorical Columns

In [None]:
cat_cols = list(data.select_dtypes(include= 'object').columns)
num_cols = list(data.select_dtypes(exclude= 'object').columns)
print('Categorical Columns: ', cat_cols)
print('Numerical Columns: ', num_cols )

'date' column:

In [None]:
data['date'] = pd.to_datetime(data['date'])

## Taking only year from date format
data['year'] = data['date'].dt.year
data['year'].unique()

Numerical columns:

In [None]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan,strategy='mean') 
imputer.fit(data[num_cols])    
data[num_cols] = imputer.transform(data[num_cols])

In [None]:
data = data.drop('date', axis = 1)
data.head()

Checking for any disruptions or irregulariyt in each categorical parameters,considering their unique values in each columns

'State' column

In [None]:
data['state'].unique()

'location' column

In [None]:
data['location'].unique()

In [None]:
data['location'].nunique()

In [None]:
data.replace(to_replace= 'Visakhapatnam', value = 'Vishakhapatnam', inplace = True)
data.replace(to_replace= 'Silcher', value = 'Silchar', inplace = True)
data.replace(to_replace= 'Kotttayam' , value = 'Kottayam', inplace = True)
data.replace(to_replace= 'Bhubaneswar' , value = 'Bhubaneshwar', inplace = True)
data.replace(to_replace= 'Pondichery' , value = 'Pondicherry', inplace = True)
data.replace(to_replace= 'Noida, Ghaziabad' , value = 'Noida', inplace = True)
data.replace(to_replace= 'Calcutta', value = 'Kolkata', inplace = True)
data.replace(to_replace= 'Greater Mumbai', value = 'Mumbai', inplace = True)
data.replace(to_replace= 'Navi Mumbai', value = 'Mumbai', inplace = True)
data.replace(to_replace= 'Bombay', value = 'Mumbai', inplace = True)

# Converting all locations from uppercase to lowercase to avoid any duplicates and for simplicity
data['location'].str.lower()

data['location'].nunique()

Agency column has around 34 percent missing values, also there are so much irregularities in these parameter, As per the information: Agency do contain states agencies that we can get by state column itself.Its better to drop these column.

In [None]:
data = data.drop('agency', axis = 1)

Also 'location_monitoring_station' column is also not useful, as most of the info about location/location monitoring is stated in location column itself.Roughly speaking, Location Monitoring Stations are present in that location only.

In [None]:
data = data.drop('location_monitoring_station', axis = 1)

'type' column

In [None]:
data['type'].isna().sum()

In [None]:
data['type'] = data['type'].fillna('Others')

In [None]:
data['type'].value_counts()

In [None]:
res_str = 'Residential|RIRUO'
ind_str = 'Industrial'
sens_str = 'Sensitive'

res_mask = data['type'].str.contains(res_str, regex = True)
ind_mask = data['type'].str.contains(ind_str, regex = True)
sens_mask = data['type'].str.contains(sens_str, regex = True)

data['type'][res_mask] = 'Residential,Rural'
data['type'][ind_mask] = 'Industrial'
data['type'][sens_mask] = 'Sensetive'

print(data['type'].value_counts())

In [None]:
data.head()

SO2 (Sulphur Dioxide):   The largest source of SO2 in the atmosphere is the burning of fossil fuels by power plants and other industrial facilities. Smaller sources of SO2 emissions include: industrial processes such as extracting metal from ore; natural sources such as volcanoes; and locomotives, ships and other vehicles.

NO2(Nitrogen Dioxide):  Is one of a group of highly reactive gases known as oxides of nitrogen or nitrogen oxides (NOx).NO2 can cause irritation of eyes, nose and throat and when inhaled might cause lung irritations and decreased lung function. In areas with higher levels of nitrogen dioxide, a greater chance of asthma attacks.

rspm (Respirable suspended particulate matter): RSPM is that fraction of TSPM which is readily inhaled by humans through their respiratory system and in general, considered as particulate matter with their diameter (aerodynamic) less than 2.5 micrometers. Larger particles would be filtered in the nasal duct. Is a causative agent of mortality and morbidity.Fine particles and other air pollutants are linked with a number of health problems like premature death & asthma.

#### Distribution of so2 emission with respect to different type (Areas)

In [None]:
so2_type_groupby = data.groupby([data['type']])['so2'].mean().sort_values(ascending = False)
so2_type_groupby.plot.bar()
plt.xlabel('Type')
plt.ylabel('SO2 emission')
plt.title('Distribution of SO2 emission wrt type')
plt.show()

TakeAway: /
So2 emission is major in cases from industrial areas, rather than residential or rural; It seems that type 'others' have major emission, but still its categoty form from missing values in dataset, which could lie in any of the tree major type categories. So We consider higher emission from Industrial type areas.

In [None]:
so2_type_groupby = data.groupby([data['type']])['no2'].mean().sort_values(ascending = False)
so2_type_groupby.plot.bar(color = 'g')
plt.xlabel('Type')
plt.ylabel('NO2 emission')
plt.title('Distribution of NO2 emission wrt type')
plt.show()

TakeAway: /
Obviously, No2 emission is major in cases from industrial areas, as mostly the manufacturing factories of various goods emits most of the harmful gases.

#### SO2,NO2 and rspm emissions trends over the years

In [None]:
year_groupby = data.groupby(['year']).median().reset_index().sort_values(by = 'year', ascending = False)

# SO2
f,ax = plt.subplots(figsize = (10,7))
plt.xticks(rotation = 90)
sns.pointplot(x = 'year', y = 'so2', data = year_groupby, color= 'r')
plt.xlabel('Year')
plt.ylabel('So2')
plt.title('So2 emission trends over the years')
plt.show()

In [None]:
# No2
f,ax = plt.subplots(figsize = (10,7))
plt.xticks(rotation = 90)
sns.pointplot(x = 'year', y = 'no2', data = year_groupby)
plt.xlabel('Year')
plt.ylabel('No2')
plt.title('No2 emission trends over the years')
plt.show()

In [None]:
#rspm 
f,ax = plt.subplots(figsize = (10,7))
plt.xticks(rotation = 90)
sns.pointplot(x = 'year', y = 'rspm', data = year_groupby)
plt.xlabel('Year')
plt.ylabel('rspm')
plt.title('rspm emission trends over the years')
plt.show()

TakeAway:
- **SO2** - In 1990 - 2000, in these decade, So2 emission is at peak. But after 2000, over the years, due to various governments acts in progression for reducing the amount of so2 emitted, it comes under control. /
- **rspm** - In these also, emission of rspm is very large in amount, but after 2005, it shows its decreasing slope. /
- **NO2** - But in case of No2, emission descreasing slope follows after 2010, before that it shows similar behaviour (if we take mean from 1987 - 2010) over the years.

#### Statewise Emission trends

In [None]:
state_groupby = data.groupby(['state']).mean().reset_index()

In [None]:
fig, axes = plt.subplots(3,1, figsize =(17, 35))
fig.suptitle('Statewise Emission distributions')
#SO2
sns.barplot(ax = axes[0], x = 'state', y = 'so2', data = state_groupby)
axes[0].set_title('Statewise SO2 emission')
axes[0].set_xticklabels(axes[0].get_xticklabels(), rotation = 90)
#NO2
sns.barplot(ax = axes[1], x = 'state', y = 'no2', data = state_groupby)
axes[1].set_title('Statewise NO2 emission')
axes[1].set_xticklabels(axes[1].get_xticklabels(), rotation = 90)
#RSPM
sns.barplot(ax = axes[2], x = 'state', y = 'rspm', data = state_groupby)
axes[2].set_title('Statewise RSPM emission')
axes[2].set_xticklabels(axes[2].get_xticklabels(), rotation = 90)

#### Top 10 States in emission So2

In [None]:
so2_emission_top_states = state_groupby.sort_values(by= 'so2', ascending= False)
so2_emission_top_states = so2_emission_top_states[['state', 'so2']].head(10)
sns.barplot(x= 'state', y = 'so2', data = so2_emission_top_states)
plt.xticks(rotation = 90)
plt.show()

#### Top 10 States in emission of No2

In [None]:
no2_emission_top_states = state_groupby.sort_values(by= 'no2', ascending= False)
no2_emission_top_states = no2_emission_top_states[['state', 'no2']].head(10)
sns.barplot(x= 'state', y = 'no2', data = no2_emission_top_states)
plt.xticks(rotation = 90)
plt.show()

#### Top 10 states in rspm emission

In [None]:
rspm_emission_top_states = state_groupby.sort_values(by= 'rspm', ascending= False)
rspm_emission_top_states = rspm_emission_top_states[['state', 'rspm']].head(10)
sns.barplot(x= 'state', y = 'rspm', data = rspm_emission_top_states)
plt.xticks(rotation = 90)
plt.show()

TakeAway: **So2 emission** - Top 3 States are **Uttaranchal, Jharkhand and Sikkim.** ////
**No2 emission** - Top 3 states are **West bangal, Delhi and Jharkhand.**////
**rspm emission** - Top 3 States are **Delhi, Uttar Pradesh and Jharkhnad.**////
Seems That, If we consider **all togather states involvement, Delhi and Jharkhand are at the peak,** in production of all three of the deadly gas emissions.

#### Top Cities giving combined highest so2+no2 emission

In [None]:
fig, axes = plt.subplots(1,1, figsize =(12, 10))
data['so2+no2'] = data['so2'].values + data['no2'].values
location_groupby = data.groupby(['location']).mean().reset_index()
top_cities = location_groupby.sort_values(ascending = False, by = 'so2+no2')
top_cities = top_cities[['location', 'so2+no2']].head(15)
sns.barplot(ax = axes, x= 'so2+no2', y = 'location', data = top_cities, orient= 'h')
plt.xticks(rotation = 90)
plt.show()

Above Graphs gives top 15 locations/cities in India that produces major amount of deadly gases as mentioned - So2, No2 and rspm. Most of the emission of these gases are produced as the byproduct of Industries, Factories or Various Manufacturing Plants.i.e It majorly comes from Industrial sectors.

In [None]:
### Save the cleaned dataframe as csv file.

data.to_csv('Indian_airquality_cleaned.csv', index= False)