**AIR QUALITY INDEX**


The air quality index (AQI) is an index for reporting air quality on a daily basis. It is a measure of how air pollution affects one's health within a short time period. The purpose of the AQI is to help people know how the local air quality impacts their health. The Environmental Protection Agency (EPA) calculates the AQI for five major air pollutants, for which national air quality standards have been established to safeguard public health.
 
1. Ground-level ozone
2. Particle pollution/particulate matter (PM2.5/pm 10)
3. Carbon Monoxide
4. Sulfur dioxide
5. Nitrogen dioxide
 
The higher the AQI value, the greater the level of air pollution and the greater the health concerns. The concept of AQI has been widely used in many developed countries for over the last three decades. AQI quickly disseminates air quality information in real-time.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns

In [None]:
df=pd.read_csv('../input/air-quality-data-in-india/city_day.csv')
df.head()

As we are only taking 2020 data for predicting AQI 

In [None]:
# data for 2020
df=df[df['Date'] >= ('2020-01-01')]
print(df.shape)
df.head()

AS INDEX IS MIXED WE NEED TO FIX THE INDEX 

In [None]:
df.reset_index(drop=True,inplace=True)
df.head()

**TO CHECK THE NULL VALUES IN OUR 2020 DATATSET**

In [None]:
df.isnull().sum()

In [None]:
df.info()

In [None]:
for i in df.columns:
    print('column name:{}    unique values:{}'.format(i,len(df[i].unique())))

**DEALING WITH MISSING VALUES**

In [None]:
df['PM2.5']=df['PM2.5'].fillna(df['PM2.5'].mean())
df['PM10']=df['PM10'].fillna(df['PM10'].mean())
df['NO']=df['NO'].fillna(df['NO'].mean())
df['NO2']=df['NO2'].fillna(df['NO2'].mean())
df['NOx']=df['NOx'].fillna(df['NOx'].mean())
df['NH3']=df['NH3'].fillna(df['NH3'].mean())
df['CO']=df['CO'].fillna(df['CO'].mean())
df['SO2']=df['SO2'].fillna(df['SO2'].mean())
df['O3']=df['O3'].fillna(df['O3'].mean())
df['Benzene']=df['Benzene'].fillna(df['Benzene'].mean())
df['Toluene']=df['Toluene'].fillna(df['Toluene'].mean())
df['Xylene']=df['Xylene'].fillna(df['Xylene'].mean())
df['AQI']=df['AQI'].fillna(df['AQI'].mode()[0])
df['AQI_Bucket']=df['AQI_Bucket'].fillna('Moderate')

**WE ARE FILLING MISSING VALUES IN 2 DIFFERENT CATEGORIES**


**FOR NUMERIC VALUES : MEAN()**


**FOR CATEGORICAL VALUES : MODE()**

In [None]:
df.head()

In [None]:
# how much is the average amount of pollution in each city stations
most_polluted = df[['City', 'AQI', 'PM10', 'CO']].groupby(['City']).mean().sort_values(by = 'AQI', ascending = False)
most_polluted

In [None]:
import matplotlib.pyplot as plt

plt.style.use('seaborn-whitegrid')
f, ax_ = plt.subplots(1, 3, figsize = (15,15))

bar1 = sns.barplot(x = most_polluted.AQI,
                   y = most_polluted.index,
                   palette = 'Reds_r',
                   ax = ax_[0]);

bar1 = sns.barplot(x = most_polluted.PM10,
                   y = most_polluted.index,
                   palette = 'RdBu',
                   ax = ax_[1]);

bar1 = sns.barplot(x = most_polluted.CO,
                   y = most_polluted.index,
                   palette = 'RdBu',
                   ax = ax_[2]);

titles = ['AirQualityIndex', 'ParticulateMatter10', 'CO']
for i in range(3) :
    ax_[i].set_ylabel('')   
    ax_[i].set_yticklabels(labels = ax_[i].get_yticklabels(),fontsize = 14);
    ax_[i].set_title(titles[i])
    f.tight_layout()

**AS THERE ARE MANY POLUTANTS IN THE AIR WE ARE CLASSIFYING THEM INTO 2 CATEGORIES**

**VEHICLE_POLUTANTS : PM2.5 , PM10 , NO , NOx , NH3 , CO**

**INDUSTRY_POLUTANT : SO2 , O3 , BENZENE , TOLUENE , XYLENE**

In [None]:
df1=df.copy()
df1['Vehicle_Pollution_content']=df1['PM2.5']+df1['PM10']+df1['NO']+df1['NOx']+df1['NH3']+df1['CO']
df1['Industry_pollutants']=df1['SO2']+df1['O3']+df1['Benzene']+df1['Toluene']+df1['Xylene']
df1.drop(['PM2.5','PM10','NO','NO2','NOx','NH3','CO','SO2','O3','Benzene','Toluene','Xylene'],axis=1,inplace=True)
df1.head()

**DISTRIBUTING DATE INTO DAY AND MONTH COLUMNS**

In [None]:
df1['Day_date']=pd.to_datetime(df1['Date'],format='%Y/%m/%d').dt.day
df1['month_date']=pd.to_datetime(df1['Date'],format='%Y/%m/%d').dt.month
df1.drop(['Date'],axis=1,inplace=True)
df1.head()

In [None]:
df1.describe()

**IDENTIFYING OUTLIERS**

In [None]:
outliers=df1.loc[df1['Vehicle_Pollution_content'] > (1000)]
outliers

In [None]:
outliers=df1.loc[df1['Industry_pollutants']>(800)]
outliers

In [None]:
sns.pairplot(data=df1)

In [None]:
df1.drop(['AQI_Bucket'],axis=1,inplace=True)
df1.head()

**IN ORDER TO SEE EVERY COLUMN IN THE DATASET**

In [None]:
pd.set_option('display.max_columns',None)
pd.set_option('display.max_rows',None)

**CREATING DUMMIES FOR CITY COLUMN**

In [None]:
df1=pd.get_dummies(df1,drop_first=True)
print(df1.shape)
df1.head()

**AS AQI IS DEPENDENT VARIABLE WE DROPING IT FROM DATASET AND PUTING IT IN Y**

In [None]:
X=df1.drop(['AQI'],axis=1)
y=df1['AQI']
print(X.shape)
print(y.shape)

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2)

**RANDOM FOREST REGRESSOR**

In [None]:
from sklearn.metrics import classification_report,accuracy_score
from sklearn.ensemble import RandomForestRegressor
model3=RandomForestRegressor()
model3.fit(X_train,y_train)
model3.score(X_test,y_test)

**GRADIENT BOSSTING REGRESSOR**

In [None]:
from sklearn.ensemble import GradientBoostingRegressor
model3=GradientBoostingRegressor()
model3.fit(X_train,y_train)
model3.score(X_test,y_test)

**EXTRA TREE REGRESSOR**

In [None]:
from sklearn.ensemble import ExtraTreesRegressor
model3=ExtraTreesRegressor()
model3.fit(X_train,y_train)
model3.score(X_test,y_test)

**HYPERTUNING OF EXTRA TREE REGRESSOR**

In [None]:
from sklearn.model_selection import RandomizedSearchCV,GridSearchCV
from scipy.stats import randint as sp_randint
from sklearn.metrics import mean_squared_error,make_scorer

clf = ExtraTreesRegressor(random_state=12)
param_dist = {"n_estimators": [5, 10],
              "max_depth": [3, None],
              "max_features": sp_randint(1, 11),
              "min_samples_split": sp_randint(1, 11),
              "min_samples_leaf": sp_randint(1, 11),
              "bootstrap": [True, False]}
# rmse = make_scorer(mean_squared_error, greater_is_better=False)
r = RandomizedSearchCV(clf, param_distributions=param_dist,
                       cv=10,
                       scoring='r2',
                       n_iter=3,
                       n_jobs=2)
r.fit(X, y)

In [None]:
print(r.best_params_)
r.score(X_test,y_test)

**HYPERTUNING OF RANDOM FOREST REGRESSOR**

In [None]:
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 1000, num = 5)]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
min_samples_split = [2, 5, 10]
min_samples_leaf = [1, 2, 4]
bootstrap = [True, False]
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

In [None]:
rf = RandomForestRegressor()
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)
rf_random.fit(X_train,y_train)

In [None]:
rf_random.best_params_

In [None]:
rf_random.score(X_test,y_test)

In [None]:
prediction=rf_random.predict(X_test)

In [None]:
sns.distplot(y_test-prediction)