In [None]:
%%HTML
<style type="text/css">

div.h2 {
    background-color: #00b050; 
    color: white; 
    padding: 5px; 
    padding-right: 300px; 
    font-size: 25px;  
    margin-top: 2px;
    margin-bottom: 10px;
}

div.h3 {
    background-color: white; 
    color: #fe0000; 
    padding: 5px; 
    padding-right: 300px; 
    font-size: 20px; 
    margin-top: 2px;
    margin-bottom: 10px;
}
</style>

<center><h1>Making every bit count - Where to invest to combat air pollution in India?</h1></center>
<center><i>Identifying the cities that require immediate attention to their increasing air pollution</i></center>
<br>
<br>
<p><i>Snap...pop...crackle...snap...flare...</i> - The fire kept modifying its tone as I fed more twigs and leaves into its mouth with the long handle outdoor broom I was holding. I was 10 years old. Along with my grandfather, I was burning all the dry leaves and twigs that were lying on our backyard. It was fun to do it! But, looking back I realise that I was unknowingly contributing to the air pollution problem in India.<br>    
    <strong>Did my grandfather and I want to pollute the environment? Nope.</strong><br>
    <strong>But were we? Yes.</strong><br>
    Air pollution is an evil, but not necessarily a ploy of bad people. When countries and local industries are fighting for survival, they somehow tend to forget their obligations to the environment. It's not entirely their fault. Hence, I believe the strongest focus to deal with air pollution or any kind of pollution is to not place a ton of sanctions but to provide an alternative way for people to live the lives they dream of.
</p>
In this notebook, I make an attempt to understand how investments can be made in order to minimize the effect of air pollution in a city. 
<br>
<br>
<strong>NOTE:</strong>
The work here is my interpretation of the data at hand. The recommendations I will suggest are based on what I find is true through my analysis. They need not necessarily resonate in the same tone with every reader.
    
    

![](https://www.halifax.ca/sites/default/files/pages/in-content/2018-06/Brush-Burning-HRM.jpg)
<center><i>But, what if those twigs were being burnt because we needed warmth? Are we still wrong?</i></center>

In [None]:
# install awesomeness
!pip install pywaffle
!pip install bubbly

In [None]:
import os
import pandas as pd
import numpy as np
from itertools import islice

# import viz libraries here
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
from pywaffle import Waffle
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from plotly.offline import init_notebook_mode, iplot
from bubbly.bubbly import bubbleplot

# Disable warnings 
import warnings
warnings.filterwarnings('ignore')

# for slide deck embed
from IPython.display import HTML

In [None]:
# Global Functions and Utility functions

COLOR_ASSOCIATION = {
    '#00b050': 'Good',
    '#91cf4f': 'Satisfactory',
    '#fefe00': 'Moderate',
    '#ffbf00': 'Poor',
    '#fe0000': 'Very Poor',
    '#bf0000': 'Severe',
    '#737373': 'Not Available'
}

def trim(x):
    """Strip of trailing whitespace"""
    
    return x.strip()

def order_bucket(old, order=['Good', 'Satisfactory', 'Moderate', 'Poor', 'Very Poor', 'Severe', 'Not Available']):
    """Order by bucket
    `old` => Old dictionary
    `order` => Order of AQI buckets (optional)
    """
    
    new = {}
    for cat in order:
        try:
            new[cat] = old[cat]
        except:
            continue
    return new

<div class="h2">Introduction</div>

This notebook is a submission to the [Where to deploy resources in India to combat air pollution](https://www.kaggle.com/rohanrao/air-quality-data-in-india/tasks?taskId=1877) hosted by [@romandovega](https://www.kaggle.com/romandovega) on the [Air Quality Data in India](https://www.kaggle.com/rohanrao/air-quality-data-in-india) dataset compiled by [@rohanrao](https://www.kaggle.com/rohanrao).  

In a crux, the task requires a submission that would convince a rich uncle to provide monetary investment to improve the quality of air in a given city. The necessity is to tie up all loose ends with data-based evidence and also present a rough plan as to how things must be done and also how progress can be measured. A maximum of 3 cities can be chosen from the prospective list of 25+ cities present in the dataset at the time of this analysis.

In [None]:
# Loading the data

home = "../input/air-quality-data-in-india"
try:
    cd = pd.read_csv(os.path.join(home, "city_day.csv"))
    ch = pd.read_csv(os.path.join(home, "city_hour.csv"))
    sd = pd.read_csv(os.path.join(home, "station_day.csv"))
    sh = pd.read_csv(os.path.join(home, "station_hour.csv"))
    st = pd.read_csv(os.path.join(home, "stations.csv"))
    city = pd.read_csv("/kaggle/input/top-500-indian-cities/cities_r2.csv")
    city_co = pd.read_csv("/kaggle/input/indian-cities-database/Indian Cities Database.csv")
    
except:
    print("File names have changed!")

In [None]:
# prepare data on cities

city['name_of_city'] = city['name_of_city'].apply(trim)
new_names = {
    'Ahmadabad': 'Ahmedabad',
    'Amravati': 'Amaravati',
    'Gurgaon': 'Gurugram',
    'Greater Mumbai': 'Mumbai',
    'Greater Hyderabad': 'Hyderabad',
}
city['name_of_city'] = city['name_of_city'].replace(new_names)

rel_cols = [
    'name_of_city',
    'state_name',
    'population_total',
    'population_male',
    'population_female',
    '0-6_population_total',
    'literates_total',
    'literates_male',
    'literates_female',
    'sex_ratio',
    'effective_literacy_rate_total',
    'total_graduates',
]
city = city[rel_cols]

cities = ['Ahmedabad',
 'Aizawl',
 'Amaravati',
 'Amritsar',
 'Bengaluru',
 'Bhopal',
 'Brajrajnagar',
 'Chandigarh',
 'Chennai',
 'Coimbatore',
 'Delhi',
 'Ernakulam',
 'Gurugram',
 'Guwahati',
 'Hyderabad',
 'Jaipur',
 'Jorapokhar',
 'Kochi',
 'Kolkata',
 'Lucknow',
 'Mumbai',
 'Patna',
 'Shillong',
 'Talcher',
 'Thiruvananthapuram',
 'Visakhapatnam']
city_t = city[city['name_of_city'].isin(cities)]

# combine data into `df`
df = pd.merge(
    left=cd,
    right=city_t,
    left_on='City',
    right_on='name_of_city',
    how='outer'
)

<div class="h2">Summary</div>

In summary, the three main cities that have been chosen are Ahmedabad, Lucknow and Patna. Ahmedabad is to be given the initial investment for the next 3 years. If this is succesful, investment should go to Lucknow and Patna before any other city in this dataset.  

Gurugram needs to be kept under observation for its rapid decline in air quality. However, instead of directly funding to improve air in Gurugram, it would be better to inform the tech giants in Gurugram(almost 50% of Fortune 500 companies have offices here) about the problem. They could tackle the issue as a part of their CSR activities to improve the living standards of their employees.  

The following slide deck gives a brief view of why the 3 above cities need to be funded. The deck is typically all you need to read, *Potential Investor*.

In [None]:
HTML('<div class="canva-embed" data-design-id="DAEHX8BoyPA" data-height-ratio="0.5625" style="padding:56.2500% 5px 5px 5px;background:rgba(0,0,0,0.03);border-radius:8px;"></div><script async src="https:&#x2F;&#x2F;sdk.canva.com&#x2F;v1&#x2F;embed.js"></script><a href="https:&#x2F;&#x2F;www.canva.com&#x2F;design&#x2F;DAEHX8BoyPA&#x2F;view?utm_content=DAEHX8BoyPA&amp;utm_campaign=designshare&amp;utm_medium=embeds&amp;utm_source=link" target="_blank" rel="noopener">AQI Analysis Executive Summary</a> by Ramshankar Yadhunath')

For a nuts and bolts understanding of how the analysis was conducted, **please read on.**

<div class="h2">India's National Air Quality Index</div>

India's National Air Quality Index programme was put into effect in the year 2015 as a step to monitor the air quality in the country. It was initially started in 14 cities and later extended to 34 ([Source](http://moef.gov.in/environment/pollution/)).  

As per the AQI classification, any AQI measure can fall into a particular bucket. This bucket is represented by 

![](https://i.imgur.com/XmnE0rT.png)
<center><i>Source: Calculating AQI Tutorial by Rohan Rao</i></center>
<br>

Further information on how AQI is calculated is provided in detail, with code by Rohan Rao in his [notebook](https://www.kaggle.com/rohanrao/calculating-aqi-air-quality-index-tutorial).  

<div class="h2">Methodology</div>

The following methodology was used in order to tackle the problem at hand. Across these 7 steps, the pool of prospective cities that could be provided the funding was continuously shortened in order to narrow down to the 3 main cities of the 25+ ones at the start.

![](https://github.com/ry05/aqi_project/blob/master/methodology_aqi.png?raw=true)

**Step 1: Understanding the Data**  
The 5 .csv datasets(city_day.csv, city_hour.csv, station_day.csv, station_hour.csv and stations.csv) were observed. Possible combinations of these datasets were realised. 

**Step 2: Formulate Questions**  
Questions other than the ones asked by the task author were formulated. An important idea that emerged from this step was to **not impute missing values**, but rather treat them as **problems with the data collection of a city**.

**Step 3: Analysis on the Basis of Number of Available Records**   
**Step 4: Modelling Level Transition between Days of a City**  
**Step 5: Year-wise AQI Change in Cities**  
The above 3 steps dealt with analysis on the AQI dataset.

**Step 6: Analysing with Socio-economic Indicators**  
Combining other kinds of data to make a more rounded judgement of which city should receive the funding.

**Step 7: Design a Rough Plan**  
Designing a rough plan suggesting ideas to make improvements in the 3 chosen cities, the use of funds and a method to track progress.  

<div class="h2">Preliminary Analysis</div>

In this section, some preliminary groundwork is performed which will become more relevant as the analysis progresses.

<div class="h3">How do AQI levels distribute for all cities over the last 5 years?</div>

In [None]:
# Day-wise AQI levels across Indian cities from 2015-2020 (Scaled Representation) 

mpl.rc_file_defaults()

# prepare data
temp = cd.fillna('Not Available')
temp = pd.DataFrame(temp['AQI_Bucket'].value_counts()).to_dict()['AQI_Bucket']
temp = order_bucket(temp) # order the dict based on AQI buckets

# plot
fig = plt.figure(
    title={
        'label': 'Day-wise AQI levels across Indian cities from 2015-2020 (Scaled Representation)\nTotal Records:29531\n',
        'loc': 'left',
        'fontdict': {
            'fontsize': 15,
        }
    },
    FigureClass=Waffle, 
    rows=10, 
    columns=20,
    values=temp, 
    colors=['#00b050', '#91cf4f', "#fefe00", "#ffbf00", "#fe0000", "#bf0000", "#737373"],
    labels=[f"{k} ({round((v/cd.shape[0]*100),2)}%)" for k, v in temp.items()],
    #legend={'loc': 'upper left', 'bbox_to_anchor': (1.1, 1)},
    legend={
        # 'labels': [f"{k} ({v}%)" for k, v in data.items()],  # lebels could also be under legend instead
        'loc': 'lower left',
        'bbox_to_anchor': (0, -0.2),
        'ncol': 4,
        'framealpha': 0,
        'fontsize': 12
    },
    block_arranging_style='style',
    figsize=(10, 20),
    starting_location='NW',
    vertical=False,
)
# show plot
plt.show()

üí° **INSIGHTS**
- Only 5% of all entries recorded have an AQI that would inflict no harm on any section of the demography
- Categories Poor, Very Poor and Severe (i.e the ones that are capable of harming healthy people) contribute to around 1/5th of all entries
- 16% of all entries have missing AQI levels!

<div class="h3">Not all cities have the same number of days measured !</div>

If we are going to compare cities, the comparison must ideally be between cities with comparable size of records. However, some cities have more records while others have far less. Therefore, it was necessary to take this into consideration while building the case.

In [None]:
# setting matplotlib parameters
plt.rcParams['figure.figsize'] = 12, 8
plt.rcParams['font.family'] = 'serif'
plt.rcParams['font.serif'] = 'Ubuntu'
plt.rcParams['font.monospace'] = 'Ubuntu Mono'
plt.rcParams['axes.labelweight'] = 'bold'
plt.rcParams['axes.labelsize'] = 15
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12
plt.rcParams['axes.titlesize'] = 20
plt.rcParams['figure.titlesize'] = 20
plt.rcParams['figure.titleweight'] = 'bold'
plt.rcParams['axes.titlelocation'] = 'left'
plt.rcParams['axes.titleweight'] = 'bold'
plt.rcParams['legend.shadow'] = False
plt.rcParams['legend.frameon'] = True

# plotting num records per city
temp = pd.DataFrame(cd['City'].value_counts()).sort_values(by='City', ascending=True).to_dict()['City']
with plt.style.context('ggplot'):
    fig, ax = plt.subplots()
    ax.axvline(x=1005, c='#737373', linestyle='--')
    ax.barh(list(temp.keys()), list(temp.values()), label='Low Number of Records', color=['#737373']*12 + ['#30a2da']*14)
    ax.legend()
    ax.yaxis.grid(False)
    ax.text(x=1050, y='Shillong', s='50% of maximum records\navailable for a city')
    ax.set_xlabel("\nNumber of Records")
    ax.set_ylabel("Name of City\n")
    fig.suptitle("Available Number of Daily AQI Records per City (2015-2020)")
    #ax.set_title("Available number of daily AQI records per city (2015-2020)")

üí° **INSIGHTS**
- The most records a city in the given dataset has is 2009
- 12 out of the 26 cities have total day-wise records for 2015-2020 which is lesser than the maximum number of records available for a city i.e 2009 for Mumbai, Delhi, Lucknow, Chennai, Bengaluru, Ahmedabad, Hyderabad
- I have considered a 50% threshold of 2009 to divide the cities into WMR and WLR
    - WMR: With More Records (In blue)
    - WLR: With Less Records (In grey)
- Why do the WLR states have low records?
    - Hypothesis 1: Maybe they don't have the resources
    - Hypothesis 2: Maybe they are not prioritized by AQI
- Cities with low records may or may not be crossed out for investment on the basis of whether
    - They have low records because they are clean and safe
    - They have low records because they are not efficiently monitored

In [None]:
# wmr will mean they have enough records to compare
# wlr means these places need to be provided more records if the pollution levels are getting worse here
# btw, the govt. does prioritise areas when they need to make a comparison

# cities with more than 1005 records
CITIES_WMR = ['Thiruvananthapuram',
 'Jaipur',
 'Jorapokhar',
 'Amritsar',
 'Visakhapatnam',
 'Gurugram',
 'Patna',
 'Hyderabad',
 'Lucknow',
 'Bengaluru',
 'Mumbai',
 'Chennai',
 'Delhi',
 'Ahmedabad'] 

# cities with less than 1005 records
CITIES_WLR = ['Aizawl',
 'Ernakulam',
 'Kochi',
 'Bhopal',
 'Chandigarh',
 'Shillong',
 'Coimbatore',
 'Guwahati',
 'Kolkata',
 'Talcher',
 'Brajrajnagar',
 'Amaravati']

In [None]:
# custom aggregate functions

def unique_cnt(series):
    """Returns count of unique values in a series"""
    
    return len(series.unique())

def active_station_cnt(series):
    """Returns count of active stations"""
    
    return (list(series).count('Active'))

def asset_plot(df, xlabel, title):
    """ Plot a bar plot of quantity of a particular asset to a city
    > Ensure that the `df` follows a structure as follows
    | City | StationId |
    
    """
    
    # preapre data for wmr and wlr
    wmr = df[df['City'].isin(CITIES_WMR)]
    wlr = df[df['City'].isin(CITIES_WLR)]
    
    
    # set to default values
    mpl.rc_file_defaults()
    plt.rcParams['figure.figsize'] = 10, 8
    plt.rcParams['font.family'] = 'serif'
    plt.rcParams['font.serif'] = 'Ubuntu'
    plt.rcParams['font.monospace'] = 'Ubuntu Mono'
    plt.rcParams['axes.labelweight'] = 'bold'
    plt.rcParams['axes.labelsize'] = 15
    plt.rcParams['xtick.labelsize'] = 12
    plt.rcParams['ytick.labelsize'] = 12
    plt.rcParams['axes.titlesize'] = 15
    plt.rcParams['figure.titlesize'] = 20
    plt.rcParams['figure.titleweight'] = 'bold'
    plt.rcParams['axes.titlelocation'] = 'left'
    plt.rcParams['axes.titleweight'] = 'normal'
    with plt.style.context('ggplot'):
        fig, axs = plt.subplots(nrows=2, ncols=1, sharex=True)
        axs[0].barh(list(wmr['City']), list(wmr['StationId']), color='#30a2da')
        axs[1].barh(list(wlr['City']), list(wlr['StationId']), color='#30a2da')
        #ax.yaxis.grid(False)
        #ax.xaxis.grid(False)
        
        #axs[0].set_xlabel(f"\n{xlabel}")
        axs[0].set_title(f"Cities with more records")
        #axs[0].set_ylabel("Name of City\n")
        #axs[1].set_xlabel(f"\n{xlabel}")
        axs[1].set_title(f"Cities with less records")
        fig.text(0.5, 0.04, 'Number of Stations', ha='center', fontsize=12)
        fig.suptitle(f"{title} (2015-2020)")
    

In [None]:
# number of stations per city(daywise measures)
station_data = pd.merge(sd, st)

# gives a count of stations in each city
city_station_cnt = station_data.groupby(['City'], as_index=False).\
    agg({'StationId': unique_cnt}).\
    sort_values(by='StationId')

# wmr and wlr
city_station_cnt_wmr = city_station_cnt[city_station_cnt['City'].isin(CITIES_WMR)]
city_station_cnt_wlr = city_station_cnt[city_station_cnt['City'].isin(CITIES_WLR)]

# plot the num of stations per city
asset_plot(city_station_cnt, 'Count of Stations', 'Number of Stations per City')

üí° **INSIGHTS**

- Naturally, cities with more records in the data have more stations
    - Delhi has over 35 stations
    - Mumbai and Bengaluru rank second with 10 stations
    - Ahmedabad, Jorapokhar, Visakhapatnam and Amritsar have only 1 station each
- Amongst the cities in WLR, Kolkata is an outlier. It has 7 stations, while all others in this group only have 1 station
    - Does this mean Kolkata is not registering enough entries into the database? Is this a possible problem with administration?

<div class="h2">Analysis</div>

In [None]:
# replace missing AQI_Bucket values with 'Not Available'
station_data['AQI_Bucket'].fillna('Not Available', inplace=True)
station_data.head()

# AQI levels across cities for 5 years
city_buckets_wmr = station_data[station_data['City'].isin(CITIES_WMR)].groupby(['City', 'AQI_Bucket']).agg({'Date':'count'})
temp_wmr = city_buckets_wmr.unstack(level='AQI_Bucket', fill_value=0)
temp_wmr = temp_wmr['Date'][['Good', 'Satisfactory', 'Moderate', 'Poor', 'Very Poor', 'Severe', 'Not Available']]
city_buckets_wlr = station_data[station_data['City'].isin(CITIES_WLR)].groupby(['City', 'AQI_Bucket']).agg({'Date':'count'})
temp_wlr = city_buckets_wlr.unstack(level='AQI_Bucket', fill_value=0)
temp_wlr = temp_wlr['Date'][['Good', 'Satisfactory', 'Moderate', 'Poor', 'Very Poor', 'Severe', 'Not Available']]

# making into percentages
temp_wmr['Total'] = temp_wmr.apply('sum', axis=1)
temp_wmr['Good'] = round(temp_wmr['Good'] / temp_wmr['Total'], 2) * 100
temp_wmr['Satisfactory'] = round(temp_wmr['Satisfactory'] / temp_wmr['Total'], 2) * 100
temp_wmr['Moderate'] = round(temp_wmr['Moderate'] / temp_wmr['Total'], 2) * 100
temp_wmr['Poor'] = round(temp_wmr['Poor'] / temp_wmr['Total'], 2) * 100
temp_wmr['Very Poor'] = round(temp_wmr['Very Poor'] / temp_wmr['Total'], 2) * 100
temp_wmr['Severe'] = round(temp_wmr['Severe'] / temp_wmr['Total'], 2) * 100
temp_wmr['Not Available'] = round(temp_wmr['Not Available'] / temp_wmr['Total'], 2) * 100
temp_wmr = temp_wmr.drop(['Total'], axis=1)

temp_wlr['Total'] = temp_wlr.apply('sum', axis=1)
temp_wlr['Good'] = round(temp_wlr['Good'] / temp_wlr['Total'], 2) * 100
temp_wlr['Satisfactory'] = round(temp_wlr['Satisfactory'] / temp_wlr['Total'], 2) * 100
temp_wlr['Moderate'] = round(temp_wlr['Moderate'] / temp_wlr['Total'], 2) * 100
temp_wlr['Poor'] = round(temp_wlr['Poor'] / temp_wlr['Total'], 2) * 100
temp_wlr['Very Poor'] = round(temp_wlr['Very Poor'] / temp_wlr['Total'], 2) * 100
temp_wlr['Severe'] = round(temp_wlr['Severe'] / temp_wlr['Total'], 2) * 100
temp_wlr['Not Available'] = round(temp_wlr['Not Available'] / temp_wlr['Total'], 2) * 100
temp_wlr = temp_wlr.drop(['Total'], axis=1)


# plot
mpl.rc_file_defaults()
plt.rcParams['figure.figsize'] = 20, 10
plt.rcParams['font.family'] = 'serif'
plt.rcParams['font.serif'] = 'Ubuntu'
plt.rcParams['font.monospace'] = 'Ubuntu Mono'
plt.rcParams['axes.labelweight'] = 'bold'
plt.rcParams['axes.labelsize'] = 15
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12
plt.rcParams['axes.titlesize'] = 15
plt.rcParams['figure.titlesize'] = 20
plt.rcParams['figure.titleweight'] = 'bold'
plt.rcParams['axes.titlelocation'] = 'left'
plt.rcParams['axes.titleweight'] = 'normal'
with plt.style.context('ggplot'):
    fig, axs = plt.subplots(nrows=2, ncols=1, sharex=True)
    temp_wmr.apply(lambda x: x*100/sum(x), axis=1).plot(kind='barh',
                                                    stacked=True,
                                                    legend=False,
                                                    color=['#00b050', '#91cf4f', "#fefe00", "#ffbf00", "#fe0000", "#bf0000", "#737373"],
                                                    ax=axs[0])
    temp_wlr.apply(lambda x: x*100/sum(x), axis=1).plot(kind='barh',
                                                    stacked=True,
                                                    legend=False,
                                                    color=['#00b050', '#91cf4f', "#fefe00", "#ffbf00", "#fe0000", "#bf0000", "#737373"],
                                                    ax=axs[1])
    axs[0].set_title(f"Cities with more records")
    axs[0].set_ylabel('')
    axs[1].set_title(f"Cities with less records")
    axs[1].set_ylabel('')
    fig.text(0.5, 0.04, 'Number of Measurements', ha='center', fontsize=15)
    fig.suptitle(f"Overall AQI Classifications per City (2015-2020)")

üí° **INSIGHTS**

- In a general look, it is evident that 'Cities with less records' have measured more Good or Satisfactory classifications than the 'Cities with more records'.
- In the first set, Thiruvananthapuram registered the highest percentage of it's daily measurements as of 2015-2020 as "Good" or "Satisfactory". Ahmedabad on the other hand has less than 5% of it's entries from 2015-2020 under the positive category(Good or Satisfactory).
- Ahmedabad also has close to 35% of it's entries under the 'Not Available' status.

In [None]:
# prep data for 3-class categorization

temp_wmr = temp_wmr.reset_index()
temp_wmr['Acceptable'] = temp_wmr['Good'] + temp_wmr['Satisfactory']
temp_wmr['Unacceptable'] = temp_wmr['Moderate'] + temp_wmr['Poor'] + temp_wmr['Very Poor'] + temp_wmr['Severe']
temp_wmr.sort_values(by='Unacceptable', ascending=False)

temp_wlr = temp_wlr.reset_index()
temp_wlr['Acceptable'] = temp_wlr['Good'] + temp_wlr['Satisfactory']
temp_wlr['Unacceptable'] = temp_wlr['Moderate'] + temp_wlr['Poor'] + temp_wlr['Very Poor'] + temp_wlr['Severe']
temp_wlr.sort_values(by='Unacceptable', ascending=False)

In [None]:
# move this func on top later
def three_cat_comp(df1, df2):
    """ Interactive comparison across 3 categories """

    fig = go.Figure()
    fig = make_subplots(rows=2, cols=1, 
                       subplot_titles=('Cities with more records', 'Cities with less records'))
    fig.add_trace(go.Bar(
        x=df1['City'],
        y=df1['Acceptable'],
        name='Acceptable levels',
        marker_color='#00b050',
    ), row=1, col=1)

    fig.add_trace(go.Bar(
        x=df1['City'],
        y=df1['Unacceptable'],
        name='Unacceptable levels',
        marker_color='#bf0000',
    ), row=1, col=1)

    fig.add_trace(go.Bar(
        x=df1['City'],
        y=df1['Not Available'],
        name='Missing Records',
        marker_color='#737373',
    ), row=1, col=1)
    
    fig.add_trace(go.Bar(
        x=df2['City'],
        y=df2['Acceptable'],
        marker_color='#00b050',
        showlegend=False,
    ), row=2, col=1)

    fig.add_trace(go.Bar(
        x=df2['City'],
        y=df2['Unacceptable'],
        marker_color='#bf0000',
        showlegend=False,
    ), row=2, col=1)

    fig.add_trace(go.Bar(
        x=df2['City'],
        y=df2['Not Available'],
        marker_color='#737373',
        showlegend=False,
    ), row=2, col=1)

    fig.update_layout(template='ggplot2', barmode='group', xaxis_tickangle=-45, title_text='Percentages of AQI Levels per City (2015-2020)',
                     height=700, width=1000)
    
    return fig.show()

# call the function
three_cat_comp(temp_wmr, temp_wlr)

In the above visualization, I introduce a new way of classifying the AQI levels in order to simplify viz:
- If the AQI bucket is Good or Satisfactory, it's put into the **Acceptable** label => Means it does not harm people too much
- If the AQI bucket is Moderate, Poor, Very Poor or Severe, it's put into the **Unacceptable** label => Means it can cause harm to a healthy population
- **Missing** is a new label that takes into account the missing or null values of AQI buckets => Missing data is a red flag as it indicates poor administration or faulty apparatus


üí° **INSIGHTS**

- Over the last 5 years, Ahmedabad has had the lowest amount of Acceptable levels, 3rd highest amount of Unacceptable levels and the highest amount in terms of Missing levels in all WMR cities
    - Other cities that look troubled are Delhi, Gurugram, Patna, Lucknow, Jaipur and Jorapokhar
- Thiruvananthapuram is a happy outlier in that top bar plot
- Amongst the WLR cities, Bhopal looks the most troubled
    - Other cities in WLR that have recorded more Unacceptable days than Acceptable ones are Brajrajnagar and Talcher

<div class="h3">Which cities are under the radar as of now?</div>

With the analysis so far, a few cities have emerged as potential recipients of the monetary funding to improve their state. These are =>
- Ahmedabad: A very high percentage of the days it has registered measurements show unacceptable AQI levels
- Delhi: Same as Ahmedabad. Also, it's highly discussed in national and international media
- Kolkata: *The more stations, but less number of records* phenomenon puts Kolkata under scrutiny for poor administration
- Other cities that make it to this list are 
    - Gurugram
    - Patna
    - Lucknow
    - Jaipur
    - Jorapokhar
    - Bhopal
    - Brajrajnagar
    - Talcher

<div class="h3">Using a State Transition Idea to Prioritize Cities based on Air Pollution Levels</div>

From above, we have 11 cities that make it into the list of cities we need to consider for monetary investment. What we need next is a way to filter these cities into 3 main cities. In this section, I discuss one approach: Modelling the Pollution in a City as a State(or Level) Transition diagram.

In [None]:
cd['AQI_Bucket'] = cd['AQI_Bucket'].fillna('Not Available')

def to_level(x):
    """Converting to a level"""
    
    if(x in ['Good', 'Satisfactory']):
        return 'Level 1'
    elif(x in ['Moderate', 'Poor']):
        return 'Level 2'
    elif(x in ['Very Poor', 'Severe']):
        return 'Level 3'
    elif(x == 'Not Available'):
        return 'Level 4'

def window(seq, n=2):
    '''Source: https://stackoverflow.com/questions/47297585/building-a-transition-matrix-using-words-in-python-numpy'''
    """Sliding window width n from seq.  From old itertools recipes."""
    
    it = iter(seq)
    result = tuple(islice(it, n))
    if len(result) == n:
        yield result
    for elem in it:
        result = result[1:] + (elem,)
        yield result
        
def make_trans_mat(states):
    """Make transition probability matrix"""
    
    # get the counts
    pairs = pd.DataFrame(window(states), columns=['Current', 'Next'])
    c = pairs.groupby('Current', as_index=False).agg({'Next': 'count'})
    c.columns = ['Current', 'Total']
    k = pd.DataFrame(pairs.groupby('Current')['Next'].value_counts())
    k.columns = ['Count']
    k = k.reset_index()
    
    # calculate probabilities
    t_mat = pd.merge(k, c)
    t_mat['Prob'] = t_mat['Count'] / t_mat['Total']
    t_mat = t_mat.pivot(index='Current', columns='Next', values='Prob').fillna(0)
    
    return t_mat

def model_city(city):
    """Build the model for a city"""
    
    # define states
    t_city = cd[cd['City']==city]
    t_city['Level'] = t_city['AQI_Bucket'].apply(to_level)
    
    # make_trans_mat
    possible_states = list(t_city['Level'])
    return make_trans_mat(possible_states)

def agg_models(cities):
    """Aggregate models"""
    
    # prep data
    i = 0
    c_name = cities[i]
    temp = model_city(cities[i])
    temp['City'] = c_name
    for i in range(1, len(cities)):
        city_df = model_city(cities[i])
        city_df['City'] = cities[i]
        temp = pd.concat([temp, city_df])
        
    # rename
    temp = temp[['Level 1', 'Level 2', 'Level 3', 'City']]
    temp = temp.fillna(0)
    return temp

def plot_probs(city_list, level_no, mode='imp'):
    """
    Plot probs for `cities` to transition into `level no`
    from any other level
    """
    
    nxt = f'Level {level_no}'
    
    # aggregate data
    e = agg_models(city_list).sort_values(by=nxt, ascending=False).reset_index()
    e = e[(e['Current']!=nxt) & (e['Current']!='Level 4')][['Current', nxt, 'City']]

    # overlook very small probabilities
    e[nxt] = round(e[nxt], 2)
    e = e[e[nxt]>0.0]
    
    # create the dataframes
    improvements = e[e['Current']>nxt].reset_index(drop=True)
    deteriorations = e[e['Current']<nxt].reset_index(drop=True)
    
    if(mode=='imp'):
        return improvements.style.set_caption(f'Improvements to Level {level_no}')\
            .background_gradient(cmap='inferno')
    elif(mode=='det'):
        return deteriorations.style.set_caption(f'Deteriorations to Level {level_no}')\
            .background_gradient(cmap='inferno')

### The Idea
The AQI levels for a city is not uniform throughout the last 5 years. Sometimes, the level has been bad for human respiration and at other times, harmless. However, what if we hypothesized that the **AQI level of a city on day d depends on the AQI level of the same city on day (d-1)**. This does follow the Markovian property (though it is not completely based on it).  

AQI of a city on any given day is a result of several factors like:
- The population
- The vehicles on the road
- The industries
- and others...

Most of these factors will be common between days. Hence, if a particular city on a given day is on a given AQI level, there is a high chance it would remain in the same the next day. This helps to know a city is **bad** if this level is bad. But, it tells nothing about the **volatility of a city's AQI level**. Simply put, we want an ideal city that can change into an AQI level the next day if it has a bad one today. 

### The Method
1. Convert the AQI levels in the dataset into 4 levels
    - Good and Satisfactory => Level 1
    - Moderate and Poor => Level 2
    - Very Poor and severe => Level 3
    - Not Available => Level 4
2. Create a transition matrix showing the probability of a day in a city to change to a Level i if it i's currently in Level j
3. Order the cities based on the probability for each city to transition from one level i to the next level j
    - If i > j, it is an improvement
    - If i < j, it is a deterioration
4. Cities that have high probabbilities of deterioration and low probabilities of improvement are the ones that need most focus

### An Example
Here is an example to make the above idea more clearer in practice.

In [None]:
# plotting transition probability matrix for Ahmedabad
model_city('Ahmedabad')

The above transition probability matrix can be visualized as follows:

<center><img src="https://github.com/ry05/aqi_project/blob/master/ahmedabad_transition.png?raw=true" width="900" height="700"></center>
<center><i>Credits: Author</i></center>
<br>

It shows that if a given day in Ahmedabad has an AQI classification lying in Level 2, there is a 21% chance for the next day to have a classification in Level 3(deterioration probability). And the probability for the next day to be Level 1(improvement probability) is as low as 2%!

In [None]:
plot_probs(CITIES_WMR, 2, 'det')

üí° **INSIGHTS**
- Jorapokhar has the highest probability of all cities in WMR to transition into Level 2 from Level 1 (34%)
- It is followed by Patna(32%), Delhi(28%), Gurugram(25%), Jaipur(23%) and Ahmedabad(23%) [Top 5]

In [None]:
plot_probs(CITIES_WLR, 2, mode='det')

üí° **INSIGHTS**
- Amongst WLR cities, Brajrajnagar(22%) and Talcher(20%) are the top 2 cities in terms of deteriorations to level 2

In [None]:
plot_probs(CITIES_WMR, 3, 'det')

üí° **INSIGHTS**
- Ahmedabad has the highest(20%) chance of deteriorating to Level 3 the next day if it was at Level 2 today
- It's followed by Delhi(12%), Gurugram(12%), Lucknow(9%) and Patna(8%)

In [None]:
plot_probs(CITIES_WLR, 3, mode='det')

üí° **INSIGHTS**
- Guwahati has a 12% probability of going into level 3 the next day if its at level 2 today
- Talcher is second with a 5% probability

In [None]:
plot_probs(CITIES_WMR, 2)

üí° **INSIGHTS**
- Ahmedabad has the lowest probability(10%) of showing an improvement to level 2 the next day if it was at level 3 today
- Other cities that have low probs here are Patna(11%), Lucknow(15%), Delhi(17%) and Gurugram(19%)

In [None]:
plot_probs(CITIES_WMR, 1)

üí° **INSIGHTS**
- Ahmedabad has the lowest probability(2%) of showing an improvement to level 1 the next day if it was at level 2 today
- Other cities that have low probs here are Delhi(5%), Hyderabad(6%), Chennai(6%), Gurugram(7%) and Patna(8%)

In [None]:
plot_probs(CITIES_WLR, 2)

üí° **INSIGHTS**
- Kolkata has the lowest probability(11%) of showing an improvement to level 2 the next day if it was at level 3 today amongst the WLR cities
- Talcher follows with 17%

With the above insights provided by the **State Transition Idea**, we now can quantify the priority of each city involved more definitively.

<div class="h3">Which cities are under the radar now?</div>

As seen in all the analyses of the previous sections, **Ahmedabad** is a **high-priority city** when it comes to AQI. It has the highest probabilities to deteriorate from one level to the other (5th highest for level 1=>level2 and highest for level2=>level3). It also has the lowest probabilities to improve from one level to another.

Therefore, Ahmedabad is most likely the city that requires the initial monetary funding to improve its pollution.  

Other cities that are still under consideration for the second and third spots are =>
- Patna
- Delhi
- Gurugram
- Lucknow
- Kolkata
- Talcher
- Guwahati

<div class="h3">Unacceptable AQI Levels and Indeterminable AQI Levels - New metrics?</div>
<br>

From a previous classification, there were 2 categories =>
* Acceptable (Good and Satisfactory AQI levels)
* Unacceptable (Moderate, Poor, Very Poor and Severe levels)
* Not Available (Missing AQI levels)

**NOTE:** The **Not Available** classification is better called as **Indeterminable** levels. It means that even if data is collected for individual pollutant levels, the collected data did not conform to the requirements to generate a final AQI bucket for the day.


üîç **METRICS CREATED**

For this section, two key metrics are used:
* `Unacceptable AQI Level Percentage`(UALP) = (`Number of Unacceptable AQI Levels` / `Number of Records`) * 100
* `Indeterminable AQI Level Percentage`(IALP) = (`Number of Indeterminable AQI Levels` / `Number of Records`) * 100

`Record` => A record is registered when there is an entry for a given day in the dataset. A record for a given day does not indicate that the day has a determinable AQI level.


Ideally, a city that is high priority on the AQI index will have a high UALP. Having a high IALP indicates problems with the data collection of the city.

In [None]:
# convert to datetime
cd['Date'] = pd.to_datetime(cd['Date'])

# engineer into year, month
def ret_yr(x):
    
    return (x.strftime("%Y"))

def ret_mon(x):
    
    return (x.strftime("%B"))

cd['Year'] = cd['Date'].apply(ret_yr)
cd['Month'] = cd['Date'].apply(ret_mon)

# convert to BTX
cd['BTX'] = cd['Benzene'] + cd['Toluene'] + cd['Xylene']

In [None]:
def cnt_acc(series):
    
    return (list(series).count('Good') + (list(series).count('Satisfactory')))

def cnt_unacc(series):
    
    return (list(series).count('Moderate') + (list(series).count('Poor')) 
           + list(series).count('Very Poor') + list(series).count('Severe'))

def cnt_navail(series):
    
    return (list(series).count('Not Available'))

# feature engineer
cd_tot = cd.groupby(['City', 'Year'], as_index=False).agg({'Date':'count'})
cd_tot.columns = ['City', 'Year', 'Recorded']
cd_acc = cd.groupby(['City', 'Year'], as_index=False).agg({'AQI_Bucket':cnt_acc})
cd_acc.columns = ['City', 'Year', 'Acceptable']
cd_acc = cd_acc.drop(['City', 'Year'], axis=1)
cd_unacc = cd.groupby(['City', 'Year'], as_index=False).agg({'AQI_Bucket':cnt_unacc})
cd_unacc.columns = ['City', 'Year', 'Unacceptable']
cd_unacc = cd_unacc.drop(['City', 'Year'], axis=1)
cd_navail = cd.groupby(['City', 'Year'], as_index=False).agg({'AQI_Bucket':cnt_navail})
cd_navail.columns = ['City', 'Year', 'Not Available']
cd_navail = cd_navail.drop(['City', 'Year'], axis=1)

# filter it
yr_wise = pd.concat([cd_tot, cd_acc, cd_unacc, cd_navail], axis=1)
filtered = ['Ahmedabad', 'Patna', 'Delhi', 'Gurugram', 'Lucknow', 'Kolkata', 'Talcher', 'Guwahati']
yr_wise_f = yr_wise[yr_wise['City'].isin(filtered)]

# convert to percentages
yr_wise_f['Acceptable_Percent'] = round((yr_wise_f['Acceptable'] / yr_wise_f['Recorded']) * 100, 2)
yr_wise_f['Unacceptable_Percent'] = round((yr_wise_f['Unacceptable'] / yr_wise_f['Recorded']) * 100, 2)
yr_wise_f['Not_Available_Percent'] = round((yr_wise_f['Not Available'] / yr_wise_f['Recorded']) * 100, 2)

yr_wise_f['Year'] = yr_wise_f['Year'].astype('int')

In the visualization below, each subplot represents a year. The idea of the visualization is to show the transition of each of the 8 cities across the UALP and IALP measures for the last 6 years.

UALP is on the X Axis and IALP is on the Y Axis.

In [None]:
fig = make_subplots(rows=3, cols=2, subplot_titles=['2015', '2016', '2017', '2018', '2019', '2020'])

# add traces
year = yr_wise_f[yr_wise_f['Year']==2015]
fig.add_trace(
    go.Scatter(x=year['Unacceptable_Percent'],
              y=year['Not_Available_Percent'],
              mode='markers',
              text=year['City'],
              marker = dict(
                  size=15,
                  color = ['rgb(127,60,141)', 'rgb(17,165,121)', 'rgb(57,105,172)', 'rgb(128,186,90)', 'rgb(230,131,16)'],
              ),
              opacity=0.8,
              showlegend=False),
    row=1, col=1)

year = yr_wise_f[yr_wise_f['Year']==2016]
fig.add_trace(
    go.Scatter(x=year['Unacceptable_Percent'],
              y=year['Not_Available_Percent'],
              mode='markers',
              text=year['City'],
              marker = dict(
                  size=15,
                  color = ['rgb(127,60,141)', 'rgb(17,165,121)', 'rgb(57,105,172)', 'rgb(128,186,90)', 'rgb(230,131,16)'],
              ),
              opacity=0.8,
              showlegend=False),
    row=1, col=2)

year = yr_wise_f[yr_wise_f['Year']==2017]
fig.add_trace(
    go.Scatter(x=year['Unacceptable_Percent'],
              y=year['Not_Available_Percent'],
              mode='markers',
              text=year['City'],
              marker = dict(
                  size=15,
                  color = ['rgb(127,60,141)', 'rgb(17,165,121)', 'rgb(57,105,172)', 'rgb(128,186,90)', 'rgb(230,131,16)', 'rgb(249,123,114)'],
              ),
              opacity=0.8,
              showlegend=False),
    row=2, col=1)

year = yr_wise_f[yr_wise_f['Year']==2018]
fig.add_trace(
    go.Scatter(x=year['Unacceptable_Percent'],
              y=year['Not_Available_Percent'],
              mode='markers',
              text=year['City'],
              marker = dict(
                  size=15,
                  color = ['rgb(127,60,141)', 'rgb(17,165,121)', 'rgb(57,105,172)', 'rgb(231,63,116)', 'rgb(128,186,90)', 'rgb(230,131,16)', 'rgb(249,123,114)'],
              ),
              opacity=0.8,
              showlegend=False),
    row=2, col=2)

year = yr_wise_f[yr_wise_f['Year']==2019]
fig.add_trace(
    go.Scatter(x=year['Unacceptable_Percent'],
              y=year['Not_Available_Percent'],
              mode='markers',
              text=year['City'],
              marker = dict(
                  size=15,
                  color = ['rgb(127,60,141)', 'rgb(17,165,121)', 'rgb(57,105,172)', 'rgb(242,183,1)', 'rgb(231,63,116)', 'rgb(128,186,90)', 'rgb(230,131,16)', 'rgb(249,123,114)'],
              ),
              opacity=0.8,
              showlegend=False),
    row=3, col=1)

year = yr_wise_f[yr_wise_f['Year']==2020]
fig.add_trace(
    go.Scatter(x=year['Unacceptable_Percent'],
              y=year['Not_Available_Percent'],
              mode='markers',
              text=year['City'],
              marker = dict(
                  size=15,
                  color = ['rgb(127,60,141)', 'rgb(17,165,121)', 'rgb(57,105,172)', 'rgb(242,183,1)', 'rgb(231,63,116)', 'rgb(128,186,90)', 'rgb(230,131,16)', 'rgb(249,123,114)'],
              ),
              opacity=0.8,
              showlegend=False),
    row=3, col=2)

# Update xaxis properties
fig.update_xaxes(title_text="Percentage of Unacceptable AQI Levels", row=1, col=1, range=[-10,110])
fig.update_xaxes(title_text="Percentage of Unacceptable AQI Levels", row=1, col=2, range=[-10,110])
fig.update_xaxes(title_text="Percentage of Unacceptable AQI Levels", row=2, col=1, range=[-10,110])
fig.update_xaxes(title_text="Percentage of Unacceptable AQI Levels", row=2, col=2, range=[-10,110])
fig.update_xaxes(title_text="Percentage of Unacceptable AQI Levels", row=3, col=1, range=[-10,110])
fig.update_xaxes(title_text="Percentage of Unacceptable AQI Levels", row=3, col=2, range=[-10,110])

# Update yaxis properties
fig.update_yaxes(title_text="Percentage of Indeterminable AQI Levels", row=1, col=1, range=[-20,110])
fig.update_yaxes(title_text="", row=1, col=2, range=[-20,110])
fig.update_yaxes(title_text="Percentage of Indeterminable AQI Levels", row=2, col=1, range=[-20,110])
fig.update_yaxes(title_text="", row=2, col=2, range=[-20,110])
fig.update_yaxes(title_text="Percentage of Indeterminable AQI Levels", row=3, col=1, range=[-20,110])
fig.update_yaxes(title_text="", row=3, col=2, range=[-20,110])

fig.update_layout(template='ggplot2',
    title={
        "text": "Unacceptable AQI Levels vs Missing AQI Levels",
        "font": {"family": "Rockwell", "size": 25},
        "xanchor": "center",
        "yanchor": "top",
    },
    width=900,
    height=1200,
)
fig.show()

üí° **INSIGHTS**

* As the years progress, a general trend sees the cities move towards the bottom right.
    * The bottom right is a region with high UALP and low IALP
    * This indicates that data collection has in general become better for these cities over the given time period
    * The year 2017 is an outlier in the sense that both Ahmedabad and Patna have shown a higher IALP measure than what they did have in 2016. Especially, Patna with a rise of about 50% from 2016 in the IALP. This however is restored to 1% in 2018. This does cast reasonable doubt.
* The following states are removed from this list of 8 states for the following reasons:
    * Talcher => It has a population of around 40,000 only which is way lesser than the other cities. So, a fair comparison is not possible
    * Guwahati => Data available only for 2019 and 2020
    * Kolkata => Data available for only 2018,2019 and 2020

**NOTE:** In an earlier statement, I had cast doubt on Kolkata's administration because of the lower number of records inspite of having more stations in place. This could be because Kolkata has only begun registering records since 2018(3 years lesser than the time for most other cities in this data)

The above visualization does provide insights. But, a more improve way of seeing a pattern would be using a bubble chart with only the 5 top cities (Ahmedabad, Delhi, Patna, Gurugram and Lucknow) in consideration.

In [None]:
# filter cities
filtered = ['Ahmedabad', 'Patna', 'Delhi', 'Gurugram', 'Lucknow', 'Kolkata', 'Talcher', 'Guwahati']
city_f = city[city['name_of_city'].isin(filtered)]
bubble = pd.merge(
    left=yr_wise_f,
    right=city_f,
    left_on='City',
    right_on='name_of_city',
)

# prep data for bubble plot
bubble = bubble[(bubble['City']!='Guwahati') & (bubble['City']!='Talcher') & (bubble['City']!='Kolkata')]
# area of each city in sq kms.
areas = pd.DataFrame({
    'City': ['Ahmedabad', 'Delhi', 'Gurugram', 'Lucknow', 'Patna'],
    'Area(km2)': [464, 1484, 732, 349, 136]
})
bubble = pd.merge(bubble, areas)
# per_capita_income
per_cap_inc = pd.DataFrame({
    'City': ['Ahmedabad', 'Delhi', 'Gurugram', 'Lucknow', 'Patna'],
    'Per_Capita_Income(INR)': [173000, 360644, 122000, 71000, 106000]
})
bubble = pd.merge(bubble, per_cap_inc)

fig = px.scatter(bubble, x="Unacceptable_Percent", y="Not_Available_Percent", animation_frame="Year", animation_group="City",
           color="City", hover_name="City", size='Area(km2)', size_max=65, color_discrete_sequence=px.colors.qualitative.G10)
fig.update_layout(template='ggplot2',
    title={
        "text": f"The Five Year Transition of AQI in our Top 5 Cities",
        "font": {"family": "Rockwell", "size": 25},
        "xanchor": "center",
        "yanchor": "top",
    },
    height=500,
    xaxis_title='Recorded Days with Unacceptable AQI Levels(%)',
    yaxis_title='Recorded Days with Indeterminable AQI Levels(%)',
)
fig.show()

**NOTE:** Size of the point represents geographical area of the city.

üí° **INSIGHTS**

* Run the visualization first and keep your eyes on Gurugram
    - Notice how Gurugram travels the farthest diagonal path to get from **High IALP-Low UALP region** to **Low IALP-High UALP region**
    - This shows how Gurugram is getting worse faster than other cities
    - This might be attributed to Gurugram's quick growth as [India's fastest growing IT Tech Region](https://ceoworld.biz/2016/12/02/indias-top-12-tech-cities-digital-indian-cities-survey-2016/)
* Ahmedabad starts 2015 with an IALP of about 28%, but in the next 2 years it records back to back increases in IALP measures to 68% and 81%. From 2018 however, it has stayed below 4%.
    - The increase of IALP in 2016 and 2017 is however a point to note
    - Because of the high IALP measures, the low UALP values don't have significance for these couple of years in Ahmedabad
* Delhi has had very good data collection practices in place, the highest IALP it registered was only 2.47% and it came in 2017
    - It started of with UALP of 99.73% in 2015
    - After a series of fluctuations, it is currently at 86.34% in 2020
* Patna has recorded oscillations similar to Ahmedabad
* Lucknow has shown a sharp dive in UALP in 2020 compared to 2019. But, 2020 is not over! So, any change or pattern observed in 2020 from 2019 is not to be conclusive

<div class="h3">Comparing our top 5 cities on the basis of socio-economic factors</div>

The analysis has so far focused on the specific AQI data available. However, to choose the most relevant cities to be provided the monetary investment, there is a need for comparing the filtered citites across other indicators. For this purpose, I am comparing the cities across 2 other indicators:
* Per Capita Income of the City(In INR)
* Population of Children under 6 years in the City

üìö **RELEVANCE OF THESE METRICS**

Direct consequences of air pollution are respiratory troubles and health disorders. Naturally, treating health issues is an expenditure. These treatments can at times be extra costs(or burden) on families that are not well off. Therefore, the **per capita income of a city** is a factor in deciding which city needs the investment to reduce air pollution and subsequently decrease medical expenditure due to respiratory problems.

Children are the most vulnerable of all age groups. Also, children have always been the most protected groups across human civilization. It is also a fact that children spend time outdoors playing games with their friends and tend to be in direct contact with air more. Therefore, cities with a larger **child population(under 6 years)** are to be given some consideration.

Finally, cities with low per capita income and high child population is to be given a very high priority as these fall under a *region of urgency*. 

üíæ **SOURCE OF DATA**

The two indicators used in this section have been compiled from the following sources:

* Population of Children under 6 years in the City => [Top 500 Indian Cities - Based on 2011 Census](https://www.kaggle.com/zed9941/top-500-indian-cities)
* Per Capita Income of the City(INR) => 
    - Compiled manually from multiple sources
    - Ahmedabad: https://www.prsindia.org/sites/default/files/budget_files/State%20Budget%20Analysis%20-%20Gujarat%202020-21_Final.pdf
    - Delhi: https://www.prsindia.org/sites/default/files/budget_files/Delhi%20Budget%20Analysis%20-%202019-20.pdf
    - Gurgaon: https://economictimes.indiatimes.com/work-career/best-cities-to-move-into-if-you-are-starting-a-new-career/gurgaon/slideshow/55224194.cms
    - Patna: https://patna.nic.in/economy/#:~:text=As%20of%202015%2C%20GDP%20per,rate%20is%207.29%20per%20cent.
    - Lucknow: https://www.hindustantimes.com/lucknow/kasganj-s-per-capita-gdp-fourth-highest-in-uttar-pradesh/story-8tSlU6yrjAeF9H92835X5I.html
    
‚ùó **DISCLOSURE**

The data above has been collected through multiple different sources as there was no single source to get this from. There is some difference between the years when each city's per capita income was collected was collected. However, this difference is not more than 3 years with Patna's being the earliest at 2015 and Ahmedabad's the latest at 2017-18.

In [None]:
# subset data
data = bubble[bubble['Year']==2015]

# comparing cities
fig = px.scatter(data, x="Per_Capita_Income(INR)", y="0-6_population_total",
           color="City", hover_name="City", size='Area(km2)', size_max=50, color_discrete_sequence=px.colors.qualitative.G10)
fig.update_layout(template='ggplot2',
    title={
        "text": f"Comparing Cities by Area, Income and Child Population",
        "font": {"family": "Rockwell", "size": 25},
        "xanchor": "center",
        "yanchor": "top",
    },
    height=500,
    xaxis_title='Per Capita Income(INR)',
    yaxis_title='Child Population (0-6 years)',
)
fig.show()

**NOTE:** Size of the point represents geographical area of the city.


üí° **INSIGHTS**

* Delhi has the highest child population, but also the largest per capita income. In fact, Delhi is an outlier in this 5 city group!
* Ahmedabad has already been identified as a highly polluted city w.r.t the **State/Level Transition Diagram** analysis before
    - In the above bubble plot, it can also be noticed that Ahmedabad has a high child population value
* The other 3 cities are of specific interest
* Lucknow is a city with a higher child population and lower per capita income than Patna and Gurugram. This makes it a *region of urgency*(relative to these 5 cities)
* Patna is between Lucknow and Gurugram in terms of the two factors considered here
* Gurugram has been noticed to have a sharp decline in AQI w.r.t to the other cities in previos analysis. However, it has a very low count of child population and a higher per capita income than the Patna and Lucknow



Based on the analysis so far, **Ahmedabad** is the city that should receive the initial funding for 3 years. This is because of the findings in the Level Transition Diagram-based approach that indicates Ahmedabad to have constantly deteriorating air quality and long periods of bad air quality and very little subsequent natural improvement. 

**Lucknow and Patna** are to be the 2 cities that should be funded if the funding to Ahmedabad is successful. This is because of the relative importance to these regions due to low income and high child population. 

**Why not Gurugram ‚ùì** 
Gurugram has shown steep decline, no doubt. However, **it's a city with a larger area**. Moreover, being one of the fastest growing areas in the country the amount required to perform any kind of improvement or initiative to curb air pollution **will most likely be significantly higher**. Since there is such a doubt cast, Gurugram has been excluded from this top 3 list.

**Why not Delhi ‚ùì**
Same reasons as that for Gurugram. But, in addition to that **Delhi's air pollution problem is internationally discussed**. Everybody is talking about it already. Therefore, it will probably receive funding from other sources. It is the other cities that are only placed in "lists of most polluted cities" and see little media visibility that require funding to improve the lives of its citizens.

<div class="h2">Rough Plan to Use the Investment</div>

This section draws focus on some of the opportunities that the investment can be used for in order to improve the state of air quality in Ahmedabad, Lucknow and Patna. There are 3 main kinds of areas where I feel the funding can be applied:

![](https://github.com/ry05/aqi_project/blob/master/where_to_invest.png?raw=true)

## Invest in Clean Technology

**Clean technology** is any process, product or service that reduces negative environmental impacts through significant energy efficiency improvements, the sustainable use of resources, or environmental protection activities.([Source](https://en.wikipedia.org/wiki/Clean_technology)). A couple of ways the *uncle* could invest his money would be to

- Set up a Money-lending firm or bank
    - Provide low-interest loans for people to convert their vehicles into CNG-run from petrol-run or diesel-run
    - Provide incentives to those who are willing to give away their old fuel-driven vehicles and switch to electric-vehicles. The extra money could be used by the family for a purpose like long-term deposits for children
- Launch an Electric Carpool Startup*ask 
    - The traditional carpool system with only electric cars
    - Provide free rides and offers to attract people
    - This can generate livelihood for cab-drivers without causing pollution to the environment
    - Autorickshaw drivers can be targetted and brought into the revolution as [here](https://bengaluru.citizenmatters.in/e-rickshaws-air-pollution-bengaluru-policy-transport-28136#:~:text=According%20to%20it%2C%20in%20a,sector%20is%200.44%20million%20tonne.)
    - Under this startup, more solar-powered battery charge stations for any e-vehicle can be setup. This is important, sans this everything else fails

## Invest in Community-driven Change

Initiatives fail because people are not ready to accept it. People are not ready to accept because most initiatives and ideas are brought into effect without a public study or social research. Therefore, investment has to be put into creating an organization that would **ask people questions** and **act on their responses**. Basically, bring about effective change in existing behaviour of citizens, by involving them as key stakeholders in any relevant policy decision.

- Ask the Right Questions
    - Use social surveys to collect data
    - Make data open, but maintain ethical standards
    - Understand the *why* behind behaviours
        - Why do people not change their polluting vehicles?
        - Why do people not adhere to government regulations?
        - Why do people not want to carpool?
        - Is their cultural relevance to how people's lifestyles are?
- Make People Responsible
    - Involve local communities like for example, an apartment locality
    - Make people take initiative, rather than wait for govt. policy

## Invest in Making Connections

Alone, it will be difficult to make a change. So, its important to form the right partnerships. Develop plans with corporations to help out as part of their CSR. Create accountability by keeping all transactions open and public. 

<div class="h3">How to measure progress?</div>

Progress in Ahmedabad after the first 3 years can be measured based on the following metrics:

- UALP
    - Unacceptable AQI Level Day Percentage
    - Investment has been succesful if UALP has declined constantly(atleast by 10%) for the last 3 years(With no increase in IALP by more than 3%)
    - As of 2019, UALP measures were as follows:
        - Ahmedabad = 96.44%
        - Lucknow = 79.45%
        - Patna = 83.56%
    - If the plan works, Ahmedabad will see a UALP of atmost 66.44% by 2024. 
- IALP
    - Indeterminable AQI Level Day Percentage
    - It is also a responsibility to ensure that IALP never rises above 3% in any of the next 3 years
- Number of e-carpool cars in each city
    - Also include a count of number of cab drivers who have switched to e-carpool
    - Any increase in the proportion of e-vehicles is positive
    
There can definitely be more metrics created from the **Rough Plan** demonstrated before. In a real-world scenario, metrics need to be defined by taking way more parameters into consideration. Hence, I shall stop at these 3 simple metrics.

<div class="h2">References</div>

### Domain Knowledge References
1. [Air Pollutants](https://en.wikipedia.org/wiki/Air_pollution#Pollutants)
2. [UNEP Monitoring Air Quality](https://www.unep.org/explore-topics/air/what-we-do/monitoring-air-quality)
3. [UNEP's Air Programme](https://www.unep.org/explore-topics/air)
4. [UNEP's 2016 report on "Actions on Air Quality"](https://www.unenvironment.org/resources/assessment/actions-air-quality?_ga=2.209423066.934997296.1598950157-760778986.1598950157)
5. [UNEP's Air Programme](https://www.unep.org/explore-topics/air)
6. [Ministry of Environment, Forestry and Climate Change - Govt. of India](http://moef.gov.in/environment/pollution/)

### Kaggle Kernels for Inspiration
1. https://www.kaggle.com/romandovega/chaieda-air-quality-in-india-eda-using-tableau 
2. https://www.kaggle.com/parulpandey/breathe-india-covid-19-effect-on-pollution
3. https://www.kaggle.com/rohanrao/calculating-aqi-air-quality-index-tutorial

Hope it was a good read! Thank you ;)