# Air Quality Analysis across The World

## Goal

- To Conduct a comprehensive global data analysis to identify regions with the cleanest and dirtiest air quality, focusing on key parameters such as overall Air Quality Index (AQI), Carbon Dioxide AQI, Ozone AQI, and Nitrogen Dioxide AQI.
- The objective is to discern and rank regions based on air quality, identifying both the cleanest and dirtiest areas worldwide.
- The motivation behind the analysis is the growing concerns of impacts of air pollution on health and environment. Ensuring clean and healthy air is crucial for the well-being of communities globally. 
- Leveraging advanced analytics, this study aims to offer insights into the geographical distribution of air pollution, facilitating the development of strategic environmental policies and targeted public health initiatives on a global scale.


## Data Acquisition and Processing

- The dataset is named "World Air Quality Index by City and Coordinates", is aquired from Kaggle and it is licensed under CC BY-NC-SA 4.0. <link>https://www.kaggle.com/datasets/adityaramachandran27/world-air-quality-index-by-city-and-coordinates/data<Link/>

The provided Dataset contains air quality data with the following features:

- Country: The country to which the data belongs.
- City: The specific city within the country.
- AQI value: The overall Air Quality Index (AQI) value, representing the combined impact of multiple air pollutants.
- AQI category: The category or level associated with the overall AQI value, indicating the degree of air quality (e.g., Good, Moderate, Unhealthy).
- CO AQI value: The AQI value specifically for Carbon Monoxide (CO) concentration.
- CO AQI category: The category associated with the CO AQI value.
- Ozone AQI value: The AQI value specifically for Ozone concentration.
- Ozone AQI category: The category associated with the Ozone AQI value.
- NO2 AQI value: The AQI value specifically for Nitrogen Dioxide (NO2) concentration.
- NO2 AQI category: The category associated with the NO2 AQI value.
- PM2.5 AQI value: The AQI value specifically for Particulate Matter (PM2.5) concentration.
- PM2.5 AQI category: The category associated with the PM2.5 AQI value.
- lat: The latitude coordinate of the city.
- lng: The longitude coordinate of the city.





In [None]:
# Libraries used

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px                     # pip install plotly

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## Load Dataset

In [None]:
df = pd.DataFrame
df = pd.read_csv('dataset/aqi_dataset.csv')

In [None]:
df.head()

# Data Cleaning

In [None]:
# Replace spaces in Column names with underscore

df.columns = ['Country','City','AQI_value','AQI_category','CO_AQI_value','CO_AQI_category','Ozone_AQI_value','Ozone_AQI_category','NO2_AQI_value','NO2_AQI_category','PM2.5_AQI_value','PM2.5_AQI_category','lat','lng']

In [None]:
df.head()

In [None]:
# Count how many values are NAN
df.isna().sum()

In [None]:
# Delete columns having NAN values
df.dropna(inplace=True)
df.isna().sum()

In [None]:
# Check Duplicate cities if there are any 
df[df.duplicated(subset=['City'])]
 

In [None]:
# Since there are duplicate rows with same city names and values, Keep first occurances of the rows and delete the other occurances (Inshort Delete Duplicate rows)
df.drop_duplicates(subset='City', inplace=True)
df

In [None]:
# Reset index
df.reset_index(drop=True, inplace=True)
df

In [None]:
df.describe()

## Analysis
Lets find out the top 10 cleanest and dirtiest cities around the world in terms of AQI

In [None]:
df[['Country', 'City', 'AQI_value',]].nsmallest(10,'AQI_value')

In [None]:
df[['Country', 'City', 'AQI_value']].nlargest(10,'AQI_value')

- The provided data indicates that several cities across different countries, including Pakistan, India, and the United States, have recorded the highest possible Air Quality Index (AQI) value of 500 whereas Cities like Macas (Ecuador), Azogues (Ecuador), and Tari (Papua New Guinea) have exceptionally low AQI values, ranging from 7 to 8, indicating good air quality conditions. This is a positive indicator for the environmental health of these areas.

- Since there are multiple indicators of air quality apart from just AQI in this dataset, we will now calculate the avg values of each parameters and will plot it to see we get something different or find some insights.

In [None]:
# Top 10 countries with avg cleanest and dirtiest air
top_AQI_clean_countries_df = df.groupby('Country').agg({'AQI_value':'mean'}).nsmallest(10,"AQI_value")
top_AQI_dirtiest_countries_df = df.groupby('Country').agg({'AQI_value':'mean'}).nlargest(10,"AQI_value")

top_PM25_clean_countries_df = df.groupby('Country').agg({'PM2.5_AQI_value':'mean'}).nsmallest(10,"PM2.5_AQI_value")
top_PM25_dirtiest_countries_df = df.groupby('Country').agg({'PM2.5_AQI_value':'mean'}).nlargest(10,"PM2.5_AQI_value")

top_CO_clean_countries_df = df[df.CO_AQI_value!=0].groupby('Country').agg({'CO_AQI_value':'mean'}).nsmallest(10,"CO_AQI_value")
top_CO_dirtiest_countries_df = df.groupby('Country').agg({'CO_AQI_value':'mean'}).nlargest(10,"CO_AQI_value")


plt.style.use('ggplot')

fig, ((ax1, ax2), (ax3, ax4), (ax5, ax6)) = plt.subplots(3, 2, figsize=(20,25),)


sns.barplot(ax=ax1, x='Country', y='AQI_value', data=top_AQI_clean_countries_df, hue='Country')
ax1.set_title('Avg AQI of Top 10 Cleanest countries')
ax1.set_xlabel('Country')
ax1.set_ylabel('Avg AQI')

sns.barplot(ax=ax2, x='Country', y='AQI_value', data=top_AQI_dirtiest_countries_df, hue='Country')
ax2.set_title('Avg AQI of Top 10 Dirtiest countries')
ax2.set_xlabel('Country')
ax2.set_ylabel('Avg AQI')

sns.barplot(ax=ax3, x='Country', y="PM2.5_AQI_value", data=top_PM25_clean_countries_df, hue='Country')
ax3.set_title('Avg PM2.5 Value of Top 10 Cleanest countries')
ax3.set_xlabel('Country')
ax3.set_ylabel('Avg PM2.5 Value')

sns.barplot(ax=ax4, x='Country', y="PM2.5_AQI_value", data=top_PM25_dirtiest_countries_df, hue='Country')
ax4.set_title('Avg PM2.5 Value of Top 10 Dirtiest countries')
ax4.set_xlabel('Country')
ax4.set_ylabel('Avg PM2.5 Value')

sns.barplot(ax=ax5, x='Country', y='CO_AQI_value', data=top_CO_clean_countries_df, hue='Country')
ax5.set_title('Avg CO AQI of Top 10 Cleanest countries')
ax5.set_xlabel('Country')
ax5.set_ylabel('Avg CO AQI')

sns.barplot(ax=ax6, x='Country', y='CO_AQI_value', data=top_CO_dirtiest_countries_df, hue='Country')
ax6.set_title('Avg CO AQI of Top 10 Dirtiest countries')
ax6.set_xlabel('Country')
ax6.set_ylabel('Avg CO AQI')


ax1.set_xticklabels(ax1.get_xticklabels(), rotation=45, ha='right')
ax2.set_xticklabels(ax2.get_xticklabels(), rotation=45, ha='right')
ax3.set_xticklabels(ax1.get_xticklabels(), rotation=45, ha='right')
ax4.set_xticklabels(ax2.get_xticklabels(), rotation=45, ha='right')
ax5.set_xticklabels(ax1.get_xticklabels(), rotation=45, ha='right')
ax6.set_xticklabels(ax2.get_xticklabels(), rotation=45, ha='right')


plt.show()

## Common Patterns and Similarities

1. **Elevated AQI in Certain Countries:**
   - Republic of Korea consistently appears with high AQI values across multiple pollutants, suggesting persistent air quality challenges.
   - Bahrain, Pakistan, and Saudi Arabia also exhibit high AQI values across various pollutants, indicating potential shared environmental concerns.

2. **High PM2.5 Levels in Specific Regions:**
   - Republic of Korea, Bahrain, Pakistan, and Saudi Arabia consistently show high PM2.5 AQI values, emphasizing the prevalence of fine particulate matter in these regions.

3. **Low AQI in Some Countries:**
   - Palau, Solomon Islands, Maldives, and Luxembourg consistently maintain low overall AQI values, indicating generally favorable air quality conditions.

4. **Differences in CO AQI:**
   - Republic of Korea has a notably higher CO AQI compared to other countries, suggesting potential sources of carbon monoxide emissions in the region.

5. **Variability in NO2 AQI:**
   - China, Rwanda, and Nigeria show relatively higher NO2 AQI values, indicating potential sources of nitrogen dioxide emissions, such as industrial and vehicular activities.


## There is one more quality of Air Quality indexes that they can be added together to give a combined Air quality summary. Lets add all the paramaters aggregately and  visualize a distribution chart for top 5 clean and dirty air countries and see if we can find seomthing meaningful:


In [None]:
#lets see the distribution perecentage for top 5 countries

top_AQI_sum_clean_countries_df = df.groupby('Country',as_index=False).agg({'AQI_value':'sum'}).nsmallest(5,"AQI_value")
top_AQI_sum_dirtiest_countries_df = df.groupby('Country',as_index=False).agg({'AQI_value':'sum'}).nlargest(5,"AQI_value")

top_PM25_sum_clean_countries_df = df.groupby('Country',as_index=False).agg({'PM2.5_AQI_value':'sum'}).nsmallest(5,"PM2.5_AQI_value")
top_PM25_sum_dirtiest_countries_df = df.groupby('Country',as_index=False).agg({'PM2.5_AQI_value':'sum'}).nlargest(5,"PM2.5_AQI_value")


top_CO_sum_clean_countries_df = df[df.CO_AQI_value!=0].groupby('Country',as_index=False).agg({'CO_AQI_value':'sum'}).nsmallest(5,"CO_AQI_value")
top_CO_sum_dirtiest_countries_df = df.groupby('Country',as_index=False).agg({'CO_AQI_value':'sum'}).nlargest(5,"CO_AQI_value")

# Create a pie chart
fig, ((p1,p2), (p3,p4), (p5,p6)) = plt.subplots(3, 2, figsize=(15, 12))


p1.pie(top_AQI_sum_clean_countries_df['AQI_value'], labels=top_AQI_sum_clean_countries_df['Country'], autopct='%1.1f%%', startangle=90, colors=['red', 'green', 'blue', 'orange'])
# p1.axis('equal')
p1.set_title('Distribution of Top 5 Clean AQI Countries')

p2.pie(top_AQI_sum_dirtiest_countries_df['AQI_value'], labels=top_AQI_sum_dirtiest_countries_df['Country'], autopct='%1.1f%%', startangle=90, colors=['red', 'green', 'blue', 'orange'])
# p2.axis('equal')
p2.set_title('Distribution of Top 5 Dirty AQI Countries')

p3.pie(top_PM25_sum_clean_countries_df['PM2.5_AQI_value'], labels=top_PM25_sum_clean_countries_df['Country'], autopct='%1.1f%%', startangle=90, colors=['red', 'green', 'blue', 'orange'])
# p3.axis('equal')
p3.set_title('Distribution of Top 5 Clean PM2.5 Countries')

p4.pie(top_PM25_sum_dirtiest_countries_df['PM2.5_AQI_value'], labels=top_PM25_sum_dirtiest_countries_df['Country'], autopct='%1.1f%%', startangle=90, colors=['red', 'green', 'blue', 'orange'])
# p4.axis('equal')
p4.set_title('Distribution of Top 5 Dirty PM2.5 Countries Countries')

p5.pie(top_CO_sum_clean_countries_df['CO_AQI_value'], labels=top_CO_sum_clean_countries_df['Country'], autopct='%1.1f%%', startangle=90, colors=['red', 'green', 'blue', 'orange'])
# p5.axis('equal')
p5.set_title('Distribution of Top 5 Clean CO Value Countries')

p6.pie(top_CO_sum_dirtiest_countries_df['CO_AQI_value'], labels=top_CO_sum_dirtiest_countries_df['Country'], autopct='%1.1f%%', startangle=90, colors=['red', 'green', 'blue', 'orange'])
# p6.axis('equal')
p6.set_title('Distribution of Top 5 Dirty CO Value Countries')

# plt.tight_layout()

# Show the plot
plt.show()



### As we expected the pie chart for distribution shows different and varied data specially for dirty air quality countries and here are some insights

### AQI (Air Quality Index):

**Clean AQI Countries:**
- Palau, Solomon Islands, Maldives, Luxembourg, Iceland.

**Dirty AQI Countries:**
- China, India, United States of America, Germany, Italy.

**Unique Finding:**
- Iceland appears in both clean and dirty categories for different pollutants.

### PM2.5 (Particulate Matter 2.5):

**Clean PM2.5 Countries:**
- Solomon Islands, Palau, Luxembourg, Maldives, Iceland.

**Dirty PM2.5 Countries:**
- China, India, United States of America, Germany, Italy.

### Carbon Monoxide (CO) Values:

**Top 5 Dirty CO Value Countries:**
1. Russian Federation
2. China
3. India
4. Italy
5. United States of America.

## Some More plots to see any patterns:

In [None]:
# Pair plot
sns.pairplot(data=df,hue='AQI_category', vars=['CO_AQI_value', 'Ozone_AQI_value', 'NO2_AQI_value', 'PM2.5_AQI_value'],)

##  Lets plot these features geographically and see how it looks:


In [None]:
# AQI MAP
plt.figure(figsize=(16, 8))
sns.scatterplot(data=df, x='lng', y='lat', hue='AQI_category')
sns.set(style="darkgrid")
plt.show()

In [None]:
# PM2.5 MAP

plt.figure(figsize=(16, 8))
sns.scatterplot(data=df, x='lng', y='lat', hue='PM2.5_AQI_category')
sns.set(style="darkgrid")
plt.show()

- In terms of PM2.5 particles the map shows a good amount similarity with AQI MAP. There are some outliers but hard to identify.

In [None]:
# Carbon Monoxide(CO) Map

plt.figure(figsize=(16, 8))
sns.scatterplot(data=df, x='lng', y='lat', hue='CO_AQI_category')
sns.set(style="darkgrid")
plt.show()

In [None]:
#Ozone Map
plt.figure(figsize=(16, 8))
sns.scatterplot(data=df, x='lng', y='lat', hue='Ozone_AQI_category')
sns.set(style="darkgrid")
plt.show()

In [None]:
# NO2 Map
plt.figure(figsize=(16, 8))
sns.scatterplot(data=df, x='lng', y='lat', hue='NO2_AQI_category')
sns.set(style="darkgrid")
plt.show()

### For AQI Map:
- Most parts of North America, Europe, and Asia exhibit a mix of “Good,” “Moderate,” and “Unhealthy” air quality for all types air quality paramters.
- Clusters of red dots in Asia highlight regions with consistently unhealthy air quality.
- South America predominantly shows good to moderate air quality, with scattered unhealthy spots.
- Africa displays a mix but leans toward moderate air quality.
- Australia’s eastern part is moderate to unhealthy, while the western part remains mostly good.

## Conclusion: Cleanest and Dirtiest Air Quality

The analysis of global air quality data has allowed us to categorize regions into the cleanest and dirtiest based on various parameters, including the overall Air Quality Index (AQI), PM2.5 levels, and Carbon Monoxide (CO) concentrations.

### Cleanest Air Quality Regions:

1. **Clean AQI Countries:**
   - Palau, Solomon Islands, Maldives, Luxembourg, Iceland.
   - These regions consistently maintain low overall AQI values, indicating generally favorable air quality conditions.

2. **Clean PM2.5 Countries:**
   - Solomon Islands, Palau, Luxembourg, Maldives, Iceland.
   - These areas exhibit low levels of particulate matter (PM2.5), contributing to good air quality.

3. **Top 5 Clean CO Value Countries:**
   - Solomon Islands, Palau, Luxembourg, Maldives, Iceland.


### Dirtiest Air Quality Regions:

1. **Dirty AQI Countries:**
   - China, India, United States of America, Germany, Italy.
   - These countries consistently record high overall AQI values, indicating widespread air pollution challenges.

2. **Dirty PM2.5 Countries:**
   - China, India, United States of America, Germany, Italy.
   - These regions face elevated levels of particulate matter (PM2.5), contributing to poor air quality.

3. **Top 5 Dirty CO Value Countries:**
   - Russian Federation, China, India, Italy, United States of America.
   - These countries exhibit the highest levels of Carbon Monoxide (CO), indicating significant sources of CO emissions.

### Global Implications and Recommendations:

- The cleanest regions showcase successful environmental policies and practices that can serve as models for others.
- Regions with consistently high pollution levels require urgent intervention through stricter regulations and emission control measures.
- International collaboration is essential to address the global challenge of air pollution, promoting sustainable development and environmental health.

This analysis provides valuable insights for policymakers, environmentalists, and the public to prioritize efforts towards achieving cleaner air quality and ensuring a healthier and more sustainable future.

In [30]:
import plotly.express as px
fig = px.scatter_mapbox(df,
                       lon=df['lng'],
                       lat=df['lat'],
                       color = df['AQI_category'], 
                       zoom = 2,
                       width = 1100,
                       height=700,
                       title = 'Air Quality accross the world')
fig.update_layout(mapbox_style = 'stamen-terrain')

fig.show()