<a href="https://colab.research.google.com/github/zukhrafarshadz-sudo/PDA/blob/main/PDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Practical Data Analysis Coursework  
## Air Quality Index (AQI) Analysis Using Python

**Module:** CMP7005 – Programming for Data Analysis  
**Student ID:** ST20318986  
**Student Name:** Zukhraf Arshad
**Academic Year:** 2025–2026  

---

### Project Overview
This Project analyses air quality data collected from multiple Indian cities.  
The objectives are to:
- Integrate multiple city-level datasets into a single dataset
- Perform exploratory data analysis (EDA)
- Prepare the data for predictive modelling
- Develop insights into air pollution patterns and AQI behaviour


# Task 1: Data Handling and Integration

**Objective:**  
To load, merge, and validate multiple city-level air quality datasets obtained from a GitHub repository into a single unified dataset for analysis.


## Integrating my Github Account

In [1]:
! git config --global user.name "zukhrafarshadz-sudo"
! git config --global user.email "zukhrafarshadz@gmail.com"

In [2]:
#Cloning the dataset from github
!git clone https://github.com/zukhrafarshadz-sudo/PDA.git


Cloning into 'PDA'...
remote: Enumerating objects: 31, done.[K
remote: Counting objects: 100% (31/31), done.[K
remote: Compressing objects: 100% (30/30), done.[K
remote: Total 31 (delta 0), reused 0 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (31/31), 770.67 KiB | 5.04 MiB/s, done.


In [5]:
%cd PDA

/content/PDA


In [6]:
%ls

Ahmedabad_data.csv     Coimbatore_data.csv  Kolkata_data.csv
Aizawl_data.csv        Delhi_data.csv       Lucknow_data.csv
Amaravati_data.csv     Ernakulam_data.csv   Mumbai_data.csv
Amritsar_data.csv      Gurugram_data.csv    Patna_data.csv
Bengaluru_data.csv     Guwahati_data.csv    README.md
Bhopal_data.csv        Hyderabad_data.csv   Shillong_data.csv
Brajrajnagar_data.csv  Jaipur_data.csv      Talcher_data.csv
Chandigarh_data.csv    Jorapokhar_data.csv  Thiruvananthapuram_data.csv
Chennai_data.csv       Kochi_data.csv       Visakhapatnam_data.csv


## Importing all the required libraries

In [7]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns
import os
import glob

### Combining all the CSV files in the dataset

In [12]:
# confirming we are inside the cloned repo folder (PDA)
print("Current folder:", os.getcwd())

# Finding all CSV files that end with _data.csv
city_files = sorted(glob.glob("*_data.csv"))
print(f"Found {len(city_files)} city CSV files.")
# city_files[:5]  # showing first 5

all_cities_data = []

for file_name in city_files:
    city_df = pd.read_csv(file_name)

    all_cities_data.append(city_df)
    print(f"Loaded: {file_name} | rows: {len(city_df)}")

# Combine into one big table
combined_data = pd.concat(all_cities_data, ignore_index=True)

print("\nSUCCESS: Combined dataset created!")
print("Total rows:", len(combined_data))
print("Total columns:", combined_data.shape[1])

# Saving the combined dataset to CSV
combined_file_name = "all_cities_combined.csv"
combined_data.to_csv(combined_file_name, index=False)

print(f"Combined file saved as: {combined_file_name}")


Current folder: /content/PDA
Found 26 city CSV files.
Loaded: Ahmedabad_data.csv | rows: 2009
Loaded: Aizawl_data.csv | rows: 113
Loaded: Amaravati_data.csv | rows: 951
Loaded: Amritsar_data.csv | rows: 1221
Loaded: Bengaluru_data.csv | rows: 2009
Loaded: Bhopal_data.csv | rows: 289
Loaded: Brajrajnagar_data.csv | rows: 938
Loaded: Chandigarh_data.csv | rows: 304
Loaded: Chennai_data.csv | rows: 2009
Loaded: Coimbatore_data.csv | rows: 386
Loaded: Delhi_data.csv | rows: 2009
Loaded: Ernakulam_data.csv | rows: 162
Loaded: Gurugram_data.csv | rows: 1679
Loaded: Guwahati_data.csv | rows: 502
Loaded: Hyderabad_data.csv | rows: 2006
Loaded: Jaipur_data.csv | rows: 1114
Loaded: Jorapokhar_data.csv | rows: 1169
Loaded: Kochi_data.csv | rows: 162
Loaded: Kolkata_data.csv | rows: 814
Loaded: Lucknow_data.csv | rows: 2009
Loaded: Mumbai_data.csv | rows: 2009
Loaded: Patna_data.csv | rows: 1858
Loaded: Shillong_data.csv | rows: 310
Loaded: Talcher_data.csv | rows: 925
Loaded: Thiruvananthapuram_d

In [13]:
df= pd.read_csv('all_cities_combined.csv')
df

Unnamed: 0,City,Date,PM2.5,PM10,NO,NO2,NOx,NH3,CO,SO2,O3,Benzene,Toluene,Xylene,AQI,AQI_Bucket
0,Ahmedabad,01/01/2015,,,0.92,18.22,17.15,,0.92,27.64,133.36,0.00,0.02,0.00,,
1,Ahmedabad,02/01/2015,,,0.97,15.69,16.46,,0.97,24.55,34.06,3.68,5.50,3.77,,
2,Ahmedabad,03/01/2015,,,17.40,19.30,29.70,,17.40,29.07,30.70,6.80,16.40,2.25,,
3,Ahmedabad,04/01/2015,,,1.70,18.48,17.97,,1.70,18.59,36.08,4.43,10.14,1.00,,
4,Ahmedabad,05/01/2015,,,22.10,21.42,37.76,,22.10,39.33,39.31,7.01,18.89,2.78,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29526,Visakhapatnam,27/06/2020,15.02,50.94,7.68,25.06,19.54,12.47,0.47,8.55,23.30,2.24,12.07,0.73,41.0,Good
29527,Visakhapatnam,28/06/2020,24.38,74.09,3.42,26.06,16.53,11.99,0.52,12.72,30.14,0.74,2.21,0.38,70.0,Satisfactory
29528,Visakhapatnam,29/06/2020,22.91,65.73,3.45,29.53,18.33,10.71,0.48,8.42,30.96,0.01,0.01,0.00,68.0,Satisfactory
29529,Visakhapatnam,30/06/2020,16.64,49.97,4.05,29.26,18.80,10.03,0.52,9.84,28.30,0.00,0.00,0.00,54.0,Satisfactory


**Task 1 Summary**

All 26 city-level air quality CSV files were successfully merged into a single dataset containing 29,531 records and 16 variables. The combined dataset was saved as `all_cities_combined.csv` and retains the original structure of the source datasets, making it suitable for exploratory data analysis and predictive modelling.


## Task 2: Exploratory Data Analysis (EDA)

**Objective:**  
To explore, clean, and understand the combined air quality dataset using statistical analysis and data visualisation techniques. This task aims to identify data quality issues, understand pollutant behaviour, and uncover patterns and relationships that will inform subsequent predictive modelling.



### 2.1 Data Understanding
