

## Programming for Data Analysis and Visualisation
### Assessment (Indian Air Pollution Data)


### **ðŸŽ¯ Objective :**

The objective of this project is to analyse Indiaâ€™s air quality data and build a complete end-to-end machine learning workflow. This includes importing and merging the dataset, performing exploratory data analysis, cleaning and preprocessing the data, developing an AQI prediction model, and creating a simple multi-page GUI application to present results.
The project also aims to use GitHub for version control by maintaining a well-organised repository with regular commits and documentation.

# ðŸ§­ Task 1 - Data Handling


## Importing the Required Libraries

In [11]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns
import os
import glob

## Mounting the drive

In [1]:
# Load the Drive helper and mount
from google.colab import drive

# This will prompt for authorization.


In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
%cd '/content/drive/MyDrive/week5_datavisulisation/Assessment Data-20251027'
# my drive path where the file is saved in my drive

/content/drive/MyDrive/week5_datavisulisation/Assessment Data-20251027


In [4]:
%ls # it shows all the content of your folder if done properly

Ahmedabad_data.csv       Chennai_data.csv     Kochi_data.csv
Aizawl_data.csv          Coimbatore_data.csv  Kolkata_data.csv
all_cities_combined.csv  Delhi_data.csv       Lucknow_data.csv
Amaravati_data.csv       Ernakulam_data.csv   Mumbai_data.csv
Amritsar_data.csv        Gurugram_data.csv    Patna_data.csv
Bengaluru_data.csv       Guwahati_data.csv    Shillong_data.csv
Bhopal_data.csv          Hyderabad_data.csv   Talcher_data.csv
Brajrajnagar_data.csv    Jaipur_data.csv      Thiruvananthapuram_data.csv
Chandigarh_data.csv      Jorapokhar_data.csv  Visakhapatnam_data.csv


##  Combining all the csv files

In [16]:
all_cities_data = []

city_files = glob.glob("*_data.csv") # Finding all the files that end with _data.csv
for file_name in city_files:
    city_df = pd.read_csv(file_name) # Reading the CSV file into a data frame
    # Adding city's data to a list
    all_cities_data.append(city_df)

    print(f"Loaded: {file_name}")

# Combining all city data into one big table
combined_data = pd.concat(all_cities_data, ignore_index=True)

# Saving the combined data to a new CSV file
combined_data.to_csv("all_cities_combined.csv", index=False)

print(f"SUCCESS: Combined {len(city_files)} city files into one file with {len(combined_data)} total rows")
print("The combined file is saved as: all_cities_combined.csv")


Loaded: Delhi_data.csv
Loaded: Brajrajnagar_data.csv
Loaded: Gurugram_data.csv
Loaded: Chennai_data.csv
Loaded: Hyderabad_data.csv
Loaded: Jaipur_data.csv
Loaded: Patna_data.csv
Loaded: Bhopal_data.csv
Loaded: Mumbai_data.csv
Loaded: Jorapokhar_data.csv
Loaded: Ernakulam_data.csv
Loaded: Thiruvananthapuram_data.csv
Loaded: Ahmedabad_data.csv
Loaded: Kochi_data.csv
Loaded: Amritsar_data.csv
Loaded: Lucknow_data.csv
Loaded: Visakhapatnam_data.csv
Loaded: Chandigarh_data.csv
Loaded: Bengaluru_data.csv
Loaded: Talcher_data.csv
Loaded: Shillong_data.csv
Loaded: Kolkata_data.csv
Loaded: Guwahati_data.csv
Loaded: Aizawl_data.csv
Loaded: Amaravati_data.csv
Loaded: Coimbatore_data.csv
SUCCESS: Combined 26 city files into one file with 29531 total rows
The combined file is saved as: all_cities_combined.csv


##  Fundamental Data Understanding to gain general insight:

In [17]:
df= pd.read_csv('all_cities_combined.csv')
df

Unnamed: 0,City,Date,PM2.5,PM10,NO,NO2,NOx,NH3,CO,SO2,O3,Benzene,Toluene,Xylene,AQI,AQI_Bucket
0,Delhi,01/01/2015,313.22,607.98,69.16,36.39,110.59,33.85,15.20,9.25,41.68,14.36,24.86,9.84,472.0,Severe
1,Delhi,02/01/2015,186.18,269.55,62.09,32.87,88.14,31.83,9.54,6.65,29.97,10.55,20.09,4.29,454.0,Severe
2,Delhi,03/01/2015,87.18,131.90,25.73,30.31,47.95,69.55,10.61,2.65,19.71,3.91,10.23,1.99,143.0,Moderate
3,Delhi,04/01/2015,151.84,241.84,25.01,36.91,48.62,130.36,11.54,4.63,25.36,4.26,9.71,3.34,319.0,Very Poor
4,Delhi,05/01/2015,146.60,219.13,14.01,34.92,38.25,122.88,9.20,3.33,23.20,2.80,6.21,2.96,325.0,Very Poor
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29526,Coimbatore,27/06/2020,11.81,15.08,,40.84,23.43,2.49,0.57,6.04,15.42,0.00,0.00,,34.0,Good
29527,Coimbatore,28/06/2020,14.04,16.03,,44.77,26.75,2.63,0.57,5.88,11.45,0.00,0.00,,32.0,Good
29528,Coimbatore,29/06/2020,16.26,20.81,,49.22,31.02,2.01,0.61,6.19,10.09,0.00,0.00,,41.0,Good
29529,Coimbatore,30/06/2020,14.21,15.69,,39.15,20.83,1.72,0.59,5.59,13.85,0.00,0.00,,33.0,Good


In [18]:
df.head()

Unnamed: 0,City,Date,PM2.5,PM10,NO,NO2,NOx,NH3,CO,SO2,O3,Benzene,Toluene,Xylene,AQI,AQI_Bucket
0,Delhi,01/01/2015,313.22,607.98,69.16,36.39,110.59,33.85,15.2,9.25,41.68,14.36,24.86,9.84,472.0,Severe
1,Delhi,02/01/2015,186.18,269.55,62.09,32.87,88.14,31.83,9.54,6.65,29.97,10.55,20.09,4.29,454.0,Severe
2,Delhi,03/01/2015,87.18,131.9,25.73,30.31,47.95,69.55,10.61,2.65,19.71,3.91,10.23,1.99,143.0,Moderate
3,Delhi,04/01/2015,151.84,241.84,25.01,36.91,48.62,130.36,11.54,4.63,25.36,4.26,9.71,3.34,319.0,Very Poor
4,Delhi,05/01/2015,146.6,219.13,14.01,34.92,38.25,122.88,9.2,3.33,23.2,2.8,6.21,2.96,325.0,Very Poor


In [19]:
df.tail()

Unnamed: 0,City,Date,PM2.5,PM10,NO,NO2,NOx,NH3,CO,SO2,O3,Benzene,Toluene,Xylene,AQI,AQI_Bucket
29526,Coimbatore,27/06/2020,11.81,15.08,,40.84,23.43,2.49,0.57,6.04,15.42,0.0,0.0,,34.0,Good
29527,Coimbatore,28/06/2020,14.04,16.03,,44.77,26.75,2.63,0.57,5.88,11.45,0.0,0.0,,32.0,Good
29528,Coimbatore,29/06/2020,16.26,20.81,,49.22,31.02,2.01,0.61,6.19,10.09,0.0,0.0,,41.0,Good
29529,Coimbatore,30/06/2020,14.21,15.69,,39.15,20.83,1.72,0.59,5.59,13.85,0.0,0.0,,33.0,Good
29530,Coimbatore,01/07/2020,,,,46.03,27.57,,0.57,5.73,10.59,0.0,0.0,,,


In [21]:
df.shape
print(f'No of Rows: {df.shape[0]}, No of Columns: {df.shape[1]}')

No of Rows: 29531, No of Columns: 16


In [22]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29531 entries, 0 to 29530
Data columns (total 16 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   City        29531 non-null  object 
 1   Date        29531 non-null  object 
 2   PM2.5       24933 non-null  float64
 3   PM10        18391 non-null  float64
 4   NO          25949 non-null  float64
 5   NO2         25946 non-null  float64
 6   NOx         25346 non-null  float64
 7   NH3         19203 non-null  float64
 8   CO          27472 non-null  float64
 9   SO2         25677 non-null  float64
 10  O3          25509 non-null  float64
 11  Benzene     23908 non-null  float64
 12  Toluene     21490 non-null  float64
 13  Xylene      11422 non-null  float64
 14  AQI         24850 non-null  float64
 15  AQI_Bucket  24850 non-null  object 
dtypes: float64(13), object(3)
memory usage: 3.6+ MB


In [23]:
df.dtypes

Unnamed: 0,0
City,object
Date,object
PM2.5,float64
PM10,float64
NO,float64
NO2,float64
NOx,float64
NH3,float64
CO,float64
SO2,float64


In [24]:
df.columns

Index(['City', 'Date', 'PM2.5', 'PM10', 'NO', 'NO2', 'NOx', 'NH3', 'CO', 'SO2',
       'O3', 'Benzene', 'Toluene', 'Xylene', 'AQI', 'AQI_Bucket'],
      dtype='object')