## Importing Required Libraries

To begin the data analysis and machine learning workflow, we import essential Python libraries for data manipulation, visualization, and numerical operations:

### Libraries Used:
- **pandas**: For reading, cleaning, and processing structured data (DataFrames).
- **numpy**: For efficient numerical computations and array operations.
- **seaborn**: For enhanced data visualization with statistical plotting.
- **matplotlib.pyplot**: For creating custom plots and charts.

These libraries form the core environment for exploring the dataset and preparing it for modeling.


In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

## Loading Air Quality and Weather Datasets

We begin by loading two separate datasets: one containing **air quality data** and another with **weather conditions**. These datasets are later merged to build a comprehensive view for AQI prediction.

### Datasets:
- **Air Quality Data (`aqidf`)**
  - File: `city-railway station, bangalore-air-quality.csv`
  - Contains pollutant concentrations (e.g., PM10, NO₂, SO₂) and AQI values recorded near a railway station in Bangalore.

- **Weather Data (`weatherdf`)**
  - File: `climate_data.csv`
  - Includes meteorological parameters such as temperature, humidity, rainfall, fog, and wind speed.

Both datasets are read using `pandas.read_csv()` and will be preprocessed and merged in subsequent steps to create a unified dataset for modeling.


In [2]:
aqidf = pd.read_csv(r"D:\Downloads\city-railway station, bangalore-air-quality.csv")
weatherdf = pd.read_csv(r"D:\Downloads\climate_data.csv")

## Previewing the Air Quality Dataset

To understand the structure and contents of the air quality data, we use the `head()` function to display the first few rows of the DataFrame.

### Purpose:
- To inspect the available features (columns) such as pollutant concentrations and AQI values.
- To check for any missing values, data types, or formatting issues.
- To verify the successful loading of the dataset.

This preview helps guide the data cleaning and feature engineering steps that follow.



In [3]:
aqidf.head()

Unnamed: 0,pm10,no2,so2,co,Day,Month,Year
0,69,18,3,10,1,3,2025
1,71,18,3,9,2,3,2025
2,70,16,3,9,3,3,2025
3,72,12,4,11,4,3,2025
4,73,11,4,11,5,3,2025


## Previewing the Weather Dataset

We use the `head()` function to examine the first few rows of the weather dataset.

### Purpose:
- To explore the structure and types of weather-related features available.
- To verify the presence of key variables such as temperature, humidity, wind speed, rainfall, and fog.
- To confirm successful data loading and assess if any initial cleaning is required.

This preview ensures we understand how weather variables are recorded and prepares us for merging with the air quality data.


In [4]:
weatherdf.head()

Unnamed: 0,Year,Month,Day,T,TM,Tm,SLP,H,PP,VV,V,VM,VG,RA,SN,TS,FG
0,2006,1,1.0,19.3,25.3,13.4,-,70.0,0.0,6.6,5.7,9.4,-,,,,
1,2006,1,2.0,19.5,25.5,14.5,-,69.0,0.0,6.6,5.7,9.4,-,,,,
2,2006,1,3.0,20.6,27.6,15.8,-,60.0,0.0,6.3,4.8,11.1,-,,,,
3,2006,1,4.0,,,,,,,,,,,,,,
4,2006,1,5.0,,,,,,,,,,,,,,


In [5]:
weatherdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7261 entries, 0 to 7260
Data columns (total 17 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Year    7261 non-null   int64  
 1   Month   7261 non-null   int64  
 2   Day     7030 non-null   float64
 3   T       3868 non-null   object 
 4   TM      3868 non-null   object 
 5   Tm      3868 non-null   object 
 6   SLP     3868 non-null   object 
 7   H       3868 non-null   object 
 8   PP      3868 non-null   object 
 9   VV      3868 non-null   object 
 10  V       3868 non-null   object 
 11  VM      3868 non-null   object 
 12  VG      3637 non-null   object 
 13  RA      2674 non-null   object 
 14  SN      314 non-null    object 
 15  TS      874 non-null    object 
 16  FG      327 non-null    object 
dtypes: float64(1), int64(2), object(14)
memory usage: 964.5+ KB


In [6]:
aqidf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2798 entries, 0 to 2797
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0    pm10   2798 non-null   object
 1    no2    2798 non-null   object
 2    so2    2798 non-null   object
 3    co     2798 non-null   object
 4   Day     2798 non-null   int64 
 5   Month   2798 non-null   int64 
 6   Year    2798 non-null   int64 
dtypes: int64(3), object(4)
memory usage: 153.1+ KB


## Cleaning the Weather Dataset

To ensure consistency and avoid issues during merging or analysis, we remove any rows from the weather dataset where the **`Day`** value is missing.

### Operation:
- `dropna(subset=['Day'])`: Removes rows where the `Day` column contains `NaN`.

### Purpose:
- The `Day` column is essential for time-based merging with the air quality dataset.
- Missing date components can lead to mismatches or invalid entries during join operations.

This step ensures the weather dataset contains only valid, complete date entries before proceeding with further processing or merging.


In [8]:
weatherdf = weatherdf.dropna(subset=['Day'])

## Converting 'Day' Column to Integer

To ensure uniformity in data types and support accurate merging with other datasets, we convert the `Day` column in the weather dataset to integers.

### Operation:
- `astype(int)`: Casts all values in the `Day` column to integer type.

### Purpose:
- Aligns the `Day` column with date columns in the air quality dataset, which are often stored as integers.
- Prevents issues during filtering, comparisons, or merging operations that depend on date alignment.

This step is part of standardizing temporal fields across datasets before merging.


In [9]:
weatherdf['Day'] = weatherdf['Day'].astype(int)

## Merging Weather and Air Quality Datasets

To build a unified dataset for AQI prediction, we merge the **weather** and **air quality** datasets based on common temporal keys.

### Merge Details:
- **Keys Used**: `Year`, `Month`, and `Day`
- **Merge Type**: `inner`
  - Ensures only rows with matching dates in both datasets are retained.

### Purpose:
- Combines meteorological features (from `weatherdf`) with pollutant and AQI values (from `aqidf`).
- Creates a single dataset (`merge_df`) that can be used for feature engineering and model training.

This merged dataset forms the foundation for all subsequent analysis and modeling tasks.


In [10]:
merge_df = pd.merge(weatherdf, aqidf, on=['Year','Month','Day'], how='inner')

## Previewing the Merged Dataset

After merging the weather and air quality datasets, we use `head()` to examine the first few rows of the resulting DataFrame `merge_df`.

### Purpose:
- To verify that the merge was successful.
- To ensure that both weather and air quality features are present and properly aligned.
- To check for any inconsistencies or issues introduced during the merge process.

This step confirms the integrity of the combined dataset before proceeding with cleaning, feature engineering, or modeling.


In [11]:
merge_df.head()

Unnamed: 0,Year,Month,Day,T,TM,Tm,SLP,H,PP,VV,...,VM,VG,RA,SN,TS,FG,pm10,no2,so2,co
0,2015,12,29,,,,,,,,...,,,,,,,,48.0,6.0,17.0
1,2015,12,30,,,,,,,,...,,,,,,,49.0,54.0,7.0,21.0
2,2015,12,31,-,-,-,-,-,-,-,...,-,-,-,-,-,-,62.0,,,
3,2016,1,2,20.8,28.3,12.9,-,45,0,6.3,...,5.4,-,,,,,,60.0,9.0,15.0
4,2016,1,3,21.4,29,13.7,-,45,0,6.3,...,3.5,-,,,,,79.0,45.0,10.0,13.0


## Inspecting the End of the Merged Dataset

We use `tail()` to view the last few rows of the merged DataFrame `merge_df`.

### Purpose:
- To confirm that the dataset extends consistently across the expected date range.
- To identify any irregularities or missing data toward the end of the dataset.
- To ensure temporal consistency in both weather and air quality features after merging.

This helps validate the completeness of the dataset and ensures it's ready for further preprocessing.


In [12]:
merge_df.tail()

Unnamed: 0,Year,Month,Day,T,TM,Tm,SLP,H,PP,VV,...,VM,VG,RA,SN,TS,FG,pm10,no2,so2,co
2793,2025,3,20,-,-,-,-,-,-,-,...,-,-,-,-,-,-,70,12.0,4.0,12.0
2794,2025,3,21,-,-,-,-,-,-,-,...,-,-,-,-,-,-,70,11.0,4.0,12.0
2795,2025,3,22,-,-,-,-,-,-,-,...,-,-,-,-,-,-,69,11.0,3.0,14.0
2796,2025,3,23,-,-,-,-,-,-,-,...,-,-,-,-,-,-,66,,,
2797,2025,3,23,-,-,-,-,-,-,-,...,-,-,-,-,-,-,72,11.2,3.5,19.9


## Handling Missing Values in the Merged Dataset

To ensure the dataset is complete and ready for modeling, we apply forward-fill and backward-fill techniques to handle any remaining missing values.

### Operations:
- `ffill(inplace=True)`: **Forward fill** – propagates the last valid value forward.
- `bfill(inplace=True)`: **Backward fill** – fills missing values by propagating the next valid value backward.

### Purpose:
- Ensures no gaps remain in the dataset after merging.
- Maintains temporal consistency by filling based on adjacent days.
- Prevents errors or information loss during feature engineering and model training.

These steps help create a fully populated dataset without removing rows, preserving as much data as possible.


In [13]:
merge_df.ffill(inplace=True)
merge_df.bfill(inplace=True)

In [14]:
merge_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2798 entries, 0 to 2797
Data columns (total 21 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Year    2798 non-null   int64 
 1   Month   2798 non-null   int64 
 2   Day     2798 non-null   int32 
 3   T       2798 non-null   object
 4   TM      2798 non-null   object
 5   Tm      2798 non-null   object
 6   SLP     2798 non-null   object
 7   H       2798 non-null   object
 8   PP      2798 non-null   object
 9   VV      2798 non-null   object
 10  V       2798 non-null   object
 11  VM      2798 non-null   object
 12  VG      2798 non-null   object
 13  RA      2798 non-null   object
 14  SN      2798 non-null   object
 15  TS      2798 non-null   object
 16  FG      2798 non-null   object
 17   pm10   2798 non-null   object
 18   no2    2798 non-null   object
 19   so2    2798 non-null   object
 20   co     2798 non-null   object
dtypes: int32(1), int64(2), object(18)
memory usage: 448.2+ KB


## Replacing Placeholder Values with NaN

In the merged dataset, some missing or invalid entries may be represented as the string `"-"` instead of actual `NaN` values. We replace these placeholders to standardize missing value representation.

### Operation:
- `replace("-", np.nan, inplace=True)`: Converts all occurrences of `"-"` to `np.nan`.

### Purpose:
- Ensures consistency in how missing values are handled.
- Allows functions like `.fillna()`, `.dropna()`, and imputation strategies to work correctly.
- Prepares the dataset for numeric conversion and further analysis.

This step is crucial before converting columns to numeric types or applying any statistical methods.


In [15]:
merge_df.replace("-", np.nan, inplace=True)

## Dropping Unnecessary Columns

To streamline the dataset and focus on relevant features for AQI prediction, we remove columns that are not useful for modeling.

### Operation:
- `drop(columns=['SLP', 'VG', 'SN'], inplace=True)`

### Purpose:
- **`SLP` (Sea Level Pressure)**, **`VG` (Wind Gust)**, and **`SN` (Snowfall)** may be irrelevant, contain too many missing values, or show low correlation with AQI.
- Reduces dimensionality and potential noise in the model.
- Simplifies the feature set for more efficient processing and interpretation.

This cleanup step ensures that only meaningful features are retained for further analysis and modeling.


In [16]:
merge_df.drop(columns=['SLP','VG','SN'],inplace=True)

## Removing Data from the Year 2015

To ensure the dataset contains only reliable and relevant observations, we remove all rows corresponding to the year 2015.

### Operation:
- `merge_df.drop(merge_df[merge_df['Year'] == 2015].index, inplace=True)`

### Purpose:
- The 2015 data may be incomplete, inconsistent, or out of scope for the modeling timeline.
- Helps focus analysis and modeling efforts on more recent and robust data.
- Prevents potential biases or inaccuracies caused by early, sparse records.

This step refines the temporal scope of the dataset and improves overall data quality.


In [17]:
merge_df.drop(merge_df[merge_df['Year']==2015].index,inplace=True)

## Exploring Rainfall (`RA`) Value Distribution

To understand the distribution of rainfall data in the merged dataset, we use the `value_counts()` method on the `RA` (Rainfall) column.

### Purpose:
- Identify how frequently different rainfall values occur.
- Detect dominant values (e.g., zero rainfall days).
- Spot anomalies or inconsistent entries that may need cleaning or conversion.

This step provides insight into how rainfall varies in the dataset and informs decisions for feature transformation or binning, if needed.


In [18]:
merge_df['RA'].value_counts()

RA
o    2548
Name: count, dtype: int64

## Cleaning and Converting the Rainfall (`RA`) Column

The `RA` (Rainfall) column contains inconsistent values and missing data, which we clean and standardize for further analysis.

### Steps:
1. **Replace Inconsistent Value**:
   - `'o'` (a likely typo or placeholder) is replaced with `1` using `replace({'o': 1})`.

2. **Fill Missing Values**:
   - Missing rainfall values (`NaN`) are filled with `0`, assuming no rainfall on those days.

3. **Convert to Integer**:
   - The column is cast to `int` to ensure it holds clean, numeric values for modeling.

### Purpose:
- Ensures the `RA` column is numeric, consistent, and free of invalid entries.
- Prepares the feature for numerical analysis and machine learning input.


In [19]:
merge_df['RA']=merge_df['RA'].replace({'o':1})
merge_df['RA']= merge_df['RA'].fillna(0)
merge_df['RA'] = merge_df['RA'].astype(int)

## Exploring Thunderstorm (`TS`) Value Distribution

To examine how thunderstorm data is recorded in the dataset, we use `value_counts()` on the `TS` column.

### Purpose:
- Understand how frequently thunderstorms (`TS`) occur in the dataset.
- Identify patterns such as binary indicators (`0` for no thunderstorm, `1` for presence).
- Detect any non-standard entries or missing values that may need cleaning.

This step helps determine whether the `TS` column can be used as-is or requires transformation before modeling.


## Cleaning and Converting the Thunderstorm (`TS`) Column

The `TS` column, which indicates thunderstorm activity, contains inconsistent and missing values. We clean and standardize it to prepare for analysis.

### Steps:
1. **Replace Non-Standard Values**:
   - Replace `'o'` (likely representing presence of thunderstorm) with `1`.

2. **Fill Missing Values**:
   - Fill all `NaN` values with `0`, assuming no thunderstorm occurred when data is missing.

3. **Convert to Integer**:
   - Cast the column to `int` type to ensure consistency and compatibility with modeling tools.

### Purpose:
- Creates a clean, binary indicator for thunderstorm presence.
- Ensures the feature is ready for machine learning as a numeric variable.


In [20]:
merge_df['TS'].value_counts()

TS
o    2350
Name: count, dtype: int64

## Reapplying Cleaning on the Thunderstorm (`TS`) Column

This block ensures the `TS` (Thunderstorm) column is fully cleaned and formatted for use in modeling. If already cleaned previously, this acts as a safeguard.

### Operations:
1. **Replace `'o'` with `1`**: Ensures any lingering non-standard entries are handled.
2. **Fill Missing Values with `0`**: Guarantees no null entries remain.
3. **Convert to Integer**: Finalizes the column as a binary numeric indicator.

### Purpose:
- Reinforces data integrity before further processing or modeling.
- Ensures the `TS` column is clean, consistent, and usable for feature analysis or training.

This step helps eliminate any overlooked inconsistencies that might affect downstream tasks.


In [None]:
merge_df['TS']=merge_df['TS'].replace({'o':1})
merge_df['TS']= merge_df['TS'].fillna(0)
merge_df['TS'] = merge_df['TS'].astype(int)

## Exploring Fog (`FG`) Value Distribution

To assess how fog presence is recorded in the dataset, we use `value_counts()` on the `FG` (Fog) column.

### Purpose:
- Understand the frequency of foggy days in the dataset.
- Detect whether fog data is encoded as binary (`0` for no fog, `1` for fog) or uses other representations.
- Identify any unusual or inconsistent values that may require cleaning.

This step provides insight into the distribution and reliability of fog-related data for potential use as a predictive feature.


In [22]:
merge_df['FG'].value_counts()

FG
o    1609
Name: count, dtype: int64

## Cleaning and Converting the Fog (`FG`) Column

The `FG` column, representing fog presence, contains inconsistent and missing values. We standardize it for use in analysis and modeling.

### Steps:
1. **Replace Non-Standard Values**:
   - Replace `'o'` (likely indicating fog) with `1`.

2. **Fill Missing Values**:
   - Fill all `NaN` values with `0`, assuming no fog when data is missing.

3. **Convert to Integer**:
   - Convert the column to `int` type to ensure a consistent binary representation.

### Purpose:
- Transforms the `FG` column into a clean binary indicator (1 for fog, 0 for no fog).
- Ensures the feature is ready for modeling and compatible with numerical processing tools.


In [23]:
merge_df['FG']=merge_df['FG'].replace({'o':1})
merge_df['FG']= merge_df['FG'].fillna(0)
merge_df['FG'] = merge_df['FG'].astype(int)

## Dropping Rows with Missing Key Weather Features

To ensure data quality and consistency for modeling, we remove any rows that are missing critical weather features.

### Operation:
- `dropna(subset=['T', 'VV', 'VM'])`: Drops rows where any of the following columns contain `NaN`:
  - `T`: Average Temperature
  - `VV`: Visibility
  - `VM`: Maximum Wind Speed

### Purpose:
- These features are essential for understanding weather conditions and their effect on air quality.
- Ensures the model is trained on complete and reliable data.
- Prevents errors during feature engineering or training due to missing values.

This step finalizes the dataset with only complete records for key environmental indicators.


In [24]:
merge_df = merge_df.dropna(subset=['T','VV','VM'])

## Exploring Unique Values Across All Columns

To gain a comprehensive understanding of the data, we extract and examine the unique values present in each column of the merged dataset.

### Operation:
- `merge_df.apply(pd.Series.unique)`: Applies the `unique()` function to each column to retrieve all distinct values.

### Purpose:
- Identify categorical or binary features (e.g., `FG`, `TS`).
- Detect potential data quality issues or inconsistencies.
- Understand the range and diversity of values, which can inform encoding decisions and feature transformations.

This step is useful for both exploratory data analysis and preparing the dataset for machine learning.


In [25]:
unique_values = merge_df.apply(pd.Series.unique)
print(unique_values)

Year     [2016, 2017, 2018, 2019, 2020, 2021, 2022, 202...
Month              [1, 2, 3, 4, 6, 7, 8, 9, 10, 11, 12, 5]
Day      [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 17, 18, 19, 2...
T        [20.8, 21.4, 21, 20.4, 20.7, 21.2, 22.6, 22.9,...
TM       [28.3, 29, 28.7, 27.6, 28.4, 25.9, 27.4, 29.2,...
Tm       [12.9, 13.7, 17, 16.2, 15.1, 17.8, 19.4, 19, 1...
H        [45, 65, 61, 47, 73, 88, 78, 70, 68, 66, 43, 3...
PP       [0, 0.51, 4.32, nan, 1.27, 3.3, 0.76, 7.11, 23...
VV       [6.3, 5.5, 4.8, 4, 5.3, 6.9, 5.6, 6.6, 7.4, 7....
V        [1.1, 0.4, 3.1, 1.3, 0.9, 3, 2.8, 2.6, 1.5, 2....
VM       [5.4, 3.5, 7.6, 37, 11.1, 14.8, 18.3, 9.4, 51....
RA                                                  [0, 1]
TS                                                  [0, 1]
FG                                                  [0, 1]
 pm10    [ , 79, 72, 76, 65, 57, 52, 54, 51, 43, 55, 39...
 no2     [60, 45, 42, 25, 36, 33, 28,  , 41, 27, 40, 31...
 so2     [9, 10, 11, 14, 26, 15, 17, 18, 20,  , 30, 32,.

## Converting All Columns to Numeric Format

To ensure the dataset is fully numeric and compatible with machine learning models, we convert all columns to numeric types.

### Operation:
- `merge_df.apply(pd.to_numeric, errors='coerce')`: Attempts to convert each column to a numeric type.
  - Non-numeric entries are coerced to `NaN` (if any exist).

### Purpose:
- Ensures uniform data types across the dataset.
- Prepares the dataset for statistical analysis and modeling.
- Handles any leftover non-numeric or malformed entries gracefully by converting them to `NaN`.

This step is especially useful after replacing text-based placeholders and ensures the dataset is clean and fully numerical.


In [26]:
merge_df = merge_df.apply(pd.to_numeric, errors='coerce')

## Final Imputation of Missing Values

After converting all columns to numeric types, we perform a final round of missing value imputation using forward and backward fill strategies.

### Steps:
1. **Create a Copy**:
   - `merge_df.copy()`: Ensures we're working on a separate copy of the data to avoid unintended side effects.

2. **Forward Fill (`ffill`)**:
   - Propagates the last valid observation forward to fill missing values.

3. **Backward Fill (`bfill`)**:
   - Fills any remaining missing values using the next valid observation.

### Purpose:
- Ensures a fully populated dataset with no missing values.
- Maintains the temporal continuity of the data.
- Prepares the final dataset for feature engineering and model training.

This step concludes the data cleaning process, ensuring the dataset is consistent, complete, and ready for modeling.



In [27]:
merge_df = merge_df.copy()
merge_df.ffill(inplace=True)
merge_df.bfill(inplace=True)

## Saving the Cleaned and Merged Dataset

After completing all cleaning, preprocessing, and imputation steps, we save the final dataset to a CSV file for future use.

### Operation:
- `to_csv('merged_df.csv', index=False)`: Exports the DataFrame to a CSV file named `merged_df.csv` without row indices.

### Purpose:
- Preserves the cleaned dataset for use in modeling, analysis, or deployment.
- Allows easy reloading of preprocessed data without repeating the cleaning workflow.

This step marks the completion of the data preparation phase and creates a reproducible output for downstream tasks.


In [28]:
merge_df.to_csv('merged_df.csv', index=False)

## Loading the Final Preprocessed Dataset

To begin the modeling phase, we load the cleaned and fully processed dataset from the saved CSV file.

### Operation:
- `pd.read_csv('merged_df.csv')`: Reads the cleaned dataset into a new DataFrame named `df`.

### Purpose:
- Ensures a fresh start using the final prepared dataset.
- Avoids re-running all preprocessing steps.
- Ready for feature engineering, exploratory data analysis, and model training.

This marks the transition from data preparation to analysis and predictive modeling.


In [36]:
df = pd.read_csv('merged_df.csv')

## Previewing the Final Dataset

We use `head()` to inspect the first few rows of the fully cleaned and merged dataset loaded into `df`.

### Purpose:
- Verify the structure and content of the final dataset.
- Confirm that all relevant features (weather, pollutants, AQI, etc.) are present and in the expected format.
- Ensure that the dataset is ready for feature engineering and model training.

This final check confirms the dataset's integrity before proceeding with modeling tasks.


In [37]:
df.head()

Unnamed: 0,Year,Month,Day,T,TM,Tm,H,PP,VV,V,VM,RA,TS,FG,pm10,no2,so2,co
0,2016,1,2,20.8,28.3,12.9,45,0.0,6.3,1.1,5.4,0,0,0,79.0,60.0,9.0,15.0
1,2016,1,3,21.4,29.0,13.7,45,0.0,6.3,0.4,3.5,0,0,0,79.0,45.0,10.0,13.0
2,2016,1,4,21.4,29.0,13.7,45,0.0,6.3,0.4,3.5,0,0,0,72.0,45.0,11.0,13.0
3,2016,1,5,21.4,29.0,13.7,45,0.0,6.3,0.4,3.5,0,0,0,76.0,42.0,14.0,11.0
4,2016,1,6,21.4,29.0,13.7,45,0.0,6.3,0.4,3.5,0,0,0,65.0,25.0,26.0,6.0


## Inspecting Column Names in the Final Dataset

We check the list of column names in the DataFrame to identify available features and detect any formatting issues.

### Columns Present:
- **Date Features**: `Year`, `Month`, `Day`
- **Weather Features**: `T`, `TM`, `Tm`, `H`, `PP`, `VV`, `V`, `VM`, `RA`, `TS`, `FG`
- **Air Pollutants**: `' pm10'`, `' no2'`, `' so2'`, `' co'`

### Observations:
- Some pollutant columns have **leading spaces** in their names (`' pm10'`, `' no2'`, etc.), which should be cleaned for consistency and to avoid errors in processing.

Cleaning the column names will be the next step to standardize and simplify the dataset before further analysis.


In [38]:
df.columns

Index(['Year', 'Month', 'Day', 'T', 'TM', 'Tm', 'H', 'PP', 'VV', 'V', 'VM',
       'RA', 'TS', 'FG', ' pm10', ' no2', ' so2', ' co'],
      dtype='object')

## Computing the Air Quality Index (AQI)

To derive a target variable for prediction, we calculate the **Air Quality Index (AQI)** based on the maximum value among key pollutant concentrations.

### Steps:
1. **Select Pollutant Columns**:
   - `[' pm10', ' no2', ' so2', ' co']` represent concentrations of various pollutants.
   - These columns currently contain leading spaces and should be cleaned later for consistency.

2. **Calculate AQI**:
   - `df['AQI'] = df[aqi_columns].max(axis=1)` assigns the maximum pollutant concentration in each row as the AQI value.
   - This simplification assumes the highest pollutant value drives the overall AQI, which is a common approach when specific AQI computation formulas are not available.

### Purpose:
- Creates a target variable (`AQI`) for supervised machine learning.
- Enables modeling of AQI as a function of weather and pollutant-related features.



In [39]:
aqi_columns = [' pm10', ' no2', ' so2', ' co']
df['AQI']= df[aqi_columns].max(axis=1)

## Saving the Final Dataset with Computed AQI

After calculating the AQI column, we save the updated dataset to a new CSV file for use in modeling and analysis.

### Operation:
- `df.to_csv('aqidataset.csv', index=False)`: Exports the dataset, including the new `AQI` column, to a CSV file named `aqidataset.csv` without row indices.

### Purpose:
- Preserves the final, feature-rich dataset with the AQI target variable.
- Enables consistent and reusable input for model development pipelines.
- Avoids repeating preprocessing and AQI computation steps in the future.

This concludes the data preparation process and finalizes the dataset for machine learning tasks.


In [40]:
df.to_csv('aqidataset.csv', index=False)

## Loading the Final AQI Dataset for Modeling

We load the dataset containing all cleaned features and the computed `AQI` target variable.

### Operation:
- `pd.read_csv('aqidataset.csv')`: Reads the previously saved CSV file into a DataFrame named `df`.

### Purpose:
- Marks the starting point for model training and evaluation.
- Ensures all features and the `AQI` column are readily available for analysis and machine learning.
- Avoids repeating any previous preprocessing steps.

This step initiates the modeling phase using the finalized dataset.


In [2]:
df=pd.read_csv('aqidataset.csv')

In [3]:
df.head()

Unnamed: 0,Year,Month,Day,T,TM,Tm,H,PP,VV,V,VM,RA,TS,FG,pm10,no2,so2,co,AQI
0,2016,1,2,20.8,28.3,12.9,45,0.0,6.3,1.1,5.4,0,0,0,79.0,60.0,9.0,15.0,79.0
1,2016,1,3,21.4,29.0,13.7,45,0.0,6.3,0.4,3.5,0,0,0,79.0,45.0,10.0,13.0,79.0
2,2016,1,4,21.4,29.0,13.7,45,0.0,6.3,0.4,3.5,0,0,0,72.0,45.0,11.0,13.0,72.0
3,2016,1,5,21.4,29.0,13.7,45,0.0,6.3,0.4,3.5,0,0,0,76.0,42.0,14.0,11.0,76.0
4,2016,1,6,21.4,29.0,13.7,45,0.0,6.3,0.4,3.5,0,0,0,65.0,25.0,26.0,6.0,65.0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2726 entries, 0 to 2725
Data columns (total 19 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Year    2726 non-null   int64  
 1   Month   2726 non-null   int64  
 2   Day     2726 non-null   int64  
 3   T       2726 non-null   float64
 4   TM      2726 non-null   float64
 5   Tm      2726 non-null   float64
 6   H       2726 non-null   int64  
 7   PP      2726 non-null   float64
 8   VV      2726 non-null   float64
 9   V       2726 non-null   float64
 10  VM      2726 non-null   float64
 11  RA      2726 non-null   int64  
 12  TS      2726 non-null   int64  
 13  FG      2726 non-null   int64  
 14   pm10   2726 non-null   float64
 15   no2    2726 non-null   float64
 16   so2    2726 non-null   float64
 17   co     2726 non-null   float64
 18  AQI     2726 non-null   float64
dtypes: float64(12), int64(7)
memory usage: 404.8 KB
