#  Project Title: Air Quality Index (AQI) Prediction Using Machine Learning

##  Introduction

Air pollution has become one of the most serious environmental issues affecting public health and climate.  
The **Air Quality Index (AQI)** is a standardized measure used to describe the level of air pollution in a specific region.  
It indicates how clean or polluted the air is, and what health effects it may cause to humans.

This project aims to **predict the AQI** of a location using various **machine learning regression models** such as **Decision Tree, Random Forest, and XGBoost**.  
The prediction is based on key atmospheric parameters like **PM2.5, PM10, NOâ‚‚, SOâ‚‚, CO, Oâ‚ƒ, NHâ‚ƒ, Benzene, and Toluene**.

By training models on historical air quality data, the system learns complex relationships between pollutant concentrations and AQI values.  
Once trained, it can accurately forecast AQI for future dates or real-time data, helping authorities and citizens take timely preventive actions.


##  Objectives

- To analyze air quality data and understand pollution trends.  
- To build and compare multiple regression models for AQI prediction.  
- To evaluate model performance using metrics like **MAE**, **RMSE**, and **RÂ² Score**.  
- To visualize the relationship between **actual** and **predicted** AQI values.  

##  Outcome

The final model provides an **efficient and accurate AQI prediction system** that can assist environmental agencies, urban planners, and the general public in **monitoring and improving air quality**.


## ðŸ‘¥ Team Members

| No. | Name |
|-----|-------------------|
| 1.  | Manav Patel |
| 2.  | Het Bhatt |
| 3.  | Varmil Parikh |


##  Import Required Libraries for Data Analysis & Model Building


In [None]:
import joblib
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn as sk
from sklearn.metrics import classification_report, accuracy_score, precision_score, recall_score, f1_score, confusion_matrix,classification_report
from sklearn.preprocessing import LabelEncoder, StandardScaler, MinMaxScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression

##  Load Dataset (`city_day.csv`) and Display Initial Records


In [None]:
df=pd.read_csv('city_day.csv')
df

##  Check Total Number of Missing Values in the Dataset


In [None]:
df.isnull().sum().sum()

##  Check Missing Values Column-wise


In [None]:
df.isnull().sum()

##  Handle Missing Values (Numeric â†’ Mean, Categorical â†’ Mode)


In [None]:
import pandas as pd
df = pd.read_csv("city_day.csv") 
print("Missing values before filling:")
print(df.isnull().sum())
numeric_cols = df.select_dtypes(include=['float64', 'int64']).columns
df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].mean())
categorical_cols = df.select_dtypes(include=['object']).columns
for col in categorical_cols:
    df[col] = df[col].fillna(df[col].mode()[0])
print("\nMissing values after filling:")
print(df.isnull().sum())
print("\nDataset shape:", df.shape)

##  Generate Statistical Summary of the Dataset


In [None]:
df.describe()

##  Encode Categorical Columns ('City' & 'AQI_Bucket')


In [None]:
from sklearn.preprocessing import LabelEncoder
le_city = LabelEncoder()
df['City'] = le_city.fit_transform(df['City'])
aqi_mapping = {
    'Good': 0,
    'Satisfactory': 1,
    'Moderate': 2,
    'Poor': 3,
    'Very Poor': 4,
    'Severe': 5
}
df['AQI_Bucket'] = df['AQI_Bucket'].map(aqi_mapping)
print(df.head())


##  Standardize Numeric Features Using StandardScaler


In [None]:
from sklearn.preprocessing import StandardScaler
cols_to_scale = ['PM2.5', 'PM10', 'NO', 'NO2', 'NOx', 'NH3', 'CO', 
                 'SO2', 'O3', 'Benzene', 'Toluene', 'Xylene']
scaler = StandardScaler()
df[cols_to_scale] = scaler.fit_transform(df[cols_to_scale])
print(df.head())


##  Detect and Remove Outliers from AQI Using IQR Method


In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(8,5))
sns.boxplot(x=df['AQI'])
plt.title("AQI Before Removing Outliers")
plt.show()
Q1 = df['AQI'].quantile(0.25)
Q3 = df['AQI'].quantile(0.75)
IQR = Q3 - Q1
lower_limit = Q1 - 1.5 * IQR
upper_limit = Q3 + 1.5 * IQR
print(f"AQI Lower Limit: {lower_limit}")
print(f"AQI Upper Limit: {upper_limit}")
df_no_outlier = df[(df['AQI'] >= lower_limit) & (df['AQI'] <= upper_limit)]
print(f"Original dataset shape: {df.shape}")
print(f"After removing outliers: {df_no_outlier.shape}")
plt.figure(figsize=(8,5))
sns.boxplot(x=df_no_outlier['AQI'])
plt.title("AQI After Removing Outliers")
plt.show()


##  Split Dataset into Features (X) and Target (y)


In [None]:
x = df.drop('AQI', axis=1)
y = df['AQI']

##  Display All Feature Column Names


In [None]:
x.columns

##  Display Target Variable (y) Values


In [None]:
y

##  Split Data into Training and Testing Sets (80% / 20%)


In [None]:
X_train,X_test,y_train,y_test = train_test_split(x,y,test_size=0.2)

##  Train and Evaluate Decision Tree Regressor Model


In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import matplotlib.pyplot as plt
import numpy as np

dt = DecisionTreeRegressor(max_depth=5, random_state=42)
dt.fit(X_train, y_train)

y_pred_train = dt.predict(X_train)
y_pred_test = dt.predict(X_test)

print("ðŸ“˜ Decision Tree Regressor â€” Training Data")
print(f"MAE: {mean_absolute_error(y_train, y_pred_train):.2f}")
print(f"MSE: {mean_squared_error(y_train, y_pred_train):.2f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_train, y_pred_train)):.2f}")
print(f"RÂ² Score: {r2_score(y_train, y_pred_train):.2f}")
print("\nðŸ“— Decision Tree Regressor â€” Test Data")
print(f"MAE: {mean_absolute_error(y_test, y_pred_test):.2f}")
print(f"MSE: {mean_squared_error(y_test, y_pred_test):.2f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred_test)):.2f}")
print(f"RÂ² Score: {r2_score(y_test, y_pred_test):.2f}")
plt.figure(figsize=(12,5))
plt.subplot(1,2,1)
plt.scatter(y_train, y_pred_train, alpha=0.6, color='skyblue', edgecolors='k')
plt.plot([y_train.min(), y_train.max()], [y_train.min(), y_train.max()], 'r--', linewidth=2)
plt.xlabel("Actual AQI")
plt.ylabel("Predicted AQI")
plt.title("Decision Tree AQI â€” Training Data")
plt.grid(True, linestyle='--', alpha=0.5)
plt.subplot(1,2,2)
plt.scatter(y_test, y_pred_test, alpha=0.6, color='lightgreen', edgecolors='k')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', linewidth=2)
plt.xlabel("Actual AQI")
plt.ylabel("Predicted AQI")
plt.title("Decision Tree AQI â€” Test Data")
plt.grid(True, linestyle='--', alpha=0.5)

plt.tight_layout()
plt.show()


##  Train and Evaluate Random Forest Regressor Model


In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import matplotlib.pyplot as plt
import numpy as np
rf = RandomForestRegressor(
    n_estimators=200,
    max_depth=10,
    random_state=42,
    n_jobs=-1
)
rf.fit(X_train, y_train)
y_pred_train_rf = rf.predict(X_train)
y_pred_test_rf = rf.predict(X_test)
print("ðŸ“˜ Random Forest Regressor â€” Training Data")
print(f"MAE: {mean_absolute_error(y_train, y_pred_train_rf):.2f}")
print(f"MSE: {mean_squared_error(y_train, y_pred_train_rf):.2f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_train, y_pred_train_rf)):.2f}")
print(f"RÂ² Score: {r2_score(y_train, y_pred_train_rf):.2f}")
print("\nðŸ“— Random Forest Regressor â€” Test Data")
print(f"MAE: {mean_absolute_error(y_test, y_pred_test_rf):.2f}")
print(f"MSE: {mean_squared_error(y_test, y_pred_test_rf):.2f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred_test_rf)):.2f}")
print(f"RÂ² Score: {r2_score(y_test, y_pred_test_rf):.2f}")
plt.figure(figsize=(12,5))
plt.subplot(1,2,1)
plt.scatter(y_train, y_pred_train_rf, alpha=0.6, color='skyblue', edgecolors='k')
plt.plot([y_train.min(), y_train.max()], [y_train.min(), y_train.max()], 'r--', linewidth=2)
plt.xlabel("Actual AQI")
plt.ylabel("Predicted AQI")
plt.title("Random Forest AQI â€” Training Data")
plt.grid(True, linestyle='--', alpha=0.5)
plt.subplot(1,2,2)
plt.scatter(y_test, y_pred_test_rf, alpha=0.6, color='lightgreen', edgecolors='k')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', linewidth=2)
plt.xlabel("Actual AQI")
plt.ylabel("Predicted AQI")
plt.title("Random Forest AQI â€” Test Data")
plt.grid(True, linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()


##  Train and Evaluate XGBoost Regressor Model


In [None]:
from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import matplotlib.pyplot as plt
import numpy as np
xgb = XGBRegressor(
    n_estimators=300,
    learning_rate=0.1,
    max_depth=6,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42,
    n_jobs=-1
)
xgb.fit(X_train, y_train)
y_pred_train_xgb = xgb.predict(X_train)
y_pred_test_xgb = xgb.predict(X_test)
print("ðŸ“˜ XGBoost Regressor â€” Training Data")
print(f"MAE: {mean_absolute_error(y_train, y_pred_train_xgb):.2f}")
print(f"MSE: {mean_squared_error(y_train, y_pred_train_xgb):.2f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_train, y_pred_train_xgb)):.2f}")
print(f"RÂ² Score: {r2_score(y_train, y_pred_train_xgb):.2f}")
print("\nðŸ“— XGBoost Regressor â€” Test Data")
print(f"MAE: {mean_absolute_error(y_test, y_pred_test_xgb):.2f}")
print(f"MSE: {mean_squared_error(y_test, y_pred_test_xgb):.2f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred_test_xgb)):.2f}")
print(f"RÂ² Score: {r2_score(y_test, y_pred_test_xgb):.2f}")
plt.figure(figsize=(12,5))
plt.subplot(1,2,1)
plt.scatter(y_train, y_pred_train_xgb, alpha=0.6, color='skyblue', edgecolors='k')
plt.plot([y_train.min(), y_train.max()], [y_train.min(), y_train.max()], 'r--', linewidth=2)
plt.xlabel("Actual AQI")
plt.ylabel("Predicted AQI")
plt.title("XGBoost AQI â€” Training Data")
plt.grid(True, linestyle='--', alpha=0.5)
plt.subplot(1,2,2)
plt.scatter(y_test, y_pred_test_xgb, alpha=0.6, color='lightgreen', edgecolors='k')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', linewidth=2)
plt.xlabel("Actual AQI")
plt.ylabel("Predicted AQI")
plt.title("XGBoost AQI â€” Test Data")
plt.grid(True, linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()


##  Air Quality Index (AQI) Prediction Using XGBoost Regression Model


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from xgboost import XGBRegressor
from fuzzywuzzy import process
from termcolor import colored
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

def get_aqi_category_and_suggestion(aqi_value):
    if aqi_value <= 50:
        return "Good", "Air quality is satisfactory â€” you can enjoy outdoor activities freely."
    elif aqi_value <= 100:
        return "Satisfactory", "Air quality is acceptable â€” sensitive people should limit long outdoor exposure."
    elif aqi_value <= 200:
        return "Moderate", "Consider reducing outdoor activities for long durations."
    elif aqi_value <= 300:
        return "Poor", "Avoid outdoor exercise and wear a mask if you go outside."
    elif aqi_value <= 400:
        return "Very Poor", "Air is unhealthy â€” stay indoors and use an air purifier if possible."
    else:
        return "Severe", "Serious health effects â€” avoid going outdoors and keep windows closed."

def get_aqi_range(category):
    ranges = {
        "Good": "0â€“50",
        "Satisfactory": "51â€“100",
        "Moderate": "101â€“200",
        "Poor": "201â€“300",
        "Very Poor": "301â€“400",
        "Severe": "401â€“500"
    }
    return ranges.get(category, "Unknown")

df = pd.read_csv("city_day.csv")
df['City'] = df['City'].astype(str).str.strip().str.title()
df = df.dropna(subset=['AQI'])
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day
df = df[(df['AQI'] >= 0) & (df['AQI'] <= 500)]

clip_bounds = {
    'PM2.5': (0, 500),
    'PM10': (0, 600),
    'NO': (0, 400),
    'NO2': (0, 400),
    'NOx': (0, 500),
    'NH3': (0, 300),
    'CO': (0, 10),
    'SO2': (0, 300),
    'O3': (0, 400),
    'Benzene': (0, 50),
    'Toluene': (0, 200),
    'Xylene': (0, 50),
}
for col, (lo, hi) in clip_bounds.items():
    if col in df.columns:
        df[col] = df[col].astype(float).clip(lower=lo, upper=hi)

le_city = LabelEncoder()
df['City'] = le_city.fit_transform(df['City'])

feature_cols = ['City','PM2.5','PM10','NO','NO2','NOx','NH3','CO','SO2','O3','Benzene','Toluene','Xylene','Year','Month','Day']
X = df[feature_cols]
y = df['AQI']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

xgb = XGBRegressor(
    n_estimators=2000,
    learning_rate=0.05,
    max_depth=6,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42,
    n_jobs=-1,
    eval_metric='rmse'
)
xgb.fit(X_train, y_train, eval_set=[(X_test, y_test)], verbose=False)


cities = sorted(le_city.classes_)
print("\nAvailable Cities in Dataset:\n")
for c in cities:
    print("-", c)

user_city = input("\nEnter your city name: ").strip().title()
best_match, score = process.extractOne(user_city, cities)
if score >= 80:
    city = best_match
    print(f"Using closest match: {city}")
else:
    print(f"{user_city} not recognized. Please enter a valid city.")
    raise SystemExit

year = int(input("Year: "))
month = int(input("Month: "))
day = int(input("Day: "))

def get_valid_input(prompt, low, high):
    while True:
        try:
            val = float(input(prompt))
            return float(np.clip(val, low, high))
        except ValueError:
            print("Invalid input. Enter a numeric value.")

pm25 = get_valid_input("PM2.5 (Âµg/mÂ³): ", *clip_bounds['PM2.5'])
pm10 = get_valid_input("PM10 (Âµg/mÂ³): ", *clip_bounds['PM10'])
no2  = get_valid_input("NO2 (Âµg/mÂ³): ", *clip_bounds['NO2'])
so2  = get_valid_input("SO2 (Âµg/mÂ³): ", *clip_bounds['SO2'])
co   = get_valid_input("CO (mg/mÂ³): ", *clip_bounds['CO'])
o3   = get_valid_input("O3 (Âµg/mÂ³): ", *clip_bounds['O3'])

city_encoded = le_city.transform([city])[0]

input_data = pd.DataFrame({
    'City': [city_encoded],
    'PM2.5': [pm25],
    'PM10': [pm10],
    'NO': [0],
    'NO2': [no2],
    'NOx': [0],
    'NH3': [0],
    'CO': [co],
    'SO2': [so2],
    'O3': [o3],
    'Benzene': [0],
    'Toluene': [0],
    'Xylene': [0],
    'Year': [year],
    'Month': [month],
    'Day': [day]
})

city_data_all = df[df['City'] == city_encoded]
city_medians = city_data_all.median(numeric_only=True)
for col in ['NO','NOx','NH3','Benzene','Toluene','Xylene']:
    if input_data[col].iloc[0] == 0:
        input_data[col] = city_medians.get(col, 0.0)

predicted_aqi = float(xgb.predict(input_data)[0])
predicted_aqi = float(np.clip(predicted_aqi, 0, 500))

category, suggestion = get_aqi_category_and_suggestion(predicted_aqi)
aqi_range = get_aqi_range(category)

color_map = {
    "Good": "green",
    "Satisfactory": "light_green",
    "Moderate": "yellow",
    "Poor": "light_red",
    "Very Poor": "red",
    "Severe": "magenta"
}

print(f"\nPredicted AQI: {round(predicted_aqi, 2)}")
print(colored(f"AQI Category: {category}", color_map.get(category, "white")))
print(colored(f"AQI Range: {aqi_range} ({category})", color_map.get(category, "white")))
print(colored(f"Suggestion: {suggestion}", color_map.get(category, "white")))

city_data = df[df['City'] == city_encoded]
city_data = city_data[(city_data['AQI'] >= 0) & (city_data['AQI'] <= 500)]
print(f"\nAQI Statistics for {city}:")
print(f"Average AQI: {city_data['AQI'].mean():.2f}")
print(f"Maximum AQI: {city_data['AQI'].max():.2f}")
print(f"Minimum AQI: {city_data['AQI'].min():.2f}")

plt.figure(figsize=(8,4))
city_plot = city_data.dropna(subset=['AQI']).sort_values('Date')
plt.plot(pd.to_datetime(city_plot['Date']), city_plot['AQI'], linewidth=2)
plt.title(f"AQI Trend for {city}")
plt.xlabel("Date")
plt.ylabel("AQI")
plt.grid(True, linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()

print("\nPredicted AQI for Next 3 Days:")
for i in range(1, 4):
    future = input_data.copy()
    future['Day'] = [day + i]
    scale = np.random.uniform(0.97, 1.03)
    for col in ['PM2.5','PM10','NO2','SO2','CO','O3']:
        lo, hi = clip_bounds[col]
        future[col] = float(np.clip(future[col].iloc[0] * scale, lo, hi))
    pred_next = float(xgb.predict(future)[0])
    pred_next = float(np.clip(pred_next, 0, 500))
    print(f"Day +{i}: {round(pred_next, 2)}")
