<a href="https://colab.research.google.com/github/uzeziogho/Private-Projects/blob/main/Sustainability_%E2%80%93_Air_Quality_%26_Weather.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [12]:
# Step 1: Import libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, calinski_harabasz_score

# Step 2: Load dataset
df = pd.read_csv("/content/AirQualityUCI.csv", sep=';', decimal=',')  # decimal fix for numbers like "7,8"

# Step 3: Handle missing values (-200 → NaN)
df.replace(-200, np.nan, inplace=True)

# Step 4: Select target and features
target = "CO(GT)"  # pollutant to predict
X = df.drop(columns=[target, "Date", "Time"], errors="ignore")
y = df[target]

# Drop completely empty / unnamed columns
X = X.loc[:, X.notna().any(axis=0)]

# Step 5: Impute missing values
imputer = SimpleImputer(strategy="median")
X = pd.DataFrame(imputer.fit_transform(X), columns=X.columns)
y = y.fillna(y.median())

# Step 6: Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 7: Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
X_scaled = scaler.fit_transform(X)   # full dataset for clustering

# Step 8: Supervised Models
models = {
    "Linear Regression": LinearRegression(),
    "Random Forest": RandomForestRegressor(n_estimators=100, random_state=42),
    "XGBoost": xgb.XGBRegressor(n_estimators=200, learning_rate=0.1, max_depth=6, random_state=42)
}

print("\n=== SUPERVISED LEARNING (Regression) ===")
for name, model in models.items():
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_test_scaled)

    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    mae = mean_absolute_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)

    print(f"\n--- {name} ---")
    print(f"RMSE: {rmse:.3f}")
    print(f"MAE: {mae:.3f}")
    print(f"R²: {r2:.3f}")

# Step 9: Unsupervised Clustering
print("\n=== UNSUPERVISED LEARNING (Clustering) ===")
kmeans = KMeans(n_clusters=3, random_state=42)
clusters = kmeans.fit_predict(X_scaled)

# Evaluate clustering
sil_score = silhouette_score(X_scaled, clusters)
ch_score = calinski_harabasz_score(X_scaled, clusters)

print(f"Silhouette Score: {sil_score:.3f}")
print(f"Calinski-Harabasz Index: {ch_score:.3f}")

# Optional: Add cluster labels back to df for inspection
df["Cluster"] = clusters
print("\nCluster sample counts:")
print(df["Cluster"].value_counts())



=== SUPERVISED LEARNING (Regression) ===

--- Linear Regression ---
RMSE: 0.555
MAE: 0.372
R²: 0.827

--- Random Forest ---
RMSE: 0.519
MAE: 0.306
R²: 0.849

--- XGBoost ---
RMSE: 0.493
MAE: 0.308
R²: 0.864

=== UNSUPERVISED LEARNING (Clustering) ===
Silhouette Score: 0.219
Calinski-Harabasz Index: 3576.039

Cluster sample counts:
Cluster
0    4036
2    3147
1    2288
Name: count, dtype: int64


**Supervised Learning (Regression)**

You predicted CO(GT) (pollutant concentration) with three models.

**Model	RMSE	MAE	R²	Interpretation**
- Linear Regression	0.555	0.372	0.827	Decent linear fit. R² of 0.827 means ~83% of variance in CO levels is explained.
- Random Forest	0.519	0.306	0.849	Slightly better than Linear Regression. Lower RMSE & MAE indicate smaller prediction errors.
- XGBoost	0.493	0.308	0.864	Best model. R² of 0.864 indicates it explains ~86% of variance. RMSE is lowest, showing better prediction accuracy.

**Takeaway:**

All three models perform well, but XGBoost is the strongest, balancing low error and high explained variance.

Random Forest also performs very well and is easier to interpret than XGBoost.

Linear Regression is simpler but slightly less accurate.

**Unsupervised Learning (Clustering)**

You grouped the dataset into 3 clusters (pollution categories):

Silhouette Score: 0.219

Measures how well-separated clusters are.

Score ranges from -1 to 1; closer to 1 = well-separated.

0.219 is low-moderate, indicating clusters are somewhat overlapping.

Calinski-Harabasz Index: 3576.039

Higher is better; measures cluster compactness and separation.

Decent value suggests clusters have internal cohesion and some separation.

Cluster counts:

- Cluster 0: 4036 samples
- Cluster 2: 3147 samples
- Cluster 1: 2288 samples


Cluster 0 = largest group, possibly “moderate pollution”.

Cluster 1 = smallest group, could correspond to “highest” or “lowest” pollution days.

**Takeaway:**

Clusters give a rough categorization of pollution levels, but some overlap exists.

You can inspect the cluster averages for CO, NOx, NO2, etc., to label clusters as Low, Medium, High Pollution.

**Overall Recommendation**

- Supervised: Use XGBoost for final pollutant predictions.

- Unsupervised: Use clusters as pollution categories for downstream analysis (e.g., alerting high pollution days).

Optionally, combine regression predictions with cluster labels for deeper insights (e.g., which sensor readings predict high pollution days).