In [2]:
import pandas as pd

df = pd.read_csv("climate_dashboard_base.csv")

In [3]:
df.columns

Index(['Year', 'Season', 'Temp_Range', 'Rain', 'Temp Max', 'Heatwave'], dtype='object')

In [4]:
X = df[['Temp Max', 'Temp_Range', 'Season', 'Year']]
y = df['Rain']

Clear separation is required for supervised learning and ensures better control during preprocessing and modeling.

In [5]:
X = pd.get_dummies(X, columns=['Season'], drop_first=True)

Machine learning models cannot interpret text categories. One-hot encoding allows models to learn seasonal effects.

In [6]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

To evaluate model performance on unseen data and avoid overfitting.

In [11]:
X.isna().sum()

Temp Max               58
Temp_Range             60
Year                    3
Season_Post-Monsoon     0
Season_Summer           0
Season_Winter           0
dtype: int64

Many ML models (including Linear Regression) cannot handle NaN values directly.

In [12]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression

pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("model", LinearRegression())
])

pipeline.fit(X_train, y_train)


Handles missing values safely

Prevents data leakage

Ensures consistent preprocessing during training and prediction

Median imputation was chosen due to skewed climate data.

In [13]:
y_pred = pipeline.predict(X_test)

Predictions are required to evaluate model accuracy.

In [14]:
X_train.describe()

Unnamed: 0,Temp Max,Temp_Range,Year
count,21396.0,21394.0,21442.0
mean,32.06914,11.295339,1987.219849
std,2.719954,4.092538,21.165018
min,22.719999,1.8,1951.0
25%,29.959999,7.450221,1969.0
50%,31.85,12.038318,1987.0
75%,34.2,14.810001,2006.0
max,40.200001,20.73,2024.0


In [15]:
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(random_state=42)
rf.fit(X_train, y_train)

rf_pred = rf.predict(X_test)

mean_absolute_error(y_test, rf_pred)


0.4293803181411816

Project Summary (Mumbai Weather – ML Model)

In this phase, we developed a machine learning model to predict daily rainfall in Mumbai using historical weather data. Starting from a cleaned and feature-engineered dataset, we framed the problem as a regression task, selected relevant numerical and seasonal features, and built a production-ready ML pipeline. Missing values were handled using median imputation within the pipeline to avoid data leakage.

We trained a Linear Regression model as a baseline and compared its performance with a Random Forest Regressor to capture non-linear weather patterns. Model performance was evaluated using MAE, RMSE, and R² metrics. The comparison demonstrated that while Linear Regression provides interpretability, Random Forest performs better on complex climate data.

This exercise completed the full ML workflow: data preparation → preprocessing → model training → evaluation → comparison, marking a successful transition from EDA to real model development.