## Day 07

**Problem Statement**

The goal of this analysis is to understand long-term climate patterns in Mumbai,
identify factors influencing rainfall and temperature variation, and prepare
the dataset for predictive modeling or dashboarding.


In [1]:
import pandas as pd
import numpy as np

In [2]:
# cleaned data from previous day
climate_df = pd.read_csv("mumbai_climate_day06.csv")

# Restore datetime type
climate_df['Date'] = pd.to_datetime(climate_df['Date'])

# Recreate engineered features
climate_df['Temp_Range'] = climate_df['Temp Max'] - climate_df['Temp Min']


In [3]:
num=climate_df.select_dtypes(exclude=object)
num

Unnamed: 0,Date,Rain,Temp Max,Temp Min,Year,Month,Day,Temp_Range
0,1951-01-01,0.0,28.530001,14.54,1951.0,1.0,1.0,13.990001
1,1951-01-02,0.0,28.850000,14.48,1951.0,1.0,2.0,14.370001
2,1951-01-03,0.0,30.660000,14.43,1951.0,1.0,3.0,16.230000
3,1951-01-04,0.0,30.139999,14.36,1951.0,1.0,4.0,15.780000
4,1951-01-05,0.0,29.180000,13.34,1951.0,1.0,5.0,15.840000
...,...,...,...,...,...,...,...,...
26801,2024-06-18,0.0,34.300000,26.60,2024.0,6.0,18.0,7.700000
26802,2024-06-19,20.0,34.800000,25.50,2024.0,6.0,19.0,9.300000
26803,NaT,18.0,33.100000,25.40,,,,7.700000
26804,NaT,4.0,30.900000,26.70,,,,4.200000


In [4]:
corr_matrix = num.corr()
corr_matrix

Unnamed: 0,Date,Rain,Temp Max,Temp Min,Year,Month,Day,Temp_Range
Date,1.0,0.075243,0.086202,0.109552,0.999907,0.006592,0.000346,-0.045065
Rain,0.075243,1.0,-0.037145,0.043221,0.077293,0.01826,-0.010666,-0.065081
Temp Max,0.086202,-0.037145,1.0,0.253516,0.089328,-0.179822,-0.001433,0.428261
Temp Min,0.109552,0.043221,0.253516,1.0,0.109715,0.168273,0.003115,-0.765562
Year,0.999907,0.077293,0.089328,0.109715,1.0,-0.006978,-0.000933,-0.043134
Month,0.006592,0.01826,-0.179822,0.168273,-0.006978,1.0,0.010649,-0.276807
Day,0.000346,-0.010666,-0.001433,0.003115,-0.000933,0.010649,1.0,-0.003874
Temp_Range,-0.045065,-0.065081,0.428261,-0.765562,-0.043134,-0.276807,-0.003874,1.0


In [5]:
rain_corr = corr_matrix['Rain'].sort_values(ascending=False)
rain_corr

Rain          1.000000
Year          0.077293
Date          0.075243
Temp Min      0.043221
Month         0.018260
Day          -0.010666
Temp Max     -0.037145
Temp_Range   -0.065081
Name: Rain, dtype: float64

In [6]:
climate_df = climate_df.drop(columns=['Day', 'Temp Min'])

In [7]:
climate_df.columns

Index(['Date', 'Rain', 'Temp Max', 'Year', 'Month', 'Season', 'Temp_Range'], dtype='object')

**Rainfall Correlation Insight**

- Rain is selected as a primary target variable.
- Temperature variables show multicollinearity.
- Temp_Range is retained to capture daily variability.
- Season is preferred over Month due to non-linear rainfall behavior.
- Day feature is dropped as it adds noise rather than signal.

In [9]:
climate_df.columns

Index(['Date', 'Rain', 'Temp Max', 'Year', 'Month', 'Season', 'Temp_Range'], dtype='object')

In [10]:
ml_df = climate_df[
    ['Year', 'Season', 'Temp_Range', 'Rain', 'Temp Max']
]

In [11]:
ml_df

Unnamed: 0,Year,Season,Temp_Range,Rain,Temp Max
0,1951.0,Winter,13.990001,0.0,28.530001
1,1951.0,Winter,14.370001,0.0,28.850000
2,1951.0,Winter,16.230000,0.0,30.660000
3,1951.0,Winter,15.780000,0.0,30.139999
4,1951.0,Winter,15.840000,0.0,29.180000
...,...,...,...,...,...
26801,2024.0,Monsoon,7.700000,0.0,34.300000
26802,2024.0,Monsoon,9.300000,20.0,34.800000
26803,,Post-Monsoon,7.700000,18.0,33.100000
26804,,Post-Monsoon,4.200000,4.0,30.900000


Since notebooks are executed independently, all engineered features
such as temperature range must be recreated to ensure reproducibility.

In [12]:
climate_df.groupby('Season')['Rain'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Season,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Monsoon,8925.0,0.996112,9.253023,0.0,0.0,0.0,0.0,253.0
Post-Monsoon,4456.0,0.066629,1.997652,0.0,0.0,0.0,0.0,114.0
Summer,6808.0,0.011751,0.413522,0.0,0.0,0.0,0.0,21.0
Winter,6617.0,0.168037,12.487379,0.0,0.0,0.0,0.0,1011.7


In [15]:
ml_df.to_csv("mumbai_climate_day07.csv", index=False)

**Day 07 Summary**

- Identified rainfall and temperature as primary climate targets
- Analyzed feature relevance using correlation
- Addressed multicollinearity among temperature variables
- Validated seasonality as a strong predictive feature
- Prepared ML-ready dataset for future modeling