
# Week 2 Project – Heatwave Risk Analysis 🌡️🔥

This notebook continues from **Week 1 Project**, focusing on:

- **Exploratory Data Analysis (EDA)**
- **Data Transformation**
- **Feature Selection**

Dataset: `heatwave_data.csv`  
Final Output: Cleaned & transformed dataset `heatwave_processed.csv`


In [None]:

# Importing required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder
from sklearn.feature_selection import SelectKBest, f_classif, chi2, RFE
from sklearn.linear_model import LogisticRegression


In [None]:

# Load dataset
df = pd.read_csv("heatwave_data.csv")
df.head()


## Exploratory Data Analysis (EDA)

In [None]:

# Dataset info
df.info()

# Statistical summary
df.describe()


In [None]:

# Missing values check
df.isnull().sum()


In [None]:

# Correlation heatmap
plt.figure(figsize=(10,6))
sns.heatmap(df.corr(), annot=True, cmap="coolwarm")
plt.title("Correlation Heatmap")
plt.show()


In [None]:

# Distribution plots for numerical features
df.hist(bins=30, figsize=(12,8))
plt.show()


In [None]:

# Outlier detection using boxplots
plt.figure(figsize=(12,6))
sns.boxplot(data=df)
plt.xticks(rotation=90)
plt.title("Outlier Detection")
plt.show()


## Data Transformation

In [None]:

# Fill missing values (mean for numeric, mode for categorical)
for col in df.columns:
    if df[col].dtype == "object":
        df[col].fillna(df[col].mode()[0], inplace=True)
    else:
        df[col].fillna(df[col].mean(), inplace=True)


In [None]:

# Encoding categorical variables
le = LabelEncoder()
for col in df.select_dtypes(include="object").columns:
    df[col] = le.fit_transform(df[col])


In [None]:

# Scaling numeric data
scaler = StandardScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)
df_scaled.head()


## Feature Engineering

In [None]:

# Example: Create Heat Index (Temp * Humidity)
if "Temperature" in df.columns and "Humidity" in df.columns:
    df_scaled["Heat_Index"] = df_scaled["Temperature"] * df_scaled["Humidity"]
df_scaled.head()


## Feature Selection

In [None]:

# Correlation-based feature selection (remove highly correlated features)
corr_matrix = df_scaled.corr().abs()
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
to_drop = [column for column in upper.columns if any(upper[column] > 0.9)]
df_reduced = df_scaled.drop(to_drop, axis=1)
print("Dropped features due to high correlation:", to_drop)


In [None]:

# SelectKBest (ANOVA F-test)
X = df_reduced.drop(df_reduced.columns[-1], axis=1)  # features
y = df_reduced[df_reduced.columns[-1]]  # target (last column assumed)

best_features = SelectKBest(score_func=f_classif, k=5)
fit = best_features.fit(X, y)
df_scores = pd.DataFrame({"Feature": X.columns, "Score": fit.scores_})
df_scores.sort_values(by="Score", ascending=False)


In [None]:

# Recursive Feature Elimination (RFE)
model = LogisticRegression(max_iter=500)
rfe = RFE(model, n_features_to_select=5)
fit = rfe.fit(X, y)
selected_features = X.columns[fit.support_]
selected_features


In [None]:

# Final dataset with selected features
df_final = df_reduced[selected_features]
df_final["Target"] = y

# Save processed dataset
df_final.to_csv("heatwave_processed.csv", index=False)
df_final.head()



## ✅ Conclusion
- Performed **EDA** (summary, correlations, distributions, outliers).  
- Applied **Data Transformation** (missing values, encoding, scaling).  
- Conducted **Feature Selection** (correlation removal, SelectKBest, RFE).  
- Saved cleaned dataset as **heatwave_processed.csv** for further modeling (Week 3).  
