
# Week 2 — Climate Risk & Disaster Management  
**Dataset:** World Disaster Risk Index Time Series  
**Goal:** *Predict the next year's World Risk Index (WRI) for a chosen country.*

This notebook continues from Week 1 and adds:
- Exploratory Data Analysis (EDA)
- Data Cleaning & Transformations
- Feature Engineering (including next-year target)
- Feature Selection (Correlation filter, SelectKBest, RandomForest importance)

> **Note:** Place the dataset CSV in the same folder as this notebook and set `DATA_FILE` accordingly.


In [None]:

# 1) Setup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path

# ML / feature selection
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.ensemble import RandomForestRegressor

# Display options
pd.set_option("display.max_columns", 100)

DATA_FILE = "global_disaster_risk_index_time_series.csv"  # <-- change if your filename is different
assert Path(DATA_FILE).exists(), f"Dataset file '{DATA_FILE}' not found. Place it next to this notebook."


In [None]:

# 2) Load & basic cleaning
df = pd.read_csv(DATA_FILE)

# Strip whitespace from column names to avoid accidental mismatches
df.columns = df.columns.str.strip()

print("Shape:", df.shape)
print("\nColumns:", df.columns.tolist())

print("\nFirst 5 rows:")
display(df.head())


## Exploratory Data Analysis (EDA)

In [None]:

# Info and basic stats
print("----- INFO -----")
df.info()
print("\n----- DESCRIBE (numeric) -----")
display(df.describe())
print("\n----- Missing Values -----")
display(df.isnull().sum())


In [None]:

# Distributions for key numeric features
numeric_cols = ['WRI', 'Exposure', 'Vulnerability', 'Susceptibility', 
                'Lack of Coping Capabilities', 'Lack of Adaptive Capacities', 'Year']
numeric_cols = [c for c in numeric_cols if c in df.columns]

for col in numeric_cols:
    plt.figure()
    plt.hist(df[col].dropna(), bins=30)
    plt.title(f"Distribution of {col}")
    plt.xlabel(col)
    plt.ylabel("Frequency")
    plt.show()


In [None]:

# Average WRI trend over years
if 'Year' in df.columns and 'WRI' in df.columns:
    wri_year = df.groupby('Year')['WRI'].mean()
    plt.figure()
    plt.plot(wri_year.index, wri_year.values, marker='o')
    plt.title("Average WRI Over Years")
    plt.xlabel("Year")
    plt.ylabel("Average WRI")
    plt.grid(True)
    plt.show()


In [None]:

# Top 10 regions by average WRI
if 'Region' in df.columns and 'WRI' in df.columns:
    top_regions = df.groupby('Region')['WRI'].mean().sort_values(ascending=False).head(10)
    display(top_regions.to_frame('Avg WRI'))


## Feature Engineering — Next-Year Target

In [None]:

# Create target: next year's WRI per Region (shift -1)
# This means: for each Region and Year, we want to predict the WRI for Year+1
df = df.sort_values(['Region', 'Year'])

if 'Region' in df.columns and 'Year' in df.columns and 'WRI' in df.columns:
    df['WRI_next'] = df.groupby('Region')['WRI'].shift(-1)
else:
    raise ValueError("Required columns 'Region', 'Year', 'WRI' are missing.")

print("Rows before dropping NA target:", len(df))
df_model = df.dropna(subset=['WRI_next']).copy()
print("Rows after dropping NA target:", len(df_model))

display(df_model.head())


## Data Preparation — Imputation, Encoding, Scaling

In [None]:

# Identify features
cat_cols = [c for c in ['Region', 'Exposure Category', 'WRI Category',
                        'Vulnerability Category', 'Susceptibility Category'] if c in df_model.columns]

num_cols = [c for c in ['WRI', 'Exposure', 'Vulnerability', 'Susceptibility',
                        'Lack of Coping Capabilities', 'Lack of Adaptive Capacities', 'Year']
            if c in df_model.columns]

target = 'WRI_next'
features = cat_cols + num_cols

print("Categorical:", cat_cols)
print("Numeric:", num_cols)

X = df_model[features].copy()
y = df_model[target].copy()

# Column transformer: impute & encode categoricals, impute & scale numerics
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")),
    ("scaler", StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore", sparse_output=False))
])

preprocess = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, num_cols),
        ("cat", categorical_transformer, cat_cols)
    ],
    remainder="drop"
)

# Fit transform to create processed feature matrix
X_processed = preprocess.fit_transform(X)
print("Processed feature matrix shape:", X_processed.shape)


## Feature Selection

In [None]:

# 1) Correlation filter (on numeric-only columns)
corr_info = {}
if num_cols:
    corr_df = pd.concat([X[num_cols], y], axis=1).dropna()
    corr = corr_df.corr(numeric_only=True)
    display(corr[['WRI_next']].sort_values(by='WRI_next', ascending=False))
    corr_info = corr[['WRI_next']].to_dict()['WRI_next']


In [None]:

# 2) SelectKBest with f_regression on processed features
# Note: We don't have original column names after OneHot; this is a ranking demo.
k = min(20, X_processed.shape[1])  # pick top 20 or less
skb = SelectKBest(score_func=f_regression, k=k)
X_skb = skb.fit_transform(X_processed, y)

print("Top-k features selected (indices):", np.where(skb.get_support())[0].tolist())
print("Corresponding F-scores (first 10):", np.sort(skb.scores_[skb.get_support()])[:10])


In [None]:

# 3) RandomForest feature importances (rough guidance)
rf = RandomForestRegressor(n_estimators=300, random_state=42, n_jobs=-1)
rf.fit(X_processed, y)
importances = rf.feature_importances_

# Show top 20 importance indices and values
top_idx = np.argsort(importances)[::-1][:20]
print("Top 20 feature importance indices:", top_idx.tolist())
print("Top 20 importances:", importances[top_idx])


## Save Cleaned & Modeled-Ready Data

In [None]:

# Save a cleaned version for Week 3 (optional)
clean_out = "clean_wdris_for_model.csv"
df_model.to_csv(clean_out, index=False)
print("Saved:", clean_out)



## Improvisations (Highlights to paste in LMS)
- Created **next-year target** `WRI_next` per Region using group-wise shift.
- Performed **EDA** (info, describe, missing values, numeric distributions, trend by year).
- Cleaned column names and **imputed** missing values (mean for numeric, mode for categorical).
- Applied **One-Hot Encoding** for categories and **Standard Scaling** for numeric features.
- Ran **three feature selection approaches**: numeric correlation with target, SelectKBest, and RandomForest importances.
- Exported a **clean, model-ready CSV** for downstream modeling.
