# Child Mortality Prediction Notebook

This notebook covers the full workflow:
1. **Data ingestion**
2. **Inspection for NaN and usable years**
3. **Large timeframe prediction (1990–2023)** with robust scaling and country weights
4. **Detailed predictions (1990–2005)** for World Bank income Groups 2 (Lower-middle) and 3 (Upper-middle), core vs education-inclusive predictors

All models use 5-fold cross-validation and report R², RMSE, MAE, and MAPE.

## Setup & Imports

In [1]:

import pandas as pd
import numpy as np
import os

from sklearn.model_selection import KFold, GridSearchCV, cross_val_score
from sklearn.preprocessing import RobustScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import make_scorer, r2_score, mean_squared_error, mean_absolute_error


## Data Ingestion

In [5]:
import pandas as pd
import os

# Robust relative path builder
def data_path(*parts):
    return os.path.join('.', 'numerical-data', *parts)

paths = {
    'mortality': data_path('child-mortality-by-income-level-of-country', 'child-mortality-by-income-level-of-country.csv'),
    'health_exp': data_path('public-health-expenditure-share-gdp', 'public-health-expenditure-share-gdp.csv'),
    # Both 'electricity-access' and 'access-to-electricity' exist; choose one consistently (electricity-access)
    'electricity_gdp': data_path('electricity-access', 'access-to-electricity-vs-gdp-per-capita.csv'),
    'fertility': data_path('fertility-rate-vs-share-of-women-between-25-and-29-years-old-with-no-education', 'fertility-rate-vs-share-of-women-between-25-and-29-years-old-with-no-education.csv')
}

missing = [k for k,p in paths.items() if not os.path.exists(p)]
if missing:
    print('Warning: some files not found:', {k: paths[k] for k in missing})
else:
    print('All source CSV files located.')

mortality_df   = pd.read_csv(paths['mortality'])
health_exp_df  = pd.read_csv(paths['health_exp'])
elec_df        = pd.read_csv(paths['electricity_gdp'])
fertility_df   = pd.read_csv(paths['fertility'])

All source CSV files located.


## Inspection for NaN and Usable Years

In [6]:

print(mortality_df.head())
print(health_exp_df.head())
print(elec_df.head())
print(fertility_df.head())

print("\nNaN counts:")
print("Mortality:", mortality_df.isna().sum().sum())
print("Health:", health_exp_df.isna().sum().sum())
print("Electricity/GDP:", elec_df.isna().sum().sum())
print("Education:", fertility_df.isna().sum().sum())


        Entity Code  Year  \
0  Afghanistan  AFG  1957   
1  Afghanistan  AFG  1958   
2  Afghanistan  AFG  1959   
3  Afghanistan  AFG  1960   
4  Afghanistan  AFG  1961   

   Child mortality rate of children aged under five years, per 100 live births  
0                                          37.132380                            
1                                          36.523033                            
2                                          35.951195                            
3                                          35.316550                            
4                                          34.760840                            
      Entity Code  Year  Public health expenditure as a share of GDP
0  Argentina  ARG  1880                                          0.0
1  Argentina  ARG  1890                                          0.0
2  Argentina  ARG  1900                                          0.0
3  Argentina  ARG  1910                                        

## Helper Functions

In [7]:

def rmse(y_true, y_pred, sample_weight=None):
    return np.sqrt(mean_squared_error(y_true, y_pred, sample_weight=sample_weight))
def mae(y_true, y_pred, sample_weight=None):
    return mean_absolute_error(y_true, y_pred, sample_weight=sample_weight)
def mape(y_true, y_pred, sample_weight=None):
    mask = y_true != 0
    if sample_weight is not None:
        sample_weight = sample_weight[mask]
    return np.mean(np.abs((y_true[mask] - y_pred[mask]) / y_true[mask]))

r2_scorer   = make_scorer(r2_score, greater_is_better=True)
rmse_scorer = make_scorer(rmse, greater_is_better=False)
mae_scorer  = make_scorer(mae, greater_is_better=False)
mape_scorer = make_scorer(mape, greater_is_better=False)

cv = KFold(n_splits=5, shuffle=True, random_state=42)

def cv_scores_full(model, X, y, w, fit_param_name="reg__sample_weight"):
    out = {}
    try:
        r2 = cross_val_score(model, X, y, scoring=r2_scorer, cv=cv,
                             fit_params={fit_param_name: w})
        rm = cross_val_score(model, X, y, scoring=rmse_scorer, cv=cv,
                             fit_params={fit_param_name: w})
        ma = cross_val_score(model, X, y, scoring=mae_scorer, cv=cv,
                             fit_params={fit_param_name: w})
        mp = cross_val_score(model, X, y, scoring=mape_scorer, cv=cv,
                             fit_params={fit_param_name: w})
        out["weighted"] = True
    except Exception as e:
        r2 = cross_val_score(model, X, y, scoring=r2_scorer, cv=cv)
        rm = cross_val_score(model, X, y, scoring=rmse_scorer, cv=cv)
        ma = cross_val_score(model, X, y, scoring=mae_scorer, cv=cv)
        mp = cross_val_score(model, X, y, scoring=mape_scorer, cv=cv)
        out["weighted"] = False
        out["warning"] = f"Weighted CV not supported; fallback to unweighted. ({e})"
    out["R2_mean"], out["R2_std"]     = float(np.mean(r2)), float(np.std(r2))
    out["RMSE_mean"], out["RMSE_std"] = float(-np.mean(rm)), float(np.std(rm))
    out["MAE_mean"], out["MAE_std"]   = float(-np.mean(ma)), float(np.std(ma))
    out["MAPE_mean"], out["MAPE_std"] = float(-np.mean(mp)), float(np.std(mp))
    return out

def make_models():
    pipe_ols = Pipeline([("scale", RobustScaler()), ("reg", LinearRegression())])
    pipe_ridge = Pipeline([("scale", RobustScaler()), ("reg", Ridge(random_state=42))])
    pipe_lasso = Pipeline([("scale", RobustScaler()), ("reg", Lasso(max_iter=10000, random_state=42))])
    pipe_rf = Pipeline([("reg", RandomForestRegressor(random_state=42, n_jobs=-1))])

    ridge_grid = GridSearchCV(pipe_ridge, {"reg__alpha": np.logspace(-3, 2, 12)}, scoring=r2_scorer, cv=cv, n_jobs=-1)
    lasso_grid = GridSearchCV(pipe_lasso, {"reg__alpha": np.logspace(-3, 1, 12)}, scoring=r2_scorer, cv=cv, n_jobs=-1)
    rf_grid    = GridSearchCV(pipe_rf, {
        "reg__n_estimators":[200,400],
        "reg__max_depth":[4,6,8,None],
        "reg__min_samples_leaf":[1,2,5],
        "reg__max_features":["auto","sqrt"]
    }, scoring=r2_scorer, cv=cv, n_jobs=-1)

    return pipe_ols, ridge_grid, lasso_grid, rf_grid


## Large Timeframe Prediction (1990–2023)

Here we prepare the dataset for 1990–2023 (core predictors only), apply robust scaling, weights, and run OLS, Ridge, Lasso, and Random Forest with cross-validation.

## Detailed Predictions (1990–2005, Groups 2 & 3)

We run four analyses:
- Group 2 (Lower-middle) — Core
- Group 2 — +Education
- Group 3 (Upper-middle) — Core
- Group 3 — +Education

All with robust scaling, weights, cross-validation, and metrics R², RMSE, MAE, MAPE.