
# Capstone Module 6  
## What Factors Drive National Mortality Rates?

**Author:** [Your Name]  
**Date:** [Date]  

---

## Business Understanding

We aim to understand what factors most influence a nation's overall mortality rate.  
Specifically, is it driven by healthcare access, doctor-patient ratio, per capita income, education levels, or other factors?

**Goal:** Provide insights and actionable recommendations for policymakers to prioritize interventions that can lower mortality rates.


# 1. Data Loading and Cleaning

In [None]:

# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Set visualization style
sns.set(style='whitegrid', palette='muted')


In [None]:

# Load data
df = pd.read_csv("data.csv")  # Replace with correct path

# Quick look at data
df.head()
df.info()
df.describe()


# 2. Data Cleaning

In [None]:

# Check missing values
df.isnull().sum()

# Drop rows or fill missing values if necessary
df = df.dropna(subset=['Mortality Rate (%)', 'Healthcare Access (%)', 'Doctors per 1000', 'Per Capita Income (USD)', 'Education Index'])

# Optional: Encode 'Availability of Vaccines/Treatment' if categorical
df['Availability of Vaccines/Treatment'] = df['Availability of Vaccines/Treatment'].map({'Yes': 1, 'No': 0})

# Optional: Encode Gender if needed
df['Gender'] = df['Gender'].map({'Male': 0, 'Female': 1, 'Other': 2})


# 3. Exploratory Data Analysis (EDA)

In [None]:

# Correlation Heatmap
plt.figure(figsize=(12,10))
corr_matrix = df.corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title("Correlation Matrix")
plt.show()


In [None]:

# Distribution of Mortality Rate
plt.figure(figsize=(8,6))
sns.histplot(df['Mortality Rate (%)'], kde=True)
plt.title('Distribution of Mortality Rate (%)')
plt.xlabel('Mortality Rate (%)')
plt.ylabel('Frequency')
plt.show()


# 4. PCA - Dimensionality Reduction

In [None]:

# Selecting numerical features
features = ['Healthcare Access (%)', 'Doctors per 1000', 'Hospital Beds per 1000', 
            'Average Treatment Cost (USD)', 'Availability of Vaccines/Treatment',
            'Recovery Rate (%)', 'DALYs', 'Improvement in 5 Years (%)',
            'Per Capita Income (USD)', 'Education Index', 'Urbanization Rate (%)']

X = df[features]
y = df['Mortality Rate (%)']

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# PCA
pca = PCA(n_components=2)
principal_components = pca.fit_transform(X_scaled)

# Plot PCA
pca_df = pd.DataFrame(data=principal_components, columns=['PC1', 'PC2'])
pca_df['Mortality Rate'] = y

plt.figure(figsize=(8,6))
sns.scatterplot(x='PC1', y='PC2', hue='Mortality Rate', data=pca_df, palette='viridis')
plt.title('PCA - Top 2 Components vs Mortality Rate')
plt.show()

# Explained variance
print("Explained Variance Ratio:", pca.explained_variance_ratio_)


# 5. Regression Modeling

In [None]:

# Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Linear Regression
lr = LinearRegression()
lr.fit(X_train, y_train)

# Predictions
y_pred_lr = lr.predict(X_test)

# Evaluation
print("Linear Regression R2:", r2_score(y_test, y_pred_lr))
print("Linear Regression RMSE:", np.sqrt(mean_squared_error(y_test, y_pred_lr)))


In [None]:

# Random Forest Regressor + GridSearchCV
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [None, 10, 20]
}

rf = RandomForestRegressor(random_state=42)
grid_search = GridSearchCV(rf, param_grid, cv=5, scoring='r2')
grid_search.fit(X_train, y_train)

# Best Model
best_rf = grid_search.best_estimator_
y_pred_rf = best_rf.predict(X_test)

# Evaluation
print("Random Forest R2:", r2_score(y_test, y_pred_rf))
print("Random Forest RMSE:", np.sqrt(mean_squared_error(y_test, y_pred_rf)))


In [None]:

# Feature Importances
importances = best_rf.feature_importances_
feature_importance_df = pd.DataFrame({'Feature': features, 'Importance': importances}).sort_values(by='Importance', ascending=False)

plt.figure(figsize=(10,6))
sns.barplot(x='Importance', y='Feature', data=feature_importance_df)
plt.title('Feature Importances - Random Forest')
plt.show()



# 6. Findings and Recommendations

## Key Insights:
- The top factors influencing Mortality Rate were:
    - Healthcare Access (%)
    - Per Capita Income (USD)
    - Education Index
    - DALYs (Disability Adjusted Life Years)

- Random Forest provided higher R² performance than simple Linear Regression, indicating nonlinear effects.



# README.md (Summary):

- This project analyzes **global health statistics** to determine the key drivers of national mortality rates.
- Techniques used:
    - PCA for dimensionality reduction
    - Correlation analysis
    - Regression models (Linear & Random Forest with GridSearchCV)
- Key findings and recommendations provided for actionable policy impact.




