<a href="https://colab.research.google.com/github/surendran2566/Life_Expectancy_Analysis/blob/main/Life_Expectancy_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
                                  ## 🧠 Project Title: Life Expectancy Analysis

### 📄 About the Dataset:
'''
This dataset represents key health and economic indicators from 193 countries spanning 2000–2015. Collected from WHO and UN sources,
it includes features related to immunization, mortality, GDP, healthcare expenditure, schooling, and more.'''

### 🎯 Project Objective:

'''We aim to determine which socio-economic and healthcare-related factors impact life expectancy the most, and how policies might help countries improve it.'''

### 🔍 Key Research Questions:
'''
1. Which features truly impact life expectancy?
2. Should low life expectancy countries (<65) increase healthcare expenditure?
3. How do infant and adult mortality influence lifespan?
4. Do lifestyle habits (e.g., alcohol, smoking, exercise) affect life expectancy?
5. What’s the role of schooling on lifespan?
6. Is alcohol consumption positively or negatively related?
7. Does population density correlate with life expectancy?
8. How does immunization affect longevity? '''

############################################################################### 🔧 Step-by-Step Implementation

### Step 1: Install and Import Required Libraries

!pip install tensorflow keras keras-tuner shap scikit-learn seaborn xgboost
!pip install nbstripout

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf
from tensorflow import keras
import keras_tuner as kt
import shap
import warnings
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score
from xgboost import XGBRegressor

warnings.filterwarnings('ignore')
pd.set_option('display.float_format', lambda x: '%.3f' % x)
shap.initjs()

In [None]:
### Step 2: Load and Inspect the Dataset

from google.colab import files
import pandas as pd
import io

# Upload your dataset (you'll get a pop-up to select the CSV file)
uploaded = files.upload()

# Load the first uploaded file safely
df = pd.read_csv(io.BytesIO(next(iter(uploaded.values()))))

# View the structure of the data
print("Dataset Shape:", df.shape)
print("\nSample Data:")
print(df.head())
print("\nMissing Values Count:")
print(df.isnull().sum())

In [None]:
### Step 3: Data Cleaning

df.dropna(inplace=True)
print("After Removing Nulls:", df.shape)

In [None]:
### Step 4: Data Cleaning and Feature Engineering (Reshape Style)

# Rename for consistency
if 'Life expectancy' in df.columns:
    df.rename(columns={'Life expectancy': 'Life_expectancy'}, inplace=True)

# Ensure 'Year' is numeric
df['Year'] = pd.to_numeric(df['Year'], errors='coerce')

# Check and reassign Gender column if available
if 'Gender' in df.columns:
    reorder_cols = ['Country', 'Year', 'Gender'] + [col for col in df.columns if col not in ['Country', 'Year', 'Gender']]
else:
    reorder_cols = ['Country', 'Year'] + [col for col in df.columns if col not in ['Country', 'Year']]
df = df[reorder_cols]

# Remove non-numeric columns before imputation
non_numeric_cols = df.select_dtypes(include=['object']).columns.tolist()
columns_to_impute = [col for col in df.columns if col not in ['Country', 'Year', 'Life_expectancy', 'Gender'] and col not in non_numeric_cols]

# Impute missing values per country
def groupwise_mean_impute(df, group_col, target_col):
    return df.groupby(group_col)[target_col].transform(lambda x: pd.to_numeric(x, errors='coerce').fillna(x.mean()))

for col in columns_to_impute:
    df[col] = groupwise_mean_impute(df, 'Country', col)

# Drop any remaining NaNs
df.dropna(inplace=True)

# Summary outputs
print("Cleaned Data Shape:", df.shape)
print("Duplicate Records:", df.duplicated().sum())
print("\nDescriptive Statistics:")
display(df.describe())
print("\nNumber of Countries:", df['Country'].nunique())
print("Year Range:", df['Year'].min(), "-", df['Year'].max())

In [None]:
### Step 5: Exploratory Data Analysis (EDA)

                                # Dynamically drop columns based on availability
drop_cols = ['Country', 'Year']
if 'Life_expectancy' in df.columns:
    drop_cols.append('Life_expectancy')
elif 'Life expectancy' in df.columns:
    drop_cols.append('Life expectancy')
if 'Gender' in df.columns:
    drop_cols.append('Gender')

features = df.columns.drop(drop_cols)

                    # Distribution plots
for col in features:
    plt.figure(figsize=(6, 4))
    sns.histplot(df[col], kde=True)
    plt.title(f"Distribution: {col}")
    plt.tight_layout()
    plt.show()

                           # Correlation heatmap
plt.figure(figsize=(14, 10))
sns.heatmap(df.corr(numeric_only=True), cmap='coolwarm', annot=False)
plt.title("Correlation Matrix")
plt.show()

                                      # Gender vs Life Expectancy (if available)
if 'Gender' in df.columns and 'Life_expectancy' in df.columns:
    sns.boxplot(x='Gender', y='Life_expectancy', data=df)
    plt.title("Life Expectancy by Gender")
    plt.show()
elif 'Gender' in df.columns and 'Life expectancy' in df.columns:
    sns.boxplot(x='Gender', y='Life expectancy', data=df)
    plt.title("Life Expectancy by Gender")
    plt.show()

In [None]:
### Step 6: Preprocessing – Scaling & PCA

# Identify numeric columns only
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()

# Confirm actual column name for life expectancy
target_column = 'Life expectancy '  # Note the space at the end

# Exclude identifier and target columns
exclude_cols = ['Country', 'Year', target_column]
if 'Gender' in df.columns:
    exclude_cols.append('Gender')

features = [col for col in numeric_cols if col not in exclude_cols]

# Define input and output
X = df[features].copy()
y = df[target_column].copy()

# Standardize input features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA to preserve 90% variance
pca = PCA(n_components=0.9)
X_pca = pca.fit_transform(X_scaled)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_pca, y, test_size=0.3, random_state=42)

# Output shapes
print("Original numeric feature count:", len(features))
print("PCA reduced feature count:", X_pca.shape[1])

In [None]:
### Step 7: Deep Learning Model

def build_nn_model():
    model = keras.Sequential([
        keras.layers.Dense(64, activation='relu', input_shape=(X_train.shape[1],)),
        keras.layers.Dense(64, activation='relu'),
        keras.layers.Dense(64, activation='relu'),
        keras.layers.Dense(1)
    ])
    model.compile(optimizer='adam', loss='mse')
    return model

nn_model = build_nn_model()
nn_history = nn_model.fit(X_train, y_train, epochs=200, validation_split=0.2, verbose=0)

plt.plot(nn_history.history['loss'], label='Train Loss')
plt.plot(nn_history.history['val_loss'], label='Val Loss')
plt.title('DL Model Loss History')
plt.legend()
plt.show()

print("Test MSE:", nn_model.evaluate(X_test, y_test))

In [None]:
### Step 8: Classical ML Models + Evaluation

X_train_ml, X_test_ml, y_train_ml, y_test_ml = train_test_split(X_scaled, y, test_size=0.3, random_state=42)

models = {
    "Random Forest": RandomForestRegressor(random_state=42),
    "Extra Trees": ExtraTreesRegressor(random_state=42),
    "Gradient Boost": GradientBoostingRegressor(random_state=42),
    "XGBoost": XGBRegressor(random_state=42)
}

results = []
for name, model in models.items():
    model.fit(X_train_ml, y_train_ml)
    preds = model.predict(X_test_ml)
    rmse = np.sqrt(mean_squared_error(y_test_ml, preds))
    r2 = r2_score(y_test_ml, preds)
    results.append((name, rmse, r2))

results_df = pd.DataFrame(results, columns=['Model', 'RMSE', 'R2_Score']).sort_values(by='R2_Score', ascending=False)
print("\nModel Evaluation Results:")
print(results_df)

plt.figure(figsize=(8,6))
sns.barplot(x='Model', y='R2_Score', data=results_df)
plt.title("Model Comparison - R2 Score")
plt.ylim(0.9, 1.0)
plt.tight_layout()
plt.show()

In [None]:
### Step 9: Cross-Validation (XGBoost)

best_model = XGBRegressor(random_state=42)
kf = KFold(n_splits=10, shuffle=True, random_state=42)
cv_scores = cross_val_score(best_model, X_scaled, y, cv=kf, scoring='r2')

print("\nCross-Validation (XGBoost):")
print(f"Mean R2 Score: {cv_scores.mean():.4f}")
print(f"Standard Deviation: {cv_scores.std():.4f}")

plt.plot(cv_scores, marker='o')
plt.title("Cross-Validation Scores")
plt.xlabel("Fold")
plt.ylabel("R2 Score")
plt.grid(True)
plt.show()

In [None]:
### Step 10: Feature Importance (Random Forest)

rf = RandomForestRegressor(random_state=42)
rf.fit(X_train_ml, y_train_ml)
importances = rf.feature_importances_
indices = np.argsort(importances)[::-1]

plt.figure(figsize=(10, 6))
plt.bar(range(X.shape[1]), importances[indices])
plt.xticks(range(X.shape[1]), np.array(features)[indices], rotation=90)
plt.title("Feature Importances via Random Forest")
plt.tight_layout()
plt.show()

In [None]:
### Step 11: SHAP Value Interpretation (Neural Network)

X_sample = X_train[:100]
background = shap.kmeans(X_sample, 10)
explainer = shap.KernelExplainer(nn_model.predict, background)
shap_values = explainer.shap_values(X_sample)

shap.summary_plot(shap_values[0], features=features)

### ✅ Conclusion:
'''
This notebook covers a complete life expectancy study.
It includes data preprocessing, EDA, model training (deep learning and classical), evaluation, cross-validation, and interpretation.
All steps align with the original project objectives, dataset documentation, and research questions.'''

 ##### THIS IS A WORK DONE BY 'SURENDRAN L'

