# **Imputation: World Happiness Report**

**Group Number:** 97  
**Members:**  
Roy Rui #300176548  
Jiayi Ma #300263220
 

---

# **Dataset II: World Happiness Report (2019)** 
 **Authors & Collaborators**: Sustainable Development Solutions Network, Abigail Larion  
**Source**: [Gallup World Poll](https://www.kaggle.com/datasets/unsdsn/world-happiness)  
**Shape**: **9 Columns, 156 Rows**  
**Purpose**: Evaluates and ranks countries based on happiness indicators such as GDP per capita, social support, life expectancy, and more.  

---

# **Introduction**
This Jupyter Notebook corresponds to Assignment 2 of the CSI4142 course. The main objective is to demonstrate **imputation tests** as required by the assignment instructions. Each test is clearly separated, accompanied by code, results, and explanations to ensure reproducibility and clarity.



---

# **Dataset Description**
The **World Happiness Report** is an annual survey that ranks countries by happiness levels, using multiple socioeconomic and well-being indicators. The happiness score is derived from responses to the **Cantril ladder question**, which asks respondents to rate their lives on a scale from **0 (worst possible life)** to **10 (best possible life)**.  

The rankings are determined using **six key contributing factors**:
- **GDP per capita** (economic production)
- **Social support** (availability of help from friends/family)
- **Healthy life expectancy** (life expectancy at birth)
- **Freedom to make life choices** (perceived ability to make key decisions)
- **Generosity** (charitable giving and volunteering behavior)
- **Perceptions of corruption** (trust in government/business integrity)

This dataset provides valuable insights into the relationship between economic, social, and governance factors and overall happiness across different countries.  
  


| Feature                        | Description                                                    | Data Type   |
|--------------------------------|----------------------------------------------------------------|-------------|
| Overall rank                   | Happiness rank (1 = happiest)                                  | Numerical   |
| Country or region              | Name of the country/region                                     | Categorical |
| Score                          | Average life evaluation score (0–10)                           | Numerical   |
| GDP per capita                 | GDP per capita (normalized)                                     | Numerical   |
| Social support                 | Level of social support (Gallup measure)                       | Numerical   |
| Healthy life expectancy        | Healthy life expectancy at birth                               | Numerical   |
| Freedom to make life choices   | Perceived freedom in decision-making                           | Numerical   |
| Generosity                     | Charitable giving and volunteering behavior                    | Numerical   |
| Perceptions of corruption      | Trust in government/business corruption levels                 | Numerical   |

In this dataset, **scores** come from the Cantril ladder approach, asking respondents to imagine life on a 0–10 scale. The factors above help explain **why** certain countries rank higher than others.


---

# **Imputation Tests**
The assignment requires conducting **3 imputation tests**, each employing a distinct approach (no method repeated). The implementation follows these guidelines:



**Import Libraries**

This section initializes the required Python libraries for data processing and imputation tasks:
- `pandas`: Facilitates data reading and manipulation.
- `numpy`: Supports numerical operations and random value selection.
- `sklearn.linear_model.LinearRegression`: Implements linear regression for regression-based imputation.
- `sklearn.impute.KNNImputer`: Performs K-Nearest Neighbors (KNN) imputation for filling missing values.
- `sklearn.metrics`: Computes evaluation metrics, including **Mean Absolute Error (MAE)** and **Mean Squared Error (MSE)** to assess imputation quality.

A random seed is set to ensure the reproducibility of experimental results.


In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.impute import KNNImputer
from sklearn.metrics import mean_absolute_error, mean_squared_error

# Set random seed for reproducibility
np.random.seed(42)


**Read and Explore the Dataset**

The *World Happiness Report* dataset is loaded from a CSV file into a pandas DataFrame (`worldHappiness`). 

An initial inspection of the dataset is conducted using `.head()` to display the rows and `.info()` to summarize key metadata, including the number of rows, columns, and missing values.

In [2]:
worldHappiness = pd.read_csv('dataset2/2019.csv')

# View dataset
worldHappiness.iloc[:160]
worldHappiness.head(160)


Unnamed: 0,Overall rank,Country or region,Score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption
0,1,Finland,7.769,1.340,1.587,0.986,0.596,0.153,0.393
1,2,Denmark,7.600,1.383,1.573,0.996,0.592,0.252,0.410
2,3,Norway,7.554,1.488,1.582,1.028,0.603,0.271,0.341
3,4,Iceland,7.494,1.380,1.624,1.026,0.591,0.354,0.118
4,5,Netherlands,7.488,1.396,1.522,0.999,0.557,0.322,0.298
...,...,...,...,...,...,...,...,...,...
151,152,Rwanda,3.334,0.359,0.711,0.614,0.555,0.217,0.411
152,153,Tanzania,3.231,0.476,0.885,0.499,0.417,0.276,0.147
153,154,Afghanistan,3.203,0.350,0.517,0.361,0.000,0.158,0.025
154,155,Central African Republic,3.083,0.026,0.000,0.105,0.225,0.235,0.035



## Test 1: Mean Imputation on “GDP per capita” (MCAR)

**(a) Choose an Attribute**  
For this first imputation experiment, the selected attribute is **“GDP per capita.”**

**(b) Simulate Missing Values (MCAR)**  
The procedure randomly removes 20% of the rows’ values in the “GDP per capita” column. This ensures missingness does not depend on any observed or unobserved variable, characterizing a **Missing Completely At Random (MCAR)** scenario.

**(c) Imputation Method: Default Value Imputation (Mean)**  
A univariate approach is used by calculating the **mean** of the non-missing “GDP per capita” values and filling in the NaN entries with that mean. This corresponds to “Default Value Imputation” (method #2) and is distinct from the methods applied in subsequent tests.

**(d) Evaluation**  
To assess the approximation, the algorithm compares the imputed values against the original (pre-deletion) data for those rows intentionally set to NaN. It calculates:
- **MAE (Mean Absolute Error)**
- **MSE (Mean Squared Error)**  
Lower MAE/MSE indicates more accurate imputation.

Below is the relevant code snippet:

In [None]:
# Load the dataset
worldHappiness_t1 = pd.read_csv('dataset2/2019.csv')

# Backup the original column
original_values_1 = worldHappiness_t1['GDP per capita'].copy()

# Simulate missing values (MCAR) by randomly deleting 20% of 'GDP per capita'
missing_fraction = 0.2
n_missing_1 = int(missing_fraction * len(worldHappiness_t1))
missing_indices_1 = np.random.choice(worldHappiness_t1.index, n_missing_1, replace=False)
worldHappiness_t1.loc[missing_indices_1, 'GDP per capita'] = np.nan

# Print a table focusing on the deleted rows to visualize missing data
print(f"Deleted {n_missing_1} rows from 'GDP per capita'.")
print("Showing the deleted rows:")
display(
    worldHappiness_t1.loc[missing_indices_1]
        .head(35)
        .style
        .set_caption("MCAR Deletion: 20% on GDP per capita")
)

# Mean imputation (Default Value Imputation)
mean_val = worldHappiness_t1['GDP per capita'].mean()
worldHappiness_t1['GDP_per_capita_mean_imputed'] = worldHappiness_t1['GDP per capita'].fillna(mean_val)

# Evaluate imputation quality by comparing original vs. imputed values
original_missing_values_1 = original_values_1.loc[missing_indices_1]
imputed_values_1 = worldHappiness_t1.loc[missing_indices_1, 'GDP_per_capita_mean_imputed']

# Compute error metrics: MAE (Mean Absolute Error) and MSE (Mean Squared Error)
mae_1 = mean_absolute_error(original_missing_values_1, imputed_values_1)
mse_1 = mean_squared_error(original_missing_values_1, imputed_values_1)

# Print evaluation results
print("\nEvaluation Results:")
print(f"MAE: {mae_1:.4f}")
print(f"MSE: {mse_1:.4f}") 


Deleted 31 rows from 'GDP per capita'.
Showing the deleted rows:


Unnamed: 0,Overall rank,Country or region,Score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption
96,97,Bulgaria,5.011,,1.513,0.815,0.311,0.081,0.004
69,70,Serbia,5.603,,1.383,0.854,0.282,0.137,0.039
82,83,Mongolia,5.285,,1.531,0.667,0.317,0.235,0.038
76,77,Dominican Republic,5.425,,1.401,0.779,0.497,0.113,0.101
114,115,Burkina Faso,4.587,,1.056,0.38,0.255,0.177,0.113
29,30,Spain,6.354,,1.484,1.062,0.362,0.153,0.079
94,95,Bhutan,5.082,,1.321,0.604,0.457,0.37,0.167
132,133,Ukraine,4.332,,1.39,0.739,0.178,0.187,0.01
93,94,Vietnam,5.175,,1.346,0.851,0.543,0.147,0.073
139,140,India,4.015,,0.765,0.588,0.498,0.2,0.085



Evaluation Results:
MAE: 0.2642
MSE: 0.1028


## Test 2: Regression Imputation on “Healthy life expectancy” (MNAR)

**(a) Choose an Attribute**  
The chosen attribute for this second experiment is **“Healthy life expectancy.”**

**(b) Simulate Missing Values (MNAR)**  
Rows where “Healthy life expectancy” is above its 70th percentile are selected, and 50% of that subset is deleted. Because the probability of missingness depends on the attribute’s own value, this constitutes a **Missing Not At Random (MNAR)** setup.

**(c) Imputation Method: Regression**  
A **linear regression model** is trained using “GDP per capita” as the predictor to estimate missing “Healthy life expectancy.” This aligns with “Regression Imputation” (method #7). Predictions are applied to rows that remain NaN in “Healthy life expectancy” but contain valid “GDP per capita.”

**(d) Evaluation**  
The procedure again compares the original (pre-deletion) and the imputed values for the specifically removed rows. It computes:
- **MAE** (average absolute deviation)
- **MSE** (squared deviation)  
Any row still NaN (e.g., if “GDP per capita” is also missing) is skipped to avoid evaluation errors.

Below is the corresponding code snippet:

In [4]:
# Load the dataset
worldHappiness_t2 = pd.read_csv('dataset2/2019.csv')

# Backup the original column
original_values_2 = worldHappiness_t2['Healthy life expectancy'].copy()

# Simulate MNAR by selecting high life expectancy rows and deleting 50%
threshold = worldHappiness_t2['Healthy life expectancy'].quantile(0.7)
high_indices = worldHappiness_t2.index[
    (worldHappiness_t2['Healthy life expectancy'] > threshold) &
    (worldHappiness_t2['GDP per capita'].notna())
]

# Randomly delete 50% of the identified high life expectancy rows
delete_fraction_2 = 0.5
n_missing_2 = int(delete_fraction_2 * len(high_indices))
missing_indices_2 = np.random.choice(high_indices, n_missing_2, replace=False)
worldHappiness_t2.loc[missing_indices_2, 'Healthy life expectancy'] = np.nan

# Print details of deleted values for visualization
print(f"Deleted {n_missing_2} rows from 'Healthy life expectancy' among the high-life group.")
print("Here are the deleted rows:")
display(
    worldHappiness_t2.loc[missing_indices_2]
        .head(35)
        .style
        .set_caption("MNAR Deletion: 50% High Life Expectancy Rows")
)

# Train a regression model using 'GDP per capita' to predict 'Healthy life expectancy'
not_null_df_2 = worldHappiness_t2.dropna(subset=['Healthy life expectancy', 'GDP per capita'])
X_train_2 = not_null_df_2[['GDP per capita']]
y_train_2 = not_null_df_2['Healthy life expectancy']

reg_model = LinearRegression()
reg_model.fit(X_train_2, y_train_2)

# Predict missing values where 'GDP per capita' is not missing
null_df_2 = worldHappiness_t2[
    worldHappiness_t2['Healthy life expectancy'].isna() &
    worldHappiness_t2['GDP per capita'].notna()
]
X_test_2 = null_df_2[['GDP per capita']]
predicted_2 = reg_model.predict(X_test_2)

# Store imputed values in a new column
worldHappiness_t2['LifeExp_reg_imputed'] = worldHappiness_t2['Healthy life expectancy'].copy()
worldHappiness_t2.loc[X_test_2.index, 'LifeExp_reg_imputed'] = predicted_2

# Evaluate imputation
original_missing_values_2 = original_values_2.loc[missing_indices_2]
imputed_values_2 = worldHappiness_t2.loc[missing_indices_2, 'LifeExp_reg_imputed']

# Remove remaining NaN values before evaluation
imputed_values_2_no_nan = imputed_values_2.dropna()
original_missing_values_2_no_nan = original_missing_values_2.loc[imputed_values_2_no_nan.index]

# Compute MAE and MSE to measure imputation performance
mae_2 = mean_absolute_error(original_missing_values_2_no_nan, imputed_values_2_no_nan)
mse_2 = mean_squared_error(original_missing_values_2_no_nan, imputed_values_2_no_nan)

# Print evaluation results
print("\nEvaluation Results:")
print(f"Skipped rows (still NaN): {len(imputed_values_2) - len(imputed_values_2_no_nan)}")
print(f"MAE: {mae_2:.4f}")
print(f"MSE: {mse_2:.4f}")

Deleted 23 rows from 'Healthy life expectancy' among the high-life group.
Here are the deleted rows:


Unnamed: 0,Overall rank,Country or region,Score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption
24,25,Taiwan,6.446,1.368,1.43,,0.351,0.242,0.097
54,55,Estonia,5.893,1.237,1.528,,0.495,0.103,0.161
3,4,Iceland,7.494,1.38,1.624,,0.591,0.354,0.118
65,66,Portugal,5.693,1.221,1.431,,0.508,0.047,0.025
15,16,Ireland,7.021,1.499,1.553,,0.516,0.298,0.31
21,22,Malta,6.726,1.3,1.52,,0.564,0.375,0.151
46,47,Argentina,6.086,1.092,1.432,,0.471,0.066,0.05
29,30,Spain,6.354,1.286,1.484,,0.362,0.153,0.079
37,38,Slovakia,6.198,1.246,1.504,,0.334,0.121,0.014
17,18,Belgium,6.923,1.356,1.504,,0.473,0.16,0.21



Evaluation Results:
Skipped rows (still NaN): 0
MAE: 0.0751
MSE: 0.0078


## Test 3: KNN Imputation on “Generosity” (MAR)

**(a) Choice of Attribute**  
For the third imputation experiment, the selected attribute is **Generosity**.

**(b) Simulation of Missing Values (MAR)**  
A **Missing At Random (MAR)** mechanism is created by targeting rows where `Score < 5.5` (i.e., relatively low happiness scores) and deleting 30% of those rows’ Generosity values. This means missingness is related to a different observed variable (`Score`) rather than the Generosity column itself.

**(c) Imputation Approach: Similarity-Based (KNN)**  
KNNImputer (method #9) is used for a multivariate, similarity-based approach. Multiple features—`Score`, `GDP per capita`, `Healthy life expectancy`, `Freedom to make life choices`, and `Generosity`—are leveraged to find the nearest neighbors and fill missing values based on neighbor averages. This differs from earlier tests (mean or regression) and helps satisfy the requirement of employing three distinct methods.

**(d) Evaluation**  
After inserting `NaN` into `Generosity` for 30% of the selected subset (countries with `Score < 5.5`), KNNImputer is applied to fill missing values. The resulting imputed values are compared to the original values specifically deleted, using **Mean Absolute Error (MAE)** and **Mean Squared Error (MSE)**. Smaller MAE/MSE indicates closer alignment with the true values.

Below is the corresponding code snippet:

In [5]:
# Load the dataset
worldHappiness_t3 = pd.read_csv('dataset2/2019.csv')

# Backup the original 'Generosity' column
original_values_3 = worldHappiness_t3['Generosity'].copy()

# Simulate MAR by selecting rows where Score < 5.5 and deleting 30%
score_threshold = 5.5
sub_indices_3 = worldHappiness_t3.index[worldHappiness_t3['Score'] < score_threshold]

# Suppose we delete 30% of those rows in the subset
delete_fraction_3 = 0.3
n_missing_3 = int(delete_fraction_3 * len(sub_indices_3))

# Randomly pick from that subset
missing_indices_3 = np.random.choice(sub_indices_3, n_missing_3, replace=False)

# Set 'Generosity' to NaN in those rows
worldHappiness_t3.loc[missing_indices_3, 'Generosity'] = np.nan

# Print details of deleted values for visualization
print(f"Deleted {n_missing_3} rows from 'Generosity' among countries with Score < {score_threshold}.")
print("Here are the deleted rows:")
display(
    worldHappiness_t3.loc[missing_indices_3]
        .head(35)
        .style
        .set_caption(f"MAR Deletion: Score < {score_threshold}, 30% in that subset")
)
# KNN Imputation (using multiple features)
imputer = KNNImputer(n_neighbors=5)
columns_for_knn = [
    'Score',
    'GDP per capita',
    'Healthy life expectancy',
    'Freedom to make life choices',
    'Generosity'
]

# Apply KNN imputation using multiple features
worldHappiness_knn = worldHappiness_t3[columns_for_knn].copy()
worldHappiness_imputed_array = imputer.fit_transform(worldHappiness_knn)
worldHappiness_knn_imputed = pd.DataFrame(
    worldHappiness_imputed_array,
    columns=columns_for_knn,
    index=worldHappiness_knn.index
)

# Store the imputed "Generosity" in a new column
worldHappiness_t3['Generosity_knn_imputed'] = worldHappiness_knn_imputed['Generosity']

# Evaluate imputation
original_missing_values_3 = original_values_3.loc[missing_indices_3]
imputed_values_3 = worldHappiness_t3.loc[missing_indices_3, 'Generosity_knn_imputed']

# Compute MAE and MSE to measure imputation performance
mae_3 = mean_absolute_error(original_missing_values_3, imputed_values_3)
mse_3 = mean_squared_error(original_missing_values_3, imputed_values_3)

# Print evaluation results
print("\nEvaluation Results:")
print(f"MAE: {mae_3:.4f}")
print(f"MSE: {mse_3:.4f}")


Deleted 24 rows from 'Generosity' among countries with Score < 5.5.
Here are the deleted rows:


Unnamed: 0,Overall rank,Country or region,Score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption
82,83,Mongolia,5.285,0.948,1.531,0.667,0.317,,0.038
93,94,Vietnam,5.175,0.741,1.346,0.851,0.543,,0.073
85,86,Kyrgyzstan,5.261,0.551,1.438,0.723,0.508,,0.023
86,87,Turkmenistan,5.247,1.052,1.538,0.657,0.394,,0.028
143,144,Lesotho,3.802,0.489,1.169,0.168,0.359,,0.093
87,88,Algeria,5.211,1.002,1.16,0.785,0.086,,0.114
121,122,Mauritania,4.49,0.57,1.167,0.489,0.066,,0.088
152,153,Tanzania,3.231,0.476,0.885,0.499,0.417,,0.147
115,116,Armenia,4.559,0.85,1.055,0.815,0.283,,0.064
123,124,Tunisia,4.461,0.921,1.0,0.815,0.167,,0.055



Evaluation Results:
MAE: 0.0571
MSE: 0.0044


**Summary of MAE/MSE for the three imputation tests:**

In [6]:
print(f"Test 1 (Mean Imputation on 'GDP per capita') =>   MAE = {mae_1:.4f}, MSE = {mse_1:.4f}")
print(f"Test 2 (Regression Imputation on 'Healthy life expectancy') =>   MAE = {mae_2:.4f}, MSE = {mse_2:.4f}")
print(f"Test 3 (KNN Imputation on 'Generosity') =>   MAE = {mae_3:.4f}, MSE = {mse_3:.4f}")

Test 1 (Mean Imputation on 'GDP per capita') =>   MAE = 0.2642, MSE = 0.1028
Test 2 (Regression Imputation on 'Healthy life expectancy') =>   MAE = 0.0751, MSE = 0.0078
Test 3 (KNN Imputation on 'Generosity') =>   MAE = 0.0571, MSE = 0.0044


---

# **Conclusion**
The imputation experiments in this notebook illustrate how different missingness mechanisms (MCAR, MNAR, MAR) and methods (mean, regression, KNN) produce varying accuracy. Lower MAE/MSE values suggest more reliable recovery of the original data. Additional analyses could investigate other techniques, such as MICE or predictive models, or could be applied to larger datasets for enhanced validation. These findings indicate that matching the imputation approach to the underlying missingness mechanism can lead to more effective and robust results.


---

# **References**
1. **World Happiness Report**: [Kaggle - World Happiness](https://www.kaggle.com/datasets/unsdsn/world-happiness)  
2. **Assignment 2 PDF**: *CSI4142 Data Science, Imputation Section Revised on Feb. 14th*  

---

## **Acknowledgments**
- **ChatGPT**: Formatting markdown texts, paraphrasing, grammar checks, and code debugging.  