# Import Libraries

In this section, we import the necessary Python libraries for our analysis, including:
- `pandas`: For data reading and processing.
- `numpy`: For numerical operations and random selections.
- `sklearn.linear_model.LinearRegression`: A linear regression model used for regression imputation.
- `sklearn.impute.KNNImputer`: Used for KNN imputation.
- `sklearn.metrics`: To evaluate the imputation results using MSE and MAE.

We also set a random seed to ensure the reproducibility of our experiments.


In [116]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.impute import KNNImputer
from sklearn.metrics import mean_absolute_error, mean_squared_error

# Set random seed for reproducibility
np.random.seed(42)


# Read and Explore the Dataset

In this section, we load the *World Happiness Report* dataset from a CSV file into a pandas DataFrame named `worldHappiness`.

We then use `.head()` to inspect the first 5 rows and `.info()` to check the dataset’s basic information, including the number of rows, columns, and any missing values.


In [117]:
worldHappiness = pd.read_csv('dataset2/2019.csv')

# View dataset
worldHappiness.iloc[:160]
worldHappiness.head(160)


Unnamed: 0,Overall rank,Country or region,Score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption
0,1,Finland,7.769,1.340,1.587,0.986,0.596,0.153,0.393
1,2,Denmark,7.600,1.383,1.573,0.996,0.592,0.252,0.410
2,3,Norway,7.554,1.488,1.582,1.028,0.603,0.271,0.341
3,4,Iceland,7.494,1.380,1.624,1.026,0.591,0.354,0.118
4,5,Netherlands,7.488,1.396,1.522,0.999,0.557,0.322,0.298
...,...,...,...,...,...,...,...,...,...
151,152,Rwanda,3.334,0.359,0.711,0.614,0.555,0.217,0.411
152,153,Tanzania,3.231,0.476,0.885,0.499,0.417,0.276,0.147
153,154,Afghanistan,3.203,0.350,0.517,0.361,0.000,0.158,0.025
154,155,Central African Republic,3.083,0.026,0.000,0.105,0.225,0.235,0.035


## Test 1: Mean Imputation on “GDP per capita” (MCAR)

**(a) Choose an Attribute**  
For this first imputation test, the chosen attribute is **“GDP per capita.”**

**(b) Simulate Missing Values (MCAR)**  
This test randomly remove 20% of the rows’ values in the “GDP per capita” column, ensuring the missingness does not depend on observed or unobserved values of the dataset. This corresponds to **Missing Completely At Random (MCAR)**.

**(c) Imputation Method: Default Value Imputation (Mean)**  
Use a univariate approach by computing the **mean** of the non-missing “GDP per capita” values and filling in the missing entries with that mean. This falls under “Default Value Imputation” (method #2), which differs from the other tests’ methods.

**(d) Evaluation**  
To measure how well the imputed values approximate the original data, it compare the imputed values against the original (before deletion) for those rows that were intentionally set to NaN. It calculate:
- **MAE (Mean Absolute Error)**
- **MSE (Mean Squared Error)**  
Lower errors indicate more accurate imputation.

Below is the code snippet showing these steps:


In [118]:
# Load the dataset
worldHappiness_t1 = pd.read_csv('dataset2/2019.csv')

# Backup the original column
original_values_1 = worldHappiness_t1['GDP per capita'].copy()

# Simulate missing values (MCAR) by randomly deleting 20% of 'GDP per capita'
missing_fraction = 0.2
n_missing_1 = int(missing_fraction * len(worldHappiness_t1))
missing_indices_1 = np.random.choice(worldHappiness_t1.index, n_missing_1, replace=False)
worldHappiness_t1.loc[missing_indices_1, 'GDP per capita'] = np.nan

# Print a table focusing on the first 35 deleted rows to visualize missing data
print(f"Deleted {n_missing_1} rows from 'GDP per capita'. Showing the deleted rows:")
display(
    worldHappiness_t1.loc[missing_indices_1]
        .head(35)
        .style
        .set_caption("MCAR Deletion: 20% on GDP per capita")
)

# Mean imputation (Default Value Imputation)
mean_val = worldHappiness_t1['GDP per capita'].mean()
worldHappiness_t1['GDP_per_capita_mean_imputed'] = worldHappiness_t1['GDP per capita'].fillna(mean_val)

# Evaluate imputation quality by comparing original vs. imputed values
original_missing_values_1 = original_values_1.loc[missing_indices_1]
imputed_values_1 = worldHappiness_t1.loc[missing_indices_1, 'GDP_per_capita_mean_imputed']

# Compute error metrics: MAE (Mean Absolute Error) and MSE (Mean Squared Error)
mae_1 = mean_absolute_error(original_missing_values_1, imputed_values_1)
mse_1 = mean_squared_error(original_missing_values_1, imputed_values_1)

# Print evaluation results
print("\nEvaluation Results:")
print(f"MAE: {mae_1:.4f}")
print(f"MSE: {mse_1:.4f}") 


Deleted 31 rows from 'GDP per capita'. Showing the deleted rows:


Unnamed: 0,Overall rank,Country or region,Score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption
96,97,Bulgaria,5.011,,1.513,0.815,0.311,0.081,0.004
69,70,Serbia,5.603,,1.383,0.854,0.282,0.137,0.039
82,83,Mongolia,5.285,,1.531,0.667,0.317,0.235,0.038
76,77,Dominican Republic,5.425,,1.401,0.779,0.497,0.113,0.101
114,115,Burkina Faso,4.587,,1.056,0.38,0.255,0.177,0.113
29,30,Spain,6.354,,1.484,1.062,0.362,0.153,0.079
94,95,Bhutan,5.082,,1.321,0.604,0.457,0.37,0.167
132,133,Ukraine,4.332,,1.39,0.739,0.178,0.187,0.01
93,94,Vietnam,5.175,,1.346,0.851,0.543,0.147,0.073
139,140,India,4.015,,0.765,0.588,0.498,0.2,0.085



Evaluation Results:
MAE: 0.2642
MSE: 0.1028


## Test 2: Regression Imputation on “Healthy life expectancy” (MNAR)

**(a) Choose an Attribute**  
For the second test, the chosen attribute is **“Healthy life expectancy.”**

**(b) Simulate Missing Values (MNAR)**  
This test select rows where “Healthy life expectancy” is above its 70th percentile (hence “high” values) and delete 50% within that subset. Because the probability of missingness depends on the attribute’s own value, this simulates **Missing Not At Random (MNAR).**

**(c) Imputation Method: Regression**  
This test train a **linear regression model** using “GDP per capita” as the predictor to estimate missing “Healthy life expectancy.” This corresponds to “Regression Imputation” (method #7). We then apply the model to rows where “Healthy life expectancy” is NaN but “GDP per capita” is not.

**(d) Evaluation**  
It again compare the original (pre-deletion) and imputed values for the specific rows that were intentionally set to NaN. It compute:
- **MAE** to show average absolute deviation 
- **MSE** to reflect squared error accumulation  
Note: if any rows remain NaN (e.g., if “GDP per capita” was missing too), and skip them to avoid error in evaluation.

Below is the corresponding code snippet:


In [119]:
# Load the dataset
worldHappiness_t2 = pd.read_csv('dataset2/2019.csv')

# Backup the original column
original_values_2 = worldHappiness_t2['Healthy life expectancy'].copy()

# Simulate MNAR by selecting high life expectancy rows and deleting 50%
threshold = worldHappiness_t2['Healthy life expectancy'].quantile(0.7)
high_indices = worldHappiness_t2.index[
    (worldHappiness_t2['Healthy life expectancy'] > threshold) &
    (worldHappiness_t2['GDP per capita'].notna())
]

# Randomly delete 50% of the identified high life expectancy rows
delete_fraction_2 = 0.5
n_missing_2 = int(delete_fraction_2 * len(high_indices))
missing_indices_2 = np.random.choice(high_indices, n_missing_2, replace=False)
worldHappiness_t2.loc[missing_indices_2, 'Healthy life expectancy'] = np.nan

# Print details of deleted values for visualization
print(f"Deleted {n_missing_2} rows from 'Healthy life expectancy' among the high-life group.")
print("Here are the deleted rows:")
display(
    worldHappiness_t2.loc[missing_indices_2]
        .head(35)
        .style
        .set_caption("MNAR Deletion: 50% High Life Expectancy Rows")
)

# Train a regression model using 'GDP per capita' to predict 'Healthy life expectancy'
not_null_df_2 = worldHappiness_t2.dropna(subset=['Healthy life expectancy', 'GDP per capita'])
X_train_2 = not_null_df_2[['GDP per capita']]
y_train_2 = not_null_df_2['Healthy life expectancy']

reg_model = LinearRegression()
reg_model.fit(X_train_2, y_train_2)

# Predict missing values where 'GDP per capita' is not missing
null_df_2 = worldHappiness_t2[
    worldHappiness_t2['Healthy life expectancy'].isna() &
    worldHappiness_t2['GDP per capita'].notna()
]
X_test_2 = null_df_2[['GDP per capita']]
predicted_2 = reg_model.predict(X_test_2)

# Store imputed values in a new column
worldHappiness_t2['LifeExp_reg_imputed'] = worldHappiness_t2['Healthy life expectancy'].copy()
worldHappiness_t2.loc[X_test_2.index, 'LifeExp_reg_imputed'] = predicted_2

# Evaluate imputation
original_missing_values_2 = original_values_2.loc[missing_indices_2]
imputed_values_2 = worldHappiness_t2.loc[missing_indices_2, 'LifeExp_reg_imputed']

# Remove remaining NaN values before evaluation
imputed_values_2_no_nan = imputed_values_2.dropna()
original_missing_values_2_no_nan = original_missing_values_2.loc[imputed_values_2_no_nan.index]

# Compute MAE and MSE to measure imputation performance
mae_2 = mean_absolute_error(original_missing_values_2_no_nan, imputed_values_2_no_nan)
mse_2 = mean_squared_error(original_missing_values_2_no_nan, imputed_values_2_no_nan)

# Print evaluation results
print("\nEvaluation Results:")
print(f"Number of rows successfully imputed: {len(imputed_values_2_no_nan)}")
print(f"Skipped rows (still NaN): {len(imputed_values_2) - len(imputed_values_2_no_nan)}")
print(f"MAE: {mae_2:.4f}")
print(f"MSE: {mse_2:.4f}")

Deleted 23 rows from 'Healthy life expectancy' among the high-life group.
Here are the deleted rows:


Unnamed: 0,Overall rank,Country or region,Score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption
24,25,Taiwan,6.446,1.368,1.43,,0.351,0.242,0.097
54,55,Estonia,5.893,1.237,1.528,,0.495,0.103,0.161
3,4,Iceland,7.494,1.38,1.624,,0.591,0.354,0.118
65,66,Portugal,5.693,1.221,1.431,,0.508,0.047,0.025
15,16,Ireland,7.021,1.499,1.553,,0.516,0.298,0.31
21,22,Malta,6.726,1.3,1.52,,0.564,0.375,0.151
46,47,Argentina,6.086,1.092,1.432,,0.471,0.066,0.05
29,30,Spain,6.354,1.286,1.484,,0.362,0.153,0.079
37,38,Slovakia,6.198,1.246,1.504,,0.334,0.121,0.014
17,18,Belgium,6.923,1.356,1.504,,0.473,0.16,0.21



Evaluation Results:
Number of rows successfully imputed: 23
Skipped rows (still NaN): 0
MAE: 0.0751
MSE: 0.0078


## Test 3: KNN Imputation on “Generosity” (MAR)

**(a) Choice of Attribute**  
For our third imputation test, the chosen attribute is **Generosity**.

**(b) Simulation of Missing Values (MAR)**  
Instead of randomly removing values (MCAR), we simulate a **Missing At Random (MAR)** mechanism by selecting rows where `Score < 5.5` (i.e., countries with relatively low happiness scores) and removing 30% of those rows’ Generosity values. This approach ties missingness to another observed variable (`Score`), thus fulfilling the MAR condition.

**(c) Imputation Approach: Similarity-Based (KNN)**  
This test choose **KNNImputer** for a multivariate, similarity-based approach (method #9). It leverages multiple features (`Score`, `GDP per capita`, `Healthy life expectancy`, `Freedom to make life choices`, `Generosity`) to find the nearest neighbors of each row and fill missing values based on neighbor averages. This differs from our previous tests (mean imputation or regression) and ensures we meet the requirement of using three distinct imputation methods.

**(d) Evaluation**  
After inserting `NaN` into `Generosity` for 30% of the selected subset (those with `Score < 5.5`), we apply KNN to fill the missing values. We then compare the imputed values against the original ones for precisely those deleted rows, computing the **Mean Absolute Error (MAE)** and **Mean Squared Error (MSE)** to quantify imputation quality. The lower the MAE/MSE, the closer our imputed values are to the true ones.

Below is the corresponding code cell implementing these steps:

In [120]:
# Load the dataset
worldHappiness_t3 = pd.read_csv('dataset2/2019.csv')

# Backup the original 'Generosity' column
original_values_3 = worldHappiness_t3['Generosity'].copy()

# Simulate MAR by selecting rows where Score < 5.5 and deleting 30%
score_threshold = 5.5
sub_indices_3 = worldHappiness_t3.index[worldHappiness_t3['Score'] < score_threshold]

# Suppose we delete 30% of those rows in the subset
delete_fraction_3 = 0.3
n_missing_3 = int(delete_fraction_3 * len(sub_indices_3))

# Randomly pick from that subset
missing_indices_3 = np.random.choice(sub_indices_3, n_missing_3, replace=False)

# Set 'Generosity' to NaN in those rows
worldHappiness_t3.loc[missing_indices_3, 'Generosity'] = np.nan

# Print details of deleted values for visualization
print(f"Deleted {n_missing_3} rows from 'Generosity' among countries with Score < {score_threshold}.")
print("Here are the deleted rows:")
display(
    worldHappiness_t3.loc[missing_indices_3]
        .head(35)
        .style
        .set_caption(f"MAR Deletion: Score < {score_threshold}, 30% in that subset")
)
# KNN Imputation (using multiple features)
imputer = KNNImputer(n_neighbors=5)
columns_for_knn = [
    'Score',
    'GDP per capita',
    'Healthy life expectancy',
    'Freedom to make life choices',
    'Generosity'
]

# Apply KNN imputation using multiple features
worldHappiness_knn = worldHappiness_t3[columns_for_knn].copy()
worldHappiness_imputed_array = imputer.fit_transform(worldHappiness_knn)
worldHappiness_knn_imputed = pd.DataFrame(
    worldHappiness_imputed_array,
    columns=columns_for_knn,
    index=worldHappiness_knn.index
)

# Store the imputed "Generosity" in a new column
worldHappiness_t3['Generosity_knn_imputed'] = worldHappiness_knn_imputed['Generosity']

# Evaluate imputation
original_missing_values_3 = original_values_3.loc[missing_indices_3]
imputed_values_3 = worldHappiness_t3.loc[missing_indices_3, 'Generosity_knn_imputed']

# Compute MAE and MSE to measure imputation performance
mae_3 = mean_absolute_error(original_missing_values_3, imputed_values_3)
mse_3 = mean_squared_error(original_missing_values_3, imputed_values_3)

# Print evaluation results
print("\nEvaluation Results:")
print(f"Number of rows actually imputed: {len(imputed_values_3.dropna())}")
print(f"MAE: {mae_3:.4f}")
print(f"MSE: {mse_3:.4f}")


Deleted 24 rows from 'Generosity' among countries with Score < 5.5.
Here are the deleted rows:


Unnamed: 0,Overall rank,Country or region,Score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption
82,83,Mongolia,5.285,0.948,1.531,0.667,0.317,,0.038
93,94,Vietnam,5.175,0.741,1.346,0.851,0.543,,0.073
85,86,Kyrgyzstan,5.261,0.551,1.438,0.723,0.508,,0.023
86,87,Turkmenistan,5.247,1.052,1.538,0.657,0.394,,0.028
143,144,Lesotho,3.802,0.489,1.169,0.168,0.359,,0.093
87,88,Algeria,5.211,1.002,1.16,0.785,0.086,,0.114
121,122,Mauritania,4.49,0.57,1.167,0.489,0.066,,0.088
152,153,Tanzania,3.231,0.476,0.885,0.499,0.417,,0.147
115,116,Armenia,4.559,0.85,1.055,0.815,0.283,,0.064
123,124,Tunisia,4.461,0.921,1.0,0.815,0.167,,0.055



Evaluation Results:
Number of rows actually imputed: 24
MAE: 0.0571
MSE: 0.0044
