# Import Libraries

In this section, we import the necessary Python libraries for our analysis, including:
- `pandas`: For data reading and processing.
- `numpy`: For numerical operations and random selections.
- `sklearn.linear_model.LinearRegression`: A linear regression model used for regression imputation.
- `sklearn.impute.KNNImputer`: Used for KNN imputation.
- `sklearn.metrics`: To evaluate the imputation results using MSE and MAE.

We also set a random seed to ensure the reproducibility of our experiments.


In [103]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.impute import KNNImputer
from sklearn.metrics import mean_absolute_error, mean_squared_error

# Set random seed for reproducibility
np.random.seed(42)


# Read and Explore the Dataset

In this section, we load the *World Happiness Report* dataset from a CSV file into a pandas DataFrame named `worldHappiness`.

We then use `.head()` to inspect the first 5 rows and `.info()` to check the dataset’s basic information, including the number of rows, columns, and any missing values.


In [104]:
worldHappiness = pd.read_csv('dataset2/2019.csv')

# View dataset
worldHappiness.iloc[:160]
worldHappiness.head(160)


Unnamed: 0,Overall rank,Country or region,Score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption
0,1,Finland,7.769,1.34,1.587,0.986,0.596,0.153,0.393
1,2,Denmark,7.6,1.383,1.573,0.996,0.592,0.252,0.41
2,3,Norway,7.554,1.488,1.582,1.028,0.603,0.271,0.341
3,4,Iceland,7.494,1.38,1.624,1.026,0.591,0.354,0.118
4,5,Netherlands,7.488,1.396,1.522,0.999,0.557,0.322,0.298
5,6,Switzerland,7.48,1.452,1.526,1.052,0.572,0.263,0.343
6,7,Sweden,7.343,1.387,1.487,1.009,0.574,0.267,0.373
7,8,New Zealand,7.307,1.303,1.557,1.026,0.585,0.33,0.38
8,9,Canada,7.278,1.365,1.505,1.039,0.584,0.285,0.308
9,10,Austria,7.246,1.376,1.475,1.016,0.532,0.244,0.226


# Test 1 - Univariate Imputation (Mean Imputation)

**Objective:**
1. Select a numerical column (e.g., 'GDP per capita').
2. Randomly remove 20% of its values (simulate MCAR - Missing Completely at Random).
3. Perform mean imputation to fill in missing values.
4. Compare the imputed values with the original values and evaluate using MAE and MSE.

This helps us understand the performance of simple default-value imputation on this feature.


In [105]:
# Backup the original column for later evaluation
original_values_1 = worldHappiness['GDP per capita'].copy()

# Randomly remove 20% of 'GDP per capita' (simulate MCAR)
missing_fraction = 0.2
n_missing = int(missing_fraction * len(worldHappiness))
missing_indices = np.random.choice(worldHappiness.index, n_missing, replace=False)

# Set these rows as NaN
worldHappiness.loc[missing_indices, 'GDP per capita'] = np.nan

# Print the deleted data table
print("Data after randomly deleting 20% of 'GDP per capita' values:")
print(worldHappiness.head(20))

print(f"\nAfter simulation, 'GDP per capita' missing count: {worldHappiness['GDP per capita'].isna().sum()}")

# Perform Mean Imputation
mean_value = worldHappiness['GDP per capita'].mean()
worldHappiness['GDP_per_capita_mean_imputed'] = worldHappiness['GDP per capita'].fillna(mean_value)

# Evaluate Imputation Quality (only for removed rows)
original_missing_values_1 = original_values_1.loc[missing_indices]
imputed_values_1 = worldHappiness.loc[missing_indices, 'GDP_per_capita_mean_imputed']

mae_1 = mean_absolute_error(original_missing_values_1, imputed_values_1)
mse_1 = mean_squared_error(original_missing_values_1, imputed_values_1)

print("\n[Test 1: Mean Imputation on 'GDP per capita']")
print("MAE:", mae_1)
print("MSE:", mse_1)


Data after randomly deleting 20% of 'GDP per capita' values:
    Overall rank Country or region  Score  GDP per capita  Social support  \
0              1           Finland  7.769           1.340           1.587   
1              2           Denmark  7.600           1.383           1.573   
2              3            Norway  7.554           1.488           1.582   
3              4           Iceland  7.494           1.380           1.624   
4              5       Netherlands  7.488           1.396           1.522   
5              6       Switzerland  7.480           1.452           1.526   
6              7            Sweden  7.343           1.387           1.487   
7              8       New Zealand  7.307           1.303           1.557   
8              9            Canada  7.278           1.365           1.505   
9             10           Austria  7.246             NaN           1.475   
10            11         Australia  7.228           1.372           1.548   
11            1

# Test 2 - Regression Imputation

**Objective:**
1. Select another numerical column (e.g., 'Healthy life expectancy').
2. Design a missingness mechanism (e.g., MNAR or MAR). For example, remove part of the highest life expectancy values.
3. Train a linear regression model using another feature (e.g., 'GDP per capita') to predict and impute missing values.
4. Compare the imputed values with the original values and evaluate using MAE and MSE.

By leveraging feature relationships, regression imputation aims to provide more accurate imputation than simple mean imputation.


In [106]:
# Backup the original data for later evaluation
original_values_2 = worldHappiness['Healthy life expectancy'].copy()

# Set a threshold to select "higher life expectancy" rows (e.g., top 30%)
threshold = worldHappiness['Healthy life expectancy'].quantile(0.7)
high_indices = worldHappiness.index[
    (worldHappiness['Healthy life expectancy'] > threshold) &
    (worldHappiness['GDP per capita'].notna())  # Ensure GDP per capita is not missing
]

# Randomly delete a portion of life expectancy values within the selected high group (MNAR simulation)
delete_fraction = 0.5
n_missing_2 = int(delete_fraction * len(high_indices))
missing_indices_2 = np.random.choice(high_indices, n_missing_2, replace=False)

worldHappiness.loc[missing_indices_2, 'Healthy life expectancy'] = np.nan

print("\nData after randomly deleting 'Healthy life expectancy' in high life expectancy rows:")
print(worldHappiness.loc[missing_indices_2].head(10))


print("\n[Test 2: Regression Imputation on 'Healthy life expectancy']")
print(f"Total rows: {len(worldHappiness)}")
print(f"High life expectancy rows: {len(high_indices)}")
print(f"Deleted {n_missing_2} rows from them for MNAR simulation.")

# Train a regression model on the non-missing subset (X = 'GDP per capita')
not_null_df_2 = worldHappiness.dropna(subset=['Healthy life expectancy', 'GDP per capita'])
X_train_2 = not_null_df_2[['GDP per capita']]
y_train_2 = not_null_df_2['Healthy life expectancy']

reg_model = LinearRegression()
reg_model.fit(X_train_2, y_train_2)

# Predict missing values
null_df_2 = worldHappiness[
    (worldHappiness['Healthy life expectancy'].isna()) &
    (worldHappiness['GDP per capita'].notna())
]
X_test_2 = null_df_2[['GDP per capita']]
predicted_2 = reg_model.predict(X_test_2)

# Write predictions back to a new column
worldHappiness['LifeExp_reg_imputed'] = worldHappiness['Healthy life expectancy'].copy()
worldHappiness.loc[X_test_2.index, 'LifeExp_reg_imputed'] = predicted_2

# Evaluate Imputation: Only compare rows we deliberately deleted
original_missing_values_2 = original_values_2.loc[missing_indices_2]
imputed_values_2 = worldHappiness.loc[missing_indices_2, 'LifeExp_reg_imputed']

# If some rows remain NaN (e.g., if GDP was also missing), drop them before evaluation
imputed_values_2_no_nan = imputed_values_2.dropna()
original_missing_values_2_no_nan = original_missing_values_2.loc[imputed_values_2_no_nan.index]

mae_2 = mean_absolute_error(original_missing_values_2_no_nan, imputed_values_2_no_nan)
mse_2 = mean_squared_error(original_missing_values_2_no_nan, imputed_values_2_no_nan)

print(f"\nNumber of rows successfully imputed: {len(imputed_values_2_no_nan)}")
print(f"Skipped rows (still NaN) => {len(imputed_values_2) - len(imputed_values_2_no_nan)}")

print("\n--- Evaluation ---")
print("MAE:", mae_2)
print("MSE:", mse_2)



Data after randomly deleting 'Healthy life expectancy' in high life expectancy rows:
    Overall rank Country or region  Score  GDP per capita  Social support  \
17            18           Belgium  6.923           1.356           1.504   
23            24            France  6.592           1.324           1.472   
33            34         Singapore  6.262           1.572           1.463   
1              2           Denmark  7.600           1.383           1.573   
16            17           Germany  6.985           1.373           1.454   
25            26             Chile  6.444           1.159           1.369   
3              4           Iceland  7.494           1.380           1.624   
21            22             Malta  6.726           1.300           1.520   
37            38          Slovakia  6.198           1.246           1.504   
43            44          Slovenia  6.118           1.258           1.523   

    Healthy life expectancy  Freedom to make life choices  Generos

# Test 3 - KNN Imputation (Similarity-based)

**Objective:**
1. Select a third feature (e.g., 'Generosity').
2. Randomly remove 20% of the values (simulate MCAR).
3. Use `KNNImputer`, incorporating multiple features (e.g., 'Happiness Score', 'GDP per capita', 'Healthy life expectancy', etc.), to fill missing values based on similarity.
4. Evaluate the imputation quality using MAE and MSE.

KNN imputation considers multiple feature similarities to fill missing values, making it more robust than simple mean imputation.


In [None]:

# Backup original 'Generosity' column
original_values_3 = worldHappiness['Generosity'].copy()

# Randomly remove 20% of 'Generosity' (MCAR)
missing_fraction_3 = 0.2
n_missing_3 = int(missing_fraction_3 * len(worldHappiness))
missing_indices_3 = np.random.choice(worldHappiness.index, n_missing_3, replace=False)

worldHappiness.loc[missing_indices_3, 'Generosity'] = np.nan

print("Data after randomly deleting 20% of 'Generosity':")
print(worldHappiness.loc[missing_indices_3].head(10))

# Select Multiple Features for KNN Imputation
imputer = KNNImputer(n_neighbors=5)
columns_for_knn = [
    'Score',
    'GDP per capita',
    'Healthy life expectancy',
    'Freedom to make life choices',
    'Generosity'
]

worldHappiness_knn = worldHappiness[columns_for_knn].copy()
worldHappiness_imputed_array = imputer.fit_transform(worldHappiness_knn)

worldHappiness_knn_imputed = pd.DataFrame(worldHappiness_imputed_array, columns=columns_for_knn, index=worldHappiness_knn.index)
worldHappiness['Generosity_knn_imputed'] = worldHappiness['Generosity'].copy()
worldHappiness['Generosity_knn_imputed'] = worldHappiness_knn_imputed['Generosity']

# Evaluate Imputation (Compare original and imputed values)
original_missing_values_3 = original_values_3.loc[missing_indices_3]
imputed_values_3 = worldHappiness.loc[missing_indices_3, 'Generosity_knn_imputed']

mae_3 = mean_absolute_error(original_missing_values_3, imputed_values_3)
mse_3 = mean_squared_error(original_missing_values_3, imputed_values_3)

print("\n[Test 3: KNN Imputation on 'Generosity']")
print("MAE:", mae_3)
print("MSE:", mse_3)




[Test 3: KNN Imputation on 'Generosity']
MAE: 0.0670258064516129
MSE: 0.006435681290322581
