# Applied Machine Learning: In-class Exercise Imputation

## Goal

Our goal for this exercise sheet is to learn the basics of imputation using Python and scikit-learn. Imputation is the process of filling in missing data in a dataset using statistical methods like mean, median, mode, or predictive models.
### Required packages

We will use scikit-learn for machine learning and openml to access the dataset from OpenML.

### Data: Miami house prices

We will use house price data on 13,932 single-family homes sold in Miami in 2016.

We load the data from OpenML and drop the column `"PARCELNO"`, which is not needed for the analysis. Then, we ensure that numeric and categorical features are appropriately typed. Specifically, `"avno60plus"` and `"structure_quality"` are converted to categorical features.

We artificially introduce missing values in three features: `"OCEAN_DIST"`, `"TOT_LVG_AREA"`, and `"structure_quality"`. The missing values are introduced only for houses with an `age` greater than 50. For each feature, 2000 such rows are selected at random and set to `NaN`.

In [1]:
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_openml

rng = np.random.default_rng(12345)

X, y = fetch_openml(data_id=43093, as_frame=True, return_X_y=True)
miami = X.copy()

miami.drop(columns=["PARCELNO"], inplace=True)

# Convert "avno60plus" and "structure_quality" to object
for col in ("avno60plus", "structure_quality"):
    miami[col] = miami[col].astype('object')

# Introduce missingness in three features for homes with age > 50
mask_idxs = miami.index[miami['age'] > 50]
for feature in ['OCEAN_DIST', 'TOT_LVG_AREA', 'structure_quality']:
    sampled = rng.choice(mask_idxs, size=2000, replace=False)
    miami.loc[sampled, feature] = np.nan

# Quick sanity check
print("\nMissing values introduced:")
print(miami[['OCEAN_DIST', 'TOT_LVG_AREA', 'structure_quality']].isna().sum())

miami.info()



Missing values introduced:
OCEAN_DIST           2000
TOT_LVG_AREA         2000
structure_quality    2000
dtype: int64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13932 entries, 0 to 13931
Data columns (total 15 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   LATITUDE           13932 non-null  float64
 1   LONGITUDE          13932 non-null  float64
 2   LND_SQFOOT         13932 non-null  int64  
 3   TOT_LVG_AREA       11932 non-null  float64
 4   SPEC_FEAT_VAL      13932 non-null  int64  
 5   RAIL_DIST          13932 non-null  float64
 6   OCEAN_DIST         11932 non-null  float64
 7   WATER_DIST         13932 non-null  float64
 8   CNTR_DIST          13932 non-null  float64
 9   SUBCNTR_DI         13932 non-null  float64
 10  HWY_DIST           13932 non-null  float64
 11  age                13932 non-null  int64  
 12  avno60plus         13932 non-null  object 
 13  month_sold         13932 non-null  int64  
 14 

## 1 Create simple imputation pipelines

Imputation can be integrated into standard pipeline workflows using scikit-learn's `Pipeline` objects. In this step, you will create two pipelines:

* One pipeline to impute **numerical features** by randomly sampling observed (non-missing) values from each column.
* Another pipeline to impute **categorical features** using out-of-range imputation, which adds a new category (in this case, `-1`) to represent missing values.

Use the provided `RandomSampleImputer` to randomly fill in missing numeric values based on the existing observed values in each column. 

Use the `sklearn.impute.SimpleImputer` to add an out-of-range (OOR) category (here: `-1`) to represent missing categorical values.

**Note**: Unlike the R version where string `".MISSING"` is added by the OOR-imputer as an extra category, we only consider integer `-1` here as the OOR category. This is because the column `structure_quality` and `avno60plus` have values like `0, 1, 4, ...` that look like `int`. Although we have explicitly specified the `dtype` as `object`, the `OneHotEncoder` will still identify these values as `int`, and it treat these columns' data type as mixed data type (int + str) after OOR-imputed using `".MISSING"`. Consequently, `OnetHotEncoder` will throw an error. Therefore, to avoid this unnecessity, we only consider `-1` as the OOR category.


In [2]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.utils.validation import check_is_fitted


class RandomSampleImputer(BaseEstimator, TransformerMixin):
    def __init__(self, random_state=None):
        self.random_state = random_state
        self._is_fitted = False
        self._sklearn_output_config = {"transform": "default"}

    def fit(self, X, y=None):
        # X is expected to be a DataFrame
        self.rng_ = np.random.default_rng(self.random_state)
        # for each column, store the non-missing pool
        self.pool_ = {
            col: X[col].dropna().values
            for col in X.columns
        }
        self._is_fitted = True
        return self

    def transform(self, X):
        # No need to check if fitted - scikit-learn will do this
        X = X.copy()
        for col, values in self.pool_.items():
            mask = X[col].isna()
            n_miss = mask.sum()
            if n_miss > 0:
                draws = self.rng_.choice(values, size=n_miss, replace=True)
                X.loc[mask, col] = draws
        return X
        
    def __sklearn_is_fitted__(self):
        return hasattr(self, "pool_") and self._is_fitted
        
    def set_output(self, *, transform=None):
        """Set output container.
        
        Parameters
        ----------
        transform : {"default", "pandas"}, default=None
            Configure output of `transform` and `fit_transform`.
            
        Returns
        -------
        self : estimator instance
            Estimator instance.
        """
        if transform is not None:
            self._sklearn_output_config["transform"] = transform
        return self
        
    def get_feature_names_out(self, input_features=None):
        """Get output feature names for transformation.
        
        Parameters
        ----------
        input_features : array-like of str or None, default=None
            Input features.
            
        Returns
        -------
        feature_names_out : ndarray of str objects
            Transformed feature names.
        """
        check_is_fitted(self)
        if input_features is None:
            raise ValueError("Input features not specified")
        return np.array(input_features, dtype=object)

In [3]:
#===SOLUTION===
from sklearn.impute import SimpleImputer

impute_numeric = Pipeline([
    ("random_sample", RandomSampleImputer(random_state=12345))
])

impute_factor = Pipeline([
    ("oor", SimpleImputer(strategy="constant", fill_value=-1))
])

## 2 Create and inspect a pipeline graph

Combine both imputation pipelines with a random forest learning algorithm into a complete `Pipeline`. The numeric features are processed using random sampling imputation, while categorical features are imputed with an out-of-range value and then one-hot encoded. These are merged using a `ColumnTransformer`, and the final estimator is a `RandomForestRegressor`.

While scikit-learn does not have a built-in function to *plot* a pipeline like `mlr3pipelines::plot()` in R, you can inspect the structure by printing the pipeline object.

<details><summary>Hint 1:</summary>

Use `ColumnTransformer` to apply different preprocessing steps to numerical and categorical columns.

</details>

<details><summary>Hint 2:</summary>

When determining which columns to apply a preprocessing pipeline, use `sklearn.compose.make_column_selector` to select columns by dtype.

</details>


In [4]:
#===SOLUTION===

from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_selector

impute_and_encode = Pipeline([
    ("oor", SimpleImputer(strategy="constant", fill_value=-1)),
    ("onehot", OneHotEncoder())
])

# Use make_column_selector to automatically select columns by dtype
numeric_selector = make_column_selector(dtype_exclude=object)
categorical_selector = make_column_selector(dtype_include=object)

preprocessor = ColumnTransformer([
    ("num_impute", impute_numeric, numeric_selector),
    ("cat_process", impute_and_encode, categorical_selector)    
], remainder="drop")

full_pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("rf", RandomForestRegressor(n_estimators=100, random_state=12345))
])

full_pipeline

### Simple Imputation

As an alternative to a full pipeline that includes a learner, we can define a simpler pipeline that only handles imputation. This allows us to preprocess the data independently before modeling.

In this step, you define a pipeline containing only the numeric and categorical imputation steps from before. You then apply this pipeline to the Miami housing data using `fit_transform`, which performs the imputation. Finally, you reassemble the result into a DataFrame and inspect the first few rows to verify the imputed values.

In [5]:
# Define a "simple" imputation pipeline (no learner)
simple_impute = ColumnTransformer([
    ("num_impute", impute_numeric, numeric_selector),
    ("cat_impute", SimpleImputer(strategy="constant", fill_value=-1), categorical_selector)
], remainder="drop")

# Set output to pandas DataFrame directly
simple_impute.set_output(transform="pandas")

# The fit_transform will now return a DataFrame automatically
miami_imputed = simple_impute.fit_transform(miami)

# Display the results
miami_imputed.head()

Unnamed: 0,num_impute__LATITUDE,num_impute__LONGITUDE,num_impute__LND_SQFOOT,num_impute__TOT_LVG_AREA,num_impute__SPEC_FEAT_VAL,num_impute__RAIL_DIST,num_impute__OCEAN_DIST,num_impute__WATER_DIST,num_impute__CNTR_DIST,num_impute__SUBCNTR_DI,num_impute__HWY_DIST,num_impute__age,num_impute__month_sold,cat_impute__avno60plus,cat_impute__structure_quality
0,25.891031,-80.160561,9375,2750.0,0,2815.9,33320.2,347.6,42815.3,37742.2,15954.9,67,8,0,-1
1,25.891324,-80.153968,9375,1624.0,0,4359.1,17666.7,337.8,43504.9,37340.5,18125.0,63,9,0,-1
2,25.891334,-80.15374,9375,1283.0,49206,4412.9,32759.1,297.1,43530.4,37328.7,18200.5,61,2,0,-1
3,25.891765,-80.152657,12450,1294.0,10033,4585.0,10156.5,0.0,43797.5,37423.2,18514.4,63,9,0,4
4,25.891825,-80.154639,12800,1684.0,16681,4063.4,10836.8,326.6,43599.7,37550.8,17903.4,42,7,0,4


### Assessing Performance

To evaluate the performance of the full pipeline—including both imputation and the random forest learner—you will use 3-fold cross-validation. This provides an estimate of the model's error on unseen data by splitting the dataset into three parts, training on two, and validating on the third in rotation.

The metric used here is the **mean squared error (MSE)**.

<details><summary>Hint 1:</summary>
Use `cross_val_score` with `scoring="neg_mean_squared_error"`—note that scikit-learn returns the negative MSE, so you'll need to multiply the result by -1 to interpret it properly.
</details>


In [6]:
#===SOLUTION===

from sklearn.model_selection import KFold, cross_val_score


cv = KFold(n_splits=3, shuffle=True, random_state=12345)
neg_mse = cross_val_score(
    full_pipeline, 
    miami, 
    y, 
    cv=cv, 
    scoring="neg_mean_squared_error",
    n_jobs=-1
)

# 4. Convert to positive MSE and summarize
mse_scores = -neg_mse
print(f"MSE per fold: {mse_scores}")
print(f"→ Average 3-fold CV MSE: {mse_scores.mean():.2f}")


MSE per fold: [1.12846963e+10 1.37675313e+10 1.01202609e+10]
→ Average 3-fold CV MSE: 11724162846.14


## 3 Model-based imputation

Instead of relying on simple statistical imputation methods, you can use model-based imputation, where missing values are predicted using other observed features. This approach treats the feature with missing values as a target variable and uses a supervised learning model to estimate its values.

You will now create **two pipelines** that differ in how numeric features are imputed:

* One pipeline uses **linear regression** for imputation.
* The other uses a **decision tree**.

In both pipelines, categorical features are still imputed using the out-of-range strategy from earlier. After imputation, each pipeline ends with a `RandomForestRegressor` for the final prediction task.

<details><summary>Hint 1:</summary>
Use [`IterativeImputer`](https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html#sklearn.impute.IterativeImputer) from `sklearn.impute` with the desired estimator (`LinearRegression` or `DecisionTreeRegressor`) for model-based numeric imputation.
</details>

<details><summary>Hint 2:</summary>
Don't forget to import `enable_iterative_imputer` from `sklearn.experimental` before using `IterativeImputer`.
</details>


In [7]:
#===SOLUTION===

# First import the experimental module to enable IterativeImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor

# 1. Define the imputers
dt_imputer = IterativeImputer(
    estimator=DecisionTreeRegressor(random_state=12345),
    initial_strategy="mean",
    random_state=12345
)
linear_imputer = IterativeImputer(
    estimator=LinearRegression(),
    initial_strategy="mean",
    random_state=12345
)

# 2. Define the preprocessors and pipelines
dt_preprocessor = ColumnTransformer([
    ("cat_impute", SimpleImputer(strategy="constant", fill_value=-1), categorical_selector),
    ("num_impute", dt_imputer, numeric_selector),
], remainder="drop")

linear_preprocessor = ColumnTransformer([
    ("cat_impute", SimpleImputer(strategy="constant", fill_value=-1), categorical_selector),
    ("num_impute", linear_imputer, numeric_selector),
], remainder="drop")

dt_pipeline = Pipeline([
    ("preprocessor", dt_preprocessor),
    ("rf", RandomForestRegressor(n_estimators=100, random_state=12345))
])

linear_pipeline = Pipeline([
    ("preprocessor", linear_preprocessor),
    ("rf", RandomForestRegressor(n_estimators=100, random_state=12345))
])

In [8]:
#===SOLUTION===

dt_pipeline 

In [9]:
#===SOLUTION===

linear_pipeline

### Assessing Performance

As before, use 3-fold cross-validation to compare the error of the two pipelines to identify which learner seems to work best for imputation for this data set.

In [10]:
#===SOLUTION===

import warnings
from sklearn.exceptions import ConvergenceWarning
warnings.filterwarnings(action='ignore', category=ConvergenceWarning)


# 3. Evaluate the tree-based imputation pipeline
neg_mse_dt = cross_val_score(
    dt_pipeline,
    miami,
    y,
    cv=cv,
    scoring="neg_mean_squared_error",
)
mse_dt = -neg_mse_dt
print(f"Decision-tree imputation pipeline MSE per fold: {mse_dt}")
print(f"→ Average MSE: {mse_dt.mean():.2f}\n")

# 4. Evaluate the linear-model imputation pipeline
neg_mse_linear = cross_val_score(
    linear_pipeline,
    miami,
    y,
    cv=cv,
    scoring="neg_mean_squared_error",
)
mse_linear = -neg_mse_linear
print(f"Linear-model imputation pipeline MSE per fold: {mse_linear}")
print(f"→ Average MSE: {mse_linear.mean():.2f}")


Decision-tree imputation pipeline MSE per fold: [1.15086059e+10 1.07000392e+10 9.40927713e+09]
→ Average MSE: 10539307389.39

Linear-model imputation pipeline MSE per fold: [1.07898296e+10 1.02297789e+10 9.17037115e+09]
→ Average MSE: 10063326559.55


Question: What is your observation?

===SOLUTION===

In this case, using a linear model for model-based imputation seems to outperform a decision tree with default hyperparameter settings.

## 3 Branches in pipelines

Pipelines can become very complex. Within a pipeline, we could be interested which imputation method works best. An elegant way to find out is to treat the imputation method as just another hyperparameter that we tune alongside other hyperparameters when we tune the pipeline. A way to do this is by using pipeline branching.

However, unlike `mlr3`, `sklearn` does not provide off-the-shelf support for pipeline branching. To achieve the same imputation-switching effect, we can add two different imputers in the hyperparameter search space.

Set up a Pipeline that contains the following elements: 1. A `ColumnTransformer` consisting of an (arbitary) imputer for numerical features and a SimpleImputer for categorical features; 2. A `RandomForestRegressor` serving as the learner.

Later in the hyperparameter space, we will switch between `IterativeImputer` and `HistogramImputer` for imputing the numerical features.

In [11]:

class HistogramImputer(BaseEstimator, TransformerMixin):
    def __init__(self, bins=10, random_state=None):
        self.bins = bins
        self.random_state = random_state

    def fit(self, X, y=None):
        self.rng_ = np.random.default_rng(self.random_state)
        self.hist_ = {}
        self.edges_ = {}
        for col in X.columns:
            vals = X[col].dropna().values
            counts, edges = np.histogram(vals, bins=self.bins)
            probs = counts / counts.sum()
            self.hist_[col] = probs
            self.edges_[col] = edges
        return self

    def transform(self, X):
        X = X.copy()
        for col, probs in self.hist_.items():
            mask = X[col].isna()
            n_miss = mask.sum()
            if n_miss > 0:
                edges = self.edges_[col]
                # pick bins according to observed frequencies
                bins_idx = self.rng_.choice(len(probs), size=n_miss, p=probs)
                # sample uniformly within those bins
                draws = self.rng_.uniform(edges[bins_idx], edges[bins_idx+1])
                X.loc[mask, col] = draws
        return X

In [12]:
#===SOLUTION===

from sklearn.model_selection import RandomizedSearchCV

pipeline = Pipeline([
    ("preprocessor", ColumnTransformer([
        ("num", "passthrough", numeric_selector),  # placeholder, will be set in param_dist
        ("cat", SimpleImputer(strategy="constant", fill_value=-1), categorical_selector),
    ], remainder="drop")),
    ("rf", RandomForestRegressor(n_estimators=100, random_state=12345))
])

### Define a search space

To tune the pipeline, we define a **hyperparameter search space** that includes:

1. The `max_features` parameter of the random forest, searched over values from 2 to 8.
2. The **imputation method**—choosing between histogram-based and decision tree-based imputation.
3. The `max_depth` of the decision tree used in the model-based imputation, varied from 1 to 30.

This allows the tuning procedure to evaluate combinations of both preprocessing and modeling strategies.

<details><summary>Hint 1:</summary>

Assume that you name the first component `ColumnTransformer` above as `"preprocessor"`, and in the `ColumnTransformer` you name the imputer for numerical features as `"num"`, then this `"num"` imputer will be replaced by an `IterativeImputer` when we try model-based imputation. Now, you can specify the search space for the decision tree used in the model-based imputation by `"preprocessor__num__estimator__max_depth": list(range(1, 31))`.

</details>


In [13]:
#===SOLUTION===

# define search space
param_dist = [
    {
        # Use HistogramImputer for numerical features
        "preprocessor__num": [HistogramImputer(bins=10, random_state=12345)],
        "rf__max_features": list(range(2, 9)),
    },
    {
        # Use IterativeImputer for numerical features  
        "preprocessor__num": [
            IterativeImputer(
                estimator=DecisionTreeRegressor(random_state=12345), 
                initial_strategy="mean", 
                random_state=12345
            )
        ],
        "preprocessor__num__estimator__max_depth": list(range(1, 31)),
        "rf__max_features": list(range(2, 9)),
    }
]

### Tuning the pipeline

Now, tune the pipeline using an `RandomizedSearchCV` with 3-fold CV and random search. You can terminate after 10 evaluations to reduce run time. Then, display the optimal hyperparameter set as chosen by the tuner based on the mean squared error.

In [14]:
#===SOLUTION===

search = RandomizedSearchCV(
    pipeline,
    param_distributions=param_dist,
    n_iter=10, 
    cv=cv,
    scoring="neg_mean_squared_error",
    random_state=12345,
)

search.fit(miami, y)

# display results
best_params = search.best_params_
best_mse = -search.best_score_

print(f"Best parameters found:\n{best_params}")
print(f"→ 3-fold CV MSE (avg): {best_mse:.2f}")

Best parameters found:
{'rf__max_features': 3, 'preprocessor__num__estimator__max_depth': 7, 'preprocessor__num': IterativeImputer(estimator=DecisionTreeRegressor(random_state=12345),
                 random_state=12345)}
→ 3-fold CV MSE (avg): 9834692119.76
