# Applied Machine Learning: Homework Exercise 03-1

## Goal

Apply what you have learned about using pipelines for efficient pre-processing and model training on a regression problem.

## House Prices in King county

In this exercise, we want to model house sale prices in King county in the state of Washington, USA.

In [1]:
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_openml

rng = np.random.default_rng(124)

X, y = fetch_openml(data_id=42092, return_X_y=True)

X.head()

Unnamed: 0,date,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,20141013T000000,3,1.0,1180,5650,1.0,0,0,3,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,20141209T000000,3,2.25,2570,7242,2.0,0,0,3,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,20150225T000000,2,1.0,770,10000,1.0,0,0,3,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,20141209T000000,4,3.0,1960,5000,1.0,0,0,5,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,20150218T000000,3,2.0,1680,8080,1.0,0,0,3,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


We do some simple feature pre-processing first:

In [2]:
# To be consistent with mlr3data package, we need the following steps [1-4]. 
# These steps are not shown in the R solution, as they are done in the mlr3data.
# For details please see https://github.com/mlr-org/mlr3data/blob/main/R/kc_housing.R

# 1. Convert dates from strings to datetime
X['date'] = pd.to_datetime(X['date'])

# 2. Replace 0 values with NA in yr_renovated
X['yr_renovated'] = X['yr_renovated'].replace(0, np.nan)

# 3. Replace 0 values with NA in sqft_basement
X['sqft_basement'] = X['sqft_basement'].replace(0, np.nan)

# 4. Convert waterfront to category.
X['waterfront'] = X['waterfront'].astype('category')

# Next, we do the feature preprocessing

# Convert zipcode to category
X['zipcode'] = X['zipcode'].astype('category')

# Transform date to numeric variable (days since earliest date)
min_date = X['date'].min()
X['date'] = (X['date'] - min_date).dt.days

# Scale the target variable by dividing by 1000
y = y.astype(float) / 1000

# Delete columns containing NAs
X = X.drop(X.columns[X.isna().any()], axis=1)

# Print information about the DataFrame
print("DataFrame info:")
print(X.info())

DataFrame info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 17 columns):
 #   Column         Non-Null Count  Dtype   
---  ------         --------------  -----   
 0   date           21613 non-null  int64   
 1   bedrooms       21613 non-null  int64   
 2   bathrooms      21613 non-null  float64 
 3   sqft_living    21613 non-null  int64   
 4   sqft_lot       21613 non-null  int64   
 5   floors         21613 non-null  float64 
 6   waterfront     21613 non-null  category
 7   view           21613 non-null  int64   
 8   condition      21613 non-null  int64   
 9   grade          21613 non-null  int64   
 10  sqft_above     21613 non-null  int64   
 11  yr_built       21613 non-null  int64   
 12  zipcode        21613 non-null  category
 13  lat            21613 non-null  float64 
 14  long           21613 non-null  float64 
 15  sqft_living15  21613 non-null  int64   
 16  sqft_lot15     21613 non-null  int64   
dtypes: category(2),

## Train-test split

Before we train a model, let’s reserve some data for evaluating our model later on:

In [3]:
from sklearn.model_selection import train_test_split

# Split the data with a 60% training ratio.
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.6, random_state=124)

# Print out the shapes of the training and testing sets.
print(f"Training set shape: {X_train.shape}")
print(f"Test set shape: {X_test.shape}")


Training set shape: (12967, 17)
Test set shape: (8646, 17)


## XGBoost

XGBoost (Chen and Guestrin, 2016) is a highly performant library for gradient-boosted trees. As with some other ML algorithms, XGBoost in scikit-learn does not natively handle categorical data, meaning categorical features must be encoded as numerical variables before use. 

In the King County dataset, there are two categorical features: **"waterfront"** and **"zipcode"**. Categorical features can be grouped by their cardinality, which refers to the number of unique levels they contain: binary features (two levels), low-cardinality features, and high-cardinality features. There is no universal threshold defining when a feature is considered high-cardinality; such a threshold can even be tuned based on performance. 

Low-cardinality features, such as binary features, can typically be handled by **one-hot encoding**. One-hot encoding converts categorical features into a binary representation, where each possible category is represented as a separate binary feature. While theoretically, it's sufficient to create one less binary feature than levels (often called dummy or treatment encoding, particularly relevant for generalized linear models), scikit-learn's default `OneHotEncoder` usually creates a full set of binary features unless specified otherwise.

For this dataset:

- **"waterfront"** has **2** unique levels, making it a binary feature suitable for one-hot (or dummy) encoding.
- **"zipcode"** has **70** unique levels, making it a very high-cardinality feature.

High-cardinality categorical features, like `"zipcode"`, can be problematic if encoded with methods like one-hot encoding due to dimensionality explosion, and might require alternative encoding methods or preprocessing strategies.

## Impact encoding

Impact encoding (Micci-Barreca 2001) is a good approach for handling high-cardinality features. Impact encoding converts categorical features into numeric values. The idea behind impact encoding is to use the target feature to create a mapping between the categorical feature and a numerical value that reflects its importance in predicting the target feature. Impact encoding involves the following steps:

1. Group the target variable by the categorical feature.
2. Compute the mean of the target variable for each group.
3. Compute the global mean of the target variable.
4. Compute the impact score for each group as the difference between the mean of the target variable for the group and the global mean of the target variable.
5. Replace the categorical feature with the impact scores.

Impact encoding preserves the information of the categorical feature while also creating a numerical representation that reflects its importance in predicting the target. Compared to one-hot encoding, the main advantage is that only a single numeric feature is created regardless of the number of levels of the categorical features, hence it is especially useful for high-cardinality features. As information from the target is used to compute the impact scores, the encoding process must be embedded in cross-validation to avoid leakage between training and testing data.

`sklearn` has implemented impact encoding in the class `sklearn.preprocessing.TargetEncoder`. Please refer to [the documentation](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.TargetEncoder.html) for more details.

## Exercises

### Exercise 1: Create a pipeline

Create a pipeline that preprocesses each categorical variable using impact encoding. The pipeline should include hyperparameter optimization with an XGBoost model, using randomized search and mean squared error (MSE) as the performance measure. You should perform hyperparameter tuning via cross-validation with a number of folds `cv=2`.

For the hyperparameter search space, define ranges informed by empirical best practices and common defaults for tuning XGBoost:

- `learning_rate`: sample from a log-uniform distribution between 1e-4 and 1.
- `max_depth`: sample integer values from 1 to 20.
- `colsample_bytree` and `colsample_bylevel`: uniform distributions from 0.1 to 1.0.
- `reg_lambda` and `reg_alpha`: log-uniform distributions between 0.001 and 1000.
- `subsample`: uniform distribution from 0.1 to 1.0.

To speed up computations, set `n_estimators=100` and enable multi-core computing (`n_jobs=-1`) in your `XGBRegressor`.

Additionally, you will compare the pipeline using impact encoding with an alternative pipeline using one-hot encoding, keeping all other pipeline steps and hyperparameter tuning procedures identical. This will allow you to explore how different encoding strategies affect the model's predictive performance.

<details><summary>Hint 1:</summary>
    Use `loguniform`, `uniform`, `randint` from `scipy.stats` for specifying the distributions in parameter search.
</details>

<details><summary>Hint 2:</summary>
    Use `objective='reg:squarederror'` in `XGBRegressor`.
</details>

<details><summary>Hint 3:</summary>
    Use `neg_mean_squared_error` in `RandomizedSearchCV`.
</details>


In [4]:
#===SOLUTION===

from scipy.stats import loguniform, uniform, randint
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, TargetEncoder
from xgboost import XGBRegressor


# Define the search space for the XGBoost hyperparameters
param_distributions = {
    'xgb__learning_rate': loguniform(1e-4, 1.0),         # eta: from 1e-4 to 1 (log scale)
    'xgb__max_depth': randint(1, 21),                    # max_depth: from 1 to 20
    'xgb__colsample_bytree': uniform(0.1, 0.9),          # colsample_bytree: from 0.1 to 1.0
    'xgb__colsample_bylevel': uniform(0.1, 0.9),         # colsample_bylevel: from 0.1 to 1.0
    'xgb__reg_lambda': loguniform(0.001, 1000),          # reg_lambda: from 0.001 to 1000 (log scale)
    'xgb__reg_alpha': loguniform(0.001, 1000),           # reg_alpha: from 0.001 to 1000 (log scale)
    'xgb__subsample': uniform(0.1, 0.9)                  # subsample: from 0.1 to 1.0
}

# Define categorical and numerical features
categorical_features = ['zipcode', 'waterfront']
numerical_features = [col for col in X_train.columns if col not in categorical_features]

# 1. Pipeline with TargetEncoder
target_preprocessor = ColumnTransformer([
    ('target_encoder', TargetEncoder(smooth=0.0001), categorical_features),
    ('passthrough', 'passthrough', numerical_features)
])

target_pipeline = Pipeline([
    ('preprocessor', target_preprocessor),
    ('xgb', XGBRegressor(n_estimators=100, n_jobs=-1, objective='reg:squarederror', verbosity=0))
])

# 2. Pipeline with OneHotEncoder
onehot_preprocessor = ColumnTransformer([
    ('onehot_encoder', OneHotEncoder(sparse_output=False, handle_unknown='ignore'), categorical_features),
    ('passthrough', 'passthrough', numerical_features)
])

onehot_pipeline = Pipeline([
    ('preprocessor', onehot_preprocessor),
    ('xgb', XGBRegressor(n_estimators=100, n_jobs=-1, objective='reg:squarederror', verbosity=0))
])

# Create RandomizedSearchCV for TargetEncoder pipeline
target_search = RandomizedSearchCV(
    estimator=target_pipeline,
    param_distributions=param_distributions,
    n_iter=50,
    scoring='neg_mean_squared_error',
    cv=2,
    random_state=111,
    verbose=1,
)

# Create RandomizedSearchCV for OneHotEncoder pipeline
onehot_search = RandomizedSearchCV(
    estimator=onehot_pipeline,
    param_distributions=param_distributions,
    n_iter=50,
    scoring='neg_mean_squared_error',
    cv=2,
    random_state=124,
    verbose=1,
)


## Exercise 2: Benchmark a pipeline

Assess the performance of both pipelines using the untouched test set principle:

- Perform CV on your training dataset (`X_train`, `y_train`).
- Evaluate their predictive performance (MSE) on the separate test set (`X_test`, `y_test`).

Finally, report and compare the optimal hyperparameters and corresponding performance metrics obtained from cross-validation and on the independent test set.

<details><summary>Hint 1:</summary>
    Use `RandomizedSearchCV.best_score_` to see the best validation score, and use `RandomizedSearchCV.score` to run performance evaluation on the test set.
</details>


<details><summary>Hint 2:</summary>
    Recall that the `RandomizedSearchCV` use negative MSE as the score in the HPO. Therefore, when printing the performances on the validation and test sets, you need to flip the sign of the score.
</details>


In [5]:
#===SOLUTION===

print("\nTraining model with OneHotEncoder...")
onehot_search.fit(X_train, y_train)

print("Training model with TargetEncoder...")
target_search.fit(X_train, y_train)

# Output results for OneHotEncoder
print("\n--- OneHotEncoder Results ---")
print(f"Best parameters: {onehot_search.best_params_}")
print(f"Best CV score (MSE): {-onehot_search.best_score_:.4f}")
onehot_test_score = onehot_search.score(X_test, y_test)
print(f"Test set score (MSE): {-onehot_test_score:.4f}")

# Output results for TargetEncoder
print("\n--- TargetEncoder Results ---")
print(f"Best parameters: {target_search.best_params_}")
print(f"Best CV score (MSE): {-target_search.best_score_:.4f}")
target_test_score = target_search.score(X_test, y_test)
print(f"Test set score (MSE): {-target_test_score:.4f}")


Training model with OneHotEncoder...
Fitting 2 folds for each of 50 candidates, totalling 100 fits
Training model with TargetEncoder...
Fitting 2 folds for each of 50 candidates, totalling 100 fits

--- OneHotEncoder Results ---
Best parameters: {'xgb__colsample_bylevel': 0.9739842882294312, 'xgb__colsample_bytree': 0.2816306321316162, 'xgb__learning_rate': 0.24734640788083925, 'xgb__max_depth': 4, 'xgb__reg_alpha': 63.31327250747733, 'xgb__reg_lambda': 0.014633103639863188, 'xgb__subsample': 0.8480834882645039}
Best CV score (MSE): 17334.5468
Test set score (MSE): 16713.4028

--- TargetEncoder Results ---
Best parameters: {'xgb__colsample_bylevel': 0.8656327252373124, 'xgb__colsample_bytree': 0.5827666592066569, 'xgb__learning_rate': 0.07834804904588204, 'xgb__max_depth': 8, 'xgb__reg_alpha': 57.53289051844459, 'xgb__reg_lambda': 0.8489426245140799, 'xgb__subsample': 0.658346139517151}
Best CV score (MSE): 18180.8702
Test set score (MSE): 14889.9817


## Summary

We learned how to apply pre-processing steps together with tuning to construct refined pipelines for benchmark experiments.