<img src="images\Logo_UCLL_ENG_RGB.png" style="background-color:white;" />

# Data Analytics & Machine learning

Lecturers: Aimée Lynn Backiel, Kenric Borgelioen, Sofie Torfs, Lies Bollens

Academic year 2025-2026

## Lab 6: Machine learning, part 2

### Lecture outline

1. Recap of previous weeks
2. Overfitting and underfitting
3. Automating machine learning pipelines with sci-kit learn
4. Model evaluation 
5. parameters and hyperparameters

### Recap of last lecture(s)

#### Lab 1

1. We ensured we had a valid Python installation.
2. We learnt what a virtual environment is:
   * Isolated Python executable and packages.
   * We created a virtual environment.
3. Absolute path vs relative path recap.
4. Recap of data structures in Python

#### Lab 2
1. Installed Pandas
2. Learnt how to read data
3. Learnt how to calculate mean, mode, median etc.
4. Basic exploration of the 4 variables

#### Lab 3
1. Wrapped up computing summary statistics (mean, median, mode, ...)
2. Learnt how to deal with outliers 
3. Focused on exploration of data

#### Lab 4
1. Univariate data visualization using Matplotlib
   1. Figures and axes
   2. Histograms
   3. Box plots
   4. Bar charts
2. Multivariate data visualization using Seaborn
   1. Scatter plots
   2. Small multiples
   3. Color coding

#### Lab 5
2. Intro to machine learning using scikit-learn
   1. Preprocessing
      1. One Hot encoding
      2. Scaling
      3. Outliers
   2. Classification and regression

### The case

Ada Turing Travelogue, or as everyone calls her, Ada just started working part time at her parents travel agency. She has a keen understanding and interest of everything related to applied computer science ranging from server & system management to full stack software development. Through database foundations she already understands how to query data and programming 1 and 2 covered the essentials about the Python programming language. Recently she has just decided to start learning about data analytics & machine learning as well.

She uses her skills to connect to the travel agency's database where she finds many, normalized, tables. Ada recalls what she learnt in database foundations and performs all the correct joins. Afterwards she saves the data in the `data/` folder.


She finds the following dataset:

| Column Name          | Description                                                                                       |
| -------------------- | ------------------------------------------------------------------------------------------------- |
| SalesID              | Unique identifier for each sale.                                                                  |
| Age                  | Age of the traveler.                                                                              |
| Country              | Country of origin of the traveler.                                                                |
| Membership_Status    | Membership level of the traveler in the booking system; could be 'standard', 'silver', or 'gold'. |
| Previous_Purchases   | Number of previous bookings made by the traveler.                                                 |
| Destination          | Travel destination chosen by the traveler.                                                        |
| Stay_length          | Duration of stay at the destination.                                                              |
| Guests               | Number of guests traveling (including the primary traveler).                                             |
| Travel_month         | Month in which the travel is scheduled.                                                           |
| Months_before_travel | Number of months prior to travel that the booking was made.                                       |
| Earlybird_discount   | Boolean flag indicating whether the traveler received an early bird discount.                     |
| Package_Type         | Type of travel package chosen by the traveler.                                                    |
| Cost                 | Calculated cost of the travel package.                                                            |
| Margin | The cost (for the traveler) - what the travel agency pays. |
 | Additional_Services_Cost| The amount of additional services (towels, car rentals, room service, ...) that was bought during the trip. |


#### Our challenge

Before getting into harder use cases we will start off by predicting the cost of a given stay. Right now Ada's parents do this manually automating this task would already be a big help to their business.



<center>
<img src="https://www.datascience-pm.com/wp-content/uploads/2021/02/CRISP-DM.png" style="background-color:white;width:50%">
</center>

It also helps to situate our progress within CRISP-DM. We have done the first three steps, as from this lecture we will progress to modeling. As mentioned in the lecture, this is an iterative procedure, as we are doing modeling we need to circle back to both data preparation.

### Machine learning with sci-kit learn


<center>
<img src="https://scikit-learn.org/stable/_images/grid_search_workflow.png" style="background-color:white;width:50%">
</center>

##### ❓ What have we done so far of the image below? What stages have we completed?


We have done train test splitting and we have built a few models on the training set. 

#### Recap overfitting and underfitting

Overfitting is when the model doesn't learn general patterns from the data but rather focuses on doing really well on the training data. 

❗ Typically it means that the performance on the test set will be a lot worse than that on the training set.

Underfitting is when the model learns a pattern that doesn't significantly capture the details of the training set.

❗ Typically it means that the performance on the test set will be similarly (bad) on as that on the training set.

<center>
<img src="https://www.mathworks.com/discovery/overfitting/_jcr_content/mainParsys/image.adapt.full.medium.svg/1686825007300.svg" style="background-color:white;width:50%">
</center>

Let's try this out continuing where we left of last session.
Create two different regression models that predict cost based on the other variables in the dataset.
Use a linear regression model and a decision tree regression.
Compare both methods in terms of performance.
We'll get you some code and tips to get you started, the rest is up to you!

In [None]:
import pandas as pd # by convention
pd.options.display.float_format = '{:.2f}'.format
from sklearn.model_selection import train_test_split
import plotly.express as px
import numpy  as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor

In [None]:
#Read in the data
#X = #features are denoted with a capital X
#y = #the target variable is denoted with a small y

In [None]:
travel_dataset = pd.read_csv("data/lab_6_dataset.csv")
X = travel_dataset.drop(columns="cost") #features are denoted with a capital X
y = travel_dataset["cost"] #the target variable is denoted with a small y

In [None]:
# Split the data into a training and test set

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=42)

##### ❓ Should we apply our preprocessing step to the training dataset, the test dataset, or both?

The proper methodology is to fit the OneHotEncoder and StandardScaler on the training data only. This way, they "learn" the categories of the categorical variables and the distribution (mean and standard deviation) of the numerical variables from the training set. Once fitted, these preprocessors should then be applied to the test data, ensuring that the transformation applied is consistent and does not give the model any information about the test set.

In [None]:
#Preprocess the data

In [None]:
cat_columns = ["package_Type", "destination", "country"]
numeric_columns = ["guests", "age", "stay_length"]


In [None]:
def fit_preprocessors(train_data: np.ndarray, categorical_columns: list[str], numeric_columns: list[str]) -> (OneHotEncoder, StandardScaler):
    # Fit the OneHotEncoder and StandardScaler to the training data
    ohe = OneHotEncoder(sparse_output=False)
    scaler = StandardScaler()
    ohe.fit(train_data[categorical_columns])
    scaler.fit(train_data[numeric_columns])
    
    # Return the fitted preprocessors
    return ohe, scaler

def transform_data(data: np.ndarray, categorical_columns: list[str], numeric_columns: list[str], ohe: OneHotEncoder, scaler: StandardScaler) -> np.ndarray:
    # Transform data using the already fitted preprocessors
    cat_cols = ohe.transform(data[categorical_columns])
    num_cols = scaler.transform(data[numeric_columns])
    
    # Return the transformed data
    return np.hstack((cat_cols, num_cols))


In [None]:
ohe, scaler = fit_preprocessors(X_train, cat_columns, numeric_columns)
X_train_preprocessed = transform_data(X_train, cat_columns, numeric_columns, ohe=ohe, scaler=scaler)
X_test_preprocessed = transform_data(X_test, cat_columns, numeric_columns, ohe=ohe, scaler=scaler)

In [None]:
#Fit a linear regression model and predict both the training and test set

In [None]:
lin_reg = LinearRegression()
lin_reg.fit(X_train_preprocessed, y_train)
predictions_lin_reg = lin_reg.predict(X_train_preprocessed)
predictions_test_lin_reg= lin_reg.predict(X_test_preprocessed)

##### ❓ Plot both the predictions of the training and the test set compared to the actual labels. 
You'll see that you basically have to create the same plot multiple times.
Tip: create a function that can help you avoid repetition.

In [None]:
#Your function here

In [None]:
def plot_results(y_true: pd.Series, predictions: np.ndarray, title: str) -> None:
    
    fig = px.scatter(x=predictions, y=y_true, labels={"x": "predicted", "y": "actual"}, title=title)
    fig.add_shape(type="line",
                x0=-1000, 
                y0=-1000, 
                x1=7000, 
                y1=7000)
    
    fig.show()

In [None]:
#Code to plot the comparison on the training set

In [None]:
plot_results( y_train, predictions_lin_reg, title="predicted vs actual for linear regression on the training set")

In [None]:
#Code to plot the comparison on the test set

In [None]:
plot_results(y_test, predictions_test_lin_reg, title="predicted versus actual on the test set for linear regression")

##### ❓ Repeat the process for the decision tree regressor.

In [25]:
decision_tree = DecisionTreeRegressor()
decision_tree.fit(X_train_preprocessed, y_train)
predictions_dt = decision_tree.predict(X_train_preprocessed)
predictions_test_dt = decision_tree.predict(X_test_preprocessed)

In [None]:
plot_results(y_train, predictions_dt, title="predicted versus actual on the training set for a decision tree")

In [None]:
plot_results(y_test, predictions_test_dt, title="predicted versus actual on the test set for a decision tree")

##### ❓ Which of the two models is overfitting? Can you describe how and why?

The decision tree is overfitting. It's more or less analogous to the first image, the error is exactly 0 for many values in the training set, that's already a danger sign. Upon closer inspection of the test we can see that the error there there is far longer. We can say that the model has failed to learn a general pattern. The pattern it learnt describes only the training set well and nothing else.

##### ❓ Which of the two models is underfitting? Can you describe how and why?

The linear regression is underfitting. The model is too simplistic, it is unable to to account for certain important patterns and combinations of variables. The error it makes is large both on the training and test set. We can do better than this.

#### Using pipelines to avoid mistakes

In machine learning, the preprocessing step is critical to prepare the data for modeling. However, there's a subtle but crucial methodological error in the function shown. The issue lies in the lines where `OneHotEncoder` and `StandardScaler` are fitted:

```python
def preprocess_data(data: np.ndarray, categorical_columns: list[str], numeric_columns: list[str]) -> np.ndarray:

    ohe = OneHotEncoder(sparse_output=False) 
    ohe.fit(data[cat_columns]) # The error occurs here!
    cat_cols_train  = ohe.transform(data[categorical_columns])


    scaler = StandardScaler()
    num_cols_train = scaler.fit_transform(data[numeric_columns]) # And here!

    return np.hstack((cat_cols_train, num_cols_train))  
```

The key mistake in the preprocessing function provided lies in fitting the OneHotEncoder and StandardScaler to the data inside the function. This could lead to a situation where, when preprocessing the test set, the encoders and scalers are fitted again separately with the test set statistics, which is incorrect. The preprocessing steps that "learn" from the data, such as encoding categorical variables and scaling numerical variables, should be based solely on the training data to avoid introducing bias from the test set into the model.

The proper methodology is to fit the OneHotEncoder and StandardScaler on the training data only. This way, they "learn" the categories of the categorical variables and the distribution (mean and standard deviation) of the numerical variables from the training set. Once fitted, these preprocessors should then be applied to the test data, ensuring that the transformation applied is consistent and does not give the model any information about the test set.

One potential way to solve it is as follows:

```python
def fit_preprocessors(train_data: np.ndarray, categorical_columns: list[str], numeric_columns: list[str]) -> (OneHotEncoder, StandardScaler):
    # Fit the OneHotEncoder and StandardScaler to the training data
    ohe = OneHotEncoder(sparse_output=False)
    scaler = StandardScaler()
    ohe.fit(train_data[categorical_columns])
    scaler.fit(train_data[numeric_columns])
    
    # Return the fitted preprocessors
    return ohe, scaler

def transform_data(data: np.ndarray, categorical_columns: list[str], numeric_columns: list[str], ohe: OneHotEncoder, scaler: StandardScaler) -> np.ndarray:
    # Transform data using the already fitted preprocessors
    cat_cols = ohe.transform(data[categorical_columns])
    num_cols = scaler.transform(data[numeric_columns])
    
    # Return the transformed data
    return np.hstack((cat_cols, num_cols))

```

In scikit-learn, the Pipeline and ColumnTransformer classes offer an idiomatic and streamlined way to chain multiple preprocessing steps and a model into a single workflow. The `Pipeline` class allows you to assemble sequences of transformations and a final model, which simplifies your code and helps prevent common mistakes, such as fitting preprocessing steps to the test data. The `ColumnTransformer` is particularly useful for applying different preprocessing to different columns, such as one-hot encoding for categorical variables and scaling for numerical variables. By combining these tools, you not only reduce the need for manual 'glue' code but also safeguard against the leakage of information from the test set into the training process. We highly recommend you always use this in the scope of this course.



The syntax is quite simple:

*  `make_column_transformer` expects multiple tuples. The `transformer` (preprocessing step) is in the first position of the tuple and the columns you apply the transformer (type: `list[str]`) is in the second position.  
*  `make_pipeline` similarly expects all objects in sequence, so typically you add your preprocessing first followed by the model you want to apply. 

In [30]:
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline

In [31]:
preprocessing = make_column_transformer(
    (StandardScaler(), numeric_columns),
    (OneHotEncoder(sparse_output=False), cat_columns),
    remainder="drop"
)

In [None]:
preprocessing.fit_transform(X_train)

In [33]:
lin_reg_pipe = make_pipeline(preprocessing, LinearRegression())

In [34]:
lin_reg_pipe.fit(X_train, y_train)
predictions_lin_reg_train = lin_reg_pipe.predict(X_train)
predictions_lin_reg_test = lin_reg_pipe.predict(X_test)

That's all!

Convince yourself for a second that is a lot simpler than what we were doing previously, we simply use `make_column_transformer` to indicate what preprocessing we want to apply to which column. Afterwards we put our preprocessing in a pipeline with the model we want to use.

❗ Once you call `model.fit(X_train, y_train)` it both fits the preprocessing and the model in one go. 

Pipelines make it easy to make many different models in one go. 

##### ❓ Create a function that can train 4 models in one go: Linear Regression, Decision Tree, Random Forest and HistGradient Boosting.
Use the components you already created above and the 'make_pipeline' function.
Store your results in a list called 'results'. Create a tuple for eacht of the techniques, including the name of the technique, the predictions on the training set and the predictions on the test set.

In [35]:
from sklearn.ensemble import RandomForestRegressor, HistGradientBoostingRegressor

In [36]:
decision_tree_pipe = make_pipeline(preprocessing, DecisionTreeRegressor())
rf_pipe = make_pipeline(preprocessing, RandomForestRegressor())
xgb_pipe = make_pipeline(preprocessing, HistGradientBoostingRegressor())

You can even do the following if you want:

```python

model_name_pair = [("random_forest", RandomForestRegressor()), ("gradient boosting", HistGradientBoostingRegressor()), ("decision tree", DecisionTreeRegressor())]
results = []
for pair in model_name_pair:
    name, model = pair
    pipe = make_pipeline(preprocessing, model)
    pipe.fit(X_train, y_train)
    predictions_train = pipe.predict(X_train)
    predictions_test = pipe.predict(X_test)
    result.append([(name, predictions_train, predictions_test)])
```

The code above would train 3 models in a single for loop. Afterwards you can use a single function to evaluate the results you have obtained from all three models.

#### Model evaluation

Up until now we've only *qualitatively* judged the quality of our models by looking at our predicted versus actual plot. We're interested in having a single number that summarizes the the performance of our model. The reason is that investigating graphs doesn't scale well.

##### Attempt 0: taking the mean of the error

Our first intuition might be to take the difference of the predictions and the actual values. This is called the **error** or the **residual**. After we have this value we may be tempted to take the mean. 

In [None]:
error = predictions_lin_reg_test - y_test
error

In [None]:
np.mean(error)

In [None]:
px.histogram(error, title="distribution of the errors of linear regression on the test set.")

##### ❓ Why is the mean of residuals misleading?


By taking the mean the negative errors and the positive errors cancel each other out. This is definitely not what we want.

##### ❓ How can we better quantify errors?

We can start by taking the absolute value of the errors and then taking the mean.

##### Attempt 1: taking the mean of the absolute error (MAE)

In [None]:
np.mean(np.abs(error))

In [None]:
px.histogram(np.abs(error), title="distribution of the absolute errors of linear regression on the test set.")

This approach is more informative as it provides a clearer picture of how much error is present on average in our predictions.

By considering the MAE, we obtain a useful summary statistic for model performance that is easy to understand and directly interpretable in terms of the problem at hand.

In practice squaring the errors is more common.

##### Attempt 2: Taking the mean of the squared errors (MSE)

Squaring the errors, as shown in the equation: $(-2)² = 4 = (2)²$, ensures that all error values are positive. This method has two main benefits:

This method has two main advantages:

1. Large errors are amplified more than smaller ones, which can be particularly important in cases where larger deviations are less tolerable.
2. MSE is a differentiable function, which makes it mathematically convenient for optimization algorithms used in model training. There are other statistical properties that the MSE can leverage.

Note: point 2 is only for your information. You don't need to know this at all. 


The MSE can be calculated as follows:

In [None]:
np.mean(np.square(error))

Since MSE can result in large numbers that are difficult to interpret, we often take the square root to obtain the Root Mean Squared Error (RMSE), which has the same units as the original values:

*Note: the root is $\sqrt(x)$.*

In [43]:
rmse = np.sqrt(np.mean(np.square(error)))

In [None]:
px.histogram(np.square(error), title="distribution of the square errors of linear regression on the test set.")

##### ❓ What are downsides of the MSE? There are two, but one is harder to come up with.

* Easier: The MSE can be more difficult to interpret since it's not in the same units as the original data. Taking the square root to obtain the RMSE helps mitigate this issue.
* Harder: MSE is sensitive to outliers because errors are squared, so outliers have a disproportionately large impact on the total error.

##### Summary

Our advice:

* Compute both the MAE and the RMSE. 
* The one you should focus on the most is typically the RMSE. 
* If there are error outliers then the difference between the MAE and the RMSE is likely going to be large. In that case it is typically more interesting to look at the MAE. 
 
Typically, RMSE is preferred because it is more sensitive to large errors, which can be critical in applications where such errors are especially problematic, like in autonomous vehicle guidance systems. However, if your model is prone to outliers, or if large errors are less impactful, MAE can be a more relevant metric. Always consider the specific context of your application when choosing your primary evaluation metric.

##### model evaluation using sci-kit learn


sci-kit learn offers all of these functions out of the box. All you need to do is remember their name.

In [45]:
from sklearn.metrics import mean_absolute_error, mean_squared_error

##### ❓ Use the pipeline approach discussed above and the mean_absolute_error and mean_squared_error for the following models: RandomForestRegressor, HistGradientBoostingRegressor, DecisionTreeRegressor and LinearRegression. You can build on the function that you created above that returned results for each of the regressors.

In [None]:
model_name_pair = [("random_forest", RandomForestRegressor(n_jobs=-1)), ("gradient boosting", HistGradientBoostingRegressor()), ("decision tree", DecisionTreeRegressor()), ("linear regression", LinearRegression()) ]
results = []
for pair in model_name_pair:
    name, model = pair
    print(f"STARTING {name}")
    pipe = make_pipeline(preprocessing, model)
    pipe.fit(X_train, y_train)
    predictions_train = pipe.predict(X_train)
    predictions_test = pipe.predict(X_test)
    print("-"*20)
    print(f"The MAE on the training set is {mean_absolute_error(y_train, predictions_train)} and on the test set it is {mean_absolute_error(y_test, predictions_test)}")
    print(f"The mse on the training set is {mean_squared_error(y_train, predictions_train, squared=False)} and on the test set it is {mean_squared_error(y_test, predictions_test, squared=False)}\n")

##### ❓ Comment on the behavior of the models. Are they overfitting? Underfitting?

It seems like the decision tree and random forest are both overfitting. There is a large gap between the performance on the test set and the training set.
Linear regression is underfitting, the performance on test and train is similarly bad. 
Gradient boosting is not underfitting nor overfitting. The difference between test and train is minimal, on top of that, its MAE and MSE are better than all the alternatives we have tried.