---
title: "Bank of England (BoE) Interest Rate Predictor"
author: 'Yuyao Bai'
format: html
self-contained: true
jupyter: python3
engine: jupyter
editor:
  render-on-save: true
  preview: true
---


### Set-up

In [423]:
# libraries for dataframe and data manipulation
import pandas as pd
import numpy as np
import re
# libraries for plotting
from lets_plot import *
LetsPlot.setup_html()
from lets_plot.plot import gggrid
import matplotlib.pyplot as plt
#libraries for table formatting
from pytablewriter import MarkdownTableWriter
# libraries for data exploration (e.g missing values)
import missingno as msno
import sweetviz as sv
#libraries for pre-processing the data 
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import LabelEncoder, StandardScaler, PolynomialFeatures, MinMaxScaler, PowerTransformer
#libraries for models
from xgboost import XGBClassifier, XGBRegressor
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from sklearn.linear_model import LinearRegression, Lasso, LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.svm import SVC
import miceforest as mf
#libraries for train/test split
from sklearn.model_selection import TimeSeriesSplit, train_test_split
#libraries for hyperparameter tuning
from sklearn.model_selection import GridSearchCV, cross_val_score
# libraries for metric definition (model evaluation)
from sklearn.metrics import mean_absolute_percentage_error, balanced_accuracy_score, confusion_matrix, roc_auc_score, f1_score, precision_score, recall_score, average_precision_score, make_scorer, classification_report, fbeta_score, precision_recall_curve
# libraries for dimensionality reduction
from sklearn.decomposition import PCA
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import VarianceThreshold

import statsmodels.api as sm

### 1.1 Data Processing  
#### 1.1.1 BoE Dataset

In [449]:
df = pd.read_csv('data/BoE_interest_rates.csv')
df

Unnamed: 0,Date,Rate,rate_change
0,1997-05-06,6.25,1
1,1997-06-06,6.50,1
2,1997-07-10,6.75,1
3,1997-08-07,7.00,1
4,1997-11-06,7.25,1
...,...,...,...
63,2023-06-22,5.00,1
64,2023-08-03,5.25,1
65,2024-08-01,5.00,-1
66,2024-11-07,4.75,-1


In [450]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 68 entries, 0 to 67
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Date         68 non-null     object 
 1   Rate         68 non-null     float64
 2   rate_change  68 non-null     int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 1.7+ KB


In [451]:
# make sure observations are ordered in the correct order of date 
df = df.sort_values(by='Date').reset_index(drop=True)

To check that rate setting events happen every once in a while, we calculate the difference in days between consecutive rate setting events. The output below shows that the rate setting events are indeed spaced out at irregular intervals, with the minimum difference being 8 days and the maximum being 2709 days. This suggests that the **rate setting events are not scheduled at fixed intervals**, which is common in central bank monetary policy decisions.

In [452]:
# check that rate setting events happen every once in a while
df['Date'] = pd.to_datetime(df['Date'])
df['Date_diff'] = df['Date'].diff()
df['Date_diff'].value_counts()

Date_diff
28 days      8
49 days      7
42 days      6
35 days      5
63 days      5
364 days     4
91 days      4
56 days      3
154 days     2
455 days     2
119 days     2
98 days      2
273 days     1
2709 days    1
29 days      1
181 days     1
8 days       1
637 days     1
587 days     1
31 days      1
16 days      1
34 days      1
84 days      1
70 days      1
57 days      1
90 days      1
126 days     1
210 days     1
47 days      1
Name: count, dtype: int64

#### 1.1.2 Economic Indicators Dataset

In [453]:
indicators_df = pd.read_csv('data/economic_indicators_interest_rate_setting.csv')

In [454]:
indicators_df.head()

Unnamed: 0,Date,CCI,"Unemployment rate (aged 16 and over, seasonally adjusted): %",10-year-gilt-yield,CPIH MONTHLY RATE 00: ALL ITEMS 2015=100,Gross Value Added - Monthly (Index 1dp) :CVM SA,"Monthly average Spot exchange rate, Sterling into US$ [a] XUMAGBD","Monthly average Spot exchange rates, Sterling into Euro [a] XUMASER"
0,1997-01-01,102.2504,7.5,7.5552,-0.3,62.0,0.6031,0.7376
1,1997-02-01,102.5327,7.3,7.1962,0.2,62.5,0.6156,0.7192
2,1997-03-01,102.6905,7.2,7.4544,0.2,62.6,0.6226,0.7172
3,1997-04-01,102.79,7.2,7.638,0.4,63.3,0.6137,0.7022
4,1997-05-01,102.9294,7.2,7.1681,0.4,62.7,0.6122,0.7034


In [455]:
indicators_df.columns

Index(['Date', 'CCI',
       'Unemployment rate (aged 16 and over, seasonally adjusted): %',
       '10-year-gilt-yield', 'CPIH MONTHLY RATE 00: ALL ITEMS 2015=100',
       'Gross Value Added - Monthly (Index 1dp) :CVM SA',
       'Monthly average Spot exchange rate, Sterling into US$              [a]             XUMAGBD',
       'Monthly average Spot exchange rates, Sterling into Euro              [a]             XUMASER'],
      dtype='object')

We will rename the columns above to make them more concise. Additionally, similar to the BoE dataset, we convert the date columns to datetime format and ensure that the data is correctly sorted by date.

In [456]:
# rename the columns
column_names = {
    'Date': 'date',
    'CCI': 'cci',
    'Unemployment rate (aged 16 and over, seasonally adjusted): %': 'unemployment_rate',
    '10-year-gilt-yield': 'gilt_yield',
    'CPIH MONTHLY RATE 00: ALL ITEMS 2015=100': 'cpih',
    'Gross Value Added - Monthly (Index 1dp) :CVM SA': 'gva', 
    'Monthly average Spot exchange rate, Sterling into US$              [a]             XUMAGBD': 'er_gbp_usd',
    'Monthly average Spot exchange rates, Sterling into Euro              [a]             XUMASER': 'er_gbp_eur'
}

indicators_df = indicators_df.rename(columns=column_names)

# convert the date column to datetime
indicators_df['date'] = pd.to_datetime(indicators_df['date'])

# make sure observations are ordered in the correct order of date
indicators_df = indicators_df.sort_values(by='date').reset_index(drop=True)

In [457]:
indicators_df.head()

Unnamed: 0,date,cci,unemployment_rate,gilt_yield,cpih,gva,er_gbp_usd,er_gbp_eur
0,1997-01-01,102.2504,7.5,7.5552,-0.3,62.0,0.6031,0.7376
1,1997-02-01,102.5327,7.3,7.1962,0.2,62.5,0.6156,0.7192
2,1997-03-01,102.6905,7.2,7.4544,0.2,62.6,0.6226,0.7172
3,1997-04-01,102.79,7.2,7.638,0.4,63.3,0.6137,0.7022
4,1997-05-01,102.9294,7.2,7.1681,0.4,62.7,0.6122,0.7034


#### 1.1.3 Merging Datasets
Now, we want to assign to each row of ```df_baseline```, values of economic indicators from the last quarter i.e the average of each indicator for the last three months up to the date of the rate setting event. For example, if the rate setting event is on 06/05/1997, we will average data for GDP for May 1997, April 1997, and March 1997, and do the same separately for the other indicators i.e exchange rates, 10-year gilt yield, unemployment rates, CPIH, and CCI. A function ```get_last_quarter_values``` is defined to achieve this.  
  
By taking the aggregate of the economic indicators for the last quarter, we aim to capture the economic conditions leading up to the rate setting event and assess how these conditions influence the Bank of England's monetary policy decisions. 

In [458]:
indicators_df_baseline = indicators_df.copy()
df_baseline = df.copy()

indicators = ['cci', 'unemployment_rate', 'gilt_yield', 'cpih', 'gva', 'er_gbp_usd', 'er_gbp_eur']

def get_last_quarter_values(date, indicators_df_baseline, indicators):
    # get the last quarter values
    last_quarter = date - pd.DateOffset(months=3)
    last_quarter_values = indicators_df_baseline.query('date < @date & date >= @last_quarter')[indicators].mean()
    
    return last_quarter_values

df_baseline[indicators] = df_baseline['Date'].apply(lambda x: get_last_quarter_values(x, indicators_df_baseline, indicators))

df_baseline

Unnamed: 0,Date,Rate,rate_change,Date_diff,cci,unemployment_rate,gilt_yield,cpih,gva,er_gbp_usd,er_gbp_eur
0,1997-05-06,6.25,1,NaT,102.803300,7.200000,7.420167,0.333333,62.866667,0.616167,0.707600
1,1997-06-06,6.50,1,31 days,102.885733,7.233333,7.315967,0.333333,63.000000,0.611333,0.698467
2,1997-07-10,6.75,1,34 days,102.910667,7.200000,7.122533,0.133333,63.033333,0.606367,0.683067
3,1997-08-07,7.00,1,28 days,102.891300,7.066667,7.096267,0.133333,63.266667,0.610233,0.670233
4,1997-11-06,7.25,1,91 days,102.958967,6.600000,6.652167,0.133333,63.933333,0.609667,0.679933
...,...,...,...,...,...,...,...,...,...,...,...
63,2023-06-22,5.00,1,42 days,97.089057,4.166667,3.992767,0.666667,100.466667,0.798900,0.869900
64,2023-08-03,5.25,1,42 days,97.734747,4.233333,4.444300,0.100000,100.533333,0.785067,0.858533
65,2024-08-01,5.00,-1,364 days,99.977790,4.166667,4.175300,0.200000,101.300000,0.785267,0.848567
66,2024-11-07,4.75,-1,98 days,99.141300,4.366667,4.173667,0.300000,101.200000,0.769200,0.836300


In [459]:
df_baseline.isnull().sum()

Date                 0
Rate                 0
rate_change          0
Date_diff            1
cci                  1
unemployment_rate    1
gilt_yield           1
cpih                 1
gva                  1
er_gbp_usd           1
er_gbp_eur           1
dtype: int64

The ```"Date_diff"``` column created in 1.1.1 is dropped as it is no longer needed. We also drop the last row of the dataset as it contains NaN values due to the lack of data for the last quarter.

In [460]:
# drop the Date_diff column
df_baseline = df_baseline.drop('Date_diff', axis=1)

# drop the last row (since we lack any information about the economic indicators for this row)
df_baseline = df_baseline.dropna()

Now, ```df_baseline``` does not contain any missing values and is ready for modelling in the next section.

### 1.2 Baseline logistic regression model  
In this part, we'll focus on predicting ```"rate_change"``` contained in the Bank of England interest rates dataset. A logistic regression model will be used as the baseline model to predict whether the Bank of England will increase or decrease the interest rates at the next rate setting event.   
  
We first note that the target variable, ```"rate_change"```, is defined as follows:
* 1: if the Bank of England increases the interest rates at a rate setting event  
* -1: if the Bank of England decreases the interest rates at a rate setting event   
  
Since sklearn's LogisticRegression expects labels to be in {0, 1}, we will do a simple encoding of the target variable by mapping -1 to 0.

In [461]:
# encode the rate_change column
df_baseline['rate_change'] = (df_baseline['rate_change'] == 1).astype(int)


Next, we want to split the data into training and test sets, with 70% of the years for the training set. Since we have already ensured in 1.1.1 and 1.1.2 that the data is correctly sorted by date, we can use a `train_test_split` with `shuffle=False` to maintain the order of the data. This is important as we are dealing with time series data, and we want to avoid data leakage from the future into the past, i.e. we want to train the model on past data and test it on future data.

In [462]:
# train/test split (70/30), noting that the data is temporal
X = df_baseline[indicators]
y = df_baseline['rate_change']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, shuffle=False, random_state=42)

First, we standardise the features using the `StandardScaler` to ensure that all features are on the same scale. This is important for logistic regression as it is sensitive to the scale of the features. The scaler is fit on the training set and then applied to both the training and test sets so as to avoid data leakage.

In [463]:
# scale the features, fitting the scaler on the training set
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [464]:
logit = LogisticRegression()
# fit this instance to the training set
_ = logit.fit(X_train_scaled, y_train)

In [465]:
# generate predictions on the training and test sets
y_train_pred = logit.predict(X_train_scaled)
y_test_pred = logit.predict(X_test_scaled)

y_train_probas = logit.predict_proba(X_train_scaled)[:, 1]
y_test_probas = logit.predict_proba(X_test_scaled)[:, 1]


#### 1.2.1 Baseline Logistic Regression Model Evaluation  
Let us look at the class distribution of the overall dataset, as well as the training and test sets to decide an appropriate evaluation metric. 

In [466]:
df_baseline['rate_change'].value_counts(normalize=True)

rate_change
1    0.537313
0    0.462687
Name: proportion, dtype: float64

In [467]:
print(f"Train Class Distribution: {y_train.value_counts(normalize=True)}")
print(f"Test Class Distribution: {y_test.value_counts(normalize=True)}")

Train Class Distribution: rate_change
0    0.565217
1    0.434783
Name: proportion, dtype: float64
Test Class Distribution: rate_change
1    0.761905
0    0.238095
Name: proportion, dtype: float64


While the overall dataset is relatively balanced, when we look at the distribution of the target variable in the training and test sets, we see that there is **moderate class imbalance in the test set**, with more instances of interest rate increases (76.2%) than decreases (23.8%). Hence, we need to take this into account when choosing a metric to evaluate the model's performance.  
  
> As an aside, we note that the class imbalance in the test set is nothing surprising when we consider the real-world context of the Bank of England's interest rate decisions. The high proportion of rate hikes reflects the BoE's sustained monetary tightening strategy from late 2021 through 2023, as they aggressively raised rates to combat inflation. This hiking cycle, the most rapid in decades, was driven by post-pandemic supply chain disruptions, labour shortages, and energy price shocks from the Ukraine war, forcing the MPC to take a hawkish stance to curb inflationary pressures despite recession risks. As a result, our test set is skewed towards rate hikes.

Given the class imbalance, I choose to use the **Area Under the Precision-Recall Curve (AUC-PR)** as the evaluation metric. AUC-PR is well-suited for imbalanced datasets because it focuses on the model's ability to correctly identify interest rate increases (class 1) while balancing precision and recall. We want to balance the trade-off between precision (the proportion of correctly predicted interest rate increases among all predicted interest rate increases) and recall (the proportion of correctly predicted interest rate increases among all actual interest rate increases)  
  
AUC-PR is also more informative than the F1-score. The F1-score is a single classification threshold (0.5 is the default), whereas the AUC-PR considers all possible classification thresholds, providing a better overall picture of how well the model distinguishes between rate increases and decreases.

In [468]:
train_auc_pr = average_precision_score(y_train, y_train_probas)
test_auc_pr = average_precision_score(y_test, y_test_probas)

print(f'Train AUC-PR: {train_auc_pr:.3f}')
print(f'Test AUC-PR: {test_auc_pr:.3f}')

Train AUC-PR: 0.972
Test AUC-PR: 0.734


**Interpretation:**  
The model classifies the training set with an AUC-PR of 0.972, indicating that the model performs very well in distinguishing between rate increases and decreases in the training set. The AUC-PR for the test set is 0.734, showing signs of overfitting, as the model's performance drops when applied to unseen data. While these are relatively high AUC-PR values, we need to err on the side of caution when interpreting the results as we have not considered publication lags in the economic indicators data (detailed explanation in section 1.3). Hence, we cannot immediately conclude that the baseline model performs relatively well at predicting the Bank of England's interest rate decisions using the economic indicators.  
  
Let's also plot the confusion matrices for the training and test sets to easily visualise how the model performs in terms of true positives, false positives, true negatives, and false negatives. In the context of this problem:
* True Positive (TP): The model correctly predicts an interest rate increase
* True Negative (TN): The model correctly predicts an interest rate decrease
* False Positive (FP): The model incorrectly predicts an interest rate increase when it is actually a decrease
* False Negative (FN): The model incorrectly predicts an interest rate decrease when it is actually an increase

In [469]:
def plot_confusion_matrix(y_true, y_pred, title="Confusion Matrix"):
    """
    Generates and plots a confusion matrix using Lets-Plot.

    Parameters:
    - y_true: Actual labels
    - y_pred: Predicted labels
    - title: Title for the plot
    """
    # Compute confusion matrix
    conf_matrix = confusion_matrix(y_true, y_pred)
    
    # Extract TP, FP, FN, TN
    TN, FP, FN, TP = conf_matrix.ravel()

    # Convert confusion matrix to a DataFrame
    conf_matrix_df = pd.DataFrame(conf_matrix, 
                                  columns=["Predicted 0", "Predicted 1"], 
                                  index=["Actual 0", "Actual 1"])

    # Melt the confusion matrix DataFrame to long format
    conf_matrix_long = conf_matrix_df.reset_index().melt(id_vars="index", value_vars=["Predicted 0", "Predicted 1"])
    conf_matrix_long.columns = ["Actual", "Predicted", "Count"]

    # Define mapping for labels
    label_map = {
        ("Actual 0", "Predicted 0"): "TN",
        ("Actual 0", "Predicted 1"): "FP",
        ("Actual 1", "Predicted 0"): "FN",
        ("Actual 1", "Predicted 1"): "TP",
    }

    # Add annotations for TP, FP, FN, TN
    conf_matrix_long['Annotation'] = conf_matrix_long.apply(
        lambda row: f"{label_map[(row['Actual'], row['Predicted'])]}: {row['Count']}", axis=1
    )

    # Create confusion matrix plot with Lets-Plot
    plot = ggplot(conf_matrix_long, aes(x='Predicted', y='Actual', fill='Count')) + \
        geom_tile() + \
        geom_text(aes(label='Annotation'), size=10, color='black', vjust=0.5, hjust=0.5) + \
        scale_fill_gradient(low='white', high='#FF7F50') + \
        ggtitle(title) + \
        xlab('Predicted') + \
        ylab('Actual') + \
        coord_fixed(ratio=1) + \
        theme_minimal() + \
        theme(legend_position='right')

    return plot


conf_matrix_plot_train = plot_confusion_matrix(y_train, y_train_pred, title="Confusion Matrix (Train)")
conf_matrix_plot_test = plot_confusion_matrix(y_test, y_test_pred, title="Confusion Matrix (Test)")

gggrid([conf_matrix_plot_train, conf_matrix_plot_test])

Again, we see that the baseline model classifies the training set well, with a high number of true positives and true negatives. However, the model struggles with the test set. While it is able to correctly classify all interest rate decreases (true negatives), it is also misclassifying all interest rate increases as decreases (false negatives). This is likely due to the class imbalance in the test set, where there are more instances of interest rate increases than decreases. Let's explore this further in 1.3.

#### 1.2.2 Interpretation of Regression Coefficients  
Here, we see the advantage of using a logistic regression model. It is interpretable, and we can extract the coefficients to understand the impact of each feature on the log-likelihood of the Bank of England raising interest rates. This provides a clear mathematical relationship between the predictors and the target variable, allowing us to understand the direction and strength of the relationships.

In [470]:
# Get regression coefficients and intercept from the logistic regression model
coefficients = np.round(logit.coef_[0], 3)
intercept = np.round(logit.intercept_[0], 3)

coefficients_df = pd.DataFrame({
    'Feature': ['Intercept'] + indicators,
    'Coefficient': [intercept] + coefficients.tolist()
})

coefficients_df


Unnamed: 0,Feature,Coefficient
0,Intercept,-0.762
1,cci,0.493
2,unemployment_rate,-0.185
3,gilt_yield,1.916
4,cpih,-0.473
5,gva,0.195
6,er_gbp_usd,-1.428
7,er_gbp_eur,-0.856


Since we standardised all the features, we note that the logistic regression coefficients represent the **log-odds** of the target variable (rate change) given a **one standard deviation increase** in the predictor variable. This is the form of a logistic regression equation:     
$$Log Odds = log(\frac{p}{1-p}) = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n$$  
, where, in our context:
* $p$ is the probability of the target variable (rate change) being 1 (interest rate increase)
* $\beta_0$ is the intercept
* $\beta_i$ is the regression coefficient for feature $x_i$


Let us look at CCI as an example. The coefficient of ```cci``` is 0.493, which means that for every one standard deviation increase in the CCI, the log-odds of the Bank of England raising the interest rates increases by 0.493, *holding all other variables constant*. This implies that a higher CCI is associated with a higher likelihood of an interest rate increase. For easier interpretation, we can also exponentiate the regression coefficient to get the **odds ratio** for a particular feature $x_i$:
$$Odds Ratio_{x_i} = e^{\beta_i}$$  
So continuing with the example of CCI, $Odds Ratio_{cci} = e^{0.493} = 1.637$. This means that for every **one standard deviation increase in the CCI**, the **odds of a rate hike increase by a factor of 1.637**, *holding all other variables constant*. 

:::{.callout-note}
Since there are multiple regressors (features) in the model, the interpretation of the coefficients is based on the assumption that the other variables are held constant. This is known as the **ceteris paribus** assumption.
:::
  
The same interpretation can be applied to the other features in the model, noting that a **negative coefficient** implies that the feature is associated with a **lower likelihood of an interest rate increase**. Each feature's odds ratio is calculated and added to `coefficients_df` below. 

Lastly, the coefficient of the intercept term has little practical or intuitive meaning in this context, as it represents the log-odds of the target variable when all predictor variables are zero. Since the predictor variables are economic indicators, it is unlikely that they would all be zero at the same time.

In [471]:
# compute the odds ratios
coefficients_df['Odds Ratio'] = np.exp(coefficients_df['Coefficient']).round(3)

coefficients_df

Unnamed: 0,Feature,Coefficient,Odds Ratio
0,Intercept,-0.762,0.467
1,cci,0.493,1.637
2,unemployment_rate,-0.185,0.831
3,gilt_yield,1.916,6.794
4,cpih,-0.473,0.623
5,gva,0.195,1.215
6,er_gbp_usd,-1.428,0.24
7,er_gbp_eur,-0.856,0.425


Holding the other regressors constant, the 10-year gilt yield (`"gilt_yield"`) gives the highest odds ratio of 6.794, indicating that a one standard deviation increase in 10-year gilt yields is associated with a 6.794 times higher odds of the Bank of England raising interest rates. This suggests that the 10-year gilt yield is a strong predictor of the Bank of England's interest rate decisions. 

### 1.3 Improving the model  
In 1.2.1, we noted some issues in the model specifications that could be causing overly optimistic performance metrics. In this section, I will elaborate on them and attempt to improve the baseline logistic regression model by:  
1. Applying more appropriate lags to the economic indicators to capture **publication delays**
2. Dimensionality reduction using **PCA** to address multicollinearity
3. Using a **Random Forest Classifier** to capture non-linear relationships and interactions between features
4. **Cross-validation** to ensure robustness of the model (using `TimeSeriesSplit`)

As mentioned earlier, in the baseline model, we did not consider **publication lags**. This refers to the fact that when an economic indicator is published, it typically reflects data from a past period. Based on the sources of each economic indicator in our dataset, we find that the publication delay varies between indicators, as listed below: 
* [CCI](https://www.oecd.org/en/data/indicators/consumer-confidence-index-cci.html?oecdcontrol-cf46a27224-var1=GBR&oecdcontrol-b2a0dbca4d-var3=1997-01&oecdcontrol-b2a0dbca4d-var4=2024-12), [Unemployment Rate](https://www.ons.gov.uk/employmentandlabourmarket/peoplenotinwork/unemployment/timeseries/mgsx/lms/): 3 months
* [GDP (GVA)](https://www.ons.gov.uk/economy/grossdomesticproductgdp/datasets/gdpmonthlyestimateuktimeseriesdataset), [Monthly Spot Exchange Rates](https://www.bankofengland.co.uk/boeapps/database/fromshowcolumns.asp?Travel=NIxIRxSUx&FromSeries=1&ToSeries=50&DAT=RNG&FD=1&FM=Dec&FY=1974&TD=31&TM=Jan&TY=2025&FNY=&CSVF=TT&html.x=46&html.y=35&C=1D1&C=IN3&Filter=N): 2 months
* [10-year Gilt Yield](https://fred.stlouisfed.org/series/IRLTLT01GBM156N#), [CPIH](https://www.ons.gov.uk/economy/inflationandpriceindices/timeseries/l59c/mm23): 1 month  
  
*What does this mean for our baseline model?*   
There was likely severe **data leakage** in the baseline model, where we used the economic indicators from the quarter prior to the rate setting event to predict the rate change. This is a problem because: for example, if the rate setting event was on 06/05/1997, given that we know the data for unemployment rate has a 3-month publication lag, policy makers would not have had access to the unemployment rate for May 1997 at the point of the rate setting event. Instead, they would have used the unemployment rate for February 1997. Hence, in our baseline model, we were actually using data that policy makers would not have had access to at the time of the rate setting event, to predict the rate change. This is a significant issue that will be addressed in our improved model specification.  
  
*How will we address this in our improved model?*  
The code below does the following:
1. Check that there are no gaps in the economic indicators data, i.e. that there is data available for each month from the start of the dataset to the end.
2. For each economic indicator, we defined its respective publication lag based on the source data.
3. For each economic indicator, a 3-month rolling average is computed to reflect the quarter-over-quarter trends policymakers would analyse. 
4. For each rate setting date (`df_new['Date']`), the code calculates a reference month by subtracting the publication lag from the rate setting date. We then convert this reference month to monthly periods for proper alignment when merging datasets. This is important as we have assumed economic indicators in our dataset to be reported at the start of each month, whereas rate setting events are on specific dates.  
5. The quarterly average for each indicator is merged into the main dataset `df_new` based on the reference month, ensuring only historically available data is used for prediction. 

In [472]:
df_new = df.copy()
indicators_df_new = indicators_df.copy()

# check that monthly data is available for all months for the economic indicators data
date_range = pd.date_range(indicators_df_new['date'].min(), indicators_df_new['date'].max(), freq='MS') 
missing_dates = date_range[~date_range.isin(indicators_df_new['date'])]
print("Missing dates:", missing_dates)


Missing dates: DatetimeIndex([], dtype='datetime64[ns]', freq='MS')


In [473]:
# define lags for each indicator based on its publication delay
indicator_lag = {
    'cci': 3,   # this means cci has a 3-month publication delay
    'unemployment_rate': 3,
    'gilt_yield': 1,  
    'cpih': 1,
    'gva': 2,
    'er_gbp_usd': 2,
    'er_gbp_eur': 2
}

# Process each indicator
for indicator, lag in indicator_lag.items():
    # Create a temporary DataFrame with 3-month averages
    indicator_avg = (
        indicators_df_new
        .sort_values('date')
        .set_index('date')[indicator]
        .rolling(window=3, min_periods=3) # min_periods=3 to enforce strict windows (NaN otherwise)
        .mean()
        .reset_index()
        .rename(columns={indicator: f'{indicator}_avg'})
    )
    
    # For each rate-setting date, compute the reference month (accounting for lag)
    df_new[f'{indicator}_ref_month'] = df_new['Date'] - pd.DateOffset(months=lag)
    
    # Convert reference month to period for alignment when merging
    df_new[f'{indicator}_ref_month'] = df_new[f'{indicator}_ref_month'].dt.to_period('M') 
    indicator_avg['month_period'] = indicator_avg['date'].dt.to_period('M')
    
    # Merge the 3-month average based on the reference month
    df_new = df_new.merge(
        indicator_avg,
        left_on=f'{indicator}_ref_month',
        right_on='month_period',
        how='left'
    ).rename(columns={f'{indicator}_avg': f'{indicator}_lagged'})
    
    # Clean up temporary columns
    df_new = df_new.drop(columns=[f'{indicator}_ref_month', 'month_period', 'date'])

In [474]:
df_new

Unnamed: 0,Date,Rate,rate_change,Date_diff,cci_lagged,unemployment_rate_lagged,gilt_yield_lagged,cpih_lagged,gva_lagged,er_gbp_usd_lagged,er_gbp_eur_lagged
0,1997-05-06,6.25,1,NaT,,,7.429533,0.266667,62.366667,0.613767,0.724667
1,1997-06-06,6.50,1,31 days,102.491200,7.333333,7.420167,0.333333,62.800000,0.617300,0.712867
2,1997-07-10,6.75,1,34 days,102.671067,7.233333,7.315967,0.333333,62.866667,0.616167,0.707600
3,1997-08-07,7.00,1,28 days,102.803300,7.200000,7.122533,0.133333,63.000000,0.611333,0.698467
4,1997-11-06,7.25,1,91 days,102.891300,7.066667,6.807267,0.266667,63.466667,0.615667,0.668767
...,...,...,...,...,...,...,...,...,...,...,...
63,2023-06-22,5.00,1,42 days,94.449537,3.966667,3.725400,0.833333,100.533333,0.818267,0.883233
64,2023-08-03,5.25,1,42 days,96.307567,4.033333,4.255067,0.166667,100.466667,0.798900,0.869900
65,2024-08-01,5.00,-1,364 days,99.229860,4.333333,4.175300,0.200000,101.266667,0.792533,0.853100
66,2024-11-07,4.75,-1,98 days,100.100567,4.200000,4.016333,0.366667,101.266667,0.769167,0.845033


Now that we have addressed the issue of publication lags, we will also do the following: 
1. Encode the binary target variable `rate_change` as 0 for a rate decrease and 1 for a rate increase.
2. Drop the `Date_diff` column as it is not needed for this analysis.
3. Drop rows with missing values in the dataset (more on this below).

In [475]:
# encode the rate_change column
df_new['rate_change'] = (df_new['rate_change'] == 1).astype(int)

# drop the Date_diff column
df_new = df_new.drop('Date_diff', axis=1)

We observe that the only rows with missing values are the first and last rows of the dataset. The first row is missing values because there are no data for CCI and unemployment rate available for the quarter prior to the first rate setting event in the dataset. The last row is missing values because there are no data for 10-year gilt yields, CPIH, GVA, and monthly spot exchange rates for the quarter prior to the last rate setting event in the dataset. It also makes sense that there are no values missing for the years in between, as we have previously checked that there are no gaps in the economic indicators data.  
  
Since these rows do not contain any information that can be used for prediction, we drop them from the dataset.

In [476]:
print(df_new.isnull().sum())

# drop rows with missing lag values
df_new = df_new.dropna().reset_index(drop=True)

Date                        0
Rate                        0
rate_change                 0
cci_lagged                  1
unemployment_rate_lagged    1
gilt_yield_lagged           1
cpih_lagged                 1
gva_lagged                  1
er_gbp_usd_lagged           1
er_gbp_eur_lagged           1
dtype: int64


Next, it is easy to suspect that the economic indicators are correlated with each other, which could lead to multicollinearity in the model. We can plot a correlation matrix to visualise the relationships between the economic indicators:

In [477]:
# correlation matrix
indicators_lagged_cols = [f'{indicator}_lagged' for indicator in indicators]
numeric_df = df_new[indicators_lagged_cols]
indicators_corr = numeric_df.corr()

# plot the correlation matrix
ggplot(pd.melt(indicators_corr.reset_index(), id_vars='index'), aes(x='index', y='variable', fill='value')) + \
    geom_tile() + \
    scale_fill_gradient(low='white', high='blue') + \
    labs(title='Correlation Matrix of Economic Indicators (Lagged)', x='Indicator', y='Indicator') + \
    theme(axis_text_x=element_text(angle=45, hjust=1)) + \
    ggsize(800, 800)

It seems that **many indicators are moderately to highly correlated with each other**- I will elaborate on two examples:
* GDP (`"gva_lagged"`) and unemployment rate (`"unemployment_rate_lagged"`) are highly correlated with each other, with a correlation coefficient of -0.867.
    + This aligns with basic economic theory- when the economy is growing (GDP is rising), businesses tend to expand, leading to increased demand for labour and thus lower unemployment. 
* The monthly spot exchange rate for GBP to EUR (`"er_gbp_eur_lagged"`) and GDP are also highly correlated, with a correlation coefficient of 0.851. 
    + Strong economic performance in the UK (higher GDP) attracts foreign investment in UK assets, increasing the demand for GBP and thus appreciating its value against the EUR.  
  
To address multicollinearity, we can use **PCA to reduce the dimensionality** of the dataset while retaining as much information as possible. PCA will transform the original economic indicators into a set of linearly uncorrelated variables called principal components. These principal components are ordered by the amount of variance they explain in the data, with the first component explaining the most variance. By selecting a subset of the principal components that explain most of the variance in the data, we can reduce the dimensionality of the dataset and mitigate multicollinearity. 
  
Taking a few steps back, let us first split the data into training and test sets, using the same method of splitting as in the baseline model (70% of the years for training set). Again, we use a `train_test_split` with `shuffle=False` to maintain the chronological order of the data, as we are dealing with time series data:

In [478]:
X, y = df_new[indicators_lagged_cols], df_new['rate_change']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, shuffle=False, random_state=42)

As before, our test set is quite imbalanced, with 80% of the instances being rate increases (class 1): 

In [479]:
print(f"Train Class Distribution: \n{y_train.value_counts(normalize=True)}")
print(f"Test Class Distribution: \n{y_test.value_counts(normalize=True)}")

Train Class Distribution: 
rate_change
0    0.586957
1    0.413043
Name: proportion, dtype: float64
Test Class Distribution: 
rate_change
1    0.8
0    0.2
Name: proportion, dtype: float64


In [480]:
len(X_train), len(X_test)

(46, 20)

One preprocessing step we take is to **standardise the features** before applying PCA so that all features have a mean of 0 and a standard deviation of 1. This step ensures that each feature contributes equally to the PCA, since PCA is sensitive to the scale of the features.  
  
We will use a `Pipeline` to combine the standardisation, PCA, and instantiation of a Random Forest Classifier into a single object. This allows us to streamline the process and ensures that the same transformations are applied to both the training and test sets. Also note that:  
* `n_components` parameter in PCA is set to 0.90, meaning that PCA will retain enough components to explain 90% of the variance in the data. We want to retain as much information as possible while reducing the dimensionality of the dataset.  
* `class_weight='balanced'` parameter in the Random Forest Classifier is used to account for the slight class imbalance in the training set when we fit the pipeline to the training data. This parameter assigns weights to the classes inversely proportional to their frequency, so that the model gives more importance to the minority class (rate increases) during training.
* `TimeSeriesSplit` is used to ensure that the cross-validation preserves the temporal order of the data. This is important as we are dealing with time series data, and we want to avoid data leakage from the future into the past. `n_splits` is set to 3, i.e. the training data is split into 3 folds for cross-validation, as we have a *limited amount of data* (46 data points in the training set).

In [481]:
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=0.9)),
    ('rf', RandomForestClassifier(class_weight='balanced', random_state=42))
])

tscv = TimeSeriesSplit(n_splits=3)
num_folds = tscv.get_n_splits()

# lists to store metrics and confusion matrices
train_results, val_results = [], []
train_cm_sum= None

# cross-validation loop on the training set only
for fold, (train_idx, val_idx) in enumerate(tscv.split(X_train), 1):
    print(f"Processing fold {fold}/{num_folds}...")

    # Split the training data into training and validation sets
    X_train_fold, X_val_fold = X_train.iloc[train_idx], X_train.iloc[val_idx]
    y_train_fold, y_val_fold = y_train.iloc[train_idx], y_train.iloc[val_idx]

    # Fit the pipeline only on the training set
    pipeline.fit(X_train_fold, y_train_fold)

    # predictions
    y_train_pred = pipeline.predict(X_train_fold)
    y_val_pred = pipeline.predict(X_val_fold)
    y_train_probas = pipeline.predict_proba(X_train_fold)[:, 1]
    y_val_probas = pipeline.predict_proba(X_val_fold)[:, 1]  

    # Calculate the AUC-PR
    train_auc_pr = average_precision_score(y_train_fold, y_train_probas)
    val_auc_pr = average_precision_score(y_val_fold, y_val_probas)

    # store the results
    train_results.append(train_auc_pr)
    val_results.append(val_auc_pr)

    # compute confusion matrices
    labels = np.sort(y.unique())
    cm_train = confusion_matrix(y_train_fold, y_train_pred, labels=labels)

    # sum the confusion matrices
    train_cm_sum = cm_train if train_cm_sum is None else train_cm_sum + cm_train

# calculate the average AUC-PR for the training and validation sets
print(f"Average Train AUC-PR: {np.mean(train_results):.3f}")
print(f"Average Validation AUC-PR: {np.mean(val_results):.3f}")
    

Processing fold 1/3...
Processing fold 2/3...
Processing fold 3/3...
Average Train AUC-PR: 1.000
Average Validation AUC-PR: 0.561


After performing the cross-validation on the training set and obtaining its aggregate AUC-PR score, we fit the pipeline to the entire training set and evaluate the model on the test set. This allows us to assess the model's performance on unseen data:

In [482]:
_ = pipeline.fit(X_train, y_train)

y_test_pred = pipeline.predict(X_test)
y_test_probas = pipeline.predict_proba(X_test)[:, 1]

# calculate the performance metrics
auc_pr = average_precision_score(y_test, y_test_probas)

print(f"Test Set Performance: \nAUC-PR = {auc_pr:.3f}")


Test Set Performance: 
AUC-PR = 0.936


#### 1.3.1 Interpretation of Random Forest Classifier Performance (AUC-PR)  
|AUC-PR|Baseline Logistic Regression|Improved Random Forest Classifier|
|---|---|---|
|Train set|0.972|1.000|
|Test set|0.734|0.936|  
  
Based on the AUC-PR metric, it is clear that the improved Random Forest Classifier outperforms the baseline logistic regression model on both the training and test sets. The Random Forest Classifier achieves a perfect AUC-PR of 1.000 on the training set, indicating that it can effectively distinguish between rate increases and decreases in the training data. On the test set, the Random Forest Classifier achieves an AUC-PR of 0.936, which is significantly higher than the baseline model's AUC-PR of 0.734 (albeit still a **red flag for overfitting**).  
  
Therefore, the steps that we have taken to address the issues in the baseline model- accounting for publication lags, reducing dimensionality using PCA, and using a Random Forest Classifier with balanced class weights- have led to a clear **improvement in the model's predictive performance**.  
  
#### 1.3.2 Confusion Matrices  
We draw the (averaged) confusion matrices over all cross validation folds in the training set, as well as the confusion matrix for the test set to once again get a visual representation of the model's performance in terms of true positives, false positives, true negatives, and false negatives.:


In [483]:
# compute mean confusion matrix for the training set since we have multiple folds
train_cm_avg = np.round(train_cm_sum / num_folds).astype(int)

# convert confusion matrices to dataframe
def get_cm_df(cm, dataset, labels):
    """
    Converts a confusion matrix to a DataFrame for plotting.

    Parameters:
    - cm: Confusion matrix (2D numpy array)
    - dataset: Label for the dataset (e.g., 'Train' or 'Test')
    - labels: Class labels

    Returns:
    - DataFrame with columns: ['Actual', 'Predicted', 'Count', 'Dataset']
    """
    return pd.DataFrame([
        (actual, predicted, cm[i, j]) for i, actual in enumerate(labels) for j, predicted in enumerate(labels)
    ], columns=["Actual", "Predicted", "Count"]).assign(Dataset=dataset)

train_cm_df = get_cm_df(train_cm_avg, 'Train (CV-Averaged)', labels)

# Define the label map for annotations
label_map = {
    (0, 0): "TN",
    (0, 1): "FP",
    (1, 0): "FN",
    (1, 1): "TP",
}

def plot_confusion_matrix_with_annotations(cm_df, label_map, title="Confusion Matrix"):
    """
    Creates and plots a confusion matrix with annotations for TP, FP, FN, TN using Lets-Plot.

    Parameters:
    - cm_df: DataFrame containing confusion matrix details
    - label_map: Dictionary mapping (Actual, Predicted) to annotation labels
    - title: Title for the plot
    """
    # Add annotations for TP, FP, FN, TN
    cm_df['Annotation'] = cm_df.apply(
        lambda row: f"{label_map[(row['Actual'], row['Predicted'])]}: {row['Count']}", axis=1
    )

    # Create confusion matrix plot with Lets-Plot
    plot = ggplot(cm_df, aes(x="Predicted", y="Actual", fill="Count")) + \
        geom_tile() + \
        geom_text(aes(label="Annotation"), size=10, color="black", vjust=0.5, hjust=0.5) + \
        scale_fill_gradient(low="white", high="#FF7F50") + \
        ggtitle(title) + \
        xlab("Predicted") + \
        ylab("Actual") + \
        coord_fixed(ratio=1) + \
        theme_minimal() + \
        theme(legend_position="right")

    return plot

# Now plot the confusion matrix with annotations
p_train_avg_with_annotations = plot_confusion_matrix_with_annotations(train_cm_df, label_map, title="Averaged Confusion Matrix (Train - CV)")
p_test = plot_confusion_matrix(y_test, y_test_pred, title="Confusion Matrix (Test)")

gggrid([p_train_avg_with_annotations, p_test], ncol=2, widths=[1, 1])


The Random Forest Classifier perfectly classifies the training set even with cross-validation, showing an improved performance compared to the baseline logistic regression. However, it **still fails to correctly classify any interest rate increases in the test set** (false negatives), even as the test set itself exhibits a relatively high proportion (80%) of rate increases.   
  
One potential explanation is that the **unprecedented economic events** in recent years, such as the pandemic and the subsequent monetary policy responses, introduced market dynamics that were not present in earlier training data. Consequently, the model, even after incorporating publication lags, PCA, and other preprocessing steps, is unable to capture the new patterns associated with rate hikes. Essentially, the model’s predicted probabilities for rate hikes remain extremely low (mean probability = 0.0665), leading to a default classification of "rate decrease" when using the standard threshold. Let's look at the summary statistics of the predicted probabilities for the test set:


In [484]:
print("Min:", np.min(y_test_probas))
print("Max:", np.max(y_test_probas))
print("Mean:", np.mean(y_test_probas))
print("Median:", np.median(y_test_probas))

Min: 0.02
Max: 0.2
Mean: 0.0665
Median: 0.03


This suggests that, **while the model might rank instances relatively well** (as indicated by a high AUC-PR in the test set), its calibration is off - it is **too conservative in assigning higher probabilities to rate hikes**. Addressing this discrepancy may require a more targeted rebalancing of the training data, threshold recalibration, or the incorporation of additional features that capture these unprecedented economic conditions. I explore one of these approaches in 1.3.3 as an extension.  
 
#### 1.3.3 Rebalancing the Train/Test Split
Instead of the instructed 70/30 split, let's consider a **75/25 split** for the training and test sets instead. The motivation behind this is due to the fact that in our baseline analysis, the training set largely consisted of pre-2021 observations when rate hikes were relatively rare. This led the model to learn patterns indicating a low likelihood of rate increases. By shifting to an 75/25 split, we ensure that more recent data—where rate hikes have become more frequent—is included in the training set. The aim is to provide the model with up-to-date economic signals reflective of the current monetary policy environment, potentially leading to more realistic probability estimates and improved predictive performance on the test set.

In [485]:
X, y = df_new[indicators_lagged_cols], df_new['rate_change']
X_train_new, X_test_new, y_train_new, y_test_new = train_test_split(X, y, test_size=0.25, shuffle=False, random_state=42)

In [486]:
print(f"Train Class Distribution: \n{y_train_new.value_counts()}")
print(f"Test Class Distribution: \n{y_test_new.value_counts()}")

Train Class Distribution: 
rate_change
0    28
1    21
Name: count, dtype: int64
Test Class Distribution: 
rate_change
1    14
0     3
Name: count, dtype: int64


In [487]:
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=0.9)),
    ('rf', RandomForestClassifier(class_weight='balanced', random_state=42))
])

tscv = TimeSeriesSplit(n_splits=3)
num_folds = tscv.get_n_splits()

# lists to store metrics and confusion matrices
train_results_new, val_results_new = [], []

# cross-validation loop on the training set only
for fold, (train_idx, val_idx) in enumerate(tscv.split(X_train_new), 1):
    print(f"Processing fold {fold}/{num_folds}...")

    # Split the training data into training and validation sets
    X_train_new_fold, X_val_new_fold = X_train_new.iloc[train_idx], X_train_new.iloc[val_idx]
    y_train_new_fold, y_val_new_fold = y_train_new.iloc[train_idx], y_train_new.iloc[val_idx]

    # Fit the pipeline only on the training set
    pipeline.fit(X_train_new_fold, y_train_new_fold)

    # predictions
    y_train_new_pred = pipeline.predict(X_train_new_fold)
    y_val_new_pred = pipeline.predict(X_val_new_fold)
    y_train_new_probas = pipeline.predict_proba(X_train_new_fold)[:, 1]
    y_val_new_probas = pipeline.predict_proba(X_val_new_fold)[:, 1]  

    # Calculate the AUC-PR
    train_auc_pr_new = average_precision_score(y_train_new_fold, y_train_new_probas)
    val_auc_pr_new = average_precision_score(y_val_new_fold, y_val_new_probas)

    # store the results
    train_results_new.append(train_auc_pr_new)
    val_results_new.append(val_auc_pr_new)

# calculate the average AUC-PR for the training and validation sets
print(f"Average Train AUC-PR: {np.mean(train_results_new):.3f}")
print(f"Average Validation AUC-PR: {np.mean(val_results_new):.3f}")
    

Processing fold 1/3...
Processing fold 2/3...
Processing fold 3/3...
Average Train AUC-PR: 1.000
Average Validation AUC-PR: 0.387


In [488]:
_ = pipeline.fit(X_train_new, y_train_new)

y_test_pred_new = pipeline.predict(X_test_new)
y_test_probas_new = pipeline.predict_proba(X_test_new)[:, 1]

# calculate the performance metrics
auc_pr_new = average_precision_score(y_test_new, y_test_probas_new)

print(f"Test Set Performance: \nAUC-PR = {auc_pr_new:.3f}")


Test Set Performance: 
AUC-PR = 0.970


In [489]:
print("Min:", np.min(y_test_probas_new))
print("Max:", np.max(y_test_probas_new))
print("Mean:", np.mean(y_test_probas_new))
print("Median:", np.median(y_test_probas_new))


Min: 0.12
Max: 0.83
Mean: 0.6611764705882353
Median: 0.69


In [490]:
p_test_new = plot_confusion_matrix(y_test_new, y_test_pred_new, title="Confusion Matrix (Test) (75/25 Split)")
p_test_new

**Interpretation of results**: 
Now, the Random Forest Classifier almost perfectly classfies rate hikes in the test set (only 1 false negative), a significant improvement from the previous model. However, the number of false positives also increased. This suggests that the model is now more sensitive to rate hikes, but at the cost of making more false positive predictions.  
  
Comparing the AUC-PR scores: 
| AUC-PR | Random Forest Classifier (70/30 split) | Improved Random Forest Classifier (75/25 split) |
|--------|---------------------------------------|-------------------------------------------------|
| Aggregate Train | 1.000 | 1.000 |
| Aggregate Validation | 0.561 | 0.387 |
| Test set | 0.936 | 0.970 |  
  
The discrepency between the aggregate train and validation AUC-PR scores in the 75/25 split model is even greater now as compared to the model with a 70/30 split. Despite the overfitting evident in cross-validation, the test set AUC-PR remains high (0.970) and is close to the training performance. This may be because the test set—comprising more recent data—has a distribution of economic indicators that the model, having been trained on more recent patterns (due to the 75/25 split), can rank effectively. In other words, while the model struggles with certain periods during cross-validation, it appears to perform well on the test period, likely because the test data reflects conditions similar to those the model overfit on. As such, it is difficult to conclude whether this new model truly generalises better to unseen data. 
  
### 1.4 Conclusion 
Predicting interest rate hikes and cuts is inherently challenging due to the complex and dynamic nature of economic conditions. In the real world, central banks base their decisions on a variety of factors, including inflation, unemployment, geopolitical events, and economic shocks, all of which are often unpredictable. The difficulty in accurately predicting these rate changes is compounded by the fact that these events are rare and can be heavily influenced by unforeseen circumstances, such as the COVID-19 pandemic or geopolitical tensions like the Russia-Ukraine war. 

Moreover, models trained on historical data may struggle to predict future rate changes if they encounter new economic patterns or events that weren't present in the training data. This is evident in our analysis, where even with an improved model, we saw limitations in its ability to predict rate hikes, especially in the post-2021 period, when monetary policy became more reactive to unforeseen challenges. 

Thus, while machine learning models can provide insights and offer probability estimates, they face significant limitations when it comes to capturing the full range of variables and shocks that influence central bank decisions.   
  
## Generative AI acknowledgement  
This notebook was written on VSCode with the GitHub Copilot extension activated. It provided some useful suggestions when autocompleting code snippets. I also used ChatGPT for debugging and for checking my understanding and interpretations of the evaluation metrics.