CS4001/4042 Assignment 1, Part B, Q4
---

Model degradation is a common issue faced when deploying machine learning models (including neural networks) in the real world. New data points could exhibit a different pattern from older data points due to factors such as changes in government policy or market sentiments. For instance, housing prices in Singapore have been increasing and the Singapore government has introduced 3 rounds of cooling measures over the past years (16 December 2021, 30 September 2022, 27 April 2023).

In such situations, the distribution of the new data points could differ from the original data distribution which the models were trained on. Recall that machine learning models often work with the assumption that the test distribution should be similar to train distribution. When this assumption is violated, model performance will be adversely impacted.  In the last part of this assignment, we will investigate to what extent model degradation has occurred.




---



---



Your co-investigators used a linear regression model to rapidly test out several combinations of train/test splits and shared with you their findings in a brief report attached in Appendix A below. You wish to investigate whether your deep learning model corroborates with their findings.

In [4]:
%pip install alibi-detect

Note: you may need to restart the kernel to use updated packages.


You should consider upgrading via the 'c:\Users\Wei Kang\AppData\Local\Programs\Python\Python39\python.exe -m pip install --upgrade pip' command.


In [1]:
SEED = 42

import random
random.seed(SEED)

import numpy as np
np.random.seed(SEED)

import pandas as pd
from sklearn.metrics import r2_score
import math
import warnings

from alibi_detect.cd import TabularDrift

from pytorch_tabular import TabularModel
from pytorch_tabular.models import CategoryEmbeddingModelConfig
from pytorch_tabular.config import (
    DataConfig,
    OptimizerConfig,
    TrainerConfig,
)

> Evaluate your model from B1 on data from year 2022 and report the test R2.

In [2]:
# TODO: Enter your code here
df = pd.read_csv('hdb_price_prediction.csv')

warnings.filterwarnings("ignore")

# Filtering data for evaluation on 2022 data
train_df_2019 = df[df['year'] <= 2019]
validation_df_2020 = df[df['year'] == 2020]
test_df_2022 = df[df['year'] == 2022]

# Sanity Check: Get unique values 
train_unique_years_2022 = ', '.join(map(str, train_df_2019['year'].unique()))
validation_unique_years_2022 = ', '.join(map(str, validation_df_2020['year'].unique()))
test_unique_years_2022 = ', '.join(map(str, test_df_2022['year'].unique()))

# Sanity Check: Print the formatted unique values
print(f"Unique years in train_df_2022: {train_unique_years_2022}")
print(f"Unique years in validation_df_2022: {validation_unique_years_2022}")
print(f"Unique years in test_df_2022: {test_unique_years_2022}\n\n")

# Define the DataConfig 
data_config= DataConfig(
    target=['resale_price'], 
    continuous_cols=['dist_to_nearest_stn', 'dist_to_dhoby', 'degree_centrality', 'eigenvector_centrality', 'remaining_lease_years', 'floor_area_sqm'],
    categorical_cols=['month', 'town', 'flat_model_type', 'storey_range']
)

# Define the TrainerConfig 
trainer_config = TrainerConfig(
    auto_lr_find=True, 
    batch_size=1024, 
    max_epochs=50,  
)

# Define the CategoryEmbeddingModelConfig 
model_config = CategoryEmbeddingModelConfig(
    task="regression",
    layers="50" 
)

# Create the OptimizerConfig 
optimizer_config = OptimizerConfig(optimizer="Adam")

# Create the TabularModel
model_2022 = TabularModel(
    data_config=data_config,
    model_config=model_config,
    optimizer_config=optimizer_config,
    trainer_config=trainer_config,
)

# Train the model using 2022 dataset
model_2022.fit(train=train_df_2019, validation=validation_df_2020, seed=SEED)

# Evaluate the model on the 2022 test dataset and store the results
result_2022 = model_2022.evaluate(test_df_2022)

# Print the evaluation results
print(result_2022)

# Calculate the Root Mean Square Error (RMSE) from the test results
rmse_2022 = math.sqrt(result_2022[0]['test_mean_squared_error'])

# Get predictions for the 2022 test dataset
pred_2022_df = model_2022.predict(test_df_2022)
actual_values_2022 = test_df_2022['resale_price'].values  
predicted_values_2022 = pred_2022_df["resale_price_prediction"].values  

# Calculate the R² for 2022 Test Set
r2_2022 = r2_score(actual_values_2022, predicted_values_2022)

# Print values
print(f"RMSE for 2022 Test Set: {round(rmse_2022,4)}")
print(f"R² for 2022 Test Set: {round(r2_2022,4)}")


2023-10-09 13:07:45,420 - {pytorch_tabular.tabular_model:105} - INFO - Experiment Tracking is turned off


Unique years in train_df_2022: 2017, 2018, 2019
Unique years in validation_df_2022: 2020
Unique years in test_df_2022: 2022




Global seed set to 42
2023-10-09 13:07:45,464 - {pytorch_tabular.tabular_model:473} - INFO - Preparing the DataLoaders
2023-10-09 13:07:45,471 - {pytorch_tabular.tabular_datamodule:290} - INFO - Setting up the datamodule for regression task
2023-10-09 13:07:45,588 - {pytorch_tabular.tabular_model:521} - INFO - Preparing the Model: CategoryEmbeddingModel
2023-10-09 13:07:45,659 - {pytorch_tabular.tabular_model:268} - INFO - Preparing the Trainer
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
2023-10-09 13:07:45,714 - {pytorch_tabular.tabular_model:573} - INFO - Auto LR Find Started


Finding best initial lr:   0%|          | 0/100 [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_steps=100` reached.
Learning rate set to 0.5754399373371567
Restoring states from the checkpoint path at c:\Users\Wei Kang\Desktop\Individual Assignment\.lr_find_478720eb-7fdd-4e3f-be98-452cdbfd85a2.ckpt
Restored all states from the checkpoint file at c:\Users\Wei Kang\Desktop\Individual Assignment\.lr_find_478720eb-7fdd-4e3f-be98-452cdbfd85a2.ckpt
2023-10-09 13:07:52,881 - {pytorch_tabular.tabular_model:575} - INFO - Suggested LR: 0.5754399373371567. For plot and detailed analysis, use `find_learning_rate` method.
2023-10-09 13:07:52,883 - {pytorch_tabular.tabular_model:582} - INFO - Training Started


Output()

2023-10-09 13:08:35,197 - {pytorch_tabular.tabular_model:584} - INFO - Training the model completed
2023-10-09 13:08:35,198 - {pytorch_tabular.tabular_model:1258} - INFO - Loading the best model


Output()

Output()

RMSE for 2022 Test Set: 127542.3017
R² for 2022 Test Set: 0.4388


> Evaluate your model from B1 on data from year 2023 and report the test R2.

In [3]:
# TODO: Enter your code here

# 1. Filter data for year 2023
test_2023_df = df[df['year'] == 2023]

# 2. Evaluate the model on the 2023 dataset and store the results
result_2023 = model_2022.evaluate(test_2023_df)

# 3. Get predictions for the 2023 dataset
pred_2023_df = model_2022.predict(test_2023_df)

# Extract the actual and predicted values for 2023
actual_values_2023 = test_2023_df['resale_price'].values
predicted_values_2023 = pred_2023_df["resale_price_prediction"].values

# Calculate the Root Mean Square Error (RMSE) from the test results
rmse_2023 = math.sqrt(result_2023[0]['test_mean_squared_error'])
print(f"RMSE for 2023 Test Set: {round(rmse_2023,4)}")

# Calculate the R² for 2023 data
r2_2023 = r2_score(actual_values_2023, predicted_values_2023)

# Print the R² value for the 2023 data  
print(f"R² for 2023 Test Set: {round(r2_2023,4)}")

Output()

Output()

RMSE for 2023 Test Set: 157166.6766
R² for 2023 Test Set: 0.1621


> Did model degradation occur for the deep learning model?


### Yes. 

Model degradation, within the machine learning domain, is the deterioration of a model's performance when exposed to new data that may not adhere to the original training data's distribution. This drop in efficacy is especially pronounced when the incoming data significantly deviates from patterns present during training.

Given the results:

- R² for 2022: 0.4388
- R² for 2023: 0.1621

There's a clear decrease in the R² value from 2022 to 2023, indicating a decline in the model's performance. In the context of Singapore's housing prices and the government's cooling measures between December 2021 and April 2023, it's plausible that the data distribution in 2023 differs from that of the original training set (data until 2019). This suggests that "concept drift" might have occurred, where the relationship between features and the target variable (resale price) has changed over time. (More to be discussed below.)



---



---



Model degradation could be caused by [various data distribution shifts](https://huyenchip.com/2022/02/07/data-distribution-shifts-and-monitoring.html#data-shift-types): covariate shift (features), label shift and/or concept drift (altered relationship between features and labels).
There are various conflicting terminologies in the [literature](https://www.sciencedirect.com/science/article/pii/S0950705122002854#tbl1). Let’s stick to this reference for this assignment.

> Using the **Alibi Detect** library, apply the **TabularDrift** function with the training data (year 2019 and before) used as the reference and **detect which features have drifted** in the 2023 test dataset. Before running the statistical tests, ensure you **sample 1000 data points** each from the train and test data. Do not use the whole train/test data. (Hint: use this example as a guide https://docs.seldon.io/projects/alibi-detect/en/stable/examples/cd_chi2ks_adult.html)


In [5]:
print(df.columns)

Index(['month', 'year', 'town', 'full_address', 'nearest_stn',
       'dist_to_nearest_stn', 'dist_to_dhoby', 'degree_centrality',
       'eigenvector_centrality', 'flat_model_type', 'remaining_lease_years',
       'floor_area_sqm', 'storey_range', 'resale_price'],
      dtype='object')


In [12]:
def print_delimiter(character="=", length=50):
    print(character * length)

# Load the dataset
df = pd.read_csv("hdb_price_prediction.csv")

# Split data into training set (year 2019 and before) and 2023 test set
X_train = df[df['year'] <= 2019].drop(columns=['resale_price'])
X_test_2023 = df[df['year'] == 2023].drop(columns=['resale_price'])

# Sample 1000 data points from the training data (year 2019 and before) and 2023 test data
X_train_sample = X_train.sample(1000, random_state=SEED)
X_test_2023_sample = X_test_2023.sample(1000, random_state=SEED)

feature_names = ["month","year","town","full_address","nearest_stn","dist_to_nearest_stn","dist_to_dhoby","degree_centrality",
                 "eigenvector_centrality","flat_model_type","remaining_lease_years","floor_area_sqm","storey_range"]

labels = ['No!', 'Yes!']

category_map = {
    0: None, # month 
    1: None, # year
    2: None, # town
    3: None, # full_address
    4: None, # nearest_stn
    9: None, # flat_model_type
    12: None # storey_range
}

categories_per_feature = {f: None for f in list(category_map.keys())}
cd = TabularDrift(X_train_sample.values, p_val=.05, categories_per_feature=categories_per_feature)
preds = cd.predict(X_test_2023_sample.values)
fpreds = cd.predict(X_test_2023_sample.values, drift_type='feature')
print_delimiter()
# Check for Drift on the Test Set:
print('Was there a drift on Test Set? {}'.format(labels[preds['data']['is_drift']]))
print_delimiter()
print_delimiter()

print('Printing the p-values for each feature:', end = '\n\n')
# Inspect Results for Each Feature:
for f in range(cd.n_features):
    # Check if the feature is categorical or continuous
    stat = 'Chi2' if f in list(categories_per_feature.keys()) else 'K-S'
    fname = feature_names[f]
    is_drift = fpreds['data']['is_drift'][f]
    stat_val, p_val = fpreds['data']['distance'][f], fpreds['data']['p_val'][f]
    print(f'{fname} -- Drift? {labels[is_drift]} -- {stat} {stat_val:.4f} -- p-value {p_val:.4f}')

print_delimiter()


Was there a drift on Test Set? Yes!
Printing the p-values for each feature:

month -- Drift? Yes! -- Chi2 430.3365 -- p-value 0.0000
year -- Drift? Yes! -- Chi2 2000.0000 -- p-value 0.0000
town -- Drift? No! -- Chi2 33.1777 -- p-value 0.1267
full_address -- Drift? No! -- Chi2 1799.3334 -- p-value 0.1532
nearest_stn -- Drift? Yes! -- Chi2 110.0444 -- p-value 0.0080
dist_to_nearest_stn -- Drift? No! -- K-S 0.0350 -- p-value 0.5606
dist_to_dhoby -- Drift? No! -- K-S 0.0590 -- p-value 0.0591
degree_centrality -- Drift? No! -- K-S 0.0380 -- p-value 0.4547
eigenvector_centrality -- Drift? No! -- K-S 0.0560 -- p-value 0.0837
flat_model_type -- Drift? Yes! -- Chi2 62.1219 -- p-value 0.0011
remaining_lease_years -- Drift? Yes! -- K-S 0.1630 -- p-value 0.0000
floor_area_sqm -- Drift? Yes! -- K-S 0.0620 -- p-value 0.0410
storey_range -- Drift? Yes! -- Chi2 27.8423 -- p-value 0.0095


> Assuming that the flurry of housing measures have made an impact on the relationship between all the features and resale_price (i.e. P(Y|X) changes), which type of data distribution shift possibly led to model degradation?


# **Concept Shift**

Given the recent developments in the Singapore housing market, notably the introduction of several rounds of cooling measures, it's imperative to understand the type of data distribution shift that might have occurred, resulting in model degradation.

### **1. Covariate Shift**
- **Definition**: The distribution of input features changes, but the relationship between the input and output remains the same.
- **Context**: This shift would imply that the distribution of features like `town`, `flat_model_type`, `storey_range`, etc. has changed over time. However, despite these changes, the way they relate to the `resale_price` remains consistent. Specifically, if only the type of flats being sold or the towns they are in have changed distribution, but their impact on the `resale_price` remained the same, it would be indicative of a covariate shift.

### **2. Label Shift**
- **Definition**: The distribution of the labels (or target) changes, but the relationship between features and labels stays the same.
- **Context**: This would mean that the distribution of the `resale_price` itself has changed, but the way the features predict this price remains the same. If the resale prices in the market generally went up or down, but the influence of features on this price stayed consistent, it would indicate a label shift.

### **3. Concept Drift**
- **Definition**: The relationship between input features and the target variable changes.
- **Context**: This is when the underlying relationship between the features and the `resale_price` changes. Given the cooling measures' assumed impact on all the features and resale prices in the Singapore housing market, this would signify a concept drift.

### **Conclusion**
Given that the housing measures have influenced the relationship between the features and the `resale_price` and that the functional relationship between the features and the target variable, i.e., \(P(Y|X)\), has changed, this suggests that **Concept Drift** is the most probable shift in this scenario.


> From your analysis via TabularDrift, which features contribute to this shift?


# **Drifted Features**

The results from the `TabularDrift` analysis are as follows:

| Feature                  | Drift Detected | Test Used | Test Statistic | P-Value   |
|--------------------------|----------------|-----------|----------------|-----------|
| month              | Yes            | Chi2      | 430.3365       | 0.0000 |
| year               | Yes            | Chi2      | 2000.0000      | 0.0000 |
| town                     | No             | Chi2      | 33.1777        | 0.1267    |
| full_address             | No             | Chi2      | 1799.3334      | 0.1532    |
| ***nearest_stn***              | Yes            | Chi2      | 110.0444       | ***0.0080*** |
| dist_to_nearest_stn      | No             | K-S       | 0.0350         | 0.5606    |
| dist_to_dhoby            | No             | K-S       | 0.0590         | 0.0591    |
| degree_centrality        | No             | K-S       | 0.0380         | 0.4547    |
| eigenvector_centrality   | No             | K-S       | 0.0560         | 0.0837    |
| ***flat_model_type***    | Yes            | Chi2      | 62.1219        | ***0.0011*** |
| ***remaining_lease_years*** | Yes        | K-S       | 0.1630         | ***0.0000*** |
| ***floor_area_sqm***     | Yes            | K-S       | 0.0620         | ***0.0410*** |
| ***storey_range***       | Yes            | Chi2      | 27.8423        | ***0.0095*** |

Based on our results, the features that show significant drift (p-value < 0.05) include:

- month
- year
- nearest_stn
- flat_model_type
- remaining_lease_years
- floor_area_sqm
- storey_range

Among these, month and year are time-based and expected to change, so the primary contributors from a model perspective might be the rest of the features, i.e. (`nearest_stn`, `flat_model_type`, `remaining_lease_years`, `floor_area_sqm`, `storey_range`).

> Suggest 1 way to address model degradation and implement it, showing improved test R2 for year 2023.


# Addressing Model Degradation

The housing market has seen significant changes due to government interventions and cooling measures. These external forces have resulted in the observed model degradation. To combat this and maintain a robust predictive model, we propose the following strategies:

### **1. Adaptive Training with Weighted Samples**

#### **Insights from the Appendix**:
- Models trained closer to the test year tend to perform better, indicating the importance of using recent data for training.
- While older data provides a historical context, it seems to lose predictive power for future years, especially in the face of external changes like government interventions.

#### **Rationale:**
- **Recent Market Changes:** Recent data represents the current state of the market, affected by new policies and economic conditions. Therefore, it's imperative to adapt the model by incorporating the most recent data.  Models trained on recent data perform better. This is evident from the appendix where models trained closer to 2021 had higher R2 values.
- **Balancing Historical and Current Data:** While it's essential to capture long-term trends, recent data provides crucial information about the market's current state. Weighing samples based on their recency allows the model to benefit from both historical context and current dynamics.

#### **Strategy:**
- **Data Inclusion:** Incorporate data post-2019 till 2022 into the training set. We use 90% of the data up to 2022 for training, and 10% for validation since validation is not very important here and we are mainly concerned with addressing the model degradation issue by ensuring that the model can generalize well to new data (years 2023). 
- **Weight Assignment:** Weigh the samples in a manner where newer samples are assigned higher importance during training. We do this through a weighted random sampler. This ensures that the model doesn't overlook historical data but emphasizes recent market changes more. (e.g., **0.000004 for 2017** and **0.000015 for 2022** after normalization of weights.)

#### **Result:**
- **Increase in R2 Value:** Our strategy resulted in an improvement in R2 Value from **`0.1621`** to **`0.6038`**, demonstrating the effectiveness of training with the most recent data, giving precedence to recent data and balancing it with historical insights.

In [13]:
# TODO: Enter your code here
df = pd.read_csv('hdb_price_prediction.csv')
warnings.filterwarnings("ignore")
from torch.utils.data import WeightedRandomSampler

# Use data up to 2022 for training, use 90% of this for training and 10% for validation 
data_up_to_2022 = df[df['year'] <= 2022]
split_idx = int(0.9 * len(data_up_to_2022))  # Find the 90% mark

train_up_to_2022 = data_up_to_2022.iloc[:split_idx]  # First 90%
valid_up_to_2022 = data_up_to_2022.iloc[split_idx:]  # Last 10%

test_2023_df = df[df['year'] == 2023]

# Compute the sample weights using the year column of train_up_to_2022

years = train_up_to_2022['year'].values
weights = 1 / (2024 - years)  # Give more weight to recent years
weights = weights / np.sum(weights)  # Normalize the weights to make them sum up to 1
sampler = WeightedRandomSampler(weights, num_samples=len(weights), replacement=True)  # Create the weighted sampler

# Create a dictionary to store one weight per unique year
unique_year_weights = {}

# Loop through the unique years
for year in np.unique(years):
    # Find the index of the first occurrence of the current year
    year_idx = np.where(years == year)[0][0]
    # Get the weight of the first occurrence of the current year
    unique_year_weights[year] = weights[year_idx]

# Convert the dictionary to a DataFrame for easier viewing
unique_year_weights_df = pd.DataFrame(list(unique_year_weights.items()), columns=['Year', 'Weight'])

# Display the DataFrame to see the weight assigned to each unique year
display(unique_year_weights_df)

# Define the DataConfig
data_config_updated = DataConfig(
    target=['resale_price'],
    continuous_cols=['dist_to_nearest_stn', 'dist_to_dhoby', 'degree_centrality', 'eigenvector_centrality', 'remaining_lease_years', 'floor_area_sqm'],
    categorical_cols=['month', 'town', 'flat_model_type', 'storey_range'],
    validation_split=None
)

# Define the TrainerConfig
trainer_config_updated = TrainerConfig(
    auto_lr_find=True,
    batch_size=1024,
    max_epochs=50,
)

# Define the CategoryEmbeddingModelConfig
model_config_updated = CategoryEmbeddingModelConfig(
    task="regression",
    layers="50"
)

# Create the OptimizerConfig
optimizer_config_updated = OptimizerConfig(optimizer="Adam")

# Create the TabularModel with updated configurations
model_updated = TabularModel(
    data_config=data_config_updated,
    model_config=model_config_updated,
    optimizer_config=optimizer_config_updated,
    trainer_config=trainer_config_updated,
)

# Train the updated model using data up to 2022 with the weighted sampler
model_updated.fit(train=train_up_to_2022, validation=valid_up_to_2022, train_sampler=sampler, seed=SEED)

# Evaluate the model on the 2023 test dataset and store the results
result_updated = model_updated.evaluate(test_2023_df)

# Print the evaluation results
print(result_updated)

# Calculate the Root Mean Square Error (RMSE) from the test results
rmse_updated = math.sqrt(result_updated[0]['test_mean_squared_error'])

# Get predictions for the test dataset
pred_2023_df = model_updated.predict(test_2023_df)
actual_values_updated = test_2023_df['resale_price'].values
predicted_values_updated = pred_2023_df["resale_price_prediction"].values

# Calculate the R² for the updated model
r2_updated = r2_score(actual_values_updated, predicted_values_updated)

# Old values
rmse_2023 = math.sqrt(result_2023[0]['test_mean_squared_error'])
r2_2023 = r2_score(actual_values_2023, predicted_values_2023)

# New values
rmse_updated = math.sqrt(result_updated[0]['test_mean_squared_error'])
r2_updated = r2_score(actual_values_updated, predicted_values_updated)

# Creating dataframe
data = {
    'Metrics': ['RMSE', 'R^2'],
    'Old Values': [rmse_2023, r2_2023],
    'New Values': [rmse_updated, r2_updated]
}

comparison_df = pd.DataFrame(data)
comparison_df_rounded = comparison_df.round(4)
display(comparison_df_rounded)


Unnamed: 0,Year,Weight
0,2017,4e-06
1,2018,5e-06
2,2019,6e-06
3,2020,8e-06
4,2021,1e-05
5,2022,1.5e-05


2023-10-09 13:33:53,640 - {pytorch_tabular.tabular_model:105} - INFO - Experiment Tracking is turned off
Global seed set to 42
2023-10-09 13:33:53,671 - {pytorch_tabular.tabular_model:473} - INFO - Preparing the DataLoaders
2023-10-09 13:33:53,681 - {pytorch_tabular.tabular_datamodule:290} - INFO - Setting up the datamodule for regression task
2023-10-09 13:33:53,842 - {pytorch_tabular.tabular_model:521} - INFO - Preparing the Model: CategoryEmbeddingModel
2023-10-09 13:33:53,875 - {pytorch_tabular.tabular_model:268} - INFO - Preparing the Trainer
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
2023-10-09 13:33:53,932 - {pytorch_tabular.tabular_model:573} - INFO - Auto LR Find Started


Finding best initial lr:   0%|          | 0/100 [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_steps=100` reached.
Learning rate set to 0.5754399373371567
Restoring states from the checkpoint path at c:\Users\Wei Kang\Desktop\Individual Assignment\.lr_find_eea41a8e-a11f-420e-b2cd-c9828ba7397e.ckpt
Restored all states from the checkpoint file at c:\Users\Wei Kang\Desktop\Individual Assignment\.lr_find_eea41a8e-a11f-420e-b2cd-c9828ba7397e.ckpt
2023-10-09 13:33:59,114 - {pytorch_tabular.tabular_model:575} - INFO - Suggested LR: 0.5754399373371567. For plot and detailed analysis, use `find_learning_rate` method.
2023-10-09 13:33:59,116 - {pytorch_tabular.tabular_model:582} - INFO - Training Started


Output()

2023-10-09 13:34:57,169 - {pytorch_tabular.tabular_model:584} - INFO - Training the model completed
2023-10-09 13:34:57,171 - {pytorch_tabular.tabular_model:1258} - INFO - Loading the best model


Output()

Output()

[{'test_loss': 11679669248.0, 'test_mean_squared_error': 11679669248.0}]


Unnamed: 0,Metrics,Old Values,New Values
0,RMSE,157166.6766,108072.5185
1,R^2,0.1621,0.6038


### Appendix A



Here are our results from a linear regression model. We used StandardScaler for continuous variables and OneHotEncoder for categorical variables.

While 2021 data can be predicted well, test R2 dropped rapidly for 2022 and 2023.

| Training set | Test set | Test R2 |
|--------------|----------|---------|
| Year <= 2020 | 2021     | 0.76    |
| Year <= 2020 | **2022**     | 0.41    |
| Year <= 2020 | **2023**     | **0.10**   |



Similarly, a model trained on 2017 data can predict 2018-2021 well (with slight degradation in performance for 2021), but drops drastically in 2022 and 2023.

| Training set | Test set | Test R2 |
|--------------|----------|---------|
| 2017         | 2018     | 0.90    |
|              | 2019     | 0.89    |
|              | 2020     | 0.87    |
|              | 2021     | 0.72    |
|              | **2022**     | **0.37**    |
|              | **2023**     | **0.09**    |

With the test set fixed at year 2021, training on data from 2017-2020 still works well on the test data, with minimal degradation. Training sets closer to year 2021 generally do better.

| Training set | Test set | Test R2 |
|--------------|----------|---------|
| 2020         | 2021     | 0.81    |
| 2019         | 2021     | 0.75    |
| 2018         | 2021     | 0.73    |
| 2017         | 2021     | 0.72    |