### **Setting Up Workspace**

In [1]:
# src/train_model.py
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import train_test_split # We will explain why NOT to use this
from sklearn.metrics import classification_report, mean_absolute_error, r2_score
import joblib # For saving our trained models

### **Load Clean Feature Data**

Our first step is to load the hyp_a_features.parquet file that your previous script created.

In [2]:
# --- Step 1: Load Feature Data ---
print("Step 1: Loading features...")

try:
    year = 2015
    df = pd.read_parquet(f'../data/processed/hyp_a_features_{year}_present.parquet')
except FileNotFoundError:
    print("Error: The feature file was not found.")
    print("Please run the 'feature_engineering.py' script first.")
    exit()

print("Features loaded successfully.")
df.info()

Step 1: Loading features...
Features loaded successfully.
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2754 entries, 2015-01-15 to 2025-09-19
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   day_of_week        2754 non-null   int64  
 1   asia_return        2754 non-null   float64
 2   asia_range         2754 non-null   float64
 3   atr_at_asia_close  2754 non-null   float64
 4   rsi_at_asia_close  2754 non-null   float64
 5   ema50_dist         2754 non-null   float64
 6   ema200_dist        2754 non-null   float64
 7   london_direction   2754 non-null   int64  
 8   london_return      2754 non-null   float64
dtypes: float64(7), int64(2)
memory usage: 215.2 KB


### **Step 2: Define Features (X) and Targets (y) and Split the Data**

This is the most important conceptual step in machine learning. We need to separate our data into two groups:

- `X` (The Features): The information the model will use to make a prediction (e.g., asia_return, rsi_at_asia_close).
- `y` (The Target): The answer the model is trying to predict (e.g., london_direction).

**Crucially, for time-series data, we CANNOT split the data randomly**. We must split it chronologically to simulate reality. We train on the past and test on the more recent "future".

In [3]:
# --- Step 2: Define Features, Targets, and Split Data ---
print("\nStep 2: Preparing data for training...")

# 'X' is our feature set. We drop the two target columns.
X = df.drop(columns=['london_direction', 'london_return'])

# We have two separate targets we want to predict.
y_class = df['london_direction']  # For our classification model
y_reg = df['london_return']      # For our regression model

# --- The Time-Series Split ---
# We will use the first 80% of the data for training and the last 20% for testing.
# This ensures we are always testing on data that comes after our training data.
train_size = int(len(df) * 0.8)

X_train, X_test = X.iloc[:train_size], X.iloc[train_size:]
y_train_class, y_test_class = y_class.iloc[:train_size], y_class.iloc[train_size:]
y_train_reg, y_test_reg = y_reg.iloc[:train_size], y_reg.iloc[train_size:]

print(f"Data split into training and testing sets:")
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape:  {X_test.shape}")


Step 2: Preparing data for training...
Data split into training and testing sets:
X_train shape: (2203, 7)
X_test shape:  (551, 7)


### **Step 3: Train the Classification Model (Hypothesis A1)**

Now we'll teach our first model to predict the direction (1 or 0). We will use XGBClassifier.

In [4]:
# --- Step 3: Train Classification Model ---
print("\n--- Training Classification Model (Hypothesis A1) ---")

# Initialize the XGBoost Classifier model with some standard parameters.
# 'objective' tells it to perform binary (two-class) classification.
# 'eval_metric' is the metric used to stop training early if it's not improving.
model_class = xgb.XGBClassifier(
    objective='binary:logistic',
    eval_metric='logloss',
    n_estimators=1000, # Number of decision trees to build.
    learning_rate=0.05,
    max_depth=5,
    use_label_encoder=False,
    random_state=42
)

# Train the model on our training data.
model_class.fit(X_train, y_train_class)

# --- Evaluate the Classification Model ---
print("\n--- Evaluating Classification Model ---")

# Make predictions on the unseen test data.
y_pred_class = model_class.predict(X_test)

# Print a report showing key metrics.
# Precision: Of all the "bullish" predictions, how many were correct?
# Recall: Of all the actual bullish days, how many did we correctly identify?
# F1-Score: A combined score of precision and recall.
print(classification_report(y_test_class, y_pred_class, target_names=['Bearish (0)', 'Bullish (1)']))


--- Training Classification Model (Hypothesis A1) ---


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)



--- Evaluating Classification Model ---
              precision    recall  f1-score   support

 Bearish (0)       0.48      0.39      0.43       263
 Bullish (1)       0.52      0.61      0.56       288

    accuracy                           0.50       551
   macro avg       0.50      0.50      0.50       551
weighted avg       0.50      0.50      0.50       551



### **Step 4: Train the Regression Model (Hypothesis A2)**

Next, we'll teach our second model to predict the actual return value. We will use XGBRegressor.

In [5]:
# --- Step 4: Train Regression Model ---
print("\n--- Training Regression Model (Hypothesis A2) ---")

# Initialize the XGBoost Regressor model.
# 'objective' tells it to minimize the squared error, which is standard for regression.
model_reg = xgb.XGBRegressor(
    objective='reg:squarederror',
    eval_metric='rmse', # Root Mean Squared Error
    n_estimators=1000,
    learning_rate=0.05,
    max_depth=5,
    random_state=42
)

# Train the model on our training data.
model_reg.fit(X_train, y_train_reg)

# --- Evaluate the Regression Model ---
print("\n--- Evaluating Regression Model ---")

# Make predictions on the unseen test data.
y_pred_reg = model_reg.predict(X_test)

# Calculate and print key metrics.
mae = mean_absolute_error(y_test_reg, y_pred_reg)
r2 = r2_score(y_test_reg, y_pred_reg)

print(f"Mean Absolute Error (MAE): {mae:.6f}")
print("MAE tells us, on average, how far off our return prediction was in percentage points.")
print(f"R-squared (R2 Score): {r2:.4f}")
print("R2 Score tells us how much of the variance in the returns our model can explain (closer to 1 is better).")


--- Training Regression Model (Hypothesis A2) ---

--- Evaluating Regression Model ---
Mean Absolute Error (MAE): 0.005311
MAE tells us, on average, how far off our return prediction was in percentage points.
R-squared (R2 Score): -0.4444
R2 Score tells us how much of the variance in the returns our model can explain (closer to 1 is better).


--- Training Regression Model (Hypothesis A2) ---

--- Evaluating Regression Model ---
Mean Absolute Error (MAE): 0.005979
MAE tells us, on average, how far off our return prediction was in percentage points.
R-squared (R2 Score): -0.7396
R2 Score tells us how much of the variance in the returns our model can explain (closer to 1 is better).

**Step 5: Save Your Trained Models**

The final step is to save our two trained models so we can load them later in our backtesting script without having to retrain them.

In [6]:
# --- Step 5: Save Trained Models ---
print("\nStep 5: Saving models...")

# Define the paths where the models will be saved.
class_model_path = f'../models/xgb_classifier_hyp_a_{year}_present.joblib'
reg_model_path = f'../models/xgb_regressor_hyp_a_{year}_present.joblib'

# Use joblib to dump the trained model objects into files.
joblib.dump(model_class, class_model_path)
joblib.dump(model_reg, reg_model_path)

print(f"Classification model saved to: {class_model_path}")
print(f"Regression model saved to: {reg_model_path}")


Step 5: Saving models...
Classification model saved to: ../models/xgb_classifier_hyp_a_2015_present.joblib
Regression model saved to: ../models/xgb_regressor_hyp_a_2015_present.joblib
