# Data Pipeline
For this project determining which strategies to use determines the sucess of the trading bot.<br>
In the context of this project we follow this structure in 3 steps found in these folders in `src/`<br>
`backtesting/` -> `data/` -> `models/`<br>
1. `backtesting/` - Here we try out any choice of strategies we want to try using `backtrader` to find a viable technical indicator based approach to ensure that historically our selection would turn a profit.
2. `data/` - Here we then implement those technical indicators onto a `DataFrame` from historical data on a stock of our choice. This also includes the signals to indicate that we should buy or sell, this makes the `DataFrame` contain all the features and labels necessary to train model on.
3. `models/` - Here we train the machine learning models on the modified `DataFrame`s and validate that they can then accurately predict the buy and sell signals based on technical indicators<br>
Given this pipeline we can ensure the validity of the technical indicators and allow machine learning models to train on them. 

# Backtesting
We first test the strategy on historical data using technicals of our choice using `backtrader`, once we decide what technicals indicators we would like to use we then apply them in the `process_data()` method found in `src/data/data_processing.py`. After we define the modifications to the `DataFrame` then we can start training the ML models

In [2]:
import sys
import os
import sys
sys.path.insert(0, '../')

import matplotlib.pylab as plt
import numpy as np
import pandas as pd
from data import data_processing as dp # to get modified DataFrames with Technicals

# Get DataFrame
In `src/data/data_processing.py` we can load a `DataFrame` that contains our own technical indicators using the `get_df` method and a spcecified ticker. We can modify the strategy and technical indicators in the `process_data` method to adjust to new strategies. This method will also add the signals to the df to allow ML models to train on. 

In [6]:
df = dp.get_df("AAPL")
df

[*********************100%***********************]  1 of 1 completed


Price,Close,High,Low,Open,Volume,EMA(12),EMA(26),EMA(12-26),Signal Line(9),MACD,Signal
Ticker,AAPL,AAPL,AAPL,AAPL,AAPL,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2
1980-12-12,0.098597,0.099025,0.098597,0.098597,469033600,0.098597,0.098597,0.000000,0.000000,0,HOLD
1980-12-15,0.093453,0.093881,0.093453,0.093881,175884800,0.097805,0.098216,-0.000410,-0.000082,0,HOLD
1980-12-16,0.086594,0.087022,0.086594,0.087022,105728000,0.096080,0.097355,-0.001274,-0.000321,0,HOLD
1980-12-17,0.088737,0.089166,0.088737,0.088737,86441600,0.094951,0.096716,-0.001766,-0.000610,0,HOLD
1980-12-18,0.091310,0.091738,0.091310,0.091310,73449600,0.094390,0.096316,-0.001925,-0.000873,0,HOLD
...,...,...,...,...,...,...,...,...,...,...,...
2025-05-07,195.992981,199.178806,192.996910,198.909155,68536700,202.929953,205.212675,-2.282722,-2.417410,0,HOLD
2025-05-08,197.231369,199.788014,194.425036,197.461064,50478900,202.053248,204.621467,-2.568219,-2.447572,-1,SELL
2025-05-09,198.270004,200.277366,197.281295,198.739390,36453900,201.471210,204.150988,-2.679778,-2.494013,0,HOLD
2025-05-12,210.789993,211.270004,206.750000,210.970001,63677800,202.904869,204.642766,-1.737897,-2.342790,1,BUY


# Construct ML model
Once we have a dataframe with the technical indicators we would like to use, we can construct the ML model. We will use scikit-learn to simplify the process. Here we will scale the data, set up the pipeline, and split the data up. However most of the signals in the `DataFrame` are just "HOLD" so we need to do a lot preprocessing in regards to filling in those gaps so the ML can have more diverse labels.

# Train ML Model
We will test the model on the training data which will be a chunk of the dataframe we got from yfinance. Then once trained we can put it another chunk for validation data and adjust parameters as needed. Finally we can run the model on the test data for final analysis. 

In [58]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from imblearn.under_sampling import RandomUnderSampler
import warnings

X = df.iloc[:, [0, 4]] # All columns except the signal (feature)
y = df.iloc[:, 10] # Just the signal column (label)

rus = RandomUnderSampler(random_state=42) 
X_res, y_res = rus.fit_resample(X, y) 

X_res

# X_train, X_test, y_train, y_test = train_test_split(X_res, y_res, test_size=0.2, random_state=42)

# # Initialize RandomForestClassifier
# rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)

# # Fit the classifier to the training data
# rf_classifier.fit(X_train, y_train)

# # Make predictions
# y_pred = rf_classifier.predict(X_test)

# accuracy = accuracy_score(y_test, y_pred)
# classification_rep = classification_report(y_test, y_pred)

# # Print the results
# print(f"Accuracy: {accuracy:.2f}")
# print("\nClassification Report:\n", classification_rep)


Price,Close,Volume
Ticker,AAPL,AAPL
Date,Unnamed: 1_level_2,Unnamed: 2_level_2
2024-08-14,220.943359,41960600
1988-05-31,0.287201,123200000
1999-06-03,0.356424,488510400
1984-02-16,0.087022,105235200
2018-02-14,39.439358,162579600
...,...,...
2024-10-25,230.599411,38802300
2024-12-30,251.593079,35557500
2025-03-03,237.718262,47184000
2025-04-03,202.923904,103419000


# Export Model 
Once the model is trained we can export it using the `pickle` library or another equivalent. We can then use this in a driver where we can then feed the bot live data from the `finnhub` API and then recompute the technicals used to train the bot and allow it to decide and execute trades through the Alpaca API. 

# Test model on new data 