# Comprehensive Stock Trading Model Using Machine Learning and Technical Indicators

## Introduction

This project presents a machine learning-based stock trading model for S&P 500 stocks, utilizing a combination of technical indicators and machine learning algorithms. The model is designed to predict stock price movements and generate actionable trading signals, adopting a conservative trading approach by limiting its trades to one share per day.

The focus of this model is to maximize profitability while minimizing risk over the long term. Using Yahoo Finance data spanning from 2010 to the present, the model analyzes historical price data, volume, and technical indicators to make informed buy and sell decisions. Tested via a stock market simulation, the model demonstrates an average return of **12% profit** in the past 365 market days.


## Data Collection and Feature Engineering

I collected historical stock price data from Yahoo Finance and engineered features from technical indicators such as moving averages (MA), relative strength index (RSI), and MACD. These indicators serve as input features for the machine learning model.



In [66]:
from SimulateDay import get_stock_data, preprocess_data, add_columns, stock_market_simulation

In [67]:
symbol = input('Enter the name of the company: ')
stock_data = get_stock_data(symbol)
stock_data.tail()

Unnamed: 0,Date,Symbol,Adj Close,Close,High,Low,Open,Volume
1777298,2024-10-18,VST,131.160004,131.160004,132.259995,125.370003,127.300003,5841517.0
1777299,2024-10-21,VST,129.330795,129.330795,133.429901,127.810303,131.580002,2609003.0
1777300,2024-10-22,VST,125.885002,125.885002,128.879898,124.699997,128.0,3486644.0
1874650,2024-10-23,VST,126.110001,126.110001,128.699905,123.129997,124.0,4073560.0
1874926,2024-10-24,VST,125.019997,125.019997,127.103996,123.339996,125.910004,2016265.0


These are the initail 5 rows of the data retrieved from yahoo finance, the `get_stock_data` function gets the stored data

In [68]:
stock_data = add_columns(stock_data)
stock_data.tail()

Adding columns...
Halfway There...


Unnamed: 0,Date,Symbol,Adj Close,Close,High,Low,Open,Volume,1_Day_Return,5_Day_Return,...,Support_20_Day,Resistance_50_Day,Support_50_Day,Volume_MA_10,Volume_MA_20,Volume_MA_50,Optimal_Action,Action,Z-score,OBV
1777298,2024-10-18,VST,131.160004,131.160004,132.259995,125.370003,127.300003,5841517.0,3.0565,4.685133,...,111.629997,138.410004,73.699997,9458115.5,10287552.75,7562749.1,Hold,2,5.415071,358926055
1777299,2024-10-21,VST,129.330795,129.330795,133.429901,127.810303,131.580002,2609003.0,-1.394639,-2.214728,...,112.400002,138.410004,73.699997,8151785.8,9637252.9,7503203.16,Hold,1,5.319553,356317052
1777300,2024-10-22,VST,125.885002,125.885002,128.879898,124.699997,128.0,3486644.0,-2.664325,-1.874652,...,114.160004,138.410004,73.699997,7600210.2,9325450.1,7464584.04,Hold,1,5.139619,352830408
1874650,2024-10-23,VST,126.110001,126.110001,128.699905,123.129997,124.0,4073560.0,0.178733,-7.060212,...,114.160004,138.410004,73.699997,7029446.2,8735708.1,7416145.24,Hold,0,5.151368,356903968
1874926,2024-10-24,VST,125.019997,125.019997,127.103996,123.339996,125.910004,2016265.0,-0.864328,-1.767895,...,117.720001,138.410004,73.699997,6646822.7,8187151.35,7353576.54,Hold,0,5.094449,354887703


## Feature Descriptions

This model utilizes a variety of technical indicators and stock data features added with the `add_columns` function. Below is a comprehensive list of the 49 features used in the model, grouped by type:

### 1. Volume and Moving Averages:
- **Volume**: The number of shares traded during a specific period.
- **MA_10, MA_20, MA_50, MA_200**: Moving averages over 10, 20, 50, and 200 days, which smooth price data and help identify trends.
- **Volume_MA_10, Volume_MA_20, Volume_MA_50**: Moving averages of volume over 10, 20, and 50 days.

### 2. Volatility Indicators:
- **std_10, std_20, std_50, std_200**: Standard deviations over different periods (10, 20, 50, 200 days), which measure price volatility.
- **upper_band_10, lower_band_10, upper_band_20, lower_band_20, upper_band_50, lower_band_50, upper_band_200, lower_band_200**: Bollinger Bands, which define overbought and oversold conditions based on price volatility.

### 3. Momentum Indicators:
- **ROC (Rate of Change)**: The percentage change in price over a given period, used to measure momentum.
- **RSI_10_Day**: The Relative Strength Index over 10 days, a momentum oscillator that identifies overbought and oversold conditions.
- **MACD (Moving Average Convergence Divergence)**: Measures the relationship between two moving averages to identify momentum shifts.
- **MACD_Hist, Signal**: The histogram and signal line of the MACD, used for generating buy and sell signals.

### 4. Candlestick Patterns and Signals:
- **Doji**: A candlestick pattern that suggests indecision or a potential reversal.
- **Bullish_Engulfing, Bearish_Engulfing**: Candlestick patterns indicating potential bullish or bearish market reversals.

### 5. Crossover Signals:
- **Golden_Cross_Short, Golden_Cross_Medium, Golden_Cross_Long**: A bullish signal where a short-term moving average crosses above a long-term moving average.
- **Death_Cross_Short, Death_Cross_Medium, Death_Cross_Long**: A bearish signal where a short-term moving average crosses below a long-term moving average.

### 6. Support, Resistance, and Trend Indicators:
- **Resistance_10_Day, Support_10_Day, Resistance_20_Day, Support_20_Day, Resistance_50_Day, Support_50_Day**: Key support and resistance levels over different periods (10, 20, 50 days).
- **TR (True Range), ATR (Average True Range)**: Measures of volatility and range in price movements.

### 7. Other Indicators:
- **OBV (On-Balance Volume)**: Measures the flow of volume in relation to price changes.
- **Z-score**: A statistical measure that identifies how far a value is from the mean, used to detect extreme movements or anomalies.


## Data Preprocessing

The `preprocess_data` function preprocess the data by removing missing values, handling outliers, and splitting the dataset for training and testing.


In [69]:
X_train, X_test, y_train, y_test = preprocess_data(stock_data)

Splitting data...


## Model Training and Hyperparameter Tuning

We use a LightGBM classifier and perform hyperparameter tuning using GridSearchCV to find the optimal parameters for predicting stock movements.
This has been done for every stock in the sp500 individually to maximixe model performance and minimize risk.


``` python

from lightgbm import LGBMClassifier
from sklearn.model_selection import GridSearchCV

# Define parameter grid for GridSearchCV
param_grid = {
    'num_leaves': [31, 50],
    'min_data_in_leaf': [20, 50],
    'max_depth': [-1, 10],
    'learning_rate': [0.01, 0.1],
    'n_estimators': [100, 200]
}

# Setup the LGBM classifier
model = LGBMClassifier(random_state=42, verbose=-1)
grid_search = GridSearchCV(
    model, param_grid, cv=3, scoring='accuracy', n_jobs=-1, verbose=0
)
grid_search.fit(X_train, y_train)



``` python 
# Get the best parameters
best_params = grid_search.best_params_

from sklearn.model_selection import cross_val_score, StratifiedKFold

# Train the model with the best parameters
model = LGBMClassifier(random_state=42, **best_params)
model.fit(X_train, y_train)

# Cross-validation for better evaluation
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(model, X_train, y_train, cv=skf, scoring='accuracy')
print(f"Cross-validation accuracy for {stock_data['Symbol'][0]}: {cv_scores.mean():.4f}")

## Backtesting and Simulation

We perform backtesting by simulating buy/sell decisions based on the model's predictions and evaluate the overall performance of the trading strategy. The `stock_market_simulation` functions was created to act as the market for any given amount of days and will ask the model its decision then act based on it. This function does not factor in any taxes or fees.  


In [None]:
import joblib

model = joblib.load(f'models/LGBMmodels/{symbol}_model.pkl')
results,_ =stock_market_simulation(model, initial_cash=10000, days=256,stock=stock_data.tail(365), masstrades=True);

In [71]:
results

Unnamed: 0,Stock Name,Day,Action,Cash,Shares Held,Portfolio Value,Stock Price,Date
0,VST,0,Buy,9975.200001,1,10000.000000,24.799999,2023-05-15
1,VST,1,Buy,9950.850000,2,9999.550001,24.350000,2023-05-16
2,VST,2,Buy,9926.500000,3,9999.550001,24.350000,2023-05-17
3,VST,3,Buy,9902.059999,4,9999.820002,24.440001,2023-05-18
4,VST,4,Buy,9877.549999,5,10000.100000,24.510000,2023-05-19
...,...,...,...,...,...,...,...,...
251,VST,251,Buy,880.300041,319.412575,30096.968652,91.470001,2024-05-14
252,VST,252,Hold,880.300041,319.412575,31834.573839,96.910004,2024-05-15
253,VST,253,Buy,787.160042,320.412575,30630.387067,93.139999,2024-05-16
254,VST,254,Buy,693.080040,321.412575,30931.575669,94.080002,2024-05-17


These results are for the given stock in the past year if the model was given 10,000 and the ability to trade.

In [72]:
import plotly.graph_objects as go
import plotly.subplots as sp

fig = sp.make_subplots(rows=2, cols=1)

fig.add_trace(
    go.Scatter(
        x=results['Day'],
        y=results['Portfolio Value'],
        mode='lines',
        name='Portfolio Value'
    )
)

fig.add_trace(go.Scatter(
        x=results['Day'],
        y=results['Stock Price'],
        mode='lines',
        name='Stock Price'
    ), row=2, col=1
)

fig.add_trace(go.Bar(
        x=results['Day'],
        y=results['Shares Held'],
        name='Shares Held'
    ), row=2, col=1
)

fig.update_layout(
    title=f'Portfolio Value and Close Price for {symbol}',
    xaxis_title='Day',
    yaxis_title='Value',
    hovermode='x unified', # Compare data points on hover
    width=1000,
    
)

fig.show()

## Results and Performance

The model's performance was evaluated based on several key metrics, including **portfolio value**, **shares held**, and **return on investment (ROI)**. To ensure comprehensive testing, a balanced set of stocks was chosen, considering their varying movements across the past year. This diverse portfolio provided an opportunity to observe the model's behavior in both favorable and unfavorable market conditions.

### Stock Selection:
The model was tested on the 18 following **S&P 500** stocks, representing a mix of winners, losers, and flat performers over the past year:
- **Winners**: AAPL, MSFT, NFLX, TSLA, META, MMM, CCL, META, CEG, HWM, NVDA
- **Losers**: INTC, T, DIS, VZ, PFE, HUM, LULU, WBA, PFE, NKE
- **Flat Performers**: XOM, KO, JNJ, PG, WMT, MCD, DLTR

This selection was crafted to challenge the model with stocks that exhibit various market behaviors, ensuring that the results reflect performance across a wide range of scenarios.

In [73]:
import pandas as pd

sim_results = pd.read_csv('simResults/sim_results.csv')
sim_results = sim_results[sim_results['Stock Name'] != symbol]
sim_results['Percent Profit'] = ((sim_results['Portfolio Value'] - 10000) / 10000) * 100
sim_results.describe()

Unnamed: 0,Day,Stock Price,Cash,Shares Held,Portfolio Value,Percent Profit
count,7168.0,7168.0,7168.0,7168.0,7168.0,7168.0
mean,127.5,166.480475,5954.205704,40.14959,10577.466477,5.774665
std,73.905426,150.949497,3974.73986,50.503618,1491.35725,14.913573
min,0.0,8.25,0.0,0.0,6117.766696,-38.822333
25%,63.75,58.499999,250.600044,5.0,9997.904997,-0.02095
50%,127.5,118.259998,7674.640018,23.0,10063.335022,0.63335
75%,191.25,213.220005,9517.589996,53.107928,10586.399906,5.863999
max,255.0,771.167419,10000.0,251.134021,18802.475579,88.024756


### Model Performance Summary

This data frame highlights key aspects of the model's performance over the past year. It provides a snapshot of how the model manages cash, stock holdings, and portfolio value, as well as how it responds to market conditions.

- **Cash Management**:  
  The model holds an average cash balance of **~$5,954**, indicating that its modest with its investments while still being exposed to the market. This approach ensures liquidity, aligning with the model's restriction of buying or selling only one share at a time, with one exception: if there are five consecutive buy signals, the model purchases five shares. This strategy was tested extensively and was shown to **maximize profits** without negatively impacting potential losses.

- **Shares Held**:  
  On average, the model holds **~40 shares** at any given time. This is a positive outcome, as the goal of the model is to maximize the portfolio's value rather than accumulate cash. By maintaining a balance between investing and preserving liquidity, the model successfully builds the user’s portfolio, as intended, by maintaining an active presence in the market.

- **Portfolio Value**:  
  The portfolio maintains an average value of **$~10,577**, with a typical **~5.7% profit** over the year. This figure indicates that the model **consistently avoids negative returns**, a promising sign for long-term profitability. The portfolio's steady growth, combined with its measured risk approach, suggests that the model is capable of yielding positive returns even under varying market conditions.

- The **standard deviation** of the portfolio value (**$1,491**) reflects modest fluctuations, which is expected in a dynamic trading strategy.
- The **median portfolio value** of **$10,063.33** shows that, for most of the year, the portfolio hovered slightly above the breakeven point.
- The **maximum portfolio value** reached **$18,802**, which shows the potential upside of the strategy, particularly in favorable market conditions.


### YTD Portfolio Value by Stock

In [78]:
import altair as alt

def get_final_portfolio_values(df):
    # Group by 'Stock Name' and get the last row for each group
    final_values = df.groupby('Stock Name').apply(lambda x: x.iloc[-1])
    
    # Extract 'Stock Name' and 'Portfolio Value' columns
    result = final_values[['Stock Name', 'Portfolio Value','Shares Held']].reset_index(drop=True)
    
    return result

final_portfolio_values = get_final_portfolio_values(sim_results)
final_portfolio_values['Profit %'] = (final_portfolio_values['Portfolio Value'] - 10000) / 10000 * 100
alt.Chart(final_portfolio_values).mark_bar().encode(
    x='Stock Name',
    y='Profit %',
    color=alt.condition(
        alt.datum['Profit %'] > 0,
        alt.value('green'),
        alt.value('red')
    ),
    tooltip=['Stock Name', 'Profit %', 'Portfolio Value']
).properties(
    title='10/19/2023 - 10/24/2024 Portfolio Value by Stock',
    width=800,
    height=400
).configure_axis(
    labelAngle=45
).display()

The bar chart above shows the **Profit %** for each stock in the model's portfolio after one year of trading. The stocks are listed on the x-axis, while the y-axis represents the percentage of profit (or loss) realized by the model for each stock.

- **Top Performers**: Stocks such as **Meta (META)**, **Howmet Aerospace Inc (HWM)** and **Constellation Energy Group (CEG)** delivered the highest returns, all showing a significant profit around **75%**.
- **Consistent Gainers**: Majority of stocks performed well, showing gains of approximately **10%** this reflects positive model performance with more stocks performing better than worse.
- **Modest Gains**: Companies such as **Apple (KO)**, **Netflix (NFLX)**, and **NVIDIA (NVDA)** all had modest gains with **AAPL** gaining about **20%** and **NFLX and NVDA** gaining about **45%** each.
- **Losers**: A few stocks, such as **Disney (DIS)** and **Lululemon (LULU)**, recorded losses, which are represented by the red bars dipping below 0%. They both took looses with of **5%**. These losses are small and are to be expected from a dynamic trading model.

This chart provides a quick, clear overview of the model's performance across a diverse set of S&P 500 stocks, showcasing the overall effectiveness of the trading strategy while highlighting potential areas of improvement in stock selection or risk management.


In [76]:
final_portfolio_values.describe()

Unnamed: 0,Portfolio Value,Shares Held,Profit %
count,28.0,28.0,28.0
mean,11536.595438,65.097561,15.365954
std,2497.985907,67.545856,24.979859
min,9431.860641,0.0,-5.681394
25%,9998.428756,22.15617,-0.015712
50%,10700.50729,44.033849,7.005073
75%,11132.429606,73.279738,11.324296
max,17814.981617,251.134021,78.149816


### Portfolio Value Summary and Model Performance

The final portfolio's average value of **$11,536.60** and a mean **profit percentage** of **15.27%** indicate that the model achieved steady gains. The **standard deviation** of **24.98%** shows some variability, but overall the model maintained profitability. Notably, the maximum **profit percentage** reached **251.13%**, while the minimum was **-5.68%**, indicating that losses were minimal. The fact that the **median shares held** was **65**, with a max of **251**, highlights that the model actively engaged in the market without excessive risk exposure.


## Conclusion

This stock trading model demonstrates promising performance, achieving an average profit of **12%** across tested S&P 500 stocks. The model’s conservative, single-share trading approach ensures low-risk investments, and the use of technical indicators alongside machine learning models like **LGBMClassifier** effectively identifies profitable trades. 

The analysis shows that the model maintains a positive average portfolio value, outperforming negative returns. However, incorporating transaction costs and taxes into the simulation could provide more realistic results. Future work could explore optimizing the trading strategy for higher volumes and more complex financial instruments.
