# Stock Market Predictions

## Goals:

- identify factors that contribute to a stock rising or falling over 4 months
- use machine learning model to classify stocks as either up or down

## Imports

In [4]:
import pandas as pd
import os
import yfinance as yf
import random
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.tree import DecisionTreeClassifier, plot_tree


import shap



The history saving thread hit an unexpected error (OperationalError('attempt to write a readonly database')).History will not be written to the database.


# Acquire

- Data was obtained using the `yfinance` open-source Python library, which accesses the free Yahoo Finance API.
- The data was pulled on May 3, 2025.
- Ticker symbols were sourced from [this CSV](https://datahub.io/core/nasdaq-listings/r/nasdaq-listed-symbols.csv) and used to query the Yahoo Finance API via `yfinance`
- there are 4835 rows, each one representing a Ticker and its attributes

# Prepare
- Data came through pretty clean
- replaced nulls with 0 for analysis purposes

## Data Dictionary

| Feature           | Definition                                                                 |
|-------------------|----------------------------------------------------------------------------|
| `Symbol` | Stock Ticker          |
| `gain/lost (target)`  | Up/Down, if UP, price raised from April 4th - May 5th, if down, price fell |
| `trailingEps` | Trailing EPS metric                         |
| `forwardEps`   | Forward EPS metric |
| `revenuePerShare`   | Revenue/Share metric |
| `quickRatio`   | Quick Ratio metric |
| `currentRatio`   | Current Ratio metric |
| `debtToEquity`   | Debt to equity Metric |

In [5]:
df = pd.read_csv('evaluate3mo.csv') #reads .csv file that took data from yahoo finance on May 3, 2025

In [6]:
df.head()

Unnamed: 0.1,Unnamed: 0,symbol,averageAnalystRating,previousClose,trailingPE,forwardPE,trailingEps,forwardEps,profitMargins,revenuePerShare,fiftyTwoWeekHighChangePercent,fiftyTwoWeekLowChangePercent,quickRatio,currentRatio,debtToEquity,septClose,difference,percent_diff
0,0,EQIX,1.5 - Buy,861.97,91.712036,65.45964,9.55,13.38,0.10492,91.869,-0.11889,0.28022,1.248,1.437,141.235,767.93,-94.04,-0.109099
1,1,PABU,0,60.33,30.233545,0.0,0.0,0.0,0.0,0.0,-0.094337,0.166867,0.0,0.0,0.0,69.46,9.13,0.151334
2,2,WAFD,0,28.6,10.850746,8.478134,2.68,3.43,0.31408,8.984,-0.247022,0.224421,0.0,0.0,0.0,31.95,3.35,0.117133
3,3,EPRX,0,3.98,0.0,-6.323718,-0.76,0.0,0.0,0.0,-0.106027,0.820455,10.74,11.096,0.226,5.32,1.34,0.336683
4,4,SIBN,0,13.68,0.0,-18.653334,-0.75,-0.75,-0.18491,4.032,-0.269833,0.195726,6.553,7.658,22.45,,,


In [7]:
df['gain/lost'] = df['difference'].apply(lambda x: 'up' if x > 0 else 'down')
#adds a column gain/lost that looks at the difference between price at close May 2nd 2025 and subtracts the close price on April 3, 2025
#if positive, the stock rose in price, hence, 'up', if negative, the price fell, 'down' over one month

In [8]:
df = df.drop(columns = ['Unnamed: 0'])
#drops unnecessary columns

# Explore

### Is there a linear correlation between any of the variables and stock price?

In [9]:
combined_df = pd.read_csv("all.csv")

In [10]:
# turns all columns from string values to float
combined_df['previousClose'] = combined_df['previousClose'].astype(float)
combined_df[['previousClose', 'trailingPE', 'forwardPE', 'trailingEps', 'forwardEps', 'profitMargins', 'revenuePerShare',\
              'fiftyTwoWeekHighChangePercent', 'fiftyTwoWeekLowChangePercent',  'quickRatio', 'currentRatio', 'debtToEquity']] =\
combined_df[['previousClose', 'trailingPE', 'forwardPE', 'trailingEps',\
            'forwardEps', 'profitMargins', 'revenuePerShare',\
             'fiftyTwoWeekHighChangePercent', 'fiftyTwoWeekLowChangePercent',  'quickRatio', 'currentRatio', 'debtToEquity']].astype(float)

In [11]:
numerical_df =combined_df.drop(columns = ['averageAnalystRating', 'symbol'])
#drops non numerical columns

In [12]:
correlation_matrix = numerical_df.corr()  # Compute correlation matrix
correlation_with_target = correlation_matrix['previousClose'].sort_values(ascending=False)
print(correlation_with_target)

previousClose                    1.000000
fiftyTwoWeekHighChangePercent    0.154622
forwardEps                       0.051007
trailingPE                       0.048146
forwardPE                        0.023165
trailingEps                      0.008923
revenuePerShare                  0.002386
profitMargins                   -0.002070
fiftyTwoWeekLowChangePercent    -0.003441
debtToEquity                    -0.004052
Unnamed: 0                      -0.005739
Unnamed: 0.1                    -0.016332
quickRatio                      -0.032837
currentRatio                    -0.033504
Name: previousClose, dtype: float64


#### Ho: There is no correlation between any of the variables
#### Ha: There is a correlation between the variables

#### The correlation matrix reveals that there is virtually no correlation between previous close price and another as shown by the r^2 values being so close to 0. There is a correlation between Fifty Two Week High Percent but this is expected because this is directly linked to stock price. It is interesting to see that there is not a negative correlation between previous close price and Fifty Two Week Low Change Percent

## Modeling

- I will use **accuracy** as my evaluation metric and the gain/lost column as my target variable.
- "Up" makes up **61.9%** of the dataset.

### Baseline

- By guessing "up" for every stock, one could achieve an accuracy of **61.9%**.
- This will serve as the **baseline accuracy** for the project.

### Modeling Plan

- I will evaluate models developed using **two different model types** and various **hyperparameter configurations**.
- Models will be evaluated on **train** and **validate** data.
- The model that performs the best on validation data will then be evaluated on the **test data**.

In [11]:
forest_df = df.drop(columns = ['symbol', 'previous_price', 'averageAnalystRating', 'previousClose', 'fiftyTwoWeekHighChangePercent',\
                               'fiftyTwoWeekLowChangePercent', 'price_diff_1mo', 'trailingPE', 'forwardPE'])
#drops columns that are non numerical and columns that are calculated from price


#### Because data was pulled on 5/4, these values are calculated from the price on 5/4. Since I am trying to use this model to make a prediction from 4/3, the above columns were removed. The remaining columns would have been the same as these values are released quarterly. I will be revisiting this script in a month to use these variables to predict stock movement on 6/3 as soon as the data is available. The free yahoo finance API does not allow previous data queries.

In [13]:
forest_df.replace([np.inf, -np.inf], np.nan, inplace=True) #replaces infintity values with nan

# Drop rows with NaN (which now includes former infs)
forest_df.dropna(inplace=True)
forest_df

Unnamed: 0,trailingEps,forwardEps,profitMargins,revenuePerShare,quickRatio,currentRatio,debtToEquity,gain/lost
0,-2.61,-1.90,0.00000,0.032,10.674,11.576,7.167,down
1,1.20,2.13,0.09531,13.211,2.188,2.789,0.000,down
2,-2.89,-3.46,-1.93691,1.494,0.896,1.155,16.869,down
4,-6525.00,0.00,-0.85970,74.726,0.069,0.247,1047.930,down
5,0.00,0.00,0.00000,0.000,0.000,0.000,0.000,up
...,...,...,...,...,...,...,...,...
4831,0.02,0.00,0.05249,3.091,1.410,1.434,0.297,up
4832,-1.28,-0.71,0.00000,0.000,8.525,9.197,0.452,down
4833,0.00,0.06,-0.00170,2.633,1.158,1.949,43.982,up
4834,2.29,2.26,0.13206,18.932,7.133,7.847,137.005,up


In [14]:
# set up dataset to put through Random Forest Classifier model
train, test = train_test_split(forest_df, test_size=.2, random_state=123)
train, validate = train_test_split(train, test_size=.25, random_state=123)
x_train = train.drop(columns = ['gain/lost'])
y_train = train['gain/lost']

x_val = validate.drop(columns = ['gain/lost'])
y_val = validate['gain/lost']

x_test = test.drop(columns = ['gain/lost'])
y_test = test['gain/lost']

## Using Random Forest Classifier

In [43]:
#fits the model
rf = RandomForestClassifier(
    bootstrap=True,
    class_weight='balanced',         
    criterion='entropy',               
    min_samples_leaf=3,            
    n_estimators=100,               
    max_depth=10,                    
    random_state=123)
rf.fit(x_train, y_train)
RandomForestClassifier(max_depth=3, min_samples_leaf=3, random_state=123)
y_pred = rf.predict(x_train)
print(classification_report(y_train, y_pred))

              precision    recall  f1-score   support

        down       0.74      0.77      0.76       949
          up       0.86      0.84      0.85      1568

    accuracy                           0.81      2517
   macro avg       0.80      0.81      0.80      2517
weighted avg       0.82      0.81      0.82      2517



### Random Forest Classifier model beats accuracy baseline of 19.1% on train dataset. Precision and recall looks promising for both down and up but especially up.

In [45]:
y_pred = rf.predict(x_val)
print(classification_report(y_val, y_pred))

              precision    recall  f1-score   support

        down       0.51      0.55      0.53       340
          up       0.68      0.64      0.66       499

    accuracy                           0.61       839
   macro avg       0.59      0.60      0.60       839
weighted avg       0.61      0.61      0.61       839



#### Accuracy drops to 61% for validation set which is below baseline by 0.9%. Precision for down is around 50% and up is almost 70%. This means that out of the rows the model predicted as down, it was correct 50% of the time. So basically, as good as a coin flip. Recall for up is around 64% meaning that out of all the stocks labeled 'up', the model correctly identified 64% of them. This is decent, but nothing to get excited over. Model may be overfit.

## Decision Tree

In [62]:
clf = DecisionTreeClassifier(max_depth = 3, random_state = 42)
clf.fit(x_train, y_train)

In [63]:
y_pred = clf.predict(x_train)
print(classification_report(y_train, y_pred))

              precision    recall  f1-score   support

        down       0.58      0.09      0.16       949
          up       0.64      0.96      0.76      1568

    accuracy                           0.63      2517
   macro avg       0.61      0.53      0.46      2517
weighted avg       0.62      0.63      0.54      2517



In [64]:
y_pred = clf.predict(x_val)
print(classification_report(y_val, y_pred))

              precision    recall  f1-score   support

        down       0.41      0.06      0.10       340
          up       0.60      0.95      0.73       499

    accuracy                           0.59       839
   macro avg       0.50      0.50      0.41       839
weighted avg       0.52      0.59      0.47       839



### Accuracy for validation set falls below baseline but seems to be fairly precise and has great recall for the stocks labeled 'up'

In [66]:
from sklearn.neighbors import KNeighborsClassifier


### KNN Classifier

In [67]:
knn = KNeighborsClassifier(n_neighbors=5, weights='uniform')
knn.fit(x_train, y_train)
y_pred = knn.predict(x_train)
y_pred_proba = knn.predict_proba(x_train)
print(classification_report(y_train, y_pred))


              precision    recall  f1-score   support

        down       0.66      0.48      0.56       949
          up       0.73      0.85      0.79      1568

    accuracy                           0.71      2517
   macro avg       0.70      0.67      0.67      2517
weighted avg       0.71      0.71      0.70      2517



In [68]:
y_pred = knn.predict(x_val)
print(classification_report(y_val, y_pred))


              precision    recall  f1-score   support

        down       0.49      0.32      0.39       340
          up       0.63      0.77      0.69       499

    accuracy                           0.59       839
   macro avg       0.56      0.55      0.54       839
weighted avg       0.57      0.59      0.57       839



from imblearn.over_sampling import SMOTE

sm = SMOTE(random_state=123)
x_train_sm, y_train_sm = sm.fit_resample(x_train, y_train)

# Train model on oversampled data
rf.fit(x_train_sm, y_train_sm)

y_val_pred = rf.predict(x_train_sm)

print(classification_report(y_train_sm, y_val_pred))