# Stock Market Predictions

## Goals:

- identify factors that contribute to a stock rising or falling over one month
- use machine learning model to classify stocks as either up or down

## Imports

In [1]:
import pandas as pd
import os
import yfinance as yf
import random
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

import shap



# Acquire

- Data was obtained using the `yfinance` open-source Python library, which accesses the free Yahoo Finance API.
- The data was pulled on May 3, 2025.
- Ticker symbols were sourced from [this CSV](https://datahub.io/core/nasdaq-listings/r/nasdaq-listed-symbols.csv) and used to query the Yahoo Finance API via `yfinance`
- there are 4835 rows, each one representing a Ticker and its attributes

# Prepare
- Data came through pretty clean
- replaced nulls with 0 for analysis purposes

## Data Dictionary

| Feature           | Definition                                                                 |
|-------------------|----------------------------------------------------------------------------|
| `Symbol` | Stock Ticker          |
| `gain/lost (target)`  | Up/Down, if UP, price raised from April 4th - May 5th, if down, price fell |
| `trailingEps` | Trailing EPS metric                         |
| `forwardEps`   | Forward EPS metric |
| `revenuePerShare`   | Revenue/Share metric |
| `quickRatio`   | Quick Ratio metric |
| `currentRatio`   | Current Ratio metric |
| `debtToEquity`   | Debt to equity Metric |

In [2]:
df = pd.read_csv('realone.csv') #reads .csv file that took data from yahoo finance on May 3, 2025

In [3]:
df['gain/lost'] = df['price_diff_1mo'].apply(lambda x: 'up' if x > 0 else 'down')
#adds a column gain/lost that looks at the difference between price at close May 2nd 2025 and subtracts the close price on April 3, 2025
#if positive, the stock rose in price, hence, 'up', if negative, the price fell, 'down' over one month

In [4]:
df = df.drop(columns = ['Unnamed: 0'])
#drops unnecessary columns

# Explore

### Is there a linear correlation between any of the variables and stock price?

In [5]:
combined_df = pd.read_csv("all.csv")

In [6]:
combined_df['previousClose'] = combined_df['previousClose'].astype(float)
combined_df[['previousClose', 'trailingPE', 'forwardPE', 'trailingEps', 'forwardEps', 'profitMargins', 'revenuePerShare',\
              'fiftyTwoWeekHighChangePercent', 'fiftyTwoWeekLowChangePercent',  'quickRatio', 'currentRatio', 'debtToEquity']] =\
combined_df[['previousClose', 'trailingPE', 'forwardPE', 'trailingEps',\
            'forwardEps', 'profitMargins', 'revenuePerShare',\
             'fiftyTwoWeekHighChangePercent', 'fiftyTwoWeekLowChangePercent',  'quickRatio', 'currentRatio', 'debtToEquity']].astype(float)

In [7]:
numerical_df =combined_df.drop(columns = ['averageAnalystRating', 'symbol'])

In [8]:
correlation_matrix = numerical_df.corr()  # Compute correlation matrix
correlation_with_target = correlation_matrix['previousClose'].sort_values(ascending=False)
print(correlation_with_target)

previousClose                    1.000000
fiftyTwoWeekHighChangePercent    0.154622
forwardEps                       0.051007
trailingPE                       0.048146
forwardPE                        0.023165
trailingEps                      0.008923
revenuePerShare                  0.002386
profitMargins                   -0.002070
fiftyTwoWeekLowChangePercent    -0.003441
debtToEquity                    -0.004052
Unnamed: 0                      -0.005739
Unnamed: 0.1                    -0.016332
quickRatio                      -0.032837
currentRatio                    -0.033504
Name: previousClose, dtype: float64


#### Ho: There is no correlation between any of the variables
#### Ha: There is a correlation between the variables

#### The correlation matrix reveals that there is virtually no correlation between previous close price and another du to the r^2 values being so close to 0. There is a correlation between Fifty Two Week High Percent but this is expected because this is directly linked to stock price. It is interesting to see that there is not a negative correlation between previous close price and Fifty Two Week Low Change Percent

## Modeling

- I will use **accuracy** as my evaluation metric and the gain/lost column as my target variable.
- "Up" makes up **.** of the dataset.

### Baseline

- By guessing "up" for every stock, one could achieve an accuracy of **73.5%**.
- This will serve as the **baseline accuracy** for the project.

### Modeling Plan

- I will evaluate models developed using **two different model types** and various **hyperparameter configurations**.
- Models will be evaluated on **train** and **validate** data.
- The model that performs the best on validation data will then be evaluated on the **test data**.

In [10]:
forest_df = df.drop(columns = ['symbol', 'previous_price', 'averageAnalystRating', 'previousClose', 'fiftyTwoWeekHighChangePercent',\
                               'fiftyTwoWeekLowChangePercent', 'price_diff_1mo', 'trailingPE', 'forwardPE'])
#drops columns that are non numerical


#### The remaining columns are values that would be the same on April 4th 2025 as May 5th 2025 because this data is released quarterly. I will be revisiting this script in one month when I can do an analysis with the columns listed above.

In [12]:
forest_df[forest_df['gain/lost'] == 'up']

Unnamed: 0,trailingEps,forwardEps,profitMargins,revenuePerShare,quickRatio,currentRatio,debtToEquity,gain/lost
5,0.00,0.00,0.00000,0.000,0.000,0.000,0.000,up
6,-0.83,-0.83,-0.27165,3.057,2.212,2.699,244.895,up
8,-68.17,-0.52,0.00000,0.000,2.674,3.077,4.029,up
10,,,0.00000,,,,,up
12,-1.13,,-0.12832,8.806,1.253,1.316,2.613,up
...,...,...,...,...,...,...,...,...
4830,-1.07,-0.21,-0.21095,36.981,3.203,3.273,1.110,up
4831,0.02,0.00,0.05249,3.091,1.410,1.434,0.297,up
4833,0.00,0.06,-0.00170,2.633,1.158,1.949,43.982,up
4834,2.29,2.26,0.13206,18.932,7.133,7.847,137.005,up


In [7]:
forest_df.replace([np.inf, -np.inf], np.nan, inplace=True) #replaces infintity values with nan

# Drop rows with NaN (which now includes former infs)
forest_df.dropna(inplace=True)
forest_df

Unnamed: 0,trailingEps,forwardEps,profitMargins,revenuePerShare,quickRatio,currentRatio,debtToEquity,gain/lost
0,-2.61,-1.90,0.00000,0.032,10.674,11.576,7.167,down
1,1.20,2.13,0.09531,13.211,2.188,2.789,0.000,down
2,-2.89,-3.46,-1.93691,1.494,0.896,1.155,16.869,down
4,-6525.00,0.00,-0.85970,74.726,0.069,0.247,1047.930,down
5,0.00,0.00,0.00000,0.000,0.000,0.000,0.000,up
...,...,...,...,...,...,...,...,...
4831,0.02,0.00,0.05249,3.091,1.410,1.434,0.297,up
4832,-1.28,-0.71,0.00000,0.000,8.525,9.197,0.452,down
4833,0.00,0.06,-0.00170,2.633,1.158,1.949,43.982,up
4834,2.29,2.26,0.13206,18.932,7.133,7.847,137.005,up


In [10]:
# set up dataset to put through Random Forest Classifier model
train, test = train_test_split(forest_df, test_size=.2, random_state=123)
train, validate = train_test_split(train, test_size=.25, random_state=123)
x_train = train.drop(columns = ['gain/lost'])
y_train = train['gain/lost']

x_val = validate.drop(columns = ['gain/lost'])
y_val = validate['gain/lost']

x_test = test.drop(columns = ['gain/lost'])
y_test = test['gain/lost']

In [11]:
rf = RandomForestClassifier(
    bootstrap=True,
    class_weight='balanced',         
    criterion='gini',               
    min_samples_leaf=10,            
    n_estimators=200,               
    max_depth=6,                    
    random_state=123)
rf.fit(x_train, y_train)
RandomForestClassifier(max_depth=3, min_samples_leaf=3, random_state=123)
y_pred = rf.predict(x_train)
print(classification_report(y_train, y_pred))

              precision    recall  f1-score   support

        down       0.54      0.71      0.61       949
          up       0.78      0.63      0.70      1568

    accuracy                           0.66      2517
   macro avg       0.66      0.67      0.66      2517
weighted avg       0.69      0.66      0.67      2517



### Random Forest Classifier model beats accuracy baseline of 0.62 by 0.04 on train dataset. 

In [46]:
forest_df

Unnamed: 0,trailingEps,forwardEps,profitMargins,revenuePerShare,quickRatio,currentRatio,debtToEquity,gain/lost
0,-2.61,-1.90,0.00000,0.032,10.674,11.576,7.167,down
1,1.20,2.13,0.09531,13.211,2.188,2.789,0.000,down
2,-2.89,-3.46,-1.93691,1.494,0.896,1.155,16.869,down
4,-6525.00,0.00,-0.85970,74.726,0.069,0.247,1047.930,down
5,0.00,0.00,0.00000,0.000,0.000,0.000,0.000,up
...,...,...,...,...,...,...,...,...
4831,0.02,0.00,0.05249,3.091,1.410,1.434,0.297,up
4832,-1.28,-0.71,0.00000,0.000,8.525,9.197,0.452,down
4833,0.00,0.06,-0.00170,2.633,1.158,1.949,43.982,up
4834,2.29,2.26,0.13206,18.932,7.133,7.847,137.005,up


In [47]:
y_val_pred = rf.predict(x_val)


In [35]:
print(classification_report(y_val, y_val_pred))

              precision    recall  f1-score   support

        down       0.52      0.19      0.28       401
          up       0.62      0.88      0.73       606

    accuracy                           0.61      1007
   macro avg       0.57      0.54      0.50      1007
weighted avg       0.58      0.61      0.55      1007



In [48]:
from imblearn.over_sampling import SMOTE

sm = SMOTE(random_state=123)
x_train_sm, y_train_sm = sm.fit_resample(x_train, y_train)

# Train model on oversampled data
rf.fit(x_train_sm, y_train_sm)

In [50]:
y_val_pred = rf.predict(x_train_sm)

In [51]:
print(classification_report(y_train_sm, y_val_pred))

              precision    recall  f1-score   support

        down       0.65      0.75      0.70      1461
          up       0.71      0.60      0.65      1461

    accuracy                           0.68      2922
   macro avg       0.68      0.68      0.67      2922
weighted avg       0.68      0.68      0.67      2922

