# Stock Return Analysis & Classification

## Introduction & Motivation

Financial markets are inherently noisy and volatile, with stock prices influenced by a wide range of external factors and human behavior. This uncertainty makes short-term price movements difficult to model and challenges the assumption that clear, stable patterns exist in historical price data.

This project stems from a growing interest in financial markets alongside an introductory background in machine learning. Rather than assuming predictability, the goal is to explore how standard machine learning techniques behave when applied to financial time-series data. In particular, the project investigates whether commonly used statistical and technical features such as returns, moving averages, and rolling volatility capture any consistent structure in daily stock price movement.

By analyzing historical stock data and evaluating simple classification models, this project aims to better understand both the potential and the limitations of applying machine learning to uncertain, real-world financial data.

---

### Key Characteristics of This Project
- Exploratory rather than predictive in nature  
- Focused on understanding behavior and limitations, not forecasting accuracy  
- Uses historical daily stock data as a case study for noisy time-series modeling  

---

### What This Project Does *Not* Claim
- It does **not** attempt to beat the market or generate a trading strategy  
- It does **not** assume short-term stock movements are reliably predictable  


---

## Libraries and Dependencies

This project relies on a small set of Python libraries to support data acquisition, numerical analysis, visualization, and preprocessing. Each library is used with a specific role in the analysis pipeline.

### Data Acquisition
- **yfinance** is used to retrieve historical daily stock price data from Yahoo Finance. This provides reproducible, publicly available market data and defines the temporal resolution (daily) of the analysis.


In [1]:
import yfinance as yf

### Data Manipulation and Numerical Analysis
- **pandas** and **numpy** are used for structured time-series manipulation and numerical computation. These libraries support feature engineering operations such as return calculation, rolling statistics, and time-based aggregation.

### Exploratory Data Analysis and Visualization
- **matplotlib** and **seaborn** are used to visualize price trends, return distributions, and volatility behavior. Visualizations are used primarily for exploratory analysis and interpretation rather than presentation.

### Outlier Handling
- **winsorize** from `scipy.stats.mstats` is used to limit the influence of extreme return values. Financial return data is often heavy-tailed, and winsorization provides a simple way to reduce the impact of rare but extreme observations without removing data points entirely.

In [2]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats.mstats import winsorize

### Warning Management
- Python warnings are suppressed to improve notebook readability. This helps keep the focus on analysis results rather than non-critical library warnings, while assuming careful handling of data preprocessing steps.

In [3]:
import warnings
warnings.filterwarnings('ignore')

## Data Acquisition

The first step in the analysis is to obtain historical stock price data that will serve as the foundation for all subsequent feature engineering and modeling.

Historical daily price data is retrieved using the `yfinance` library, which requires a stock’s ticker symbol along with a specified start and end date in `YYYY-MM-DD` format.

To improve reproducibility and separate data collection from analysis, the retrieved data is saved locally as a `.csv` file in a dedicated data directory.


In [4]:
df = yf.download('NKE', start='2020-01-01', end='2024-01-01').to_csv('Datasets/nike_stock.csv')

[*********************100%***********************]  1 of 1 completed


## Data Cleaning and Formatting

After loading the raw stock price data, the dataset is converted into a Pandas DataFrame to enable structured manipulation and analysis.

In [11]:
nike = pd.read_csv('../Datasets/nike_stock.csv')
nike

Unnamed: 0,Price,Close,High,Low,Open,Volume
0,Ticker,NKE,NKE,NKE,NKE,NKE
1,Date,,,,,
2,2020-01-02,95.36821746826172,95.3775509900979,94.26709683150074,94.58437249941043,5644100
3,2020-01-03,95.10693359375,95.18158752791594,93.6045569857679,93.86583863595764,4541800
4,2020-01-06,95.0229263305664,95.03225273015106,94.12710073259969,94.20174752622184,4612400
...,...,...,...,...,...,...
1003,2023-12-22,104.98536682128906,107.66733492795588,104.41204414073968,105.19914790848757,46666200
1004,2023-12-26,104.96592712402344,105.61698972341972,104.44120116881932,105.2380168592794,12846700
1005,2023-12-27,104.1010971069336,105.51981745011808,103.8290147690198,105.27688573216825,10157900
1006,2023-12-28,105.7433090209961,106.30691233863952,103.79013609671432,104.17882829556108,9352900


Due to the CSV export format, the first row of the dataset contains the stock ticker label repeated across columns rather than actual numerical values. This row is removed to ensure that all remaining rows represent valid daily observations.

In [12]:
nike = nike.drop([0,1],axis=0)
nike

Unnamed: 0,Price,Close,High,Low,Open,Volume
2,2020-01-02,95.36821746826172,95.3775509900979,94.26709683150074,94.58437249941043,5644100
3,2020-01-03,95.10693359375,95.18158752791594,93.6045569857679,93.86583863595764,4541800
4,2020-01-06,95.0229263305664,95.03225273015106,94.12710073259969,94.20174752622184,4612400
5,2020-01-07,94.97627258300781,95.87209824443295,94.07111340194787,95.00426602252274,6719900
6,2020-01-08,94.76165771484375,95.31221437997459,94.10844656574092,94.53769594517937,4942200
...,...,...,...,...,...,...
1003,2023-12-22,104.98536682128906,107.66733492795588,104.41204414073968,105.19914790848757,46666200
1004,2023-12-26,104.96592712402344,105.61698972341972,104.44120116881932,105.2380168592794,12846700
1005,2023-12-27,104.1010971069336,105.51981745011808,103.8290147690198,105.27688573216825,10157900
1006,2023-12-28,105.7433090209961,106.30691233863952,103.79013609671432,104.17882829556108,9352900


Column names are then standardized, and the date column is explicitly reformatted to ensure consistent naming and interpretation. The date column is converted to a datetime format to support time-series operations such as sorting, indexing, and rolling-window calculations.

In [None]:
nike['Date'] = nike['Price']
nike = nike.drop('Price',axis=1)

nike

Finally, price and volume columns are converted to numeric types. Explicit type conversion ensures that downstream feature engineering and statistical operations behave as expected and prevents silent errors caused by incorrect data types.

In [9]:
for i in nike.columns:
    if i != 'Date':
        nike[i] = pd.to_numeric(nike[i])
    else:
        nike[i] = pd.to_datetime(nike[i])

Running `pd.describe()` helps us see valuable information like the amount of data, avg no of trades over the span of the range provided and the minimum and maximum value from the stock's OHLC

In [10]:
nike.describe().round(2)

Unnamed: 0,Close,High,Low,Open,Volume,Date
count,1006.0,1006.0,1006.0,1006.0,1006.0,1006
mean,115.49,116.78,114.19,115.5,7271308.75,2021-12-30 11:57:08.230616320
min,58.76,62.58,56.14,60.63,1821900.0,2020-01-02 00:00:00
25%,99.54,100.7,98.62,99.55,5065750.0,2020-12-30 06:00:00
50%,113.73,115.46,112.78,114.07,6275250.0,2021-12-29 12:00:00
75%,128.18,129.79,126.68,128.24,8102450.0,2022-12-28 18:00:00
max,168.18,169.68,166.27,167.08,48176100.0,2023-12-29 00:00:00
std,21.94,22.01,21.94,22.02,4231354.48,


STOCK ANALYSIS

In [None]:
plt.figure(figsize=(9,4),dpi=250)
for col in ['Close', 'Open']:
    sns.lineplot(data=nike, x='Date', y=col, label=col)
plt.title('Price of NKE Stock')
plt.xticks(rotation=90)
plt.ylabel('Price($)')
plt.legend()
plt.show()

In [None]:
nike['Open-to-Close Return(%)'] = ((nike['Close'] - nike['Open'])/(nike['Open']))*100
nike['Day-to-day Return(%)'] = nike['Close'].pct_change() * 100
nike

In [None]:
plt.figure(figsize=(9,4),dpi=250)
plt.title('NKE Stock Day-to-Day Return')
sns.scatterplot(data=nike, x='Date',y='Day-to-day Return(%)',alpha=0.8)
plt.axhline(0, color='black', linestyle='--')
plt.xticks(rotation=-45);

In [None]:
sns.displot(nike, x='Day-to-day Return(%)',kde=True);

In [None]:
nike['NextDayReturn'] = (nike['Close'].shift(-1) > nike['Close']).astype(int)

In [None]:
nike

In [None]:
corr = nike.corr()

In [None]:
sns.heatmap(corr, annot=False,cmap='coolwarm');

In [None]:
nike['SMA20'] = nike['Close'].rolling(window=20).mean()

In [None]:
nike

In [None]:
plt.figure(figsize=(9,4),dpi=250)
for col in ['Close', 'SMA20']:
    sns.lineplot(data=nike, x='Date', y=col, label=col)
plt.title('Price of NKE Stock')
plt.xticks(rotation=90)
plt.ylabel('Price($)')
plt.legend()
plt.show()

DateTime Analysis

In [None]:
nike['Month'] = nike['Date'].dt.month
nike['Year'] = nike['Date'].dt.year
nike['DayofWeek'] = nike['Date'].dt.day_of_week
nike

In [None]:
month_closing = nike.groupby('Month')['Close'].mean()
month_closing


In [None]:
plt.plot(month_closing);

In [None]:
month_returns = nike.groupby('Month')['Day-to-day Return(%)'].mean()
month_returns

In [None]:
sns.barplot(x=month_returns.index, y=month_returns)

In [None]:
sns.countplot(data=nike,x='DayofWeek',hue='NextDayReturn')

Volatility

In [None]:
nike['volatility'] = nike['Day-to-day Return(%)'].rolling(window=20).std()
nike['volatility']

In [None]:
sns.lineplot(data=nike,x='Date',y='volatility')
plt.xticks(rotation=-45);

In [None]:
sns.displot(data=nike,x='volatility')

<h2>Stock Analysis using Models<h2>

Data Preperation / Cleaning

In [None]:
nike

In [None]:
nike.isnull().sum()

In [None]:
nikeDF = nike.dropna(axis=0)

In [None]:
nikeDF.head(10)

In [None]:
nikeDF.isnull().sum() == 0

In [None]:
winsorizedClosing = winsorize(nikeDF['Open-to-Close Return(%)'],(0.05,0.05)).data
nikeDF['Open-to-Close Return(%)'] = winsorizedClosing

In [None]:
winsorizedRet = winsorize(nikeDF['Day-to-day Return(%)'],(0.05,0.05)).data
nikeDF['Day-to-day Return(%)'] = winsorizedRet

In [None]:
winsorizedVol = winsorize(nikeDF['volatility'],(0.05,0.05)).data
nikeDF['volatility'] = winsorizedVol

In [None]:
nikeDF

Logistic Regression

In [None]:
nikeDum = pd.get_dummies(nikeDF, columns=['Month', 'Year', 'DayofWeek'], drop_first=True)

In [None]:
nikeDum

In [None]:
X = nikeDum.drop(['Date','NextDayReturn'],axis=1)
y = nikeDum['NextDayReturn']

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=101)

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc = scaler.transform(X_test)

In [None]:
from sklearn.linear_model import LogisticRegressionCV

log_model = LogisticRegressionCV()

In [None]:
log_model.fit(X_train_sc,y_train)

In [None]:
log_model.C_

In [None]:
coefs = pd.Series(index=X.columns,data=log_model.coef_[0]).sort_values()
sns.barplot(x=coefs.index,y=coefs.values)
plt.xticks(rotation=90);

In [None]:
from sklearn.metrics import classification_report

yPreds = log_model.predict(X_test_sc)

In [None]:
print(classification_report(y_test,yPreds))

RandomForest

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

In [None]:
rfc = RandomForestClassifier()

In [None]:
n_estimators = [64,100,128,175]
max_features = ['auto','log2','sqrt']
bootstrap = [True,False]
oob_score = [True,False]

In [None]:
param_grid = {
    'n_estimators': n_estimators,
    'max_features': max_features,
    'bootstrap': bootstrap,
    'oob_score': oob_score
}

In [None]:
grid = GridSearchCV(rfc,param_grid)

In [None]:
grid.fit(X_train,y_train)

In [None]:
grid.best_params_

In [None]:
pred = grid.predict(X_test)

In [None]:
print(classification_report(y_test,pred))