# Stock Volatility 

#### David Montoto
#### Vatsal Bagri

About:

In [None]:
import pandas as pd
import numpy as np
import matplotlib as plt
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.mixture import GaussianMixture
from sklearn.decomposition import PCA
from sklearn.metrics import mean_absolute_error, mean_squared_error
import torch
from torch import nn
from sklearn.model_selection import train_test_split

: 

### Dataset 1 - Stock data from Yahoo Finance

Our initial dataset, a collection of stocks along with their varying prices and dividends with the given data, are provided from Yahoo Finance. Most of the data is easy to work with as it has been polished. However, the 'Date' category can be simplified to just the provided date not including time. 

In [None]:
df1_initial = pd.read_csv('data/stock_details_5_years.csv', header = 0)
df1_initial.head()

: 

In [None]:
df1_initial['Date'] = pd.to_datetime(df1_initial['Date'].str.extract(r'(\d{4}-\d{2}-\d{2})')[0])
df1_initial.head()

: 

### Data 2 - Volatility Data For Each Company (Computed and Wrangled From Data set 1)

We will create a second dataset. This dataset will be taken from a subset of the date from out first created dataset. We aim to group different stocks by volatility. This data is grouped seperately because it will be used seperately from the rest of the data in dataset 1. 

In [None]:
df1_initial['perc_change'] = abs(((df1_initial['Close'] - df1_initial['Open']) / df1_initial['Open']) * 100.0)

changes_df = df1_initial.groupby('Company')['perc_change'].apply(list).reset_index()

changes_df['length'] = changes_df['perc_change'].apply(len)
full_length = changes_df['length'].max()
final_df = changes_df[changes_df['length'] == full_length].drop(columns=['length'])

final_df.head(10)


: 

In [None]:
X = np.array(final_df['perc_change'].tolist())
pca = PCA(n_components=250) 
X_pca = pca.fit_transform(X)

kmeans = KMeans(n_clusters=11, random_state=24) # use of 11 explained in writing above
cluster_labels = kmeans.fit_predict(X_pca)

silh_score = silhouette_score(X_pca, cluster_labels)
print('silhouette score:', silh_score)

plt.scatter(X_pca[:, 0], X_pca[:, 1], c=cluster_labels, cmap='hsv', alpha=0.6)
plt.title('Baseline k-means clustering')
plt.xlabel('principal component x direction')
plt.ylabel('principal component y direction')
plt.colorbar()
plt.show()

: 

### PCA Analysis

In our stock market analysis, we used Principal Component Analysis (PCA) to reduce the complexity of our high-dimensional dataset, where each company’s stock data had 1250 observations. To determine the optimal number of components, we used the elbow method and found that 84 components explained about 95% of the variance, balancing dimensionality reduction and data integrity. We also addressed a data cleaning issue by excluding companies with fewer than 1250 observations, ensuring consistency for PCA.

In [None]:
X = np.array(final_df['perc_change'].tolist())
pca = PCA()
pca.fit(X)

# elbow method shown here
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Components')
plt.ylabel('Variance')
plt.title('Variance vs. total components')
plt.grid(True)
plt.show()


In the following subsection we will conduct clustering of our data, cross validation for hyper parameter tuning, analysis of different final clusters and supervised learning analysis.

### Baseline K Means and GMM Models

To compare the performance of the GMM and K-means models on our dataset, we first implemented both models using 11 clusters, corresponding to the 11 industries in the S&P 500, to explore potential industry patterns. We began by establishing a baseline, running both models on PCA-processed data without any hyperparameter tuning. To evaluate the models, we visualized the results through graphs and calculated the silhouette scores for each model. This allowed us to assess how well the models performed in clustering the data and provided insight into their overall effectiveness.

In [None]:
X = np.array(final_df['perc_change'].tolist())
pca = PCA(n_components=84) 
X_pca = pca.fit_transform(X)

kmeans = KMeans(n_clusters=11, random_state=24) # use of 11 explained in writing above
cluster_labels = kmeans.fit_predict(X_pca)

silh_score = silhouette_score(X_pca, cluster_labels)
print('silhouette score:', silh_score)

plt.scatter(X_pca[:, 0], X_pca[:, 1], c=cluster_labels, cmap='hsv', alpha=0.6)
plt.title('Baseline k-means clustering')
plt.xlabel('principal component x direction')
plt.ylabel('principal component y direction')
plt.colorbar()
plt.show()

In [None]:
gmm = GaussianMixture(n_components=11, random_state=24)  
cluster_labels = gmm.fit_predict(X_pca)

silh_score = silhouette_score(X_pca, cluster_labels)
print('silhouette score:', silh_score)

plt.scatter(X_pca[:, 0], X_pca[:, 1], c=cluster_labels, cmap='hsv', alpha=0.6)
plt.title('Baseline GMM clustering')
plt.xlabel('principal component x direction')
plt.ylabel('principal component y direction')
plt.colorbar()
plt.show()

### Comparison of Cluster Assignment between the two techiniques 

Due to the similarity in the silhouette scores of the two techiques, we printed out the clusters which were created just to ensure the validity of our data.

In [None]:
kmeans_labels = kmeans.predict(X_pca)
gmm_labels = gmm.predict(X_pca)

# Compare the cluster assignments
print("KMeans cluster assignments:", kmeans_labels[:10])
print("GMM cluster assignments:", gmm_labels[:10])

kmeans_silhouette = silhouette_score(X_pca, kmeans_labels)
gmm_silhouette = silhouette_score(X_pca, gmm_labels)

print(f"KMeans Silhouette Score: {kmeans_silhouette}")
print(f"GMM Silhouette Score: {gmm_silhouette}")

### Cross validation and optimization of Baseline GMM and K Means
To optimize our GMM and K-means models for clustering stocks, we used cross-validation to determine the ideal number of clusters. This process tested the models on different data segments, providing performance scores and ensuring robustness. Cross-validation helped us identify the optimal number of clusters that maximized coherence and distinction, which we assessed using the silhouette score. 

In [None]:
silhouette_scores = []

# corss validation here, note it is unseeded
for n_clusters in range(2, 19):
    kmeans = KMeans(n_clusters=n_clusters)
    cluster_labels = kmeans.fit_predict(X_pca)
    silh_score = silhouette_score(X_pca, cluster_labels)
    silhouette_scores += [silh_score]

final_cluster_num = np.argmax(silhouette_scores) + 2 # add 2 because corresponding index is left shifted

print("best # of clusters:", final_cluster_num)
print("Corresponding best cross validation silhouette score", silhouette_scores[final_cluster_num - 2])

#kmeans used new found best k param
kmeans = KMeans(n_clusters=final_cluster_num, random_state=24)
cluster_labels = kmeans.fit_predict(X_pca)

#saving labl results for later analysis in finl-df
final_df['k-means label'] = cluster_labels

plt.scatter(X_pca[:, 0], X_pca[:, 1], c=cluster_labels, cmap='coolwarm', alpha=0.5)
plt.title('Tuned k-means Lables')
plt.xlabel('Principal Component x direction')
plt.ylabel('Principal Component y direction')
plt.colorbar()
plt.show()

# silhoutte score from graphed kmeans clustering
final_silhouette_score = silhouette_score(X_pca, cluster_labels)
print("Silhouette score for graphed clustered:", final_silhouette_score)

In [None]:
silhouette_scores = []

# cross validation here, note this part is unseeded
for n_components in range(2, 19):
    gmm = GaussianMixture(n_components=n_components)
    cluster_labels = gmm.fit_predict(X_pca)
    silh_score = silhouette_score(X_pca, cluster_labels)
    silhouette_scores.append(silh_score)

final_cluster_num = np.argmax(silhouette_scores) + 2

print("best # of clusters:", final_cluster_num)
print("Corresponding best cross validation silhouette score", silhouette_scores[final_cluster_num - 2])

gmm = GaussianMixture(n_components=final_cluster_num, random_state=24)
cluster_labels = gmm.fit_predict(X_pca)

# Update final_df with cluster labels
final_df['gmm labels'] = cluster_labels

# Plot the clustering labels

plt.scatter(X_pca[:, 0], X_pca[:, 1], c=cluster_labels, cmap='coolwarm', alpha=0.5)
plt.title('Tuned gmm Lables')
plt.xlabel('Principal Component x direction')
plt.ylabel('Principal Component y direction')
plt.colorbar()
plt.show()

final_silhouette_score = silhouette_score(X_pca, cluster_labels)
print("Silhouette score for graphed clustered:", final_silhouette_score)

### Comparing Clusters from Each Model

After improving the models, both GMM and K-means models repeeatedly selected 2 clusters in the cross validation. Since groups are now assigned to companies from the models, we will proceed with a performance analysis to learn how these clusters and models differ from each other. 

In [None]:
# peak at data frame after adding labels from our models
final_df.head(8)

In [None]:
# volatility comparision of Clusters in K-means Model
final_df['avg_volatility'] = final_df['perc_change'].apply(lambda x: x.mean())
final_df[['k-means label', 'avg_volatility']].groupby('k-means label').mean()

In [None]:
# volatility comparision of Clusters in GMM Model
final_df[['gmm labels', 'avg_volatility']].groupby('gmm labels').mean()

In [None]:
kmeans_0 = final_df.index[final_df['k-means label'] == 0].values

kmeans_1 = final_df.index[final_df['k-means label'] == 1].values

gmm_0 = final_df.index[final_df['gmm labels'] == 0].values

gmm_1 = final_df.index[final_df['gmm labels'] == 1].values

#lengths of each 

print("k-means 0 length:", len(kmeans_0))
print("k-means 1 length:", len(kmeans_1), '\n')
print("gmm 0 length:", len(gmm_0))
print("gmm 1 length:", len(gmm_1), '\n')

# to visualize what these arrays look like here is a snippet of first 10 companies in the 1-label groups
print(gmm_1[0:10])
print(kmeans_1[0:10])