# Clustering NIFTY Feature-Engineered Data

This notebook demonstrates unsupervised clustering (K-Means) on the feature-engineered NIFTY dataset.

- Data: `data/nifty/train/featured.csv`
- Libraries: pandas, numpy, scikit-learn, matplotlib, seaborn

---

In [ ]:
# 1. Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

sns.set(style='whitegrid', palette='muted', font_scale=1.1)
%matplotlib inline

In [ ]:
# 2. Load the feature-engineered data
data_path = '../../data/nifty/train/featured.csv'
df = pd.read_csv(data_path)
print(f'Loaded {len(df)} rows and {len(df.columns)} columns.')
df.head()

## 3. Feature selection for clustering

We'll select a subset of features that are numeric and relevant for clustering.
You can adjust this list based on your analysis goals.

In [ ]:
# Select features for clustering
features = [
    'daily_return', 'log_return', 'price_range',
    'ma_5', 'ma_20', 'volatility_5', 'volatility_20',
    'rsi_14', 'macd_12_26', 'macd_signal_12_26', 'macd_histogram_12_26',
    'stoch_14', 'stoch_smoothk', 'stoch_smoothd'
]
# Drop rows with missing values in selected features
X = df[features].dropna().copy()
print(f'Clustering on {X.shape[0]} rows and {X.shape[1]} features.')

## 4. Feature scaling

K-Means is sensitive to feature scale, so we standardize the features.

In [ ]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled[:5]

## 5. Choosing the number of clusters (Elbow Method)

We'll plot the inertia (within-cluster sum of squares) for different cluster counts.

In [ ]:
inertia = []
K_range = range(2, 11)
for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X_scaled)
    inertia.append(kmeans.inertia_)

plt.figure(figsize=(8, 4))
plt.plot(K_range, inertia, marker='o')
plt.title('Elbow Method for Optimal k')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Inertia')
plt.show()

## 6. Fit K-Means and assign clusters

Choose the number of clusters based on the elbow plot above (e.g., k=3).

In [ ]:
k = 3  # Change this based on the elbow plot
kmeans = KMeans(n_clusters=k, random_state=42)
clusters = kmeans.fit_predict(X_scaled)
df_clustered = df.loc[X.index].copy()
df_clustered['cluster'] = clusters
df_clustered.head()

## 7. Visualize clusters using PCA (2D plot)

We'll use PCA to reduce the feature space to 2D for visualization.

In [ ]:
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
plt.figure(figsize=(10, 6))
sns.scatterplot(x=X_pca[:, 0], y=X_pca[:, 1], hue=clusters, palette='Set2', alpha=0.7)
plt.title('Clusters visualized with PCA')
plt.xlabel('PCA 1')
plt.ylabel('PCA 2')
plt.legend(title='Cluster')
plt.show()

## 8. Cluster analysis

Let's look at the mean values of each feature for each cluster.

In [ ]:
df_clustered.groupby('cluster')[features].mean()

---
You can further analyze clusters, visualize time series by cluster, or use other clustering algorithms as needed!