# Module 4: Machine Learning

**Python for Business | FinHub**

---

## What You'll Build

![Module 4 Preview](images/module4_preview.png)

## By the End of This Module

You'll be able to:

| Skill | Example |
|-------|--------|
| Discover groups in data | "These 20 stocks cluster into 2 natural groups" |
| Predict categories | "Will the stock go UP or DOWN tomorrow?" |
| Interpret feature importance | "Volume change matters more than today's return" |
| Evaluate out-of-sample | Train on 4 years, test on 1 year of unseen data |

**Two techniques:**
1. **K-Means Clustering** — Find groups without labels (unsupervised)
2. **Random Forest Classification** — Predict categories and see which features matter (supervised)

Both use the same scikit-learn patterns you learned in Module 3.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# Load stock characteristics
stock_df = pd.read_csv("data/stock_characteristics.csv")
print(f"Loaded {len(stock_df)} stocks")
stock_df.head()

---

## 1. K-Means Clustering

### The Question

Imagine you're an analyst looking at 20 stocks. You have two characteristics for each:
- **Beta**: How volatile is it relative to the market?
- **Dividend Yield**: What percentage does it pay out?

**Question: Can we find meaningful investment groups based only on these numbers?**

We don't have labels. We just have data. This is **unsupervised learning**.

In [None]:
# What do we have?
print("Our 20 stocks:")
print(stock_df[["ticker", "beta", "dividend_yield"]])

In [None]:
# Visualize: just the data, no labels
fig, ax = plt.subplots(figsize=(10, 7))

ax.scatter(stock_df["beta"], stock_df["dividend_yield"], s=150, alpha=0.7)

for _, row in stock_df.iterrows():                    # Label each point
    ax.annotate(row["ticker"], (row["beta"], row["dividend_yield"]),
                xytext=(5, 5), textcoords="offset points", fontsize=9)

ax.set_xlabel("Beta (Market Sensitivity)", fontsize=12)
ax.set_ylabel("Dividend Yield (%)", fontsize=12)
ax.set_title("20 Stocks: Can You See Two Groups?", fontsize=14)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("Look at the plot. Do you see a pattern?")
print("Some stocks cluster in the lower-left, others in the upper-right...")

### K-Means Algorithm

K-Means finds groups by:
1. Pick K random points as initial "centers"
2. Assign each stock to its nearest center
3. Recalculate centers as the average of each group
4. Repeat until stable

**Important:** Always standardize features first. K-Means uses distance, so features on different scales would dominate.

In [None]:
# Standardize features (mean=0, std=1)
features = ["beta", "dividend_yield"]
X = stock_df[features].values                         # Raw data

scaler = StandardScaler()                             # Create scaler
X_scaled = scaler.fit_transform(X)                    # Fit and transform

print("Before standardization:")
print(f"  Beta range: {X[:, 0].min():.2f} to {X[:, 0].max():.2f}")
print(f"  Dividend range: {X[:, 1].min():.2f} to {X[:, 1].max():.2f}")
print("\nAfter standardization:")
print(f"  Beta range: {X_scaled[:, 0].min():.2f} to {X_scaled[:, 0].max():.2f}")
print(f"  Dividend range: {X_scaled[:, 1].min():.2f} to {X_scaled[:, 1].max():.2f}")

In [None]:
# Fit K-Means with K=2 clusters
kmeans = KMeans(n_clusters=2, n_init=10, random_state=42)
kmeans.fit(X_scaled)                                  # Fit on standardized data

# Add cluster labels to our data
stock_df["cluster"] = kmeans.labels_

print("Cluster assignments:")
print(stock_df[["ticker", "beta", "dividend_yield", "cluster"]])

In [None]:
# Visualize the clusters K-Means found
fig, ax = plt.subplots(figsize=(10, 7))

colors = ["#2ecc71", "#9b59b6"]                       # Green and purple
for cluster in [0, 1]:
    mask = stock_df["cluster"] == cluster
    ax.scatter(
        stock_df.loc[mask, "beta"],
        stock_df.loc[mask, "dividend_yield"],
        c=colors[cluster], s=150, label=f"Cluster {cluster}", alpha=0.7
    )

for _, row in stock_df.iterrows():
    ax.annotate(row["ticker"], (row["beta"], row["dividend_yield"]),
                xytext=(5, 5), textcoords="offset points", fontsize=9)

ax.set_xlabel("Beta", fontsize=12)
ax.set_ylabel("Dividend Yield (%)", fontsize=12)
ax.set_title("K-Means Found Two Groups (No Labels Used!)", fontsize=14)
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

### The Reveal: What Did K-Means Find?

Let's check what these clusters actually represent. Our data has a "sector" column we haven't looked at yet...

In [None]:
# What sectors are in each cluster?
print("Cluster composition:")
for cluster in [0, 1]:
    cluster_stocks = stock_df[stock_df["cluster"] == cluster]
    sectors = cluster_stocks["sector"].value_counts()
    print(f"\nCluster {cluster}:")
    for sector, count in sectors.items():
        print(f"  {sector}: {count} stocks")

In [None]:
# Side by side: K-Means clusters vs True sectors
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Left: K-Means clusters
for cluster in [0, 1]:
    mask = stock_df["cluster"] == cluster
    axes[0].scatter(
        stock_df.loc[mask, "beta"],
        stock_df.loc[mask, "dividend_yield"],
        c=colors[cluster], s=150, label=f"Cluster {cluster}", alpha=0.7
    )
for _, row in stock_df.iterrows():
    axes[0].annotate(row["ticker"], (row["beta"], row["dividend_yield"]),
                     xytext=(5, 5), textcoords="offset points", fontsize=9)
axes[0].set_xlabel("Beta")
axes[0].set_ylabel("Dividend Yield (%)")
axes[0].set_title("K-Means Clusters (Unsupervised)")
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Right: True sectors
sector_colors = {"Utilities": "#3498db", "Tech": "#e74c3c"}
for sector, color in sector_colors.items():
    mask = stock_df["sector"] == sector
    axes[1].scatter(
        stock_df.loc[mask, "beta"],
        stock_df.loc[mask, "dividend_yield"],
        c=color, s=150, label=sector, alpha=0.7
    )
for _, row in stock_df.iterrows():
    axes[1].annotate(row["ticker"], (row["beta"], row["dividend_yield"]),
                     xytext=(5, 5), textcoords="offset points", fontsize=9)
axes[1].set_xlabel("Beta")
axes[1].set_ylabel("Dividend Yield (%)")
axes[1].set_title("True Sector Labels")
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("K-Means recovered the sector groupings almost perfectly — with no labels!")
print("It found that Utilities (low beta, high dividend) cluster separately from Tech (high beta, low dividend).")

### ✏️ Checkpoint: Try K=3 Clusters

What happens if we ask for 3 clusters instead of 2? Does it find a meaningful third group?

In [None]:
# Your code here


---

## 2. Random Forest Classification

### A Different Question

In Module 3, we tried to predict the *exact* return. That's hard — R² is basically zero.

But what if we ask an easier question: **Will the stock go UP or DOWN tomorrow?**

This is **classification** instead of regression. Instead of predicting a number, we predict a category.

And instead of R², we measure **accuracy**: what percentage of days did we get right? If we're just guessing, we'd expect 50%. Can we beat that?

We'll use features from our dataset:
- **Basic**: today's return, volume change, price range, overnight gap
- **Technical**: 20-day volatility, 5-day momentum
- **Market**: S&P 500 return, VIX level

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

# Load and prepare stock data with features
df = pd.read_csv("data/stock_prices.csv")
df["date"] = pd.to_datetime(df["date"])
df = df.sort_values(["ticker", "date"]).reset_index(drop=True)

# Calculate returns within each ticker
df["return"] = df.groupby("ticker")["close"].pct_change()

# Feature engineering (computed within each ticker group)
df["volume_chg"] = df.groupby("ticker")["volume"].pct_change()
df["range_pct"] = (df["high"] - df["low"]) / df["close"]
df["gap"] = (df["open"] - df.groupby("ticker")["close"].shift(1)) / df.groupby("ticker")["close"].shift(1)

# Moving averages and technical features
df["ma5"] = df.groupby("ticker")["close"].transform(lambda x: x.rolling(5).mean())
df["ma20"] = df.groupby("ticker")["close"].transform(lambda x: x.rolling(20).mean())
df["price_vs_ma5"] = (df["close"] - df["ma5"]) / df["ma5"]
df["price_vs_ma20"] = (df["close"] - df["ma20"]) / df["ma20"]

# Volatility and momentum
df["volatility_5d"] = df.groupby("ticker")["return"].transform(lambda x: x.rolling(5).std())
df["volatility_20d"] = df.groupby("ticker")["return"].transform(lambda x: x.rolling(20).std())
df["momentum_5d"] = df.groupby("ticker")["close"].transform(lambda x: x.pct_change(5))
df["momentum_20d"] = df.groupby("ticker")["close"].transform(lambda x: x.pct_change(20))

# Load market data and add SPY return and VIX
market_df = pd.read_csv("data/market_data.csv")
market_df["date"] = pd.to_datetime(market_df["date"])

spy = market_df[market_df["ticker"] == "SPY"][["date", "close_px"]].rename(columns={"close_px": "spy_close"})
spy["spy_return"] = spy["spy_close"].pct_change()
vix = market_df[market_df["ticker"] == "VIX"][["date", "close_px"]].rename(columns={"close_px": "vix"})

df = df.merge(spy[["date", "spy_return"]], on="date", how="left")
df = df.merge(vix[["date", "vix"]], on="date", how="left")

print(f"Loaded {len(df):,} rows")
print(f"Tickers: {df['ticker'].unique().tolist()}")

# Show available features
print(f"\nFeatures available:")
print(df.columns.tolist())

In [None]:
# Focus on AAPL
aapl = df[df["ticker"] == "AAPL"].copy().set_index("date")

# Target: will tomorrow go UP (1) or DOWN (0)?
aapl["tomorrow_return"] = aapl["return"].shift(-1)
aapl["direction"] = (aapl["tomorrow_return"] > 0).astype(int)

# Features we'll use
features = [
    # Basic (things we observe today)
    "return", "volume_chg", "range_pct", "gap",
    # Technical
    "volatility_20d", "momentum_5d",
    # Market
    "spy_return", "vix",
]

# Prepare data (drop rows with missing values)
class_data = aapl[features + ["direction"]].dropna()
split_idx = int(len(class_data) * 0.8)

train = class_data.iloc[:split_idx]
test = class_data.iloc[split_idx:]

X_train = train[features].values
y_train = train["direction"].values
X_test = test[features].values
y_test = test["direction"].values

print(f"Using {len(features)} features:")
for f in features:
    print(f"  • {f}")
print(f"\nTraining: {len(train)} days")
print(f"Testing:  {len(test)} days")
print(f"\nClass balance in test data:")
print(f"  Up days:   {y_test.sum()} ({y_test.mean():.1%})")
print(f"  Down days: {len(y_test) - y_test.sum()} ({1 - y_test.mean():.1%})")

### What is Random Forest?

A Random Forest is a collection of decision trees. Each tree learns simple rules like:
- "If volatility > 0.02 AND momentum < 0, predict DOWN"
- "If volume is high AND price is above MA, predict UP"

The "forest" part: instead of one tree, we build many trees (100 by default), each trained on a random subset of the data and features. The final prediction is a **vote** — whatever most trees predict wins.

Why this works:
- Individual trees might overfit, but averaging many trees reduces noise
- Different trees focus on different patterns
- It's robust and hard to break

In [None]:
# Fit Random Forest
rf = RandomForestClassifier(n_estimators=100, max_depth=250, random_state=42)
rf.fit(X_train, y_train)

# Predictions
y_pred_train = rf.predict(X_train)
y_pred_test = rf.predict(X_test)

# Accuracy
acc_train = accuracy_score(y_train, y_pred_train)
acc_test = accuracy_score(y_test, y_pred_test)

print("=" * 50)
print("RANDOM FOREST: Predicting UP or DOWN")
print("=" * 50)
print(f"Training accuracy: {acc_train:.1%}")
print(f"Test accuracy:     {acc_test:.1%}")
print(f"")
print(f"Baseline (random guessing): 50%")
print(f"Improvement over baseline:  {acc_test - 0.5:+.1%}")

In [None]:
# Confusion matrix: where did we get it right/wrong?
cm = confusion_matrix(y_test, y_pred_test)

fig, ax = plt.subplots(figsize=(6, 5))
im = ax.imshow(cm, cmap="Blues")

ax.set_xticks([0, 1])
ax.set_yticks([0, 1])
ax.set_xticklabels(["Predicted DOWN", "Predicted UP"])
ax.set_yticklabels(["Actual DOWN", "Actual UP"])

for i in range(2):
    for j in range(2):
        ax.text(j, i, cm[i, j], ha="center", va="center", fontsize=20,
                color="white" if cm[i, j] > cm.max()/2 else "black")

ax.set_title("Confusion Matrix (Test Set)")
plt.colorbar(im)
plt.tight_layout()
plt.show()

print("Reading the matrix:")
print(f"  Correctly predicted DOWN: {cm[0,0]} days")
print(f"  Correctly predicted UP:   {cm[1,1]} days")
print(f"  Wrong (predicted UP, was DOWN):   {cm[0,1]} days")
print(f"  Wrong (predicted DOWN, was UP):   {cm[1,0]} days")

### Feature Importance: What Does the Model Care About?

One of the best things about Random Forest: it tells you **which features matter most**.

The importance score measures how much each feature helps the trees make better splits. Higher = more important.

In [None]:
# Feature importance
importance_df = pd.DataFrame({
    "feature": features,
    "importance": rf.feature_importances_
}).sort_values("importance", ascending=True)

# Plot
fig, ax = plt.subplots(figsize=(10, 8))
colors = plt.cm.Blues(importance_df["importance"] / importance_df["importance"].max())
ax.barh(importance_df["feature"], importance_df["importance"], color=colors)
ax.set_xlabel("Importance")
ax.set_title("Random Forest: Which Features Matter Most?")
plt.tight_layout()
plt.show()

# Top features
print("Most important features:")
for _, row in importance_df.tail(5).iloc[::-1].iterrows():
    print(f"  {row['feature']}: {row['importance']:.3f}")

### The Takeaway

We achieved a small improvement over random guessing (50%).

That might sound disappointing, but consider:
- **Stock returns are hard to predict.** This is a fundamental result in finance.
- **Even small edges matter.** A few percent edge, applied consistently over thousands of trades, can be meaningful.
- **The model learned something real.** Look at the feature importance — it's telling us which signals have predictive power.

What's interesting about the feature importance:
- **Momentum features** often rank highly — recent trends have some predictive power
- **Volatility** measures how "noisy" the stock is
- **Volume changes** can signal unusual activity

The honest truth: you're not going to get rich from a simple model like this. But you now understand how practitioners approach prediction problems in finance.

### ✏️ Checkpoint: Tune the Forest

Try changing these Random Forest parameters and see how accuracy changes:
- `n_estimators`: Number of trees (default 100). More trees = more stable but slower.
- `max_depth`: How deep each tree can go (default None = unlimited). Shallower = less overfitting.

In [None]:
# Your code here


---

## 3. Exercises

Complete these to finish Module 4.

### Exercise 4.1: Elbow Method

How do we choose K for K-Means? One way: plot the "inertia" (within-cluster sum of squares) for K=1,2,3,...,6. Look for an "elbow" where adding more clusters stops helping much.

```python
inertias = []
for k in range(1, 7):
    km = KMeans(n_clusters=k, random_state=42)
    km.fit(X_scaled)
    inertias.append(km.inertia_)
plt.plot(range(1, 7), inertias, marker='o')
```

In [None]:
# Your code here


### Exercise 4.2: Try a Different Stock

Does Random Forest work better or worse on different stocks? Try running the classification on NVDA or MSFT instead of AAPL.

```python
# Change the ticker and re-run the classification
ticker = "NVDA"  # or "MSFT", "GOOGL", etc.
stock = df[df["ticker"] == ticker].copy().set_index("date")
# ... (same feature engineering as before)
```

Does the accuracy change? Does the feature importance change?

In [None]:
# Your code here — try a different stock


### Exercise 4.3: Commit Your Work

```bash
git add .
git commit -m "Complete module 4 exercises"
git push
```

---

## Recap

| Technique | Use When | Key Method |
|-----------|----------|------------|
| **K-Means** | Find groups without labels | `KMeans(n_clusters=K).fit(X)` |
| **Random Forest** | Predict categories, understand feature importance | `RandomForestClassifier().fit(X, y)` |

### Key Takeaways

1. **K-Means discovers structure** — it found industry groupings from just beta and dividend yield
2. **Always standardize** before K-Means (it uses distance)
3. **Classification reframes the problem** — "up or down?" is easier to interpret than "what's the return?"
4. **Feature importance** shows what the model learned
5. **Out-of-sample evaluation is essential** — train accuracy can lie, test accuracy tells the truth
6. **Small edges matter in finance** — even a few percent above 50% is meaningful at scale

---

**Next up:** Module 5 — Responsible Coding with AI. We'll build a project from scratch using AI assistance.