# Machine Learning for Finance - Fundamentals

Welcome to this comprehensive tutorial on machine learning fundamentals with applications in finance. By the end of this notebook, you will understand:

- Core machine learning concepts and terminology
- How to work with financial data in Python
- Key ML algorithms: regression, classification, and clustering
- Best practices for evaluating models on financial data
- How to build a simple ML-based trading strategy

**Prerequisites**: Intermediate Python knowledge (functions, classes, basic syntax)

---

## Part 1: Environment Setup and Python Essentials

First, let's install and import all the libraries we'll need throughout this tutorial.

In [None]:
# Install required packages (uncomment if needed)
# !pip install numpy pandas matplotlib seaborn scikit-learn yfinance

In [None]:
# Core libraries
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning
from sklearn.model_selection import train_test_split, TimeSeriesSplit, cross_val_score
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import (
    mean_squared_error, r2_score,
    accuracy_score, precision_score, recall_score, 
    confusion_matrix, classification_report
)

# Financial data
import yfinance as yf

# Settings
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
plt.style.use('seaborn-v0_8-whitegrid')

print("All libraries imported successfully!")

### Quick NumPy and Pandas Refresher

Machine learning relies heavily on numerical operations. Let's quickly review the key data structures.

In [None]:
# NumPy: Efficient numerical arrays
prices = np.array([100, 102, 101, 105, 103, 108])
print("Stock prices:", prices)
print("Mean price:", np.mean(prices))
print("Standard deviation:", np.std(prices))

# Calculate daily returns: (today - yesterday) / yesterday
returns = np.diff(prices) / prices[:-1]
print("Daily returns:", returns)

In [None]:
# Pandas: DataFrames for tabular data
df_example = pd.DataFrame({
    'Date': pd.date_range('2024-01-01', periods=6),
    'Price': prices,
    'Volume': [1000, 1200, 900, 1500, 1100, 1300]
})
df_example['Returns'] = df_example['Price'].pct_change()
print(df_example)

---

## Part 2: Understanding Machine Learning

### What is Machine Learning?

Machine Learning (ML) is a subset of artificial intelligence where computers learn patterns from data without being explicitly programmed. Instead of writing rules, we provide examples and let the algorithm discover the rules.

### Types of Machine Learning

| Type | Description | Financial Example |
|------|-------------|-------------------|
| **Supervised Learning** | Learn from labeled data (input → output) | Predict stock price direction |
| **Unsupervised Learning** | Find patterns in unlabeled data | Group similar stocks together |
| **Reinforcement Learning** | Learn through trial and error | Algorithmic trading agents |

### The ML Workflow

```
Data Collection → Data Preprocessing → Feature Engineering → 
Model Training → Model Evaluation → Deployment
```

### Key Terminology

- **Features (X)**: Input variables used to make predictions (e.g., past prices, volume, indicators)
- **Labels/Target (y)**: The output we want to predict (e.g., tomorrow's price, buy/sell signal)
- **Training Data**: Data used to teach the model
- **Test Data**: Data used to evaluate model performance (never seen during training)
- **Overfitting**: Model memorizes training data but fails on new data
- **Underfitting**: Model is too simple to capture patterns

In [None]:
# Visual: Overfitting vs Good Fit
np.random.seed(42)
X_demo = np.linspace(0, 10, 20)
y_demo = 2 * X_demo + 1 + np.random.randn(20) * 2

fig, axes = plt.subplots(1, 3, figsize=(14, 4))

# Underfitting (too simple)
axes[0].scatter(X_demo, y_demo, alpha=0.7)
axes[0].axhline(y=np.mean(y_demo), color='red', linewidth=2)
axes[0].set_title('Underfitting\n(Model too simple)', fontsize=12)

# Good fit
axes[1].scatter(X_demo, y_demo, alpha=0.7)
z = np.polyfit(X_demo, y_demo, 1)
axes[1].plot(X_demo, np.poly1d(z)(X_demo), color='green', linewidth=2)
axes[1].set_title('Good Fit\n(Captures the pattern)', fontsize=12)

# Overfitting (too complex)
axes[2].scatter(X_demo, y_demo, alpha=0.7)
z = np.polyfit(X_demo, y_demo, 15)
X_smooth = np.linspace(0, 10, 100)
axes[2].plot(X_smooth, np.poly1d(z)(X_smooth), color='red', linewidth=2)
axes[2].set_title('Overfitting\n(Memorizes noise)', fontsize=12)

plt.tight_layout()
plt.show()

---

## Part 3: Working with Financial Data

Let's fetch real stock market data and explore it.

In [None]:
# Fetch historical data for Apple, Google, and Bitcoin
tickers = ['AAPL', 'GOOGL', 'BTC-USD']
start_date = '2020-01-01'
end_date = '2024-12-31'

# Download data
data = {}
for ticker in tickers:
    data[ticker] = yf.download(ticker, start=start_date, end=end_date, progress=False)
    print(f"{ticker}: {len(data[ticker])} trading days")

# We'll primarily use Apple for our examples
df = data['AAPL'].copy()
df.head(10)

In [None]:
# Basic info about the data
print("Data shape:", df.shape)
print("\nColumn types:")
print(df.dtypes)
print("\nBasic statistics:")
df.describe()

### Exploratory Data Analysis (EDA)

Before building models, we need to understand our data.

In [None]:
# Visualize price history
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Price over time
axes[0, 0].plot(df.index, df['Close'], color='blue', linewidth=1)
axes[0, 0].set_title('AAPL Closing Price Over Time', fontsize=12)
axes[0, 0].set_xlabel('Date')
axes[0, 0].set_ylabel('Price ($)')

# Volume over time
axes[0, 1].bar(df.index, df['Volume'], color='gray', alpha=0.7, width=1)
axes[0, 1].set_title('Trading Volume Over Time', fontsize=12)
axes[0, 1].set_xlabel('Date')
axes[0, 1].set_ylabel('Volume')

# Daily returns distribution
daily_returns = df['Close'].pct_change().dropna()
axes[1, 0].hist(daily_returns, bins=50, color='steelblue', edgecolor='black', alpha=0.7)
axes[1, 0].axvline(x=0, color='red', linestyle='--')
axes[1, 0].set_title('Distribution of Daily Returns', fontsize=12)
axes[1, 0].set_xlabel('Daily Return')
axes[1, 0].set_ylabel('Frequency')

# Candlestick-style: High-Low range
df['Range'] = df['High'] - df['Low']
axes[1, 1].plot(df.index, df['Range'], color='orange', linewidth=0.5)
axes[1, 1].set_title('Daily Price Range (Volatility Indicator)', fontsize=12)
axes[1, 1].set_xlabel('Date')
axes[1, 1].set_ylabel('High - Low ($)')

plt.tight_layout()
plt.show()

### Feature Engineering for Finance

Raw prices aren't very useful for ML. We need to create meaningful features.

In [None]:
def create_features(df):
    """
    Create technical indicators and features for ML models.
    """
    df = df.copy()
    
    # Returns
    df['Returns'] = df['Close'].pct_change()
    df['Log_Returns'] = np.log(df['Close'] / df['Close'].shift(1))
    
    # Moving Averages
    df['SMA_5'] = df['Close'].rolling(window=5).mean()
    df['SMA_20'] = df['Close'].rolling(window=20).mean()
    df['SMA_50'] = df['Close'].rolling(window=50).mean()
    
    # Moving Average Crossover Signal
    df['SMA_Cross'] = (df['SMA_5'] > df['SMA_20']).astype(int)
    
    # Volatility (20-day rolling standard deviation of returns)
    df['Volatility'] = df['Returns'].rolling(window=20).std()
    
    # Relative Strength Index (RSI)
    delta = df['Close'].diff()
    gain = (delta.where(delta > 0, 0)).rolling(window=14).mean()
    loss = (-delta.where(delta < 0, 0)).rolling(window=14).mean()
    rs = gain / loss
    df['RSI'] = 100 - (100 / (1 + rs))
    
    # Price relative to moving average
    df['Price_SMA20_Ratio'] = df['Close'] / df['SMA_20']
    
    # Momentum (5-day price change)
    df['Momentum_5'] = df['Close'].pct_change(periods=5)
    
    # Volume features
    df['Volume_SMA_20'] = df['Volume'].rolling(window=20).mean()
    df['Volume_Ratio'] = df['Volume'] / df['Volume_SMA_20']
    
    return df

# Apply feature engineering
df = create_features(df)
print("New features created:")
print(df.columns.tolist())
df.tail(10)

In [None]:
# Visualize some technical indicators
fig, axes = plt.subplots(3, 1, figsize=(14, 10), sharex=True)

# Price with Moving Averages
recent = df.iloc[-252:]  # Last year of data
axes[0].plot(recent.index, recent['Close'], label='Close', linewidth=1)
axes[0].plot(recent.index, recent['SMA_5'], label='SMA 5', linewidth=1, alpha=0.8)
axes[0].plot(recent.index, recent['SMA_20'], label='SMA 20', linewidth=1, alpha=0.8)
axes[0].plot(recent.index, recent['SMA_50'], label='SMA 50', linewidth=1, alpha=0.8)
axes[0].set_title('Price with Moving Averages', fontsize=12)
axes[0].legend(loc='upper left')
axes[0].set_ylabel('Price ($)')

# RSI
axes[1].plot(recent.index, recent['RSI'], color='purple', linewidth=1)
axes[1].axhline(y=70, color='red', linestyle='--', alpha=0.7, label='Overbought (70)')
axes[1].axhline(y=30, color='green', linestyle='--', alpha=0.7, label='Oversold (30)')
axes[1].set_title('Relative Strength Index (RSI)', fontsize=12)
axes[1].set_ylabel('RSI')
axes[1].legend(loc='upper left')
axes[1].set_ylim(0, 100)

# Volatility
axes[2].plot(recent.index, recent['Volatility'] * 100, color='orange', linewidth=1)
axes[2].set_title('20-Day Rolling Volatility', fontsize=12)
axes[2].set_ylabel('Volatility (%)')
axes[2].set_xlabel('Date')

plt.tight_layout()
plt.show()

---

## Part 4: Data Preprocessing

Before feeding data into ML models, we need to clean and prepare it properly.

In [None]:
# Check for missing values
print("Missing values per column:")
print(df.isnull().sum())
print(f"\nTotal rows: {len(df)}")

In [None]:
# Handle missing values by dropping rows with NaN
# (These occur at the start due to rolling calculations)
df_clean = df.dropna().copy()
print(f"Rows after cleaning: {len(df_clean)}")
print(f"Rows removed: {len(df) - len(df_clean)}")

### Creating Labels for Classification

For classification, we need to create target labels. A common task is predicting whether the price will go up or down.

In [None]:
# Create target variable: Will price go UP (1) or DOWN (0) tomorrow?
df_clean['Target'] = (df_clean['Close'].shift(-1) > df_clean['Close']).astype(int)

# Remove the last row (no future data to create target)
df_clean = df_clean.iloc[:-1]

print("Target distribution:")
print(df_clean['Target'].value_counts())
print(f"\nUp days: {df_clean['Target'].sum() / len(df_clean) * 100:.1f}%")

### Feature Scaling

Many ML algorithms work better when features are on similar scales.

In [None]:
# Select features for our models
feature_columns = [
    'Returns', 'Volatility', 'RSI', 'Price_SMA20_Ratio',
    'Momentum_5', 'Volume_Ratio', 'SMA_Cross'
]

X = df_clean[feature_columns].values
y = df_clean['Target'].values

print("Feature matrix shape:", X.shape)
print("Target vector shape:", y.shape)

# Show feature statistics before scaling
print("\nFeature statistics (before scaling):")
print(df_clean[feature_columns].describe().round(4))

In [None]:
# StandardScaler: transforms data to have mean=0 and std=1
scaler = StandardScaler()

# Example: Scale the features
X_scaled = scaler.fit_transform(X)

print("After StandardScaler:")
print(f"Mean of each feature: {X_scaled.mean(axis=0).round(4)}")
print(f"Std of each feature: {X_scaled.std(axis=0).round(4)}")

### Train/Test Split for Time Series

**IMPORTANT**: For time series data, we cannot randomly split! We must preserve temporal order to avoid look-ahead bias.

In [None]:
# WRONG way (don't do this for time series!)
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# CORRECT way: Use temporal split
split_index = int(len(X) * 0.8)  # 80% train, 20% test

X_train = X[:split_index]
X_test = X[split_index:]
y_train = y[:split_index]
y_test = y[split_index:]

# Scale AFTER splitting (fit on train, transform both)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)  # Use same parameters!

print(f"Training set: {len(X_train)} samples")
print(f"Test set: {len(X_test)} samples")
print(f"\nTrain date range: {df_clean.index[0].date()} to {df_clean.index[split_index-1].date()}")
print(f"Test date range: {df_clean.index[split_index].date()} to {df_clean.index[-1].date()}")

---

## Part 5: Supervised Learning - Regression

Regression predicts a continuous value. Let's predict next-day returns.

In [None]:
# For regression, our target is the next day's return (continuous)
df_reg = df_clean.copy()
df_reg['Target_Return'] = df_reg['Returns'].shift(-1)
df_reg = df_reg.dropna()

X_reg = df_reg[feature_columns].values
y_reg = df_reg['Target_Return'].values

# Time series split
split_idx = int(len(X_reg) * 0.8)
X_train_reg = X_reg[:split_idx]
X_test_reg = X_reg[split_idx:]
y_train_reg = y_reg[:split_idx]
y_test_reg = y_reg[split_idx:]

# Scale features
scaler_reg = StandardScaler()
X_train_reg_scaled = scaler_reg.fit_transform(X_train_reg)
X_test_reg_scaled = scaler_reg.transform(X_test_reg)

In [None]:
# Train a Linear Regression model
lr_model = LinearRegression()
lr_model.fit(X_train_reg_scaled, y_train_reg)

# Make predictions
y_pred_train = lr_model.predict(X_train_reg_scaled)
y_pred_test = lr_model.predict(X_test_reg_scaled)

# Evaluate
print("Linear Regression Results:")
print(f"\nTraining Set:")
print(f"  R² Score: {r2_score(y_train_reg, y_pred_train):.4f}")
print(f"  RMSE: {np.sqrt(mean_squared_error(y_train_reg, y_pred_train)):.6f}")

print(f"\nTest Set:")
print(f"  R² Score: {r2_score(y_test_reg, y_pred_test):.4f}")
print(f"  RMSE: {np.sqrt(mean_squared_error(y_test_reg, y_pred_test)):.6f}")

In [None]:
# Understand the model: Feature coefficients
coef_df = pd.DataFrame({
    'Feature': feature_columns,
    'Coefficient': lr_model.coef_
}).sort_values('Coefficient', key=abs, ascending=False)

print("Feature Importance (by coefficient magnitude):")
print(coef_df)

# Visualize coefficients
plt.figure(figsize=(10, 5))
colors = ['green' if c > 0 else 'red' for c in coef_df['Coefficient']]
plt.barh(coef_df['Feature'], coef_df['Coefficient'], color=colors)
plt.xlabel('Coefficient Value')
plt.title('Linear Regression Coefficients')
plt.axvline(x=0, color='black', linestyle='-', linewidth=0.5)
plt.tight_layout()
plt.show()

In [None]:
# Visualize predictions vs actual
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Scatter plot
axes[0].scatter(y_test_reg, y_pred_test, alpha=0.5, s=10)
axes[0].plot([y_test_reg.min(), y_test_reg.max()], 
             [y_test_reg.min(), y_test_reg.max()], 
             'r--', linewidth=2, label='Perfect prediction')
axes[0].set_xlabel('Actual Returns')
axes[0].set_ylabel('Predicted Returns')
axes[0].set_title('Predicted vs Actual Returns')
axes[0].legend()

# Time series comparison
test_dates = df_reg.index[split_idx:]
axes[1].plot(test_dates[:100], y_test_reg[:100], label='Actual', alpha=0.8)
axes[1].plot(test_dates[:100], y_pred_test[:100], label='Predicted', alpha=0.8)
axes[1].set_xlabel('Date')
axes[1].set_ylabel('Returns')
axes[1].set_title('Actual vs Predicted Returns (First 100 Test Days)')
axes[1].legend()

plt.tight_layout()
plt.show()

### Key Insight

Notice the R² is likely low or even negative. This is expected! Financial markets are notoriously hard to predict, especially with simple models. This is why:

1. Markets are efficient - prices quickly incorporate available information
2. There's significant randomness (noise) in short-term movements
3. Simple linear relationships often don't capture market dynamics

But even slight predictive power can be valuable in trading!

---

## Part 6: Supervised Learning - Classification

Classification predicts discrete categories. Let's predict whether price goes UP or DOWN.

In [None]:
# We already prepared classification data earlier
print("Classification task: Predict if tomorrow's price is UP (1) or DOWN (0)")
print(f"\nTraining samples: {len(X_train_scaled)}")
print(f"Test samples: {len(X_test_scaled)}")
print(f"\nClass balance in training: {np.mean(y_train):.2%} UP days")
print(f"Class balance in test: {np.mean(y_test):.2%} UP days")

### Logistic Regression

Despite its name, Logistic Regression is a classification algorithm. It predicts probabilities.

In [None]:
# Train Logistic Regression
log_reg = LogisticRegression(random_state=42, max_iter=1000)
log_reg.fit(X_train_scaled, y_train)

# Predictions
y_pred_log = log_reg.predict(X_test_scaled)
y_prob_log = log_reg.predict_proba(X_test_scaled)[:, 1]  # Probability of UP

# Evaluate
print("Logistic Regression Results:")
print(f"Accuracy: {accuracy_score(y_test, y_pred_log):.4f}")
print(f"Precision: {precision_score(y_test, y_pred_log):.4f}")
print(f"Recall: {recall_score(y_test, y_pred_log):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred_log, target_names=['DOWN', 'UP']))

### Decision Tree

Decision Trees create interpretable rules - perfect for understanding what drives predictions.

In [None]:
# Train Decision Tree (with limited depth to prevent overfitting)
dt_model = DecisionTreeClassifier(max_depth=4, random_state=42)
dt_model.fit(X_train_scaled, y_train)

# Predictions
y_pred_dt = dt_model.predict(X_test_scaled)

# Evaluate
print("Decision Tree Results:")
print(f"Accuracy: {accuracy_score(y_test, y_pred_dt):.4f}")
print(f"Precision: {precision_score(y_test, y_pred_dt):.4f}")
print(f"Recall: {recall_score(y_test, y_pred_dt):.4f}")

In [None]:
# Visualize feature importance for Decision Tree
importance_dt = pd.DataFrame({
    'Feature': feature_columns,
    'Importance': dt_model.feature_importances_
}).sort_values('Importance', ascending=True)

plt.figure(figsize=(10, 5))
plt.barh(importance_dt['Feature'], importance_dt['Importance'], color='steelblue')
plt.xlabel('Feature Importance')
plt.title('Decision Tree Feature Importance')
plt.tight_layout()
plt.show()

### Random Forest

Random Forest is an ensemble of many decision trees - usually more robust than a single tree.

In [None]:
# Train Random Forest
rf_model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42, n_jobs=-1)
rf_model.fit(X_train_scaled, y_train)

# Predictions
y_pred_rf = rf_model.predict(X_test_scaled)
y_prob_rf = rf_model.predict_proba(X_test_scaled)[:, 1]

# Evaluate
print("Random Forest Results:")
print(f"Accuracy: {accuracy_score(y_test, y_pred_rf):.4f}")
print(f"Precision: {precision_score(y_test, y_pred_rf):.4f}")
print(f"Recall: {recall_score(y_test, y_pred_rf):.4f}")

In [None]:
# Compare all classification models
models_comparison = pd.DataFrame({
    'Model': ['Logistic Regression', 'Decision Tree', 'Random Forest'],
    'Accuracy': [
        accuracy_score(y_test, y_pred_log),
        accuracy_score(y_test, y_pred_dt),
        accuracy_score(y_test, y_pred_rf)
    ],
    'Precision': [
        precision_score(y_test, y_pred_log),
        precision_score(y_test, y_pred_dt),
        precision_score(y_test, y_pred_rf)
    ],
    'Recall': [
        recall_score(y_test, y_pred_log),
        recall_score(y_test, y_pred_dt),
        recall_score(y_test, y_pred_rf)
    ]
})

print("Model Comparison:")
print(models_comparison.to_string(index=False))

In [None]:
# Visualize confusion matrix for best model
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

for ax, (name, y_pred) in zip(axes, 
    [('Logistic Regression', y_pred_log), 
     ('Decision Tree', y_pred_dt), 
     ('Random Forest', y_pred_rf)]):
    
    cm = confusion_matrix(y_test, y_pred)
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=ax,
                xticklabels=['DOWN', 'UP'], yticklabels=['DOWN', 'UP'])
    ax.set_xlabel('Predicted')
    ax.set_ylabel('Actual')
    ax.set_title(f'{name}\nAccuracy: {accuracy_score(y_test, y_pred):.2%}')

plt.tight_layout()
plt.show()

---

## Part 7: Unsupervised Learning

Unsupervised learning finds patterns without labeled data.

### K-Means Clustering

Let's group stocks by their behavior patterns.

In [None]:
# Prepare data for multiple stocks
stock_features = {}
tickers_for_clustering = ['AAPL', 'GOOGL', 'MSFT', 'AMZN', 'META', 
                           'NVDA', 'JPM', 'BAC', 'XOM', 'CVX',
                           'JNJ', 'PFE', 'KO', 'PEP', 'WMT']

print("Downloading stock data for clustering...")
for ticker in tickers_for_clustering:
    try:
        stock_data = yf.download(ticker, start='2023-01-01', end='2024-12-31', progress=False)
        if len(stock_data) > 100:
            returns = stock_data['Close'].pct_change().dropna()
            stock_features[ticker] = {
                'mean_return': returns.mean(),
                'volatility': returns.std(),
                'sharpe': returns.mean() / returns.std() if returns.std() > 0 else 0,
                'skewness': returns.skew(),
                'max_drawdown': (stock_data['Close'] / stock_data['Close'].cummax() - 1).min()
            }
    except:
        pass

# Create DataFrame
cluster_df = pd.DataFrame(stock_features).T
print(f"\nCollected data for {len(cluster_df)} stocks")
cluster_df

In [None]:
# Scale features for clustering
scaler_cluster = StandardScaler()
X_cluster = scaler_cluster.fit_transform(cluster_df)

# Apply K-Means with 3 clusters
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
cluster_df['Cluster'] = kmeans.fit_predict(X_cluster)

print("Cluster assignments:")
for cluster in sorted(cluster_df['Cluster'].unique()):
    stocks = cluster_df[cluster_df['Cluster'] == cluster].index.tolist()
    print(f"\nCluster {cluster}: {stocks}")

In [None]:
# Visualize clusters
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Volatility vs Return
colors = ['red', 'blue', 'green']
for cluster in range(3):
    mask = cluster_df['Cluster'] == cluster
    axes[0].scatter(
        cluster_df.loc[mask, 'volatility'] * 100,
        cluster_df.loc[mask, 'mean_return'] * 100,
        c=colors[cluster], label=f'Cluster {cluster}', s=100, alpha=0.7
    )
    for ticker in cluster_df[mask].index:
        axes[0].annotate(ticker, 
            (cluster_df.loc[ticker, 'volatility'] * 100, 
             cluster_df.loc[ticker, 'mean_return'] * 100),
            fontsize=8, ha='center', va='bottom')

axes[0].set_xlabel('Daily Volatility (%)')
axes[0].set_ylabel('Mean Daily Return (%)')
axes[0].set_title('Stock Clusters: Risk vs Return')
axes[0].legend()
axes[0].axhline(y=0, color='gray', linestyle='--', alpha=0.5)

# Plot 2: Cluster characteristics
cluster_means = cluster_df.groupby('Cluster')[['mean_return', 'volatility', 'sharpe']].mean()
cluster_means.plot(kind='bar', ax=axes[1], width=0.8)
axes[1].set_xlabel('Cluster')
axes[1].set_ylabel('Value')
axes[1].set_title('Average Characteristics by Cluster')
axes[1].legend(loc='upper right')
axes[1].tick_params(axis='x', rotation=0)

plt.tight_layout()
plt.show()

### Principal Component Analysis (PCA)

PCA reduces dimensionality while preserving the most important patterns.

In [None]:
# Apply PCA to our feature set
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_cluster)

print("Explained variance ratio:")
print(f"PC1: {pca.explained_variance_ratio_[0]:.2%}")
print(f"PC2: {pca.explained_variance_ratio_[1]:.2%}")
print(f"Total: {sum(pca.explained_variance_ratio_):.2%}")

In [None]:
# Visualize PCA results with clusters
plt.figure(figsize=(10, 7))

for cluster in range(3):
    mask = cluster_df['Cluster'] == cluster
    plt.scatter(X_pca[mask, 0], X_pca[mask, 1], 
                c=colors[cluster], label=f'Cluster {cluster}', s=150, alpha=0.7)

# Add stock labels
for i, ticker in enumerate(cluster_df.index):
    plt.annotate(ticker, (X_pca[i, 0], X_pca[i, 1]), 
                 fontsize=9, ha='center', va='bottom')

plt.xlabel(f'Principal Component 1 ({pca.explained_variance_ratio_[0]:.1%} variance)')
plt.ylabel(f'Principal Component 2 ({pca.explained_variance_ratio_[1]:.1%} variance)')
plt.title('PCA: Stocks in 2D Space')
plt.legend()
plt.axhline(y=0, color='gray', linestyle='--', alpha=0.3)
plt.axvline(x=0, color='gray', linestyle='--', alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Understanding what the principal components represent
pca_loadings = pd.DataFrame(
    pca.components_.T,
    columns=['PC1', 'PC2'],
    index=cluster_df.columns[:-1]  # Exclude 'Cluster' column
)

print("PCA Loadings (what each feature contributes to each PC):")
print(pca_loadings.round(3))

---

## Part 8: Model Evaluation Best Practices

Proper evaluation is crucial to avoid overfitting and ensure real-world performance.

### Time Series Cross-Validation

For time series, we use rolling/expanding windows instead of random folds.

In [None]:
# TimeSeriesSplit maintains temporal order
tscv = TimeSeriesSplit(n_splits=5)

# Visualize the splits
fig, ax = plt.subplots(figsize=(12, 4))

for i, (train_idx, test_idx) in enumerate(tscv.split(X)):
    ax.scatter(train_idx, [i] * len(train_idx), c='blue', s=1, label='Train' if i==0 else '')
    ax.scatter(test_idx, [i] * len(test_idx), c='red', s=1, label='Test' if i==0 else '')

ax.set_xlabel('Sample Index')
ax.set_ylabel('CV Fold')
ax.set_title('Time Series Cross-Validation Splits')
ax.legend(loc='upper left')
plt.tight_layout()
plt.show()

In [None]:
# Cross-validation scores for different models
models = {
    'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
    'Decision Tree': DecisionTreeClassifier(max_depth=4, random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
}

cv_results = {}
print("Cross-Validation Results (TimeSeriesSplit, 5 folds):")
print("-" * 50)

for name, model in models.items():
    scores = cross_val_score(model, X_train_scaled, y_train, cv=tscv, scoring='accuracy')
    cv_results[name] = scores
    print(f"{name}:")
    print(f"  Mean Accuracy: {scores.mean():.4f} (+/- {scores.std() * 2:.4f})")
    print(f"  Fold scores: {[f'{s:.3f}' for s in scores]}")
    print()

In [None]:
# Visualize CV results
cv_df = pd.DataFrame(cv_results)
cv_df.index = [f'Fold {i+1}' for i in range(5)]

fig, ax = plt.subplots(figsize=(10, 5))
cv_df.plot(kind='bar', ax=ax, width=0.8)
ax.set_ylabel('Accuracy')
ax.set_title('Cross-Validation Accuracy by Fold')
ax.set_ylim(0.4, 0.7)
ax.axhline(y=0.5, color='gray', linestyle='--', label='Random guess')
ax.legend(loc='lower right')
ax.tick_params(axis='x', rotation=0)
plt.tight_layout()
plt.show()

### Avoiding Look-Ahead Bias

**Look-ahead bias** occurs when you use future information to make predictions about the past. Common mistakes:

1. **Random train/test split** on time series data
2. **Scaling** with test data included in the calculation
3. **Feature engineering** using future values
4. **Model selection** based on test set performance

Always ask: "Would I have this information at the time of prediction?"

---

## Part 9: Practical Application - Building a Simple Trading Strategy

Let's combine everything we've learned into a simple ML-based trading strategy.

In [None]:
# Prepare data for backtesting
backtest_df = df_clean.copy()
backtest_df = backtest_df.iloc[split_index:]  # Use only test period

# Get model predictions for test period
backtest_df['Prediction'] = rf_model.predict(X_test_scaled)
backtest_df['Probability'] = rf_model.predict_proba(X_test_scaled)[:, 1]

print(f"Backtesting period: {backtest_df.index[0].date()} to {backtest_df.index[-1].date()}")
print(f"Number of trading days: {len(backtest_df)}")
backtest_df[['Close', 'Returns', 'Target', 'Prediction', 'Probability']].head(10)

In [None]:
# Simple strategy: 
# - Go LONG (hold stock) when model predicts UP
# - Go to CASH when model predicts DOWN

# Strategy returns: invest when signal is 1, stay in cash otherwise
backtest_df['Strategy_Returns'] = backtest_df['Prediction'] * backtest_df['Returns']

# Cumulative returns
backtest_df['Cumulative_Market'] = (1 + backtest_df['Returns']).cumprod()
backtest_df['Cumulative_Strategy'] = (1 + backtest_df['Strategy_Returns']).cumprod()

# Calculate performance metrics
total_market_return = backtest_df['Cumulative_Market'].iloc[-1] - 1
total_strategy_return = backtest_df['Cumulative_Strategy'].iloc[-1] - 1

print("Performance Comparison:")
print(f"\nBuy & Hold (Market):")
print(f"  Total Return: {total_market_return:.2%}")
print(f"  Annualized Return: {(backtest_df['Cumulative_Market'].iloc[-1] ** (252/len(backtest_df)) - 1):.2%}")

print(f"\nML Strategy:")
print(f"  Total Return: {total_strategy_return:.2%}")
print(f"  Annualized Return: {(backtest_df['Cumulative_Strategy'].iloc[-1] ** (252/len(backtest_df)) - 1):.2%}")

In [None]:
# Visualize strategy performance
fig, axes = plt.subplots(2, 1, figsize=(14, 10))

# Plot 1: Cumulative returns
axes[0].plot(backtest_df.index, backtest_df['Cumulative_Market'], 
             label='Buy & Hold', linewidth=2)
axes[0].plot(backtest_df.index, backtest_df['Cumulative_Strategy'], 
             label='ML Strategy', linewidth=2)
axes[0].set_title('Cumulative Returns: ML Strategy vs Buy & Hold', fontsize=12)
axes[0].set_ylabel('Portfolio Value (Starting $1)')
axes[0].legend(loc='upper left')
axes[0].axhline(y=1, color='gray', linestyle='--', alpha=0.5)

# Plot 2: Trading signals
axes[1].plot(backtest_df.index, backtest_df['Close'], label='Price', alpha=0.7)
# Mark long positions
long_mask = backtest_df['Prediction'] == 1
axes[1].fill_between(backtest_df.index, backtest_df['Close'].min(), backtest_df['Close'].max(),
                      where=long_mask, alpha=0.2, color='green', label='Long Position')
axes[1].set_title('Trading Signals', fontsize=12)
axes[1].set_xlabel('Date')
axes[1].set_ylabel('Price ($)')
axes[1].legend(loc='upper left')

plt.tight_layout()
plt.show()

In [None]:
# Calculate additional risk metrics
strategy_returns = backtest_df['Strategy_Returns']
market_returns = backtest_df['Returns']

# Sharpe Ratio (assuming 0% risk-free rate for simplicity)
sharpe_market = np.sqrt(252) * market_returns.mean() / market_returns.std()
sharpe_strategy = np.sqrt(252) * strategy_returns.mean() / strategy_returns.std()

# Maximum Drawdown
def max_drawdown(cumulative_returns):
    peak = cumulative_returns.cummax()
    drawdown = (cumulative_returns - peak) / peak
    return drawdown.min()

mdd_market = max_drawdown(backtest_df['Cumulative_Market'])
mdd_strategy = max_drawdown(backtest_df['Cumulative_Strategy'])

# Win rate
correct_predictions = (backtest_df['Prediction'] == backtest_df['Target']).sum()
win_rate = correct_predictions / len(backtest_df)

print("Risk Metrics Comparison:")
print("\n{:<25} {:>15} {:>15}".format('Metric', 'Buy & Hold', 'ML Strategy'))
print("-" * 55)
print("{:<25} {:>15.2%} {:>15.2%}".format('Total Return', total_market_return, total_strategy_return))
print("{:<25} {:>15.2f} {:>15.2f}".format('Sharpe Ratio', sharpe_market, sharpe_strategy))
print("{:<25} {:>15.2%} {:>15.2%}".format('Max Drawdown', mdd_market, mdd_strategy))
print("{:<25} {:>15} {:>15.2%}".format('Prediction Accuracy', 'N/A', win_rate))

### Important Caveats

This is a simplified example. In real trading:

1. **Transaction costs** - buying/selling has fees that eat into returns
2. **Slippage** - you may not get the exact price you expect
3. **Market impact** - large trades move prices
4. **Survivorship bias** - we're testing on stocks that still exist
5. **Overfitting risk** - past performance doesn't guarantee future results
6. **Regime changes** - markets evolve, patterns may stop working

---

## Part 10: Next Steps and Resources

Congratulations! You've learned the fundamentals of machine learning with financial applications. Here's where to go next:

### Deep Learning Preview

Neural networks can capture complex patterns:

- **LSTM (Long Short-Term Memory)**: Great for sequential/time series data
- **Transformer models**: State-of-the-art for many prediction tasks
- **Autoencoders**: Anomaly detection in trading

### Recommended Libraries

| Library | Purpose |
|---------|--------|
| `ta-lib` | Technical analysis indicators |
| `backtrader` | Backtesting trading strategies |
| `zipline` | Algorithmic trading simulation |
| `tensorflow` / `pytorch` | Deep learning |
| `xgboost` / `lightgbm` | Gradient boosting (often wins competitions) |

### Common Pitfalls to Avoid

1. **Overfitting**: Always validate on out-of-sample data
2. **Look-ahead bias**: Never use future data in predictions
3. **Survivorship bias**: Include delisted stocks in analysis
4. **Ignoring transaction costs**: They can eliminate profits
5. **Data snooping**: Don't test too many strategies on the same data
6. **Overconfidence**: Markets are adversarial - other traders adapt

### Further Learning

- **Books**: "Advances in Financial Machine Learning" by Marcos López de Prado
- **Courses**: Machine Learning for Trading (Coursera/Udacity)
- **Practice**: Kaggle competitions on financial prediction
- **Paper trading**: Test strategies with fake money before risking real capital

In [None]:
# Summary of what we covered
print("""
╔══════════════════════════════════════════════════════════════════╗
║           MACHINE LEARNING FOR FINANCE - SUMMARY                 ║
╠══════════════════════════════════════════════════════════════════╣
║                                                                  ║
║  Part 1:  Environment Setup & Python Essentials                  ║
║  Part 2:  Understanding Machine Learning                         ║
║  Part 3:  Working with Financial Data                            ║
║  Part 4:  Data Preprocessing                                     ║
║  Part 5:  Supervised Learning - Regression                       ║
║  Part 6:  Supervised Learning - Classification                   ║
║  Part 7:  Unsupervised Learning                                  ║
║  Part 8:  Model Evaluation Best Practices                        ║
║  Part 9:  Building a Simple Trading Strategy                     ║
║  Part 10: Next Steps and Resources                               ║
║                                                                  ║
╠══════════════════════════════════════════════════════════════════╣
║                                                                  ║
║  Key Takeaways:                                                  ║
║  • Always use temporal train/test splits for time series         ║
║  • Feature engineering is crucial for financial ML               ║
║  • Validate rigorously to avoid overfitting                      ║
║  • Markets are hard to predict - even small edge is valuable     ║
║  • Consider transaction costs and real-world constraints         ║
║                                                                  ║
╚══════════════════════════════════════════════════════════════════╝
""")

print("Happy learning and trading!")

---

## Practice Exercises

Try these exercises to reinforce your learning:

1. **Feature Engineering**: Add more technical indicators (Bollinger Bands, MACD) and see if they improve model performance

2. **Different Assets**: Apply the same analysis to cryptocurrency (BTC-USD) or forex data

3. **Model Tuning**: Use GridSearchCV to find optimal hyperparameters for Random Forest

4. **Multi-class Classification**: Instead of UP/DOWN, try predicting Strong Up, Weak Up, Weak Down, Strong Down

5. **Portfolio Clustering**: Use clustering to build a diversified portfolio of uncorrelated stocks