##### Created by Gil on 9/24/2025

# Market-Driven Municipal Bond Embeddings via Siamese Networks

## Executive Summary
This notebook implements a Bond2Vec approach to learning municipal bond embeddings based on market behavior rather than traditional characteristics. We use a Siamese neural network to learn representations where bonds with similar trading patterns are close in embedding space.

## Key Innovation
Traditional bond similarity relies on static features (rating, sector, maturity). This approach discovers bonds that "dance to the same tune" in the market, potentially uncovering:
- Cross-sector relationships
- Hidden risk factors  
- Market microstructure effects
- Novel relative value opportunities

## Methodology Overview

### 1. Temporal Co-occurrence
- Bonds trading within 30-60 second windows are candidates for behavioral similarity
- Similar to "customers who viewed X also viewed Y" in recommendation systems

### 2. Behavioral Similarity
- Compute cosine similarity on trade history arrays (5 previous trades)
- Each trade encoded as: [yield_spread, treasury_spread, log_par_traded, trade_type_1, trade_type_2, log_seconds_ago]
- Similarity > 0.5 ‚Üí positive pair (similar behavior)
- Similarity < 0.2 ‚Üí negative pair (dissimilar behavior)

### 3. Siamese Network Training
- Twin networks with shared weights process two CUSIPs
- Learn embeddings that preserve behavioral similarity
- Contrastive loss pushes similar bonds together, dissimilar apart

### 4. Temporal Alignment (Critical!)
- Each pair uses point-in-time features from the EXACT trade timestamps
- Maintains temporal consistency: comparing apples to apples
- Creates index: (CUSIP, trade_datetime) ‚Üí feature_vector

### Understanding ficc.ai Trade History Arrays

The trade history arrays encode historical trades as 6-element numpy arrays. Each row represents one historical trade, with the following structure:

### Array Structure
```python   

### Array Structure

[yield_spread, treasury_spread, log_par_traded, trade_type_1, trade_type_2, log_seconds_ago]
```

### Feature Breakdown

| Index | Feature | Description | Example Value |
|-------|---------|-------------|---------------|
| 0 | **Yield Spread** | Bond yield √ó 100 - FICC yield curve level (basis points) | 101.42 |
| 1 | **Treasury Spread** | (Bond yield - Treasury rate) √ó 100 (basis points, rounded to 3 decimals) | 19.0 |
| 2 | **Log Par Traded** | log‚ÇÅ‚ÇÄ(trade size) - normalizes trade sizes | 4.70 |
| 3 | **Trade Type 1** | First component of trade type encoding | 0.0 |
| 4 | **Trade Type 2** | Second component of trade type encoding | 0.0 |
| 5 | **Log Seconds Ago** | log‚ÇÅ‚ÇÄ(1 + seconds since trade) - normalizes time decay | 2.18 |

## Production Use Cases
1. **Daily Arbitrage Scanner**: Find bonds with high embedding similarity but yield spreads
2. **Portfolio Risk Analysis**: Measure concentration in "behavioral clusters"
3. **Trade Idea Generation**: Find substitutes based on market behavior, not just ratings
4. **Anomaly Detection**: Flag bonds diverging from their embedding neighbors


In [None]:
import os

# Step 1: Set paths BEFORE importing tensorflow
os.environ['LD_LIBRARY_PATH'] = '/usr/local/cuda-12/lib64:/usr/local/cuda/lib64'
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

# Standard data science imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

# Machine learning imports
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, Model, Input
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint, ReduceLROnPlateau
from sklearn.preprocessing import StandardScaler, RobustScaler, LabelEncoder
from sklearn.model_selection import train_test_split

# Data processing and utilities
from itertools import combinations
from tqdm import tqdm
import ast
import pickle
from collections import defaultdict
from typing import List, Tuple, Dict
import random
from scipy import stats
from decimal import Decimal

# Set random seeds for reproducibility
np.random.seed(42)
tf.random.set_seed(42)

# Configuration
EMBEDDING_DIM = 128
BATCH_SIZE = 8192
LEARNING_RATE = 0.001
NUM_EPOCHS = 100
N_BONDS_TO_ANALYZE = 1000  # Number of bonds to analyze for correlation pairs
MIN_YIELD_SPREAD_RANGE = 5.0  # Minimum basis points movement

# Correlation thresholds
CORRELATION_THRESHOLD = 0.7  # For positive pairs
NEGATIVE_CORRELATION_THRESHOLD = -0.3  # For negative pairs

# Check GPU availability
print(f"TensorFlow version: {tf.__version__}")
print(f"GPU Available: {tf.config.list_physical_devices('GPU')}")
print(f"Num GPUs: {len(tf.config.list_physical_devices('GPU'))}")
gpus = tf.config.list_physical_devices('GPU')
# Mixed precision for faster training (optional)
# tf.keras.mixed_precision.set_global_policy('mixed_float16')

TensorFlow version: 2.16.1
GPU Available: []
Num GPUs: 0


In [29]:
# Load the data
# file_path = "/home/gil/git/ficc/notebooks/gil_modeling/embeddings/processed_data_yield_spread_with_similar_trades_v2.pkl"
file_path = "/home/gil/git/ficc/notebooks/gil_modeling/embeddings/2025-09-22_one_month.pkl"

df = pd.read_pickle(file_path)

print(f"Loaded {len(df):,} bond trades")
print(f"Unique CUSIPs: {df['cusip'].nunique():,}")
print(f"Date range: {df['trade_date'].min()} to {df['trade_date'].max()}")
print(f"\nColumns: {df.columns.tolist()[:10]}...")  # Show first 10 columns

Loaded 1,405,865 bond trades
Unique CUSIPs: 196,446
Date range: 2025-08-20 00:00:00 to 2025-09-19 00:00:00

Columns: ['rtrs_control_number', 'cusip', 'yield', 'is_callable', 'refund_date', 'accrual_date', 'dated_date', 'next_sink_date', 'coupon', 'delivery_date']...


In [None]:
# df = df[:1000]

In [None]:
# Import the modules
from modules.pair_generation import run_pair_generation_pipeline
from modules.siamese_network import run_training_pipeline

# 1. Load your data
df = pd.read_pickle('/Users/gil/git/ficc/notebooks/gil_modeling/embeddings/2025-09-10.pkl')

# 2. Generate temporally-aligned pairs
pairs_df = run_pair_generation_pipeline(
    df,
    time_window_seconds=30,
    min_similarity=0.5,
    max_pairs_per_window=500,
    n_processes=11
)

# 3. Train the Siamese network with temporal alignment
base_network, artifacts, history = run_training_pipeline(
    df,           # All trades with reference data
    pairs_df,     # Pairs with temporal metadata
    test_size=0.2,
    embedding_dim=128,
    epochs=100
)

# 4. Generate embeddings for unique CUSIPs
unique_cusips_df = df.sort_values('trade_datetime').groupby('cusip').last().reset_index()
embeddings_df = get_embeddings(unique_cusips_df, base_network, artifacts)

## Understanding ficc.ai Trade History Arrays

The trade history arrays encode historical trades as 6-element numpy arrays. Each row represents one historical trade, with the following structure:

### Array Structure
```python   

### Array Structure

[yield_spread, treasury_spread, log_par_traded, trade_type_1, trade_type_2, log_seconds_ago]
```

### Feature Breakdown

| Index | Feature | Description | Example Value |
|-------|---------|-------------|---------------|
| 0 | **Yield Spread** | Bond yield √ó 100 - FICC yield curve level (basis points) | 101.42 |
| 1 | **Treasury Spread** | (Bond yield - Treasury rate) √ó 100 (basis points, rounded to 3 decimals) | 19.0 |
| 2 | **Log Par Traded** | log‚ÇÅ‚ÇÄ(trade size) - normalizes trade sizes | 4.70 |
| 3 | **Trade Type 1** | First component of trade type encoding | 0.0 |
| 4 | **Trade Type 2** | Second component of trade type encoding | 0.0 |
| 5 | **Log Seconds Ago** | log‚ÇÅ‚ÇÄ(1 + seconds since trade) - normalizes time decay | 2.18 |

### Trade Type Encoding (One-Hot)

The trade type uses a 2-component one-hot encoding:

| Trade Type | Code | Component 1 | Component 2 | Description |
|------------|------|-------------|-------------|-------------|
| **D** | [0, 0] | 0 | 0 | Dealer-to-dealer |
| **S** | [0, 1] | 0 | 1 | Dealer sell / Customer buy |
| **P** | [1, 0] | 1 | 0 | Dealer purchase / Customer sell |

### Example: Decoding a Trade History Array

```python
# Sample trade history with 5 trades
trade_history = np.array([
    [101.42, 19.0,  4.70, 0., 0., 2.18],  # D trade, 151 seconds ago
    [100.22, 17.8,  4.70, 0., 1., 2.18],  # S trade, 151 seconds ago
    [104.42, 22.0,  4.70, 0., 0., 3.75],  # D trade, ~5,620 seconds ago
    [106.02, 23.6,  4.70, 0., 0., 3.75],  # D trade, ~5,620 seconds ago
    [103.62, 21.2,  4.70, 0., 0., 3.75],  # D trade, ~5,620 seconds ago
])
```

**Interpretation:**
- All trades have the same size (10^4.70 ‚âà 50,000 par value)
- Yield spreads range from 100.22 to 106.02 basis points
- Most are dealer-to-dealer trades (D), except trade #2 which is dealer sell (S)
- Trades occurred at two different times: ~151 seconds ago and ~5,620 seconds ago

### Key Transformations

1. **Logarithmic scaling**: Applied to trade size and time to handle wide ranges
2. **Basis point conversion**: Yields multiplied by 100 for better model scaling
3. **One-hot encoding**: Categorical trade types converted to numerical representation

### Model Usage

- **Yield Spread Models**: Use 5 historical trades (`NUM_TRADES_IN_HISTORY_YIELD_SPREAD_MODEL = 5`)
- **Dollar Price Models**: Use 2 historical trades (`NUM_TRADES_IN_HISTORY_DOLLAR_PRICE_MODEL = 2`)
- **Similar Trades Model**: Incorporates trades from bonds with similar characteristics
```

## Generating Training Pairs from Correlation Matrix

Now we'll compute pairwise correlations between all bonds and generate three types of pairs:
1. **Positive pairs**: Correlation > 0.7 (bonds that move together)
2. **Hard negative pairs**: Correlation < -0.3 (bonds that move oppositely)
3. **Random negative pairs**: Random sampling (majority of real-world cases)

In [36]:
import pandas as pd
import numpy as np
from itertools import combinations
from collections import defaultdict
import random
from datetime import timedelta
import pickle
from multiprocessing import Pool, cpu_count
from functools import partial
import time

def compute_behavioral_similarity_fast(hist1, hist2):
    """Simple cosine similarity using dot product on normalized histories"""
    if len(hist1) < 5 or len(hist2) < 5:
        return 0.0
    
    hist1_flat = hist1.flatten()
    hist2_flat = hist2.flatten()
    
    hist1_norm = hist1_flat / (np.linalg.norm(hist1_flat) + 1e-8)
    hist2_norm = hist2_flat / (np.linalg.norm(hist2_flat) + 1e-8)
    
    return np.dot(hist1_norm, hist2_norm)

def get_trade_history(hist_str):
    """Parse trade history from string format"""
    if isinstance(hist_str, np.ndarray):
        return hist_str
    return np.array(hist_str) if hist_str is not None else np.array([])

def process_single_window(args):
    """Process a single time window - for parallel execution"""
    window, window_df, time_window_seconds, min_similarity, max_pairs_per_window = args
    
    # Get unique CUSIPs in this window
    cusip_groups = window_df.groupby('cusip').first()
    
    if len(cusip_groups) < 2:
        return [], []
    
    # Prepare CUSIP data with histories
    cusip_data = {}
    for cusip, row in cusip_groups.iterrows():
        hist = get_trade_history(row['trade_history'])
        if len(hist) >= 5:
            cusip_data[cusip] = hist
    
    if len(cusip_data) < 2:
        return [], []
    
    # Generate all pairs within this window
    window_pairs = []
    for cusip1, cusip2 in combinations(cusip_data.keys(), 2):
        hist1 = cusip_data[cusip1]
        hist2 = cusip_data[cusip2]
        sim = compute_behavioral_similarity_fast(hist1, hist2)
        window_pairs.append((cusip1, cusip2, sim))
    
    # Cap pairs per window if needed
    if len(window_pairs) > max_pairs_per_window:
        window_pairs = random.sample(window_pairs, max_pairs_per_window)
    
    # Classify into positive and negative
    positive_pairs = []
    negative_pairs = []
    for cusip1, cusip2, sim in window_pairs:
        if sim > min_similarity:
            positive_pairs.append((cusip1, cusip2, sim))
        elif sim < 0.2:
            negative_pairs.append((cusip1, cusip2, sim))
    
    return positive_pairs, negative_pairs

def generate_temporal_behavioral_pairs_parallel(
    df,
    time_window_seconds=30,
    min_similarity=0.5,
    max_pairs_per_window=1000,
    n_processes=None
):
    """
    Generate pairs using parallel processing
    """
    if n_processes is None:
        n_processes = cpu_count() - 1  # Leave one CPU free
    
    print(f"Using {n_processes} processes for pair generation")
    
    # Ensure datetime column
    df['trade_dt'] = pd.to_datetime(df['trade_datetime'])
    
    # Create time windows
    df['time_window'] = df['trade_dt'].dt.floor(f'{time_window_seconds}s')
    
    # Get unique windows and their data
    unique_windows = df['time_window'].unique()
    print(f"Processing {len(unique_windows)} time windows")
    
    # Prepare arguments for parallel processing
    window_args = []
    for window in unique_windows:
        window_df = df[df['time_window'] == window]
        window_args.append((
            window, 
            window_df, 
            time_window_seconds, 
            min_similarity, 
            max_pairs_per_window
        ))
    
    # Process windows in parallel
    start_time = time.time()
    with Pool(processes=n_processes) as pool:
        results = pool.map(process_single_window, window_args)
    
    # Combine results
    all_positive_pairs = []
    all_negative_pairs = []
    seen_pairs = set()
    
    for pos_pairs, neg_pairs in results:
        # Add positive pairs, avoiding duplicates
        for cusip1, cusip2, sim in pos_pairs:
            pair_key = tuple(sorted([cusip1, cusip2]))
            if pair_key not in seen_pairs:
                seen_pairs.add(pair_key)
                all_positive_pairs.append((cusip1, cusip2, sim))
        
        # Add negative pairs, avoiding duplicates
        for cusip1, cusip2, sim in neg_pairs:
            pair_key = tuple(sorted([cusip1, cusip2]))
            if pair_key not in seen_pairs:
                seen_pairs.add(pair_key)
                all_negative_pairs.append((cusip1, cusip2, sim))
    
    elapsed_time = time.time() - start_time
    print(f"Pair generation completed in {elapsed_time:.2f} seconds")
    
    return all_positive_pairs, all_negative_pairs

def create_training_dataset(positive_pairs, negative_pairs, neg_to_pos_ratio=3):
    """Create balanced training dataset with proper CUSIP pairs"""
    n_positives = len(positive_pairs)
    n_negatives_needed = min(len(negative_pairs), n_positives * neg_to_pos_ratio)
    
    if len(negative_pairs) > n_negatives_needed:
        negative_pairs = random.sample(negative_pairs, n_negatives_needed)
    
    all_pairs = []
    
    for cusip1, cusip2, sim in positive_pairs:
        all_pairs.append({
            'cusip1': cusip1,
            'cusip2': cusip2,
            'similarity': sim,
            'label': 1
        })
    
    for cusip1, cusip2, sim in negative_pairs:
        all_pairs.append({
            'cusip1': cusip1,
            'cusip2': cusip2,
            'similarity': sim,
            'label': 0
        })
    
    random.shuffle(all_pairs)
    return pd.DataFrame(all_pairs)

def run_pair_generation_parallel(
    df_or_pkl_path,
    time_window_seconds=30,
    min_similarity=0.5,
    max_pairs_per_window=1000,
    output_path='embedding_pairs.pkl',
    n_processes=None
):
    """
    Main function to run parallel pair generation
    """
    # Load data
    if isinstance(df_or_pkl_path, str):
        print(f"Loading data from {df_or_pkl_path}")
        with open(df_or_pkl_path, 'rb') as f:
            df = pickle.load(f)
    else:
        df = df_or_pkl_path
    
    print(f"Data shape: {df.shape}")
    print(f"Unique CUSIPs: {df['cusip'].nunique()}")
    print(f"Time window: {time_window_seconds} seconds")
    
    # Generate pairs in parallel
    print("\nGenerating pairs with parallel processing...")
    positive_pairs, negative_pairs = generate_temporal_behavioral_pairs_parallel(
        df,
        time_window_seconds=time_window_seconds,
        min_similarity=min_similarity,
        max_pairs_per_window=max_pairs_per_window,
        n_processes=n_processes
    )
    
    print(f"\nGenerated:")
    print(f"  Positive pairs: {len(positive_pairs):,}")
    print(f"  Negative pairs: {len(negative_pairs):,}")
    
    # Create training dataset
    pairs_df = create_training_dataset(positive_pairs, negative_pairs)
    
    print(f"\nFinal dataset shape: {pairs_df.shape}")
    print(f"Positive ratio: {pairs_df['label'].mean():.2%}")
    
    # Save
    pairs_df.to_pickle(output_path)
    print(f"\nSaved to {output_path}")
    
    # Stats
    if len(pairs_df) > 0:
        print("\nSimilarity statistics:")
        pos_df = pairs_df[pairs_df['label']==1]
        neg_df = pairs_df[pairs_df['label']==0]
        
        if len(pos_df) > 0:
            print(f"Positive pairs - Mean: {pos_df['similarity'].mean():.3f}")
            print(f"Positive pairs - Std:  {pos_df['similarity'].std():.3f}")
        
        if len(neg_df) > 0:
            print(f"Negative pairs - Mean: {neg_df['similarity'].mean():.3f}")
            print(f"Negative pairs - Std:  {neg_df['similarity'].std():.3f}")
    
    return pairs_df

n_cpus = cpu_count() - 1
pairs_df = run_pair_generation_parallel(df, time_window_seconds=30,min_similarity=0.5,max_pairs_per_window=500,n_processes=n_cpus) 



Data shape: (1405865, 150)
Unique CUSIPs: 196446
Time window: 30 seconds

Generating pairs with parallel processing...
Using 11 processes for pair generation
Processing 25439 time windows
Pair generation completed in 147.93 seconds

Generated:
  Positive pairs: 3,871,776
  Negative pairs: 2,661,695

Final dataset shape: (6533471, 4)
Positive ratio: 59.26%

Saved to embedding_pairs.pkl

Similarity statistics:
Positive pairs - Mean: 0.824
Positive pairs - Std:  0.146
Negative pairs - Mean: -0.257
Negative pairs - Std:  0.290


In [None]:
visualize_pair_distributions(pairs_df)

# CUSIP Siamese Network Embedding System

## Overview
This system implements a Siamese neural network for learning embeddings of CUSIPs. The model learns to map similar securities close together in an embedding space while keeping dissimilar securities far apart.

## Table of Contents
1. [Constants and Utilities](#constants-and-utilities)
2. [Helper Functions](#helper-functions)
3. [Feature Engineering](#feature-engineering)
4. [Siamese Network Architecture](#siamese-network-architecture)
5. [Training Pipeline](#training-pipeline)
6. [Usage Examples](#usage-examples)



## Constants and Utilities

### Key Constants

```python
NUM_OF_DAYS_IN_YEAR = 360  # MSRB convention for bond calculations
```

The system uses the **360-day year convention** commonly used in municipal bond markets (MSRB Rule G-33).

### Coupon Frequency Mappings

The code maintains two dictionaries for handling coupon payment frequencies:

1. **`COUPON_FREQUENCY_DICT`**: Maps numeric codes (0-36) to human-readable frequency descriptions
   - Example: `1`  "Semiannually", `31` -> "Zero coupon"

2. **`COUPON_FREQUENCY_TYPE`**: Maps frequency descriptions to annual payment counts
   - Example: "Semiannually" -> `2`, "Quarterly" -> `4`, "Zero coupon" -> `0`

### Default Values Dictionary

**`FEATURES_AND_DEFAULT_VALUES`** provides fallback values for missing features:
- Numeric fields default to sensible values (e.g., prices default to 100)
- Boolean fields default to False
- Text fields get placeholder values
- Some fields use computed defaults based on the dataset



## Helper Functions

### Date Calculation Functions

#### `diff_in_days_two_dates_360_30()`
Calculates the difference between two dates using the **30/360 day count convention**:
- Assumes 30 days per month
- Assumes 360 days per year
- Used for standardized bond calculations

#### `diff_in_days_two_dates_exact()`
Calculates the exact calendar day difference between dates.

#### `diff_in_days()`
A wrapper function that:
- Handles accrual date logic
- Supports both 360/30 and exact conventions
- Returns 0 for invalid dates

### Interest Payment Calculations

#### `days_in_interest_payment()`
Calculates the number of days in an interest payment period:
- Converts frequency codes to actual periods
- Returns 360 divided by annual frequency
- Handles special cases (zero coupon, irregular payments)

#### `calculate_a_over_e()`
Computes the **A/E ratio** (Accrued/Expected):
- Measures the portion of the current coupon period that has elapsed
- Used for accrued interest calculations
- Important for bond pricing

### Data Cleaning Functions

#### `to_numeric()`
Safely converts series to numeric values:
- Handles Decimal type objects
- Converts to float for calculations
- Maintains data integrity

#### `fill_missing_values()`
Intelligently fills missing values:
- Uses predefined defaults from `FEATURES_AND_DEFAULT_VALUES`
- Can compute dynamic defaults (e.g., mean values)
- Falls back to related columns when appropriate



## Feature Engineering

### Main Processing Function: `process_features()`

This function performs the core feature transformations:

#### 1. **Data Type Conversions**
- Converts Decimal columns to float
- Ensures numeric compatibility

#### 2. **Interest Payment Frequency Processing**
- Maps numeric codes to descriptive strings
- Handles missing values appropriately

#### 3. **Quantity Transformations**
- Applies **log transformation** to par traded amounts
- Helps normalize highly skewed distributions
- Formula: `log10(par_traded)`

#### 4. **Amount Fields Processing**
- Log-transforms large monetary values
- Includes: issue_amount, maturity_amount, orig_principal_amount
- Formula: `log10(1 + amount)` to handle zeros

#### 5. **Binary Feature Creation**
Creates indicator variables for:
- **callable**: Bond can be called early
- **called**: Bond has been called
- **zerocoupon**: No periodic interest payments
- **whenissued**: Traded before settlement
- **sinking**: Has sinking fund provisions
- **deferred**: Interest payments are deferred

#### 6. **Date Feature Engineering**
Converts dates to numeric features:
- **days_to_settle**: Settlement lag
- **days_to_maturity**: Time until bond matures
- **days_to_call**: Time until callable
- **days_to_refund**: Time until refundable
- All transformed with `log10(1 + days)` for normalization

#### 7. **Accrual Calculations**
- **accrued_days**: Days since last coupon
- **scaled_accrued_days**: Normalized by payment period
- **A/E ratio**: Fraction of period elapsed

### Complete Feature Engineering: `engineer_features_complete()`

This is the production-ready feature engineering pipeline:

#### Features Processed:

1. **Binary Features** (converted to 0/1):
   - Bond characteristics (callable, called, zerocoupon)
   - Trading status (whenissued, sinking, deferred)
   - Tax and legal status indicators

2. **Numeric Direct Features** (used as-is after scaling):
   - Coupon rate, prices, yields
   - Time-based calculations
   - Accrual metrics

3. **Numeric Log Features** (already transformed):
   - Quantities and amounts
   - Pre-processed in `process_features()`

4. **Categorical Features** (one-hot encoded):
   - State codes, trade types, ratings
   - Purpose classes, tax status
   - Series names, security types

#### Rating Score Mapping
Converts letter ratings to numeric scores:
- AAA = 22 (highest)
- D = 1 (default)
- MR/NR = 0 (missing/not rated)

#### Scaling and Encoding
- Uses **RobustScaler** for numeric features (resistant to outliers)
- **LabelEncoder** + one-hot encoding for categoricals
- Handles unseen categories with "UNKNOWN" class



## Siamese Network Architecture

### Base Network: `create_base_network()`

The core embedding network consists of:

```
Input Layer (n features)
    >
Dense Layer (512 units, ReLU)
    >
Batch Normalization
    >
Dropout (%)
    >
Dense Layer (256 units, ReLU)
    >
Batch Normalization
    >
Dropout (%)
    >
Dense Layer (256 units, ReLU)
    >
Batch Normalization
    >
Dropout (%)
    >
Embedding Layer (128 units, Linear)
    >
L2 Normalization
```

**Key Design Choices:**
- **Progressive dimension reduction**: 512 -> 256 -> 256 -> 128
- **Batch normalization**: Stabilizes training
- **Dropout**: Prevents overfitting (decreasing rates)
- **L2 normalization**: Creates unit-length embeddings

### Siamese Network: `create_siamese_network()`

The full architecture:
1. Two **identical base networks** (shared weights)
2. Process two input CUSIPs in parallel
3. Compute **cosine similarity** between embeddings
4. Output similarity score (-1 to 1)

### Loss Function: `contrastive_loss()`

Implements contrastive loss with margin:
- **Positive pairs** (similar): Minimizes distance
- **Negative pairs** (dissimilar): Enforces minimum margin
- **Margin parameter**: Default 0.5
- Balances attraction and repulsion forces



## Training Pipeline

### Data Preparation: `prepare_pairs_for_training()`

1. **Date Processing**: Ensures all date columns are datetime objects
2. **Feature Engineering**: Applies complete feature pipeline
3. **CUSIP Mapping**: Creates dictionary of CUSIP -> feature vector
4. **Pair Assembly**: Matches CUSIP pairs with their labels
5. **Missing Data Handling**: Reports any CUSIPs not found

### Training Function: `train_siamese_network()`

**Training Configuration:**
- **Optimizer**: Adam (learning rate = 0.001)
- **Batch Size**: 256
- **Epochs**: Up to 100 (with early stopping)

**Callbacks:**
1. **EarlyStopping**: Patience of 15 epochs
2. **ModelCheckpoint**: Saves best model
3. **ReduceLROnPlateau**: Adaptive learning rate

### Embedding Extraction: `get_embeddings()`

Generates embeddings for new CUSIPs:
1. Processes features using saved artifacts
2. Passes through trained base network
3. Returns dataframe with CUSIP and embedding columns

### Complete Pipeline: `run_training_pipeline()`

Orchestrates the entire training process:

1. **Data Preparation**
   - Engineers features
   - Creates training pairs

2. **Train/Validation Split**
   - 80/20 split by default
   - Stratified sampling

3. **Model Training**
   - Trains Siamese network
   - Monitors validation loss

4. **Artifact Saving**
   - Model weights: `cusip_embedding_model.h5`
   - Feature artifacts: `feature_artifacts.pkl`
   - Feature names: `feature_names.pkl`



## Usage Examples

### Basic Training

```python
# Load your data
df = pd.read_pickle('processed_data_yield_spread_with_similar_trades_v2.pkl')
pairs_df = pd.read_pickle('/Users/gil/git/ficc/notebooks/gil_modeling/embeddings/bond_pairs_180days.pkl')

# Train the model
base_network, artifacts, history = run_training_pipeline(
    df,  
    pairs_df,
    test_size=0.2,
    embedding_dim=128,
    epochs=100
)
```

### Generating Embeddings

```python
# Load saved model and artifacts
from tensorflow.keras.models import load_model
import pickle

base_network = load_model('cusip_embedding_model.h5')
with open('feature_artifacts.pkl', 'rb') as f:
    artifacts = pickle.load(f)

# Generate embeddings for new data
new_df = pd.read_csv('new_cusips.csv')
embeddings_df = get_embeddings(new_df, base_network, artifacts)
embeddings_df.to_csv('cusip_embeddings.csv', index=False)
```

### Finding Similar Securities

```python
# Calculate similarities between CUSIPs
from sklearn.metrics.pairwise import cosine_similarity

# Get embeddings
embedding_matrix = embeddings_df.iloc[:, 1:].values

# Find most similar to first CUSIP
similarities = cosine_similarity(embedding_matrix[0:1], embedding_matrix)[0]
top_10_indices = similarities.argsort()[-11:-1][::-1]  # Exclude self
similar_cusips = embeddings_df.iloc[top_10_indices]['cusip'].values
```

## Key Features and Benefits

### Advantages of This Approach

1. **Robust Feature Engineering**
   - Handles multiple data types (dates, decimals, categoricals)
   - Intelligent missing value imputation
   - Domain-specific transformations (360/30 convention)

2. **Scalable Architecture**
   - Siamese networks learn from pairs efficiently
   - Can handle large CUSIP databases
   - Embeddings enable fast similarity searches

3. **Production Ready**
   - Saves all artifacts for deployment
   - Handles unseen categories gracefully
   - Includes comprehensive error handling

### Use Cases

- **Portfolio Analysis**: Find similar bonds for diversification
- **Risk Management**: Identify securities with similar risk profiles
- **Trading**: Discover arbitrage opportunities between similar securities
- **Research**: Cluster securities for market analysis

In [None]:
# ===========================
# CONSTANTS AND UTILITIES
# ===========================

NUM_OF_DAYS_IN_YEAR = 360  # MSRB convention

# Coupon frequency mapping
COUPON_FREQUENCY_DICT = {
    0: 'Unknown', 1: 'Semiannually', 2: 'Monthly', 3: 'Annually', 4: 'Weekly',
    5: 'Quarterly', 6: 'Every 2 years', 7: 'Every 3 years', 8: 'Every 4 years',
    9: 'Every 5 years', 10: 'Every 7 years', 11: 'Every 8 years', 12: 'Biweekly',
    13: 'Changeable', 14: 'Daily', 15: 'Term mode', 16: 'Interest at maturity',
    17: 'Bimonthly', 18: 'Every 13 weeks', 19: 'Irregular', 20: 'Every 28 days',
    21: 'Every 35 days', 22: 'Every 26 weeks', 23: 'Not Applicable', 24: 'Tied to prime',
    25: 'One time', 26: 'Every 10 years', 27: 'Frequency to be determined',
    28: 'Mandatory put', 29: 'Every 52 weeks', 30: 'When interest adjusts-commercial paper',
    31: 'Zero coupon', 32: 'Certain years only', 33: 'Under certain circumstances',
    34: 'Every 15 years', 35: 'Custom', 36: 'Single Interest Payment'
}

COUPON_FREQUENCY_TYPE = {
    'Unknown': 1e6, 'Semiannually': 2, 'Monthly': 12, 'Annually': 1,
    'Weekly': 52, 'Quarterly': 4, 'Every 2 years': 0.5, 'Every 3 years': 1/3,
    'Every 4 years': 0.25, 'Every 5 years': 0.2, 'Every 7 years': 1/7,
    'Every 8 years': 1/8, 'Every 10 years': 1/10, 'Biweekly': 26,
    'Changeable': 44, 'Daily': 360, 'Interest at maturity': 0, 'Not Applicable': 1e6
}

# Helper to convert Decimal to float
def to_numeric(series):
    """Convert a series to numeric, handling Decimal types"""
    if series.dtype == object:
        # Check if series contains Decimal objects
        if any(isinstance(x, Decimal) for x in series.dropna().head()):
            return series.apply(lambda x: float(x) if isinstance(x, Decimal) else x).astype(float)
    return pd.to_numeric(series, errors='coerce')

# Default values for missing features
FEATURES_AND_DEFAULT_VALUES = {
    'purpose_class': 0, 'call_timing': 0, 'call_timing_in_part': 0,
    'sink_frequency': 0, 'sink_amount_type': 10, 'issue_text': 'No issue text',
    'state_tax_status': 0, 'series_name': 'No series name', 'transaction_type': 'I',
    'next_call_price': 100, 'par_call_price': 100, 'min_amount_outstanding': 0,
    'max_amount_outstanding': 0, 'maturity_amount': 0,
    'issue_price': lambda df: to_numeric(df.issue_price).mean(skipna=True) if 'issue_price' in df.columns else 100,
    'orig_principal_amount': lambda df: np.log10(to_numeric(10 ** to_numeric(df.orig_principal_amount)).mean(skipna=True)) if 'orig_principal_amount' in df.columns else 0,
    'par_price': 100, 'called_redemption_type': 0, 'extraordinary_make_whole_call': False,
    'make_whole_call': False, 'default_indicator': False, 'days_to_settle': 0,
    'days_to_maturity': 0, 'days_to_refund': 0, 'call_to_maturity': 0,
    'days_in_interest_payment': 180
}

FEATURES_AND_DEFAULT_COLUMNS = {
    'days_to_par': 'days_to_maturity',
    'days_to_call': 'days_to_maturity'
}

# ===========================
# HELPER FUNCTIONS
# ===========================

def diff_in_days_two_dates_360_30(end_date, start_date):
    """Calculate difference in days using 360/30 convention (MSRB Rule G-33)"""
    if pd.isna(end_date) or pd.isna(start_date):
        return np.nan
    
    Y2, Y1 = end_date.year, start_date.year
    M2, M1 = end_date.month, start_date.month
    D2, D1 = end_date.day, start_date.day
    
    D1 = min(D1, 30)
    if D1 == 30:
        D2 = min(D2, 30)
    
    return (Y2 - Y1) * 360 + (M2 - M1) * 30 + (D2 - D1)

def diff_in_days_two_dates_exact(end_date, start_date):
    """Calculate exact difference in days"""
    if pd.isna(end_date) or pd.isna(start_date):
        return np.nan
    return (end_date - start_date).days

def diff_in_days(trade, convention='360/30', calc_type=None):
    """Calculate days difference for a trade"""
    if calc_type == 'accrual':
        if pd.isnull(trade.get('accrual_date')):
            start_date = trade.get('dated_date')
        else:
            start_date = trade.get('accrual_date')
    else:
        start_date = trade.get('dated_date')
    
    end_date = trade.get('settlement_date')
    
    if pd.isna(start_date) or pd.isna(end_date):
        return 0
    
    if convention == '360/30':
        return diff_in_days_two_dates_360_30(end_date, start_date)
    else:
        return diff_in_days_two_dates_exact(end_date, start_date)

def days_in_interest_payment(trade):
    """Calculate days in interest payment period"""
    if 'interest_payment_frequency' not in trade:
        return 180  # Default
    
    freq_val = trade['interest_payment_frequency']
    
    # Handle if it's already converted to string
    if isinstance(freq_val, str):
        frequency = COUPON_FREQUENCY_TYPE.get(freq_val, 1e6)
    else:
        # Convert numeric code to string first
        freq_str = COUPON_FREQUENCY_DICT.get(freq_val, 'Unknown')
        frequency = COUPON_FREQUENCY_TYPE.get(freq_str, 1e6)
    
    if frequency == 0 or frequency >= 1e6:
        return 1e6
    
    return 360 / frequency

def calculate_a_over_e(row):
    """Calculate A/E ratio"""
    if not pd.isnull(row.get('previous_coupon_payment_date')) and not pd.isnull(row.get('settlement_date')):
        try:
            A = (row['settlement_date'] - row['previous_coupon_payment_date']).days
            days_ip = row.get('days_in_interest_payment', 180)
            if days_ip > 0:
                return A / days_ip
        except:
            pass
    
    accrued = row.get('accrued_days', 0)
    return accrued / NUM_OF_DAYS_IN_YEAR if NUM_OF_DAYS_IN_YEAR > 0 else 0

def fill_missing_values(df):
    """Fill missing values with defaults"""
    df = df.copy()
    
    # Fill with default values
    for feature, default_value in FEATURES_AND_DEFAULT_VALUES.items():
        if feature in df.columns:
            if callable(default_value):
                try:
                    default_value = default_value(df)
                except Exception as e:
                    print(f"Warning: Could not compute default for {feature}: {e}")
                    default_value = 0 if feature in ['orig_principal_amount'] else 100
            df[feature] = df[feature].fillna(default_value)
    
    # Fill with other columns
    for feature, feature_to_replace_with in FEATURES_AND_DEFAULT_COLUMNS.items():
        if feature in df.columns and feature_to_replace_with in df.columns:
            df[feature] = df[feature].fillna(df[feature_to_replace_with])
    
    return df

# ===========================
# FEATURE ENGINEERING
# ===========================

def process_features(df):
    """Main feature processing function from the original codebase"""
    df = df.copy()
    
    # Convert Decimal columns to float
    for col in df.columns:
        if df[col].dtype == object:
            # Check if column contains Decimal objects
            sample = df[col].dropna().head()
            if len(sample) > 0 and any(isinstance(x, Decimal) for x in sample):
                df[col] = df[col].apply(lambda x: float(x) if isinstance(x, Decimal) else x)
    
    # Process interest payment frequency
    if 'interest_payment_frequency' in df.columns:
        df['interest_payment_frequency'] = df['interest_payment_frequency'].fillna(0)
        df['interest_payment_frequency'] = df['interest_payment_frequency'].apply(
            lambda x: COUPON_FREQUENCY_DICT.get(int(x), 'Unknown') if isinstance(x, (int, float)) else x
        )
    
    # Process quantity
    if 'par_traded' in df.columns:
        df['par_traded'] = to_numeric(df['par_traded'])
        df['quantity'] = np.log10(df['par_traded'].clip(lower=1))
    
    # Process amounts with log transformation
    for col in ['issue_amount', 'maturity_amount', 'orig_principal_amount', 'max_amount_outstanding']:
        if col in df.columns:
            df[col] = to_numeric(df[col])
            df[col] = np.log10(1 + df[col].fillna(0).clip(lower=0))
    
    # Process coupon
    if 'coupon' in df.columns:
        df['coupon'] = to_numeric(df['coupon']).fillna(0)
    
    # Create binary features
    if 'is_callable' in df.columns:
        df['callable'] = df['is_callable'].astype(bool).astype(int)
    if 'is_called' in df.columns:
        df['called'] = df['is_called'].astype(bool).astype(int)
    if 'coupon' in df.columns:
        df['zerocoupon'] = (df['coupon'] == 0).astype(int)
    if 'delivery_date' in df.columns and 'trade_date' in df.columns:
        df['whenissued'] = (df['delivery_date'] >= df['trade_date']).astype(int)
    if 'next_sink_date' in df.columns:
        df['sinking'] = (~df['next_sink_date'].isnull()).astype(int)
    if 'interest_payment_frequency' in df.columns and 'zerocoupon' in df.columns:
        df['deferred'] = ((df['interest_payment_frequency'] == 'Unknown') | (df['zerocoupon'] == 1)).astype(int)
    
    # Process dates - convert to days from trade date
    if 'settlement_date' in df.columns and 'trade_date' in df.columns:
        df['days_to_settle'] = (df['settlement_date'] - df['trade_date']).dt.days.fillna(0)
        # Remove trades settling >= 30 days from trade date
        df = df[df['days_to_settle'] < 30]
    
    # Calculate days to various dates with log transformation
    date_calculations = [
        ('days_to_maturity', 'maturity_date', 'trade_date'),
        ('days_to_call', 'next_call_date', 'trade_date'),
        ('days_to_refund', 'refund_date', 'trade_date'),
        ('days_to_par', 'par_call_date', 'trade_date'),
        ('call_to_maturity', 'maturity_date', 'next_call_date')
    ]
    
    for new_col, end_col, start_col in date_calculations:
        if end_col in df.columns and start_col in df.columns:
            days_diff = (df[end_col] - df[start_col]).dt.days.fillna(0).clip(lower=0)
            with warnings.catch_warnings():
                warnings.simplefilter('ignore')
                df[new_col] = np.log10(1 + days_diff)
                df[new_col] = df[new_col].replace([-np.inf, np.inf], 0)
    
    # Calculate accrual features
    if 'settlement_date' in df.columns:
        df['accrued_days'] = df.apply(lambda row: diff_in_days(row, calc_type='accrual'), axis=1)
        df['accrued_days'] = df['accrued_days'].fillna(0)
    
    if 'interest_payment_frequency' in df.columns:
        df['days_in_interest_payment'] = df.apply(days_in_interest_payment, axis=1)
    else:
        df['days_in_interest_payment'] = 180
    
    if 'accrued_days' in df.columns and 'days_in_interest_payment' in df.columns:
        df['scaled_accrued_days'] = df['accrued_days'] / (360 / df['days_in_interest_payment'].clip(lower=1))
        df['scaled_accrued_days'] = df['scaled_accrued_days'].fillna(0)
        df['A/E'] = df.apply(calculate_a_over_e, axis=1)
        df['A/E'] = df['A/E'].fillna(0)
    
    # Process rating
    if 'sp_long' in df.columns:
        df['sp_long'] = df['sp_long'].fillna('MR')
        df['rating'] = df['sp_long']
    elif 'rating' not in df.columns:
        df['rating'] = 'MR'
    
    # Fill missing values
    df = fill_missing_values(df)
    
    return df

def engineer_features_complete(df_raw: pd.DataFrame, fit: bool=True, artifacts: dict=None):
    """
    Complete feature engineering - ROBUST VERSION
    Handles all nullable integer types properly
    """
    # First apply the original processing
    df = process_features(df_raw)
    
    if artifacts is None:
        artifacts = {}
    
    # Binary features - ensure they're numeric
    binary_features = [
        'callable', 'called', 'zerocoupon', 'whenissued', 'sinking', 'deferred',
        'is_non_transaction_based_compensation', 'is_general_obligation',
        'callable_at_cav', 'extraordinary_make_whole_call', 'make_whole_call',
        'has_unexpired_lines_of_credit', 'escrow_exists', 'default_indicator'
    ]
    
    # Numeric features (direct - no transformation)
    numeric_direct = [
        'coupon', 'issue_price', 'par_price', 'original_yield',
        'next_call_price', 'par_call_price', 'days_to_settle',
        'days_to_maturity', 'days_to_call', 'days_to_refund',
        'days_to_par', 'call_to_maturity', 'accrued_days',
        'days_in_interest_payment', 'scaled_accrued_days', 'A/E'
    ]
    
    # Numeric features (already log-transformed)
    numeric_log = [
        'quantity', 'issue_amount', 'maturity_amount',
        'orig_principal_amount', 'max_amount_outstanding'
    ]
    
    # All potential categorical features
    categorical_features = [
        'incorporated_state_code', 'trade_type', 'purpose_class',
        'rating', 'purpose_sub_class', 'called_redemption_type',
        'call_timing', 'call_timing_in_part', 'sink_frequency',
        'sink_amount_type', 'state_tax_status', 'transaction_type',
        'coupon_type', 'federal_tax_status', 'use_of_proceeds',
        'muni_security_type', 'muni_issue_type', 'capital_type',
        'other_enhancement_type', 'series_name'
    ]
    
    # Collect features
    feature_list = []
    feature_names = []
    
    # Binary features
    for feat in binary_features:
        if feat in df.columns:
            # Ensure numeric and handle any remaining issues
            val = pd.to_numeric(df[feat], errors='coerce').fillna(0).astype(float).values.reshape(-1, 1)
            feature_list.append(val)
            feature_names.append(feat)
    
    # Numeric direct features
    for feat in numeric_direct:
        if feat in df.columns:
            val = to_numeric(df[feat]).fillna(0).astype(float).values.reshape(-1, 1)
            feature_list.append(val)
            feature_names.append(feat)
    
    # Numeric log features (already transformed)
    for feat in numeric_log:
        if feat in df.columns:
            val = to_numeric(df[feat]).fillna(0).astype(float).values.reshape(-1, 1)
            feature_list.append(val)
            feature_names.append(feat)
    
    # Add rating score
    rating_map = {
        "AAA": 22, "AA+": 21, "AA": 20, "AA-": 19,
        "A+": 18, "A": 17, "A-": 16,
        "BBB+": 15, "BBB": 14, "BBB-": 13,
        "BB+": 12, "BB": 11, "BB-": 10,
        "B+": 9, "B": 8, "B-": 7,
        "CCC+": 6, "CCC": 5, "CCC-": 4,
        "CC": 3, "C": 2, "D": 1, "MR": 0, "NR": 0
    }
    
    if 'rating' in df.columns:
        rating_score = df['rating'].map(rating_map).fillna(0).values.reshape(-1, 1)
        feature_list.append(rating_score)
        feature_names.append('rating_score')
    
    # Combine numeric features
    if feature_list:
        X_numeric = np.hstack(feature_list)
    else:
        X_numeric = np.zeros((len(df), 0))
    
    # Scale numeric features
    if fit:
        scaler = RobustScaler()
        X_numeric_scaled = scaler.fit_transform(X_numeric)
        artifacts['scaler'] = scaler
    else:
        scaler = artifacts.get('scaler')
        if scaler:
            X_numeric_scaled = scaler.transform(X_numeric)
        else:
            X_numeric_scaled = X_numeric
    
    # One-hot encode categorical features with robust handling
    cat_encoded_list = []
    cat_feature_names = []
    
    if fit:
        artifacts['encoders'] = {}
    
    for cat in categorical_features:
        if cat in df.columns:
            # Check the dtype and handle appropriately
            dtype_name = str(df[cat].dtype)
            
            # Handle nullable integer types (Int64, Int32, etc.)
            if 'Int' in dtype_name or 'int' in dtype_name.lower():
                # Convert to regular numpy int, then to string
                # Use copy to avoid modifying original
                cat_series = df[cat].copy()
                # Fill NaN with a special integer value first
                cat_series = cat_series.fillna(-9999)
                # Convert to regular int64 (not nullable)
                cat_series = cat_series.astype('int64')
                # Now convert to string
                cat_values = cat_series.astype(str)
                # Replace the special value marker
                cat_values = cat_values.replace('-9999', 'MISSING')
            
            # Handle float types
            elif 'float' in dtype_name.lower():
                # Convert float to int first, then to string
                cat_series = df[cat].copy()
                cat_series = cat_series.fillna(-9999.0)
                cat_series = cat_series.astype('int64')
                cat_values = cat_series.astype(str)
                cat_values = cat_values.replace('-9999', 'MISSING')
            
            # Handle regular object/string types
            else:
                # Standard string handling
                cat_values = df[cat].fillna('MISSING').astype(str)
            
            if fit:
                encoder = LabelEncoder()
                unique_vals = cat_values.unique().tolist()
                if 'UNKNOWN' not in unique_vals:
                    unique_vals.append('UNKNOWN')
                encoder.fit(unique_vals)
                artifacts['encoders'][cat] = encoder
            else:
                encoder = artifacts.get('encoders', {}).get(cat)
                if encoder is None:
                    continue
            
            # Handle unseen categories
            cat_values = cat_values.apply(lambda x: x if x in encoder.classes_ else 'UNKNOWN')
            encoded = encoder.transform(cat_values)
            
            # One-hot encode
            n_classes = len(encoder.classes_)
            one_hot = np.zeros((len(df), n_classes))
            one_hot[np.arange(len(df)), encoded] = 1
            
            cat_encoded_list.append(one_hot)
            
            # Add feature names
            for i, class_name in enumerate(encoder.classes_):
                cat_feature_names.append(f"{cat}_{class_name}")
    
    # Combine all features
    if cat_encoded_list:
        X_cat = np.hstack(cat_encoded_list)
        X = np.hstack([X_numeric_scaled, X_cat])
        all_feature_names = feature_names + cat_feature_names
    else:
        X = X_numeric_scaled
        all_feature_names = feature_names
    
    print(f"Engineered {len(all_feature_names)} features")
    return X.astype(np.float32), artifacts, all_feature_names
# ===========================
# SIAMESE NETWORK (Your original code with updated feature engineering)
# ===========================

def create_base_network(input_dim, embedding_dim=128):
    """Create the base network for one side of the Siamese network"""
    inputs = Input(shape=(input_dim,))
    
    x = layers.Dense(512, activation='relu')(inputs)
    x = layers.BatchNormalization()(x)
    x = layers.Dropout(0.1)(x)
    
    x = layers.Dense(256, activation='relu')(x)
    x = layers.BatchNormalization()(x)
    x = layers.Dropout(0.1)(x)
    
    x = layers.Dense(256, activation='relu')(x)
    x = layers.BatchNormalization()(x)
    x = layers.Dropout(0.1)(x)
    
    embeddings = layers.Dense(embedding_dim, activation='linear', name='embeddings')(x)
    embeddings = layers.Lambda(lambda x: tf.nn.l2_normalize(x, axis=1))(embeddings)
    
    model = Model(inputs=inputs, outputs=embeddings, name='base_network')
    return model

def create_siamese_network(input_dim, embedding_dim=128):
    """Create the full Siamese network"""
    base_network = create_base_network(input_dim, embedding_dim)
    
    input_a = Input(shape=(input_dim,), name='input_a')
    input_b = Input(shape=(input_dim,), name='input_b')
    
    embedding_a = base_network(input_a)
    embedding_b = base_network(input_b)
    
    cosine_similarity = layers.Dot(axes=1, normalize=False)([embedding_a, embedding_b])
    
    siamese_model = Model(inputs=[input_a, input_b], outputs=cosine_similarity)
    
    return siamese_model, base_network

def contrastive_loss(y_true, y_pred, margin=0.5):
    """Contrastive loss for Siamese network"""
    y_pred_dist = 1 - y_pred
    pos_loss = y_true * tf.square(y_pred_dist)
    neg_loss = (1 - y_true) * tf.square(tf.maximum(0.0, margin - y_pred_dist))
    return tf.reduce_mean(pos_loss + neg_loss)

def prepare_pairs_for_training(features_df, pairs_df, artifacts=None):
    """Prepare pairs of CUSIPs for training"""
    print("Engineering features...")
    
    # Ensure date columns are datetime
    date_columns = [
        'refund_date', 'accrual_date', 'dated_date', 'next_sink_date',
        'delivery_date', 'trade_date', 'trade_datetime', 'par_call_date',
        'maturity_date', 'settlement_date', 'next_call_date',
        'previous_coupon_payment_date', 'next_coupon_payment_date',
        'first_coupon_date', 'last_period_accrues_from_date'
    ]
    
    for col in date_columns:
        if col in features_df.columns:
            features_df[col] = pd.to_datetime(features_df[col], errors='coerce')
    
    # Use the complete feature engineering
    X_all, artifacts, feature_names = engineer_features_complete(
        features_df, 
        fit=(artifacts is None), 
        artifacts=artifacts
    )
    
    print(f"Feature dimension: {X_all.shape[1]}")
    print(f"Total CUSIPs processed: {X_all.shape[0]}")
    
    cusip_to_vec = dict(zip(features_df["cusip"].values, X_all))
    
    features_a = []
    features_b = []
    labels = []
    missing_cusips = set()
    
    for _, row in pairs_df.iterrows():
        cusip1, cusip2 = row["cusip1"], row["cusip2"]
        
        if cusip1 in cusip_to_vec and cusip2 in cusip_to_vec:
            features_a.append(cusip_to_vec[cusip1])
            features_b.append(cusip_to_vec[cusip2])
            labels.append(row["label"])
        else:
            if cusip1 not in cusip_to_vec:
                missing_cusips.add(cusip1)
            if cusip2 not in cusip_to_vec:
                missing_cusips.add(cusip2)
    
    if missing_cusips:
        print(f"Warning: {len(missing_cusips)} CUSIPs from pairs not found in features")
    
    X_pairs = (np.array(features_a, dtype=np.float32), 
               np.array(features_b, dtype=np.float32))
    y = np.array(labels, dtype=np.float32)
    
    print(f"Valid pairs prepared: {len(y)}")
    
    return X_pairs, y, artifacts, feature_names

def train_siamese_network(X_train, y_train, X_val, y_val, 
                         input_dim, embedding_dim=128, 
                         epochs=100, batch_size=256):
    """Train the Siamese network"""
    siamese_model, base_network = create_siamese_network(input_dim, embedding_dim)
    
    optimizer = keras.optimizers.Adam(learning_rate=0.001)
    siamese_model.compile(
        optimizer=optimizer,
        loss=contrastive_loss,
        metrics=['mae']
    )
    
    print("\nBase network summary:")
    base_network.summary()
    
    callbacks = [
        EarlyStopping(monitor='val_loss', patience=15, restore_best_weights=True, verbose=1),
        ModelCheckpoint('best_siamese_model.h5', save_best_only=True, monitor='val_loss', verbose=1),
        ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=5, min_lr=1e-6, verbose=1)
    ]
    
    print("\nStarting training...")
    history = siamese_model.fit(
        X_train, y_train,
        validation_data=(X_val, y_val),
        epochs=epochs,
        batch_size=batch_size,
        callbacks=callbacks,
        verbose=1
    )
    
    return siamese_model, base_network, history

def get_embeddings(cusip_features_df, base_network, artifacts):
    """Get embeddings for CUSIPs"""
    # Ensure date columns are datetime
    date_columns = [
        'refund_date', 'accrual_date', 'dated_date', 'next_sink_date',
        'delivery_date', 'trade_date', 'trade_datetime', 'par_call_date',
        'maturity_date', 'settlement_date', 'next_call_date',
        'previous_coupon_payment_date', 'next_coupon_payment_date',
        'first_coupon_date', 'last_period_accrues_from_date'
    ]
    
    for col in date_columns:
        if col in cusip_features_df.columns:
            cusip_features_df[col] = pd.to_datetime(cusip_features_df[col], errors='coerce')
    
    X_features, _, _ = engineer_features_complete(cusip_features_df, fit=False, artifacts=artifacts)
    embeddings = base_network.predict(X_features, batch_size=256)
    
    embedding_cols = [f'emb_{i}' for i in range(embeddings.shape[1])]
    embeddings_df = pd.DataFrame(embeddings, columns=embedding_cols)
    embeddings_df['cusip'] = cusip_features_df['cusip'].values
    embeddings_df = embeddings_df[['cusip'] + embedding_cols]
    
    return embeddings_df

def run_training_pipeline(features_df, pairs_df, test_size=0.2, embedding_dim=128, epochs=100):
    """Run the complete training pipeline"""
    print("Preparing features and pairs...")
    X_pairs, y, artifacts, feature_names = prepare_pairs_for_training(features_df, pairs_df)
    
    print("\nSplitting data...")
    X_train_a, X_val_a, X_train_b, X_val_b, y_train, y_val = train_test_split(
        X_pairs[0], X_pairs[1], y, 
        test_size=test_size, 
        random_state=42, 
        stratify=y
    )
    X_train = (X_train_a, X_train_b)
    X_val = (X_val_a, X_val_b)
    
    input_dim = X_train[0].shape[1]
    print(f"\nData summary:")
    print(f"  Input dimension: {input_dim}")
    print(f"  Number of features: {len(feature_names)}")
    print(f"  Training samples: {len(y_train):,}")
    print(f"  Validation samples: {len(y_val):,}")
    print(f"  Positive ratio in train: {y_train.mean():.2%}")
    print(f"  Positive ratio in val: {y_val.mean():.2%}")
    
    siamese_model, base_network, history = train_siamese_network(
        X_train, y_train, X_val, y_val,
        input_dim=input_dim,
        embedding_dim=embedding_dim,
        epochs=epochs
    )
    
    print("\nSaving model and artifacts...")
    base_network.save('cusip_embedding_model.h5')
    
    with open('feature_artifacts.pkl', 'wb') as f:
        pickle.dump(artifacts, f)
    
    with open('feature_names.pkl', 'wb') as f:
        pickle.dump(feature_names, f)
    
    print("Training complete!")
    print(f"Final validation loss: {history.history['val_loss'][-1]:.4f}")
    
    return base_network, artifacts, history

# Train the model
base_network, artifacts, history = run_training_pipeline(
    df,  
    pairs_df,
    test_size=0.2,
    embedding_dim=128,
    epochs=100
)

In [None]:
# Get unique CUSIPs BEFORE generating embeddings
# This prevents generating 1.4M embeddings for duplicate CUSIPs
unique_cusips_df = df.sort_values('trade_datetime').groupby('cusip').last().reset_index()
print(f"Original data shape: {df.shape}")
print(f"Unique CUSIPs to embed: {len(unique_cusips_df)}")

# Generate embeddings for unique CUSIPs only
embeddings_df = get_embeddings(unique_cusips_df, base_network, artifacts)
print(f"Generated embeddings shape: {embeddings_df.shape}")

# Save the correct embeddings
embeddings_df.to_csv('cusip_embeddings_unique.csv', index=False)

Engineered 301 features
[1m5475/5492[0m [32m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m[37m‚îÅ[0m [1m0s[0m 1ms/step

2025-09-23 02:23:27.712045: I external/local_xla/xla/service/gpu/autotuning/dot_search_space.cc:208] All configs were filtered out because none of them sufficiently match the hints. Maybe the hints set does not contain a good representative set of valid configs? Working around this by using the full hints set instead.


[1m5492/5492[0m [32m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m[37m[0m [1m11s[0m 2ms/step


In [None]:
embedding = pd.read_csv('cusip_embeddings.csv')
embedding.head()

Unnamed: 0,cusip,emb_0,emb_1,emb_2,emb_3,emb_4,emb_5,emb_6,emb_7,emb_8,...,emb_118,emb_119,emb_120,emb_121,emb_122,emb_123,emb_124,emb_125,emb_126,emb_127
0,19648FYV0,-0.046834,0.087962,0.082753,0.083307,0.072688,-0.051867,-0.070449,0.090668,-0.025534,...,0.059696,-0.127566,0.007473,0.075059,0.029351,-0.001663,-0.063635,0.183805,0.03712,0.080757
1,180848XJ7,0.002178,0.087889,0.08304,0.080956,0.072237,-0.055294,-0.068476,0.090352,-0.025602,...,0.061586,-0.131052,-0.018359,0.076542,0.032309,0.018512,-0.035759,0.205933,0.034215,0.045631
2,54639TDU3,0.102498,0.112877,0.10783,0.089851,0.06817,-0.042887,-0.056532,0.08018,0.008092,...,0.055697,-0.108302,-0.048728,0.101579,0.00712,0.038187,-0.005147,0.19876,0.03031,0.015362
3,575896PX7,-0.044441,0.07005,0.083702,0.072695,0.07419,-0.052992,-0.061353,0.098432,-0.059692,...,0.059641,-0.122083,0.000705,0.068069,0.036816,-0.002147,-0.055479,0.191326,0.030561,0.061081
4,575896PX7,-0.03517,0.070882,0.081211,0.074073,0.076941,-0.051934,-0.060732,0.097129,-0.052026,...,0.056449,-0.123906,-0.004465,0.06896,0.033432,0.003381,-0.049986,0.196347,0.029966,0.057655


In [7]:
embeddings_df = embedding.drop_duplicates(subset='cusip', keep='first')

In [10]:
df = pd.read_pickle('2025-09-22_one_month.pkl')

# Municipal Bond Embeddings ‚Äî Practical Analysis & Trading Applications

Extract actionable insights from 128-dimensional bond embeddings trained via Siamese networks on market behavior patterns.

## What Are These Embeddings?

Our embeddings capture the "behavioral DNA" of municipal bonds - how they trade relative to others based on historical patterns. Bonds close in embedding space exhibit similar market behavior, regardless of their static characteristics. This can reveal non-obvious relationships and pricing inefficiencies.

## Pipeline Overview

### 1. **Data Preparation**
- Merge unique CUSIP embeddings with latest trade data
- Extract 128-dimensional vectors normalized for cosine similarity
- Retain key trading fields: yield, price, rating, maturity, state

### 2. **Core Functions**

#### **Similarity Search**
Find bonds that behave similarly to a target CUSIP with practical filters:
- Same state requirement (typical for muni analysis)
- Maturity bands (¬±X years)
- Rating bucket constraints (AAA/AA/A/BBB/BB/B)

#### **Arbitrage Discovery**
Identify pricing inefficiencies using simple logic:
- **High embedding similarity** (>0.93) = bonds behave similarly in market
- **Large yield spread** (>25 bps) = pricing discrepancy
- **Result**: Buy lower yield, sell higher yield for convergence trade

#### **Portfolio Concentration Analysis**
Measure systematic risk from holding similar-behaving bonds:
- Compute pairwise similarities within portfolio
- Flag high concentration (avg similarity >0.85)
- Identify most similar pairs for risk management

### 3. **Key Outputs**

Each analysis produces trader-friendly outputs:
- **Similar bonds table**: CUSIP, similarity score, yield differential
- **Arbitrage trades**: Buy/sell CUSIPs with spread in basis points
- **Concentration metrics**: Average portfolio similarity with risk warnings

## Implementation Details

**Technical choices:**
- Cosine similarity on L2-normalized vectors for speed
- Vectorized numpy operations (no loops over 196k CUSIPs)
- Optional filters match real trading constraints

**Performance:**
- Similarity search: <100ms for 10 neighbors among 196k bonds
- Arbitrage scan: ~2 seconds for 500 bond sample
- Portfolio analysis: <50ms for 100-bond portfolio

## Production Use Cases

1. **Daily Arbitrage Scanner**
   - Run nightly on all bonds with >$1M daily volume
   - Email top 10 opportunities to trading desk
   - Track convergence over time

2. **Trade Idea API**
   ```
   GET /api/similar-bonds/{cusip}?state=same&maturity_band=3
   ```
   Returns similar bonds with yields for relative value analysis

3. **Portfolio Risk Dashboard**
   - Upload holdings ‚Üí receive concentration analysis
   - Highlight bonds that are behavioral "twins"
   - Suggest diversification candidates from different embedding clusters

4. **Market Anomaly Detection**
   - Flag bonds whose yields diverge from embedding neighbors
   - Identify potential rating changes before agencies act
   - Spot liquidity shifts via neighbor trading patterns

## Interpreting Results

**High similarity (>0.95) means:**
- Bonds react similarly to market events
- Similar flow patterns and trader behavior
- Often (but not always) similar fundamentals

**Arbitrage signals are strongest when:**
- Same state, similar maturity (reduces exogenous factors)
- High liquidity (>10 trades/day for both bonds)
- No recent credit events or calls

**Portfolio concentration warnings indicate:**
- Systematic risk from correlated behaviors
- Need for geographic/sector/duration diversification
- Potential for synchronized losses in stress scenarios

## Next Steps

1. **Backtest arbitrage signals** on historical data
2. **Build real-time similarity API** for trader terminals
3. **Integrate with existing** yield curve and relative value models
4. **Create embedding drift monitoring** to detect market regime changes



In [25]:
# Municipal Bond Embedding Explorer
# Clean, practical exploration of bond embeddings

import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import warnings
warnings.filterwarnings('ignore')

# ============================================
# 1. SETUP AND DATA PREP
# ============================================
print("=" * 60)
print("MUNICIPAL BOND EMBEDDING EXPLORER")
print("=" * 60)

# Load your data (assuming these are already in memory)
# embeddings_df = pd.read_csv('cusip_embeddings_unique.csv')  
# df = pd.read_pickle('2025-09-22_one_month.pkl')

# Merge embeddings with the latest trade data for each CUSIP
latest_trades = (df.sort_values('trade_datetime')
                   .groupby('cusip')
                   .last()
                   .reset_index())

# Key fields we care about for analysis
key_fields = ['cusip', 'yield', 'dollar_price', 'rating', 'maturity_date', 
              'coupon', 'incorporated_state_code', 'trade_datetime', 
              'par_traded', 'trade_type']

# Keep only available fields
available_fields = [f for f in key_fields if f in latest_trades.columns]
analysis_df = embeddings_df.merge(latest_trades[available_fields], on='cusip', how='inner')

# Convert yields from basis points to percent and clean bad data
if 'yield' in analysis_df.columns:
    analysis_df['yield'] = analysis_df['yield'] / 100.0
    
    # Filter out obviously bad yields (>15% is extremely rare for munis)
    bad_yield_mask = (analysis_df['yield'] > 15) | (analysis_df['yield'] < 0)
    n_bad = bad_yield_mask.sum()
    
    if n_bad > 0:
        print(f"‚ö†Ô∏è Found {n_bad} bonds with invalid yields (>15% or negative)")
        bad_examples = analysis_df[bad_yield_mask][['cusip', 'yield']].head()
        print(f"   Examples:\n{bad_examples}")
        analysis_df.loc[bad_yield_mask, 'yield'] = np.nan
    
    valid_yields = analysis_df['yield'].dropna()
    print(f"‚úì Valid yields: {len(valid_yields):,} bonds, median: {valid_yields.median():.2f}%, range: {valid_yields.min():.2f}%-{valid_yields.max():.2f}%")

print(f"‚úì Loaded {len(embeddings_df):,} unique CUSIP embeddings")
print(f"‚úì Merged with trade data: {len(analysis_df):,} CUSIPs with full data")

# Extract embedding columns and create matrix
embedding_cols = [c for c in embeddings_df.columns if c.startswith('emb_')]
X = analysis_df[embedding_cols].values
X_normalized = X / (np.linalg.norm(X, axis=1, keepdims=True) + 1e-12)

print(f"‚úì Embedding dimension: {len(embedding_cols)}")

# ============================================
# 2. SIMILARITY SEARCH FUNCTION
# ============================================

def find_similar_bonds(cusip_query, n_similar=10, same_state_only=False, 
                       similar_maturity_years=None, similar_rating_bucket=False):
    """
    Find the most similar bonds to a given CUSIP
    
    Parameters:
    - cusip_query: The CUSIP to search for
    - n_similar: Number of similar bonds to return
    - same_state_only: If True, only return bonds from same state
    - similar_maturity_years: If set, only bonds within ¬±X years maturity
    - similar_rating_bucket: If True, only bonds in same rating bucket (AAA/AA/A/BBB/BB/B)
    """
    
    # Find the query bond
    if cusip_query not in analysis_df['cusip'].values:
        print(f"‚ùå CUSIP {cusip_query} not found in embeddings")
        return None
    
    query_idx = analysis_df[analysis_df['cusip'] == cusip_query].index[0]
    query_embedding = X_normalized[query_idx].reshape(1, -1)
    query_info = analysis_df.iloc[query_idx]
    
    # Compute similarities to all bonds
    similarities = cosine_similarity(query_embedding, X_normalized)[0]
    
    # Create mask for filtering
    mask = np.ones(len(analysis_df), dtype=bool)
    mask[query_idx] = False  # Exclude self
    
    # Apply filters
    if same_state_only and 'incorporated_state_code' in analysis_df.columns:
        mask &= (analysis_df['incorporated_state_code'] == query_info['incorporated_state_code'])
    
    if similar_maturity_years and 'maturity_date' in analysis_df.columns:
        query_maturity = pd.to_datetime(query_info['maturity_date'])
        if pd.notna(query_maturity):
            all_maturities = pd.to_datetime(analysis_df['maturity_date'])
            years_diff = abs((all_maturities - query_maturity).dt.days / 365.25)
            mask &= (years_diff <= similar_maturity_years)
    
    if similar_rating_bucket and 'rating' in analysis_df.columns:
        def get_rating_bucket(r):
            r_str = str(r).upper()
            if r_str.startswith('AAA'): return 'AAA'
            if r_str.startswith('AA'): return 'AA'
            if r_str.startswith('A'): return 'A'
            if r_str.startswith('BBB'): return 'BBB'
            if r_str.startswith('BB'): return 'BB'
            if r_str.startswith('B'): return 'B'
            return 'NR'
        
        query_bucket = get_rating_bucket(query_info.get('rating', 'NR'))
        all_buckets = analysis_df['rating'].apply(get_rating_bucket)
        mask &= (all_buckets == query_bucket)
    
    # Get top similar bonds
    valid_indices = np.where(mask)[0]
    if len(valid_indices) == 0:
        print("‚ö†Ô∏è No bonds found matching the criteria")
        return None
    
    valid_similarities = similarities[valid_indices]
    top_indices = valid_indices[np.argsort(valid_similarities)[-n_similar:][::-1]]
    
    # Create results dataframe
    results = analysis_df.iloc[top_indices].copy()
    results['similarity'] = similarities[top_indices]
    
    # Add useful comparison columns
    if 'yield' in results.columns and pd.notna(query_info.get('yield')):
        results['yield_diff_bps'] = (results['yield'] - query_info['yield']) * 100
    
    if 'dollar_price' in results.columns and pd.notna(query_info.get('dollar_price')):
        results['price_diff'] = results['dollar_price'] - query_info['dollar_price']
    
    # Reorder columns for clarity
    cols_order = ['cusip', 'similarity']
    if 'yield' in results.columns:
        cols_order.extend(['yield', 'yield_diff_bps'])
    if 'dollar_price' in results.columns:
        cols_order.extend(['dollar_price', 'price_diff'])
    cols_order.extend([c for c in results.columns if c not in cols_order and not c.startswith('emb_')])
    
    results = results[cols_order]
    
    # Print query bond info
    print(f"\nüìå Query Bond: {cusip_query}")
    if 'yield' in query_info and pd.notna(query_info['yield']):
        print(f"   Yield: {query_info['yield']:.3f}%")
    if 'rating' in query_info:
        print(f"   Rating: {query_info['rating']}")
    if 'maturity_date' in query_info:
        print(f"   Maturity: {query_info['maturity_date']}")
    if 'incorporated_state_code' in query_info:
        print(f"   State: {query_info['incorporated_state_code']}")
    
    return results

# ============================================
# 3. ARBITRAGE OPPORTUNITY FINDER
# ============================================

def find_arbitrage_opportunities(min_similarity=0.95, min_yield_spread_bps=30, 
                                max_results=20, same_state=True, 
                                similar_maturity_years=2.0):
    """
    Find potential arbitrage opportunities:
    Bonds that are very similar but have significant yield differences
    
    Arbitrage logic: If two bonds behave identically (high embedding similarity)
    but trade at different yields, there may be a pricing inefficiency
    """
    
    print("\nüîç Searching for arbitrage opportunities...")
    print(f"   Criteria: similarity > {min_similarity}, yield spread > {min_yield_spread_bps} bps")
    
    if 'yield' not in analysis_df.columns:
        print("‚ùå No yield data available for arbitrage analysis")
        return None
    
    # Only consider bonds with valid yields (already converted to percent)
    # Filter out unreasonable yields (>15% is extremely rare for munis)
    valid_yield_mask = (analysis_df['yield'].notna()) & (analysis_df['yield'] <= 15) & (analysis_df['yield'] >= 0)
    valid_df = analysis_df[valid_yield_mask].copy()
    valid_X = X_normalized[valid_yield_mask]
    
    if len(valid_df) < 2:
        print("‚ùå Insufficient bonds with valid yield data")
        return None
    
    print(f"   Analyzing {len(valid_df)} bonds with valid yields...")
    
    opportunities = []
    
    # Sample bonds to check (for speed, check a subset)
    n_sample = min(500, len(valid_df))
    sample_indices = np.random.choice(len(valid_df), n_sample, replace=False)
    
    for i in sample_indices:
        query_embedding = valid_X[i].reshape(1, -1)
        similarities = cosine_similarity(query_embedding, valid_X)[0]
        
        # Find highly similar bonds
        similar_mask = (similarities > min_similarity)
        similar_mask[i] = False  # Exclude self
        
        # Apply additional filters
        if same_state and 'incorporated_state_code' in valid_df.columns:
            state_mask = (valid_df['incorporated_state_code'].values == 
                         valid_df.iloc[i]['incorporated_state_code'])
            similar_mask &= state_mask
        
        if similar_maturity_years and 'maturity_date' in valid_df.columns:
            query_maturity = pd.to_datetime(valid_df.iloc[i]['maturity_date'])
            if pd.notna(query_maturity):
                all_maturities = pd.to_datetime(valid_df['maturity_date'])
                years_diff = abs((all_maturities - query_maturity).dt.days / 365.25)
                similar_mask &= (years_diff.values <= similar_maturity_years)
        
        similar_indices = np.where(similar_mask)[0]
        
        for j in similar_indices:
            yield_i = valid_df.iloc[i]['yield']
            yield_j = valid_df.iloc[j]['yield']
            yield_spread = abs(yield_i - yield_j) * 100  # Convert to bps
            
            if yield_spread >= min_yield_spread_bps:
                # Determine buy/sell direction
                if yield_i > yield_j:
                    buy_idx, sell_idx = j, i
                else:
                    buy_idx, sell_idx = i, j
                
                opportunities.append({
                    'buy_cusip': valid_df.iloc[buy_idx]['cusip'],
                    'buy_yield': valid_df.iloc[buy_idx]['yield'],
                    'sell_cusip': valid_df.iloc[sell_idx]['cusip'],
                    'sell_yield': valid_df.iloc[sell_idx]['yield'],
                    'yield_spread_bps': yield_spread,
                    'similarity': similarities[j],
                    'state': valid_df.iloc[i].get('incorporated_state_code', 'N/A'),
                    'buy_rating': valid_df.iloc[buy_idx].get('rating', 'NR'),
                    'sell_rating': valid_df.iloc[sell_idx].get('rating', 'NR'),
                })
    
    if not opportunities:
        print("No arbitrage opportunities found with current criteria")
        return None
    
    # Remove duplicates and sort by opportunity size
    arb_df = pd.DataFrame(opportunities)
    arb_df['pair_key'] = arb_df.apply(
        lambda x: tuple(sorted([x['buy_cusip'], x['sell_cusip']])), axis=1
    )
    arb_df = arb_df.drop_duplicates(subset='pair_key').drop('pair_key', axis=1)
    arb_df = arb_df.sort_values('yield_spread_bps', ascending=False).head(max_results)
    
    print(f"‚úì Found {len(arb_df)} unique arbitrage opportunities")
    
    return arb_df

# ============================================
# 4. PORTFOLIO SIMILARITY ANALYSIS
# ============================================

def analyze_portfolio_similarity(cusip_list):
    """
    Analyze how similar bonds in a portfolio are to each other
    Useful for understanding concentration risk
    """
    
    # Filter to CUSIPs in our embeddings
    valid_cusips = [c for c in cusip_list if c in analysis_df['cusip'].values]
    
    if len(valid_cusips) < 2:
        print("‚ùå Need at least 2 valid CUSIPs for portfolio analysis")
        return None
    
    print(f"\nüìä Portfolio Analysis: {len(valid_cusips)} bonds")
    
    # Get embeddings for portfolio
    portfolio_mask = analysis_df['cusip'].isin(valid_cusips)
    portfolio_df = analysis_df[portfolio_mask].copy()
    portfolio_X = X_normalized[portfolio_mask]
    
    # Compute pairwise similarities
    sim_matrix = cosine_similarity(portfolio_X)
    
    # Extract upper triangle (excluding diagonal)
    upper_triangle = sim_matrix[np.triu_indices_from(sim_matrix, k=1)]
    
    # Statistics
    avg_similarity = np.mean(upper_triangle)
    max_similarity = np.max(upper_triangle)
    min_similarity = np.min(upper_triangle)
    
    print(f"\nPortfolio Similarity Metrics:")
    print(f"  Average pairwise similarity: {avg_similarity:.3f}")
    print(f"  Max similarity: {max_similarity:.3f}")
    print(f"  Min similarity: {min_similarity:.3f}")
    
    # Find most similar pair
    max_idx = np.unravel_index(np.argmax(sim_matrix - np.eye(len(sim_matrix))), 
                                sim_matrix.shape)
    most_similar_pair = (portfolio_df.iloc[max_idx[0]]['cusip'],
                        portfolio_df.iloc[max_idx[1]]['cusip'])
    
    print(f"\nMost similar pair: {most_similar_pair[0]} <-> {most_similar_pair[1]}")
    print(f"  Similarity: {sim_matrix[max_idx]:.3f}")
    
    # Concentration warning
    if avg_similarity > 0.85:
        print("\n‚ö†Ô∏è WARNING: High portfolio concentration - bonds are very similar")
        print("   Consider diversifying to reduce systematic risk")
    elif avg_similarity > 0.75:
        print("\n‚ö†Ô∏è CAUTION: Moderate portfolio concentration")
    else:
        print("\n‚úì Portfolio appears well-diversified in embedding space")
    
    return {
        'similarity_matrix': sim_matrix,
        'portfolio_df': portfolio_df,
        'avg_similarity': avg_similarity,
        'most_similar_pair': most_similar_pair
    }

# ============================================
# 5. EXAMPLE USAGE
# ============================================

print("\n" + "="*60)
print("EXAMPLE ANALYSES")
print("="*60)

# Example 1: Find similar bonds to a specific CUSIP
print("\n1Ô∏è‚É£ SIMILARITY SEARCH")
print("-" * 40)

# Pick a random CUSIP with good data
sample_cusips = analysis_df[(analysis_df['yield'].notna()) & 
                           (analysis_df['yield'] <= 15) & 
                           (analysis_df['yield'] >= 0)]['cusip'].values
if len(sample_cusips) > 0:
    example_cusip = sample_cusips[0]
    similar_bonds = find_similar_bonds(
        example_cusip, 
        n_similar=5,
        same_state_only=True,
        similar_maturity_years=3
    )
    
    if similar_bonds is not None:
        print("\nMost similar bonds:")
        display_cols = ['cusip', 'similarity', 'yield', 'yield_diff_bps', 'rating']
        display_cols = [c for c in display_cols if c in similar_bonds.columns]
        print(similar_bonds[display_cols].to_string(index=False))

# Example 2: Find arbitrage opportunities
print("\n2Ô∏è‚É£ ARBITRAGE OPPORTUNITIES")
print("-" * 40)

arb_opportunities = find_arbitrage_opportunities(
    min_similarity=0.93,  # Slightly lower threshold for more results
    min_yield_spread_bps=25,
    max_results=5,
    same_state=True
)

if arb_opportunities is not None and len(arb_opportunities) > 0:
    print("\nTop arbitrage trades:")
    for idx, row in arb_opportunities.head(3).iterrows():
        print(f"\n  Trade #{idx+1}:")
        print(f"    BUY:  {row['buy_cusip']} @ {row['buy_yield']:.3f}% (Rating: {row['buy_rating']})")
        print(f"    SELL: {row['sell_cusip']} @ {row['sell_yield']:.3f}% (Rating: {row['sell_rating']})")
        print(f"    Spread: {row['yield_spread_bps']:.1f} bps | Similarity: {row['similarity']:.3f}")

# Example 3: Portfolio concentration analysis
print("\n3Ô∏è‚É£ PORTFOLIO CONCENTRATION CHECK")
print("-" * 40)

# Create a sample portfolio
sample_portfolio = sample_cusips[:10].tolist() if len(sample_cusips) >= 10 else sample_cusips.tolist()
portfolio_analysis = analyze_portfolio_similarity(sample_portfolio)

print("\n" + "="*60)
print("‚úÖ Analysis complete! Bond embeddings ready for production use.")
print("="*60)

MUNICIPAL BOND EMBEDDING EXPLORER
‚ö†Ô∏è Found 85 bonds with invalid yields (>15% or negative)
   Examples:
           cusip   yield
3134   28304CBU0  32.779
5358   13069AAW8  17.641
10468  05753PCD2  19.309
23431  574193MH8  19.365
25514  927676MW3  24.689
‚úì Valid yields: 196,361 bonds, median: 3.26%, range: 0.00%-14.83%
‚úì Loaded 196,446 unique CUSIP embeddings
‚úì Merged with trade data: 196,446 CUSIPs with full data
‚úì Embedding dimension: 128

EXAMPLE ANALYSES

1Ô∏è‚É£ SIMILARITY SEARCH
----------------------------------------

üìå Query Bond: 19648FYV0
   Yield: 4.641%
   Rating: AA
   Maturity: 2044-08-01 00:00:00
   State: CO

Most similar bonds:
    cusip  similarity  yield  yield_diff_bps rating
19648FJE5    0.999854  4.690             4.9     AA
61531ABA4    0.999830  4.452           -18.9     AA
249176DE5    0.999799  4.384           -25.7    AAA
249182FR2    0.999724  4.515           -12.6    AA-
249176DD7    0.999642  4.565            -7.6    AAA

2Ô∏è‚É£ ARBITRAGE O