## **Basic Features**

**Purpose:** Engineer basic microstructure features from cleaned data

**What it does:**
- Loads cleaned data from `data/interim/`
- Computes fundamental LOB features
- Creates simple derived variables
- Saves engineered features to `data/processed/`

**Output:**
- Feature dataset in `data/processed/`
- Feature visualizations
- Correlation analysis
- Feature statistics


```
data/processed/basic_features.parquet
Columns:
- timestamp
- mid_price              ← NEW (derived)
- spread_abs             ← NEW (derived)
- spread_bps             ← NEW (derived)
- imbalance              ← NEW (derived)
- bid_depth_5            ← NEW (derived)
- ask_depth_5            ← NEW (derived)
- weighted_mid           ← NEW (derived)
```

Looking at your limit order book features, here's how I'd categorize them from basic to advanced:

## 🟢 Basic Features
*Foundational measures that are straightforward to compute and interpret*

1. **Spread** - Simple price difference
2. **Mid Price** - Average of best bid/ask
3. **Relative Spread** - Normalized spread
4. **Returns (Multiple Lags)** - Log price changes
5. **Queue Depth (Top 1, Top 5, Cumulative)** - Raw quantities at levels
6. **Spread at Multiple Levels** - Price differences at each level
7. **Average Spread (Top N Levels)** - Mean of spreads
8. **Intraday Patterns (Hour, Minute)** - Time-of-day indicators

## 🟡 Intermediate Features
*Require some statistical/financial understanding but still conceptually accessible*

9. **Weighted Mid Price** - Volume-weighted midpoint
10. **Rolling Volatility** - Standard deviation of returns
11. **VWAP** - Volume-weighted average price
12. **Effective Spread** - VWAP difference
13. **Microprice** - Another volume-weighted fair value
14. **Depth Imbalance** - Normalized bid/ask difference
15. **Queue Imbalance (Multiple Levels)** - Depth imbalance at specific levels
16. **Cumulative Volume Imbalance** - Absolute quantity difference
17. **Order Book Thickness** - Average quantity per level
18. **Depth Concentration** - Fraction of liquidity at top
19. **Time Since Event** - Temporal gap measurement


In [None]:
# ============================================================================
# 10_basic_features.ipynb
# Engineer basic microstructure features from cleaned LOB data
# ============================================================================

# %% Standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from typing import Dict, List
import warnings
warnings.filterwarnings('ignore')

# Visualization settings
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
%matplotlib inline

# %% Project imports (assuming you ran `pip install -e .`)
from src.config import (
    INTERIM_DATA_DIR, 
    PROCESSED_DATA_DIR,
    FIGURES_DIR
)
from src.data.lob_loader import LOBLoader
# from src.features.basic_features import (
#     compute_mid_price,
#     compute_spread,
#     compute_imbalance,
#     compute_depth,
#     compute_weighted_mid,
#     compute_all_basic_features  # Convenience function
# )
# from src.utils.plotting import (
#     plot_price_series,
#     plot_feature_distributions,
#     plot_correlation_matrix
# )
# from src.utils.metrics import compute_summary_stats




In [1]:
# ============================================================================
# 10_basic_features.ipynb
# Purpose: Engineer basic microstructure features from cleaned data
# ============================================================================

# %% Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import seaborn as sns
from pathlib import Path

# Visualization settings
plt.style.use('seaborn-v0_8-paper')  
plt.rcParams.update({
    'font.family': 'serif',
    'font.weight': 'bold',        
    'axes.labelweight': 'bold',    
    'axes.titleweight': 'bold',   
    'axes.linewidth': 1.2,
    'axes.spines.top': False,
    'axes.spines.right': False,
})
%matplotlib inline


from src.config import RAW_DATA_DIR, INTERIM_DATA_DIR, PROCESSED_DATA_DIR, FIGURES_DIR




In [2]:
# Configuration
SYMBOL = "BTCUSDT"
INPUT_FILE = INTERIM_DATA_DIR / f"{SYMBOL}_lob_cleaned.parquet"
OUTPUT_FILE = PROCESSED_DATA_DIR / f"{SYMBOL}_basic_features.parquet"

# Create output directories if needed
PROCESSED_DATA_DIR.mkdir(parents=True, exist_ok=True)
FIGURES_DIR.mkdir(parents=True, exist_ok=True)

print(f"Input: {INPUT_FILE}")
print(f"Output: {OUTPUT_FILE}")

Input: /Users/rylanspence/Desktop/Git/HF/Order-Book-Microstructure-Analysis/data/interim/BTCUSDT_lob_cleaned.parquet
Output: /Users/rylanspence/Desktop/Git/HF/Order-Book-Microstructure-Analysis/data/processed/BTCUSDT_basic_features.parquet


In [3]:
# Load cleaned data
print("Loading cleaned LOB data...")
lob_df = pd.read_parquet(INPUT_FILE)

print(f"Loaded {len(lob_df):,} snapshots")
print(f"Date range: {lob_df['timestamp'].min()} to {lob_df['timestamp'].max()}")
print(f"\nColumns: {lob_df.columns.tolist()}")
lob_df.head()

Loading cleaned LOB data...
Loaded 3,154 snapshots
Date range: 2025-10-09 17:07:08+00:00 to 2025-10-09 18:07:07+00:00

Columns: ['timestamp', 'bid_px_1', 'bid_qty_1', 'bid_px_2', 'bid_qty_2', 'bid_px_3', 'bid_qty_3', 'bid_px_4', 'bid_qty_4', 'bid_px_5', 'bid_qty_5', 'bid_px_6', 'bid_qty_6', 'bid_px_7', 'bid_qty_7', 'bid_px_8', 'bid_qty_8', 'bid_px_9', 'bid_qty_9', 'bid_px_10', 'bid_qty_10', 'bid_px_11', 'bid_qty_11', 'bid_px_12', 'bid_qty_12', 'bid_px_13', 'bid_qty_13', 'bid_px_14', 'bid_qty_14', 'bid_px_15', 'bid_qty_15', 'bid_px_16', 'bid_qty_16', 'bid_px_17', 'bid_qty_17', 'bid_px_18', 'bid_qty_18', 'bid_px_19', 'bid_qty_19', 'bid_px_20', 'bid_qty_20', 'ask_px_1', 'ask_qty_1', 'ask_px_2', 'ask_qty_2', 'ask_px_3', 'ask_qty_3', 'ask_px_4', 'ask_qty_4', 'ask_px_5', 'ask_qty_5', 'ask_px_6', 'ask_qty_6', 'ask_px_7', 'ask_qty_7', 'ask_px_8', 'ask_qty_8', 'ask_px_9', 'ask_qty_9', 'ask_px_10', 'ask_qty_10', 'ask_px_11', 'ask_qty_11', 'ask_px_12', 'ask_qty_12', 'ask_px_13', 'ask_qty_13', 'as

Unnamed: 0,timestamp,bid_px_1,bid_qty_1,bid_px_2,bid_qty_2,bid_px_3,bid_qty_3,bid_px_4,bid_qty_4,bid_px_5,...,ask_px_16,ask_qty_16,ask_px_17,ask_qty_17,ask_px_18,ask_qty_18,ask_px_19,ask_qty_19,ask_px_20,ask_qty_20
0,2025-10-09 17:07:08+00:00,119900.47,0.00094,119900.45,0.00229,119890.0,0.00834,119885.26,0.04879,119788.84,...,120595.0,0.24645,120605.85,0.0016,120625.0,0.29963,120632.5,6e-05,120663.16,0.00099
1,2025-10-09 17:07:09+00:00,119900.47,0.00094,119900.45,0.00229,119890.0,0.00834,119885.26,0.04879,119788.82,...,120526.76,0.0001,120536.0,0.12204,120595.0,0.24645,120605.85,0.0016,120625.0,0.29963
2,2025-10-09 17:07:10+00:00,119900.47,0.00094,119900.45,0.00229,119890.0,0.00834,119885.26,0.04879,119788.82,...,120526.76,0.0001,120536.0,0.12204,120595.0,0.24645,120605.85,0.0016,120625.0,0.29963
3,2025-10-09 17:07:11+00:00,119900.47,0.00094,119900.45,0.00229,119890.0,0.00834,119885.27,1.35252,119885.26,...,120526.76,0.0001,120536.0,0.12204,120595.0,0.24645,120605.85,0.0016,120625.0,0.29963
4,2025-10-09 17:07:12+00:00,119900.48,0.04878,119900.47,0.00094,119900.45,0.00229,119890.01,1.35252,119890.0,...,120605.85,0.0016,120632.5,6e-05,120663.16,0.00099,120672.83,0.01021,120735.87,4e-05


In [None]:
# Compute basic features using imported functions

print("\n=== Computing Basic Features ===\n")


print("Computing mid-price...")
lob_df['mid_price'] = compute_mid_price(
    bid_price=lob_df['bid_px_0'],
    ask_price=lob_df['ask_px_0']
)

print("Computing spread...")
spread_features = compute_spread(
    bid_price=lob_df['bid_px_0'],
    ask_price=lob_df['ask_px_0'],
    mid_price=lob_df['mid_price']
)
lob_df['spread_abs'] = spread_features['spread_abs']
lob_df['spread_bps'] = spread_features['spread_bps']

print("Computing order flow imbalance...")
lob_df['imbalance'] = compute_imbalance(
    bid_volume=lob_df['bid_volume_0'],
    ask_volume=lob_df['ask_volume_0']
)

print("Computing depth features...")
depth_features = compute_depth(
    lob_df=lob_df,
    levels=[1, 5, 10]  # Top 1, top 5, top 10 levels
)
lob_df = pd.concat([lob_df, depth_features], axis=1)

print("Computing weighted mid-price...")
lob_df['weighted_mid'] = compute_weighted_mid(
    bid_price=lob_df['bid_px_0'],
    ask_price=lob_df['ask_px_0'],
    bid_volume=lob_df['bid_volume_0'],
    ask_volume=lob_df['ask_volume_0']
)

#  Use convenience function (recommended for production)
# lob_df = compute_all_basic_features(lob_df)

print("\nFeature engineering complete!")




In [None]:
# %% Feature summary statistics
feature_cols = [
    'mid_price', 'spread_abs', 'spread_bps', 'imbalance',
    'bid_depth_1', 'ask_depth_1', 'bid_depth_5', 'ask_depth_5',
    'weighted_mid'
]

print("\n=== Feature Summary Statistics ===\n")
summary_stats = compute_summary_stats(lob_df[feature_cols])
print(summary_stats)

In [None]:
# %% Visualize features

print("\n=== Visualizing Features ===\n")

# Time series plots
fig, axes = plt.subplots(3, 1, figsize=(15, 10))

axes[0].plot(lob_df['timestamp'], lob_df['mid_price'], linewidth=0.5)
axes[0].set_title('Mid Price Over Time', fontsize=14, fontweight='bold')
axes[0].set_ylabel('Price')

axes[1].plot(lob_df['timestamp'], lob_df['spread_bps'], linewidth=0.5, color='orange')
axes[1].set_title('Spread (bps) Over Time', fontsize=14, fontweight='bold')
axes[1].set_ylabel('Spread (bps)')

axes[2].plot(lob_df['timestamp'], lob_df['imbalance'], linewidth=0.5, color='green')
axes[2].set_title('Order Flow Imbalance Over Time', fontsize=14, fontweight='bold')
axes[2].set_ylabel('Imbalance')
axes[2].set_xlabel('Time')

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'basic_features_timeseries.png', dpi=300, bbox_inches='tight')
plt.show()

# Distribution plots
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

axes[0, 0].hist(lob_df['spread_bps'], bins=100, edgecolor='black', alpha=0.7)
axes[0, 0].set_title('Spread Distribution', fontweight='bold')
axes[0, 0].set_xlabel('Spread (bps)')
axes[0, 0].set_ylabel('Frequency')

axes[0, 1].hist(lob_df['imbalance'], bins=100, edgecolor='black', alpha=0.7, color='orange')
axes[0, 1].set_title('Imbalance Distribution', fontweight='bold')
axes[0, 1].set_xlabel('Imbalance')
axes[0, 1].set_ylabel('Frequency')

axes[1, 0].hist(lob_df['bid_depth_5'], bins=100, edgecolor='black', alpha=0.7, color='green')
axes[1, 0].set_title('Bid Depth (Top 5) Distribution', fontweight='bold')
axes[1, 0].set_xlabel('Volume')
axes[1, 0].set_ylabel('Frequency')

axes[1, 1].hist(lob_df['ask_depth_5'], bins=100, edgecolor='black', alpha=0.7, color='red')
axes[1, 1].set_title('Ask Depth (Top 5) Distribution', fontweight='bold')
axes[1, 1].set_xlabel('Volume')
axes[1, 1].set_ylabel('Frequency')

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'basic_features_distributions.png', dpi=300, bbox_inches='tight')
plt.show()

In [None]:


# Correlation analysis

print("\n=== Feature Correlations ===\n")

# Select numeric features only
numeric_features = [
    'mid_price', 'spread_abs', 'spread_bps', 'imbalance',
    'bid_depth_5', 'ask_depth_5', 'weighted_mid'
]

corr_matrix = lob_df[numeric_features].corr()

# Plot correlation heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(
    corr_matrix,
    annot=True,
    fmt='.2f',
    cmap='coolwarm',
    center=0,
    square=True,
    linewidths=1,
    cbar_kws={"shrink": 0.8}
)
plt.title('Feature Correlation Matrix', fontsize=16, fontweight='bold', pad=20)
plt.tight_layout()
plt.savefig(FIGURES_DIR / 'basic_features_correlation.png', dpi=300, bbox_inches='tight')
plt.show()

# Identify highly correlated pairs
high_corr_pairs = []
for i in range(len(corr_matrix.columns)):
    for j in range(i+1, len(corr_matrix.columns)):
        if abs(corr_matrix.iloc[i, j]) > 0.7:
            high_corr_pairs.append({
                'feature_1': corr_matrix.columns[i],
                'feature_2': corr_matrix.columns[j],
                'correlation': corr_matrix.iloc[i, j]
            })

if high_corr_pairs:
    print("Highly correlated feature pairs (|r| > 0.7):")
    for pair in high_corr_pairs:
        print(f"  {pair['feature_1']} <-> {pair['feature_2']}: {pair['correlation']:.3f}")
else:
    print("No highly correlated feature pairs found.")

# %% Check for missing values and outliers

print("\n=== Data Quality Check ===\n")

# Missing values
missing = lob_df[feature_cols].isnull().sum()
if missing.sum() > 0:
    print("Missing values:")
    print(missing[missing > 0])
else:
    print("No missing values in features")

# Outliers (using IQR method)
print("\nOutlier detection (IQR method):")
for col in feature_cols:
    Q1 = lob_df[col].quantile(0.25)
    Q3 = lob_df[col].quantile(0.75)
    IQR = Q3 - Q1
    outliers = ((lob_df[col] < (Q1 - 1.5 * IQR)) | (lob_df[col] > (Q3 + 1.5 * IQR))).sum()
    pct = (outliers / len(lob_df)) * 100
    print(f"  {col}: {outliers} outliers ({pct:.2f}%)")


In [None]:
# Save features

print("\n=== Saving Features ===\n")

# Select columns to save
output_cols = ['timestamp'] + feature_cols

# Save to parquet
lob_df[output_cols].to_parquet(OUTPUT_FILE, index=False)
print(f"Saved {len(output_cols)} features to {OUTPUT_FILE}")
print(f"File size: {OUTPUT_FILE.stat().st_size / 1024 / 1024:.2f} MB")

# Also save a CSV sample for easy inspection
sample_file = PROCESSED_DATA_DIR / f"{SYMBOL}_basic_features_sample.csv"
lob_df[output_cols].head(1000).to_csv(sample_file, index=False)
print(f"Saved sample (1000 rows) to {sample_file}")

print("\n=== Feature Engineering Complete ===")
print(f"Next step: Run 15_advanced_features.ipy