# K2 Market Data Platform - Demo Notebook

**K2 Platform** is a distributed market data processing platform designed for high-frequency trading environments.

This notebook demonstrates:
- Platform architecture and data flow
- Sample data exploration (ASX trades)
- Data ingestion pipeline
- Query engine capabilities
- Time-travel queries with Iceberg
- REST API usage

**Prerequisites**:
- Docker services running (`make docker-up`)
- Infrastructure initialized (`make init-infra`)

## 1. Architecture Overview

```
┌─────────────────────────────────────────────────────────────────────┐
│                       K2 Platform Architecture                      │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  CSV Files → Kafka (Avro) → Iceberg → DuckDB → REST API            │
│                                                                     │
│  • CSV batch ingestion with schema validation                       │
│  • Kafka streaming with at-least-once delivery                      │
│  • Iceberg ACID transactions with time-travel                       │
│  • DuckDB sub-second analytical queries                             │
│  • FastAPI REST endpoints with OpenAPI docs                         │
│                                                                     │
│  Key Technologies:                                                  │
│  • Apache Kafka + Schema Registry                                   │
│  • Apache Iceberg (table format)                                    │
│  • DuckDB (embedded analytics)                                      │
│  • FastAPI + Prometheus + Grafana                                   │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘
```

## 2. Setup & Imports

In [None]:
# Standard library
import sys
from pathlib import Path
from datetime import datetime
from decimal import Decimal

# Data processing
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import matplotlib.dates as mdates

# Add src to path for k2 imports
sys.path.insert(0, str(Path.cwd().parent / "src"))

# Display settings
pd.set_option('display.max_columns', 15)
pd.set_option('display.width', 150)
plt.style.use('seaborn-v0_8-whitegrid')

print("Imports loaded successfully!")

In [None]:
# Configuration
SAMPLE_DATA_DIR = Path.cwd().parent / "data" / "sample"
TRADES_DIR = SAMPLE_DATA_DIR / "trades"
QUOTES_DIR = SAMPLE_DATA_DIR / "quotes"

# Company ID to Symbol mapping
COMPANY_MAPPING = {
    "7181": {"symbol": "DVN", "name": "Devine Ltd"},
    "3153": {"symbol": "MWR", "name": "MGM Wireless"},
    "7078": {"symbol": "BHP", "name": "BHP Billiton"},
    "7458": {"symbol": "RIO", "name": "Rio Tinto"},
}

# Verify sample data exists
if SAMPLE_DATA_DIR.exists():
    print(f"Sample data directory: {SAMPLE_DATA_DIR}")
    print(f"Available trade files: {list(TRADES_DIR.glob('*.csv'))}")
else:
    print("Warning: Sample data not found!")

## 3. Sample Data Exploration

The sample data contains real ASX (Australian Securities Exchange) market data from March 10-14, 2014.

| Symbol | Company | Trades | Notes |
|--------|---------|--------|-------|
| DVN | Devine Ltd | 231 | Low volume, good for demos |
| MWR | MGM Wireless | 10 | Very low volume |
| BHP | BHP Billiton | 91,630 | High volume mining stock |
| RIO | Rio Tinto | 108,670 | High volume mining stock |

In [None]:
# Load DVN trades (small dataset for demo)
dvn_file = TRADES_DIR / "7181.csv"

# Sample data has no header, so we specify column names
df_dvn = pd.read_csv(
    dvn_file,
    names=["Date", "Time", "Price", "Volume", "Qualifiers", "Venue", "BuyerID"],
)

print(f"DVN Trades: {len(df_dvn)} records")
print(f"Date range: {df_dvn['Date'].iloc[0]} to {df_dvn['Date'].iloc[-1]}")
print("\nFirst 10 trades:")
df_dvn.head(10)

In [None]:
# Data statistics
print("=== DVN Trade Statistics ===")
print(f"\nPrice Range: ${df_dvn['Price'].min():.2f} - ${df_dvn['Price'].max():.2f}")
print(f"Average Price: ${df_dvn['Price'].mean():.2f}")
print(f"Total Volume: {df_dvn['Volume'].sum():,}")
print(f"Average Trade Size: {df_dvn['Volume'].mean():,.0f}")
print(f"\nVenues: {df_dvn['Venue'].unique().tolist()}")
print(f"Qualifiers: {df_dvn['Qualifiers'].unique().tolist()}")

In [None]:
# Parse timestamps for time-series analysis
def parse_timestamp(row):
    """Parse sample data timestamp format to datetime."""
    dt_str = f"{row['Date']} {row['Time']}"
    try:
        return datetime.strptime(dt_str, "%m/%d/%Y %H:%M:%S.%f")
    except ValueError:
        return datetime.strptime(dt_str, "%m/%d/%Y %H:%M:%S")

df_dvn['Timestamp'] = df_dvn.apply(parse_timestamp, axis=1)
df_dvn['DateParsed'] = pd.to_datetime(df_dvn['Date'], format='%m/%d/%Y')

print("Timestamps parsed successfully!")
df_dvn[['Date', 'Time', 'Timestamp', 'Price', 'Volume']].head()

## 4. Data Visualization

In [None]:
# Intraday price chart
fig, axes = plt.subplots(2, 1, figsize=(14, 8), sharex=True)

# Price scatter plot
ax1 = axes[0]
scatter = ax1.scatter(
    df_dvn['Timestamp'], 
    df_dvn['Price'], 
    c=df_dvn['Volume'],
    cmap='viridis',
    alpha=0.7,
    s=df_dvn['Volume'] / 1000,  # Size by volume
)
ax1.set_ylabel('Price ($)', fontsize=12)
ax1.set_title('DVN Intraday Trades - March 10-14, 2014', fontsize=14)
plt.colorbar(scatter, ax=ax1, label='Volume')

# Volume bar chart
ax2 = axes[1]
ax2.bar(
    df_dvn['Timestamp'], 
    df_dvn['Volume'],
    width=0.001,
    color='steelblue',
    alpha=0.7,
)
ax2.set_ylabel('Volume', fontsize=12)
ax2.set_xlabel('Time', fontsize=12)

# Format x-axis
ax2.xaxis.set_major_formatter(mdates.DateFormatter('%m/%d %H:%M'))
plt.xticks(rotation=45)

plt.tight_layout()
plt.show()

In [None]:
# Daily OHLCV summary
daily_ohlcv = df_dvn.groupby('DateParsed').agg({
    'Price': ['first', 'max', 'min', 'last', 'mean'],
    'Volume': 'sum',
    'Timestamp': 'count',
}).round(2)

daily_ohlcv.columns = ['Open', 'High', 'Low', 'Close', 'VWAP', 'Volume', 'Trades']

# Calculate VWAP properly
for date in daily_ohlcv.index:
    day_data = df_dvn[df_dvn['DateParsed'] == date]
    vwap = (day_data['Price'] * day_data['Volume']).sum() / day_data['Volume'].sum()
    daily_ohlcv.loc[date, 'VWAP'] = round(vwap, 2)

print("=== Daily OHLCV Summary ===")
daily_ohlcv

## 5. Data Transformation for Ingestion

The sample data format differs from our Avro schema. We need to transform:
- Add `symbol` from company_id mapping
- Combine `Date` + `Time` → `exchange_timestamp`
- Generate `sequence_number` from row order

In [None]:
def transform_sample_trades(company_id: str, limit: int = None) -> pd.DataFrame:
    """Transform sample trade data to BatchLoader-compatible format."""
    csv_path = TRADES_DIR / f"{company_id}.csv"
    
    df = pd.read_csv(
        csv_path,
        names=["Date", "Time", "Price", "Volume", "Qualifiers", "Venue", "BuyerID"],
    )
    
    if limit:
        df = df.head(limit)
    
    company_info = COMPANY_MAPPING[company_id]
    
    # Transform to schema format
    result = pd.DataFrame({
        "symbol": company_info["symbol"],
        "company_id": int(company_id),
        "exchange": "ASX",
        "exchange_timestamp": df.apply(
            lambda row: parse_timestamp(row).isoformat(),
            axis=1,
        ),
        "price": df["Price"],
        "volume": df["Volume"],
        "qualifiers": df["Qualifiers"],
        "venue": df["Venue"].fillna("X"),
        "buyer_id": df["BuyerID"].fillna(""),
        "sequence_number": range(1, len(df) + 1),
    })
    
    return result

# Transform DVN data
df_transformed = transform_sample_trades("7181", limit=20)
print("Transformed data (first 10 rows):")
df_transformed.head(10)

In [None]:
# Schema validation
required_columns = [
    "symbol", "company_id", "exchange", "exchange_timestamp",
    "price", "volume", "qualifiers", "venue", "sequence_number",
]

print("Schema Validation:")
for col in required_columns:
    status = "" if col in df_transformed.columns else ""
    print(f"  {status} {col}: {df_transformed[col].dtype if col in df_transformed.columns else 'MISSING'}")

## 6. Query Engine Demo

The K2 QueryEngine uses DuckDB with Iceberg extension to query market data.

**Note**: This requires Docker services to be running and data loaded into Iceberg.

In [None]:
# Try to import and connect to QueryEngine
try:
    from k2.query.engine import QueryEngine
    
    engine = QueryEngine()
    print("QueryEngine connected successfully!")
    
    # Get statistics
    stats = engine.get_stats()
    print("\nDatabase Statistics:")
    for key, value in stats.items():
        print(f"  {key}: {value}")
        
except Exception as e:
    print(f"QueryEngine not available: {e}")
    print("\nTo use QueryEngine, run:")
    print("  make docker-up")
    print("  make init-infra")
    engine = None

In [None]:
# Query trades if engine is available
if engine:
    symbols = engine.get_symbols()
    print(f"Available symbols: {symbols}")
    
    if symbols:
        # Query first symbol
        symbol = symbols[0]
        print(f"\nQuerying trades for {symbol}...")
        
        trades = engine.query_trades(symbol=symbol, limit=10)
        print(f"Found {len(trades)} trades:")
        display(pd.DataFrame(trades))
else:
    print("Skipping query demo - QueryEngine not available")

In [None]:
# Market summary example
if engine:
    symbols = engine.get_symbols()
    if symbols:
        symbol = symbols[0]
        date_range = engine.get_date_range(symbol)
        
        if date_range and date_range.get('min_date'):
            date = date_range['min_date'].strftime('%Y-%m-%d')
            print(f"Market Summary for {symbol} on {date}:")
            
            summary = engine.get_market_summary(symbol, date)
            for key, value in summary.items():
                print(f"  {key}: {value}")
else:
    print("Skipping market summary - QueryEngine not available")

## 7. Time-Travel Demo

Apache Iceberg maintains a snapshot history of tables. This enables:
- Query historical data as it existed at any point
- Audit changes over time
- Rollback to previous states

In [None]:
# Try to import ReplayEngine
try:
    from k2.query.replay import ReplayEngine
    
    replay = ReplayEngine()
    print("ReplayEngine connected successfully!")
    
    # List snapshots
    snapshots = replay.list_snapshots(table_type="trades", limit=5)
    print(f"\nFound {len(snapshots)} snapshots:")
    
    if snapshots:
        df_snapshots = pd.DataFrame(snapshots)
        display(df_snapshots)
    else:
        print("No snapshots found (table may be empty)")
        
    replay.close()
    
except Exception as e:
    print(f"ReplayEngine not available: {e}")
    print("\nTime-travel requires Iceberg tables with data.")

## 8. REST API Demo

The K2 API provides REST endpoints for querying market data.

**Base URL**: http://localhost:8000

**Endpoints**:
- `GET /health` - Health check
- `GET /v1/trades` - Query trades
- `GET /v1/quotes` - Query quotes
- `GET /v1/symbols` - List symbols
- `GET /v1/summary/{symbol}/{date}` - OHLCV summary

In [None]:
import requests

API_BASE = "http://localhost:8000"
API_KEY = "k2-dev-api-key-2026"
HEADERS = {"X-API-Key": API_KEY}

def api_get(endpoint, params=None, auth=True):
    """Make API GET request."""
    headers = HEADERS if auth else {}
    try:
        response = requests.get(f"{API_BASE}{endpoint}", headers=headers, params=params, timeout=5)
        return response.status_code, response.json()
    except requests.exceptions.ConnectionError:
        return None, {"error": "API not running. Start with: make api"}
    except Exception as e:
        return None, {"error": str(e)}

In [None]:
# Health check (no auth required)
status, data = api_get("/health", auth=False)
print(f"Health Check: {status}")
print(f"Response: {data}")

In [None]:
# Get trades (requires auth)
status, data = api_get("/v1/trades", params={"limit": 5})
print(f"Trades Endpoint: {status}")

if status == 200 and "data" in data:
    df_trades = pd.DataFrame(data["data"])
    display(df_trades)
else:
    print(f"Response: {data}")

In [None]:
# List symbols
status, data = api_get("/v1/symbols")
print(f"Symbols Endpoint: {status}")

if status == 200 and "data" in data:
    print(f"Available symbols: {data['data']}")
else:
    print(f"Response: {data}")

## 9. Summary

### What We Demonstrated

| Component | Description | Status |
|-----------|-------------|--------|
| Sample Data | Real ASX market data (March 2014) | Explored |
| Data Transformation | CSV → Avro schema format | Demonstrated |
| Visualization | Intraday charts, OHLCV | Created |
| Query Engine | DuckDB + Iceberg queries | Tested |
| Time-Travel | Iceberg snapshots | Explored |
| REST API | FastAPI endpoints | Tested |

### Key Commands

```bash
make docker-up      # Start all services
make init-infra     # Initialize Kafka topics and Iceberg tables
make api            # Start REST API server
k2-query --help     # Query CLI usage
```

### Links

- **API Docs**: http://localhost:8000/docs
- **Grafana**: http://localhost:3000 (admin/admin)
- **Kafka UI**: http://localhost:8080
- **MinIO Console**: http://localhost:9001

In [None]:
# Cleanup
if 'engine' in dir() and engine:
    engine.close()
    print("QueryEngine closed.")

print("\nDemo complete!")