In [None]:
"""
================================================================================
PHASE 03A: PER-ROW / PER-TRADE FEATURE ENGINEERING
================================================================================
UOA Research Pipeline - Feature Engineering Module

Author:         CycleLabs
Version:        1.0
Last Updated:   2025-02-01
Python:         3.8+

================================================================================
OVERVIEW
================================================================================
This module performs per-trade feature engineering on options trade data that 
has been enriched with underlying OHLCV and EOD options chain data from Phase 02.
It transforms raw merged trade data into analysis-ready features for downstream 
anomaly detection processing.

Pipeline Position:
    Phase 02B (EOD Chain Merge) --> [Phase 03A] --> Phase 03B (Aggregated Features)

================================================================================
OPERATIONS PERFORMED
================================================================================
1. DATA CLEANING
   - Removes unnecessary columns: 'correction', 'exchange'
   - Filters out invalid trade conditions: 201, 203, 205, 207, 246, 247
     (canceled trades, proprietary products, compression trades)

2. CONDITION CODE MAPPING
   - Converts numeric condition codes to binary categorical flags
   - Creates 9 binary columns for trade execution type analysis:
     * condn_contmarkettrade    - Continuous market trade
     * condn_timingadjusted     - Late/out-of-sequence trade
     * cndn_autoelectronic      - Auto-electronic execution
     * condn_auctionmechanism   - Auction mechanism execution
     * cndn_crossingtrade       - Crossing trade execution
     * cndn_floorexecuted       - Floor-executed trade
     * cndn_multilegstrategy    - Multi-leg strategy component
     * condn_stockoptioncombo   - Stock-option combo component
     * condn_extendedhours      - Extended hours trade

3. OPTION TICKER PARSING
   - Parses OCC-format tickers (e.g., O:CIFR240202C00002500)
   - Extracts: strike_price, option_type (C/P), opt_expiration_date
   - Creates binary flag: option_type_call (1=call, 0=put)

4. MONEYNESS CALCULATIONS
   - otm_percentage:    Out-of-the-money % relative to strike
   - intrinsic_value:   Value if exercised immediately
   - extrinsic_value:   Time value (price - intrinsic)
   - extrinsic_ratio:   Extrinsic / price
   - days_to_expiry:    Calendar days until expiration
   - risk_reward_ratio: Extrinsic / (intrinsic + epsilon)

5. SIZE & VOLUME RATIOS
   - size_to_oi_ratio:                 size / open_interest_now
   - size_to_underlyingvolume_ratio:   (size * 100 / underlying_volume) * 100
   - open_interest_change:             open_interest_now - open_interest_yesterday

6. NOTIONAL VALUE METRICS
   - opt_trade_notional_value:         price * size * CONTRACT_MULTIPLIER
   - notional_underlyingvolume_ratio:  notional / underlying_volume

7. POSITION-WEIGHTED GREEKS
   - delta_size, gamma_size, vega_size, theta_size
   - Formula: greek * size * CONTRACT_MULTIPLIER

8. GREEKS-TO-NOTIONAL RATIOS
   - delta_notional_ratio, gamma_notional_ratio, 
   - vega_notional_ratio, theta_notional_ratio

================================================================================
INPUT SPECIFICATIONS
================================================================================
Directory:      Configured via INPUT_FOLDER
Filename:       {TICKER}_comptrades_YYYY-MM-DD.parquet
Example:        CIFR_comptrades_2024-02-01.parquet
Source:         Phase 02B output

Required Columns:
    ticker                  str     Option ticker (OCC format: O:CIFR240202C00002500)
    conditions              int     Trade condition code
    price                   float   Trade execution price per contract
    sip_timestamp           str     Trade timestamp (YYYY-MM-DD HH:MM:SS)
    size                    int     Number of contracts traded
    underlying              str     Underlying asset ticker symbol
    underlying_close        float   EOD closing price of underlying
    underlying_volume       int     EOD volume of underlying
    open_interest_now       int     Current day EOD open interest
    open_interest_yesterday int     Previous day EOD open interest (nullable)
    implied_volatility      float   EOD implied volatility
    grk_delta               float   EOD delta
    grk_gamma               float   EOD gamma
    grk_theta               float   EOD theta
    grk_vega                float   EOD vega
    trade_date              str     Trade date (YYYY-MM-DD)

================================================================================
OUTPUT SPECIFICATIONS
================================================================================
Directory:      Configured via OUTPUT_FOLDER
Filename:       {TICKER}_perrowfeatures_YYYY-MM-DD.parquet
Example:        CIFR_perrowfeatures_2024-02-01.parquet

New Columns Added (in addition to retained input columns):
    
    [Option Contract Attributes]
    strike_price            float   Strike price (parsed from ticker)
    option_type             str     'C' for call, 'P' for put
    opt_expiration_date     datetime Expiration date (parsed from ticker)
    option_type_call        int     Binary: 1=call, 0=put
    
    [Moneyness & Value Metrics]
    otm_percentage          float   Out-of-the-money percentage
    intrinsic_value         float   Intrinsic value per contract
    extrinsic_value         float   Extrinsic (time) value
    extrinsic_ratio         float   Extrinsic / price
    days_to_expiry          int     Days until expiration
    risk_reward_ratio       float   Extrinsic / intrinsic
    
    [Size & Volume Ratios]
    size_to_oi_ratio        float   Trade size / open interest
    size_to_underlyingvolume_ratio  float   Share-equiv size vs underlying vol (%)
    open_interest_change    float   OI change from previous day
    
    [Notional Metrics]
    opt_trade_notional_value float  Total trade dollar value
    notional_underlyingvolume_ratio float   Notional / underlying volume
    
    [Position-Weighted Greeks]
    delta_size              float   Position delta exposure
    gamma_size              float   Position gamma exposure
    vega_size               float   Position vega exposure
    theta_size              float   Position theta exposure
    
    [Greeks-to-Notional Ratios]
    delta_notional_ratio    float   Delta per dollar invested
    gamma_notional_ratio    float   Gamma per dollar invested
    vega_notional_ratio     float   Vega per dollar invested
    theta_notional_ratio    float   Theta per dollar invested
    
    [Condition Binary Flags]
    condn_contmarkettrade   int     Continuous market trade flag
    condn_timingadjusted    int     Timing adjusted flag
    cndn_autoelectronic     int     Auto-electronic flag
    condn_auctionmechanism  int     Auction mechanism flag
    cndn_crossingtrade      int     Crossing trade flag
    cndn_floorexecuted      int     Floor executed flag
    cndn_multilegstrategy   int     Multi-leg strategy flag
    condn_stockoptioncombo  int     Stock-option combo flag
    condn_extendedhours     int     Extended hours flag

Columns Removed:
    correction, exchange

================================================================================
CONFIGURATION
================================================================================
Modify the following variables in the USER CONFIGURABLE section:

    INPUT_FOLDER            Path to Phase 02B output parquet files
    OUTPUT_FOLDER           Path for feature-engineered output files
    TICKERS_TO_PROCESS      List of tickers to process, or None/[] for all
    START_DATE              Start date filter (inclusive, YYYY-MM-DD)
    END_DATE                End date filter (inclusive, YYYY-MM-DD)
    CONTRACT_MULTIPLIER     Standard option multiplier (default: 100)
    REMOVE_CONDITIONS       Condition codes to exclude from analysis
    CONDITIONS_MAP          Condition code to binary flag mapping

================================================================================
USAGE
================================================================================
Basic execution:
    $ python Phase_03A_perrow_feature_engineering.py

Processing specific tickers:
    1. Set TICKERS_TO_PROCESS = ["AAPL", "TSLA", "CIFR"]
    2. Run the script

Processing all tickers in date range:
    1. Set TICKERS_TO_PROCESS = None  (or empty list [])
    2. Set START_DATE and END_DATE
    3. Run the script

================================================================================
CONDITION CODES REFERENCE
================================================================================
REMOVED (not suitable for anomaly detection):
    201 - Canceled
    203 - Last and Canceled
    205 - Opening Trade and Canceled
    207 - Only Trade and Canceled
    246 - Multi Leg Floor Trade of Proprietary Products
    247 - Multilateral Compression Trade of Proprietary Products

RETAINED (mapped to binary flags):
    202 - Late and Out Of Sequence      → contmarkettrade, timingadjusted
    204 - Late                          → contmarkettrade, timingadjusted
    209 - Automatic Execution           → contmarkettrade, autoelectronic
    210 - Reopening Trade               → contmarkettrade
    219 - Intermarket Sweep Order       → contmarkettrade, autoelectronic
    227 - Single Leg Auction Non-ISO    → auctionmechanism
    228 - Single Leg Auction ISO        → auctionmechanism
    229 - Single Leg Cross Non-ISO      → crossingtrade
    230 - Single Leg Cross ISO          → crossingtrade
    231 - Single Leg Floor Trade        → floorexecuted
    232 - Multi Leg Auto-Electronic     → multilegstrategy, autoelectronic
    233 - Multi Leg Auction             → multilegstrategy, auctionmechanism
    234 - Multi Leg Cross               → multilegstrategy, crossingtrade
    235 - Multi Leg Floor Trade         → multilegstrategy, floorexecuted
    236 - Multi Leg Auto vs Single      → multilegstrategy, autoelectronic
    237 - Stock Options Auction         → stockoptioncombo, auctionmechanism
    238 - Multi Leg Auction vs Single   → multilegstrategy, auctionmechanism
    239 - Multi Leg Floor vs Single     → multilegstrategy, floorexecuted
    240 - Stock Options Auto            → stockoptioncombo, autoelectronic
    241 - Stock Options Cross           → stockoptioncombo, crossingtrade
    242 - Stock Options Floor Trade     → stockoptioncombo, floorexecuted
    243 - Stock Options Auto vs Single  → stockoptioncombo, autoelectronic
    244 - Stock Options Auction vs Single → stockoptioncombo, auctionmechanism
    245 - Stock Options Floor vs Single → stockoptioncombo, floorexecuted
    248 - Extended Hours Trade          → extendedhours

================================================================================
FORMULA REFERENCE
================================================================================
OTM Percentage:
    Call: max(strike_price - underlying_close, 0) / strike_price * 100
    Put:  max(underlying_close - strike_price, 0) / strike_price * 100

Intrinsic Value:
    Call: max(underlying_close - strike_price, 0)
    Put:  max(strike_price - underlying_close, 0)

Extrinsic Value:
    price - intrinsic_value

Option Ticker Parsing (OCC format: O:CIFR240202C00002500):
    - Last 8 chars:         Strike price (divide by 1000)
    - 9th char from end:    Option type (C/P)
    - 15th-10th from end:   Expiration date (YYMMDD)

================================================================================
DEPENDENCIES
================================================================================
Required packages:
    pandas          Data manipulation and parquet I/O
    numpy           Numerical operations
    pyarrow         Parquet file format support (installed with pandas)

Standard library:
    pathlib         File path handling
    datetime        Date parsing

Install:
    $ pip install pandas numpy pyarrow

================================================================================
GITHUB REPOSITORY
================================================================================
Owner:      toolsandsoftware@cyclelabs.net
Repo:       20260201_UOAResearchPipeline
Location:   notebooks/003_FEATUREENGINEERING/01v1_perrow_feature_engineering.py
Link:       https://github.com/toolsandsoftware-cyclelabs/20260201_UOAResearchPipeline

================================================================================
CHANGELOG
================================================================================
v1.0 (2025-02-01)
    - Initial release
    - Per-trade feature engineering from Phase 02B output
    - Condition code filtering and binary flag mapping
    - Option ticker parsing
    - Moneyness, size ratio, notional, and Greeks calculations

================================================================================
"""


import pandas as pd
import numpy as np
from pathlib import Path
from datetime import datetime

# ============================================================================
# ============================ USER CONFIGURABLE =============================
# ============================================================================

# Folder containing raw parquet files (format: TICKER_comptrades_YYYY-MM-DD.parquet)
INPUT_FOLDER = Path(r"D:\cyclelabs_codes\CL_20251120_siphontrades\01_FIXINGRAWDATA\output_mergedall")

# Folder to save feature-engineered files
OUTPUT_FOLDER = Path(r"D:\cyclelabs_codes\CL_20251120_siphontrades\01_FIXINGRAWDATA\output_2_perrowfeateng")

# Tickers to process (use None or empty list [] to process ALL tickers in folder)
TICKERS_TO_PROCESS = ["CIFR"]  # e.g., ["CIFR", "AAPL", "TSLA"] or None for all

# Date range filter (inclusive)
START_DATE = "2024-02-01"  # YYYY-MM-DD
END_DATE = "2025-12-31"    # YYYY-MM-DD

# Standard option contract multiplier
CONTRACT_MULTIPLIER = 100

# Trade conditions to DROP (exclude from analysis)
REMOVE_CONDITIONS = [201, 203, 205, 207, 246, 247]

# Trade conditions to binary column mapping
CONDITIONS_MAP = {
    202: ["condn_contmarkettrade", "condn_timingadjusted"],
    204: ["condn_contmarkettrade", "condn_timingadjusted"],
    209: ["condn_contmarkettrade", "cndn_autoelectronic"],
    210: ["condn_contmarkettrade"],
    219: ["condn_contmarkettrade", "cndn_autoelectronic"],
    227: ["condn_auctionmechanism"],
    228: ["condn_auctionmechanism"],
    229: ["cndn_crossingtrade"],
    230: ["cndn_crossingtrade"],
    231: ["cndn_floorexecuted"],
    232: ["cndn_multilegstrategy", "cndn_autoelectronic"],
    233: ["cndn_multilegstrategy", "condn_auctionmechanism"],
    234: ["cndn_multilegstrategy", "cndn_crossingtrade"],
    235: ["cndn_multilegstrategy", "cndn_floorexecuted"],
    236: ["cndn_multilegstrategy", "cndn_autoelectronic"],
    237: ["condn_stockoptioncombo", "condn_auctionmechanism"],
    238: ["cndn_multilegstrategy", "condn_auctionmechanism"],
    239: ["cndn_multilegstrategy", "cndn_floorexecuted"],
    240: ["condn_stockoptioncombo", "cndn_autoelectronic"],
    241: ["condn_stockoptioncombo", "cndn_crossingtrade"],
    242: ["condn_stockoptioncombo", "cndn_floorexecuted"],
    243: ["condn_stockoptioncombo", "cndn_autoelectronic"],
    244: ["condn_stockoptioncombo", "condn_auctionmechanism"],
    245: ["condn_stockoptioncombo", "cndn_floorexecuted"],
    248: ["condn_extendedhours"]
}

# ============================================================================
# ============================ HELPER FUNCTIONS ==============================
# ============================================================================

def parse_filename(filename):
    """
    Parse filename in format: TICKER_comptrades_YYYY-MM-DD.parquet
    
    Returns:
        tuple: (ticker, date_str) or (None, None) if parsing fails
    
    Example:
        'CIFR_comptrades_2024-02-01.parquet' -> ('CIFR', '2024-02-01')
    """
    try:
        # Remove .parquet extension
        name_without_ext = filename.replace(".parquet", "")
        # Split by '_comptrades_'
        parts = name_without_ext.split("_comptrades_")
        if len(parts) == 2:
            ticker = parts[0]
            date_str = parts[1]
            # Validate date format
            datetime.strptime(date_str, "%Y-%m-%d")
            return ticker, date_str
    except Exception:
        pass
    return None, None


def parse_option_ticker(ticker_str):
    """
    Parse option ticker to extract strike, option type, and expiration date.
    
    Format: O:UNDERLYING_YYMMDD_C/P_STRIKE (encoded as O:CIFR240202C00002500)
    - Last 8 chars: strike price (divide by 1000)
    - 9th from last: C (call) or P (put)
    - 15th to 10th from last: expiration YYMMDD
    
    Returns:
        tuple: (strike_price, option_type, opt_expiration_date)
    
    Example:
        'O:CIFR240202C00002500' -> (2.5, 'C', datetime(2024, 2, 2))
    """
    try:
        core = ticker_str.split(":")[1]  # Remove 'O:' prefix
        strike_str = core[-8:]           # Last 8 digits = strike
        option_type = core[-9]           # Char before strike = C/P
        expiry_str = core[-15:-9]        # 6 digits before option type = YYMMDD
        
        strike_price = int(strike_str) / 1000
        opt_expiration_date = datetime.strptime(expiry_str, "%y%m%d")
        return strike_price, option_type, opt_expiration_date
    except Exception:
        return np.nan, np.nan, pd.NaT


def get_files_to_process(input_folder, tickers=None, start_date=None, end_date=None):
    """
    Get list of files to process based on ticker and date filters.
    
    Args:
        input_folder: Path to folder containing parquet files
        tickers: List of tickers to process (None or [] for all)
        start_date: Start date string 'YYYY-MM-DD' (None for no filter)
        end_date: End date string 'YYYY-MM-DD' (None for no filter)
    
    Returns:
        list: List of tuples (file_path, ticker, date_str) for files to process
    """
    files_to_process = []
    
    for file_path in input_folder.glob("*.parquet"):
        ticker, date_str = parse_filename(file_path.name)
        
        # Skip if filename doesn't match expected format
        if ticker is None:
            print(f"  Skipping (invalid format): {file_path.name}")
            continue
        
        # Filter by ticker if specified
        if tickers and ticker not in tickers:
            continue
        
        # Filter by date range
        if start_date and date_str < start_date:
            continue
        if end_date and date_str > end_date:
            continue
        
        files_to_process.append((file_path, ticker, date_str))
    
    # Sort by ticker, then by date
    files_to_process.sort(key=lambda x: (x[1], x[2]))
    
    return files_to_process


def apply_condition_flags(df, conditions_map):
    """
    Apply binary condition flags based on condition codes (vectorized).
    
    Args:
        df: DataFrame with 'conditions' column
        conditions_map: Dict mapping condition codes to list of column names
    
    Returns:
        DataFrame with binary condition columns added
    """
    # Get all unique column names from the mapping
    all_columns = set()
    for cols in conditions_map.values():
        all_columns.update(cols)
    
    # Initialize all columns to 0
    for col in all_columns:
        df[col] = 0
    
    # Vectorized assignment for each condition code
    for cond_code, col_names in conditions_map.items():
        mask = df["conditions"] == cond_code
        for col in col_names:
            df.loc[mask, col] = 1
    
    return df


# ============================================================================
# ============================ MAIN PROCESSING ===============================
# ============================================================================

def main():
    # Create output folder if it doesn't exist
    OUTPUT_FOLDER.mkdir(parents=True, exist_ok=True)
    
    # Get files to process
    print("=" * 70)
    print("SCANNING INPUT FOLDER FOR FILES TO PROCESS")
    print("=" * 70)
    print(f"Input folder:  {INPUT_FOLDER}")
    print(f"Tickers:       {TICKERS_TO_PROCESS if TICKERS_TO_PROCESS else 'ALL'}")
    print(f"Date range:    {START_DATE} to {END_DATE}")
    print()
    
    files_to_process = get_files_to_process(
        INPUT_FOLDER, 
        TICKERS_TO_PROCESS, 
        START_DATE, 
        END_DATE
    )
    
    if not files_to_process:
        print("No files found matching criteria.")
        return
    
    print(f"Found {len(files_to_process)} file(s) to process:")
    for fp, ticker, date_str in files_to_process:
        print(f"  - {fp.name} (Ticker: {ticker}, Date: {date_str})")
    print()
    
    # Process each file
    print("=" * 70)
    print("PROCESSING FILES")
    print("=" * 70)
    
    for file_path, ticker, date_str in files_to_process:
        print(f"\nProcessing: {file_path.name}")
        print("-" * 50)
        
        # Load data
        df = pd.read_parquet(file_path)
        print(f"  Loaded {len(df)} rows")
        
        # Remove unwanted columns
        df = df.drop(columns=["correction", "exchange"], errors="ignore")
        
        # Remove unwanted condition rows
        rows_before = len(df)
        df = df[~df["conditions"].isin(REMOVE_CONDITIONS)].copy()
        rows_removed = rows_before - len(df)
        if rows_removed > 0:
            print(f"  Removed {rows_removed} rows with excluded conditions")
        
        # Apply condition binary flags (vectorized)
        df = apply_condition_flags(df, CONDITIONS_MAP)
        
        # Parse option ticker components
        parsed = df["ticker"].apply(parse_option_ticker)
        df["strike_price"] = parsed.apply(lambda x: x[0])
        df["option_type"] = parsed.apply(lambda x: x[1])
        df["opt_expiration_date"] = parsed.apply(lambda x: x[2])
        
        # Binary flag: 1 if call, 0 if put
        df["option_type_call"] = (df["option_type"] == "C").astype(int)
        
        # OTM percentage (relative to strike)
        df["otm_percentage"] = np.where(
            df["option_type_call"] == 1,
            np.maximum(df["strike_price"] - df["underlying_close"], 0) / df["strike_price"] * 100,
            np.maximum(df["underlying_close"] - df["strike_price"], 0) / df["strike_price"] * 100
        )
        
        # Intrinsic value
        df["intrinsic_value"] = np.where(
            df["option_type_call"] == 1,
            np.maximum(df["underlying_close"] - df["strike_price"], 0),
            np.maximum(df["strike_price"] - df["underlying_close"], 0)
        )
        
        # Extrinsic value and ratio
        df["extrinsic_value"] = df["price"] - df["intrinsic_value"]
        df["extrinsic_ratio"] = df["extrinsic_value"] / df["price"]
        
        # Days to expiry
        df["days_to_expiry"] = (df["opt_expiration_date"] - pd.to_datetime(df["trade_date"])).dt.days
        
        # Size ratios
        df["size_to_oi_ratio"] = df["size"] / df["open_interest_now"]
        df["size_to_underlyingvolume_ratio"] = (df["size"] * CONTRACT_MULTIPLIER / df["underlying_volume"]) * 100
        
        # Risk/reward ratio
        df["risk_reward_ratio"] = df["extrinsic_value"] / (df["intrinsic_value"] + 1e-8)
        
        # Notional values
        df["opt_trade_notional_value"] = df["price"] * df["size"] * CONTRACT_MULTIPLIER
        df["notional_underlyingvolume_ratio"] = df["opt_trade_notional_value"] / df["underlying_volume"]
        
        # Greeks weighted by size
        df["delta_size"] = df["grk_delta"] * df["size"] * CONTRACT_MULTIPLIER
        df["gamma_size"] = df["grk_gamma"] * df["size"] * CONTRACT_MULTIPLIER
        df["vega_size"] = df["grk_vega"] * df["size"] * CONTRACT_MULTIPLIER
        df["theta_size"] = df["grk_theta"] * df["size"] * CONTRACT_MULTIPLIER
        
        # Greeks / notional ratios
        df["delta_notional_ratio"] = df["delta_size"] / (df["opt_trade_notional_value"] + 1e-8)
        df["gamma_notional_ratio"] = df["gamma_size"] / (df["opt_trade_notional_value"] + 1e-8)
        df["vega_notional_ratio"] = df["vega_size"] / (df["opt_trade_notional_value"] + 1e-8)
        df["theta_notional_ratio"] = df["theta_size"] / (df["opt_trade_notional_value"] + 1e-8)
        
        # Open interest change
        df["open_interest_change"] = df["open_interest_now"] - df["open_interest_yesterday"].fillna(df["open_interest_now"])
        
        # Save to output
        output_file = OUTPUT_FOLDER / f"{ticker}_perrowfeatures_{date_str}.parquet"
        df.to_parquet(output_file, index=False)
        print(f"  Saved: {output_file.name} ({len(df)} rows, {len(df.columns)} columns)")
    
    print()
    print("=" * 70)
    print("PROCESSING COMPLETE")
    print("=" * 70)


if __name__ == "__main__":
    main()