# Feature Engineering for Stock Data

This notebook loads raw OHLCV data and creates technical indicators and target variables for machine learning models.

## Mathematical Context:
- **RSI (Relative Strength Index)**: Momentum oscillator (0-100) measuring speed/magnitude of price changes. RSI = 100 - (100 / (1 + RS)), where RS = avg gain / avg loss over period
- **EMA (Exponential Moving Average)**: Weighted moving average giving more weight to recent prices. EMA = Price(t) * k + EMA(y) * (1 - k), where k = 2/(N+1), N = period
- **ATR (Average True Range)**: Volatility measure. ATR = MA(True Range), where True Range = max(High-Low, |High-PrevClose|, |Low-PrevClose|)
- **Target**: Binary classification label (1 if price increases, 0 if decreases) over 5-day horizon


In [None]:
import pandas as pd
import ta
from pathlib import Path
from typing import Optional
import sys

# Add src to path for utilities
sys.path.append(str(Path.cwd().parent))

from src.utils.paths import get_project_root, resolve_data_path
from src.utils.config import load_config, get_config_value


In [None]:
class FeatureBuilder:
    """
    A class to build technical features from raw OHLCV stock data.
    
    This class handles loading raw data, computing technical indicators,
    creating target variables, and saving processed datasets.
    """
    
    def __init__(self, data_dir: str = "data/raw", output_dir: str = "data/processed") -> None:
        """
        Initialize the FeatureBuilder.
        
        Args:
            data_dir: Directory containing raw CSV files
            output_dir: Directory to save processed CSV files
        """
        # Use utility function for project root detection
        project_root = get_project_root()
        
        # Resolve paths relative to project root
        self.data_dir = (project_root / data_dir).resolve()
        self.output_dir = (project_root / output_dir).resolve()
        self.df: Optional[pd.DataFrame] = None
    
    def load_data(self, filepath: Optional[str] = None) -> pd.DataFrame:
        """
        Load raw OHLCV data from CSV file.
        
        If filepath is not provided, automatically finds the first SPY CSV file
        in the data directory.
        
        Args:
            filepath: Optional path to CSV file. If None, searches for SPY files.
        
        Returns:
            DataFrame with OHLCV data
        
        Raises:
            FileNotFoundError: If no CSV file is found
        """
        # Use utility function for data loading
        from src.utils.data_loader import load_raw_data
        
        if filepath:
            # If specific file provided, extract filename
            file_path = Path(filepath)
            if file_path.is_absolute():
                # If absolute path, use it directly
                self.df = pd.read_csv(file_path)
            else:
                # Use utility with filename
                filename = file_path.name
                self.df = load_raw_data(filename=filename)
        else:
            # Auto-detect SPY CSV
            self.df = load_raw_data(pattern="SPY*.csv", symbol="SPY")
        
        print(f"✓ Loaded {len(self.df)} rows of data")
        print(f"  Date range: {self.df['time'].min()} to {self.df['time'].max()}")
        
        return self.df
    
    def add_technical_indicators(self) -> pd.DataFrame:
        """
        Add technical indicators to the dataset.
        
        Adds:
        - RSI (14 periods): Relative Strength Index
        - EMA (20 periods): Exponential Moving Average
        - EMA (50 periods): Exponential Moving Average
        - ATR: Average True Range
        
        Returns:
            DataFrame with technical indicators added
        """
        if self.df is None:
            raise ValueError("No data loaded. Call load_data() first.")
        
        # Verify required columns exist
        required_cols = ['open', 'high', 'low', 'close']
        missing_cols = [col for col in required_cols if col not in self.df.columns]
        if missing_cols:
            raise ValueError(f"Missing required columns: {missing_cols}")
        
        print("Adding technical indicators...")
        
        # Add RSI (14 periods)
        # RSI measures momentum on a scale of 0-100
        rsi_indicator = ta.momentum.RSIIndicator(close=self.df['close'], window=14)
        self.df['rsi_14'] = rsi_indicator.rsi()
        print("  ✓ Added RSI(14)")
        
        # Add EMA (20 periods)
        # EMA gives more weight to recent prices
        ema_20_indicator = ta.trend.EMAIndicator(close=self.df['close'], window=20)
        self.df['ema_20'] = ema_20_indicator.ema_indicator()
        print("  ✓ Added EMA(20)")
        
        # Add EMA (50 periods)
        ema_50_indicator = ta.trend.EMAIndicator(close=self.df['close'], window=50)
        self.df['ema_50'] = ema_50_indicator.ema_indicator()
        print("  ✓ Added EMA(50)")
        
        # Add ATR (Average True Range)
        # ATR measures volatility using high, low, and close prices
        atr_indicator = ta.volatility.AverageTrueRange(
            high=self.df['high'],
            low=self.df['low'],
            close=self.df['close'],
            window=14
        )
        self.df['atr'] = atr_indicator.average_true_range()
        print("  ✓ Added ATR(14)")
        
        return self.df
    
    def create_target(self, horizon_days: int = 5) -> pd.DataFrame:
        """
        Create target variable for binary classification.
        
        Target = 1 if price increases over the horizon, 0 if it decreases.
        This creates a forward-looking target by comparing current close price
        with the close price N days in the future.
        
        Mathematical Context:
        - Target[i] = 1 if close[i+horizon_days] > close[i]
        - Target[i] = 0 if close[i+horizon_days] <= close[i]
        - This is shifted backward to avoid look-ahead bias during training
        
        Args:
            horizon_days: Number of days ahead to look (default: 5)
        
        Returns:
            DataFrame with target column added
        """
        if self.df is None:
            raise ValueError("No data loaded. Call load_data() first.")
        
        if 'close' not in self.df.columns:
            raise ValueError("'close' column not found in data")
        
        print(f"Creating target variable (horizon: {horizon_days} days)...")
        
        # Shift close price forward by horizon_days
        # This gives us the price N days in the future
        future_close = self.df['close'].shift(-horizon_days)
        
        # Create binary target: 1 if price increases, 0 otherwise
        # We compare future price with current price
        self.df['target'] = (future_close > self.df['close']).astype(int)
        
        print(f"  ✓ Created target column")
        print(f"  Target distribution: {self.df['target'].value_counts().to_dict()}")
        
        return self.df
    
    def remove_nans(self) -> pd.DataFrame:
        """
        Remove rows with NaN values caused by technical indicator calculations.
        
        NaN values occur at the beginning of the dataset due to:
        - Moving average windows (EMA, RSI need N periods to calculate)
        - ATR calculations requiring previous periods
        
        Returns:
            DataFrame with NaN rows removed
        """
        if self.df is None:
            raise ValueError("No data loaded. Call load_data() first.")
        
        initial_rows = len(self.df)
        
        # Remove rows with any NaN values
        self.df = self.df.dropna().reset_index(drop=True)
        
        removed_rows = initial_rows - len(self.df)
        
        print(f"✓ Removed {removed_rows} rows with NaN values")
        print(f"  Remaining rows: {len(self.df)}")
        
        return self.df
    
    def save_processed_data(self, filename: str = "spy_featured.csv") -> bool:
        """
        Save processed dataset with features to CSV file.
        
        Args:
            filename: Name of output CSV file
        
        Returns:
            True if save successful, False otherwise
        """
        if self.df is None or self.df.empty:
            print("✗ Cannot save empty DataFrame")
            return False
        
        # Create output directory if it doesn't exist
        self.output_dir.mkdir(parents=True, exist_ok=True)
        
        filepath = self.output_dir / filename
        
        try:
            self.df.to_csv(filepath, index=False)
            print(f"✓ Saved processed data to: {filepath}")
            print(f"  Shape: {self.df.shape}")
            print(f"  Columns: {list(self.df.columns)}")
            return True
        except Exception as e:
            print(f"✗ Failed to save CSV: {e}")
            return False
    
    def build_features(
        self, 
        filepath: Optional[str] = None,
        horizon_days: int = 5,
        output_filename: str = "spy_featured.csv"
    ) -> pd.DataFrame:
        """
        Complete feature engineering pipeline.
        
        This method orchestrates the entire process:
        1. Load raw data
        2. Add technical indicators
        3. Create target variable
        4. Remove NaN values
        5. Save processed data
        
        Args:
            filepath: Optional path to input CSV file
            horizon_days: Days ahead for target calculation (default: 5)
            output_filename: Name of output CSV file
        
        Returns:
            Final processed DataFrame
        """
        # Load data
        self.load_data(filepath)
        
        # Add technical indicators
        self.add_technical_indicators()
        
        # Create target
        self.create_target(horizon_days=horizon_days)
        
        # Remove NaN values
        self.remove_nans()
        
        # Save processed data
        self.save_processed_data(output_filename)
        
        return self.df


In [3]:
# Initialize feature builder
builder = FeatureBuilder()

# Build features (complete pipeline)
# If auto-detect fails, you can specify the filepath directly:
# filepath = "data/raw/SPY_D1_20251228_215819.csv"

df = builder.build_features(
    filepath=None,  # Auto-detect SPY CSV in data/raw/
    horizon_days=5,
    output_filename="spy_featured.csv"  # Output filename
)

# Display summary
print("\n" + "="*50)
print("Feature Engineering Complete!")
print("="*50)
print(f"\nFinal dataset shape: {df.shape}")
print(f"\nColumns: {list(df.columns)}")


✓ Found CSV files in: /Users/rakehsaleem/ai-trading-lab/data/raw
✓ Using CSV file: /Users/rakehsaleem/ai-trading-lab/data/raw/SPY_D1_20251228_215819.csv
✓ Loaded 5000 rows of data
  Date range: 2006-02-13 00:00:00-05:00 to 2025-12-26 00:00:00-05:00
Adding technical indicators...
  ✓ Added RSI(14)
  ✓ Added EMA(20)
  ✓ Added EMA(50)
  ✓ Added ATR(14)
Creating target variable (horizon: 5 days)...
  ✓ Created target column
  Target distribution: {1: 3004, 0: 1996}
✓ Removed 49 rows with NaN values
  Remaining rows: 4951
✓ Saved processed data to: /Users/rakehsaleem/ai-trading-lab/data/processed/spy_featured.csv
  Shape: (4951, 11)
  Columns: ['time', 'open', 'high', 'low', 'close', 'volume', 'rsi_14', 'ema_20', 'ema_50', 'atr', 'target']

Feature Engineering Complete!

Final dataset shape: (4951, 11)

Columns: ['time', 'open', 'high', 'low', 'close', 'volume', 'rsi_14', 'ema_20', 'ema_50', 'atr', 'target']


  self.df['time'] = pd.to_datetime(self.df['time'])


In [4]:
# Display first few rows
print("First 5 rows:")
display(df.head())

# Display last few rows
print("\nLast 5 rows:")
display(df.tail())


First 5 rows:


Unnamed: 0,time,open,high,low,close,volume,rsi_14,ema_20,ema_50,atr,target
0,2006-04-25 00:00:00-04:00,90.818981,90.874427,90.042753,90.35463,84359800,54.184452,90.148349,89.540572,0.805781,1
1,2006-04-26 00:00:00-04:00,90.444723,90.888282,90.306112,90.375412,67262400,54.412988,90.169974,89.573311,0.789808,1
2,2006-04-27 00:00:00-04:00,90.028863,91.227869,89.814015,90.812027,124478600,59.036054,90.231122,89.621888,0.834383,1
3,2006-04-28 00:00:00-04:00,90.645701,91.311046,90.590265,91.116989,55854400,61.939386,90.31549,89.680519,0.826269,1
4,2006-05-01 00:00:00-04:00,91.116994,91.345707,90.319976,90.375412,64990300,52.242845,90.321197,89.70777,0.840516,1



Last 5 rows:


Unnamed: 0,time,open,high,low,close,volume,rsi_14,ema_20,ema_50,atr,target
4946,2025-12-19 00:00:00-05:00,676.590027,681.090027,676.469971,680.590027,103599500,54.247804,677.136383,671.657565,7.488277,0
4947,2025-12-22 00:00:00-05:00,683.940002,685.359985,680.590027,684.830017,69556700,57.677623,677.86911,672.174132,7.294112,0
4948,2025-12-23 00:00:00-05:00,683.919983,688.200012,683.869995,687.960022,64840000,60.058043,678.830149,672.793186,7.082391,0
4949,2025-12-24 00:00:00-05:00,687.950012,690.830017,687.799988,690.380005,39445600,61.844893,679.930135,673.482866,6.792936,0
4950,2025-12-26 00:00:00-05:00,690.640015,691.659973,689.27002,690.309998,41588400,61.758818,680.918694,674.142753,6.478438,0


In [5]:
# Display data summary
print("Data Summary:")
print(df.info())

# Display basic statistics
print("\nBasic Statistics:")
display(df.describe())

# Display target distribution
print("\nTarget Distribution:")
print(df['target'].value_counts())
print(f"\nTarget percentage: {df['target'].mean() * 100:.2f}% positive")


Data Summary:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4951 entries, 0 to 4950
Data columns (total 11 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   time    4951 non-null   object 
 1   open    4951 non-null   float64
 2   high    4951 non-null   float64
 3   low     4951 non-null   float64
 4   close   4951 non-null   float64
 5   volume  4951 non-null   int64  
 6   rsi_14  4951 non-null   float64
 7   ema_20  4951 non-null   float64
 8   ema_50  4951 non-null   float64
 9   atr     4951 non-null   float64
 10  target  4951 non-null   int64  
dtypes: float64(8), int64(2), object(1)
memory usage: 425.6+ KB
None

Basic Statistics:


Unnamed: 0,open,high,low,close,volume,rsi_14,ema_20,ema_50,atr,target
count,4951.0,4951.0,4951.0,4951.0,4951.0,4951.0,4951.0,4951.0,4951.0,4951.0
mean,234.65265,235.955154,233.227028,234.68818,127267200.0,55.679786,233.554566,231.795115,3.014513,0.601899
std,157.019857,157.784175,156.166705,157.050453,91462890.0,11.502877,155.832624,153.984905,2.473956,0.489556
min,49.827258,51.330511,49.203962,49.944588,20270000.0,16.802867,54.768383,58.305682,0.654841,0.0
25%,102.503077,102.942542,101.944469,102.561901,67954100.0,47.402033,101.989362,101.439328,1.253129,0.0
50%,176.149154,176.762742,175.397661,176.073715,96428000.0,57.043513,175.627294,174.740477,1.91411,1.0
75%,350.475122,352.558414,345.951011,348.85408,156926400.0,64.370256,351.642444,343.364792,4.221719,1.0
max,690.640015,691.659973,689.27002,690.380005,871026300.0,87.191874,680.918694,674.142753,20.142962,1.0



Target Distribution:
target
1    2980
0    1971
Name: count, dtype: int64

Target percentage: 60.19% positive
