# Step 2: Cache Setup

**Objective:**  
Create a script to hold session data and serve rows for streaming.

**Instructions:**
- Load CSV/Parquet/JSON file into a Python cache (list/dict/DataFrame)
- Validate: Can iterate/stream rows one-by-one, data freshness (row reflects realistic event order)


In [1]:
# Import required libraries
import sys
import os
import pandas as pd

# Add project root to Python path
# In Jupyter, getcwd() typically returns the project root
# If not, navigate up from notebooks/ directory
current_dir = os.getcwd()
if os.path.basename(current_dir) == 'notebooks':
    # We're in notebooks/ directory, go up one level
    project_root = os.path.dirname(current_dir)
else:
    # We're already at project root
    project_root = current_dir

# Add to path if not already there
if project_root not in sys.path:
    sys.path.insert(0, project_root)

from src.cache_manager import CacheManager
from src.utils import validate_data

print("✅ Imports successful")


✅ Imports successful


In [2]:
# Initialize cache manager
cache_manager = CacheManager()
print("✅ Cache manager initialized")


✅ Cache manager initialized


In [3]:
# Find latest data file
data_dir = "../data"
data_files = [f for f in os.listdir(data_dir) if f.endswith(('.csv', '.parquet'))]

if not data_files:
    print("❌ No data files found. Please run Step 1 first.")
else:
    # Use latest file
    latest_file = max(data_files, key=lambda f: os.path.getctime(os.path.join(data_dir, f)))
    file_path = os.path.join(data_dir, latest_file)
    print(f"Using data file: {latest_file}")
    
    # Load data into cache
    format_type = 'parquet' if latest_file.endswith('.parquet') else 'csv'
    if cache_manager.load_cache(file_path, format=format_type):
        print(f"✅ Data loaded into cache: {cache_manager.get_num_records()} records")
        print(f"✅ Frequency: {cache_manager.get_frequency():.2f} Hz")
    else:
        print("❌ Failed to load data into cache")


INFO:src.cache_manager:Loading data from ../data/2023_Monaco_Race_20251119_182219.parquet (parquet format)...


Using data file: 2023_Monaco_Race_20251119_182219.parquet


INFO:src.cache_manager:Loaded 982330 records
INFO:src.utils:Calculated frequency: 999848759.83 Hz (avg interval: 0.00 ms)
INFO:src.cache_manager:Cache loaded successfully. Frequency: 999848759.83 Hz


✅ Data loaded into cache: 982330 records
✅ Frequency: 999848759.83 Hz


In [4]:
# Validate cache can iterate row-by-row
print("Testing row-by-row iteration...")
sample_records = []
for i in range(5):  # Test first 5 records
    record = cache_manager.get_next_record()
    if record:
        sample_records.append(record)
        print(f"Record {i+1}: Driver={record.get('DriverID', 'N/A')}, Speed={record.get('Speed', 'N/A')}")
    else:
        break

if sample_records:
    print(f"\n✅ Cache iteration successful: {len(sample_records)} records retrieved")
    # Reset to beginning
    cache_manager.reset()
else:
    print("❌ Cache iteration failed")


INFO:src.cache_manager:Cache reset to beginning


Testing row-by-row iteration...
Record 1: Driver=VER, Speed=0.0
Record 2: Driver=ALO, Speed=0.0
Record 3: Driver=LEC, Speed=0.0
Record 4: Driver=SAI, Speed=0.0
Record 5: Driver=HUL, Speed=0.0

✅ Cache iteration successful: 5 records retrieved


In [5]:
# Validate data order (timestamps should be increasing)
print("Validating data order (timestamps)...")
all_records = cache_manager.get_all_records()

if 'SessionTime' in all_records.columns:
    # Handle timedelta or datetime types
    timestamps = all_records['SessionTime']
    
    # Check if it's already timedelta or datetime
    if pd.api.types.is_timedelta64_dtype(timestamps):
        # Already timedelta, use directly
        is_ordered = (timestamps.diff().dropna() >= pd.Timedelta(0)).all()
        first_ts = timestamps.iloc[0]
        last_ts = timestamps.iloc[-1]
        duration = last_ts - first_ts
    elif pd.api.types.is_datetime64_any_dtype(timestamps):
        # Already datetime, use directly
        is_ordered = (timestamps.diff().dropna() >= pd.Timedelta(0)).all()
        first_ts = timestamps.iloc[0]
        last_ts = timestamps.iloc[-1]
        duration = last_ts - first_ts
    else:
        # Try to convert - check if it's timedelta string format
        if timestamps.dtype == 'object' and timestamps.astype(str).str.contains('days', na=False).any():
            timestamps = pd.to_timedelta(timestamps)
        else:
            # Try datetime first, then timedelta
            try:
                timestamps = pd.to_datetime(timestamps)
            except (TypeError, ValueError):
                timestamps = pd.to_timedelta(timestamps)
        
        is_ordered = (timestamps.diff().dropna() >= pd.Timedelta(0)).all()
        first_ts = timestamps.iloc[0]
        last_ts = timestamps.iloc[-1]
        duration = last_ts - first_ts
    
    if is_ordered:
        print("✅ Data is ordered chronologically (realistic event order)")
    else:
        print("⚠️ Warning: Some timestamps are out of order")
    
    print(f"First timestamp: {first_ts}")
    print(f"Last timestamp: {last_ts}")
    print(f"Duration: {duration}")
else:
    print("⚠️ Warning: SessionTime column not found, cannot validate order")

# Reset cache
cache_manager.reset()
print("\n✅ Step 2 Complete: Cache setup successful!")


Validating data order (timestamps)...


INFO:src.cache_manager:Cache reset to beginning


✅ Data is ordered chronologically (realistic event order)
First timestamp: 1970-01-01 00:00:00.000003722
Last timestamp: 1970-01-01 00:00:00.000010334
Duration: 0 days 00:00:00.000006612

✅ Step 2 Complete: Cache setup successful!
