### OHLCV Data Cleaning Pipeline

This notebook cleans the consolidated OHLCV data by ensuring data integrity and temporal alignment across all tickers.

**Workflow:**

1.  **Load Data:** The raw, consolidated OHLCV data is loaded.
2.  **Trim Data:** The data is reduced to only include a recent time window (e.g., the last 250 trading days).
3.  **Clean & Filter:** The data undergoes a multi-step cleaning process:
    *   **Date Alignment:** Tickers with date ranges not matching a reference symbol (`VOO`) are removed.
    *   **Completeness Check:** Tickers with any `NaN` values are removed.
    *   **Spike Removal:** Tickers with extreme single-day price changes are removed.
4.  **Save Data:** The final, clean DataFrame is saved.
5.  **Summarize:** A report details the number of tickers removed at each stage.

### Setup and Configuration

**This is the only cell you need to modify.**

In [1]:
import sys
from pathlib import Path
import pandas as pd

# --- Project Path Setup ---
NOTEBOOK_DIR = Path.cwd()
ROOT_DIR = NOTEBOOK_DIR.parent
DATA_DIR = ROOT_DIR / 'data'
SRC_DIR = ROOT_DIR / 'src'
if str(SRC_DIR) not in sys.path: sys.path.append(str(SRC_DIR))
import utils

# --- File Configuration ---
SOURCE_FILENAME = 'df_OHLCV_stocks_etfs.parquet'
DEST_FILENAME = 'df_OHLCV_clean_stocks_etfs.parquet'
SOURCE_PATH = DATA_DIR / SOURCE_FILENAME
DEST_PATH = DATA_DIR / DEST_FILENAME

# --- Cleaning Parameters ---
DAYS_TO_KEEP = 250
REFERENCE_SYMBOL = 'VOO'
MAX_DAILY_CHANGE_THRESHOLD = 0.50 # 50% change

# --- Notebook Setup ---
pd.set_option('display.max_columns', None); pd.set_option('display.width', 2000)
%load_ext autoreload
%autoreload 2

# --- Verification ---
print(f"Source file: {SOURCE_PATH}")
print(f"Destination file: {DEST_PATH}")
print(f"Reference symbol: '{REFERENCE_SYMBOL}'")
assert SOURCE_PATH.exists(), f"Source file not found at {SOURCE_PATH}"


Source file: c:\Users\ping\Files_win10\python\py311\stocks\data\df_OHLCV_stocks_etfs.parquet
Destination file: c:\Users\ping\Files_win10\python\py311\stocks\data\df_OHLCV_clean_stocks_etfs.parquet
Reference symbol: 'VOO'


### Step 1: Load Raw OHLCV Data
Load the consolidated data and validate that the reference symbol exists.


In [2]:
print(f"--- Step 1: Loading raw data from {SOURCE_PATH.name} ---")
df_raw = pd.read_parquet(SOURCE_PATH)

if REFERENCE_SYMBOL not in df_raw.index.get_level_values('Ticker'):
    raise ValueError(f"Reference symbol '{REFERENCE_SYMBOL}' not found. Halting.")

initial_tickers = set(df_raw.index.get_level_values('Ticker').unique())
print(f"Successfully loaded data with {len(initial_tickers)} unique tickers.")
df_raw.info()

--- Step 1: Loading raw data from df_OHLCV_stocks_etfs.parquet ---


ValueError: Reference symbol 'VOO' not found. Halting.

### Step 2: Trim Data to Recent Period
Reduce the dataset to a manageable, recent time window using our utility function.

In [None]:
df_trimmed = utils.trim_dataframe_to_recent_days(
    df=df_raw,
    days_to_keep=DAYS_TO_KEEP
)
trimmed_tickers = set(df_trimmed.index.get_level_values('Ticker').unique())
display(df_trimmed.head(3))


### Step 3: Clean and Filter Data
Apply a sequence of cleaning functions to ensure data quality.


In [None]:
# --- Part A: Align dates to the reference symbol ---
print("\n--- Part A: Aligning dates to reference symbol ---")
df_aligned, removed_by_date = utils.filter_df_dates_to_reference_symbol(
    df=df_trimmed,
    reference_symbol=REFERENCE_SYMBOL
)
df_aligned.index.names = ['Ticker', 'Date'] # Restore index names

# --- Part B: Remove symbols with missing values (NaNs) or incomplete history ---
print("\n--- Part B: Removing symbols with missing values or incomplete data ---")
df_complete, removed_by_nan = utils.filter_symbols_with_missing_values(
    df=df_aligned
)
df_complete.index.names = ['Ticker', 'Date'] # Restore index names

# --- Part C: Filter out tickers with extreme single-day price changes ---
print("\n--- Part C: Removing symbols with extreme price changes ---")
df_clean, removed_by_spike = utils.filter_symbols_with_extreme_changes(
    df=df_complete,
    threshold=MAX_DAILY_CHANGE_THRESHOLD
)

final_tickers = set(df_clean.index.get_level_values('Ticker').unique())
print(f"\nCleaning complete. Final ticker count: {len(final_tickers)}")

### Step 4: Save Cleaned Data
Save the fully cleaned DataFrame to a new Parquet file.

In [None]:
print(f"\n--- Step 4: Saving cleaned data ---")
if not df_clean.empty:
    DEST_PATH.parent.mkdir(parents=True, exist_ok=True)
    df_clean.to_parquet(DEST_PATH, engine='pyarrow', compression='zstd')
    print(f"Successfully saved cleaned data with {len(final_tickers)} tickers to: {DEST_PATH}")
else:
    print("Clean DataFrame is empty. Nothing to save.")

### Step 5: Final Summary
Provide a report on the number of tickers at each stage and list those that were removed.

In [None]:
# [REFACTOR] The summary now uses the cleanly collected lists of removed tickers.
print("\n--- Step 5: Cleaning Process Summary ---")

# Calculate counts at each stage
initial_count = len(initial_tickers)
trimmed_count = len(trimmed_tickers)
aligned_count = len(df_aligned.index.get_level_values('Ticker').unique())
complete_count = len(df_complete.index.get_level_values('Ticker').unique())
final_count = len(final_tickers)

# Print the funnel report
print("\n--- Ticker Count Funnel ---")
print(f"{'Initial raw count:':<35} {initial_count}")
print(f"{'After trimming to recent days:':<35} {trimmed_count}")
print(f"{'After date alignment:':<35} {aligned_count}")
print(f"{'After NaN/completeness check:':<35} {complete_count}")
print(f"{'After spike removal:':<35} {final_count}")
print("-" * 45)
print(f"{'Total tickers removed:':<35} {initial_count - final_count}")

# Print the lists of removed tickers
print("\n--- Details of Removed Tickers ---")
print(f"\n{len(removed_by_date)} symbols removed due to non-matching date index:")
print(sorted(removed_by_date))

print(f"\n{len(removed_by_nan)} symbols removed due to NaNs or incomplete history:")
print(sorted(removed_by_nan))

print(f"\n{len(removed_by_spike)} symbols removed due to extreme price spikes:")
print(sorted(removed_by_spike))