### Consolidate Yloader OHLCV Data

This notebook reads all individual ticker `.csv` files generated by the `Yloader` application, combines them into a single, analysis-ready DataFrame, and saves it in the efficient Parquet format.

**Workflow:**

1.  **Prerequisite:** Run `Yloader` to download OHLCV data. This should create multiple `.csv` files (e.g., `AAPL.csv`, `GOOG.csv`) in a dedicated directory.
2.  **Process & Combine:** The notebook scans the source directory, reads each CSV, and consolidates them into a single pandas DataFrame with a `(Ticker, Date)` MultiIndex.
3.  **Save:** The final DataFrame is saved as a single `.parquet` file for fast loading in subsequent analysis notebooks.
4.  **Verify:** The saved Parquet file is read back to confirm its integrity.

### Setup and Configuration

**This is the only cell you need to modify.** Adjust the paths and column definitions below to match your project structure.

In [1]:
import sys
import pandas as pd
from pathlib import Path
from typing import List, Optional

# --- Configuration ---

# 1. Set the directory where your Yloader CSV files are located.
#    This uses Path.home() to be portable across different computers and OS.
YLOADER_DATA_DIR = Path.home() / "Desktop" / "yloader"

# 2. Define the destination path for the final combined Parquet file.
#    It's good practice to save data outputs within the project's data folder.
#    Assuming a project structure where this notebook is in a `notebooks` subdir.
# ROOT_DIR = Path.cwd().parent if Path.cwd().name == 'notebooks' else Path.cwd()
# DESTINATION_PARQUET_PATH = ROOT_DIR / "data" / "df_OHLCV_stocks_etfs.parquet"
NOTEBOOK_DIR = Path.cwd()
ROOT_DIR = NOTEBOOK_DIR.parent
SRC_DIR = ROOT_DIR / "src"
if str(ROOT_DIR) not in sys.path:
    sys.path.append(str(ROOT_DIR))
if str(SRC_DIR) not in sys.path:
    sys.path.append(str(SRC_DIR))
DESTINATION_PARQUET_PATH_OHLCV = ROOT_DIR / "data" / "df_OHLCV_stocks_etfs.parquet"
DESTINATION_PARQUET_PATH_INDICES = ROOT_DIR / "data" / "df_indices.parquet"

# 3. Define the column names for the CSV files.
#    Assumes CSVs have no header and columns are in this fixed order.
#    The first column is always assumed to be 'Date'.
CANONICAL_COLUMN_NAMES = ["Adj Open", "Adj High", "Adj Low", "Adj Close", "Volume"]

# --- Verification ---
print(f"Project Root Directory set to: {ROOT_DIR}")
print(f"Reading Yloader CSVs from: {YLOADER_DATA_DIR}")
print(f"Ticker OHLCV output will be saved to: {DESTINATION_PARQUET_PATH_OHLCV}")
print(f"Indices Output will be saved to: {DESTINATION_PARQUET_PATH_INDICES}")

# Set pandas display options for better readability
pd.set_option("display.max_columns", None)
pd.set_option("display.width", 1000)

Project Root Directory set to: c:\Users\ping\Files_win10\python\py311\stocks
Reading Yloader CSVs from: C:\Users\ping\Desktop\yloader
Ticker OHLCV output will be saved to: c:\Users\ping\Files_win10\python\py311\stocks\data\df_OHLCV_stocks_etfs.parquet
Indices Output will be saved to: c:\Users\ping\Files_win10\python\py311\stocks\data\df_indices.parquet


### Step 1: Process and Combine CSV Data

This cell defines and executes the core logic to read all individual CSV files and merge them into a single, multi-indexed DataFrame.

In [2]:
from pathlib import Path
from datetime import datetime
from typing import List, Optional

import pandas as pd


def process_yloader_csvs(
    data_dir: Path,
    canonical_cols: List[str],
    anchor_ticker: str = "SPY",
) -> Optional[pd.DataFrame]:
    """
    Reads and combines CSV files from the Yloader output directory whose
    modification *date* (ignoring time-of-day) matches that of the specified
    anchor ticker file.

    Args:
        data_dir (Path):
            Directory containing the Yloader CSV files.
        canonical_cols (List[str]):
            Expected data column names, **excluding** 'Date'.
        anchor_ticker (str, optional):
            Ticker whose modification date is used as the reference.
            Defaults to "SPY".

    Returns:
        Optional[pd.DataFrame]:
            A sorted, multi-indexed DataFrame (Ticker, Date) or None if
            no data could be processed.
    """
    if not data_dir.is_dir():
        print(f"Error: Directory not found: {data_dir}")
        return None

    # ------------------------------------------------------------------ #
    # 1. Determine the reference date (modification date of anchor file) #
    # ------------------------------------------------------------------ #
    anchor_file = data_dir / f"{anchor_ticker.upper()}.csv"
    if not anchor_file.exists():
        print(f"Anchor file not found: {anchor_file}")
        return None

    anchor_date = datetime.fromtimestamp(anchor_file.stat().st_mtime).date()
    print(f"\nProcess csv files with this date: {anchor_date}")

    # ---------------------------------------------------------- #
    # 2. Collect CSV files whose modification date matches anchor #
    # ---------------------------------------------------------- #
    csv_files = [
        f
        for f in data_dir.glob("*.csv")
        if datetime.fromtimestamp(f.stat().st_mtime).date() == anchor_date
    ]

    if not csv_files:
        print(f"No CSV files found with the same modification date as {anchor_ticker}.")
        return None

    print(
        f"Found {len(csv_files)} CSV files with modification date "
        f"{anchor_date} (anchor: {anchor_ticker})..."
    )

    # ---------------------------------------------------------- #
    # 3. Load and combine the matching files                     #
    # ---------------------------------------------------------- #
    all_dataframes = []
    tickers_list = []
    expected_csv_cols = ["Date"] + canonical_cols

    for file_path in csv_files:
        ticker = file_path.stem
        try:
            df_temp = pd.read_csv(
                file_path,
                header=None,
                names=expected_csv_cols,
                parse_dates=["Date"],
                index_col="Date",
            )

            if df_temp.empty:
                print(f"Warning: File {file_path.name} is empty. Skipping.")
                continue

            all_dataframes.append(df_temp)
            tickers_list.append(ticker)

        except Exception as e:
            print(f"Error processing {file_path.name}: {e}. Skipping.")

    if not all_dataframes:
        print("No data was successfully loaded. Aborting.")
        return None

    multi_index_df = pd.concat(
        all_dataframes, keys=tickers_list, names=["Ticker", "Date"]
    )
    multi_index_df.sort_index(
        level=["Ticker", "Date"], ascending=[True, False], inplace=True
    )

    print("\nCSV processing complete. DataFrame created successfully.")
    return multi_index_df

In [3]:
# # --- Execute Step 1 ---
# final_df = process_yloader_csvs(YLOADER_DATA_DIR, CANONICAL_COLUMN_NAMES)

# if final_df is not None:
#     # chronological sort final_df
#     if not final_df.index.is_monotonic_increasing:
#         print(f'\nsorting final_df chronologically ...')
#         final_df.sort_index(inplace=True)
#     else:
#         print(f'\nfinal_df is sorted chronologically')

#     print("\n--- DataFrame Info ---")
#     final_df.info()

#     print("\n--- First 5 Rows of Combined DataFrame ---")
#     display(final_df.tail())

### Step 2: Save DataFrame to Parquet File

This step saves the combined DataFrame to a Parquet file, which is highly efficient for storage and subsequent loading.


In [4]:
import numpy as np
import pandas as pd
import gc
from pathlib import Path
from typing import Optional, Tuple

# ==========================================
# Helper Functions
# ==========================================


def separate_indices_from_tickers(
    df: pd.DataFrame,
) -> Tuple[pd.DataFrame, pd.DataFrame]:
    """
    Efficiently splits a MultiIndex DataFrame into two:
    1. Regular tickers
    2. Indices (tickers starting with '^')

    Uses underlying MultiIndex codes for O(N_unique) performance instead of O(N_rows).
    """
    if df is None or df.empty:
        return pd.DataFrame(), pd.DataFrame()

    print("\nSeparating indices from regular tickers...")

    # 1. Access underlying structure
    # levels: unique strings, codes: integer map to those strings
    levels = df.index.levels[0]
    codes = df.index.codes[0]

    # 2. Identify unique index tickers (vectorized string op on small array)
    is_index_ticker = levels.str.startswith("^")

    # 3. Create boolean mask using integer codes (fast integer comparison)
    target_codes = np.where(is_index_ticker)[0]
    mask = np.isin(codes, target_codes)

    # 4. Split and Copy
    # We use copy() so the new DFs own their data, allowing garbage collection of the original
    df_indices = df.loc[mask].copy()
    df_OHLCV = df.loc[~mask].copy()

    # 5. Clean up metadata
    # Remove unused levels so df_OHLCV doesn't carry metadata about '^VIX', etc.
    df_indices.index = df_indices.index.remove_unused_levels()
    df_OHLCV.index = df_OHLCV.index.remove_unused_levels()

    print(
        f"Separation complete: {len(df_OHLCV):,} regular rows, {len(df_indices):,} index rows."
    )
    return df_OHLCV, df_indices


def save_dataframe_to_parquet(df: Optional[pd.DataFrame], dest_path: Path):
    """
    Saves a DataFrame to a Parquet file using pyarrow/zstd.
    """
    if df is None or df.empty:
        print(f"Skipping save: DataFrame for {dest_path.name} is empty or None.")
        return

    try:
        dest_path.parent.mkdir(parents=True, exist_ok=True)
        df.to_parquet(dest_path, engine="pyarrow", compression="zstd", index=True)
        print(f"Successfully saved to: {dest_path}")
    except Exception as e:
        print(f"Error saving {dest_path}: {e}")


# ==========================================
# Main Execution
# ==========================================

# 1. Load Data
final_df = process_yloader_csvs(YLOADER_DATA_DIR, CANONICAL_COLUMN_NAMES)

if final_df is not None and not final_df.empty:

    # 2. Sort Chronologically (Optimization: Sort once before splitting)
    if not final_df.index.is_monotonic_increasing:
        print(f"\nSorting combined DataFrame chronologically...")
        final_df.sort_index(inplace=True)
    else:
        print(f"\nCombined DataFrame is already sorted.")

    print("\n--- Combined DataFrame Info ---")
    final_df.info()

    # 3. Split DataFrames
    df_OHLCV, df_indices = separate_indices_from_tickers(final_df)

    # 4. Memory Cleanup
    # Critical for large datasets: delete the massive original DF and force GC
    print("\nReleasing memory from combined DataFrame...")
    del final_df
    gc.collect()

    # 5. Save Outputs
    save_dataframe_to_parquet(df_OHLCV, DESTINATION_PARQUET_PATH_OHLCV)
    save_dataframe_to_parquet(df_indices, DESTINATION_PARQUET_PATH_INDICES)

    # Display preview (using display if in Jupyter, else print)
    try:
        print("\n--- Preview: Regular Tickers (df_OHLCV) ---")
        display(df_OHLCV.tail(3))
        print("\n--- Preview: Indices (df_indices) ---")
        display(df_indices.tail(3))
    except NameError:
        print(df_OHLCV.tail(3))
        print(df_indices.tail(3))

else:
    print("No data loaded from process_yloader_csvs.")


Process csv files with this date: 2025-12-29
Found 1603 CSV files with modification date 2025-12-29 (anchor: SPY)...

CSV processing complete. DataFrame created successfully.

Sorting combined DataFrame chronologically...

--- Combined DataFrame Info ---
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 9620322 entries, ('A', Timestamp('1999-11-18 00:00:00')) to ('^VIX3M', Timestamp('2025-12-29 00:00:00'))
Data columns (total 5 columns):
 #   Column     Dtype  
---  ------     -----  
 0   Adj Open   float64
 1   Adj High   float64
 2   Adj Low    float64
 3   Adj Close  float64
 4   Volume     int64  
dtypes: float64(4), int64(1)
memory usage: 404.3+ MB

Separating indices from regular tickers...
Separation complete: 9,484,759 regular rows, 135,563 index rows.

Releasing memory from combined DataFrame...
Successfully saved to: c:\Users\ping\Files_win10\python\py311\stocks\data\df_OHLCV_stocks_etfs.parquet
Successfully saved to: c:\Users\ping\Files_win10\python\py311\stocks\data\df_in

Unnamed: 0_level_0,Unnamed: 1_level_0,Adj Open,Adj High,Adj Low,Adj Close,Volume
Ticker,Date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
ZWS,2025-12-24,47.91,48.19,47.72,47.95,262000
ZWS,2025-12-26,47.84,48.17,47.61,47.84,351200
ZWS,2025-12-29,47.94,48.1738,47.69,47.77,371555



--- Preview: Indices (df_indices) ---


Unnamed: 0_level_0,Unnamed: 1_level_0,Adj Open,Adj High,Adj Low,Adj Close,Volume
Ticker,Date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
^VIX3M,2025-12-24,17.93,17.93,17.63,17.77,0
^VIX3M,2025-12-26,17.93,18.05,17.71,17.77,0
^VIX3M,2025-12-29,18.26,18.29,17.65,17.82,0


### Step 3: Verify the Saved File (Optional)

This final step reads the Parquet file back into a new DataFrame to ensure the data was saved correctly.

In [5]:
print(f"--- Verifying the saved file at: {DESTINATION_PARQUET_PATH_OHLCV} ---")

try:
    if DESTINATION_PARQUET_PATH_OHLCV.exists():
        verified_df = pd.read_parquet(DESTINATION_PARQUET_PATH_OHLCV)
        print("Verification successful! File read back into memory.")

        print("\n--- First 5 Rows of Verified DataFrame ---")
        display(verified_df.head())

        # Optional: Check a specific ticker
        if not verified_df.empty:
            example_ticker = verified_df.index.get_level_values("Ticker")[0]
            print(f"\n--- Data for first available ticker '{example_ticker}' ---")
            display(verified_df.loc[example_ticker].head())

    else:
        print("Error: The output file was not found at the specified path.")

except Exception as e:
    print(f"An error occurred during verification: {e}")

--- Verifying the saved file at: c:\Users\ping\Files_win10\python\py311\stocks\data\df_OHLCV_stocks_etfs.parquet ---
Verification successful! File read back into memory.

--- First 5 Rows of Verified DataFrame ---


Unnamed: 0_level_0,Unnamed: 1_level_0,Adj Open,Adj High,Adj Low,Adj Close,Volume
Ticker,Date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A,1999-11-18,27.2452,29.9398,23.9518,26.347,74716417
A,1999-11-19,25.7108,25.7482,23.8396,24.1764,18198352
A,1999-11-22,24.7378,26.347,23.9893,26.347,7857766
A,1999-11-23,25.4488,26.1225,23.9518,23.9518,7138321
A,1999-11-24,24.0267,25.112,23.9518,24.5881,5785608



--- Data for first available ticker 'A' ---


Unnamed: 0_level_0,Adj Open,Adj High,Adj Low,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1999-11-18,27.2452,29.9398,23.9518,26.347,74716417
1999-11-19,25.7108,25.7482,23.8396,24.1764,18198352
1999-11-22,24.7378,26.347,23.9893,26.347,7857766
1999-11-23,25.4488,26.1225,23.9518,23.9518,7138321
1999-11-24,24.0267,25.112,23.9518,24.5881,5785608


In [6]:
verified_df.loc["AAPL"]

Unnamed: 0_level_0,Adj Open,Adj High,Adj Low,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1980-12-12,0.098390,0.098817,0.098390,0.098390,611849325
1980-12-15,0.093684,0.093684,0.093256,0.093256,229439776
1980-12-16,0.086839,0.086839,0.086412,0.086412,137921067
1980-12-17,0.088550,0.088978,0.088550,0.088550,112762114
1980-12-18,0.091118,0.091545,0.091118,0.091118,95814196
...,...,...,...,...,...
2025-12-22,272.860000,273.880000,270.510000,270.970000,36571800
2025-12-23,270.840000,272.500000,269.560000,272.360000,29642000
2025-12-24,272.340000,275.430000,272.200000,273.810000,17910600
2025-12-26,274.160000,275.370000,272.860000,273.400000,21455300


#################  
#################  