### Consolidate Yloader OHLCV Data

This notebook reads all individual ticker `.csv` files generated by the `Yloader` application, combines them into a single, analysis-ready DataFrame, and saves it in the efficient Parquet format.

**Workflow:**

1.  **Prerequisite:** Run `Yloader` to download OHLCV data. This should create multiple `.csv` files (e.g., `AAPL.csv`, `GOOG.csv`) in a dedicated directory.
2.  **Process & Combine:** The notebook scans the source directory, reads each CSV, and consolidates them into a single pandas DataFrame with a `(Ticker, Date)` MultiIndex.
3.  **Save:** The final DataFrame is saved as a single `.parquet` file for fast loading in subsequent analysis notebooks.
4.  **Verify:** The saved Parquet file is read back to confirm its integrity.

### Setup and Configuration

**This is the only cell you need to modify.** Adjust the paths and column definitions below to match your project structure.

In [1]:
import sys
import pandas as pd
from pathlib import Path
from typing import List, Optional

# --- Configuration ---

# 1. Set the directory where your Yloader CSV files are located.
#    This uses Path.home() to be portable across different computers and OS.
YLOADER_DATA_DIR = Path.home() / "Desktop" / "yloader"

# 2. Define the destination path for the final combined Parquet file.
#    It's good practice to save data outputs within the project's data folder.
#    Assuming a project structure where this notebook is in a `notebooks` subdir.
# ROOT_DIR = Path.cwd().parent if Path.cwd().name == 'notebooks' else Path.cwd()
# DESTINATION_PARQUET_PATH = ROOT_DIR / "data" / "df_OHLCV_stocks_etfs.parquet"
NOTEBOOK_DIR = Path.cwd()
ROOT_DIR = NOTEBOOK_DIR.parent 
SRC_DIR = ROOT_DIR / 'src'
if str(ROOT_DIR) not in sys.path: sys.path.append(str(ROOT_DIR))
if str(SRC_DIR) not in sys.path: sys.path.append(str(SRC_DIR))
DESTINATION_PARQUET_PATH = ROOT_DIR / "data" / "df_OHLCV_stocks_etfs.parquet"

# 3. Define the column names for the CSV files.
#    Assumes CSVs have no header and columns are in this fixed order.
#    The first column is always assumed to be 'Date'.
CANONICAL_COLUMN_NAMES = ['Adj Open', 'Adj High', 'Adj Low', 'Adj Close', 'Volume']

# --- Verification ---
print(f"Project Root Directory set to: {ROOT_DIR}")
print(f"Reading Yloader CSVs from: {YLOADER_DATA_DIR}")
print(f"Output will be saved to: {DESTINATION_PARQUET_PATH}")

# Set pandas display options for better readability
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)

Project Root Directory set to: c:\Users\ping\Files_win10\python\py311\stocks
Reading Yloader CSVs from: C:\Users\ping\Desktop\yloader
Output will be saved to: c:\Users\ping\Files_win10\python\py311\stocks\data\df_OHLCV_stocks_etfs.parquet


### Step 1: Process and Combine CSV Data

This cell defines and executes the core logic to read all individual CSV files and merge them into a single, multi-indexed DataFrame.

In [2]:
from pathlib import Path
from datetime import datetime
from typing import List, Optional

import pandas as pd


def process_yloader_csvs(
    data_dir: Path,
    canonical_cols: List[str],
    anchor_ticker: str = "SPY",
) -> Optional[pd.DataFrame]:
    """
    Reads and combines CSV files from the Yloader output directory whose
    modification *date* (ignoring time-of-day) matches that of the specified
    anchor ticker file.

    Args:
        data_dir (Path):
            Directory containing the Yloader CSV files.
        canonical_cols (List[str]):
            Expected data column names, **excluding** 'Date'.
        anchor_ticker (str, optional):
            Ticker whose modification date is used as the reference.
            Defaults to "SPY".

    Returns:
        Optional[pd.DataFrame]:
            A sorted, multi-indexed DataFrame (Ticker, Date) or None if
            no data could be processed.
    """
    if not data_dir.is_dir():
        print(f"Error: Directory not found: {data_dir}")
        return None

    # ------------------------------------------------------------------ #
    # 1. Determine the reference date (modification date of anchor file) #
    # ------------------------------------------------------------------ #
    anchor_file = data_dir / f"{anchor_ticker.upper()}.csv"
    if not anchor_file.exists():
        print(f"Anchor file not found: {anchor_file}")
        return None

    anchor_date = datetime.fromtimestamp(anchor_file.stat().st_mtime).date()
    print(f'\nProcess csv files with this date: {anchor_date}')

    # ---------------------------------------------------------- #
    # 2. Collect CSV files whose modification date matches anchor #
    # ---------------------------------------------------------- #
    csv_files = [
        f
        for f in data_dir.glob("*.csv")
        if datetime.fromtimestamp(f.stat().st_mtime).date() == anchor_date
    ]

    if not csv_files:
        print(f"No CSV files found with the same modification date as {anchor_ticker}.")
        return None

    print(
        f"Found {len(csv_files)} CSV files with modification date "
        f"{anchor_date} (anchor: {anchor_ticker})..."
    )

    # ---------------------------------------------------------- #
    # 3. Load and combine the matching files                     #
    # ---------------------------------------------------------- #
    all_dataframes = []
    tickers_list = []
    expected_csv_cols = ["Date"] + canonical_cols

    for file_path in csv_files:
        ticker = file_path.stem
        try:
            df_temp = pd.read_csv(
                file_path,
                header=None,
                names=expected_csv_cols,
                parse_dates=["Date"],
                index_col="Date",
            )

            if df_temp.empty:
                print(f"Warning: File {file_path.name} is empty. Skipping.")
                continue

            all_dataframes.append(df_temp)
            tickers_list.append(ticker)

        except Exception as e:
            print(f"Error processing {file_path.name}: {e}. Skipping.")

    if not all_dataframes:
        print("No data was successfully loaded. Aborting.")
        return None

    multi_index_df = pd.concat(
        all_dataframes, keys=tickers_list, names=["Ticker", "Date"]
    )
    multi_index_df.sort_index(level=["Ticker", "Date"], ascending=[True, False], inplace=True)

    print("\nCSV processing complete. DataFrame created successfully.")
    return multi_index_df

In [3]:
# --- Execute Step 1 ---
final_df = process_yloader_csvs(YLOADER_DATA_DIR, CANONICAL_COLUMN_NAMES)

if final_df is not None:
    # chronological sort final_df
    if not final_df.index.is_monotonic_increasing:
        print(f'\nsorting final_df chronologically ...')
        final_df.sort_index(inplace=True)
    else:
        print(f'\nfinal_df is sorted chronologically')  

    print("\n--- DataFrame Info ---")
    final_df.info()
    
    print("\n--- First 5 Rows of Combined DataFrame ---")
    display(final_df.tail())


Process csv files with this date: 2025-10-06
Found 885 CSV files with modification date 2025-10-06 (anchor: SPY)...

CSV processing complete. DataFrame created successfully.

sorting final_df chronologically ...

--- DataFrame Info ---
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 4036635 entries, ('A', Timestamp('1999-11-18 00:00:00')) to ('ZWS', Timestamp('2025-10-06 00:00:00'))
Data columns (total 5 columns):
 #   Column     Dtype  
---  ------     -----  
 0   Adj Open   float64
 1   Adj High   float64
 2   Adj Low    float64
 3   Adj Close  float64
 4   Volume     int64  
dtypes: float64(4), int64(1)
memory usage: 170.0+ MB

--- First 5 Rows of Combined DataFrame ---


Unnamed: 0_level_0,Unnamed: 1_level_0,Adj Open,Adj High,Adj Low,Adj Close,Volume
Ticker,Date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
ZWS,2025-09-30,46.9,47.4,46.77,47.03,609500
ZWS,2025-10-01,46.72,47.04,46.4,46.8,599400
ZWS,2025-10-02,46.86,47.17,46.54,46.91,839900
ZWS,2025-10-03,46.87,47.37,46.62,46.84,780500
ZWS,2025-10-06,47.16,47.37,46.67,47.29,661018


### Step 2: Save DataFrame to Parquet File

This step saves the combined DataFrame to a Parquet file, which is highly efficient for storage and subsequent loading.


In [4]:
def save_dataframe_to_parquet(df: Optional[pd.DataFrame], dest_path: Path):
    """
    Saves a DataFrame to a Parquet file if it's not empty.

    Args:
        df (Optional[pd.DataFrame]): The DataFrame to save.
        dest_path (Path): The destination file path.
    """
    if df is None or df.empty:
        print("DataFrame is empty or None. Nothing to save.")
        return

    try:
        # Ensure the destination directory exists
        dest_path.parent.mkdir(parents=True, exist_ok=True)
        
        # Save the file using the efficient 'pyarrow' engine and 'zstd' compression
        df.to_parquet(dest_path, engine='pyarrow', compression='zstd', index=True)
        print(f"\nSuccessfully saved DataFrame to: {dest_path}")
        
    except Exception as e:
        print(f"\nError: Failed to save Parquet file. Details: {e}")

# --- Execute Step 2 ---
save_dataframe_to_parquet(final_df, DESTINATION_PARQUET_PATH)


Successfully saved DataFrame to: c:\Users\ping\Files_win10\python\py311\stocks\data\df_OHLCV_stocks_etfs.parquet


### Step 3: Verify the Saved File (Optional)

This final step reads the Parquet file back into a new DataFrame to ensure the data was saved correctly.

In [5]:
print(f"--- Verifying the saved file at: {DESTINATION_PARQUET_PATH} ---")

try:
    if DESTINATION_PARQUET_PATH.exists():
        verified_df = pd.read_parquet(DESTINATION_PARQUET_PATH)
        print("Verification successful! File read back into memory.")
        
        print("\n--- First 5 Rows of Verified DataFrame ---")
        display(verified_df.head())
        
        # Optional: Check a specific ticker
        if not verified_df.empty:
            example_ticker = verified_df.index.get_level_values('Ticker')[0]
            print(f"\n--- Data for first available ticker '{example_ticker}' ---")
            display(verified_df.loc[example_ticker].head())

    else:
        print("Error: The output file was not found at the specified path.")
        
except Exception as e:
    print(f"An error occurred during verification: {e}")

--- Verifying the saved file at: c:\Users\ping\Files_win10\python\py311\stocks\data\df_OHLCV_stocks_etfs.parquet ---
Verification successful! File read back into memory.

--- First 5 Rows of Verified DataFrame ---


Unnamed: 0_level_0,Unnamed: 1_level_0,Adj Open,Adj High,Adj Low,Adj Close,Volume
Ticker,Date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A,1999-11-18,27.2452,29.9398,23.9518,26.347,74716422
A,1999-11-19,25.7108,25.7482,23.8396,24.1764,18198351
A,1999-11-22,24.7378,26.347,23.9893,26.347,7857767
A,1999-11-23,25.4488,26.1225,23.9518,23.9518,7138322
A,1999-11-24,24.0267,25.112,23.9518,24.5881,5785607



--- Data for first available ticker 'A' ---


Unnamed: 0_level_0,Adj Open,Adj High,Adj Low,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1999-11-18,27.2452,29.9398,23.9518,26.347,74716422
1999-11-19,25.7108,25.7482,23.8396,24.1764,18198351
1999-11-22,24.7378,26.347,23.9893,26.347,7857767
1999-11-23,25.4488,26.1225,23.9518,23.9518,7138322
1999-11-24,24.0267,25.112,23.9518,24.5881,5785607


In [6]:
verified_df

Unnamed: 0_level_0,Unnamed: 1_level_0,Adj Open,Adj High,Adj Low,Adj Close,Volume
Ticker,Date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A,1999-11-18,27.2452,29.9398,23.9518,26.3470,74716422
A,1999-11-19,25.7108,25.7482,23.8396,24.1764,18198351
A,1999-11-22,24.7378,26.3470,23.9893,26.3470,7857767
A,1999-11-23,25.4488,26.1225,23.9518,23.9518,7138322
A,1999-11-24,24.0267,25.1120,23.9518,24.5881,5785607
...,...,...,...,...,...,...
ZWS,2025-09-30,46.9000,47.4000,46.7700,47.0300,609500
ZWS,2025-10-01,46.7200,47.0400,46.4000,46.8000,599400
ZWS,2025-10-02,46.8600,47.1700,46.5400,46.9100,839900
ZWS,2025-10-03,46.8700,47.3700,46.6200,46.8400,780500
