### Data Processing Orchestrator

This notebook orchestrates a data processing workflow by preparing a configuration file and running an external analysis script (`run_sequence.py`) for one or more dates.

**Workflow:**

1.  **Prerequisites:**
    *   The `Yloader` application has been run to download OHLCV data.
    *   A `finviz` data generation process has created `.parquet` files (e.g., `df_finviz_YYYY-MM-DD.parquet`) in the `Downloads` directory.
2.  **Find Data:** The notebook scans the `Downloads` directory for recent Finviz data files.
3.  **Select Date(s):** It extracts all available dates from the filenames and selects a subset for processing based on user configuration (e.g., only the latest date).
4.  **Configure & Run:** For each selected date, it generates a `config.py` file and executes the `run_sequence.py` script.



### Setup and Configuration

**This is the only cell you need to modify.** Adjust the variables below to match your environment and desired processing scope.

In [2]:
import sys
import re
from pathlib import Path
import pandas as pd

# --- Project and Path Configuration ---

# Autodetect the project's root directory.
# Assumes this notebook is in `root/notebooks/` or the `root/` directory.
NOTEBOOK_DIR = Path.cwd()
ROOT_DIR = NOTEBOOK_DIR.parent if NOTEBOOK_DIR.name == 'notebooks' else NOTEBOOK_DIR

# Add the project's source directory to the Python path
SRC_DIR = ROOT_DIR / 'src'
if str(SRC_DIR) not in sys.path:
    sys.path.append(str(SRC_DIR))
    
# Import the custom utility module now that the path is set
import utils

# --- Data File Configuration ---
DOWNLOADS_DIR = Path.home() / "Downloads"
DATA_FILE_PREFIX = 'df_finviz'  # Prefix for files like 'df_finviz_2024-01-15.parquet'
DATA_FILE_EXTENSION = 'parquet'
DATA_FILES_TO_SCAN = 100  # How many recent files to check for dates

# --- Analysis Run Configuration ---

# Define which dates to process using a slice.
# Examples:
#   slice(-1, None, None) -> Processes only the most recent date.
#   slice(None)           -> Processes ALL found dates.
#   slice(-5, None, None) -> Processes the 5 most recent dates.
#   slice(0, 5, None)     -> Processes the 5 oldest dates.
DATE_SLICE = slice(-2, None, None)

# --- config.py Generation Parameters ---
# These values will be written into the config.py file for each run.
DEST_DIR = ROOT_DIR / 'data' # Destination directory for processed data
ANNUAL_RISK_FREE_RATE = 0.04
TRADING_DAYS_YEAR = 252

# --- Verification ---
print(f"Project Root Directory: {ROOT_DIR}")
print(f"Source Directory: {SRC_DIR}")
print(f"Scanning for data files in: {DOWNLOADS_DIR}")
print(f"Date selection rule: {DATE_SLICE}")

# Set pandas display options for better readability
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)

# Enable auto-reloading of external modules
%load_ext autoreload
%autoreload 2

Project Root Directory: c:\Users\ping\Files_win10\python\py311\stocks
Source Directory: c:\Users\ping\Files_win10\python\py311\stocks\src
Scanning for data files in: C:\Users\ping\Downloads
Date selection rule: slice(-2, None, None)


### Step 1: Find Recent Data Files

This step searches the configured directory for data files that match the specified prefix and extension.

In [3]:
# --- Execute Step 1 ---
print("--- Step 1: Finding recent data files ---")

# Use the utility function to get a list of recent filenames.
# NOTE: We pass `directory_name=DOWNLOADS_DIR.name` to match the expected
# function signature in the existing `utils.py` module.
found_files = utils.get_recent_files_in_directory(
    prefix=DATA_FILE_PREFIX,
    extension=DATA_FILE_EXTENSION,
    count=DATA_FILES_TO_SCAN,
    directory_name=DOWNLOADS_DIR.name  # Corrected argument
)

if found_files:
    print(f"Found {len(found_files)} potential data file(s).")
    # Display the first 5 found files for brevity
    for i, filename in enumerate(found_files[:5]):
        print(f"  {i+1}. {filename}")
    if len(found_files) > 5:
        print("  ...")
else:
    print(f"No files matching '{DATA_FILE_PREFIX}*.{DATA_FILE_EXTENSION}' found in '{DOWNLOADS_DIR}'.")
    # Initialize as empty list to prevent errors in the next step
    found_files = []

--- Step 1: Finding recent data files ---
target_dir: C:\Users\ping\Downloads
Found 35 potential data file(s).
  1. df_finviz_2025-06-13_stocks_etfs.parquet
  2. df_finviz_2025-06-12_stocks_etfs.parquet
  3. df_finviz_2025-06-11_stocks_etfs.parquet
  4. df_finviz_2025-06-10_stocks_etfs.parquet
  5. df_finviz_2025-06-09_stocks_etfs.parquet
  ...


### Step 2: Extract and Select Dates for Processing

This step extracts dates from the found filenames, sorts them, and then selects the dates to be processed based on the `DATE_SLICE` configuration.

In [4]:
def extract_and_sort_dates_from_files(filenames: list[str]) -> list[str]:
    """
    Extracts date strings (YYYY-MM-DD) from a list of filenames using a
    regular expression, removes duplicates, and sorts them chronologically.

    Args:
        filenames: A list of filenames.

    Returns:
        A sorted list of unique date strings.
    """
    dates = set()
    date_pattern = re.compile(r'\d{4}-\d{2}-\d{2}')
    
    for filename in filenames:
        match = date_pattern.search(filename)
        if match:
            dates.add(match.group(0))
            
    return sorted(list(dates))

# --- Execute Step 2 ---
print("\n--- Step 2: Extracting and selecting dates ---")

# 1. Extract all available dates from the filenames
available_dates = extract_and_sort_dates_from_files(found_files)
print(f"Found {len(available_dates)} unique dates.")
if available_dates:
    print(f"Date range: {available_dates[0]} to {available_dates[-1]}")

# 2. Select the dates to process based on the configured slice
dates_to_process = available_dates[DATE_SLICE]

if dates_to_process:
    print(f"\nSelected {len(dates_to_process)} date(s) for processing:")
    for d in dates_to_process:
        print(f"  - {d}")
else:
    print("\nNo dates were selected for processing based on the current configuration.")


--- Step 2: Extracting and selecting dates ---
Found 35 unique dates.
Date range: 2025-04-25 to 2025-06-13

Selected 2 date(s) for processing:
  - 2025-06-12
  - 2025-06-13


### Step 3: Generate Configuration and Run Analysis for Each Selected Date

This is the main execution step. It iterates through the list of selected dates. For each date, it generates a fresh `config.py` and runs the `run_sequence.py` script.

In [5]:
def create_config_file(date_str: str, config_path: Path):
    """
    Creates a config.py file with dynamic paths and parameters.
    It pulls configuration from the global variables set in the setup cell.

    Args:
        date_str (str): The date to be written into the config file.
        config_path (Path): The path where the config.py file will be saved.
    """
    # Use repr() to get a string representation of the path, which correctly
    # handles backslashes on Windows (e.g., 'C:\\Users\\...')
    config_content = f"""# config.py
# This file is auto-generated by a notebook. DO NOT EDIT MANUALLY.

# --- File path configuration ---
DATE_STR = '{date_str}'
DOWNLOAD_DIR = {repr(str(DOWNLOADS_DIR))}
DEST_DIR = {repr(str(DEST_DIR))}

# --- Analysis Parameters ---
ANNUAL_RISK_FREE_RATE = {ANNUAL_RISK_FREE_RATE}
TRADING_DAYS_YEAR = {TRADING_DAYS_YEAR}
"""
    
    with open(config_path, 'w') as f:
        f.write(config_content)

# --- Execute Step 3 ---
print("\n--- Step 3: Starting processing sequence ---")

if not dates_to_process:
    print("No dates to process. Halting execution.")
else:
    for date_str in dates_to_process:
        print(f"\n{'='*20} PROCESSING DATE: {date_str} {'='*20}")
        
        # Define the path for the config file (in the project root)
        config_file_path = ROOT_DIR / 'config.py'
        
        # 1. Create the config.py file for the current date
        create_config_file(date_str, config_file_path)
        print(f"Successfully created config file: {config_file_path}")

        # 2. Run the external processing script
        print(f"Executing run_sequence.py for {date_str}...")
        %run -i {ROOT_DIR / 'run_sequence.py'}
        print(f"--- Finished processing for {date_str} ---")

    print(f"\n{'='*20} WORKFLOW COMPLETE {'='*20}")


--- Step 3: Starting processing sequence ---

Successfully created config file: c:\Users\ping\Files_win10\python\py311\stocks\config.py
Executing run_sequence.py for 2025-06-12...
Starting notebook execution sequence...

--- Running py0_get_yloader_OHLCV_data_v1.ipynb ---

Running command: c:\Users\ping\Files_win10\python\py311\.venv\Scripts\jupyter nbconvert --to notebook --execute --output executed\executed_py0_get_yloader_OHLCV_data_v1.ipynb py0_get_yloader_OHLCV_data_v1.ipynb
Successfully executed py0_get_yloader_OHLCV_data_v1.ipynb
Output saved to: executed\executed_py0_get_yloader_OHLCV_data_v1.ipynb

--- Running py1_clean_df_finviz_v14.ipynb ---

Running command: c:\Users\ping\Files_win10\python\py311\.venv\Scripts\jupyter nbconvert --to notebook --execute --output executed\executed_py1_clean_df_finviz_v14.ipynb py1_clean_df_finviz_v14.ipynb
Successfully executed py1_clean_df_finviz_v14.ipynb
Output saved to: executed\executed_py1_clean_df_finviz_v14.ipynb

--- Running py2_clea