### Pipeline Orchestrator

This notebook finds available Finviz data, allows the user to select which date(s) to process, and then executes the main processing pipeline (`run_sequence.py`) for each selected date.

**Workflow:**
1.  **Setup:** Configure paths and define the default date selection rule.
2.  **Get Valid Trading days:** Retrieve OHLCV data. Use the date index as valid trading days.
3.  **Find Data Files:** Scan the `Downloads` directory for recent Finviz data files.
4.  **Select Dates:** Extract available dates and apply the default selection rule.
5.  **(Optional) Refine Selection:** Interactively prompt the user to override the default date selection.
6.  **Execute Pipeline:** For each selected date, generate a `config.py` file and run the external processing script.

### Setup and Configuration

In [7]:
import sys
from pathlib import Path
import pandas as pd

# --- Project Path Setup ---
NOTEBOOK_DIR = Path.cwd()
ROOT_DIR = NOTEBOOK_DIR.parent 
SRC_DIR = ROOT_DIR / 'src'
if str(ROOT_DIR) not in sys.path: sys.path.append(str(ROOT_DIR))
if str(SRC_DIR) not in sys.path: sys.path.append(str(SRC_DIR))

import utils

# --- Data File Configuration ---
DOWNLOADS_DIR = Path.home() / "Downloads"
DATA_FILE_PREFIX = 'df_finviz'
DATA_FILE_EXTENSION = 'parquet'
DATA_FILES_TO_SCAN = 100
OHLCV_PARQUET_PATH = ROOT_DIR / "data" / "df_OHLCV_stocks_etfs.parquet"

# --- Analysis Run Configuration ---
# Default rule for selecting which dates to process.
# slice(-1, None, None) -> Processes only the most recent date.
DATE_SLICE = slice(-1, None, None)

# --- config.py Generation Parameters ---
DEST_DIR = ROOT_DIR / 'data'
ANNUAL_RISK_FREE_RATE = 0.04
TRADING_DAYS_PER_YEAR = 252

# --- Notebook Setup ---
pd.set_option('display.max_columns', None); pd.set_option('display.width', 1000)
%load_ext autoreload
%autoreload 2

# --- Verification ---
print(f"Project Root Directory: {ROOT_DIR}")
print(f"Scanning for data files in: {DOWNLOADS_DIR}")
print(f'OHLCV Parquet Path: {OHLCV_PARQUET_PATH}')
print(f'SRC Path: {SRC_DIR}')

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
Project Root Directory: c:\Users\ping\Files_win10\python\py311\stocks
Scanning for data files in: C:\Users\ping\Downloads
OHLCV Parquet Path: c:\Users\ping\Files_win10\python\py311\stocks\data\df_OHLCV_stocks_etfs.parquet
SRC Path: c:\Users\ping\Files_win10\python\py311\stocks\src


### Step 1: Get Valid Trading Days

In [8]:
df_prices = pd.read_parquet(OHLCV_PARQUET_PATH)

# The date is the second level of the index (level 1, since it's 0-indexed)
trading_days = df_prices.index \
    .get_level_values('Date') \
    .unique() \
    .sort_values()

# trading_days is now a sorted DatetimeIndex
print(f'trading_days (total {len(trading_days)}):')
print(f'type(trading_days): {type(trading_days)}')
print(f'trading_days:\n{trading_days}')

trading_days (total 371):
type(trading_days): <class 'pandas.core.indexes.datetimes.DatetimeIndex'>
trading_days:
DatetimeIndex(['2024-02-01', '2024-02-02', '2024-02-05', '2024-02-06', '2024-02-07', '2024-02-08', '2024-02-09', '2024-02-12', '2024-02-13', '2024-02-14',
               ...
               '2025-07-14', '2025-07-15', '2025-07-16', '2025-07-17', '2025-07-18', '2025-07-21', '2025-07-22', '2025-07-23', '2025-07-24', '2025-07-25'], dtype='datetime64[ns]', name='Date', length=371, freq=None)


### Step 2: Find and Display Available Data

In [9]:
print("--- Step 1: Finding recent data files ---")

found_files = utils.get_recent_files(    
    directory_path=DOWNLOADS_DIR,
    prefix=DATA_FILE_PREFIX,
    extension=DATA_FILE_EXTENSION,
    count=DATA_FILES_TO_SCAN
)

if not found_files:
    print(f"No files matching '{DATA_FILE_PREFIX}*.{DATA_FILE_EXTENSION}' found.")
    available_dates = []
else:
    # Extract dates from filenames
    available_dates = utils.extract_and_sort_dates_from_filenames(found_files)
    
    # --- START OF NEW CODE ---
    print("\nFiltering available dates against actual trading days...")
    
    # Convert list of strings to a pandas DatetimeIndex for comparison
    available_dt_index = pd.to_datetime(available_dates)
    
    # Create a boolean mask indicating which dates are valid trading days
    is_trading_day_mask = available_dt_index.isin(trading_days)
    
    # Apply the mask to keep only the valid dates
    filtered_dates = [date for date, is_valid in zip(available_dates, is_trading_day_mask) if is_valid]
    
    # Overwrite the variable with the cleaned list
    available_dates = filtered_dates
    # --- END OF NEW CODE ---

    print(f"\nFound {len(available_dates)} valid trading dates to process:")
    utils.print_list_in_columns(available_dates, num_columns=5)

--- Step 1: Finding recent data files ---

Filtering available dates against actual trading days...

Found 62 valid trading dates to process:
  0    2025-04-25    1    2025-04-28    2    2025-04-29    3    2025-04-30    4    2025-05-01
  5    2025-05-02    6    2025-05-05    7    2025-05-06    8    2025-05-07    9    2025-05-08
  10   2025-05-09    11   2025-05-12    12   2025-05-13    13   2025-05-14    14   2025-05-15
  15   2025-05-16    16   2025-05-19    17   2025-05-20    18   2025-05-21    19   2025-05-22
  20   2025-05-23    21   2025-05-27    22   2025-05-28    23   2025-05-29    24   2025-05-30
  25   2025-06-02    26   2025-06-03    27   2025-06-04    28   2025-06-05    29   2025-06-06
  30   2025-06-09    31   2025-06-10    32   2025-06-11    33   2025-06-12    34   2025-06-13
  35   2025-06-16    36   2025-06-17    37   2025-06-18    38   2025-06-20    39   2025-06-23
  40   2025-06-24    41   2025-06-25    42   2025-06-26    43   2025-06-27    44   2025-06-30
  45   2025-

### Step 3: Select Dates for Processing (Default)

In [10]:
if available_dates:
    # Apply the default slice defined in the setup cell
    dates_to_process = available_dates[DATE_SLICE]
    print(f"\n--- Step 2: Applying default selection rule ---")
    print(f"Default rule '{DATE_SLICE}' selected {len(dates_to_process)} date(s):")
    print(dates_to_process)
else:
    dates_to_process = []


--- Step 2: Applying default selection rule ---
Default rule 'slice(-1, None, None)' selected 1 date(s):
['2025-07-24']


### Step 4: (OPTIONAL) Interactively Refine Selection

In [11]:
if available_dates:
    # Call the interactive utility function
    NEW_DATE_SLICE = utils.prompt_for_slice_update("DATE_SLICE", DATE_SLICE)
    
    # If the slice was changed, update the list of dates to process
    if NEW_DATE_SLICE != DATE_SLICE:
        DATE_SLICE = NEW_DATE_SLICE
        dates_to_process = available_dates[DATE_SLICE]
        print(f"\nUpdated selection. Now processing {len(dates_to_process)} date(s):")
        print(dates_to_process)

   Continuing with the current value.


### Step 5: Execute Pipeline

This cell iterates through the final list of selected dates, generates the `config.py` file for each, and executes the `run_sequence.py` script.

In [12]:
print("\n--- Step 4: Starting processing sequence ---")

if not dates_to_process:
    print("No dates to process. Halting execution.")
else:
    for date_str in dates_to_process:
        print(f"\n{'='*20} PROCESSING DATE: {date_str} {'='*20}")
        
        # 1. Create the config.py file for the current date
        utils.create_pipeline_config_file(
            config_path=ROOT_DIR / 'config.py',
            date_str=date_str,
            downloads_dir=DOWNLOADS_DIR,
            dest_dir=DEST_DIR,
            annual_risk_free_rate=ANNUAL_RISK_FREE_RATE,
            trading_days_per_year=TRADING_DAYS_PER_YEAR
        )

        # --- 2. Run the external processing script ---
        print(f"Executing run_sequence_v2.py for {date_str}...")

        # First, create a clear variable for the full path
        script_to_run = ROOT_DIR / 'run_sequence_v2.py'

        # Now, the f-string is simple and has no quote conflicts
        get_ipython().run_line_magic('run', f'-i "{script_to_run}"')

        print(f"--- Finished processing for {date_str} ---")

    print(f"\n{'='*25} ALL PROCESSING COMPLETE {'='*25}")


--- Step 4: Starting processing sequence ---

Successfully created config file: c:\Users\ping\Files_win10\python\py311\stocks\config.py
Executing run_sequence_v2.py for 2025-07-24...
Starting notebook execution sequence...

--- Running py1_clean_df_finviz_v15.ipynb ---

Running command: c:\Users\ping\Files_win10\python\py311\.venv\Scripts\python.exe -m jupyter nbconvert --to notebook --execute --output C:\Users\ping\Files_win10\python\py311\stocks\notebooks_mean_reversion\executed\executed_py1_clean_df_finviz_v15.ipynb C:\Users\ping\Files_win10\python\py311\stocks\notebooks_mean_reversion\py1_clean_df_finviz_v15.ipynb
Successfully executed py1_clean_df_finviz_v15.ipynb
Output saved to: C:\Users\ping\Files_win10\python\py311\stocks\notebooks_mean_reversion\executed\executed_py1_clean_df_finviz_v15.ipynb

--- Running py2_clean_df_OHLCV_v10.ipynb ---

Running command: c:\Users\ping\Files_win10\python\py311\.venv\Scripts\python.exe -m jupyter nbconvert --to notebook --execute --output C:\