### Final Data Merge and Schema Finalization

This notebook completes the data preparation by adding final calculated columns and enforcing a final, standardized column order.

**Workflow:**

1.  **Load Data:** Loads the main DataFrame containing both Finviz data and previously calculated performance ratios.
2.  **Feature Engineering:** Calculates new features directly from the loaded data (e.g., `ATR/Price %`).
3.  **Merge External Data:** Calculates the 3-day performance from the separate OHLCV data file and merges this new column into the main DataFrame.
4.  **Finalize Schema:** Reorders all columns according to a predefined master list to ensure a consistent output format.
5.  **Save & Verify:** Saves the final, completed DataFrame and reads it back to confirm success.

### Setup and Configuration

This cell loads all necessary libraries and configuration parameters. It pulls dynamic settings from `config.py` and defines the final column schema.


In [24]:
import sys
from pathlib import Path
import pandas as pd

# --- Project Path Setup ---
NOTEBOOK_DIR = Path.cwd()
ROOT_DIR = NOTEBOOK_DIR.parent if NOTEBOOK_DIR.name == 'notebooks' else NOTEBOOK_DIR
SRC_DIR = ROOT_DIR / 'src'
if str(SRC_DIR) not in sys.path:
    sys.path.append(str(SRC_DIR))

# --- Dynamic Configuration (from config.py) ---
from config import DATE_STR, DEST_DIR

# --- File Path Construction ---
DATA_DIR = Path(DEST_DIR)
SOURCE_PATH = DATA_DIR / f'{DATE_STR}_df_finviz_n_ratios_stocks_etfs.parquet'
OHLCV_PATH = DATA_DIR / 'df_OHLCV_clean_stocks_etfs.parquet'
DEST_PATH = DATA_DIR / f'{DATE_STR}_df_finviz_merged_stocks_etfs.parquet'

# --- Final Schema Configuration ---
# This list defines the exact order of columns in the final output file.
FINAL_COLUMN_ORDER = [
    'No.', 'Company', 'Index', 'Sector', 'Industry', 'Country', 'Exchange',
    'Info', 'MktCap AUM, M', 'Rank', 'Market Cap, M', 'P/E', 'Fwd P/E', 'PEG',
    'P/S', 'P/B', 'P/C', 'P/FCF', 'Book/sh', 'Cash/sh', 'Dividend %',
    'Dividend TTM', 'Dividend Ex Date', 'Payout Ratio %', 'EPS', 'EPS next Q',
    'EPS this Y %', 'EPS next Y %', 'EPS past 5Y %', 'EPS next 5Y %',
    'Sales past 5Y %', 'Sales Q/Q %', 'EPS Q/Q %', 'EPS YoY TTM %',
    'Sales YoY TTM %', 'Sales, M', 'Income, M', 'EPS Surprise %',
    'Revenue Surprise %', 'Outstanding, M', 'Float, M', 'Float %',
    'Insider Own %', 'Insider Trans %', 'Inst Own %', 'Inst Trans %',
    'Short Float %', 'Short Ratio', 'Short Interest, M', 'ROA %', 'ROE %',
    # 'ROI %', 'Curr R', 'Quick R', 'LTDebt/Eq', 'Debt/Eq', 'Gross M %',
    'ROIC %', 'Curr R', 'Quick R', 'LTDebt/Eq', 'Debt/Eq', 'Gross M %',    
    'Oper M %', 'Profit M %', 'Perf 3D %', 'Perf Week %', 'Perf Month %',
    'Perf Quart %', 'Perf Half %', 'Perf Year %', 'Perf YTD %', 'Beta',
    'ATR', 'ATR/Price %', 'Volatility W %', 'Volatility M %', 'SMA20 %',
    'SMA50 %', 'SMA200 %', '50D High %', '50D Low %', '52W High %',
    '52W Low %', '52W Range', 'All-Time High %', 'All-Time Low %', 'RSI',
    'Earnings', 'IPO Date', 'Optionable', 'Shortable', 'Employees',
    'Change from Open %', 'Gap %', 'Recom', 'Avg Volume, M', 'Rel Volume',
    'Volume', 'Target Price', 'Prev Close', 'Open', 'High', 'Low', 'Price',
    'Change %', 'Single Category', 'Asset Type', 'Expense %', 'Holdings',
    'AUM, M', 'Flows 1M, M', 'Flows% 1M', 'Flows 3M, M', 'Flows% 3M',
    'Flows YTD, M', 'Flows% YTD', 'Return% 1Y', 'Return% 3Y', 'Return% 5Y',
    'Tags', 'Sharpe 3d', 'Sortino 3d', 'Omega 3d', 'Sharpe 5d',
    'Sortino 5d', 'Omega 5d', 'Sharpe 10d', 'Sortino 10d', 'Omega 10d',
    'Sharpe 15d', 'Sortino 15d', 'Omega 15d', 'Sharpe 30d',
    'Sortino 30d', 'Omega 30d', 'Sharpe 60d', 'Sortino 60d', 'Omega 60d',
    'Sharpe 120d', 'Sortino 120d', 'Omega 120d', 'Sharpe 250d',
    'Sortino 250d', 'Omega 250d'
]

# --- Notebook Setup ---
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 2000)
%load_ext autoreload
%autoreload 2

# --- Verification ---
print(f"Processing for Date: {DATE_STR}")
print(f"Source file: {SOURCE_PATH}")
print(f"OHLCV source for 3D Perf: {OHLCV_PATH}")
print(f"Destination file: {DEST_PATH}")

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
Processing for Date: 2025-06-13
Source file: ..\data\2025-06-13_df_finviz_n_ratios_stocks_etfs.parquet
OHLCV source for 3D Perf: ..\data\df_OHLCV_clean_stocks_etfs.parquet
Destination file: ..\data\2025-06-13_df_finviz_merged_stocks_etfs.parquet


### Step 1: Load Source Data

Load the main DataFrame containing the combined Finviz and performance ratio data.


In [25]:
print(f"--- Step 1: Loading data from {SOURCE_PATH.name} ---")

try:
    df_finviz = pd.read_parquet(SOURCE_PATH)
    # The list of tickers is derived directly from our primary source file.
    tickers = df_finviz.index.tolist()
    print(f"Successfully loaded data for {len(tickers)} tickers.")
    df_finviz.info()
    
except FileNotFoundError:
    print(f"ERROR: Source file not found at {SOURCE_PATH}. Halting execution.")
    df_finviz = None
except Exception as e:
    print(f"An error occurred during file loading: {e}")
    df_finviz = None

--- Step 1: Loading data from 2025-06-13_df_finviz_n_ratios_stocks_etfs.parquet ---
Successfully loaded data for 1530 tickers.
<class 'pandas.core.frame.DataFrame'>
Index: 1530 entries, MSFT to VTHR
Columns: 137 entries, No. to Omega 250d
dtypes: float64(118), int64(3), object(16)
memory usage: 1.6+ MB


### Step 2: Feature Engineering

Calculate new columns based on the existing data in the DataFrame.

In [26]:
if df_finviz is not None:
    print("\n--- Step 2: Engineering new features from existing data ---")
    
    # Calculate ATR as a percentage of Price
    df_finviz['ATR/Price %'] = (df_finviz['ATR'] / df_finviz['Price']) * 100
    print("Created 'ATR/Price %' column.")
    
    display(df_finviz[['ATR', 'Price', 'ATR/Price %']].head())


--- Step 2: Engineering new features from existing data ---
Created 'ATR/Price %' column.


Unnamed: 0,ATR,Price,ATR/Price %
MSFT,6.98,474.96,1.469597
NVDA,4.25,141.97,2.99359
AAPL,5.04,196.45,2.565538
AMZN,5.08,212.1,2.395097
GOOGL,4.59,174.67,2.627812


### Step 3: Calculate and Merge External Data (`Perf 3D %`)

This step calculates the 3-day performance using the external OHLCV file and merges it into our main DataFrame.

In [27]:
def calculate_3d_performance(ohlcv_path: Path, ticker_list: list) -> pd.DataFrame:
    """
    Loads OHLCV data, calculates 3-day performance for a list of tickers,
    and returns a DataFrame ready for merging.
    """
    try:
        df_ohlcv = pd.read_parquet(ohlcv_path)
    except FileNotFoundError:
        print(f"ERROR: OHLCV file not found at {ohlcv_path}. Cannot calculate 3D performance.")
        return pd.DataFrame()

    # Pivot to wide format with tickers as columns
    df_adj_close = df_ohlcv['Adj Close'].unstack(level='Ticker')
    
    # Filter for tickers present in our main DataFrame
    valid_tickers = [t for t in ticker_list if t in df_adj_close.columns]
    df_adj_close_filtered = df_adj_close[valid_tickers]
    
    # Calculate 3-day percentage change and get the latest value
    df_returns = df_adj_close_filtered.pct_change(periods=3) * 100
    latest_returns = df_returns.tail(1)
    
    # Transpose and rename for merging
    df_perf_3d = latest_returns.T
    df_perf_3d.columns = ['Perf 3D %']
    
    return df_perf_3d

if df_finviz is not None:
    print("\n--- Step 3: Calculating and merging 3-day performance ---")
    df_perf_3d = calculate_3d_performance(OHLCV_PATH, tickers)

    if not df_perf_3d.empty:
        # Join the new column to the main DataFrame
        df_merged = df_finviz.join(df_perf_3d)
        print("Successfully calculated and merged 'Perf 3D %'.")
        display(df_merged[['Perf 3D %']].head())
    else:
        print("Could not calculate 3D performance, continuing without it.")
        df_merged = df_finviz # Assign original df if calculation failed
else:
    print("Skipping merge step because source data did not load.")
    df_merged = None


--- Step 3: Calculating and merging 3-day performance ---
Successfully calculated and merged 'Perf 3D %'.


Unnamed: 0,Perf 3D %
MSFT,0.857895
NVDA,-1.375478
AAPL,-3.069028
AMZN,-2.532053
GOOGL,-2.200448


### Step 4: Finalize Schema by Reordering Columns

Enforce the final column order as defined in the `FINAL_COLUMN_ORDER` list.

In [28]:
if df_merged is not None:
    print("\n--- Step 4: Finalizing DataFrame schema ---")

    # Check for any columns in the master list that are missing from our DataFrame
    missing_cols = [col for col in FINAL_COLUMN_ORDER if col not in df_merged.columns]
    if missing_cols:
        print(f"Warning: The following columns from the master list are missing and will be added as empty: {missing_cols}")

    # Reindex the DataFrame to match the final desired column order
    df_final = df_merged.reindex(columns=FINAL_COLUMN_ORDER)
    
    print("Columns have been reordered to the final schema.")
    df_final.info()
else:
    print("Skipping schema finalization as merged data is not available.")
    df_final = None


--- Step 4: Finalizing DataFrame schema ---
Columns have been reordered to the final schema.
<class 'pandas.core.frame.DataFrame'>
Index: 1530 entries, MSFT to VTHR
Columns: 139 entries, No. to Omega 250d
dtypes: float64(120), int64(3), object(16)
memory usage: 1.7+ MB


### Step 5: Save and Verify Final DataFrame

Save the completed DataFrame and read it back to confirm the entire pipeline was successful.

In [29]:
if df_final is not None:
    print("\n--- Step 5: Saving and verifying final data ---")
    try:
        df_final.to_parquet(DEST_PATH, engine='pyarrow', compression='zstd')
        print(f"Successfully saved final DataFrame to: {DEST_PATH}")

        # Verification step
        print("\nVerifying saved file...")
        verified_df = pd.read_parquet(DEST_PATH)
        print("Verification successful. First 5 rows of final saved file:")
        display(verified_df.head())
        
    except Exception as e:
        print(f"An error occurred during save or verification: {e}")
else:
    print("\nSkipping final save step as the final DataFrame was not created.")


--- Step 5: Saving and verifying final data ---
Successfully saved final DataFrame to: ..\data\2025-06-13_df_finviz_merged_stocks_etfs.parquet

Verifying saved file...
Verification successful. First 5 rows of final saved file:


Unnamed: 0,No.,Company,Index,Sector,Industry,Country,Exchange,Info,"MktCap AUM, M",Rank,"Market Cap, M",P/E,Fwd P/E,PEG,P/S,P/B,P/C,P/FCF,Book/sh,Cash/sh,Dividend %,Dividend TTM,Dividend Ex Date,Payout Ratio %,EPS,EPS next Q,EPS this Y %,EPS next Y %,EPS past 5Y %,EPS next 5Y %,Sales past 5Y %,Sales Q/Q %,EPS Q/Q %,EPS YoY TTM %,Sales YoY TTM %,"Sales, M","Income, M",EPS Surprise %,Revenue Surprise %,"Outstanding, M","Float, M",Float %,Insider Own %,Insider Trans %,Inst Own %,Inst Trans %,Short Float %,Short Ratio,"Short Interest, M",ROA %,ROE %,ROIC %,Curr R,Quick R,LTDebt/Eq,Debt/Eq,Gross M %,Oper M %,Profit M %,Perf 3D %,Perf Week %,Perf Month %,Perf Quart %,Perf Half %,Perf Year %,Perf YTD %,Beta,ATR,ATR/Price %,Volatility W %,Volatility M %,SMA20 %,SMA50 %,SMA200 %,50D High %,50D Low %,52W High %,52W Low %,52W Range,All-Time High %,All-Time Low %,RSI,Earnings,IPO Date,Optionable,Shortable,Employees,Change from Open %,Gap %,Recom,"Avg Volume, M",Rel Volume,Volume,Target Price,Prev Close,Open,High,Low,Price,Change %,Single Category,Asset Type,Expense %,Holdings,"AUM, M","Flows 1M, M",Flows% 1M,"Flows 3M, M",Flows% 3M,"Flows YTD, M",Flows% YTD,Return% 1Y,Return% 3Y,Return% 5Y,Tags,Sharpe 3d,Sortino 3d,Omega 3d,Sharpe 5d,Sortino 5d,Omega 5d,Sharpe 10d,Sortino 10d,Omega 10d,Sharpe 15d,Sortino 15d,Omega 15d,Sharpe 30d,Sortino 30d,Omega 30d,Sharpe 60d,Sortino 60d,Omega 60d,Sharpe 120d,Sortino 120d,Omega 120d,Sharpe 250d,Sortino 250d,Omega 250d
MSFT,1,Microsoft Corporation,"DJIA, NDX, S&P 500",Technology,Software - Infrastructure,USA,NASD,"Technology, Software - Infrastructure",3530160.0,1,3530160.0,36.7,31.35,2.52,13.07,10.97,44.34,50.89,43.3,10.71,0.69,2.41,8/21/2025,25.42,12.94,3.37,13.5,13.12,18.45,14.54,14.33,13.27,17.88,12.1,14.13,270010.0,96640.0,7.38,2.38,7430.0,7320.0,98.51,1.47,-0.12,73.62,0.4,0.79,2.53,58.18,18.46,33.61,23.24,1.37,1.36,0.29,0.33,69.07,45.23,35.79,0.857895,0.97,4.86,22.24,7.13,9.77,12.68,1.03,6.98,1.469597,0.94,0.83,2.55,11.56,13.2,-1.14,37.75,-1.14,37.75,344.79 - 480.42,-1.14,595994.5,71.47,Apr 30/a,3/13/1986,Yes,Yes,228000.0,-0.35,-0.46,1.3,23.01,0.73,16794463,515.98,478.87,476.65,479.18,472.76,474.96,-0.82,,,,,,,,,,,,,,,-,2.488393,6.394298,1.569649,1.7619,3.573143,1.337003,7.420258,15.164748,3.145411,7.37705,18.594159,3.627778,5.582242,11.884837,2.66315,2.638729,5.536697,1.746356,0.62626,1.05022,1.130811,0.279762,0.415881,1.053934
NVDA,2,NVIDIA Corp,"DJIA, NDX, S&P 500",Technology,Semiconductors,USA,NASD,"Technology, Semiconductors",3464070.0,2,3464070.0,45.73,24.81,1.53,23.32,41.3,64.52,48.07,3.44,2.2,0.03,0.04,6/11/2025,1.16,3.1,1.0,45.3,31.72,91.83,29.9,64.24,69.18,27.6,81.36,86.17,148510.0,76770.0,9.89,1.68,24390.0,23400.0,95.96,4.08,-0.37,66.38,0.2,0.98,0.88,229.34,75.89,115.46,81.82,3.39,2.96,0.12,0.12,70.11,58.03,51.69,-1.375478,0.18,4.9,16.68,5.11,17.42,5.72,2.12,4.25,2.99359,1.65,1.73,2.71,16.63,11.24,-2.09,63.9,-7.29,63.9,86.62 - 153.13,-7.29,425809.98,62.71,May 28/a,1/22/1999,Yes,Yes,36000.0,-0.34,-1.75,1.38,259.3,0.68,177084632,171.67,145.0,142.46,143.58,140.85,141.97,-2.09,,,,,,,,,,,,,,,-,-1.872763,-3.209977,0.714033,-1.157285,-1.690841,0.834673,3.747898,6.521017,1.760265,4.655852,8.386854,2.017132,5.505622,13.458762,2.597512,1.485872,2.551612,1.329186,0.530987,0.74975,1.099794,0.352133,0.499934,1.062407
AAPL,3,Apple Inc,"DJIA, NDX, S&P 500",Technology,Consumer Electronics,USA,NASD,"Technology, Consumer Electronics",2934140.0,3,2934140.0,30.66,25.24,3.82,7.33,43.94,60.5,29.79,4.47,3.25,0.53,1.01,5/12/2025,16.11,6.41,1.42,6.27,8.51,15.41,8.03,8.51,5.08,7.68,-0.36,4.91,400370.0,97290.0,1.39,0.86,14940.0,14920.0,99.88,0.1,-1.92,63.82,-0.54,0.64,1.55,94.83,29.1,138.02,66.93,0.82,0.78,1.18,1.47,46.63,31.81,24.3,-3.069028,-3.66,-7.48,-7.98,-20.71,-5.17,-21.55,1.21,5.04,2.565538,1.23,1.32,-2.7,-2.47,-12.48,-12.76,16.1,-24.47,16.1,169.21 - 260.10,-24.47,308705.92,40.72,May 01/a,12/12/1980,Yes,Yes,164000.0,-1.5,0.12,2.08,61.37,0.84,51282817,228.26,199.2,199.44,200.37,195.7,196.45,-1.38,,,,,,,,,,,,,,,-,-8.468945,-9.654115,0.139943,-8.288563,-8.469575,0.235665,-4.047547,-4.949148,0.538323,0.447138,0.689102,1.07719,-1.333089,-2.087804,0.776781,-0.527132,-0.79683,0.897593,-1.111297,-1.580116,0.802835,-0.194979,-0.275857,0.962613
AMZN,4,Amazon.com Inc,"DJIA, NDX, S&P 500",Consumer Cyclical,Internet Retail,USA,NASD,"Consumer Cyclical, Internet Retail",2251730.0,4,2251730.0,34.59,29.23,2.01,3.46,7.36,22.92,108.2,28.82,9.25,,,-,0.0,6.13,1.32,11.93,17.23,36.89,17.21,17.86,8.62,62.33,71.88,10.08,650310.0,65940.0,16.38,0.33,10610.0,9490.0,89.45,10.58,-0.02,64.44,0.06,0.65,1.26,61.36,11.23,25.24,15.02,1.05,0.84,0.44,0.49,49.16,11.15,10.14,-2.532053,-0.69,0.88,7.15,-5.75,13.28,-3.32,1.34,5.08,2.395097,1.04,1.2,2.05,8.94,4.41,-2.88,31.43,-12.54,39.9,151.61 - 242.52,-12.54,323100.02,59.9,May 01/a,5/15/1997,Yes,Yes,1556000.0,1.0,-1.52,1.23,48.83,0.6,29271170,239.44,213.24,210.0,214.05,209.62,212.1,-0.53,,,,,,,,,,,,,,,-,-11.107779,-11.166068,0.005248,-8.884616,-8.688586,0.10697,3.322746,6.192967,1.827848,4.727346,9.772446,2.297858,3.091178,7.314785,1.837948,0.904564,1.456662,1.184248,-0.201535,-0.300903,0.964277,0.476787,0.695027,1.087018
GOOGL,5,Alphabet Inc,"NDX, S&P 500",Communication Services,Internet Content & Information,USA,NASD,"Communication Services, Internet Content & Inf...",2126510.0,5,2126510.0,19.48,17.18,1.51,5.92,6.15,22.31,28.4,28.41,7.83,0.3,0.81,6/9/2025,7.46,8.97,2.16,19.2,6.06,26.76,12.91,16.73,11.81,48.77,37.73,13.02,359310.0,111000.0,38.81,1.15,5830.0,5800.0,99.65,52.17,-0.01,38.3,-1.88,1.14,1.66,66.27,25.15,34.79,30.02,1.77,1.77,0.07,0.08,58.54,32.6,30.89,-2.200448,0.57,5.62,5.55,-5.67,-1.1,-7.73,1.01,4.59,2.627812,1.09,1.51,2.12,7.67,1.79,-3.55,24.29,-15.64,24.29,140.53 - 207.05,-15.64,7173.9,58.92,Apr 24/a,8/19/2004,Yes,Yes,183323.0,1.28,-1.84,1.42,39.99,0.69,27617315,199.64,175.7,172.47,177.13,172.38,174.67,-0.59,,,,,,,,,,,,,,,-,-50.503209,-15.496357,0.0,-3.107006,-5.098036,0.622559,3.81453,8.235403,1.847759,2.770405,5.706723,1.583275,1.675757,2.383805,1.347874,0.859436,1.301817,1.158333,-0.374498,-0.514476,0.938637,0.009306,0.012869,1.001585
