# Advanced Financial NLP Pipeline: AAPL, NVDA, GOOGL

This notebook implements the multi-feature sentiment analysis pipeline as defined in `plan.md`. We will analyze `AAPL`, `NVDA`, and `GOOGL` for the most recent 12-month period available through the Finnhub API.

### Notebook Workflow

1.  **Phase 1: Environment Setup**
    - Import libraries (`finnhub`, `pandas`, `transformers`, etc.).
    - Initialize the Finnhub API client.

2.  **Phase 2: Data Acquisition**
    - Fetch company news (headlines, summaries) from the `company-news` endpoint.
    - Fetch insider transaction sentiment (MSPR) from the `stock/insider-sentiment` endpoint.

3.  **Phase 3: NLP Sentiment Analysis**
    - Load a pre-trained `FinBERT` model for sentiment analysis.
    - Calculate and apply sentiment scores to news headlines and summaries.

4.  **Phase 4: Data Aggregation & Consolidation**
    - Merge the news and insider sentiment data into a single DataFrame.
    - Resample the combined data into a final quarterly format.
    - Generate the aggregated DataFrame containing mean sentiment scores, news volume, and insider sentiment metrics.

## Phase 1: Setup and Imports

In [2]:
print("1.0: Library Installation")
print("-"*30)
# To ensure packages install into the correct kernel environment, we explicitly use
# the 'sys.executable' to call pip. This avoids issues where '!pip' might
# point to a different Python installation.
import sys

# NOTE: To run this on GColab, you can also use the %pip magic command instead of !{sys.executable} -m pip

# Consolidated installation of all required libraries
!{sys.executable} -m pip install finnhub-python pandas seaborn matplotlib numpy datasets kaggle python-dotenv --quiet

print("Required libraries installed successfully.")

1.0: Library Installation
------------------------------
Required libraries installed successfully.



[notice] A new release of pip is available: 23.2.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [3]:
print("1.1: Library Imports")
print("-"*30)
import os
import numpy as np
import pandas as pd
import seaborn as sns
import finnhub as fi
import matplotlib.pyplot as plt
from dotenv import load_dotenv

import json
from datetime import datetime

print("Core libraries imported successfully.")

1.1: Library Imports
------------------------------
Core libraries imported successfully.


In [4]:
print("1.2: Finnhub Client Initialization")
print("-"*30)
# --- Secure API Key Management ---
# It is a security best practice to store your API key as an environment variable
# to avoid exposing it directly in the code.

# Before running this cell, set the 'FINNHUB_API_KEY' in your environment.
# For example, in your terminal:
# export FINNHUB_API_KEY='your_api_key_here'
# You will need to restart your notebook's kernel after setting the variable.

api_key = "d3r0knpr01qna05k8e40d3r0knpr01qna05k8e4g"

if not api_key:
    print("API key not found in environment variables. Please set 'FINNHUB_API_KEY'.")
    # You can temporarily hardcode your key here for testing, but it is not recommended for production.
    # api_key = "YOUR_API_KEY_HERE" 
    finnhub_client = None
else:
    finnhub_client = fi.Client(api_key=api_key)
    print("Finnhub client initialized.")
    # --- Test API Client ---
    # Optional: Test the client with a simple, free API call to ensure it's working.
    try:
        profile = finnhub_client.company_profile2(symbol='AAPL')
        print(f"Successfully fetched company profile for: {profile.get('name', 'AAPL')}")
    except Exception as e:
        print(f"Client may be initialized, but a test API call failed: {e}")

1.2: Finnhub Client Initialization
------------------------------
Finnhub client initialized.
Successfully fetched company profile for: Apple Inc


## Phase 2: Data Gathering

### Strategy 2.1 : Historical `Insider Sentiment`

In [5]:
print("2.1.0: Global Configuration")
print("-"*30)

# --- Configuration ---
# Tickers for the companies we are analyzing.
STOCKS = ['AAPL', 'NVDA', 'GOOGL']

# --- Date Range for Long-Term Data (2018-2024) ---
# This range applies to all data sources.
START_YEAR = 2018
END_YEAR = 2024

print("Global configuration loaded:")
print(f"Tickers: {STOCKS}")
print(f"Date Range: {START_YEAR}-{END_YEAR}")

2.1.0: Global Configuration
------------------------------
Global configuration loaded:
Tickers: ['AAPL', 'NVDA', 'GOOGL']
Date Range: 2018-2024


In [6]:
print("2.1.1: Long-Term Data Extraction (Insider Sentiment)")
print("-"*30)

# --- Data Storage ---
all_insider_data = []

print(f"Fetching long-term insider sentiment from {START_YEAR} to {END_YEAR}...")

# --- Fetch Data for Each Stock and Year ---
for stock in STOCKS:
    print(f"  > Processing {stock}...")
    for year in range(START_YEAR, END_YEAR + 1):
        start_date = f"{year}-01-01"
        end_date = f"{year}-12-31"
        try:
            insider_sentiment = finnhub_client.stock_insider_sentiment(stock, _from=start_date, to=end_date)
            insider_transactions = insider_sentiment.get('data', [])
            for item in insider_transactions:
                report_date = datetime(year=item['year'], month=item['month'], day=1).date()
                all_insider_data.append({
                    'ticker': stock,
                    'date': report_date,
                    'mspr': item['mspr'],
                    'change': item['change']
                })
            # A small confirmation to show progress.
            if insider_transactions:
                print(f"    - Found {len(insider_transactions)} records for {year}.")
        except Exception as e:
            print(f"    - Error fetching insider sentiment for {stock} in {year}: {e}")

print("\nLong-term insider sentiment fetching complete.")

2.1.1: Long-Term Data Extraction (Insider Sentiment)
------------------------------
Fetching long-term insider sentiment from 2018 to 2024...
  > Processing AAPL...
    - Found 10 records for 2018.
    - Found 10 records for 2019.
    - Found 9 records for 2020.
    - Found 8 records for 2021.
    - Found 9 records for 2022.
    - Found 7 records for 2023.
    - Found 8 records for 2024.
  > Processing NVDA...
    - Found 8 records for 2018.
    - Found 10 records for 2019.
    - Found 10 records for 2020.
    - Found 10 records for 2021.
    - Found 8 records for 2022.
    - Found 10 records for 2023.
    - Found 8 records for 2024.
  > Processing GOOGL...
    - Found 9 records for 2018.
    - Found 7 records for 2019.
    - Found 12 records for 2020.
    - Found 12 records for 2021.
    - Found 12 records for 2022.
    - Found 12 records for 2023.
    - Found 8 records for 2024.

Long-term insider sentiment fetching complete.


In [7]:
print("2.1.2: Create Company-Specific Insider DataFrames")
print("-"*30)

# This cell refactors the insider sentiment data into separate, clean
# DataFrames for each company, formatted for time series analysis.

# Create a temporary DataFrame from the raw collected data
insider_df = pd.DataFrame(all_insider_data)

# Dictionary to hold the final, structured DataFrames for each company
# NOTE: We can access the selevant dataset by calling `insider_datasets["AAPL"].head()` for example
insider_datasets = {}

if not insider_df.empty:
    # Convert 'date' column to datetime objects for manipulation
    insider_df['date'] = pd.to_datetime(insider_df['date'])

    # Engineer the 'Period' column in the specified 'YYYY-Q' format
    insider_df['Period'] = insider_df['date'].dt.year.astype(str) + '-Q' + insider_df['date'].dt.quarter.astype(str)
    
    print("Processing insider data for each target ticker...")
    # Iterate through the globally defined STOCKS list to create a DF for each
    for ticker in STOCKS:
        # Filter data for the current ticker
        ticker_specific_df = insider_df[insider_df['ticker'] == ticker].copy()

        if not ticker_specific_df.empty:
            # Select, rename, and sort the columns to match the desired format
            final_df = ticker_specific_df[['Period', 'mspr']].rename(columns={'mspr': 'MSPR'})
            final_df = final_df.sort_values(by='Period').reset_index(drop=True)
            
            # Store the processed DataFrame in the dictionary
            insider_datasets[ticker] = final_df 
        else:
            print(f"No insider sentiment data was found for {ticker}")
    
    print("\nSuccessfully created and structured insider sentiment DataFrames for all tickers.")
    print("DataFrames are stored in the 'insider_datasets' dictionary.")

else:
    print("The initial 'all_insider_data' list is empty. No DataFrames were created.")

2.1.2: Create Company-Specific Insider DataFrames
------------------------------
Processing insider data for each target ticker...

Successfully created and structured insider sentiment DataFrames for all tickers.
DataFrames are stored in the 'insider_datasets' dictionary.


In [8]:

print("2.1.2: Display Company-Specific Insider DataFrames")
print("-"*30)

# This is how we can access the insider_datasets ictionary now
display(insider_datasets["AAPL"].head())
display(insider_datasets["NVDA"].head())
display(insider_datasets["GOOGL"].head())

# This bit is just to gather contextual info on data distributions, quantities, ect ect
for TICKER in list(insider_datasets):
    print(f"Relevant Information on {TICKER}:\n")
    display(insider_datasets[TICKER].info())
    display(insider_datasets[TICKER].describe())

2.1.2: Display Company-Specific Insider DataFrames
------------------------------


Unnamed: 0,Period,MSPR
0,2018-Q1,-100.0
1,2018-Q1,7.840257
2,2018-Q2,-22.737514
3,2018-Q2,-54.7286
4,2018-Q2,-33.333332


Unnamed: 0,Period,MSPR
0,2018-Q1,-48.497414
1,2018-Q1,-39.06278
2,2018-Q2,-87.43943
3,2018-Q2,-62.33284
4,2018-Q3,-100.0


Unnamed: 0,Period,MSPR
0,2018-Q2,-33.796238
1,2018-Q2,-52.667397
2,2018-Q2,-48.464813
3,2018-Q3,-59.29357
4,2018-Q3,-61.71693


Relevant Information on AAPL:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61 entries, 0 to 60
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Period  61 non-null     object 
 1   MSPR    61 non-null     float64
dtypes: float64(1), object(1)
memory usage: 1.1+ KB


None

Unnamed: 0,MSPR
count,61.0
mean,-26.480193
std,64.47905
min,-100.0
25%,-85.37024
50%,-33.200634
75%,-7.226337
max,100.0


Relevant Information on NVDA:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64 entries, 0 to 63
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Period  64 non-null     object 
 1   MSPR    64 non-null     float64
dtypes: float64(1), object(1)
memory usage: 1.1+ KB


None

Unnamed: 0,MSPR
count,64.0
mean,-51.053201
std,49.574298
min,-100.0
25%,-100.0
50%,-55.415127
75%,-10.659913
max,100.0


Relevant Information on GOOGL:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 72 entries, 0 to 71
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Period  72 non-null     object 
 1   MSPR    72 non-null     float64
dtypes: float64(1), object(1)
memory usage: 1.3+ KB


None

Unnamed: 0,MSPR
count,72.0
mean,-29.655626
std,38.240153
min,-100.0
25%,-49.633164
50%,-37.794344
75%,-15.689648
max,80.80216


We now have a `Dictionary` containing: 
* `AAPL` Dataset 
* `NVDA` Dataset
* `GOOGL` Dataset

Each of these datasets has this format:
|   Period   | MSPR        |
|------------|-------------|
| 2018-Q1    | Value       | 
| 2018-Q1    | Value       | 
| 2018-Q1    | Value       | 
| 2018-Q2    | Value       | 
| ...        | ...         | 
| 2024-Q4    | Value       | 

### Strategy 2.2: Historical News via Hugging Face Datasets

In [12]:
print("2.2.0: Configuration & Authentication for Hugging Face")
print("-"*30)

# Load environment variables from .env file
load_dotenv()

# NOTE: --- Authentication (IMPORTANT) ---
# The Hugging Face User Access Token is loaded from the .env file.
# Make sure your .env file has the line: HF_TOKEN='your_token_here'

hf_token = os.getenv("HF_TOKEN")

if hf_token:
    print("Found Hugging Face token. Logging in...")
    try:
        login(token=hf_token)
        print("Login successful.")
    except Exception as e:
        print(f"Login failed: {e}")
else:
    print("Hugging Face token not found in environment variables.")
    print("Please ensure your .env file is correctly configured with 'HF_TOKEN'.")


# --- Configuration ---
# These are defined globally, but we re-state them here for clarity.
# Note: START_YEAR and END_YEAR must be defined in a previous cell.
try:
    TARGET_TICKERS = ['AAPL', 'NVDA', 'GOOGL']
    START_DATE = f"{START_YEAR}-01-01"
    END_DATE = f"{END_YEAR}-12-31"
    
    print("\nConfiguration for historical news extraction is set.")
    print(f"Tickers: {TARGET_TICKERS}")
    print(f"Date Range: {START_DATE} to {END_DATE}")
except NameError:
    print("\nWarning: START_YEAR and END_YEAR are not defined.")
    print("Please run a configuration cell first.")

2.2.0: Configuration & Authentication for Hugging Face
------------------------------
Found Hugging Face token. Logging in...
Login failed: name 'login' is not defined

Configuration for historical news extraction is set.
Tickers: ['AAPL', 'NVDA', 'GOOGL']
Date Range: 2018-01-01 to 2024-12-31


In [None]:
print("2.2.1: Extract Ticker-Specific News Articles via Streaming")
print("-"*30)

# --- Expanded Search Dictionary ---
# Maps a canonical ticker to a set of lowercase search terms for broader matching.
# Using sets provides a slight performance boost for lookups.
SEARCH_TERMS = {
    'AAPL': {'aapl', 'apple'},
    'NVDA': {'nvda', 'nvidia'},
    'GOOGL': {'googl', 'google', 'alphabet'}
}
# A flattened set of all terms for a very fast initial check.
ALL_SEARCH_TERMS = set.union(*SEARCH_TERMS.values())

# --- Data Storage ---
multisource_articles = []

print("Loading and filtering 'financial-news-multisource' dataset...")
print("(This is a large dataset and will take some time to process.)")

try:
    # --- Increase the download timeout for stability on large datasets ---
    import huggingface_hub.constants
    huggingface_hub.constants.HF_HUB_DOWNLOAD_TIMEOUT = 120 

    # --- Load all subsets of the dataset in streaming mode ---
    multisource_dataset = load_dataset(
        "Brianferrell787/financial-news-multisource",
        data_files="data/*/*.parquet",
        split="train",
        streaming=True
    )
    
    # --- Iterate through the stream with the optimized multi-step filter ---
    for i, article in enumerate(iter(multisource_dataset)):
        # Provide progress updates to show the process is not stalled
        if (i + 1) % 10_000 == 0:
            print(f"  > Processed {i + 1:,} articles. Found {len(multisource_articles)} relevant so far...")
            print(f"  > We are currently at date: {article['date'][:10]}")

        # --- OPTIMIZED FILTERING LOGIC ---
        # Step 1: Filter by date (fastest, string comparison).
        if not (START_DATE <= article['date'][:10] <= END_DATE):
            continue
        
        # Step 2: Quick pre-filter. Check if any of our expanded search terms appear
        # anywhere in the text or metadata before doing more expensive work.
        text_lower = article['text'].lower()
        extra_fields_lower = article['extra_fields'].lower()
        if not any(term in text_lower or term in extra_fields_lower for term in ALL_SEARCH_TERMS):
            continue

        # --- Step 3: Precise ticker identification for articles that passed the pre-filters ---
        mentioned_tickers = set() # Use a set to store found tickers to avoid duplicates

        # Primary Method: Check the structured 'stocks' field for high precision.
        try:
            extra_data = json.loads(article['extra_fields'])
            if 'stocks' in extra_data and isinstance(extra_data['stocks'], list):
                # Find the intersection of our target tickers and the article's tickers
                found = set(TARGET_TICKERS) & set(extra_data['stocks'])
                mentioned_tickers.update(found)
        except (json.JSONDecodeError, TypeError):
            # If JSON is invalid, we can fall back to text search.
            pass

        # Fallback Method: If no tickers were found in metadata, check the text.
        # This increases recall for articles that might not be perfectly tagged.
        if not mentioned_tickers:
            for ticker, terms in SEARCH_TERMS.items():
                if any(term in text_lower for term in terms):
                    mentioned_tickers.add(ticker)
        
        # If we found one or more relevant tickers, add entries to our list.
        if mentioned_tickers:
            for ticker in mentioned_tickers:
                multisource_articles.append({
                    'date': article['date'],
                    'ticker': ticker,
                    'text': article['text']
                })
    
    print(f"\nExtraction complete. Total relevant article entries collected: {len(multisource_articles)}")

except Exception as e:
    print(f"\nAn error occurred while processing the dataset: {e}")

In [None]:
print("2.2.2: Create the News Articles DataFrame")
print("-"*30)

# --- Create DataFrame from the collected list ---
news_articles_df = pd.DataFrame(multisource_articles)

if not news_articles_df.empty:
    # Convert 'date' column to datetime objects for future analysis
    news_articles_df['date'] = pd.to_datetime(news_articles_df['date']).dt.date
    
    print("`news_articles_df` DataFrame created successfully.")
    
    # Display summary and head
    print("\n--- DataFrame Info ---")
    news_articles_df.info()
    
    print("\n--- DataFrame Head ---")
    display(news_articles_df.head())
    
else:
    print("No articles from 'financial-news-multisource' matched the filtering criteria.")

In [None]:
print("2.2.3: Save Filtered News Articles to a Compressed File")
print("-"*30)

# This step is crucial for checkpointing our progress.

if 'news_articles_df' in locals() and not news_articles_df.empty:
    # Define the output file path. Using the '.gz' extension with the 'gzip'
    # compression type is a standard and efficient way to save large CSVs.
    output_filename = "filtered_news_articles.csv.gz"

    print(f"Saving the 'news_articles_df' to a compressed CSV file: {output_filename}")
    print(f"This may take a moment given the size of the DataFrame ({len(news_articles_df):,} rows)...")

    try:
        # Save the DataFrame to a gzipped CSV.
        # - compression='gzip' handles the zipping automatically.
        # - index=False prevents pandas from writing the DataFrame index as a column.
        news_articles_df.to_csv(output_filename, index=False, compression='gzip')
        
        print(f"\nDataFrame successfully saved to '{output_filename}'.")
        print("You can now load this file in future sessions to skip the long extraction process.")

    except Exception as e:
        print(f"\nAn error occurred while saving the DataFrame: {e}")
else:
    print("The 'news_articles_df' DataFrame is not available or is empty. Skipping save operation.")


### Strategy 2.3: Article Headlines Via `Kaggle`

In [None]:
print("2.3.0: Setup and Kaggle API Configuration")
print("-"*30)

# Load environment variables from .env file
load_dotenv()

# --- 1. Configure Kaggle API ---
# Credentials are loaded from your .env file.
# Ensure your .env file contains:
# KAGGLE_USERNAME='your_username'
# KAGGLE_KEY='your_api_key'

KAGGLE_USERNAME = os.getenv('KAGGLE_USERNAME')
KAGGLE_KEY = os.getenv('KAGGLE_KEY')

# Set environment variables for the Kaggle CLI
if KAGGLE_USERNAME and KAGGLE_KEY:
    os.environ['KAGGLE_USERNAME'] = KAGGLE_USERNAME
    os.environ['KAGGLE_KEY'] = KAGGLE_KEY
    print("Kaggle API credentials configured from environment variables.")
else:
    print("Warning: Kaggle credentials not found in environment variables.")
    print("Please ensure your .env file is correctly configured.")

# --- 2. Define Constants ---
# Date range for filtering headlines.
START_YEAR = 2018
END_YEAR = 2024

print("\nSetup complete. Constants defined.")
print(f"Data will be filtered for the period: {START_YEAR}-{END_YEAR}")

In [None]:
print("2.3.1: Download, Extract, and Filter S&P 500 Headlines Dataset")
print("-"*30)

import zipfile

# --- Define constants for data directory and file paths ---
DATA_DIR = "Data"
DATASET_NAME = 'dyutidasmahaptra/s-and-p-500-with-financial-news-headlines-20082024'
ZIP_FILE_NAME = 's-and-p-500-with-financial-news-headlines-20082024.zip'
CSV_FILE_NAME = 'sp500_headlines_2008_2024.csv'

# Construct full paths for the files within the Data directory
zip_file_path = os.path.join(DATA_DIR, ZIP_FILE_NAME)
csv_file_path = os.path.join(DATA_DIR, CSV_FILE_NAME)

# Ensure the target data directory exists
os.makedirs(DATA_DIR, exist_ok=True)
print(f"Ensured that the target directory '{DATA_DIR}' exists.")

# --- Download the dataset using the Kaggle API into the specified directory ---
print(f"Downloading dataset '{DATASET_NAME}' to '{DATA_DIR}'...")
# The -p flag directs the Kaggle CLI to download to the specified path.
!kaggle datasets download -d {DATASET_NAME} -p {DATA_DIR} --quiet

# --- Extract the CSV file from the downloaded zip ---
print(f"Extracting '{CSV_FILE_NAME}' into '{DATA_DIR}'...")
with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    zip_ref.extractall(DATA_DIR)
print("Extraction complete.")

# --- Clean up the downloaded zip file ---
print(f"Removing temporary zip file: '{zip_file_path}'")
os.remove(zip_file_path)

# --- Load and process the data with pandas ---
print(f"Loading data from '{csv_file_path}' into DataFrame and filtering...")
try:
    # Load the CSV from the Data directory
    df = pd.read_csv(csv_file_path)

    # Convert the 'date' column to datetime objects for reliable filtering
    df['Date'] = pd.to_datetime(df['Date'])

    # Create the filter condition for the date range
    start_date = f'{START_YEAR}-01-01'
    end_date = f'{END_YEAR}-12-31'
    date_filter = (df['Date'] >= start_date) & (df['Date'] <= end_date)

    # Apply the filter and create the final DataFrame
    market_headlines_df = df[date_filter].copy()

    # Drop the 'close' column as it is not needed for sentiment analysis
    market_headlines_df = market_headlines_df.drop(columns=['CP'])

    print("Filtering successful. The data is ready in 'market_headlines_df'.")

except FileNotFoundError:
    print(f"ERROR: The file '{csv_file_path}' was not found after extraction.")
except Exception as e:
    print(f"An error occurred during data processing: {e}")

In [None]:
print("2.3.2: Display Final Market Headlines DataFrame")
print("-"*30)

# Sort by date just to be sure
market_headlines_df = market_headlines_df.sort_values(by='Date').reset_index(drop=True)

if 'market_headlines_df' in locals() and not market_headlines_df.empty:
    print("DataFrame for general market sentiment created successfully.")
    
    print("\n--- DataFrame Info ---")
    market_headlines_df.info()
    
    print("\n--- DataFrame Head (sorted) ---")
    display(market_headlines_df.head())
    
    print("\n--- DataFrame Tail (sorted) ---")
    display(market_headlines_df.tail())
else:
    print("The 'market_headlines_df' DataFrame was not created or is empty. Please check the previous cell for errors.")


### Strategy 2.4: Company filings via `EDGAR` API

In [None]:
print("2.4.0: Setup, Installations, and EDGAR Identity")
print("-"*30)

# --- 1. Install necessary libraries ---
# 'edgartools' is the library we'll use to interface with the SEC EDGAR database.
import sys
!{sys.executable} -m pip install edgartools --quiet

import pandas as pd
from edgar import Company, set_identity

# --- 2. Set EDGAR User Identity (CRITICAL STEP) ---
# The SEC requires any script or bot that accesses EDGAR to have a custom User-Agent
# that identifies the user. This is a compliance requirement to avoid being blocked.
# Replace the example with your own company/project name and email address.
# Format: "Sample Company Name your.email@example.com"
set_identity("University of Southampton ab3u21@soton.ac.uk")
print("EDGAR user identity set successfully.")

# --- 3. Define Constants ---
# These constants will be used to filter the filings.
# We use the globally defined START_YEAR and END_YEAR from cell 2.1.0
TARGET_TICKERS = ['AAPL', 'NVDA', 'GOOGL']
FORM_TYPES = ["10-K", "10-Q", "8-K"]
DATE_RANGE = f"{START_YEAR}-01-01:{END_YEAR}-12-31"


print("\nSetup complete. Ready to extract SEC filings.")
print(f"Tickers: {TARGET_TICKERS}")
print(f"Form Types: {FORM_TYPES}")
print(f"Date Range: {DATE_RANGE}")

In [None]:
print("2.4.1: Extract SEC Filings Data")
print("-"*30)

# --- Data Storage ---
all_filings_data = []

print("Starting extraction of SEC filings. This may take a significant amount of time...")

# --- Loop through each stock and extract its filings ---
for ticker in TARGET_TICKERS:
    print(f"  > Processing filings for {ticker}...")
    try:
        # Create a Company object for the current ticker
        company = Company(ticker)
        
        # Get all filings and immediately filter by date range and form types
        filings = company.get_filings().filter(date=DATE_RANGE, form=FORM_TYPES)
        
        # The 'filings' object is a generator; we iterate through it to get each filing
        for filing in filings:
            # The .text() method conveniently extracts and cleans the full filing text
            filing_text = filing.text()
            
            # Append the structured data to our list
            all_filings_data.append({
                'filing_date': filing.filing_date,
                'ticker': ticker,
                'form_type': filing.form,
                'text': filing_text
            })
            # Log progress for each file found
            print(f"    - Extracted {filing.form} from {filing.filing_date}")

    except Exception as e:
        print(f"    - ERROR: Could not process filings for {ticker}. Reason: {e}")

print(f"\nFilings extraction complete. Total documents extracted: {len(all_filings_data)}")


In [None]:
print("2.4.2: Create and Display Filings DataFrame")
print("-"*30)

# --- Create DataFrame from the collected list ---
filings_df = pd.DataFrame(all_filings_data)

if not filings_df.empty:
    # Convert 'filing_date' column to datetime objects
    filings_df['filing_date'] = pd.to_datetime(filings_df['filing_date'])
    
    # Sort the DataFrame by date and ticker for good practice
    filings_df = filings_df.sort_values(by=['filing_date', 'ticker']).reset_index(drop=True)
    
    print("`filings_df` DataFrame created successfully.")
    
    # Display summary and head
    print("\n--- DataFrame Info ---")
    filings_df.info()
    
    print("\n--- DataFrame Head ---")
    display(filings_df.head(40).sort_values(["filing_date"], ascending=True))
    
else:
    print("The 'all_filings_data' list is empty. No DataFrame was created. Please check cell 2.4.1 for errors.")

In [None]:
print("2.4.3: Save Filtered Filings to a Compressed File")
print("-"*30)

# Check if the filings_df exists and is not empty before attempting to save.
if 'filings_df' in locals() and not filings_df.empty:
    
    # Define the name for the compressed output file.
    output_filename = "filtered_filings.csv.gz"

    print(f"Saving the 'filings_df' to a compressed CSV file: {output_filename}")
    print(f"This may take a moment, as the DataFrame contains {len(filings_df):,} documents...")

    try:
        # Save the DataFrame to a gzipped CSV.
        # - compression='gzip' handles the compression.
        # - index=False prevents pandas from writing the DataFrame index as a column.
        filings_df.to_csv(output_filename, index=False, compression='gzip')
        
        print(f"\nDataFrame successfully saved to '{output_filename}'.")
        print("This file can be loaded in future sessions to bypass the EDGAR extraction process.")

    except Exception as e:
        print(f"\nAn error occurred while saving the filings DataFrame: {e}")
else:
    print("The 'filings_df' DataFrame is not available or is empty. Skipping the save operation.")


## Phase 3: NLP Pipelines

### 3.1: NLP Pipeline for `news_articles_df`

#### We start this section by *Down_sampling*: 
We want to go from our `>700k` articles to `~11k` articles for each company over the 2018-2024 year period. That's `5` articles per day per company!

In [None]:
print("3.1.0: Setup and Imports for NLP Pre-processing")
print("-"*30)

# --- 1. Install necessary libraries ---
import sys
!{sys.executable} -m pip install nltk pandas --quiet

import pandas as pd
import re
import nltk
import os

# --- 2. Download NLTK resources ---
# We need 'punkt' and 'punkt_tab' for word tokenization. The most reliable
# way to ensure they are available in a notebook is to call download() directly.
# NLTK will not re-download the data if it's already present.
print("Ensuring NLTK resources ('punkt', 'punkt_tab') are available...")
nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True) # This command fixes the LookupError.
print("NLTK resources are up to date.")

# --- 3. Define File path for fallback loading ---
NEWS_ARTICLES_FILE = "filtered_news_articles.csv.gz"

print("\nSetup complete. Libraries and NLTK resources are ready.")

#### Here we calculate a `Mention_Density` metric for each atricle that will help us understand how relevant an article is for our selected company.

In [None]:
print("3.1.1: Calculate Mention Density Score for Each Article")
print("-"*30)

# --- 1. Define Comprehensive Company Keywords ---
COMPANY_KEYWORDS = {
    'AAPL': {
        'aapl', 'apple', 'iphone', 'ipad', 'macbook', 'imac', 'watchos',
        'ios', 'macos', 'airpods', 'tim cook', 'app store', 'vision pro'
    },
    'NVDA': {
        'nvda', 'nvidia', 'geforce', 'rtx', 'quadro', 'tesla', # Note: 'tesla' is a GPU arch
        'cuda', 'dgx', 'tegra', 'jensen huang', 'omniverse'
    },
    'GOOGL': {
        'googl', 'google', 'alphabet', 'android', 'youtube', 'chrome',
        'pixel', 'nest', 'waymo', 'gcp', 'sundar pichai', 'gemini'
    }
}
print("Comprehensive keyword dictionaries defined.")

# --- 2. Load the Dataset (prioritizing live DataFrame) ---
news_mention_density_df = None # Initialize to None

# Prioritize using the 'news_articles_df' if it's already in memory.
if 'news_articles_df' in locals() and isinstance(news_articles_df, pd.DataFrame) and not news_articles_df.empty:
    print("Using the live 'news_articles_df' DataFrame from the current session.")
    news_mention_density_df = news_articles_df.copy() # Use a copy to avoid side effects
    
# NOTE: --- This block is commented out but can be used for future sessions ---
# Fallback to loading from the file if the live DataFrame is not available.
# elif os.path.exists(NEWS_ARTICLES_FILE):
#     print(f"Loading dataset from '{NEWS_ARTICLES_FILE}'...")
#     # In a new session, you would uncomment the line below:
#     # news_mention_density_df = pd.read_csv(NEWS_ARTICLES_FILE, compression='gzip')
# -----------------------------------------------------------------------

if news_mention_density_df is None:
    print(f"ERROR: 'news_articles_df' not found in the current session.")
    print(f"To run this cell independently, uncomment the file loading logic above and ensure '{NEWS_ARTICLES_FILE}' exists.")
else:
    print(f"DataFrame loaded with {len(news_mention_density_df):,} articles.")
    
    # --- 3. Define the Density Calculation Function ---
    def calculate_all_densities(text):
        if not isinstance(text, str) or not text.strip():
            return pd.Series({f'{ticker}_Density': 0.0 for ticker in COMPANY_KEYWORDS})
        text_lower = text.lower()
        total_words = len(nltk.word_tokenize(text_lower))
        if total_words == 0:
            return pd.Series({f'{ticker}_Density': 0.0 for ticker in COMPANY_KEYWORDS})
        densities = {}
        for ticker, keywords in COMPANY_KEYWORDS.items():
            pattern = r'\b(' + '|'.join(re.escape(k) for k in keywords) + r')\b'
            mention_count = len(re.findall(pattern, text_lower))
            density = mention_count / total_words if total_words > 0 else 0
            densities[f'{ticker}_Density'] = density
        return pd.Series(densities)

    # --- 4. Apply the function to the DataFrame ---
    print("\nCalculating mention densities for all articles. This may take a few minutes...")
    density_scores = news_mention_density_df['text'].apply(calculate_all_densities)
    
    news_mention_density_df = pd.concat([news_mention_density_df, density_scores], axis=1)
    print("Mention density calculation complete.")
    
    # --- 5. Display Results ---
    print("\n--- DataFrame with Mention Density Scores ---")
    display(news_mention_density_df[['date', 'ticker', 'AAPL_Density', 'NVDA_Density', 'GOOGL_Density']].head())

#### Now that we have `Mention_Density` for each company, we can select the Top `5` articles per day for each company that have a `Mention_Density` score of >1%.
We will also include `article_volume` for each day, representing the total number of articles published for that company (with `Mention_Density` >1%.) in that day. 

In [None]:
print("3.1.2: Final Optimized Down-Sampling with Language Filter and Volume")
print("-"*30)

# --- Import language detection library ---
import sys
!{sys.executable} -m pip install langdetect --quiet
from langdetect import detect, LangDetectException

# --- 1. Define Filtering Parameters ---
# MIN_DENSITY_THRESHOLD: The minimum relevance score an article must have to be considered.
MIN_DENSITY_THRESHOLD = 0.01
# TOP_N_ARTICLES: The maximum number of highest-scoring articles to select for any given day.
TOP_N_ARTICLES = 5

print(f"Filtering Parameters:")
print(f" - Minimum Mention Density Threshold: {MIN_DENSITY_THRESHOLD}")
print(f" - Top N Articles per Day: {TOP_N_ARTICLES}\n")

# --- 2. Initialize Storage ---
# This dictionary will hold the final, filtered DataFrames, one for each company.
company_top_articles = {}

# Check if the main DataFrame from the previous step is available in memory.
if 'news_mention_density_df' in locals() and not news_mention_density_df.empty:
    
    # --- Step A: De-duplicate Articles ---
    # This ensures we only process each unique article text once per day.
    print(f"Original article entry count: {len(news_mention_density_df):,}")
    deduped_df = news_mention_density_df.drop_duplicates(subset=['date', 'text']).copy()
    print(f"De-duplicated article count: {len(deduped_df):,}")
    
    # --- Step B: Lightweight Language Filtering with Progress Indicator ---
    # This function checks only the first 100 characters of text for speed.
    def is_english_fast(text):
        try:
            # We only need a small sample of the text to accurately detect the language.
            sample = text[:100] if isinstance(text, str) else ''
            # Return True only if the sample is valid and detected as English ('en').
            return sample.strip() and detect(sample) == 'en'
        except LangDetectException:
            # If detection fails, we assume it's not the language we want.
            return False

    print("\nFiltering for English-language articles (with progress updates)...")
    total_texts_to_check = len(deduped_df)
    print_interval = 10000  # How often to print an update.
    
    english_mask = []
    # We use an explicit loop here to provide progress feedback.
    for i, text in enumerate(deduped_df['text']):
        english_mask.append(is_english_fast(text))
        
        # This block prints a status update at the specified interval.
        if (i + 1) % print_interval == 0 or (i + 1) == total_texts_to_check:
            print(f"  > Language check progress: {i + 1:,} of {total_texts_to_check:,} articles processed...")

    # Use the generated boolean mask to select only the English articles.
    english_df = deduped_df[english_mask]
    print(f"\nFiltered down to {len(english_df):,} English articles.\n")
    
    # --- Step 3. Process Each Company Independently ---
    for ticker in COMPANY_KEYWORDS.keys():
        density_col = f'{ticker}_Density'
        print(f"--- Processing {ticker} ---")

        # Step C: Filter for relevance based on the density threshold.
        relevant_articles_df = english_df[english_df[density_col] >= MIN_DENSITY_THRESHOLD].copy()
        
        if relevant_articles_df.empty:
            print(f"  > No articles found for {ticker} above the density threshold. Skipping.")
            company_top_articles[ticker] = pd.DataFrame()
            continue
        
        print(f"  > Found {len(relevant_articles_df):,} relevant English articles for {ticker}.")

        # Step D: Calculate daily article volume from the relevant set.
        daily_volume = relevant_articles_df.groupby('date').size().rename('article_volume')
        
        # Step E: Perform the "Top N per Day" selection.
        top_n_df = relevant_articles_df.groupby('date').apply(
            lambda x: x.nlargest(TOP_N_ARTICLES, density_col),
            include_groups=False
        ).reset_index(level=0)
        
        # Step F: Merge the daily volume feature onto the down-sampled DataFrame.
        top_n_df = top_n_df.merge(daily_volume, on='date', how='left')
        
        # Step G: Clean up the ticker column for clarity.
        top_n_df = top_n_df.drop(columns=['ticker'])
        top_n_df['assigned_ticker'] = ticker
        
        print(f"  > Down-sampled to {len(top_n_df):,} top articles for NLP analysis.")

        # Store the final, enriched DataFrame in the dictionary.
        company_top_articles[ticker] = top_n_df
    
    print("\nDown-sampling and all filtering complete.")
    
    # --- 4. Display a Sample of the Results ---
    print("\n--- Sample of Final 'AAPL' Articles ---")
    if 'AAPL' in company_top_articles and not company_top_articles['AAPL'].empty:
        display(company_top_articles['AAPL'][['date', 'assigned_ticker', 'AAPL_Density', 'article_volume']].head())
    else:
        print("No articles to display for AAPL.")
        
else:
    print("ERROR: 'news_mention_density_df' not found. Please run cell 3.1.1 first.")

In [None]:
for TICKER in list(company_top_articles):
    print(TICKER)
    display(company_top_articles[TICKER].head(10))

In [None]:
print("3.1.3: Inspect Down-Sampled DataFrames for Missing Data and Volume")
print("-"*30)

if 'company_top_articles' in locals() and isinstance(company_top_articles, dict):

    full_date_range = pd.date_range(start=f'{START_YEAR}-01-01', end=f'{END_YEAR}-12-31', freq='D')
    total_days_in_period = len(full_date_range)
    
    print(f"Analyzing data coverage over a total period of {total_days_in_period} days ({START_YEAR}-{END_YEAR}).\n")

    for ticker, df in company_top_articles.items():
        print(f"--- Inspection for {ticker} ---")
        
        if df.empty:
            print("  > No articles were selected for this ticker. DataFrame is empty.\n")
            continue
        
        df['date'] = pd.to_datetime(df['date'])
        
        total_articles = len(df)
        unique_days_with_articles = df['date'].nunique()
        missing_days = total_days_in_period - unique_days_with_articles
        coverage_percentage = (unique_days_with_articles / total_days_in_period) * 100
        
        print(f"  Article Coverage:")
        print(f"    > Total selected articles for NLP: {total_articles:,}")
        print(f"    > Unique days with articles: {unique_days_with_articles:,}")
        print(f"    > Days with NO articles (to be imputed): {missing_days:,}")
        print(f"    > Daily coverage percentage: {coverage_percentage:.2f}%")
        
        # --- NEW: Reporting on 'article_volume' ---
        # To get daily stats, we first drop duplicates for each day since 'article_volume' is repeated.
        daily_stats_df = df.drop_duplicates(subset=['date'])
        
        print(f"\n  Daily Article Volume (for days with coverage):")
        print(f"    > Mean daily volume: {daily_stats_df['article_volume'].mean():.2f}")
        print(f"    > Median daily volume: {daily_stats_df['article_volume'].median():.0f}")
        print(f"    > Max daily volume: {daily_stats_df['article_volume'].max():.0f}")
        print(f"    > Min daily volume: {daily_stats_df['article_volume'].min():.0f}\n")
        
else:
    print("ERROR: 'company_top_articles' dictionary not found. Please run cell 3.1.2 first.")

#### These steps below are just exporting and verifying importing doesn't break anything.

In [None]:
print("3.1.4: Save Filtered Articles Dictionary to a File")
print("-"*30)

# Import the pickle library for object serialization.
import pickle

# Check if the dictionary from the previous cell exists.
if 'company_top_articles' in locals() and isinstance(company_top_articles, dict):
    
    # Define the output filename for our serialized dictionary.
    output_filename = "company_top_articles.pkl"

    print(f"Saving the 'company_top_articles' dictionary to '{output_filename}'...")
    print("This file will serve as a checkpoint to avoid re-running the filtering process.")

    try:
        # Open the file in write-binary ('wb') mode.
        with open(output_filename, 'wb') as f:
            # Use pickle.dump() to serialize the entire dictionary object into the file.
            pickle.dump(company_top_articles, f)
        
        print(f"\nDictionary successfully saved to '{output_filename}'.")
        
        # Provide a snippet for how to load the data in a future session.
        print("\nTo load this data in a future session, you can use the following code:")
        print("------------------------------------------------------------------")
        print("import pickle")
        print("with open('company_top_articles.pkl', 'rb') as f:")
        print("    company_top_articles = pickle.load(f)")
        print("------------------------------------------------------------------")

    except Exception as e:
        print(f"\nAn error occurred while saving the dictionary: {e}")
else:
    print("ERROR: 'company_top_articles' dictionary not found. Please run the previous cells first.")

In [None]:
print("3.1.5: Load and Verify the Filtered Articles Dictionary from File")
print("-"*30)

# Import the pickle library and pandas for displaying DataFrames.
import pickle
import pandas as pd

# Define the filename where the dictionary was saved.
FILENAME = "Data/company_top_articles.pkl"

try:
    # Open the file in read-binary ('rb') mode.
    with open(FILENAME, 'rb') as f:
        # Load the entire dictionary object from the pickle file.
        loaded_company_articles = pickle.load(f)
    
    print(f"Successfully loaded dictionary from '{FILENAME}'.")
    print("Verifying contents by displaying the head of each DataFrame...\n")
    
    # --- Verification Loop ---
    # Iterate through the keys (tickers) and values (DataFrames) of the loaded dictionary.
    for ticker, df in loaded_company_articles.items():
        print(f"--- Top 10 Articles for: {ticker} ---")
        
        if not df.empty:
            # Display the first 10 rows for visual inspection.
            display(df.head(10))
        else:
            print("  > DataFrame is empty for this ticker.")
        print("\n" + "-"*50 + "\n")

except FileNotFoundError:
    print(f"ERROR: The file '{FILENAME}' was not found. Please run cell 3.1.4 to save the file first.")
except Exception as e:
    print(f"An unexpected error occurred while loading the file: {e}")

#### And we have completed the Down-Sampling. Next, we calculate composite_sentiment scores using `FinBERT`
We now have some missing data because news articles (with `Mention_Density` >1%) are sometimes unavailable for certain days.\
Our lowest coverage is `82.79%` so we can safely reconstruct the missing sentiment data after we have computed these scores.

In [1]:
print("3.1.6: Install GPU-Enabled PyTorch (v2.6.0+) & Transformers")
print("-"*30)

import sys

# --- Step 1: Uninstall any old versions ---
# This is crucial to ensure a clean install.
print("Uninstalling previous torch versions (if any)...")
# !{sys.executable} -m pip uninstall torch torchvision torchaudio -y --quiet

# --- Step 2: Install CUDA-enabled PyTorch (v2.6.0+ for CUDA 12.4) ---
# We MUST use the 'cu124' index, as 'cu121' does not have torch 2.6.0+.
# This version is required by transformers to patch CVE-2025-32434.
print("Installing torch (>=2.6.0) for CUDA 12.4 (this may take a few minutes)...")
!{sys.executable} -m pip install torch>=2.6.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124 --quiet

# --- Step 3: Install Transformers & TQDM ---
# We add 'tqdm' here to get the progress bars and remove the warning.
print("Installing Hugging Face Transformers and TQDM...")
!{sys.executable} -m pip install transformers tqdm --quiet

# --- Step 4: Critical Kernel Restart ---
print("\n" + "-"*30)
print("! ! ! CRITICAL STEP ! ! !")
print("Installation complete. You MUST now restart the Jupyter kernel.")
print("In VS Code / Jupyter: Find the 'Restart' button for the kernel.")
print("After restarting, you can run the subsequent cells (3.1.7 onward).")
print("-"*30)

3.1.6: Install GPU-Enabled PyTorch (v2.6.0+) & Transformers
------------------------------
Uninstalling previous torch versions (if any)...
Installing torch (>=2.6.0) for CUDA 12.4 (this may take a few minutes)...
Installing Hugging Face Transformers and TQDM...



[notice] A new release of pip is available: 23.2.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip



------------------------------
! ! ! CRITICAL STEP ! ! !
Installation complete. You MUST now restart the Jupyter kernel.
In VS Code / Jupyter: Find the 'Restart' button for the kernel.
After restarting, you can run the subsequent cells (3.1.7 onward).
------------------------------



[notice] A new release of pip is available: 23.2.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [1]:
print("3.1.7: Verify Environment and Initialize Pipeline")
print("-"*30)

import torch
from transformers import pipeline, AutoTokenizer
import sys

# --- 1. Verify PyTorch Installation ---
print(f"PyTorch version: {torch.__version__}")
if not torch.__version__.startswith("2.6.0"):
    print("Warning: PyTorch version is not 2.6.0. This might cause issues.")

# --- 2. Initialize the Sentiment Analysis Pipeline with GPU Auto-Detection ---
print("\nInitializing the FinBERT sentiment analysis pipeline...")

# Check if a CUDA-compatible GPU is available
if torch.cuda.is_available():
    device_id = 0
    print(f"GPU detected: {torch.cuda.get_device_name(device_id)}. The pipeline will run on the GPU.")
else:
    device_id = -1
    print("No GPU detected or PyTorch CUDA is not installed. The pipeline will run on the CPU.")

# Load the main pipeline
# The device parameter automatically assigns the model to the GPU (0) or CPU (-1).
sentiment_pipeline = pipeline("sentiment-analysis", model="ProsusAI/finbert", device=device_id)

# Load the tokenizer separately.
tokenizer = AutoTokenizer.from_pretrained("ProsusAI/finbert")

print("\nFinBERT pipeline and tokenizer initialized successfully.")

3.1.7: Verify Environment and Initialize Pipeline
------------------------------


  from .autonotebook import tqdm as notebook_tqdm


PyTorch version: 2.6.0+cu124

Initializing the FinBERT sentiment analysis pipeline...
GPU detected: NVIDIA GeForce RTX 3050 Ti Laptop GPU. The pipeline will run on the GPU.


Device set to use cuda:0



FinBERT pipeline and tokenizer initialized successfully.


In [None]:
print("3.1.8: Re-import the dataframes after kernel restart")
print("-"*30)

import pickle
with open('Data/company_top_articles.pkl', 'rb') as f:
    company_top_articles = pickle.load(f)

In [None]:
print("3.1.9: Apply NLP with Chunking and Aggregate Daily Sentiment")
print("-"*30)

import pandas as pd
import numpy as np
from tqdm.auto import tqdm # For rich progress bars in notebooks

# --- 1. Define Helper Functions ---
# (These functions are unchanged)

def get_sentiment_score(result):
    """Converts the pipeline's output dictionary to a single numerical score."""
    label = result['label']
    score = result['score']
    if label == 'positive':
        return score
    elif label == 'negative':
        return -score
    return 0.0

def analyze_sentiment_with_chunking(text):
    """
    Analyzes sentiment of a text, handling long inputs by chunking them.
    Averages the sentiment scores of all chunks for a final document score.
    """
    # Max tokens for the model, leaving a small buffer for special tokens.
    max_length = 512
    # Amount of token overlap between chunks to maintain context.
    overlap = 25
    
    try:
        # Tokenize the text to see if it needs chunking.
        tokens = tokenizer.encode(str(text), return_tensors='pt', truncation=False)
        
        # If text is short enough, process it in one go.
        if tokens.size(1) <= max_length:
            result = sentiment_pipeline(str(text))[0]
            return get_sentiment_score(result)

        # If text is too long, split it into chunks.
        chunk_texts = []
        for i in range(0, tokens.size(1), max_length - overlap):
            chunk_tokens = tokens[0, i:i + max_length]
            chunk_texts.append(tokenizer.decode(chunk_tokens, skip_special_tokens=True))
        
        # Run the pipeline on the list of chunks.
        chunk_sentiments = sentiment_pipeline(chunk_texts)
        
        # Calculate the average sentiment score from all the chunks.
        scores = [get_sentiment_score(result) for result in chunk_sentiments]
        return sum(scores) / len(scores) if scores else 0.0

    except Exception:
        # If any error occurs (e.g., empty text), return a neutral score.
        return 0.0

# --- 2. Initialize Storage for Final Results ---
daily_sentiment_dfs = {}

if 'company_top_articles' in locals():
    print(f"Starting NLP analysis on {len(company_top_articles)} companies...")

    # --- 3. Iterate Through Each Company's DataFrame (with TQDM) ---
    for ticker, df in tqdm(company_top_articles.items(), desc="Processing all companies"):
        
        if df.empty:
            print(f"  > DataFrame for {ticker} is empty. Skipping.")
            daily_sentiment_dfs[ticker] = pd.DataFrame()
            continue

        # --- 4. Apply Sentiment Analysis with a Progress Bar ---
        # CORRECTED: Set the description for the inner progress bar by re-initializing tqdm.pandas()
        tqdm.pandas(desc=f"Analyzing {ticker} articles")
        
        # Now, call .progress_apply() without any extra arguments.
        df['sentiment_score'] = df['text'].progress_apply(analyze_sentiment_with_chunking)
        
        # --- 5. Aggregate to a Daily Time-Series DataFrame ---
        
        # Dynamically create the aggregation dictionary to include all density columns
        agg_dict = {
            'article_volume': ('article_volume', 'first'),
            'average_news_sentiment': ('sentiment_score', 'mean')
        }
        
        # Find all density columns in the DataFrame and add them to the aggregation.
        density_cols = [col for col in df.columns if '_Density' in col]
        for col in density_cols:
            agg_dict[f'AVG_{col}'] = (col, 'mean')

        daily_agg_df = df.groupby('date').agg(**agg_dict).reset_index()

        daily_agg_df = daily_agg_df.rename(columns={'date': 'Period'})
        
        print(f"\n  > Aggregation for {ticker} complete. Created daily time-series with {len(daily_agg_df)} unique days.")
        daily_sentiment_dfs[ticker] = daily_agg_df

    print("\n--- All NLP Processing and Aggregation Complete ---")
    
    # --- 6. Display a Sample of the Final Output ---
    print("\n--- Sample of Final Daily Sentiment DataFrame for 'AAPL' ---")
    if 'AAPL' in daily_sentiment_dfs and not daily_sentiment_dfs['AAPL'].empty:
        display(daily_sentiment_dfs['AAPL'].head())

else:
    print("ERROR: 'company_top_articles' dictionary not found. Please run the previous cells first.")

#### Now the dataframe has composte sentiment scores for each day and for each company.
We just want to fill in missing values via a simple `backward_fill` and `forward_fill` imputation strategy.

In [None]:
print("3.1.10: Impute Missing Values in Daily Company Sentiment")
print("-"*30)

if 'daily_sentiment_dfs' in locals() and isinstance(daily_sentiment_dfs, dict):

    # Create the full date range for the entire analysis period
    full_date_range = pd.date_range(start=f'{START_YEAR}-01-01', end=f'{END_YEAR}-12-31', freq='D')
    
    imputed_sentiment_dfs = {}
    
    print("Starting imputation for company-specific sentiment data...")

    for ticker, df in daily_sentiment_dfs.items():
        print(f"  > Processing {ticker}...")
        
        if df.empty:
            print(f"    - DataFrame for {ticker} is empty. Skipping.")
            imputed_sentiment_dfs[ticker] = pd.DataFrame()
            continue
        
        # Ensure 'Period' is a datetime object and set it as the index
        df['Period'] = pd.to_datetime(df['Period'])
        df = df.set_index('Period')
        
        original_rows = len(df)
        
        # Reindex the DataFrame to include all days in the range, creating NaNs for missing days
        imputed_df = df.reindex(full_date_range)
        
        # --- Imputation Strategy ---
        # 1. For 'article_volume', missing days correctly mean 0 articles were published.
        imputed_df['article_volume'] = imputed_df['article_volume'].fillna(0)
        
        # 2. For sentiment and density scores, the last known value persists.
        sentiment_cols = [col for col in imputed_df.columns if col != 'article_volume']
        imputed_df[sentiment_cols] = imputed_df[sentiment_cols].ffill().bfill()
        
        # Reset the index to bring 'Period' back as a column
        imputed_df = imputed_df.reset_index().rename(columns={'index': 'Period'})
        
        imputed_sentiment_dfs[ticker] = imputed_df
        
        print(f"    - Imputation complete. Rows changed from {original_rows} to {len(imputed_df)}.")

    # Overwrite the old dictionary with the new one containing the complete, imputed data
    daily_sentiment_dfs = imputed_sentiment_dfs
    
    print("\nImputation process complete for all tickers.")
    
    # Display a sample to verify the imputation
    print("\n--- Sample of Imputed 'NVDA' DataFrame (showing start of 2018) ---")
    if 'NVDA' in daily_sentiment_dfs and not daily_sentiment_dfs['NVDA'].empty:
        display(daily_sentiment_dfs['NVDA'].head())

else:
    print("ERROR: 'daily_sentiment_dfs' dictionary not found. Please run cell 3.1.9 first.")


In [None]:
print("3.1.11: Save Daily Sentiment DataFrames to a File")
print("-"*30)

import pickle
import os

# Check if the dictionary from the previous cell exists.
if 'daily_sentiment_dfs' in locals() and isinstance(daily_sentiment_dfs, dict):
    
    # Define the output directory and filename.
    output_dir = "Data"
    output_filename = "daily_sentiment_dfs.pkl"
    output_path = os.path.join(output_dir, output_filename)
    
    # Ensure the output directory exists.
    os.makedirs(output_dir, exist_ok=True)

    print(f"Saving the 'daily_sentiment_dfs' dictionary to '{output_path}'...")
    print("This file will serve as a checkpoint to avoid re-running the NLP analysis.")

    try:
        # Open the file in write-binary ('wb') mode.
        with open(output_path, 'wb') as f:
            # Use pickle.dump() to serialize the entire dictionary object into the file.
            pickle.dump(daily_sentiment_dfs, f)
        
        print(f"\nDictionary successfully saved to '{output_path}'.")
        
        # Provide a snippet for how to load the data in a future session.
        print("\nTo load this data in a future session, you can use the following code:")
        print("------------------------------------------------------------------")
        print("import pickle")
        print(f"with open('{output_path}', 'rb') as f:")
        print("    daily_sentiment_dfs = pickle.load(f)")
        print("------------------------------------------------------------------")

    except Exception as e:
        print(f"\nAn error occurred while saving the dictionary: {e}")
else:
    print("ERROR: 'daily_sentiment_dfs' dictionary not found. Please run cell 3.1.9 first.")


### 3.2: NLP Pipeline for `market_headlines_df`

In [None]:
print("3.2.0: Setup and Load Market Headlines Data")
print("-"*30)

import sys
# Ensure langdetect is installed for language filtering
!{sys.executable} -m pip install langdetect --quiet

import pandas as pd
import os
from langdetect import detect, LangDetectException

# Define the file path for the market headlines CSV
MARKET_HEADLINES_FILE = os.path.join("Data", "sp500_headlines_2008_2024.csv")

# --- Load the Dataset ---
# Prioritize using the DataFrame if it's already in memory from cell 2.3.1.
if 'market_headlines_df' in locals() and isinstance(market_headlines_df, pd.DataFrame):
    print("Using 'market_headlines_df' from the current session.")
    # Create a copy to avoid modifying the original DataFrame
    headlines_df = market_headlines_df.copy()
else:
    # Fallback to loading from the CSV file
    print(f"Loading market headlines from '{MARKET_HEADLINES_FILE}'...")
    try:
        headlines_df = pd.read_csv(MARKET_HEADLINES_FILE)
        # Ensure date column is in the correct format if loading fresh
        headlines_df['Date'] = pd.to_datetime(headlines_df['Date'])
        print("DataFrame loaded successfully.")
    except FileNotFoundError:
        print(f"ERROR: File not found at '{MARKET_HEADLINES_FILE}'. Please run cell 2.3.1 first.")
        headlines_df = pd.DataFrame() # Create empty DF to prevent downstream errors

if not headlines_df.empty:
    print(f"\nLoaded {len(headlines_df):,} total headlines for the period {START_YEAR}-{END_YEAR}.")

In [None]:
print("3.2.1: Filter for English-Language Headlines")
print("-"*30)

if not 'headlines_df' in locals() or headlines_df.empty:
    print("ERROR: 'headlines_df' not found or is empty. Skipping filtering.")
else:
    # This function checks if a given text is English.
    def is_english(text):
        try:
            # Check for valid string input for first 100 characters
            return isinstance(text, str) and detect(text[:100]) == 'en'
        except LangDetectException:
            # If detection fails, assume it's not the language we want.
            return False

    print("Filtering for English headlines. This may take a moment...")
    # Apply the language filter to the 'Title' column
    english_mask = headlines_df['Title'].apply(is_english)
    english_market_headlines_df = headlines_df[english_mask]

    original_count = len(headlines_df)
    filtered_count = len(english_market_headlines_df)
    print(f"Filtering complete.")
    print(f"  > Original headline count: {original_count:,}")
    print(f"  > English headline count:  {filtered_count:,} ({filtered_count/original_count:.2%})")

In [None]:
print("3.2.2: Analyze Data for Missing Days")
print("-"*30)

if 'english_market_headlines_df' in locals() and not english_market_headlines_df.empty:
    # Create a complete date range for our analysis period
    full_date_range = pd.date_range(start=f'{START_YEAR}-01-01', end=f'{END_YEAR}-12-31', freq='D')
    
    # Get the unique days that have at least one headline
    unique_days_with_headlines = english_market_headlines_df['Date'].nunique()
    total_days_in_period = len(full_date_range)
    missing_days = total_days_in_period - unique_days_with_headlines
    
    print(f"Analyzing headline coverage from {START_YEAR} to {END_YEAR} ({total_days_in_period} days total).")
    print(f"  > Unique days with at least one headline: {unique_days_with_headlines:,}")
    print(f"  > Days with NO headlines (to be imputed): {missing_days:,}")

    if missing_days > 0:
        print("\nMissing days detected. Imputation will be required in a later step.")
    else:
        print("\nComplete daily coverage found. No imputation needed.")
else:
    print("ERROR: 'english_market_headlines_df' not found. Cannot perform analysis.")


In [None]:
print("3.2.3: Calculate and Aggregate Daily Market Sentiment")
print("-"*30)

if 'english_market_headlines_df' in locals() and not english_market_headlines_df.empty:
    
    # --- 1. Calculate Sentiment for Each Headline ---
    # Register tqdm with pandas and set the description for the progress bar.
    tqdm.pandas(desc="Analyzing market headlines")
    
    # Since headlines are short, the chunking function will process them quickly.
    # We re-use it for consistency with the previous NLP task.
    english_market_headlines_df['sentiment_score'] = english_market_headlines_df['Title'].progress_apply(analyze_sentiment_with_chunking)
    
    # --- 2. Aggregate to a Daily Time-Series ---
    print("\nAggregating sentiment scores into a daily average...")
    daily_market_sentiment_df = english_market_headlines_df.groupby('Date').agg(
        market_average_sentiment=('sentiment_score', 'mean')
    ).reset_index()
    
    # Rename 'Date' to 'Period' for consistency
    daily_market_sentiment_df = daily_market_sentiment_df.rename(columns={'Date': 'Period'})

    print("Aggregation complete.")
    print("\n--- Sample of Daily Market Sentiment ---")
    display(daily_market_sentiment_df.head())
    
else:
    print("ERROR: 'english_market_headlines_df' is not available. Skipping sentiment analysis.")


In [None]:
print("3.2.4: Impute Missing Sentiment Values")
print("-"*30)

if 'daily_market_sentiment_df' in locals() and not daily_market_sentiment_df.empty:
    
    # Create a copy to work with, preventing state issues on re-runs
    df_to_impute = daily_market_sentiment_df.copy()

    # Ensure 'Period' column is in datetime format
    df_to_impute['Period'] = pd.to_datetime(df_to_impute['Period'])

    # Set 'Period' as the index to perform time-series operations
    df_to_impute = df_to_impute.set_index('Period')
    
    # Create the full daily date range again
    full_date_range = pd.date_range(start=f'{START_YEAR}-01-01', end=f'{END_YEAR}-12-31', freq='D')
    
    # Reindex the DataFrame to include all days in the range
    imputed_market_sentiment_df = df_to_impute.reindex(full_date_range)
    
    # Use forward-fill and then back-fill to handle all missing values
    imputed_market_sentiment_df['market_average_sentiment'] = imputed_market_sentiment_df['market_average_sentiment'].ffill().bfill()
    
    # Reset the index to bring 'Period' back to a column
    imputed_market_sentiment_df = imputed_market_sentiment_df.reset_index().rename(columns={'index': 'Period'})

    print(f"Imputation complete. DataFrame now contains {len(imputed_market_sentiment_df)} days.")
    print("\n--- Sample of Final Imputed Market Sentiment ---")
    display(imputed_market_sentiment_df.head())

else:
    print("ERROR: 'daily_market_sentiment_df' is not available. Skipping imputation.")

In [None]:
print("3.2.5: Save Final Market Sentiment DataFrame")
print("-"*30)

import pickle
import os

if 'imputed_market_sentiment_df' in locals() and not imputed_market_sentiment_df.empty:
    
    output_dir = "Data"
    output_filename = "market_sentiment_df.pkl"
    output_path = os.path.join(output_dir, output_filename)
    
    os.makedirs(output_dir, exist_ok=True)

    print(f"Saving the final market sentiment DataFrame to '{output_path}'...")

    try:
        with open(output_path, 'wb') as f:
            pickle.dump(imputed_market_sentiment_df, f)
        
        print(f"\nDataFrame successfully saved to '{output_path}'.")
        
    except Exception as e:
        print(f"\nAn error occurred while saving the DataFrame: {e}")
else:
    print("ERROR: 'imputed_market_sentiment_df' not found. Nothing to save.")


### 3.3: NLP Pipeline for `filings_df`

In [None]:
print("3.3.0: Setup and Environment Check for SEC Filings NLP")
print("-"*30)

import re
import numpy as np
import pandas as pd
import os

# Check if 'filings_df' is already loaded in the environment
if 'filings_df' in locals() and isinstance(filings_df, pd.DataFrame) and not filings_df.empty:
    print("`filings_df` is loaded and ready for processing.")
    print(f"Total filings to process: {len(filings_df):,}")
    # Create a working copy to avoid modifying the original DataFrame
    working_filings_df = filings_df.copy()
else:
    print("`filings_df` not found or is empty. Attempting to load from file...")
    data_path = "Data/filtered_filings.csv.gz"
    
    try:
        print(f"Loading from '{data_path}'...")
        # Load the dataframe from the gzipped CSV
        filings_df = pd.read_csv(data_path, compression='gzip')
        
        # Convert 'filing_date' back to datetime objects, as CSVs don't preserve the type
        if 'filing_date' in filings_df.columns:
            filings_df['filing_date'] = pd.to_datetime(filings_df['filing_date'])

        print(f"Successfully loaded `filings_df` with {len(filings_df):,} records.")
        
        # Create a working copy for the NLP tasks
        working_filings_df = filings_df.copy()
        print(f"Total filings to process: {len(working_filings_df):,}")

    except FileNotFoundError:
        print(f"ERROR: Backup file not found at '{data_path}'.")
        print("Please run Section 2.4 to generate it.")
        # Create an empty dataframe to prevent downstream errors
        working_filings_df = pd.DataFrame()
    except Exception as e:
        print(f"An unexpected error occurred while loading the data: {e}")
        # Create an empty dataframe to prevent downstream errors
        working_filings_df = pd.DataFrame()

if 'working_filings_df' not in locals():
     working_filings_df = pd.DataFrame()

if not working_filings_df.empty:
    print("`working_filings_df` is ready.")
else:
    print("Warning: `working_filings_df` is empty. Subsequent cells may not run correctly.")

In [None]:
for item in working_filings_df['text']:
    print(item)

In [None]:
print("3.3.1: Normalize Filing Text using FinBERT Tokenizer")
print("-"*30)

import pandas as pd
from transformers import AutoTokenizer
from tqdm.auto import tqdm

# We use the same tokenizer as the sentiment model to ensure consistency.
MODEL_NAME = "ProsusAI/finbert"
try:
    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
    print(f"Successfully loaded tokenizer for '{MODEL_NAME}'.")
except Exception as e:
    print(f"Error loading tokenizer: {e}")
    tokenizer = None

def normalize_text_with_tokenizer(text, tokenizer_instance):
    """
    Cleans and normalizes text by tokenizing and re-joining it using a
    pretrained transformer tokenizer. This is the most effective way to
    prepare text for the corresponding model.
    """
    if not isinstance(text, str) or not tokenizer_instance:
        return ""
    
    # This process handles special characters, whitespace, and subword splitting
    # according to the model's vocabulary.
    tokens = tokenizer_instance.tokenize(text)
    
    # Convert tokens back to a single string.
    return tokenizer_instance.convert_tokens_to_string(tokens)

if 'working_filings_df' in locals() and not working_filings_df.empty and tokenizer:
    print("Applying tokenizer-based normalization to the 'text' column...")
    tqdm.pandas(desc="Normalizing filing text")
    working_filings_df['cleaned_text'] = working_filings_df['text'].progress_apply(
        normalize_text_with_tokenizer, 
        tokenizer_instance=tokenizer
    )
    print("Text normalization complete.")
elif tokenizer is None:
    print("Skipping normalization as tokenizer failed to load.")
else:
    print("Skipping text normalization as `working_filings_df` is empty.")

In [None]:
display(working_filings_df.head(200))

In [None]:
print("3.3.2: Inspect for Empty Filings Post-Cleaning")
print("-"*30)

import numpy as np

if not working_filings_df.empty:
    # A filing is considered "empty" if its cleaned text has fewer than 100 characters.
    # This lower threshold is less likely to incorrectly flag short filings like 8-Ks.
    char_threshold = 100
    working_filings_df['is_empty'] = working_filings_df['cleaned_text'].str.len() < char_threshold
    
    empty_count = working_filings_df['is_empty'].sum()
    total_count = len(working_filings_df)
    
    print(f"Found {empty_count} filings with insufficient text (less than {char_threshold} characters) after cleaning.")
    
    if empty_count > 0:
        print("These will be assigned a neutral sentiment score by skipping NLP processing.")
        # Replace the text of empty filings with NaN
        working_filings_df.loc[working_filings_df['is_empty'], 'cleaned_text'] = np.nan
else:
    print("Skipping inspection as `working_filings_df` is empty.")

In [None]:
display(working_filings_df)

In [None]:
print("3.3.3: Calculate Filing Sentiment and Reshape Data")
print("-"*30)

import pandas as pd
import numpy as np
import warnings
import torch
from transformers import pipeline, AutoTokenizer

# --- 0. Suppress Warnings & Setup Environment ---
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=UserWarning, module='transformers')
device = 0 if torch.cuda.is_available() else -1
print(f"Using device: {'GPU' if device == 0 else 'CPU'}")

# --- 1. Helper Functions for Advanced Sentiment Analysis ---
MODEL_NAME = "ProsusAI/finbert"
try:
    # Initialize the pipeline once
    sentiment_pipeline = pipeline("sentiment-analysis", model=MODEL_NAME, device=device)
    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
    MAX_CHUNK_LENGTH = 512
    OVERLAP = 50
except Exception as e:
    print(f"Error loading model {MODEL_NAME}: {e}")
    sentiment_pipeline = None

def analyze_sentiment_with_full_distribution(text, pipeline_instance, tokenizer_instance):
    """
    Analyzes sentiment using the model's full probability distribution
    to create a continuous score from -1 to 1.
    """
    if pd.isna(text) or not isinstance(text, str) or not text.strip() or not pipeline_instance:
        return 0.0

    tokens = tokenizer_instance.encode(text, add_special_tokens=False)
    if not tokens:
        return 0.0
    
    chunk_step = MAX_CHUNK_LENGTH - OVERLAP
    token_chunks = [tokens[i:i + MAX_CHUNK_LENGTH] for i in range(0, len(tokens), chunk_step)]
    text_chunks = [tokenizer_instance.decode(chunk) for chunk in token_chunks]
    
    if not text_chunks:
        return 0.0

    try:
        # KEY CHANGE: Get scores for ALL labels (positive, negative, neutral)
        sentiments_per_chunk = pipeline_instance(text_chunks, truncation=True, return_all_scores=True)
    except Exception:
        return 0.0
    
    chunk_scores = []
    for sentiment_list in sentiments_per_chunk:
        # sentiment_list is now like [{'label':'positive', 'score':...}, {'label':'negative', 'score':...}]
        scores_dict = {item['label'].lower(): item['score'] for item in sentiment_list}
        
        # The score is the balance between positive and negative confidence
        score = scores_dict.get('positive', 0.0) - scores_dict.get('negative', 0.0)
        chunk_scores.append(score)
        
    return sum(chunk_scores) / len(chunk_scores) if chunk_scores else 0.0

# --- 2. Main Processing Logic with Manual Progress Updates ---
company_filings_sentiment_dfs = {}

if 'working_filings_df' in locals() and not working_filings_df.empty and sentiment_pipeline:
    total_companies = len(TARGET_TICKERS)
    print(f"Starting sentiment analysis for {total_companies} companies...")

    for i, ticker in enumerate(TARGET_TICKERS):
        print(f"\n--- Processing company {i+1}/{total_companies}: {ticker} ---")
        ticker_df = working_filings_df[working_filings_df['ticker'] == ticker].copy()

        if ticker_df.empty:
            print(f"   > No filings found for {ticker}. Skipping.")
            continue

        sentiment_scores = []
        total_filings = len(ticker_df)
        print(f"   > Analyzing {total_filings} filings for {ticker}...")
        
        for idx, row in ticker_df.iterrows():
            score = analyze_sentiment_with_full_distribution(row['cleaned_text'], sentiment_pipeline, tokenizer)
            sentiment_scores.append(score)
            
            if (len(sentiment_scores)) % 20 == 0 and len(sentiment_scores) > 0:
                print(f"   ... processed {len(sentiment_scores)}/{total_filings} filings for {ticker}")

        print(f"   > Finished analysis for {ticker}.")
        ticker_df['sentiment_score'] = sentiment_scores
        ticker_df['sentiment_score'] = ticker_df['sentiment_score'].fillna(0.0)

        pivoted_df = ticker_df.pivot_table(index='filing_date', columns='form_type', values='sentiment_score', aggfunc='mean')
        
        if not pivoted_df.empty:
            full_date_range = pd.date_range(start='2018-01-01', end='2024-12-31', freq='D')
            pivoted_df = pivoted_df.reindex(full_date_range)
            pivoted_df.ffill(inplace=True)
            pivoted_df.bfill(inplace=True)
        
        pivoted_df = pivoted_df.rename(columns={'10-K': '10-K_sentiment', '10-Q': '10-Q_sentiment', '8-K': '8-K_sentiment'})
        company_filings_sentiment_dfs[ticker] = pivoted_df

    print("\n\nAll tickers processed. Dictionary of filing sentiment DataFrames is ready.")

    if company_filings_sentiment_dfs:
        display_ticker = next(iter(company_filings_sentiment_dfs))
        print(f"\nSample of daily-indexed, forward-filled data for {display_ticker}:\n")
        display(company_filings_sentiment_dfs[display_ticker].head(10))
else:
    print("`working_filings_df` is empty or sentiment pipeline failed to load. Skipping NLP and reshaping.")

In [None]:
display(pd.DataFrame([company_filings_sentiment_dfs['AAPL']['8-K_sentiment']]).describe())

In [None]:
print("3.3.4: Impute Missing Daily Filing Sentiments")
print("-"*30)

if company_filings_sentiment_dfs:
    
    imputed_filings_sentiment_dfs = {}
    full_date_range = pd.date_range(start=f'{START_YEAR}-01-01', end=f'{END_YEAR}-12-31', freq='D')

    for ticker, df in company_filings_sentiment_dfs.items():
        # Reindex to the full daily range
        imputed_df = df.reindex(full_date_range)
        
        # Forward-fill and then back-fill to ensure no NaNs remain
        imputed_df = imputed_df.ffill().bfill()
        
        imputed_df = imputed_df.reset_index().rename(columns={'index': 'Period'})
        imputed_filings_sentiment_dfs[ticker] = imputed_df
        print(f"  > Imputed daily filing sentiment for {ticker}.")
        
    # Overwrite the old dictionary with the imputed one
    company_filings_sentiment_dfs = imputed_filings_sentiment_dfs
    
    print("\nImputation complete for all tickers.")
    print("\n--- Sample of Final Imputed Filing Sentiment for 'AAPL' ---")
    display(company_filings_sentiment_dfs['AAPL'].head())
    
else:
    print("`company_filings_sentiment_dfs` is empty. Nothing to impute.")

In [None]:
print("3.3.5: Save Filing Sentiment DataFrames")
print("-"*30)

import pickle

if company_filings_sentiment_dfs:
    
    output_path = os.path.join("Data", "company_filings_sentiment.pkl")
    os.makedirs("Data", exist_ok=True)
    
    print(f"Saving the 'company_filings_sentiment_dfs' dictionary to '{output_path}'...")
    with open(output_path, 'wb') as f:
        pickle.dump(company_filings_sentiment_dfs, f)
    print("Save complete.")

else:
    print("Dictionary is empty. Nothing to save.")

In [None]:
print("3.3.6: Checkpoint - How to Load Filing Sentiment Data")
print("-"*30)

import pickle
import os

path = os.path.join("Data", "company_filings_sentiment.pkl")

print("To load the saved filing sentiment data in a future session, run this code:")
print("-" * 65)
print("import pickle")
print(f"with open('{path}', 'rb') as f:")
print("    loaded_filings_sentiment = pickle.load(f)")
print("\n# Example: Accessing the AAPL DataFrame")
print("aapl_df = loaded_filings_sentiment.get('AAPL')")
print("-" * 65)

#### NLP Pipeline is now complete. We have all the necessary data.

## Phase 4: Creating the `DataFrames`

In this last part, we just merge our dataframes into one dataframe per company. All dataframes will have the same features.

In [9]:
print("4.1: Load All Processed Data Sources")
print("-"*30)

import os
import pickle

# --- 1. Define Helper Function for Loading ---
def load_pickle_data(filename):
    """Loads a pickle file from the 'Data' directory and returns its content."""
    path = os.path.join("Data", filename)
    print(f"Loading '{path}'...")
    try:
        with open(path, 'rb') as f:
            return pickle.load(f)
    except FileNotFoundError:
        print(f"  > ERROR: File not found. Please ensure the required file exists.")
        return None
    except Exception as e:
        print(f"  > ERROR: An unexpected error occurred: {e}")
        return None

# --- 2. Load the Datasets ---
# Company-specific news sentiment
daily_sentiment_dfs = load_pickle_data("daily_sentiment_dfs.pkl")
# General market news sentiment
market_sentiment_df = load_pickle_data("market_sentiment_df.pkl")
# Company-specific SEC filing sentiment
company_filings_sentiment_dfs = load_pickle_data("company_filings_sentiment.pkl")

# --- 3. Access Live Insider Data ---
# This dictionary was created in Section 2.1 and should be in memory.
if 'insider_datasets' in locals():
    print("Accessing 'insider_datasets' from the current session.")
else:
    print("  > WARNING: 'insider_datasets' not found in memory. This data will be missing from the final merge.")
    insider_datasets = {} # Create an empty dict to prevent errors

print("\nData loading process complete.")

4.1: Load All Processed Data Sources
------------------------------
Loading 'Data\daily_sentiment_dfs.pkl'...
Loading 'Data\market_sentiment_df.pkl'...
Loading 'Data\company_filings_sentiment.pkl'...
Accessing 'insider_datasets' from the current session.

Data loading process complete.


In [10]:
print("4.2: Verify Loaded DataFrames")
print("-"*30)

# Verify company news sentiment
if daily_sentiment_dfs:
    print("\n--- Sample: Daily Sentiment for AAPL ---")
    display(daily_sentiment_dfs.get('AAPL', pd.DataFrame()).head(3))
else:
    print("\n'daily_sentiment_dfs' is not loaded.")

# Verify market sentiment
if market_sentiment_df is not None:
    print("\n--- Sample: General Market Sentiment ---")
    display(market_sentiment_df.head(3))
else:
    print("\n'market_sentiment_df' is not loaded.")
    
# Verify filing sentiment
if company_filings_sentiment_dfs:
    print("\n--- Sample: Filing Sentiment for AAPL ---")
    display(company_filings_sentiment_dfs.get('AAPL', pd.DataFrame()).head(3))
else:
    print("\n'company_filings_sentiment_dfs' is not loaded.")

# Verify insider data
if insider_datasets:
    print("\n--- Sample: Insider Data for AAPL ---")
    display(insider_datasets.get('AAPL', pd.DataFrame()).head(3))
else:
    print("\n'insider_datasets' is not loaded.")

4.2: Verify Loaded DataFrames
------------------------------

--- Sample: Daily Sentiment for AAPL ---


Unnamed: 0,Period,article_volume,average_news_sentiment,AVG_AAPL_Density,AVG_NVDA_Density,AVG_GOOGL_Density
0,2018-01-01,8.0,-0.017894,0.068807,0.0125,0.0
1,2018-01-02,51.0,-0.127239,0.166147,0.0,0.0
2,2018-01-03,62.0,-0.518287,0.124039,0.0,0.0



--- Sample: General Market Sentiment ---


Unnamed: 0,Period,market_average_sentiment
0,2018-01-01,0.311223
1,2018-01-02,0.311223
2,2018-01-03,-0.018505



--- Sample: Filing Sentiment for AAPL ---


form_type,Period,10-K_sentiment,10-Q_sentiment,8-K_sentiment
0,2018-01-01,-0.146227,-0.217508,-0.023673
1,2018-01-02,-0.146227,-0.217508,-0.023673
2,2018-01-03,-0.146227,-0.217508,-0.023673



--- Sample: Insider Data for AAPL ---


Unnamed: 0,Period,MSPR
0,2018-Q1,-100.0
1,2018-Q1,7.840257
2,2018-Q2,-22.737514


In [14]:
print("4.3: Merge All Data Sources into Final Company DataFrames")
print("-"*30)

from functools import reduce

final_company_dfs = {}

# Check if all data sources are available before starting
if not all([daily_sentiment_dfs, market_sentiment_df is not None, company_filings_sentiment_dfs, insider_datasets]):
    print("ERROR: One or more required data sources are missing. Cannot proceed with the merge.")
else:
    print("Starting merge process for each target ticker...")
    for ticker in TARGET_TICKERS:
        print(f"  > Processing {ticker}...")
        
        # --- 1. Prepare all DataFrames for the merge ---
        
        # Company-specific news sentiment (Base DataFrame)
        df1 = daily_sentiment_dfs[ticker].copy()
        
        # General market sentiment
        df2 = market_sentiment_df.copy()

        # Company-specific filing sentiment
        df3 = company_filings_sentiment_dfs[ticker].copy()
        
        # Insider sentiment (needs pre-processing)
        insider_df = insider_datasets[ticker].copy()
        # Rename date column and convert to datetime
        insider_df = insider_df.rename(columns={'filingDate': 'Period'})
        insider_df['Period'] = pd.to_datetime(insider_df['Period'])
        # Aggregate by day in case of multiple filings on the same day
        df4 = insider_df.groupby('Period').agg(mspr=('MSPR', 'mean')).reset_index()

        # --- 2. Perform the merge ---
        data_frames_to_merge = [df1, df2, df3, df4]
        
        # Use reduce to cleanly merge all dataframes in the list on 'Period'
        # 'how=left' ensures we start with the complete daily index from our base df
        merged_df = reduce(lambda left, right: pd.merge(left, right, on='Period', how='left'), data_frames_to_merge)
        
        # --- 3. Impute the sparse 'mspr' column ---
        # The only column with NaNs should be 'mspr' from the sparse insider data
        merged_df['mspr'] = merged_df['mspr'].ffill().bfill()

        final_company_dfs[ticker] = merged_df
        print(f"    - Merge complete for {ticker}. Final shape: {merged_df.shape}")

    print("\n--- All companies processed. Final DataFrames are ready. ---")
    print("\n--- Sample of Final Consolidated DataFrame for 'NVDA' ---")
    display(final_company_dfs['NVDA'].head())

4.3: Merge All Data Sources into Final Company DataFrames
------------------------------
Starting merge process for each target ticker...
  > Processing AAPL...
    - Merge complete for AAPL. Final shape: (2557, 11)
  > Processing NVDA...
    - Merge complete for NVDA. Final shape: (2557, 11)
  > Processing GOOGL...
    - Merge complete for GOOGL. Final shape: (2557, 11)

--- All companies processed. Final DataFrames are ready. ---

--- Sample of Final Consolidated DataFrame for 'NVDA' ---


  insider_df['Period'] = pd.to_datetime(insider_df['Period'])
  insider_df['Period'] = pd.to_datetime(insider_df['Period'])
  insider_df['Period'] = pd.to_datetime(insider_df['Period'])


Unnamed: 0,Period,article_volume,average_news_sentiment,AVG_AAPL_Density,AVG_NVDA_Density,AVG_GOOGL_Density,market_average_sentiment,10-K_sentiment,10-Q_sentiment,8-K_sentiment,mspr
0,2018-01-01,1.0,0.0,0.0625,0.0625,0.0,0.311223,-0.073956,0.020997,-0.032128,-43.780097
1,2018-01-02,7.0,0.0,0.000997,0.041338,0.0,0.311223,-0.073956,0.020997,-0.032128,-43.780097
2,2018-01-03,10.0,0.234918,0.0,0.071479,0.0,-0.018505,-0.073956,0.020997,-0.032128,-43.780097
3,2018-01-04,13.0,-0.065777,0.005405,0.041763,0.0,0.0,-0.073956,0.020997,-0.032128,-43.780097
4,2018-01-05,9.0,0.375672,0.0125,0.071191,0.0,0.071511,-0.073956,0.020997,-0.032128,-43.780097


In [15]:
display(final_company_dfs['AAPL'].head())

Unnamed: 0,Period,article_volume,average_news_sentiment,AVG_AAPL_Density,AVG_NVDA_Density,AVG_GOOGL_Density,market_average_sentiment,10-K_sentiment,10-Q_sentiment,8-K_sentiment,mspr
0,2018-01-01,8.0,-0.017894,0.068807,0.0125,0.0,0.311223,-0.146227,-0.217508,-0.023673,-46.079872
1,2018-01-02,51.0,-0.127239,0.166147,0.0,0.0,0.311223,-0.146227,-0.217508,-0.023673,-46.079872
2,2018-01-03,62.0,-0.518287,0.124039,0.0,0.0,-0.018505,-0.146227,-0.217508,-0.023673,-46.079872
3,2018-01-04,53.0,0.222064,0.148336,0.0,0.0,0.0,-0.146227,-0.217508,-0.023673,-46.079872
4,2018-01-05,104.0,-0.443398,0.173987,0.0,0.0,0.071511,-0.146227,-0.217508,-0.023673,-46.079872


In [16]:
display(final_company_dfs['GOOGL'].head())

Unnamed: 0,Period,article_volume,average_news_sentiment,AVG_AAPL_Density,AVG_NVDA_Density,AVG_GOOGL_Density,market_average_sentiment,10-K_sentiment,10-Q_sentiment,8-K_sentiment,mspr
0,2018-01-01,13.0,-0.014047,0.0,0.0,0.071884,0.311223,-0.094467,-0.055268,-0.035295,-44.976149
1,2018-01-02,25.0,0.031997,0.0,0.0,0.054931,0.311223,-0.094467,-0.055268,-0.035295,-44.976149
2,2018-01-03,58.0,-0.333055,0.0,0.0,0.119026,-0.018505,-0.094467,-0.055268,-0.035295,-44.976149
3,2018-01-04,51.0,-0.224332,0.0,0.0,0.124848,0.0,-0.094467,-0.055268,-0.035295,-44.976149
4,2018-01-05,64.0,0.180071,0.014286,0.0,0.106121,0.071511,-0.094467,-0.055268,-0.035295,-44.976149


In [17]:
print("4.4: Export Final Company DataFrames")
print("-"*30)

import os
import pickle

# Check if the final dictionary exists before attempting to save.
if 'final_company_dfs' in locals() and isinstance(final_company_dfs, dict):
    
    # Define the new directory for the final datasets.
    output_dir = "Datasets"
    os.makedirs(output_dir, exist_ok=True)
    print(f"Ensured output directory '{output_dir}' exists.")

    # --- 1. Save the entire dictionary as a single .pkl file ---
    pkl_file_path = os.path.join(output_dir, 'final_company_dfs.pkl')
    print(f"\nSaving the complete dictionary to '{pkl_file_path}'...")
    try:
        with open(pkl_file_path, 'wb') as f:
            pickle.dump(final_company_dfs, f)
        print("  > Dictionary saved successfully.")
    except Exception as e:
        print(f"  > An error occurred while saving the dictionary: {e}")

    # --- 2. Save each DataFrame as a separate .csv file ---
    print("\nSaving each company's DataFrame as a separate CSV file...")
    for ticker, df in final_company_dfs.items():
        csv_file_path = os.path.join(output_dir, f"{ticker}_dataset.csv")
        try:
            # Use index=False to avoid writing the DataFrame index as a column.
            df.to_csv(csv_file_path, index=False)
            print(f"  > Successfully saved '{csv_file_path}'")
        except Exception as e:
            print(f"  > An error occurred while saving the CSV for {ticker}: {e}")
            
    print("\n--- All datasets have been exported. ---")

else:
    print("ERROR: 'final_company_dfs' dictionary not found. Nothing to export.")

4.4: Export Final Company DataFrames
------------------------------
Ensured output directory 'Datasets' exists.

Saving the complete dictionary to 'Datasets\final_company_dfs.pkl'...
  > Dictionary saved successfully.

Saving each company's DataFrame as a separate CSV file...
  > Successfully saved 'Datasets\AAPL_dataset.csv'
  > Successfully saved 'Datasets\NVDA_dataset.csv'
  > Successfully saved 'Datasets\GOOGL_dataset.csv'

--- All datasets have been exported. ---
