# Advanced Financial NLP Pipeline: AAPL, NVDA, GOOGL

This notebook implements the multi-feature sentiment analysis pipeline as defined in `plan.md`. We will analyze `AAPL`, `NVDA`, and `GOOGL` for the most recent 12-month period available through the Finnhub API.

### Notebook Workflow

1.  **Phase 1: Environment Setup**
    - Import libraries (`finnhub`, `pandas`, `transformers`, etc.).
    - Initialize the Finnhub API client.

2.  **Phase 2: Data Acquisition**
    - Fetch company news (headlines, summaries) from the `company-news` endpoint.
    - Fetch insider transaction sentiment (MSPR) from the `stock/insider-sentiment` endpoint.

3.  **Phase 3: NLP Sentiment Analysis**
    - Load a pre-trained `FinBERT` model for sentiment analysis.
    - Calculate and apply sentiment scores to news headlines and summaries.

4.  **Phase 4: Data Aggregation & Consolidation**
    - Merge the news and insider sentiment data into a single DataFrame.
    - Resample the combined data into a final quarterly format.
    - Generate the aggregated DataFrame containing mean sentiment scores, news volume, and insider sentiment metrics.

## Phase 1: Setup and Imports

In [1]:
print("1.0: Library Installation")
print("-"*30)
# To ensure packages install into the correct kernel environment, we explicitly use
# the 'sys.executable' to call pip. This avoids issues where '!pip' might
# point to a different Python installation.
import sys

# NOTE: To run this on GColab, you can also use the %pip magic command instead of !{sys.executable} -m pip

# Consolidated installation of all required libraries
!{sys.executable} -m pip install finnhub-python pandas seaborn matplotlib numpy datasets kaggle python-dotenv --quiet

print("Required libraries installed successfully.")

1.0: Library Installation
------------------------------
Required libraries installed successfully.



[notice] A new release of pip is available: 23.2.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [4]:
print("1.1: Library Imports")
print("-"*30)
import os
import numpy as np
import pandas as pd
import seaborn as sns
import finnhub as fi
import matplotlib.pyplot as plt
from dotenv import load_dotenv

import json
from datetime import datetime

print("Core libraries imported successfully.")

1.1: Library Imports
------------------------------
Core libraries imported successfully.


In [3]:
print("1.2: Finnhub Client Initialization")
print("-"*30)
# --- Secure API Key Management ---
# It is a security best practice to store your API key as an environment variable
# to avoid exposing it directly in the code.

# Before running this cell, set the 'FINNHUB_API_KEY' in your environment.
# For example, in your terminal:
# export FINNHUB_API_KEY='your_api_key_here'
# You will need to restart your notebook's kernel after setting the variable.

api_key = "d3r0knpr01qna05k8e40d3r0knpr01qna05k8e4g"

if not api_key:
    print("API key not found in environment variables. Please set 'FINNHUB_API_KEY'.")
    # You can temporarily hardcode your key here for testing, but it is not recommended for production.
    # api_key = "YOUR_API_KEY_HERE" 
    finnhub_client = None
else:
    finnhub_client = fi.Client(api_key=api_key)
    print("Finnhub client initialized.")
    # --- Test API Client ---
    # Optional: Test the client with a simple, free API call to ensure it's working.
    try:
        profile = finnhub_client.company_profile2(symbol='AAPL')
        print(f"Successfully fetched company profile for: {profile.get('name', 'AAPL')}")
    except Exception as e:
        print(f"Client may be initialized, but a test API call failed: {e}")

1.2: Finnhub Client Initialization
------------------------------
Finnhub client initialized.
Successfully fetched company profile for: Apple Inc


## Phase 2: Data Gathering

### Strategy 2.1 : Historical `Insider Sentiment`

In [6]:
print("2.1.0: Global Configuration")
print("-"*30)

# --- Configuration ---
# Tickers for the companies we are analyzing.
STOCKS = ['AAPL', 'NVDA', 'GOOGL']

# --- Date Range for Long-Term Data (2018-2024) ---
# This range applies to all data sources.
START_YEAR = 2018
END_YEAR = 2024

print("Global configuration loaded:")
print(f"Tickers: {STOCKS}")
print(f"Date Range: {START_YEAR}-{END_YEAR}")

2.1.0: Global Configuration
------------------------------
Global configuration loaded:
Tickers: ['AAPL', 'NVDA', 'GOOGL']
Date Range: 2018-2024


In [44]:
print("2.1.1: Long-Term Data Extraction (Insider Sentiment)")
print("-"*30)

# --- Data Storage ---
all_insider_data = []

print(f"Fetching long-term insider sentiment from {START_YEAR} to {END_YEAR}...")

# --- Fetch Data for Each Stock and Year ---
for stock in STOCKS:
    print(f"  > Processing {stock}...")
    for year in range(START_YEAR, END_YEAR + 1):
        start_date = f"{year}-01-01"
        end_date = f"{year}-12-31"
        try:
            insider_sentiment = finnhub_client.stock_insider_sentiment(stock, _from=start_date, to=end_date)
            insider_transactions = insider_sentiment.get('data', [])
            for item in insider_transactions:
                report_date = datetime(year=item['year'], month=item['month'], day=1).date()
                all_insider_data.append({
                    'ticker': stock,
                    'date': report_date,
                    'mspr': item['mspr'],
                    'change': item['change']
                })
            # A small confirmation to show progress.
            if insider_transactions:
                print(f"    - Found {len(insider_transactions)} records for {year}.")
        except Exception as e:
            print(f"    - Error fetching insider sentiment for {stock} in {year}: {e}")

print("\nLong-term insider sentiment fetching complete.")

2.1.1: Long-Term Data Extraction (Insider Sentiment)
------------------------------
Fetching long-term insider sentiment from 2018 to 2024...
  > Processing AAPL...
    - Found 10 records for 2018.
    - Found 10 records for 2019.
    - Found 9 records for 2020.
    - Found 8 records for 2021.
    - Found 9 records for 2022.
    - Found 7 records for 2023.
    - Found 8 records for 2024.
  > Processing NVDA...
    - Found 8 records for 2018.
    - Found 10 records for 2019.
    - Found 10 records for 2020.
    - Found 10 records for 2021.
    - Found 8 records for 2022.
    - Found 10 records for 2023.
    - Found 8 records for 2024.
  > Processing GOOGL...
    - Found 9 records for 2018.
    - Found 7 records for 2019.
    - Found 12 records for 2020.
    - Found 12 records for 2021.
    - Found 12 records for 2022.
    - Found 12 records for 2023.
    - Found 8 records for 2024.

Long-term insider sentiment fetching complete.


In [45]:
print("2.1.2: Create Company-Specific Insider DataFrames")
print("-"*30)

# This cell refactors the insider sentiment data into separate, clean
# DataFrames for each company, formatted for time series analysis.

# Create a temporary DataFrame from the raw collected data
insider_df = pd.DataFrame(all_insider_data)

# Dictionary to hold the final, structured DataFrames for each company
# NOTE: We can access the selevant dataset by calling `insider_datasets["AAPL"].head()` for example
insider_datasets = {}

if not insider_df.empty:
    # Convert 'date' column to datetime objects for manipulation
    insider_df['date'] = pd.to_datetime(insider_df['date'])

    # Engineer the 'Period' column in the specified 'YYYY-Q' format
    insider_df['Period'] = insider_df['date'].dt.year.astype(str) + '-Q' + insider_df['date'].dt.quarter.astype(str)
    
    print("Processing insider data for each target ticker...")
    # Iterate through the globally defined STOCKS list to create a DF for each
    for ticker in STOCKS:
        # Filter data for the current ticker
        ticker_specific_df = insider_df[insider_df['ticker'] == ticker].copy()

        if not ticker_specific_df.empty:
            # Select, rename, and sort the columns to match the desired format
            final_df = ticker_specific_df[['Period', 'mspr']].rename(columns={'mspr': 'MSPR'})
            final_df = final_df.sort_values(by='Period').reset_index(drop=True)
            
            # Store the processed DataFrame in the dictionary
            insider_datasets[ticker] = final_df 
        else:
            print(f"No insider sentiment data was found for {ticker}")
    
    print("\nSuccessfully created and structured insider sentiment DataFrames for all tickers.")
    print("DataFrames are stored in the 'insider_datasets' dictionary.")

else:
    print("The initial 'all_insider_data' list is empty. No DataFrames were created.")

2.1.2: Create Company-Specific Insider DataFrames
------------------------------
Processing insider data for each target ticker...

Successfully created and structured insider sentiment DataFrames for all tickers.
DataFrames are stored in the 'insider_datasets' dictionary.


In [46]:

print("2.1.2: Display Company-Specific Insider DataFrames")
print("-"*30)

# This is how we can access the insider_datasets ictionary now
display(insider_datasets["AAPL"].head())
display(insider_datasets["NVDA"].head())
display(insider_datasets["GOOGL"].head())

# This bit is just to gather contextual info on data distributions, quantities, ect ect
for TICKER in list(insider_datasets):
    print(f"Relevant Information on {TICKER}:\n")
    display(insider_datasets[TICKER].info())
    display(insider_datasets[TICKER].describe())

2.1.2: Display Company-Specific Insider DataFrames
------------------------------


Unnamed: 0,Period,MSPR
0,2018-Q1,-100.0
1,2018-Q1,7.840257
2,2018-Q2,-22.737514
3,2018-Q2,-54.7286
4,2018-Q2,-33.333332


Unnamed: 0,Period,MSPR
0,2018-Q1,-48.497414
1,2018-Q1,-39.06278
2,2018-Q2,-87.43943
3,2018-Q2,-62.33284
4,2018-Q3,-100.0


Unnamed: 0,Period,MSPR
0,2018-Q2,-33.796238
1,2018-Q2,-52.667397
2,2018-Q2,-48.464813
3,2018-Q3,-59.29357
4,2018-Q3,-61.71693


Relevant Information on AAPL:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61 entries, 0 to 60
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Period  61 non-null     object 
 1   MSPR    61 non-null     float64
dtypes: float64(1), object(1)
memory usage: 1.1+ KB


None

Unnamed: 0,MSPR
count,61.0
mean,-26.480193
std,64.47905
min,-100.0
25%,-85.37024
50%,-33.200634
75%,-7.226337
max,100.0


Relevant Information on NVDA:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64 entries, 0 to 63
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Period  64 non-null     object 
 1   MSPR    64 non-null     float64
dtypes: float64(1), object(1)
memory usage: 1.1+ KB


None

Unnamed: 0,MSPR
count,64.0
mean,-51.053201
std,49.574298
min,-100.0
25%,-100.0
50%,-55.415127
75%,-10.659913
max,100.0


Relevant Information on GOOGL:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 72 entries, 0 to 71
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Period  72 non-null     object 
 1   MSPR    72 non-null     float64
dtypes: float64(1), object(1)
memory usage: 1.3+ KB


None

Unnamed: 0,MSPR
count,72.0
mean,-29.655626
std,38.240153
min,-100.0
25%,-49.633164
50%,-37.794344
75%,-15.689648
max,80.80216


We now have a `Dictionary` containing: 
* `AAPL` Dataset 
* `NVDA` Dataset
* `GOOGL` Dataset

Each of these datasets has this format:
|   Period   | MSPR        |
|------------|-------------|
| 2018-Q1    | Value       | 
| 2018-Q1    | Value       | 
| 2018-Q1    | Value       | 
| 2018-Q2    | Value       | 
| ...        | ...         | 
| 2024-Q4    | Value       | 

### Strategy 2.2: Historical News via Hugging Face Datasets

In [7]:
print("2.2.0: Configuration & Authentication for Hugging Face")
print("-"*30)

# Load environment variables from .env file
load_dotenv()

# NOTE: --- Authentication (IMPORTANT) ---
# The Hugging Face User Access Token is loaded from the .env file.
# Make sure your .env file has the line: HF_TOKEN='your_token_here'

hf_token = os.getenv("HF_TOKEN")

if hf_token:
    print("Found Hugging Face token. Logging in...")
    try:
        login(token=hf_token)
        print("Login successful.")
    except Exception as e:
        print(f"Login failed: {e}")
else:
    print("Hugging Face token not found in environment variables.")
    print("Please ensure your .env file is correctly configured with 'HF_TOKEN'.")


# --- Configuration ---
# These are defined globally, but we re-state them here for clarity.
# Note: START_YEAR and END_YEAR must be defined in a previous cell.
try:
    TARGET_TICKERS = ['AAPL', 'NVDA', 'GOOGL']
    START_DATE = f"{START_YEAR}-01-01"
    END_DATE = f"{END_YEAR}-12-31"
    
    print("\nConfiguration for historical news extraction is set.")
    print(f"Tickers: {TARGET_TICKERS}")
    print(f"Date Range: {START_DATE} to {END_DATE}")
except NameError:
    print("\nWarning: START_YEAR and END_YEAR are not defined.")
    print("Please run a configuration cell first.")

2.2.0: Configuration & Authentication for Hugging Face
------------------------------
Found Hugging Face token. Logging in...
Login failed: name 'login' is not defined

Configuration for historical news extraction is set.
Tickers: ['AAPL', 'NVDA', 'GOOGL']
Date Range: 2018-01-01 to 2024-12-31


In [55]:
print("2.2.1: Extract Ticker-Specific News Articles via Streaming")
print("-"*30)

# --- Expanded Search Dictionary ---
# Maps a canonical ticker to a set of lowercase search terms for broader matching.
# Using sets provides a slight performance boost for lookups.
SEARCH_TERMS = {
    'AAPL': {'aapl', 'apple'},
    'NVDA': {'nvda', 'nvidia'},
    'GOOGL': {'googl', 'google', 'alphabet'}
}
# A flattened set of all terms for a very fast initial check.
ALL_SEARCH_TERMS = set.union(*SEARCH_TERMS.values())

# --- Data Storage ---
multisource_articles = []

print("Loading and filtering 'financial-news-multisource' dataset...")
print("(This is a large dataset and will take some time to process.)")

try:
    # --- Increase the download timeout for stability on large datasets ---
    import huggingface_hub.constants
    huggingface_hub.constants.HF_HUB_DOWNLOAD_TIMEOUT = 120 

    # --- Load all subsets of the dataset in streaming mode ---
    multisource_dataset = load_dataset(
        "Brianferrell787/financial-news-multisource",
        data_files="data/*/*.parquet",
        split="train",
        streaming=True
    )
    
    # --- Iterate through the stream with the optimized multi-step filter ---
    for i, article in enumerate(iter(multisource_dataset)):
        # Provide progress updates to show the process is not stalled
        if (i + 1) % 10_000 == 0:
            print(f"  > Processed {i + 1:,} articles. Found {len(multisource_articles)} relevant so far...")
            print(f"  > We are currently at date: {article['date'][:10]}")

        # --- OPTIMIZED FILTERING LOGIC ---
        # Step 1: Filter by date (fastest, string comparison).
        if not (START_DATE <= article['date'][:10] <= END_DATE):
            continue
        
        # Step 2: Quick pre-filter. Check if any of our expanded search terms appear
        # anywhere in the text or metadata before doing more expensive work.
        text_lower = article['text'].lower()
        extra_fields_lower = article['extra_fields'].lower()
        if not any(term in text_lower or term in extra_fields_lower for term in ALL_SEARCH_TERMS):
            continue

        # --- Step 3: Precise ticker identification for articles that passed the pre-filters ---
        mentioned_tickers = set() # Use a set to store found tickers to avoid duplicates

        # Primary Method: Check the structured 'stocks' field for high precision.
        try:
            extra_data = json.loads(article['extra_fields'])
            if 'stocks' in extra_data and isinstance(extra_data['stocks'], list):
                # Find the intersection of our target tickers and the article's tickers
                found = set(TARGET_TICKERS) & set(extra_data['stocks'])
                mentioned_tickers.update(found)
        except (json.JSONDecodeError, TypeError):
            # If JSON is invalid, we can fall back to text search.
            pass

        # Fallback Method: If no tickers were found in metadata, check the text.
        # This increases recall for articles that might not be perfectly tagged.
        if not mentioned_tickers:
            for ticker, terms in SEARCH_TERMS.items():
                if any(term in text_lower for term in terms):
                    mentioned_tickers.add(ticker)
        
        # If we found one or more relevant tickers, add entries to our list.
        if mentioned_tickers:
            for ticker in mentioned_tickers:
                multisource_articles.append({
                    'date': article['date'],
                    'ticker': ticker,
                    'text': article['text']
                })
    
    print(f"\nExtraction complete. Total relevant article entries collected: {len(multisource_articles)}")

except Exception as e:
    print(f"\nAn error occurred while processing the dataset: {e}")

2.2.1: Extract Ticker-Specific News Articles via Streaming
------------------------------
Loading and filtering 'financial-news-multisource' dataset...
(This is a large dataset and will take some time to process.)
  > Processed 10,000 articles. Found 0 relevant so far...
We are currently at date: 2016-01-11
  > Processed 20,000 articles. Found 0 relevant so far...
We are currently at date: 2016-01-20
  > Processed 30,000 articles. Found 0 relevant so far...
We are currently at date: 2016-01-28
  > Processed 40,000 articles. Found 0 relevant so far...
We are currently at date: 2016-02-06
  > Processed 50,000 articles. Found 0 relevant so far...
We are currently at date: 2016-02-16
  > Processed 60,000 articles. Found 0 relevant so far...
We are currently at date: 2016-02-24
  > Processed 70,000 articles. Found 0 relevant so far...
We are currently at date: 2016-03-04
  > Processed 80,000 articles. Found 0 relevant so far...
We are currently at date: 2016-03-14
  > Processed 90,000 artic

'(ProtocolError('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')), '(Request ID: 8dbdf5bf-614a-4465-afd3-3084037fface)')' thrown while requesting GET https://huggingface.co/datasets/Brianferrell787/financial-news-multisource/resolve/2a2e8f5c97a2034236514c4e516a85007bc85f63/data/fnspid_news/fnspid_news.053.parquet
Retrying in 1s [Retry 1/5].
'(ProtocolError('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')), '(Request ID: 89ca139d-7197-453a-a324-a5613e5372ac)')' thrown while requesting GET https://huggingface.co/datasets/Brianferrell787/financial-news-multisource/resolve/2a2e8f5c97a2034236514c4e516a85007bc85f63/data/fnspid_news/fnspid_news.053.parquet
Retrying in 2s [Retry 2/5].


  > Processed 38,190,000 articles. Found 620411 relevant so far...
We are currently at date: 2010-09-15
  > Processed 38,200,000 articles. Found 620411 relevant so far...
We are currently at date: 2013-08-29
  > Processed 38,210,000 articles. Found 620411 relevant so far...
We are currently at date: 2013-08-01
  > Processed 38,220,000 articles. Found 620411 relevant so far...
We are currently at date: 2008-12-10
  > Processed 38,230,000 articles. Found 620411 relevant so far...
We are currently at date: 2008-12-04
  > Processed 38,240,000 articles. Found 620411 relevant so far...
We are currently at date: 2013-04-18
  > Processed 38,250,000 articles. Found 620411 relevant so far...
We are currently at date: 2015-03-13
  > Processed 38,260,000 articles. Found 620411 relevant so far...
We are currently at date: 2008-03-24
  > Processed 38,270,000 articles. Found 620411 relevant so far...
We are currently at date: 2015-10-31
  > Processed 38,280,000 articles. Found 620411 relevant so far.

In [56]:
print("2.2.2: Create the News Articles DataFrame")
print("-"*30)

# --- Create DataFrame from the collected list ---
news_articles_df = pd.DataFrame(multisource_articles)

if not news_articles_df.empty:
    # Convert 'date' column to datetime objects for future analysis
    news_articles_df['date'] = pd.to_datetime(news_articles_df['date']).dt.date
    
    print("`news_articles_df` DataFrame created successfully.")
    
    # Display summary and head
    print("\n--- DataFrame Info ---")
    news_articles_df.info()
    
    print("\n--- DataFrame Head ---")
    display(news_articles_df.head())
    
else:
    print("No articles from 'financial-news-multisource' matched the filtering criteria.")

2.2.2: Create the News Articles DataFrame
------------------------------
`news_articles_df` DataFrame created successfully.

--- DataFrame Info ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 731183 entries, 0 to 731182
Data columns (total 3 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   date    731183 non-null  object
 1   ticker  731183 non-null  object
 2   text    731183 non-null  object
dtypes: object(3)
memory usage: 16.7+ MB

--- DataFrame Head ---


Unnamed: 0,date,ticker,text
0,2018-01-01,GOOGL,Tech trends for 2018: From helpful robots to v...
1,2018-01-01,GOOGL,Former Google career coach shares ways to ace ...
2,2018-01-01,GOOGL,"Trump, Macron, tech companies & growth set to ..."
3,2018-01-01,GOOGL,8 books to help you become wealthier in 2018\n...
4,2018-01-01,AAPL,Medical jargon may cloud doctor-patient commun...


In [57]:
print("2.2.3: Save Filtered News Articles to a Compressed File")
print("-"*30)

# This step is crucial for checkpointing our progress.

if 'news_articles_df' in locals() and not news_articles_df.empty:
    # Define the output file path. Using the '.gz' extension with the 'gzip'
    # compression type is a standard and efficient way to save large CSVs.
    output_filename = "filtered_news_articles.csv.gz"

    print(f"Saving the 'news_articles_df' to a compressed CSV file: {output_filename}")
    print(f"This may take a moment given the size of the DataFrame ({len(news_articles_df):,} rows)...")

    try:
        # Save the DataFrame to a gzipped CSV.
        # - compression='gzip' handles the zipping automatically.
        # - index=False prevents pandas from writing the DataFrame index as a column.
        news_articles_df.to_csv(output_filename, index=False, compression='gzip')
        
        print(f"\nDataFrame successfully saved to '{output_filename}'.")
        print("You can now load this file in future sessions to skip the long extraction process.")

    except Exception as e:
        print(f"\nAn error occurred while saving the DataFrame: {e}")
else:
    print("The 'news_articles_df' DataFrame is not available or is empty. Skipping save operation.")


2.2.3: Save Filtered News Articles to a Compressed File
------------------------------
Saving the 'news_articles_df' to a compressed CSV file: filtered_news_articles.csv.gz
This may take a moment given the size of the DataFrame (731,183 rows)...

DataFrame successfully saved to 'filtered_news_articles.csv.gz'.
You can now load this file in future sessions to skip the long extraction process.


### Strategy 2.3: Article Headlines Via `Kaggle`

In [8]:
print("2.3.0: Setup and Kaggle API Configuration")
print("-"*30)

# Load environment variables from .env file
load_dotenv()

# --- 1. Configure Kaggle API ---
# Credentials are loaded from your .env file.
# Ensure your .env file contains:
# KAGGLE_USERNAME='your_username'
# KAGGLE_KEY='your_api_key'

KAGGLE_USERNAME = os.getenv('KAGGLE_USERNAME')
KAGGLE_KEY = os.getenv('KAGGLE_KEY')

# Set environment variables for the Kaggle CLI
if KAGGLE_USERNAME and KAGGLE_KEY:
    os.environ['KAGGLE_USERNAME'] = KAGGLE_USERNAME
    os.environ['KAGGLE_KEY'] = KAGGLE_KEY
    print("Kaggle API credentials configured from environment variables.")
else:
    print("Warning: Kaggle credentials not found in environment variables.")
    print("Please ensure your .env file is correctly configured.")

# --- 2. Define Constants ---
# Date range for filtering headlines.
START_YEAR = 2018
END_YEAR = 2024

print("\nSetup complete. Constants defined.")
print(f"Data will be filtered for the period: {START_YEAR}-{END_YEAR}")

2.3.0: Setup and Kaggle API Configuration
------------------------------
Kaggle API credentials configured from environment variables.

Setup complete. Constants defined.
Data will be filtered for the period: 2018-2024


In [70]:
print("2.3.1: Download, Extract, and Filter S&P 500 Headlines Dataset")
print("-"*30)

# --- Define dataset details ---
DATASET_NAME = 'dyutidasmahaptra/s-and-p-500-with-financial-news-headlines-20082024'
ZIP_FILE_NAME = 's-and-p-500-with-financial-news-headlines-20082024.zip'
CSV_FILE_NAME = 'sp500_headlines_2008_2024.csv'

# --- Download the dataset using the Kaggle API ---
print(f"Downloading dataset '{DATASET_NAME}'...")
!kaggle datasets download -d {DATASET_NAME} --quiet

# --- Extract the CSV file from the downloaded zip ---
print(f"Extracting '{CSV_FILE_NAME}' from the zip file...")
with zipfile.ZipFile(ZIP_FILE_NAME, 'r') as zip_ref:
    zip_ref.extractall('.')
print("Extraction complete.")

# --- Load and process the data with pandas ---
print("Loading data into DataFrame and filtering...")
try:
    # Load the entire CSV into a DataFrame
    df = pd.read_csv(CSV_FILE_NAME)

    # Convert the 'date' column to datetime objects for reliable filtering
    df['Date'] = pd.to_datetime(df['Date'])

    # Create the filter condition for the date range
    start_date = f'{START_YEAR}-01-01'
    end_date = f'{END_YEAR}-12-31'
    date_filter = (df['Date'] >= start_date) & (df['Date'] <= end_date)

    # Apply the filter and create the final DataFrame
    market_headlines_df = df[date_filter].copy()

    # Drop the 'close' column as it is not needed for sentiment analysis
    market_headlines_df = market_headlines_df.drop(columns=['CP'])

    print("Filtering successful. The data is ready in 'market_headlines_df'.")

except FileNotFoundError:
    print(f"ERROR: The file '{CSV_FILE_NAME}' was not found after extraction.")
except Exception as e:
    print(f"An error occurred during data processing: {e}")

2.3.1: Download, Extract, and Filter S&P 500 Headlines Dataset
------------------------------
Downloading dataset 'dyutidasmahaptra/s-and-p-500-with-financial-news-headlines-20082024'...
Extracting 'sp500_headlines_2008_2024.csv' from the zip file...
Extraction complete.
Loading data into DataFrame and filtering...
Filtering successful. The data is ready in 'market_headlines_df'.


Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\totob\AppData\Local\Programs\Python\Python312\Scripts\kaggle.exe\__main__.py", line 7, in <module>
  File "c:\Users\totob\AppData\Local\Programs\Python\Python312\Lib\site-packages\kaggle\cli.py", line 68, in main
    out = args.func(**command_args)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\totob\AppData\Local\Programs\Python\Python312\Lib\site-packages\kaggle\api\kaggle_api_extended.py", line 1741, in dataset_download_cli
    with self.build_kaggle_client() as kaggle:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\totob\AppData\Local\Programs\Python\Python312\Lib\site-packages\kaggle\api\kaggle_api_extended.py", line 688, in build_kaggle_client
    username=self.config_values['username'],
             ~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
KeyError: 'username'


In [60]:
print("2.3.2: Display Final Market Headlines DataFrame")
print("-"*30)

# Sort by date just to be sure
market_headlines_df = market_headlines_df.sort_values(by='Date').reset_index(drop=True)

if 'market_headlines_df' in locals() and not market_headlines_df.empty:
    print("DataFrame for general market sentiment created successfully.")
    
    print("\n--- DataFrame Info ---")
    market_headlines_df.info()
    
    print("\n--- DataFrame Head (sorted) ---")
    display(market_headlines_df.head())
    
    print("\n--- DataFrame Tail (sorted) ---")
    display(market_headlines_df.tail())
else:
    print("The 'market_headlines_df' DataFrame was not created or is empty. Please check the previous cell for errors.")


2.3.2: Display Final Market Headlines DataFrame
------------------------------
DataFrame for general market sentiment created successfully.

--- DataFrame Info ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13422 entries, 0 to 13421
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   Title   13422 non-null  object        
 1   Date    13422 non-null  datetime64[ns]
dtypes: datetime64[ns](1), object(1)
memory usage: 209.8+ KB

--- DataFrame Head (sorted) ---


Unnamed: 0,Title,Date
0,2018 Stock Market Cycles Outlook: New Highs......,2018-01-02
1,The Surprising Outperformance Of A Value-Tilte...,2018-01-02
2,This Day In Market History: NYSE Volume Hits 2...,2018-01-02
3,These 4 S&P 500 Stocks Doubled in 2017,2018-01-02
4,Warren Buffett wins $1M bet against hedge fund...,2018-01-02



--- DataFrame Tail (sorted) ---


Unnamed: 0,Title,Date
13417,Zions (ZION) Loses Spot in S&P 500 as Concerns...,2024-03-04
13418,"S&P 500: Super Micro, Deckers Jump On News The...",2024-03-04
13419,"Bank of America boosts S&P 500 target to 5,400...",2024-03-04
13420,S&P 500 Price Forecast – S&P 500 Continues to ...,2024-03-04
13421,S&P 500 Gains and Losses Today: Tesla Shares T...,2024-03-04


### Strategy 2.4: Company filings via `EDGAR` API

In [61]:
print("2.4.0: Setup, Installations, and EDGAR Identity")
print("-"*30)

# --- 1. Install necessary libraries ---
# 'edgartools' is the library we'll use to interface with the SEC EDGAR database.
import sys
!{sys.executable} -m pip install edgartools --quiet

import pandas as pd
from edgar import Company, set_identity

# --- 2. Set EDGAR User Identity (CRITICAL STEP) ---
# The SEC requires any script or bot that accesses EDGAR to have a custom User-Agent
# that identifies the user. This is a compliance requirement to avoid being blocked.
# Replace the example with your own company/project name and email address.
# Format: "Sample Company Name your.email@example.com"
set_identity("University of Southampton ab3u21@soton.ac.uk")
print("EDGAR user identity set successfully.")

# --- 3. Define Constants ---
# These constants will be used to filter the filings.
# We use the globally defined START_YEAR and END_YEAR from cell 2.1.0
TARGET_TICKERS = ['AAPL', 'NVDA', 'GOOGL']
FORM_TYPES = ["10-K", "10-Q", "8-K"]
DATE_RANGE = f"{START_YEAR}-01-01:{END_YEAR}-12-31"


print("\nSetup complete. Ready to extract SEC filings.")
print(f"Tickers: {TARGET_TICKERS}")
print(f"Form Types: {FORM_TYPES}")
print(f"Date Range: {DATE_RANGE}")

2.4.0: Setup, Installations, and EDGAR Identity
------------------------------



[notice] A new release of pip is available: 23.2.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


EDGAR user identity set successfully.

Setup complete. Ready to extract SEC filings.
Tickers: ['AAPL', 'NVDA', 'GOOGL']
Form Types: ['10-K', '10-Q', '8-K']
Date Range: 2018-01-01:2024-12-31


In [62]:
print("2.4.1: Extract SEC Filings Data")
print("-"*30)

# --- Data Storage ---
all_filings_data = []

print("Starting extraction of SEC filings. This may take a significant amount of time...")

# --- Loop through each stock and extract its filings ---
for ticker in TARGET_TICKERS:
    print(f"  > Processing filings for {ticker}...")
    try:
        # Create a Company object for the current ticker
        company = Company(ticker)
        
        # Get all filings and immediately filter by date range and form types
        filings = company.get_filings().filter(date=DATE_RANGE, form=FORM_TYPES)
        
        # The 'filings' object is a generator; we iterate through it to get each filing
        for filing in filings:
            # The .text() method conveniently extracts and cleans the full filing text
            filing_text = filing.text()
            
            # Append the structured data to our list
            all_filings_data.append({
                'filing_date': filing.filing_date,
                'ticker': ticker,
                'form_type': filing.form,
                'text': filing_text
            })
            # Log progress for each file found
            print(f"    - Extracted {filing.form} from {filing.filing_date}")

    except Exception as e:
        print(f"    - ERROR: Could not process filings for {ticker}. Reason: {e}")

print(f"\nFilings extraction complete. Total documents extracted: {len(all_filings_data)}")


2.4.1: Extract SEC Filings Data
------------------------------
Starting extraction of SEC filings. This may take a significant amount of time...
  > Processing filings for AAPL...


    - Extracted 10-K from 2024-11-01


    - Extracted 8-K from 2024-10-31


    - Extracted 8-K from 2024-09-10


    - Extracted 8-K from 2024-08-26


    - Extracted 8-K from 2024-08-23


    - Extracted 10-Q from 2024-08-02


    - Extracted 8-K from 2024-08-01


    - Extracted 8-K from 2024-05-03


    - Extracted 10-Q from 2024-05-03


    - Extracted 8-K from 2024-05-02


    - Extracted 8-K from 2024-02-28


    - Extracted 10-Q from 2024-02-02


    - Extracted 8-K from 2024-02-01


    - Extracted 10-K from 2023-11-03


    - Extracted 8-K from 2023-11-02


    - Extracted 10-Q from 2023-08-04


    - Extracted 8-K from 2023-08-03


    - Extracted 8-K from 2023-05-10


    - Extracted 10-Q from 2023-05-05


    - Extracted 8-K from 2023-05-04


    - Extracted 8-K from 2023-03-10


    - Extracted 10-Q from 2023-02-03


    - Extracted 8-K from 2023-02-02


    - Extracted 8-K from 2022-11-07


    - Extracted 10-K from 2022-10-28


    - Extracted 8-K from 2022-10-27


    - Extracted 8-K from 2022-08-19


    - Extracted 8-K from 2022-08-08


    - Extracted 10-Q from 2022-07-29


    - Extracted 8-K from 2022-07-28


    - Extracted 10-Q from 2022-04-29


    - Extracted 8-K from 2022-04-28


    - Extracted 8-K from 2022-03-04


    - Extracted 10-Q from 2022-01-28


    - Extracted 8-K from 2022-01-27


    - Extracted 8-K from 2021-11-12


    - Extracted 10-K from 2021-10-29


    - Extracted 8-K from 2021-10-28


    - Extracted 8-K from 2021-08-05


    - Extracted 10-Q from 2021-07-28


    - Extracted 8-K from 2021-07-27


    - Extracted 10-Q from 2021-04-29


    - Extracted 8-K from 2021-04-28


    - Extracted 8-K from 2021-02-24


    - Extracted 8-K from 2021-02-08


    - Extracted 10-Q from 2021-01-28


    - Extracted 8-K from 2021-01-27


    - Extracted 8-K from 2021-01-05


    - Extracted 10-K from 2020-10-30


    - Extracted 8-K from 2020-10-29


    - Extracted 8-K from 2020-08-20


    - Extracted 8-K from 2020-08-07


    - Extracted 10-Q from 2020-07-31


    - Extracted 8-K from 2020-07-30


    - Extracted 8-K from 2020-05-11


    - Extracted 10-Q from 2020-05-01


    - Extracted 8-K from 2020-04-30


    - Extracted 8-K from 2020-02-27


    - Extracted 8-K from 2020-02-18


    - Extracted 10-Q from 2020-01-29


    - Extracted 8-K from 2020-01-28


    - Extracted 8-K from 2019-11-15


    - Extracted 10-K from 2019-10-31


    - Extracted 8-K from 2019-10-30


    - Extracted 8-K from 2019-09-13


    - Extracted 8-K from 2019-09-11


    - Extracted 10-Q from 2019-07-31


    - Extracted 8-K from 2019-07-30


    - Extracted 10-Q from 2019-05-01


    - Extracted 8-K from 2019-04-30


    - Extracted 8-K from 2019-03-04


    - Extracted 8-K from 2019-02-06


    - Extracted 10-Q from 2019-01-30


    - Extracted 8-K from 2019-01-29


    - Extracted 8-K from 2019-01-02


    - Extracted 10-K from 2018-11-05


    - Extracted 8-K from 2018-11-01


    - Extracted 10-Q from 2018-08-01


    - Extracted 8-K from 2018-07-31


    - Extracted 8-K from 2018-05-07


    - Extracted 10-Q from 2018-05-02


    - Extracted 8-K from 2018-05-01


    - Extracted 8-K from 2018-02-14


    - Extracted 10-Q from 2018-02-02


    - Extracted 8-K from 2018-02-01
  > Processing filings for NVDA...


    - Extracted 10-Q from 2024-11-20


    - Extracted 8-K from 2024-11-20


    - Extracted 8-K from 2024-11-07


    - Extracted 10-Q from 2024-08-28


    - Extracted 8-K from 2024-08-28


    - Extracted 8-K from 2024-07-02


    - Extracted 8-K from 2024-06-07


    - Extracted 10-Q from 2024-05-29


    - Extracted 8-K from 2024-05-22


    - Extracted 8-K from 2024-03-14


    - Extracted 10-K from 2024-02-21


    - Extracted 8-K from 2024-02-21


    - Extracted 10-Q from 2023-11-21


    - Extracted 8-K from 2023-11-21


    - Extracted 8-K from 2023-10-24


    - Extracted 8-K from 2023-10-17


    - Extracted 10-Q from 2023-08-28


    - Extracted 8-K from 2023-08-23


    - Extracted 8-K from 2023-07-24


    - Extracted 8-K from 2023-06-27


    - Extracted 10-Q from 2023-05-26


    - Extracted 8-K from 2023-05-24


    - Extracted 8-K from 2023-03-08


    - Extracted 10-K from 2023-02-24


    - Extracted 8-K from 2023-02-22


    - Extracted 10-Q from 2022-11-18


    - Extracted 8-K from 2022-11-16


    - Extracted 8-K from 2022-09-01


    - Extracted 10-Q from 2022-08-31


    - Extracted 8-K from 2022-08-31


    - Extracted 8-K from 2022-08-24


    - Extracted 8-K from 2022-08-08


    - Extracted 8-K from 2022-06-06


    - Extracted 10-Q from 2022-05-27


    - Extracted 8-K from 2022-05-25


    - Extracted 10-K from 2022-03-18


    - Extracted 8-K from 2022-03-09


    - Extracted 8-K from 2022-02-16


    - Extracted 8-K from 2022-02-08


    - Extracted 10-Q from 2021-11-22


    - Extracted 8-K from 2021-11-17


    - Extracted 10-Q from 2021-08-20


    - Extracted 8-K from 2021-08-18


    - Extracted 8-K from 2021-06-28


    - Extracted 8-K from 2021-06-16


    - Extracted 8-K from 2021-06-07


    - Extracted 10-Q from 2021-05-26


    - Extracted 8-K from 2021-05-26


    - Extracted 8-K from 2021-05-21


    - Extracted 8-K from 2021-04-12


    - Extracted 8-K from 2021-03-19


    - Extracted 10-K from 2021-02-26


    - Extracted 8-K from 2021-02-24


    - Extracted 10-Q from 2020-11-18


    - Extracted 8-K from 2020-11-18


    - Extracted 8-K from 2020-11-09


    - Extracted 8-K from 2020-09-14


    - Extracted 10-Q from 2020-08-19


    - Extracted 8-K from 2020-08-19


    - Extracted 8-K from 2020-07-13


    - Extracted 8-K from 2020-06-15


    - Extracted 10-Q from 2020-05-21


    - Extracted 8-K from 2020-05-21


    - Extracted 8-K from 2020-04-27


    - Extracted 8-K from 2020-04-17


    - Extracted 8-K from 2020-03-31


    - Extracted 8-K from 2020-03-10


    - Extracted 10-K from 2020-02-20


    - Extracted 8-K from 2020-02-13


    - Extracted 10-Q from 2019-11-14


    - Extracted 8-K from 2019-11-14


    - Extracted 10-Q from 2019-08-15


    - Extracted 8-K from 2019-08-15


    - Extracted 8-K from 2019-06-17


    - Extracted 8-K from 2019-05-29


    - Extracted 10-Q from 2019-05-16


    - Extracted 8-K from 2019-05-16


    - Extracted 8-K from 2019-04-01


    - Extracted 8-K from 2019-03-11


    - Extracted 8-K from 2019-03-11


    - Extracted 10-K from 2019-02-21


    - Extracted 8-K from 2019-02-14


    - Extracted 8-K from 2019-01-28


    - Extracted 10-Q from 2018-11-15


    - Extracted 8-K from 2018-11-15


    - Extracted 10-Q from 2018-08-16


    - Extracted 8-K from 2018-08-16


    - Extracted 10-Q from 2018-05-22


    - Extracted 8-K from 2018-05-21


    - Extracted 8-K from 2018-05-10


    - Extracted 8-K from 2018-03-13


    - Extracted 10-K from 2018-02-28


    - Extracted 8-K from 2018-02-08
  > Processing filings for GOOGL...


    - Extracted 10-Q from 2024-10-30


    - Extracted 8-K from 2024-10-29


    - Extracted 8-K from 2024-10-17


    - Extracted 8-K from 2024-09-24


    - Extracted 8-K from 2024-08-06


    - Extracted 10-Q from 2024-07-24


    - Extracted 8-K from 2024-07-23


    - Extracted 8-K from 2024-06-26


    - Extracted 8-K from 2024-06-13


    - Extracted 8-K from 2024-06-07


    - Extracted 8-K from 2024-06-05


    - Extracted 10-Q from 2024-04-26


    - Extracted 8-K from 2024-04-25


    - Extracted 8-K from 2024-02-08


    - Extracted 10-K from 2024-01-31


    - Extracted 8-K from 2024-01-30


    - Extracted 10-Q from 2023-10-25


    - Extracted 8-K from 2023-10-24


    - Extracted 10-Q from 2023-07-26


    - Extracted 8-K from 2023-07-25


    - Extracted 8-K from 2023-06-08


    - Extracted 10-Q from 2023-04-26


    - Extracted 8-K from 2023-04-25


    - Extracted 8-K from 2023-04-21


    - Extracted 8-K from 2023-04-20


    - Extracted 10-K from 2023-02-03


    - Extracted 8-K from 2023-02-02


    - Extracted 8-K from 2023-01-25


    - Extracted 8-K from 2023-01-20


    - Extracted 8-K from 2022-12-21


    - Extracted 10-Q from 2022-10-26


    - Extracted 8-K from 2022-10-25


    - Extracted 10-Q from 2022-07-27


    - Extracted 8-K from 2022-07-26


    - Extracted 8-K from 2022-07-14


    - Extracted 8-K from 2022-07-13


    - Extracted 8-K from 2022-06-03


    - Extracted 10-Q from 2022-04-27


    - Extracted 8-K from 2022-04-26


    - Extracted 10-K from 2022-02-02


    - Extracted 8-K from 2022-02-01


    - Extracted 8-K from 2022-01-04


    - Extracted 10-Q from 2021-10-27


    - Extracted 8-K from 2021-10-26


    - Extracted 10-Q from 2021-07-28


    - Extracted 8-K from 2021-07-27


    - Extracted 8-K from 2021-07-08


    - Extracted 8-K from 2021-06-04


    - Extracted 10-Q from 2021-04-28


    - Extracted 8-K from 2021-04-27


    - Extracted 8-K from 2021-04-09


    - Extracted 8-K from 2021-03-04


    - Extracted 10-K from 2021-02-03


    - Extracted 8-K from 2021-02-02


    - Extracted 8-K from 2020-12-21


    - Extracted 10-Q from 2020-10-30


    - Extracted 8-K from 2020-10-29


    - Extracted 8-K from 2020-10-26


    - Extracted 8-K from 2020-10-23


    - Extracted 8-K from 2020-10-20


    - Extracted 8-K from 2020-09-25


    - Extracted 8-K from 2020-08-05


    - Extracted 10-Q from 2020-07-31


    - Extracted 8-K from 2020-07-30


    - Extracted 8-K from 2020-06-05


    - Extracted 10-Q from 2020-04-29


    - Extracted 8-K from 2020-04-28


    - Extracted 10-K from 2020-02-04


    - Extracted 8-K from 2020-02-03


    - Extracted 8-K from 2020-01-10


    - Extracted 8-K from 2019-12-20


    - Extracted 8-K from 2019-12-09


    - Extracted 8-K from 2019-12-04


    - Extracted 10-Q from 2019-10-29


    - Extracted 8-K from 2019-10-28


    - Extracted 8-K from 2019-09-06


    - Extracted 10-Q from 2019-07-26


    - Extracted 8-K from 2019-07-25


    - Extracted 8-K from 2019-06-21


    - Extracted 8-K from 2019-04-30


    - Extracted 10-Q from 2019-04-30


    - Extracted 8-K from 2019-04-29


    - Extracted 8-K from 2019-03-20


    - Extracted 10-K from 2019-02-05


    - Extracted 8-K from 2019-02-04


    - Extracted 10-Q from 2018-10-26


    - Extracted 8-K from 2018-10-25


    - Extracted 10-Q from 2018-07-24


    - Extracted 8-K from 2018-07-23


    - Extracted 8-K from 2018-07-18


    - Extracted 8-K from 2018-06-08


    - Extracted 10-Q from 2018-04-24


    - Extracted 8-K from 2018-04-23


    - Extracted 10-K from 2018-02-06


    - Extracted 8-K from 2018-02-01

Filings extraction complete. Total documents extracted: 273


In [None]:
print("2.4.2: Create and Display Filings DataFrame")
print("-"*30)

# --- Create DataFrame from the collected list ---
filings_df = pd.DataFrame(all_filings_data)

if not filings_df.empty:
    # Convert 'filing_date' column to datetime objects
    filings_df['filing_date'] = pd.to_datetime(filings_df['filing_date'])
    
    # Sort the DataFrame by date and ticker for good practice
    filings_df = filings_df.sort_values(by=['filing_date', 'ticker']).reset_index(drop=True)
    
    print("`filings_df` DataFrame created successfully.")
    
    # Display summary and head
    print("\n--- DataFrame Info ---")
    filings_df.info()
    
    print("\n--- DataFrame Head ---")
    display(filings_df.head(40).sort_values(["filing_date"], ascending=True))
    
else:
    print("The 'all_filings_data' list is empty. No DataFrame was created. Please check cell 2.4.1 for errors.")

2.4.2: Create and Display Filings DataFrame
------------------------------
`filings_df` DataFrame created successfully.

--- DataFrame Info ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 273 entries, 0 to 272
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   filing_date  273 non-null    datetime64[ns]
 1   ticker       273 non-null    object        
 2   form_type    273 non-null    object        
 3   text         273 non-null    object        
dtypes: datetime64[ns](1), object(3)
memory usage: 8.7+ KB

--- DataFrame Head ---


Unnamed: 0,filing_date,ticker,form_type,text
0,2018-02-01,AAPL,8-K,...
1,2018-02-01,GOOGL,8-K,...
2,2018-02-02,AAPL,10-Q,...
3,2018-02-06,GOOGL,10-K,...
4,2018-02-08,NVDA,8-K,...
5,2018-02-14,AAPL,8-K,...
6,2018-02-28,NVDA,10-K,...
7,2018-03-13,NVDA,8-K,...
8,2018-04-23,GOOGL,8-K,...
9,2018-04-24,GOOGL,10-Q,Table of Contents ...


In [71]:
print("2.4.3: Save Filtered Filings to a Compressed File")
print("-"*30)

# Check if the filings_df exists and is not empty before attempting to save.
if 'filings_df' in locals() and not filings_df.empty:
    
    # Define the name for the compressed output file.
    output_filename = "filtered_filings.csv.gz"

    print(f"Saving the 'filings_df' to a compressed CSV file: {output_filename}")
    print(f"This may take a moment, as the DataFrame contains {len(filings_df):,} documents...")

    try:
        # Save the DataFrame to a gzipped CSV.
        # - compression='gzip' handles the compression.
        # - index=False prevents pandas from writing the DataFrame index as a column.
        filings_df.to_csv(output_filename, index=False, compression='gzip')
        
        print(f"\nDataFrame successfully saved to '{output_filename}'.")
        print("This file can be loaded in future sessions to bypass the EDGAR extraction process.")

    except Exception as e:
        print(f"\nAn error occurred while saving the filings DataFrame: {e}")
else:
    print("The 'filings_df' DataFrame is not available or is empty. Skipping the save operation.")


2.4.3: Save Filtered Filings to a Compressed File
------------------------------
Saving the 'filings_df' to a compressed CSV file: filtered_filings.csv.gz
This may take a moment, as the DataFrame contains 273 documents...

DataFrame successfully saved to 'filtered_filings.csv.gz'.
This file can be loaded in future sessions to bypass the EDGAR extraction process.


## Phase 3: NLP Pipelines

### 3.1: NLP Pipeline for `news_articles_df`

#### We start this section by *Down_sampling*: 
We want to go from our `>700k` articles to `~11k` articles for each company over the 2018-2024 year period. That's `5` articles per day per company!

In [78]:
print("3.1.0: Setup and Imports for NLP Pre-processing")
print("-"*30)

# --- 1. Install necessary libraries ---
import sys
!{sys.executable} -m pip install nltk pandas --quiet

import pandas as pd
import re
import nltk
import os

# --- 2. Download NLTK resources ---
# We need 'punkt' and 'punkt_tab' for word tokenization. The most reliable
# way to ensure they are available in a notebook is to call download() directly.
# NLTK will not re-download the data if it's already present.
print("Ensuring NLTK resources ('punkt', 'punkt_tab') are available...")
nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True) # This command fixes the LookupError.
print("NLTK resources are up to date.")

# --- 3. Define File path for fallback loading ---
NEWS_ARTICLES_FILE = "filtered_news_articles.csv.gz"

print("\nSetup complete. Libraries and NLTK resources are ready.")

3.1.0: Setup and Imports for NLP Pre-processing
------------------------------



[notice] A new release of pip is available: 23.2.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


Ensuring NLTK resources ('punkt', 'punkt_tab') are available...
NLTK resources are up to date.

Setup complete. Libraries and NLTK resources are ready.


#### Here we calculate a `Mention_Density` metric for each atricle that will help us understand how relevant an article is for our selected company.

In [79]:
print("3.1.1: Calculate Mention Density Score for Each Article")
print("-"*30)

# --- 1. Define Comprehensive Company Keywords ---
COMPANY_KEYWORDS = {
    'AAPL': {
        'aapl', 'apple', 'iphone', 'ipad', 'macbook', 'imac', 'watchos',
        'ios', 'macos', 'airpods', 'tim cook', 'app store', 'vision pro'
    },
    'NVDA': {
        'nvda', 'nvidia', 'geforce', 'rtx', 'quadro', 'tesla', # Note: 'tesla' is a GPU arch
        'cuda', 'dgx', 'tegra', 'jensen huang', 'omniverse'
    },
    'GOOGL': {
        'googl', 'google', 'alphabet', 'android', 'youtube', 'chrome',
        'pixel', 'nest', 'waymo', 'gcp', 'sundar pichai', 'gemini'
    }
}
print("Comprehensive keyword dictionaries defined.")

# --- 2. Load the Dataset (prioritizing live DataFrame) ---
news_mention_density_df = None # Initialize to None

# Prioritize using the 'news_articles_df' if it's already in memory.
if 'news_articles_df' in locals() and isinstance(news_articles_df, pd.DataFrame) and not news_articles_df.empty:
    print("Using the live 'news_articles_df' DataFrame from the current session.")
    news_mention_density_df = news_articles_df.copy() # Use a copy to avoid side effects
    
# NOTE: --- This block is commented out but can be used for future sessions ---
# Fallback to loading from the file if the live DataFrame is not available.
# elif os.path.exists(NEWS_ARTICLES_FILE):
#     print(f"Loading dataset from '{NEWS_ARTICLES_FILE}'...")
#     # In a new session, you would uncomment the line below:
#     # news_mention_density_df = pd.read_csv(NEWS_ARTICLES_FILE, compression='gzip')
# -----------------------------------------------------------------------

if news_mention_density_df is None:
    print(f"ERROR: 'news_articles_df' not found in the current session.")
    print(f"To run this cell independently, uncomment the file loading logic above and ensure '{NEWS_ARTICLES_FILE}' exists.")
else:
    print(f"DataFrame loaded with {len(news_mention_density_df):,} articles.")
    
    # --- 3. Define the Density Calculation Function ---
    def calculate_all_densities(text):
        if not isinstance(text, str) or not text.strip():
            return pd.Series({f'{ticker}_Density': 0.0 for ticker in COMPANY_KEYWORDS})
        text_lower = text.lower()
        total_words = len(nltk.word_tokenize(text_lower))
        if total_words == 0:
            return pd.Series({f'{ticker}_Density': 0.0 for ticker in COMPANY_KEYWORDS})
        densities = {}
        for ticker, keywords in COMPANY_KEYWORDS.items():
            pattern = r'\b(' + '|'.join(re.escape(k) for k in keywords) + r')\b'
            mention_count = len(re.findall(pattern, text_lower))
            density = mention_count / total_words if total_words > 0 else 0
            densities[f'{ticker}_Density'] = density
        return pd.Series(densities)

    # --- 4. Apply the function to the DataFrame ---
    print("\nCalculating mention densities for all articles. This may take a few minutes...")
    density_scores = news_mention_density_df['text'].apply(calculate_all_densities)
    
    news_mention_density_df = pd.concat([news_mention_density_df, density_scores], axis=1)
    print("Mention density calculation complete.")
    
    # --- 5. Display Results ---
    print("\n--- DataFrame with Mention Density Scores ---")
    display(news_mention_density_df[['date', 'ticker', 'AAPL_Density', 'NVDA_Density', 'GOOGL_Density']].head())

3.1.1: Calculate Mention Density Score for Each Article
------------------------------
Comprehensive keyword dictionaries defined.
Using the live 'news_articles_df' DataFrame from the current session.
DataFrame loaded with 731,183 articles.

Calculating mention densities for all articles. This may take a few minutes...
Mention density calculation complete.

--- DataFrame with Mention Density Scores ---


Unnamed: 0,date,ticker,AAPL_Density,NVDA_Density,GOOGL_Density
0,2018-01-01,GOOGL,0.002762,0.0,0.005525
1,2018-01-01,GOOGL,0.0,0.0,0.003781
2,2018-01-01,GOOGL,0.0,0.0,0.001136
3,2018-01-01,GOOGL,0.0,0.0,0.00114
4,2018-01-01,AAPL,0.0,0.0,0.0


#### Now that we have `Mention_Density` for each company, we can select the Top `5` articles per day for each company that have a `Mention_Density` score of >1%.
We will also include `article_volume` for each day, representing the total number of articles published for that company (with `Mention_Density` >1%.) in that day. 

In [130]:
print("3.1.2: Final Optimized Down-Sampling with Language Filter and Volume")
print("-"*30)

# --- Import language detection library ---
import sys
!{sys.executable} -m pip install langdetect --quiet
from langdetect import detect, LangDetectException

# --- 1. Define Filtering Parameters ---
# MIN_DENSITY_THRESHOLD: The minimum relevance score an article must have to be considered.
MIN_DENSITY_THRESHOLD = 0.01
# TOP_N_ARTICLES: The maximum number of highest-scoring articles to select for any given day.
TOP_N_ARTICLES = 5

print(f"Filtering Parameters:")
print(f" - Minimum Mention Density Threshold: {MIN_DENSITY_THRESHOLD}")
print(f" - Top N Articles per Day: {TOP_N_ARTICLES}\n")

# --- 2. Initialize Storage ---
# This dictionary will hold the final, filtered DataFrames, one for each company.
company_top_articles = {}

# Check if the main DataFrame from the previous step is available in memory.
if 'news_mention_density_df' in locals() and not news_mention_density_df.empty:
    
    # --- Step A: De-duplicate Articles ---
    # This ensures we only process each unique article text once per day.
    print(f"Original article entry count: {len(news_mention_density_df):,}")
    deduped_df = news_mention_density_df.drop_duplicates(subset=['date', 'text']).copy()
    print(f"De-duplicated article count: {len(deduped_df):,}")
    
    # --- Step B: Lightweight Language Filtering with Progress Indicator ---
    # This function checks only the first 100 characters of text for speed.
    def is_english_fast(text):
        try:
            # We only need a small sample of the text to accurately detect the language.
            sample = text[:100] if isinstance(text, str) else ''
            # Return True only if the sample is valid and detected as English ('en').
            return sample.strip() and detect(sample) == 'en'
        except LangDetectException:
            # If detection fails, we assume it's not the language we want.
            return False

    print("\nFiltering for English-language articles (with progress updates)...")
    total_texts_to_check = len(deduped_df)
    print_interval = 10000  # How often to print an update.
    
    english_mask = []
    # We use an explicit loop here to provide progress feedback.
    for i, text in enumerate(deduped_df['text']):
        english_mask.append(is_english_fast(text))
        
        # This block prints a status update at the specified interval.
        if (i + 1) % print_interval == 0 or (i + 1) == total_texts_to_check:
            print(f"  > Language check progress: {i + 1:,} of {total_texts_to_check:,} articles processed...")

    # Use the generated boolean mask to select only the English articles.
    english_df = deduped_df[english_mask]
    print(f"\nFiltered down to {len(english_df):,} English articles.\n")
    
    # --- Step 3. Process Each Company Independently ---
    for ticker in COMPANY_KEYWORDS.keys():
        density_col = f'{ticker}_Density'
        print(f"--- Processing {ticker} ---")

        # Step C: Filter for relevance based on the density threshold.
        relevant_articles_df = english_df[english_df[density_col] >= MIN_DENSITY_THRESHOLD].copy()
        
        if relevant_articles_df.empty:
            print(f"  > No articles found for {ticker} above the density threshold. Skipping.")
            company_top_articles[ticker] = pd.DataFrame()
            continue
        
        print(f"  > Found {len(relevant_articles_df):,} relevant English articles for {ticker}.")

        # Step D: Calculate daily article volume from the relevant set.
        daily_volume = relevant_articles_df.groupby('date').size().rename('article_volume')
        
        # Step E: Perform the "Top N per Day" selection.
        top_n_df = relevant_articles_df.groupby('date').apply(
            lambda x: x.nlargest(TOP_N_ARTICLES, density_col),
            include_groups=False
        ).reset_index(level=0)
        
        # Step F: Merge the daily volume feature onto the down-sampled DataFrame.
        top_n_df = top_n_df.merge(daily_volume, on='date', how='left')
        
        # Step G: Clean up the ticker column for clarity.
        top_n_df = top_n_df.drop(columns=['ticker'])
        top_n_df['assigned_ticker'] = ticker
        
        print(f"  > Down-sampled to {len(top_n_df):,} top articles for NLP analysis.")

        # Store the final, enriched DataFrame in the dictionary.
        company_top_articles[ticker] = top_n_df
    
    print("\nDown-sampling and all filtering complete.")
    
    # --- 4. Display a Sample of the Results ---
    print("\n--- Sample of Final 'AAPL' Articles ---")
    if 'AAPL' in company_top_articles and not company_top_articles['AAPL'].empty:
        display(company_top_articles['AAPL'][['date', 'assigned_ticker', 'AAPL_Density', 'article_volume']].head())
    else:
        print("No articles to display for AAPL.")
        
else:
    print("ERROR: 'news_mention_density_df' not found. Please run cell 3.1.1 first.")

3.1.2: Final Optimized Down-Sampling with Language Filter and Volume
------------------------------



[notice] A new release of pip is available: 23.2.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


Filtering Parameters:
 - Minimum Mention Density Threshold: 0.01
 - Top N Articles per Day: 5

Original article entry count: 731,183
De-duplicated article count: 338,316

Filtering for English-language articles (with progress updates)...
  > Language check progress: 10,000 of 338,316 articles processed...
  > Language check progress: 20,000 of 338,316 articles processed...
  > Language check progress: 30,000 of 338,316 articles processed...
  > Language check progress: 40,000 of 338,316 articles processed...
  > Language check progress: 50,000 of 338,316 articles processed...
  > Language check progress: 60,000 of 338,316 articles processed...
  > Language check progress: 70,000 of 338,316 articles processed...
  > Language check progress: 80,000 of 338,316 articles processed...
  > Language check progress: 90,000 of 338,316 articles processed...
  > Language check progress: 100,000 of 338,316 articles processed...
  > Language check progress: 110,000 of 338,316 articles processed...
 

Unnamed: 0,date,assigned_ticker,AAPL_Density,article_volume
0,2018-01-01,AAPL,0.1,8
1,2018-01-01,AAPL,0.083333,8
2,2018-01-01,AAPL,0.076923,8
3,2018-01-01,AAPL,0.0625,8
4,2018-01-01,AAPL,0.021277,8


In [131]:
for TICKER in list(company_top_articles):
    print(TICKER)
    display(company_top_articles[TICKER].head(10))

AAPL


Unnamed: 0,date,text,AAPL_Density,NVDA_Density,GOOGL_Density,article_volume,assigned_ticker
0,2018-01-01,Apple offers battery discounts as apology for ...,0.1,0.0,0.0,8,AAPL
1,2018-01-01,Apple's battery mea culpa joins rapidly growin...,0.083333,0.0,0.0,8,AAPL
2,2018-01-01,Could Apple’s battery program encourage people...,0.076923,0.0,0.0,8,AAPL
3,2018-01-01,"Benzinga's Bulls & Bears: Apple, GE, Starbucks...",0.0625,0.0625,0.0,8,AAPL
4,2018-01-01,Suspect in first NYC homicide of 2018 found ha...,0.021277,0.0,0.0,8,AAPL
5,2018-01-02,Apple has already started discounting iPhone b...,0.285714,0.0,0.0,51,AAPL
6,2018-01-02,Apple to replace iPhone batteries regardless o...,0.166667,0.0,0.0,51,AAPL
7,2018-01-02,iPhone buyer survey raises confidence in stron...,0.142857,0.0,0.0,51,AAPL
8,2018-01-02,CLSA: Apple iPhone X expectations 'still too h...,0.121212,0.0,0.0,51,AAPL
9,2018-01-02,Cramer: Apple's iPhone 'batterygate' issue doe...,0.114286,0.0,0.0,51,AAPL


NVDA


Unnamed: 0,date,text,AAPL_Density,NVDA_Density,GOOGL_Density,article_volume,assigned_ticker
0,2018-01-01,"Benzinga's Bulls & Bears: Apple, GE, Starbucks...",0.0625,0.0625,0.0,1,NVDA
1,2018-01-02,Here's Why NVIDIA Stock is Worth Adding to You...,0.0,0.090909,0.0,7,NVDA
2,2018-01-02,"For Tesla, 2018 is all about getting Model 3 p...",0.0,0.055556,0.0,7,NVDA
3,2018-01-02,"Nvidia, Other Chip Winners Seen Repeating Succ...",0.0,0.020362,0.0,7,NVDA
4,2018-01-02,"Why You Should Watch Nvidia, Intel, Other Chip...",0.002532,0.020253,0.0,7,NVDA
5,2018-01-02,"Why You Should Watch Nvidia, Intel, Other Chip...",0.002451,0.019608,0.0,7,NVDA
6,2018-01-03,"Benchmark: WDC, NVDA adding fuel to RISC-V mov...",0.0,0.1,0.0,10,NVDA
7,2018-01-03,"Nvidia, AMD Among Top Semis In 2018, According...",0.0,0.071429,0.0,10,NVDA
8,2018-01-03,"IBM, Intel Divide Dow As AMD, Nvidia, Ambarell...",0.0,0.066667,0.0,10,NVDA
9,2018-01-03,"NVIDIA, Vulcan Materials and Alibaba highlight...",0.0,0.066667,0.0,10,NVDA


GOOGL


Unnamed: 0,date,text,AAPL_Density,NVDA_Density,GOOGL_Density,article_volume,assigned_ticker
0,2018-01-01,Amazon's feud with Google is getting pettier t...,0.0,0.0,0.1,13,GOOGL
1,2018-01-01,"Sue Grafton, author of best-selling ‘alphabet’...",0.0,0.0,0.071429,13,GOOGL
2,2018-01-01,A grandmother's hilarious interaction with a n...,0.0,0.0,0.066667,13,GOOGL
3,2018-01-01,Former Google career coach shares 3 ways to ac...,0.0,0.0,0.0625,13,GOOGL
4,2018-01-01,Google looks to mend fences after rocky 2017\n...,0.0,0.0,0.058824,13,GOOGL
5,2018-01-02,Google ‘NBA founder’ and brace yourself: Its a...,0.0,0.0,0.08,25,GOOGL
6,2018-01-02,Google is now testing its mysterious Fuchsia O...,0.0,0.0,0.056075,25,GOOGL
7,2018-01-02,Social Media Sites In Germany Can Now Be Fined...,0.0,0.0,0.054054,25,GOOGL
8,2018-01-02,The Alibaba browser no one’s heard of is dethr...,0.0,0.0,0.042857,25,GOOGL
9,2018-01-02,"Buying Costco, Selling Alphabet, Verizon\n\nTh...",0.0,0.0,0.041667,25,GOOGL


In [132]:
print("3.1.3: Inspect Down-Sampled DataFrames for Missing Data and Volume")
print("-"*30)

if 'company_top_articles' in locals() and isinstance(company_top_articles, dict):

    full_date_range = pd.date_range(start=f'{START_YEAR}-01-01', end=f'{END_YEAR}-12-31', freq='D')
    total_days_in_period = len(full_date_range)
    
    print(f"Analyzing data coverage over a total period of {total_days_in_period} days ({START_YEAR}-{END_YEAR}).\n")

    for ticker, df in company_top_articles.items():
        print(f"--- Inspection for {ticker} ---")
        
        if df.empty:
            print("  > No articles were selected for this ticker. DataFrame is empty.\n")
            continue
        
        df['date'] = pd.to_datetime(df['date'])
        
        total_articles = len(df)
        unique_days_with_articles = df['date'].nunique()
        missing_days = total_days_in_period - unique_days_with_articles
        coverage_percentage = (unique_days_with_articles / total_days_in_period) * 100
        
        print(f"  Article Coverage:")
        print(f"    > Total selected articles for NLP: {total_articles:,}")
        print(f"    > Unique days with articles: {unique_days_with_articles:,}")
        print(f"    > Days with NO articles (to be imputed): {missing_days:,}")
        print(f"    > Daily coverage percentage: {coverage_percentage:.2f}%")
        
        # --- NEW: Reporting on 'article_volume' ---
        # To get daily stats, we first drop duplicates for each day since 'article_volume' is repeated.
        daily_stats_df = df.drop_duplicates(subset=['date'])
        
        print(f"\n  Daily Article Volume (for days with coverage):")
        print(f"    > Mean daily volume: {daily_stats_df['article_volume'].mean():.2f}")
        print(f"    > Median daily volume: {daily_stats_df['article_volume'].median():.0f}")
        print(f"    > Max daily volume: {daily_stats_df['article_volume'].max():.0f}")
        print(f"    > Min daily volume: {daily_stats_df['article_volume'].min():.0f}\n")
        
else:
    print("ERROR: 'company_top_articles' dictionary not found. Please run cell 3.1.2 first.")

3.1.3: Inspect Down-Sampled DataFrames for Missing Data and Volume
------------------------------
Analyzing data coverage over a total period of 2557 days (2018-2024).

--- Inspection for AAPL ---
  Article Coverage:
    > Total selected articles for NLP: 10,634
    > Unique days with articles: 2,336
    > Days with NO articles (to be imputed): 221
    > Daily coverage percentage: 91.36%

  Daily Article Volume (for days with coverage):
    > Mean daily volume: 25.17
    > Median daily volume: 14
    > Max daily volume: 387
    > Min daily volume: 1

--- Inspection for NVDA ---
  Article Coverage:
    > Total selected articles for NLP: 6,961
    > Unique days with articles: 2,117
    > Days with NO articles (to be imputed): 440
    > Daily coverage percentage: 82.79%

  Daily Article Volume (for days with coverage):
    > Mean daily volume: 5.61
    > Median daily volume: 3
    > Max daily volume: 103
    > Min daily volume: 1

--- Inspection for GOOGL ---
  Article Coverage:
    > Tot

#### These steps below are just exporting and verifying importing doesn't break anything.

In [133]:
print("3.1.4: Save Filtered Articles Dictionary to a File")
print("-"*30)

# Import the pickle library for object serialization.
import pickle

# Check if the dictionary from the previous cell exists.
if 'company_top_articles' in locals() and isinstance(company_top_articles, dict):
    
    # Define the output filename for our serialized dictionary.
    output_filename = "company_top_articles.pkl"

    print(f"Saving the 'company_top_articles' dictionary to '{output_filename}'...")
    print("This file will serve as a checkpoint to avoid re-running the filtering process.")

    try:
        # Open the file in write-binary ('wb') mode.
        with open(output_filename, 'wb') as f:
            # Use pickle.dump() to serialize the entire dictionary object into the file.
            pickle.dump(company_top_articles, f)
        
        print(f"\nDictionary successfully saved to '{output_filename}'.")
        
        # Provide a snippet for how to load the data in a future session.
        print("\nTo load this data in a future session, you can use the following code:")
        print("------------------------------------------------------------------")
        print("import pickle")
        print("with open('company_top_articles.pkl', 'rb') as f:")
        print("    company_top_articles = pickle.load(f)")
        print("------------------------------------------------------------------")

    except Exception as e:
        print(f"\nAn error occurred while saving the dictionary: {e}")
else:
    print("ERROR: 'company_top_articles' dictionary not found. Please run the previous cells first.")

3.1.4: Save Filtered Articles Dictionary to a File
------------------------------
Saving the 'company_top_articles' dictionary to 'company_top_articles.pkl'...
This file will serve as a checkpoint to avoid re-running the filtering process.

Dictionary successfully saved to 'company_top_articles.pkl'.

To load this data in a future session, you can use the following code:
------------------------------------------------------------------
import pickle
with open('company_top_articles.pkl', 'rb') as f:
    company_top_articles = pickle.load(f)
------------------------------------------------------------------


In [134]:
print("3.1.5: Load and Verify the Filtered Articles Dictionary from File")
print("-"*30)

# Import the pickle library and pandas for displaying DataFrames.
import pickle
import pandas as pd

# Define the filename where the dictionary was saved.
FILENAME = "company_top_articles.pkl"

try:
    # Open the file in read-binary ('rb') mode.
    with open(FILENAME, 'rb') as f:
        # Load the entire dictionary object from the pickle file.
        loaded_company_articles = pickle.load(f)
    
    print(f"Successfully loaded dictionary from '{FILENAME}'.")
    print("Verifying contents by displaying the head of each DataFrame...\n")
    
    # --- Verification Loop ---
    # Iterate through the keys (tickers) and values (DataFrames) of the loaded dictionary.
    for ticker, df in loaded_company_articles.items():
        print(f"--- Top 10 Articles for: {ticker} ---")
        
        if not df.empty:
            # Display the first 10 rows for visual inspection.
            display(df.head(10))
        else:
            print("  > DataFrame is empty for this ticker.")
        print("\n" + "-"*50 + "\n")

except FileNotFoundError:
    print(f"ERROR: The file '{FILENAME}' was not found. Please run cell 3.1.4 to save the file first.")
except Exception as e:
    print(f"An unexpected error occurred while loading the file: {e}")

3.1.5: Load and Verify the Filtered Articles Dictionary from File
------------------------------
Successfully loaded dictionary from 'company_top_articles.pkl'.
Verifying contents by displaying the head of each DataFrame...

--- Top 10 Articles for: AAPL ---


Unnamed: 0,date,text,AAPL_Density,NVDA_Density,GOOGL_Density,article_volume,assigned_ticker
0,2018-01-01,Apple offers battery discounts as apology for ...,0.1,0.0,0.0,8,AAPL
1,2018-01-01,Apple's battery mea culpa joins rapidly growin...,0.083333,0.0,0.0,8,AAPL
2,2018-01-01,Could Apple’s battery program encourage people...,0.076923,0.0,0.0,8,AAPL
3,2018-01-01,"Benzinga's Bulls & Bears: Apple, GE, Starbucks...",0.0625,0.0625,0.0,8,AAPL
4,2018-01-01,Suspect in first NYC homicide of 2018 found ha...,0.021277,0.0,0.0,8,AAPL
5,2018-01-02,Apple has already started discounting iPhone b...,0.285714,0.0,0.0,51,AAPL
6,2018-01-02,Apple to replace iPhone batteries regardless o...,0.166667,0.0,0.0,51,AAPL
7,2018-01-02,iPhone buyer survey raises confidence in stron...,0.142857,0.0,0.0,51,AAPL
8,2018-01-02,CLSA: Apple iPhone X expectations 'still too h...,0.121212,0.0,0.0,51,AAPL
9,2018-01-02,Cramer: Apple's iPhone 'batterygate' issue doe...,0.114286,0.0,0.0,51,AAPL



--------------------------------------------------

--- Top 10 Articles for: NVDA ---


Unnamed: 0,date,text,AAPL_Density,NVDA_Density,GOOGL_Density,article_volume,assigned_ticker
0,2018-01-01,"Benzinga's Bulls & Bears: Apple, GE, Starbucks...",0.0625,0.0625,0.0,1,NVDA
1,2018-01-02,Here's Why NVIDIA Stock is Worth Adding to You...,0.0,0.090909,0.0,7,NVDA
2,2018-01-02,"For Tesla, 2018 is all about getting Model 3 p...",0.0,0.055556,0.0,7,NVDA
3,2018-01-02,"Nvidia, Other Chip Winners Seen Repeating Succ...",0.0,0.020362,0.0,7,NVDA
4,2018-01-02,"Why You Should Watch Nvidia, Intel, Other Chip...",0.002532,0.020253,0.0,7,NVDA
5,2018-01-02,"Why You Should Watch Nvidia, Intel, Other Chip...",0.002451,0.019608,0.0,7,NVDA
6,2018-01-03,"Benchmark: WDC, NVDA adding fuel to RISC-V mov...",0.0,0.1,0.0,10,NVDA
7,2018-01-03,"Nvidia, AMD Among Top Semis In 2018, According...",0.0,0.071429,0.0,10,NVDA
8,2018-01-03,"IBM, Intel Divide Dow As AMD, Nvidia, Ambarell...",0.0,0.066667,0.0,10,NVDA
9,2018-01-03,"NVIDIA, Vulcan Materials and Alibaba highlight...",0.0,0.066667,0.0,10,NVDA



--------------------------------------------------

--- Top 10 Articles for: GOOGL ---


Unnamed: 0,date,text,AAPL_Density,NVDA_Density,GOOGL_Density,article_volume,assigned_ticker
0,2018-01-01,Amazon's feud with Google is getting pettier t...,0.0,0.0,0.1,13,GOOGL
1,2018-01-01,"Sue Grafton, author of best-selling ‘alphabet’...",0.0,0.0,0.071429,13,GOOGL
2,2018-01-01,A grandmother's hilarious interaction with a n...,0.0,0.0,0.066667,13,GOOGL
3,2018-01-01,Former Google career coach shares 3 ways to ac...,0.0,0.0,0.0625,13,GOOGL
4,2018-01-01,Google looks to mend fences after rocky 2017\n...,0.0,0.0,0.058824,13,GOOGL
5,2018-01-02,Google ‘NBA founder’ and brace yourself: Its a...,0.0,0.0,0.08,25,GOOGL
6,2018-01-02,Google is now testing its mysterious Fuchsia O...,0.0,0.0,0.056075,25,GOOGL
7,2018-01-02,Social Media Sites In Germany Can Now Be Fined...,0.0,0.0,0.054054,25,GOOGL
8,2018-01-02,The Alibaba browser no one’s heard of is dethr...,0.0,0.0,0.042857,25,GOOGL
9,2018-01-02,"Buying Costco, Selling Alphabet, Verizon\n\nTh...",0.0,0.0,0.041667,25,GOOGL



--------------------------------------------------



#### And we have completed the Down-Sampling. Next, we calculate composite_sentiment scores using `FinBERT`
We now have some missing data because news articles (with `Mention_Density` >1%) are sometimes unavailable for certain days.\
Our lowest coverage is `82.79%` so we can safely reconstruct the missing sentiment data after we have computed these scores.

In [9]:
print("3.1.6: Install GPU-Enabled PyTorch & Transformers")
print("-"*30)

import sys

# --- Step 1: Install CUDA-enabled PyTorch ---
# To use an NVIDIA GPU, we must install a version of PyTorch that has been
# compiled with the correct CUDA toolkit. The command below targets CUDA 12.1,
# which is compatible with modern NVIDIA drivers and your RTX 3050 Ti.
print("Installing CUDA-enabled PyTorch (this may take a few minutes)...")
!{sys.executable} -m pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 --quiet

# --- Step 2: Install Transformers ---
# Now that the correct torch version is installed, we can install transformers.
print("Installing Hugging Face Transformers library...")
!{sys.executable} -m pip install transformers --quiet

# --- Step 3: Critical Kernel Restart ---
# The new libraries, especially the GPU-enabled PyTorch, will not be recognized
# by the current session. A kernel restart is mandatory.
print("\n" + "-"*50)
print("! ! ! CRITICAL STEP ! ! !")
print("Installation complete. You MUST now restart the Jupyter kernel.")
print("In VS Code / Jupyter: Find the 'Restart' button for the kernel.")
print("After restarting, you can run the subsequent cells (3.1.7 onward).")
print("-"*50)

3.1.6: Install GPU-Enabled PyTorch & Transformers
------------------------------
Installing CUDA-enabled PyTorch (this may take a few minutes)...
Installing Hugging Face Transformers library...



[notice] A new release of pip is available: 23.2.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip



--------------------------------------------------
! ! ! CRITICAL STEP ! ! !
Installation complete. You MUST now restart the Jupyter kernel.
In VS Code / Jupyter: Find the 'Restart' button for the kernel.
After restarting, you can run the subsequent cells (3.1.7 onward).
--------------------------------------------------



[notice] A new release of pip is available: 23.2.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [1]:
print("3.1.7: Further Imports for the Pipeline")
print("-"*30)
# --- 2. Import necessary components ---
import torch
from transformers import pipeline, AutoTokenizer
import sys # Good to import here just in case

# --- 3. Initialize the Sentiment Analysis Pipeline with GPU Auto-Detection ---
print("Initializing the FinBERT sentiment analysis pipeline...")

# Check if a CUDA-compatible GPU is available
if torch.cuda.is_available():
    device_id = 0
    print(f"GPU detected: {torch.cuda.get_device_name(device_id)}. The pipeline will run on the GPU.")
else:
    device_id = -1
    print("No GPU detected or PyTorch CUDA is not installed. The pipeline will run on the CPU.")

# Load the main pipeline
# The device parameter automatically assigns the model to the GPU (0) or CPU (-1).
sentiment_pipeline = pipeline("sentiment-analysis", model="ProsusAI/finbert", device=device_id)

# Load the tokenizer separately.
tokenizer = AutoTokenizer.from_pretrained("ProsusAI/finbert")

print("\nFinBERT pipeline and tokenizer initialized successfully.")

3.1.7: Further Imports for the Pipeline
------------------------------


  from .autonotebook import tqdm as notebook_tqdm


Initializing the FinBERT sentiment analysis pipeline...
GPU detected: NVIDIA GeForce RTX 3050 Ti Laptop GPU. The pipeline will run on the GPU.


ValueError: Could not load model ProsusAI/finbert with any of the following classes: (<class 'transformers.models.auto.modeling_auto.AutoModelForSequenceClassification'>, <class 'transformers.models.bert.modeling_bert.BertForSequenceClassification'>). See the original errors:

while loading with AutoModelForSequenceClassification, an error is thrown:
Traceback (most recent call last):
  File "c:\Users\totob\AppData\Local\Programs\Python\Python312\Lib\site-packages\transformers\pipelines\base.py", line 293, in infer_framework_load_model
    model = model_class.from_pretrained(model, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\totob\AppData\Local\Programs\Python\Python312\Lib\site-packages\transformers\models\auto\auto_factory.py", line 604, in from_pretrained
    return model_class.from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\totob\AppData\Local\Programs\Python\Python312\Lib\site-packages\transformers\modeling_utils.py", line 277, in _wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\totob\AppData\Local\Programs\Python\Python312\Lib\site-packages\transformers\modeling_utils.py", line 4962, in from_pretrained
    config, dtype, dtype_orig = _get_dtype(
                                ^^^^^^^^^^^
  File "c:\Users\totob\AppData\Local\Programs\Python\Python312\Lib\site-packages\transformers\modeling_utils.py", line 1250, in _get_dtype
    state_dict = load_state_dict(
                 ^^^^^^^^^^^^^^^^
  File "c:\Users\totob\AppData\Local\Programs\Python\Python312\Lib\site-packages\transformers\modeling_utils.py", line 508, in load_state_dict
    check_torch_load_is_safe()
  File "c:\Users\totob\AppData\Local\Programs\Python\Python312\Lib\site-packages\transformers\utils\import_utils.py", line 1647, in check_torch_load_is_safe
    raise ValueError(
ValueError: Due to a serious vulnerability issue in `torch.load`, even with `weights_only=True`, we now require users to upgrade torch to at least v2.6 in order to use the function. This version restriction does not apply when loading files with safetensors.
See the vulnerability report here https://nvd.nist.gov/vuln/detail/CVE-2025-32434

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "c:\Users\totob\AppData\Local\Programs\Python\Python312\Lib\site-packages\transformers\pipelines\base.py", line 311, in infer_framework_load_model
    model = model_class.from_pretrained(model, **fp32_kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\totob\AppData\Local\Programs\Python\Python312\Lib\site-packages\transformers\models\auto\auto_factory.py", line 604, in from_pretrained
    return model_class.from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\totob\AppData\Local\Programs\Python\Python312\Lib\site-packages\transformers\modeling_utils.py", line 277, in _wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\totob\AppData\Local\Programs\Python\Python312\Lib\site-packages\transformers\modeling_utils.py", line 5048, in from_pretrained
    ) = cls._load_pretrained_model(
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\totob\AppData\Local\Programs\Python\Python312\Lib\site-packages\transformers\modeling_utils.py", line 5316, in _load_pretrained_model
    load_state_dict(checkpoint_files[0], map_location="meta", weights_only=weights_only).keys()
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\totob\AppData\Local\Programs\Python\Python312\Lib\site-packages\transformers\modeling_utils.py", line 508, in load_state_dict
    check_torch_load_is_safe()
  File "c:\Users\totob\AppData\Local\Programs\Python\Python312\Lib\site-packages\transformers\utils\import_utils.py", line 1647, in check_torch_load_is_safe
    raise ValueError(
ValueError: Due to a serious vulnerability issue in `torch.load`, even with `weights_only=True`, we now require users to upgrade torch to at least v2.6 in order to use the function. This version restriction does not apply when loading files with safetensors.
See the vulnerability report here https://nvd.nist.gov/vuln/detail/CVE-2025-32434

while loading with BertForSequenceClassification, an error is thrown:
Traceback (most recent call last):
  File "c:\Users\totob\AppData\Local\Programs\Python\Python312\Lib\site-packages\transformers\pipelines\base.py", line 293, in infer_framework_load_model
    model = model_class.from_pretrained(model, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\totob\AppData\Local\Programs\Python\Python312\Lib\site-packages\transformers\modeling_utils.py", line 277, in _wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\totob\AppData\Local\Programs\Python\Python312\Lib\site-packages\transformers\modeling_utils.py", line 4962, in from_pretrained
    config, dtype, dtype_orig = _get_dtype(
                                ^^^^^^^^^^^
  File "c:\Users\totob\AppData\Local\Programs\Python\Python312\Lib\site-packages\transformers\modeling_utils.py", line 1250, in _get_dtype
    state_dict = load_state_dict(
                 ^^^^^^^^^^^^^^^^
  File "c:\Users\totob\AppData\Local\Programs\Python\Python312\Lib\site-packages\transformers\modeling_utils.py", line 508, in load_state_dict
    check_torch_load_is_safe()
  File "c:\Users\totob\AppData\Local\Programs\Python\Python312\Lib\site-packages\transformers\utils\import_utils.py", line 1647, in check_torch_load_is_safe
    raise ValueError(
ValueError: Due to a serious vulnerability issue in `torch.load`, even with `weights_only=True`, we now require users to upgrade torch to at least v2.6 in order to use the function. This version restriction does not apply when loading files with safetensors.
See the vulnerability report here https://nvd.nist.gov/vuln/detail/CVE-2025-32434

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "c:\Users\totob\AppData\Local\Programs\Python\Python312\Lib\site-packages\transformers\pipelines\base.py", line 311, in infer_framework_load_model
    model = model_class.from_pretrained(model, **fp32_kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\totob\AppData\Local\Programs\Python\Python312\Lib\site-packages\transformers\modeling_utils.py", line 277, in _wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\totob\AppData\Local\Programs\Python\Python312\Lib\site-packages\transformers\modeling_utils.py", line 5048, in from_pretrained
    ) = cls._load_pretrained_model(
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\totob\AppData\Local\Programs\Python\Python312\Lib\site-packages\transformers\modeling_utils.py", line 5316, in _load_pretrained_model
    load_state_dict(checkpoint_files[0], map_location="meta", weights_only=weights_only).keys()
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\totob\AppData\Local\Programs\Python\Python312\Lib\site-packages\transformers\modeling_utils.py", line 508, in load_state_dict
    check_torch_load_is_safe()
  File "c:\Users\totob\AppData\Local\Programs\Python\Python312\Lib\site-packages\transformers\utils\import_utils.py", line 1647, in check_torch_load_is_safe
    raise ValueError(
ValueError: Due to a serious vulnerability issue in `torch.load`, even with `weights_only=True`, we now require users to upgrade torch to at least v2.6 in order to use the function. This version restriction does not apply when loading files with safetensors.
See the vulnerability report here https://nvd.nist.gov/vuln/detail/CVE-2025-32434




In [None]:
print("3.1.8: Re-import the dataframes after kernel restart")
print("-"*30)

import pickle
with open('company_top_articles.pkl', 'rb') as f:
    company_top_articles = pickle.load(f)


In [None]:
print("3.1.9: Apply NLP with Chunking and Aggregate Daily Sentiment")
print("-"*30)

import pandas as pd
import numpy as np

# --- 1. Define Helper Functions ---

def get_sentiment_score(result):
    """Converts the pipeline's output dictionary to a single numerical score."""
    label = result['label']
    score = result['score']
    if label == 'positive':
        return score
    elif label == 'negative':
        return -score
    return 0.0

def analyze_sentiment_with_chunking(text):
    """
    Analyzes sentiment of a text, handling long inputs by chunking them.
    Averages the sentiment scores of all chunks for a final document score.
    """
    # Max tokens for the model, leaving a small buffer for special tokens.
    max_length = 512
    # Amount of token overlap between chunks to maintain context.
    overlap = 25
    
    try:
        # Tokenize the text to see if it needs chunking.
        tokens = tokenizer.encode(text, return_tensors='pt', truncation=False)
        
        # If text is short enough, process it in one go.
        if tokens.size(1) <= max_length:
            result = sentiment_pipeline(text)[0]
            return get_sentiment_score(result)

        # If text is too long, split it into chunks.
        # We decode the token chunks back to text strings for the pipeline.
        chunk_texts = []
        for i in range(0, tokens.size(1), max_length - overlap):
            chunk_tokens = tokens[0, i:i + max_length]
            chunk_texts.append(tokenizer.decode(chunk_tokens, skip_special_tokens=True))
        
        # Run the pipeline on the list of chunks.
        chunk_sentiments = sentiment_pipeline(chunk_texts)
        
        # Calculate the average sentiment score from all the chunks.
        scores = [get_sentiment_score(result) for result in chunk_sentiments]
        return sum(scores) / len(scores) if scores else 0.0

    except Exception:
        # If any error occurs during processing, return a neutral score.
        return 0.0

# --- 2. Initialize Storage for Final Results ---
daily_sentiment_dfs = {}

if 'company_top_articles' in locals():
    # --- 3. Iterate Through Each Company's DataFrame ---
    total_articles = sum(len(df) for df in company_top_articles.values())
    print(f"Starting NLP analysis on a total of {total_articles:,} articles.")
    processed_count = 0
    
    for ticker, df in company_top_articles.items():
        print(f"\n--- Processing {ticker} DataFrame with {len(df):,} articles ---")
        
        if df.empty:
            print("  > DataFrame is empty. Skipping.")
            daily_sentiment_dfs[ticker] = pd.DataFrame()
            continue

        # --- 4. Apply Sentiment Analysis with Progress Indicator ---
        # We use .apply() to run our new chunking-aware function on each article text.
        df['sentiment_score'] = df['text'].apply(analyze_sentiment_with_chunking)
        
        # This is a simplified progress update after each company is done.
        processed_count += len(df)
        print(f"  > NLP Progress: {processed_count:,} of {total_articles:,} articles analyzed.")

        # --- 5. Aggregate to a Daily Time-Series DataFrame ---
        daily_agg_df = df.groupby('date').agg(
            average_density=(f'{ticker}_Density', 'mean'),
            article_volume=('article_volume', 'first'),
            average_news_sentiment=('sentiment_score', 'mean')
        ).reset_index()

        daily_agg_df = daily_agg_df.rename(columns={'date': 'Period', 'average_density': f'AVG_{ticker}_Density'})
        
        print(f"  > Aggregation for {ticker} complete. Created daily time-series with {len(daily_agg_df)} unique days.")
        daily_sentiment_dfs[ticker] = daily_agg_df

    print("\n--- All NLP Processing and Aggregation Complete ---")
    
    # --- 6. Display a Sample of the Final Output ---
    print("\n--- Sample of Final Daily Sentiment DataFrame for 'AAPL' ---")
    if 'AAPL' in daily_sentiment_dfs and not daily_sentiment_dfs['AAPL'].empty:
        display(daily_sentiment_dfs['AAPL'].head())

else:
    print("ERROR: 'company_top_articles' dictionary not found. Please run the previous cells first.")

3.1.7: Apply NLP with Chunking and Aggregate Daily Sentiment
------------------------------
Starting NLP analysis on a total of 27,705 articles.

--- Processing AAPL DataFrame with 10,634 articles ---


Token indices sequence length is longer than the specified maximum sequence length for this model (710 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (513 > 512). Running this sequence through the model will result in indexing errors


KeyboardInterrupt: 

### 3.2: NLP Pipeline for `market_headlines_df`

### 3.3: NLP Pipeline for `filings_df`