News Source:

https://www.kaggle.com/datasets/notlucasp/financial-news-headlines/data

Contents:

Data scraped from CNBC contains the Headliness, last updated date, and the preview text of articles from the end of December 2017 to July 19th, 2020.

Data scraped from the Guardian Business contains the Headliness and last updated date of articles from the end of December 2017 to July 19th, 2020 since the Guardian Business does not offer preview text.

Data scraped from Reuters contains the Headliness, last updated date, and the preview text of articles from the end of March 2018 to July 19th, 2020.

Sentiment via TextBlob:

not an ML-trained model like BERT or FinBERT. It’s dictionary-based → works okay for simple financial headlines but may miss sarcasm, jargon, or complex finance tone.

# Financial News Agentic AI Pipeline Summary

### 1. Dataset Preparation
- Load multiple news datasets (CNBC, Guardian, Reuters).
- Handle differences in columns (e.g., some may not have `Description`).

### 2. Text Preprocessing
- Lowercase all text.
- Keep alphanumeric characters only.
- Remove stopwords.
- Output: `cleaned_text`.

### 3. Topic Tagging
- Text-based classification of each headline:
  - `earnings`: earnings, revenue, EPS, quarter
  - `market`: stock, shares, price
  - `macro`: fed, inflation, GDP, rate
  - `general`: default

### 4. Sentiment Analysis
- Use **TextBlob** model to compute polarity:
  - `positive` if polarity > 0.1  
  - `negative` if polarity < -0.1  
  - `neutral` otherwise
- Output: sentiment label per headline.

### 5. Routing to Specialist Agents
- **Earnings agent:** receives topic + sentiment → `eps_signal`, `revenue_signal`
- **Market agent:** topic + sentiment → `market_signal`
- **Macro agent:** topic + sentiment → `macro_signal`
- **General agent:** topic + sentiment → `general_signal` (always 0)
- Agents are **rule-based**; only TextBlob sentiment is model-based.

### 6. Tesla Filtering
- Select headlines mentioning `"Tesla"` in `Headlines` or `Description` (if exists).

### 7. Weighted Aggregation
- Combine all agent outputs for Tesla headlines.
- Signals are multiplied by predefined weights:
  - `eps_signal` → weight 3  
  - `revenue_signal` → weight 2  
  - `market_signal` → weight 2  
  - `macro_signal` → weight 1  
  - `general_signal` → weight 0

### 8. Trade Suggestion
- Sum weighted signals to compute total score.
- Map total score → trade action:
  - Total ≥ 3 → **BUY**
  - Total ≤ -2 → **SELL**
  - Otherwise → **HOLD**

### 9. Reporting
- Print per dataset:
  - Number of Tesla headlines
  - Aggregated trade suggestion (BUY/HOLD/SELL)



In [1]:
import re
import nltk
import os
import shutil
import kagglehub
import pandas as pd
from nltk.corpus import stopwords
from textblob import TextBlob

In [11]:
# ------------------------
# Setup
# ------------------------
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

# Download stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

# ------------------------
# Download dataset (cached)
# ------------------------
path = kagglehub.dataset_download("notlucasp/financial-news-headlines")
print("Cache path:", path)

# ------------------------
# Define target folder
# ------------------------
target_dir = os.path.abspath(os.path.join("..", "news-datasets", "kaggle-headlines-data"))
if os.path.exists(target_dir):
    shutil.rmtree(target_dir)
os.makedirs(target_dir, exist_ok=True)

# ------------------------
# Copy all files from versioned folder to target
# ------------------------
for item in os.listdir(path):
    src_file = os.path.join(path, item)
    if os.path.isfile(src_file):
        shutil.copy(src_file, target_dir)

# ------------------------
# List dataset files
# ------------------------
files = os.listdir(target_dir)
print(f"{len(files)} Files in dataset:\n", files)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Administrator\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Administrator\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Cache path: C:\Users\Administrator\.cache\kagglehub\datasets\notlucasp\financial-news-headlines\versions\2
3 Files in dataset:
 ['cnbc_headlines.csv', 'guardian_headlines.csv', 'reuters_headlines.csv']


In [4]:
# Load cnbc CSV into dataframe
file_path = os.path.join(target_dir, files[0])
cnbc_raw_df = pd.read_csv(file_path)
print(files[0])
print(cnbc_raw_df.head().to_string(index=False))

cnbc_headlines.csv
                                                                Headlines                           Time                                                                                                                                        Description
     Jim Cramer: A better way to invest in the Covid-19 vaccine gold rush  7:51  PM ET Fri, 17 July 2020                                              "Mad Money" host Jim Cramer recommended buying four companies that are supporting vaccine developers.
                           Cramer's lightning round: I would own Teradyne  7:33  PM ET Fri, 17 July 2020        "Mad Money" host Jim Cramer rings the lightning round bell, which means he's giving his answers to callers' stock questions at rapid speed.
                                                                      NaN                            NaN                                                                                                                         

In [5]:
cnbc_raw_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3080 entries, 0 to 3079
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Headlines    2800 non-null   object
 1   Time         2800 non-null   object
 2   Description  2800 non-null   object
dtypes: object(3)
memory usage: 72.3+ KB


In [6]:
# Load guardian CSV into dataframe
file_path = os.path.join(target_dir, files[1])
guardian_raw_df = pd.read_csv(file_path)
print(files[1])
print(guardian_raw_df.head().to_string(index=False))

guardian_headlines.csv
     Time                                                                                                         Headlines
18-Jul-20                                                                  Johnson is asking Santa for a Christmas recovery
18-Jul-20                                       ‘I now fear the worst’: four grim tales of working life upended by Covid-19
18-Jul-20                                                    Five key areas Sunak must tackle to serve up economic recovery
18-Jul-20                                                   Covid-19 leaves firms ‘fatally ill-prepared’ for no-deal Brexit
18-Jul-20 The Week in Patriarchy  \n\n\n  Bacardi's 'lady vodka': the latest in a long line of depressing gendered products


In [7]:
guardian_raw_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17800 entries, 0 to 17799
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Time       17800 non-null  object
 1   Headlines  17800 non-null  object
dtypes: object(2)
memory usage: 278.3+ KB


In [9]:
# Load reuters CSV into dataframe
file_path = os.path.join(target_dir, files[2])
reuters_raw_df = pd.read_csv(file_path)
print(files[2])
print(reuters_raw_df.head().to_string(index=False))

reuters_headlines.csv
                                                                    Headlines        Time                                                                                                                                                                                                                                                                                      Description
                 TikTok considers London and other locations for headquarters Jul 18 2020                                                                TikTok has been in discussions with the UK government over the past few months to locate its headquarters in London, a source familiar with the matter said, as part of a strategy to distance itself from its Chinese ownership.
                Disney cuts ad spending on Facebook amid growing boycott: WSJ Jul 18 2020 Walt Disney  has become the latest company to slash its advertising spending on Facebook Inc  as the social media giant faces an a

In [10]:
reuters_raw_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32770 entries, 0 to 32769
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Headlines    32770 non-null  object
 1   Time         32770 non-null  object
 2   Description  32770 non-null  object
dtypes: object(3)
memory usage: 768.2+ KB


In [39]:
# ------------------------
# Function: Preprocessing + Tagging
# ------------------------
def preprocess_and_tag(text):
    # 1. Lowercase
    text = text.lower()
    # 2. Keep alphanumeric only
    text = re.sub(r'[^a-z0-9\s]', '', text)
    # 3. Remove stopwords
    tokens = [w for w in text.split() if w not in stop_words]
    cleaned_text = " ".join(tokens)
    
    # 4. Topic tagging
    if any(word in cleaned_text for word in ["earnings", "quarter", "revenue", "eps"]):
        topic = "earnings"
    elif any(word in cleaned_text for word in ["market", "stock", "shares", "price"]):
        topic = "market"
    elif any(word in cleaned_text for word in ["fed", "inflation", "gdp", "rate"]):
        topic = "macro"
    else:
        topic = "general"
    
    # 5. Sentiment tagging using TextBlob
    polarity = TextBlob(cleaned_text).sentiment.polarity
    if polarity > 0.1:
        sentiment = "positive"
    elif polarity < -0.1:
        sentiment = "negative"
    else:
        sentiment = "neutral"
    
    return cleaned_text, topic, sentiment

# ------------------------
# Specialist Agent Functions
# ------------------------
def earnings_agent(text, sentiment):
    eps_signal = 1 if sentiment == "positive" else 0
    revenue_signal = 1 if sentiment == "positive" else 0
    return {"eps_signal": eps_signal, "revenue_signal": revenue_signal}

def market_agent(text, sentiment):
    market_signal = 1 if sentiment == "positive" else (-1 if sentiment == "negative" else 0)
    return {"market_signal": market_signal}

def macro_agent(text, sentiment):
    macro_signal = 1 if sentiment == "positive" else (-1 if sentiment == "negative" else 0)
    return {"macro_signal": macro_signal}

def general_agent(text, sentiment):
    return {"general_signal": 0}

# ------------------------
# Routing Function
# ------------------------
def route_to_agent(row):
    topic = row["topic"]
    sentiment = row["sentiment"]
    text = row["cleaned_text"]
    
    if topic == "earnings":
        return earnings_agent(text, sentiment)
    elif topic == "market":
        return market_agent(text, sentiment)
    elif topic == "macro":
        return macro_agent(text, sentiment)
    else:
        return general_agent(text, sentiment)

# ------------------------
# Function: Full processing for a single dataset
# ------------------------
def process_dataset(df, label):
    df = df.copy()
    # Preprocess Headlines
    df[["cleaned_text", "topic", "sentiment"]] = df["Headlines"].apply(
        lambda x: pd.Series(preprocess_and_tag(str(x)))
    )
    # Route to agents
    df["agent_output"] = df.apply(route_to_agent, axis=1)
    
    # ------------------------
    # Filter Tesla-specific headlines
    # ------------------------
    if "Description" in df.columns:
        tesla_df = df[
            df["Headlines"].str.contains("Tesla", case=False, na=False) |
            df["Description"].str.contains("Tesla", case=False, na=False)
        ]
    else:
        tesla_df = df[
            df["Headlines"].str.contains("Tesla", case=False, na=False)
        ]
    
    # ------------------------
    # Aggregation + Weighted Trade Signal
    # ------------------------
    def determine_trade_signal(agent_outputs):
        weights = {
            "eps_signal": 3,
            "revenue_signal": 2,
            "market_signal": 2,
            "macro_signal": 1,
            "general_signal": 0
        }
        total_score = 0
        for out in agent_outputs:
            for key, value in out.items():
                total_score += value * weights.get(key, 0)
        if total_score >= 3:
            return "BUY"
        elif total_score <= -2:
            return "SELL"
        else:
            return "HOLD"
    
    tesla_outputs = list(tesla_df["agent_output"])
    trade_signal = determine_trade_signal(tesla_outputs)
    
    # ------------------------
    # Print dataset-specific results
    # ------------------------
    print(f"--- Dataset: {label} ---")
    print(f"Tesla headlines count: {len(tesla_df)}")
    print(f"Trade suggestion: {trade_signal}\n")
    
    return df

# ------------------------
# Process all datasets with labels
# ------------------------
datasets = {
    "cnbc": cnbc_raw_df,
    "guardian": guardian_raw_df,
    "reuters": reuters_raw_df
}

processed_dfs = {}
for label, df in datasets.items():
    processed_dfs[label] = process_dataset(df, label)

--- Dataset: cnbc ---
Tesla headlines count: 36
Trade suggestion: SELL

--- Dataset: guardian ---
Tesla headlines count: 78
Trade suggestion: BUY

--- Dataset: reuters ---
Tesla headlines count: 699
Trade suggestion: BUY

