News Source:

https://www.kaggle.com/datasets/notlucasp/financial-news-headlines/data

Contents:

Data scraped from CNBC contains the Headliness, last updated date, and the preview text of articles from the end of December 2017 to July 19th, 2020.

Data scraped from the Guardian Business contains the Headliness and last updated date of articles from the end of December 2017 to July 19th, 2020 since the Guardian Business does not offer preview text.

Data scraped from Reuters contains the Headliness, last updated date, and the preview text of articles from the end of March 2018 to July 19th, 2020.

Sentiment via TextBlob:

not an ML-trained model like BERT or FinBERT. It’s dictionary-based → works okay for simple financial headlines but may miss sarcasm, jargon, or complex finance tone.

# Financial News Agentic AI Pipeline Summary

### 1. Dataset Preparation
- Load multiple news datasets (CNBC, Guardian, Reuters).
- Handle differences in columns (e.g., some may not have `Description`).

### 2. Text Preprocessing
- Lowercase all text.
- Keep alphanumeric characters only.
- Remove stopwords.
- Output: `cleaned_text`.

### 3. Topic Tagging
- Text-based classification of each headline:
  - `earnings`: earnings, revenue, EPS, quarter
  - `market`: stock, shares, price
  - `macro`: fed, inflation, GDP, rate
  - `general`: default

### 4. Sentiment Analysis
- Use **TextBlob** model to compute polarity:
  - `positive` if polarity > 0.1  
  - `negative` if polarity < -0.1  
  - `neutral` otherwise
- Output: sentiment label per headline.

### 5. Routing to Specialist Agents
- **Earnings agent:** receives topic + sentiment → `eps_signal`, `revenue_signal`
- **Market agent:** topic + sentiment → `market_signal`
- **Macro agent:** topic + sentiment → `macro_signal`
- **General agent:** topic + sentiment → `general_signal` (always 0)
- Agents are **rule-based**; only TextBlob sentiment is model-based.

### 6. Tesla Filtering
- Select headlines mentioning `"Tesla"` in `Headlines` or `Description` (if exists).

### 7. Weighted Aggregation
- Combine all agent outputs for Tesla headlines.
- Signals are multiplied by predefined weights:
  - `eps_signal` → weight 3  
  - `revenue_signal` → weight 2  
  - `market_signal` → weight 2  
  - `macro_signal` → weight 1  
  - `general_signal` → weight 0

### 8. Trade Suggestion
- Sum weighted signals to compute total score.
- Map total score → trade action:
  - Total ≥ 3 → **BUY**
  - Total ≤ -2 → **SELL**
  - Otherwise → **HOLD**

### 9. Reporting
- Print per dataset:
  - Number of Tesla headlines
  - Aggregated trade suggestion (BUY/HOLD/SELL)



In [1]:
import re
import nltk
import os
import shutil
import kagglehub
import pandas as pd
from nltk.corpus import stopwords
from textblob import TextBlob

In [12]:
# ------------------------
# Setup
# ------------------------
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

# Download stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

# ------------------------
# Download dataset (cached)
# ------------------------
path = kagglehub.dataset_download("notlucasp/financial-news-headlines")
print("Cache path:", path)

# ------------------------
# Define target folder
# ------------------------
target_dir = os.path.abspath(os.path.join("..", "news-datasets", "kaggle-headlines-data"))
if os.path.exists(target_dir):
    shutil.rmtree(target_dir)
os.makedirs(target_dir, exist_ok=True)

# ------------------------
# Copy all files from versioned folder to target
# ------------------------
for item in os.listdir(path):
    src_file = os.path.join(path, item)
    if os.path.isfile(src_file):
        shutil.copy(src_file, target_dir)

# ------------------------
# Show where files were moved, count, and filenames
# ------------------------
files = os.listdir(target_dir)
print("Files moved to:", target_dir)
print("Number of files:", len(files))
print("Filenames:", files)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Administrator\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Administrator\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Cache path: C:\Users\Administrator\.cache\kagglehub\datasets\notlucasp\financial-news-headlines\versions\2
Files moved to: C:\Users\Administrator\Documents\GitHub\aai-520-final-project-group5\news-datasets\kaggle-headlines-data
Number of files: 3
Filenames: ['cnbc_headlines.csv', 'guardian_headlines.csv', 'reuters_headlines.csv']


In [16]:
# Load cnbc CSV into dataframe
file_path = os.path.join(target_dir, files[0])
cnbc_raw_df = pd.read_csv(file_path)

# Truncate text for display
cnbc_display = cnbc_raw_df.copy()
cnbc_display["Headlines"] = cnbc_display["Headlines"].fillna("").str[:50] + "…"
if "Description" in cnbc_display.columns:
    cnbc_display["Description"] = cnbc_display["Description"].fillna("").str[:50] + "…"

# Print truncated dataframe
print(files[0])
print(cnbc_display.head().to_string(index=False))

cnbc_headlines.csv
                                          Headlines                           Time                                         Description
Jim Cramer: A better way to invest in the Covid-19…  7:51  PM ET Fri, 17 July 2020 "Mad Money" host Jim Cramer recommended buying fou…
    Cramer's lightning round: I would own Teradyne…  7:33  PM ET Fri, 17 July 2020 "Mad Money" host Jim Cramer rings the lightning ro…
                                                  …                            NaN                                                   …
Cramer's week ahead: Big week for earnings, even b…  7:25  PM ET Fri, 17 July 2020 "We'll pay more for the earnings of the non-Covid …
IQ Capital CEO Keith Bliss says tech and healthcar…  4:24  PM ET Fri, 17 July 2020 Keith Bliss, IQ Capital CEO, joins "Closing Bell" …


In [5]:
cnbc_raw_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3080 entries, 0 to 3079
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Headlines    2800 non-null   object
 1   Time         2800 non-null   object
 2   Description  2800 non-null   object
dtypes: object(3)
memory usage: 72.3+ KB


In [15]:
# Load guardian CSV into dataframe
file_path = os.path.join(target_dir, files[1])
guardian_raw_df = pd.read_csv(file_path)

# Truncate text for display
guardian_display = guardian_raw_df.copy()
guardian_display["Headlines"] = guardian_display["Headlines"].fillna("").str[:50] + "…"

# Print truncated dataframe
print(files[1])
print(guardian_display.head().to_string(index=False))

guardian_headlines.csv
     Time                                              Headlines
18-Jul-20      Johnson is asking Santa for a Christmas recovery…
18-Jul-20    ‘I now fear the worst’: four grim tales of working…
18-Jul-20    Five key areas Sunak must tackle to serve up econo…
18-Jul-20    Covid-19 leaves firms ‘fatally ill-prepared’ for n…
18-Jul-20 The Week in Patriarchy  \n\n\n  Bacardi's 'lady vodka…


In [7]:
guardian_raw_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17800 entries, 0 to 17799
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Time       17800 non-null  object
 1   Headlines  17800 non-null  object
dtypes: object(2)
memory usage: 278.3+ KB


In [13]:
# Load reuters CSV into dataframe
file_path = os.path.join(target_dir, files[2])
reuters_raw_df = pd.read_csv(file_path)

# Truncate to first 50 characters directly using .str
reuters_display = reuters_raw_df.copy()
reuters_display["Headlines"] = reuters_display["Headlines"].str[:50] + "…"
if "Description" in reuters_display.columns:
    reuters_display["Description"] = reuters_display["Description"].str[:50] + "…"

print(files[2])
print(reuters_display.head().to_string(index=False))

reuters_headlines.csv
                                          Headlines        Time                                         Description
TikTok considers London and other locations for he… Jul 18 2020 TikTok has been in discussions with the UK governm…
Disney cuts ad spending on Facebook amid growing b… Jul 18 2020 Walt Disney  has become the latest company to slas…
Trail of missing Wirecard executive leads to Belar… Jul 18 2020 Former Wirecard  chief operating officer Jan Marsa…
Twitter says attackers downloaded data from up to … Jul 18 2020 Twitter Inc said on Saturday that hackers were abl…
U.S. Republicans seek liability protections as cor… Jul 17 2020 A battle in the U.S. Congress over a new coronavir…


In [10]:
reuters_raw_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32770 entries, 0 to 32769
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Headlines    32770 non-null  object
 1   Time         32770 non-null  object
 2   Description  32770 non-null  object
dtypes: object(3)
memory usage: 768.2+ KB


In [34]:
# ------------------------
# Define company to filter
# ------------------------
company_name = "Tesla"  # Change this to filter a different company

# ------------------------
# Define datasets dictionary
# ------------------------
datasets = {
    "cnbc": cnbc_raw_df,
    "guardian": guardian_raw_df,
    "reuters": reuters_raw_df
}

# ------------------------
# Function: Preprocessing + Tagging
# ------------------------
def preprocess_and_tag(text):
    # 1. Lowercase
    text = text.lower()
    # 2. Keep alphanumeric only
    text = re.sub(r'[^a-z0-9\s]', '', text)
    # 3. Remove stopwords
    tokens = [w for w in text.split() if w not in stop_words]
    cleaned_text = " ".join(tokens)
    
    # 4. Topic tagging
    if any(word in cleaned_text for word in ["earnings", "quarter", "revenue", "eps"]):
        topic = "earnings"
    elif any(word in cleaned_text for word in ["market", "stock", "shares", "price"]):
        topic = "market"
    elif any(word in cleaned_text for word in ["fed", "inflation", "gdp", "rate"]):
        topic = "macro"
    else:
        topic = "general"
    
    # 5. Sentiment tagging using TextBlob
    polarity = TextBlob(cleaned_text).sentiment.polarity
    if polarity > 0.1:
        sentiment = "positive"
    elif polarity < -0.1:
        sentiment = "negative"
    else:
        sentiment = "neutral"
    
    return cleaned_text, topic, sentiment

# ------------------------
# Specialist Agent Functions
# ------------------------
def earnings_agent(text, sentiment):
    eps_signal = 1 if sentiment == "positive" else 0
    revenue_signal = 1 if sentiment == "positive" else 0
    return {"eps_signal": eps_signal, "revenue_signal": revenue_signal}

def market_agent(text, sentiment):
    market_signal = 1 if sentiment == "positive" else (-1 if sentiment == "negative" else 0)
    return {"market_signal": market_signal}

def macro_agent(text, sentiment):
    macro_signal = 1 if sentiment == "positive" else (-1 if sentiment == "negative" else 0)
    return {"macro_signal": macro_signal}

def general_agent(text, sentiment):
    return {"general_signal": 0}

# ------------------------
# Routing Function
# ------------------------
def route_to_agent(row):
    topic = row["topic"]
    sentiment = row["sentiment"]
    text = row["cleaned_text"]
    
    if topic == "earnings":
        return earnings_agent(text, sentiment)
    elif topic == "market":
        return market_agent(text, sentiment)
    elif topic == "macro":
        return macro_agent(text, sentiment)
    else:
        return general_agent(text, sentiment)

# ------------------------
# Function: Full processing for a single dataset with weighted scores and agent labels
# ------------------------
def process_dataset_company_with_agent_scores(df, label, company_name):
    df = df.copy()
    
    # Preprocess Headlines
    df[["cleaned_text", "topic", "sentiment"]] = df["Headlines"].apply(
        lambda x: pd.Series(preprocess_and_tag(str(x)))
    )
    
    # Route to agents and store agent type
    agent_names = []
    agent_outputs = []
    for idx, row in df.iterrows():
        output = route_to_agent(row)
        agent_outputs.append(output)
        # Determine agent name from topic
        agent_names.append(row["topic"])
    
    df["agent_output"] = agent_outputs
    df["agent_name"] = agent_names
    
    # Filter company-specific headlines
    if "Description" in df.columns:
        company_df = df[
            df["Headlines"].str.contains(company_name, case=False, na=False) |
            df["Description"].str.contains(company_name, case=False, na=False)
        ]
    else:
        company_df = df[
            df["Headlines"].str.contains(company_name, case=False, na=False)
        ]
    
    # Define weights
    # ------------------------
    # The weights reflect the relative importance of each agent's signal to the final trade suggestion:
    # - eps_signal (3): EPS is a key indicator of company profitability, most impactful on buy/sell decisions.
    # - revenue_signal (2): Revenue trends are important but slightly less critical than EPS.
    # - market_signal (2): Overall stock/market sentiment influences short-term price moves, similar importance to revenue.
    # - macro_signal (1): Macro news (Fed, GDP, inflation) has an indirect effect, so lower weight.
    # - general_signal (0): General or uncategorized news has minimal predictive value, so weight is 0.
    weights = {
        "eps_signal": 3,
        "revenue_signal": 2,
        "market_signal": 2,
        "macro_signal": 1,
        "general_signal": 0
    }
    
    # Compute weighted scores per row
    weighted_scores = []
    for out in company_df["agent_output"]:
        score = sum(value * weights.get(key, 0) for key, value in out.items())
        weighted_scores.append(score)
    
    # Aggregate to determine trade signal
    total_score = sum(weighted_scores)
    if total_score >= 3:
        trade_signal = "BUY"
    elif total_score <= -2:
        trade_signal = "SELL"
    else:
        trade_signal = "HOLD"
    
    # ------------------------
    # Print dataset-specific results
    # ------------------------
    print(f"--- Dataset: {label} --- Total {company_name} headlines count: {len(company_df)} Trade suggestion: {trade_signal} ---")
    print("Non-general agent weighted scores per article:")
    for idx, (row, score) in enumerate(zip(company_df.itertuples(), weighted_scores), start=1):
        # Only show outputs that are not from the general agent
        filtered_out = {k: v for k, v in row.agent_output.items() if k != "general_signal" and v != 0}
        if filtered_out:
            print(f"    Article {idx}: Agent: {row.agent_name} | Weighted Article Score: {score} | Agent Output: {filtered_out}")
    print("\n")
    
    return df

# ------------------------
# Process all datasets for the company with agent labels and weighted scores
# ------------------------
processed_dfs_with_agent_scores = {}
for label, df in datasets.items():
    processed_dfs_with_agent_scores[label] = process_dataset_company_with_agent_scores(df, label, company_name)

--- Dataset: cnbc --- Total Tesla headlines count: 36 Trade suggestion: SELL ---
Non-general agent weighted scores per article:
    Article 12: Agent: market | Weighted Article Score: -2 | Agent Output: {'market_signal': -1}
    Article 17: Agent: market | Weighted Article Score: -2 | Agent Output: {'market_signal': -1}
    Article 20: Agent: market | Weighted Article Score: -2 | Agent Output: {'market_signal': -1}
    Article 25: Agent: market | Weighted Article Score: -2 | Agent Output: {'market_signal': -1}
    Article 31: Agent: market | Weighted Article Score: 2 | Agent Output: {'market_signal': 1}
    Article 32: Agent: market | Weighted Article Score: 2 | Agent Output: {'market_signal': 1}


--- Dataset: guardian --- Total Tesla headlines count: 78 Trade suggestion: BUY ---
Non-general agent weighted scores per article:
    Article 26: Agent: earnings | Weighted Article Score: 5 | Agent Output: {'eps_signal': 1, 'revenue_signal': 1}
    Article 38: Agent: earnings | Weighted Art