News Source:

https://www.kaggle.com/datasets/notlucasp/financial-news-headlines/data

Contents:

Data scraped from CNBC contains the Headliness, last updated date, and the preview text of articles from the end of December 2017 to July 19th, 2020.

Data scraped from the Guardian Business contains the Headliness and last updated date of articles from the end of December 2017 to July 19th, 2020 since the Guardian Business does not offer preview text.

Data scraped from Reuters contains the Headliness, last updated date, and the preview text of articles from the end of March 2018 to July 19th, 2020.

Sentiment via TextBlob:

not an ML-trained model like BERT or FinBERT. It’s dictionary-based → works okay for simple financial headlines but may miss sarcasm, jargon, or complex finance tone.

# Financial News Agentic AI Pipeline Summary

### 1. Dataset Preparation
- Load multiple news datasets (CNBC, Guardian, Reuters).
- Handle differences in columns (e.g., some may not have `Description`).

### 2. Text Preprocessing
- Lowercase all text.
- Keep alphanumeric characters only.
- Remove stopwords.
- Output: `cleaned_text`.

### 3. Topic Tagging
- Text-based classification of each headline:
  - `earnings`: earnings, revenue, EPS, quarter
  - `market`: stock, shares, price
  - `macro`: fed, inflation, GDP, rate
  - `general`: default

### 4. Sentiment Analysis
- Use **TextBlob** model to compute polarity:
  - `positive` if polarity > 0.1  
  - `negative` if polarity < -0.1  
  - `neutral` otherwise
- Output: sentiment label per headline.

### 5. Routing to Specialist Agents
- **Earnings agent:** receives topic + sentiment → `eps_signal`, `revenue_signal`
- **Market agent:** topic + sentiment → `market_signal`
- **Macro agent:** topic + sentiment → `macro_signal`
- **General agent:** topic + sentiment → `general_signal` (always 0)
- Agents are **rule-based**; only TextBlob sentiment is model-based.

### 6. Tesla Filtering
- Select headlines mentioning `"Tesla"` in `Headlines` or `Description` (if exists).

### 7. Weighted Aggregation
- Combine all agent outputs for Tesla headlines.
- Signals are multiplied by predefined weights:
  - `eps_signal` → weight 3  
  - `revenue_signal` → weight 2  
  - `market_signal` → weight 2  
  - `macro_signal` → weight 1  
  - `general_signal` → weight 0

### 8. Trade Suggestion
- Sum weighted signals to compute total score.
- Map total score → trade action:
  - Total ≥ 3 → **BUY**
  - Total ≤ -2 → **SELL**
  - Otherwise → **HOLD**

### 9. Reporting
- Print per dataset:
  - Number of Tesla headlines
  - Aggregated trade suggestion (BUY/HOLD/SELL)



In [1]:
import re
import nltk
import os
import shutil
import kagglehub
import pandas as pd
from nltk.corpus import stopwords
from textblob import TextBlob

In [2]:
# ------------------------
# Setup
# ------------------------
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

# Download stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

# ------------------------
# Download dataset (cached)
# ------------------------
path = kagglehub.dataset_download("notlucasp/financial-news-headlines")
print("Cache path:", path)

# ------------------------
# Define target folder
# ------------------------
target_dir = os.path.abspath(os.path.join("..", "news-datasets", "kaggle-headlines-data"))
if os.path.exists(target_dir):
    shutil.rmtree(target_dir)
os.makedirs(target_dir, exist_ok=True)

# ------------------------
# Copy all files from versioned folder to target
# ------------------------
for item in os.listdir(path):
    src_file = os.path.join(path, item)
    if os.path.isfile(src_file):
        shutil.copy(src_file, target_dir)

# ------------------------
# Show where files were moved, count, and filenames
# ------------------------
files = os.listdir(target_dir)
print("Files moved to:", target_dir)
print("Number of files:", len(files))
print("Filenames:", files)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Administrator\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Administrator\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Cache path: C:\Users\Administrator\.cache\kagglehub\datasets\notlucasp\financial-news-headlines\versions\2
Files moved to: C:\Users\Administrator\Documents\GitHub\aai-520-final-project-group5\news-datasets\kaggle-headlines-data
Number of files: 3
Filenames: ['cnbc_headlines.csv', 'guardian_headlines.csv', 'reuters_headlines.csv']


In [3]:
# Load cnbc CSV into dataframe
file_path = os.path.join(target_dir, files[0])
cnbc_raw_df = pd.read_csv(file_path)

# Truncate text for display
cnbc_display = cnbc_raw_df.copy()
cnbc_display["Headlines"] = cnbc_display["Headlines"].fillna("").str[:50] + "…"
if "Description" in cnbc_display.columns:
    cnbc_display["Description"] = cnbc_display["Description"].fillna("").str[:50] + "…"

# Print truncated dataframe
print(files[0])
print(cnbc_display.head().to_string(index=False))

cnbc_headlines.csv
                                          Headlines                           Time                                         Description
Jim Cramer: A better way to invest in the Covid-19…  7:51  PM ET Fri, 17 July 2020 "Mad Money" host Jim Cramer recommended buying fou…
    Cramer's lightning round: I would own Teradyne…  7:33  PM ET Fri, 17 July 2020 "Mad Money" host Jim Cramer rings the lightning ro…
                                                  …                            NaN                                                   …
Cramer's week ahead: Big week for earnings, even b…  7:25  PM ET Fri, 17 July 2020 "We'll pay more for the earnings of the non-Covid …
IQ Capital CEO Keith Bliss says tech and healthcar…  4:24  PM ET Fri, 17 July 2020 Keith Bliss, IQ Capital CEO, joins "Closing Bell" …


In [4]:
cnbc_raw_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3080 entries, 0 to 3079
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Headlines    2800 non-null   object
 1   Time         2800 non-null   object
 2   Description  2800 non-null   object
dtypes: object(3)
memory usage: 72.3+ KB


In [5]:
# Load guardian CSV into dataframe
file_path = os.path.join(target_dir, files[1])
guardian_raw_df = pd.read_csv(file_path)

# Truncate text for display
guardian_display = guardian_raw_df.copy()
guardian_display["Headlines"] = guardian_display["Headlines"].fillna("").str[:50] + "…"

# Print truncated dataframe
print(files[1])
print(guardian_display.head().to_string(index=False))

guardian_headlines.csv
     Time                                              Headlines
18-Jul-20      Johnson is asking Santa for a Christmas recovery…
18-Jul-20    ‘I now fear the worst’: four grim tales of working…
18-Jul-20    Five key areas Sunak must tackle to serve up econo…
18-Jul-20    Covid-19 leaves firms ‘fatally ill-prepared’ for n…
18-Jul-20 The Week in Patriarchy  \n\n\n  Bacardi's 'lady vodka…


In [6]:
guardian_raw_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17800 entries, 0 to 17799
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Time       17800 non-null  object
 1   Headlines  17800 non-null  object
dtypes: object(2)
memory usage: 278.3+ KB


In [7]:
# Load reuters CSV into dataframe
file_path = os.path.join(target_dir, files[2])
reuters_raw_df = pd.read_csv(file_path)

# Truncate to first 50 characters directly using .str
reuters_display = reuters_raw_df.copy()
reuters_display["Headlines"] = reuters_display["Headlines"].str[:50] + "…"
if "Description" in reuters_display.columns:
    reuters_display["Description"] = reuters_display["Description"].str[:50] + "…"

print(files[2])
print(reuters_display.head().to_string(index=False))

reuters_headlines.csv
                                          Headlines        Time                                         Description
TikTok considers London and other locations for he… Jul 18 2020 TikTok has been in discussions with the UK governm…
Disney cuts ad spending on Facebook amid growing b… Jul 18 2020 Walt Disney  has become the latest company to slas…
Trail of missing Wirecard executive leads to Belar… Jul 18 2020 Former Wirecard  chief operating officer Jan Marsa…
Twitter says attackers downloaded data from up to … Jul 18 2020 Twitter Inc said on Saturday that hackers were abl…
U.S. Republicans seek liability protections as cor… Jul 17 2020 A battle in the U.S. Congress over a new coronavir…


In [8]:
reuters_raw_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32770 entries, 0 to 32769
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Headlines    32770 non-null  object
 1   Time         32770 non-null  object
 2   Description  32770 non-null  object
dtypes: object(3)
memory usage: 768.2+ KB


In [16]:
# ------------------------
# Define company to filter
# ------------------------
company_name = "Tesla"  # Change this to filter a different company

# ------------------------
# Define datasets dictionary
# ------------------------
datasets = {
    "cnbc": cnbc_raw_df,
    "guardian": guardian_raw_df,
    "reuters": reuters_raw_df
}


# Agent functions for processing financial text in the current dataset workflow
def earnings_agent(text, sentiment):
    eps_signal = 1 if sentiment == "positive" else 0
    revenue_signal = 1 if sentiment == "positive" else 0
    return {"eps_signal": eps_signal, "revenue_signal": revenue_signal}

def market_agent(text, sentiment):
    market_signal = 1 if sentiment == "positive" else (-1 if sentiment == "negative" else 0)
    return {"market_signal": market_signal}

def macro_agent(text, sentiment):
    macro_signal = 1 if sentiment == "positive" else (-1 if sentiment == "negative" else 0)
    return {"macro_signal": macro_signal}

def general_agent(text, sentiment):
    return {"general_signal": 0}

## Orchestrator Class

- executes data retrieval, planner, memory store???
- routing data to process different data
- executes aggregator
- routes data to evaluator then optimizer
- returns final output

In [17]:

class Orchestrator:
    def __init__(self, datasets, stop_words, company_name):
        self.datasets = datasets
        self.stop_words = stop_words
        self.company_name = company_name
        
        # Define agent lookup for routing
        self.agent_map = {
            "earnings": earnings_agent,
            "market": market_agent,
            "macro": macro_agent,
            "general": general_agent
        }
        
        # Define weights for evaluation
        self.weights = {
            "eps_signal": 3,
            "revenue_signal": 2,
            "market_signal": 2,
            "macro_signal": 1,
            "general_signal": 0
        }
    
    def preprocess_and_tag(self, text):
        # 1. Lowercase
        text = text.lower()
        # 2. Keep alphanumeric only
        text = re.sub(r'[^a-z0-9\s]', '', text)
        # 3. Remove stopwords
        tokens = [w for w in text.split() if w not in self.stop_words]
        cleaned_text = " ".join(tokens)
        # 4. Topic tagging
        if any(word in cleaned_text for word in ["earnings", "quarter", "revenue", "eps"]):
            topic = "earnings"
        elif any(word in cleaned_text for word in ["market", "stock", "shares", "price"]):
            topic = "market"
        elif any(word in cleaned_text for word in ["fed", "inflation", "gdp", "rate"]):
            topic = "macro"
        else:
            topic = "general"
        # 5. Sentiment tagging
        polarity = TextBlob(cleaned_text).sentiment.polarity
        if polarity > 0.1:
            sentiment = "positive"
        elif polarity < -0.1:
            sentiment = "negative"
        else:
            sentiment = "neutral"
        return cleaned_text, topic, sentiment
    
    def route_to_agent(self, row):
        topic = row["topic"]
        sentiment = row["sentiment"]
        return self.agent_map.get(topic, general_agent)(row["cleaned_text"], sentiment)
    
    def process_dataset(self, df, label):
        df = df.copy()
        # Preprocess
        df[["cleaned_text", "topic", "sentiment"]] = df["Headlines"].apply(
            lambda x: pd.Series(self.preprocess_and_tag(str(x)))
        )
        # Route to agents
        agent_outputs = []
        agent_names = []
        for idx, row in df.iterrows():
            out = self.route_to_agent(row)
            agent_outputs.append(out)
            agent_names.append(row["topic"])
        df["agent_output"] = agent_outputs
        df["agent_name"] = agent_names
        # Filter company
        if "Description" in df.columns:
            company_df = df[
                df["Headlines"].str.contains(self.company_name, case=False, na=False) |
                df["Description"].str.contains(self.company_name, case=False, na=False)
            ]
        else:
            company_df = df[df["Headlines"].str.contains(self.company_name, case=False, na=False)]
        # Compute weighted scores
        weighted_scores = [
            sum(value * self.weights.get(k, 0) for k, value in out.items())
            for out in company_df["agent_output"]
        ]
        total_score = sum(weighted_scores)
        if total_score >= 3:
            trade_signal = "BUY"
        elif total_score <= -2:
            trade_signal = "SELL"
        else:
            trade_signal = "HOLD"
        
        # ------------------------
        # Simulate Evaluator and Optimizer (no changes)
        # ------------------------
        # evaluator(company_df) → optimizer(company_df)
        final_company_df = company_df.copy()
        
        # Print results
        print(f"--- Dataset: {label} --- Total {self.company_name} headlines count: {len(company_df)} Trade suggestion: {trade_signal} ---")
        print("Non-general agent weighted scores per article:")
        for idx, (row, score) in enumerate(zip(company_df.itertuples(), weighted_scores), start=1):
            filtered_out = {k: v for k, v in row.agent_output.items() if k != "general_signal" and v != 0}
            if filtered_out:
                print(f"    Article {idx}: Agent: {row.agent_name} | Weighted Article Score: {score} | Agent Output: {filtered_out}")
        print("\n")
        
        return final_company_df
    
    def run_all(self):
        results = {}
        for label, df in self.datasets.items():
            results[label] = self.process_dataset(df, label)
        return results

# ------------------------
# Run Orchestrator
# ------------------------
orchestrator = Orchestrator(datasets, stop_words, company_name)
processed_dfs_with_agent_scores = orchestrator.run_all()

--- Dataset: cnbc --- Total Tesla headlines count: 36 Trade suggestion: SELL ---
Non-general agent weighted scores per article:
    Article 12: Agent: market | Weighted Article Score: -2 | Agent Output: {'market_signal': -1}
    Article 17: Agent: market | Weighted Article Score: -2 | Agent Output: {'market_signal': -1}
    Article 20: Agent: market | Weighted Article Score: -2 | Agent Output: {'market_signal': -1}
    Article 25: Agent: market | Weighted Article Score: -2 | Agent Output: {'market_signal': -1}
    Article 31: Agent: market | Weighted Article Score: 2 | Agent Output: {'market_signal': 1}
    Article 32: Agent: market | Weighted Article Score: 2 | Agent Output: {'market_signal': 1}


--- Dataset: guardian --- Total Tesla headlines count: 78 Trade suggestion: BUY ---
Non-general agent weighted scores per article:
    Article 26: Agent: earnings | Weighted Article Score: 5 | Agent Output: {'eps_signal': 1, 'revenue_signal': 1}
    Article 38: Agent: earnings | Weighted Art

# Module 7 lab for reference only:

In [None]:
# import openai
# import time

# from dotenv import load_dotenv
# import os
# from openai import OpenAI, APIConnectionError

# # ------------------------- 1 - Load environment & Connect to OpenAI
# def init_openai():
#     load_dotenv('database.env')
#     os.environ["OPENAI_API_KEY"] = os.getenv('openaikey')
#     client = OpenAI()
#     try:
#         # Attempt to list models to verify connection
#         _ = client.models.list()
#         # No print on success
#     except APIConnectionError as e:
#         print("OpenAI API connection error:", e)
#         traceback.print_exc()
#         exit()
#     return client

# client = init_openai()    


# # BASE AGENT CLASS

# class Agent:
#     def __init__(self, name, role, model="gpt-3.5-turbo"): # Model defined here can be changed
#         # Defines a general purpose of the with name
#         # which is identiy of the agent
#         self.name = name
#         self.role = role # What the agent is specialized in
#         self.model = model # Which LLM model to use
#         self.memory = [] # Stores past task and response for traceability
    
#     # Function for core agent methods
#     # Gets a prompt and sends to LLM to get response
#     def call_llm(self, prompt):

#         try:
#             response = client.chat.completions.create(
#                 model=self.model,
#                 # Sends to other agent for the messages
#                 messages=[
#                     {"role": "system", "content": f"You are {self.name}, a {self.role} agent."},
#                     {"role": "user", "content": prompt}
#                 ],
#                 # Model parameters
#                 max_tokens=300,
#                 temperature=0.7
#             )
#             result = response.choices[0].message.content
#             print(f"{self.name}: {result[:60]}...")
#             return result
#         except Exception as e:
#             print(f" API failed for {self.name}: {e}")
#             return f"Mock response from {self.name}: {prompt[:50]}..."
    
#     # Wraps the task into the prompt and stores the resuslts, gets result
#     def process(self, task):

#         prompt = f"As a {self.role}, {task}"
#         result = self.call_llm(prompt)
#         self.memory.append({"task": task, "result": result})
#         return result
#     # After storing the result, the result is sent to function which is sent to other agents in terms of messages
#     # Enables agents to communicate by sending tasks to each other
#     # Communication between the LLMs or other sub agents to convey the message to them
#     # Shouldn't this message content be based on certain categories???****
#     # ROUTING?
#     def send_to(self, other_agent, message):

#         print(f" {self.name} → {other_agent.name}: {message[:40]}...")
#         return other_agent.process(f"Process this from {self.name}: {message}")

# # ------------------
# # SPECIALIZED AGENTS
# # -----------------
# # Domain specific agents which the coordinator used gpt-4

# # This coordinator can add_agent as a team
# class Coordinator(Agent):
#     def __init__(self):
#         super().__init__("Boss", "project coordinator", "gpt-4")
#         self.team = []
#     # Has a team that can append and 

#     def add_agent(self, agent):
#         self.team.append(agent)
#         print(f"➕ Added {agent.name} to team")
    
#     # Delegate project tasks to the team
#     def delegate_project(self, project):
#         print(f"\n COORDINATING PROJECT: {project}")
#         plan = self.process(f"Create a plan for: {project}")

#         results = {}
#         for agent in self.team:
#             task = f"Work on project '{project}' using your {agent.role} skills"
#             results[agent.name] = agent.process(task)

#         return {"plan": plan, "results": results}

# # ------------------------------
# # Researcher, analyst, writer agents
# # ----------------------------------

# # Gather information using researcher agent about specific topic via gpt-3.5-turbao (fast)
# #  Gets "data" ?
# class Researcher(Agent):
#     def __init__(self):
#         super().__init__("Scout", "researcher", "gpt-3.5-turbo")

#     def research(self, topic):
#         return self.process(f"research and gather information about: {topic}")

# # Analyzes the data with analyze method via gpt-4
# # Gets "content" ?
# class Analyst(Agent):
#     def __init__(self):
#         super().__init__("Brain", "data analyst", "gpt-4")

#     def analyze(self, data):
#         return self.process(f"analyze this data and provide insights: {data}")

# # Writer: Writes a report based on the data contents using write_report method
# #  Gets "report"?
# class Writer(Agent):
#     def __init__(self):
#         super().__init__("Pen", "content writer", "gpt-3.5-turbo")
    
#     # Write report method
#     def write_report(self, content):
#         return self.process(f"write a professional report based on: {content}")


# # MULTIAGENT SYSTEM
#         # Creates a team of agents and runs a full project

# class MultiAgentTeam:
#     def __init__(self):
#         # Create team
#         self.coordinator = Coordinator()
#         self.researcher = Researcher()
#         self.analyst = Analyst()
#         self.writer = Writer()

#         # Setup team
#         self.coordinator.add_agent(self.researcher)
#         self.coordinator.add_agent(self.analyst)
#         self.coordinator.add_agent(self.writer)

#         print(" Multiagent team created!")
    
#     # The execute_project Coordinates, plan the project, researches, gathers info, sends to analyst, writes report
#     def execute_project(self, project_description):
#         print(f"\n EXECUTING PROJECT")
#         print("=" * 40)

#         # Step 1: Coordinate
#         coordination = self.coordinator.delegate_project(project_description)

#         # Step 2: Research
#         research_data = self.researcher.research(project_description)

#         # Step 3: Analyze research
#         # Sends to analyst for analyzing
#         analysis = self.researcher.send_to(self.analyst, research_data)

#         # Step 4: Write report
#         final_report = self.analyst.send_to(self.writer, analysis)
        
#         # Print task for each agent to show status, how it looks
#         print(f"\n PROJECT COMPLETED!")
#         return {
#             "coordination": coordination,
#             "research": research_data,
#             "analysis": analysis,
#             "report": final_report
#         }
#     # Prints task counts for each agent of the status of the team
#     def show_team_status(self):
#         print(f"\n TEAM STATUS")
#         agents = [self.coordinator, self.researcher, self.analyst, self.writer]
#         for agent in agents:
#             print(f"  {agent.name} ({agent.role}): {len(agent.memory)} tasks completed")


# # DEMO FUNCTIONS
# # Runs the full multi-agent pipeline on
# # Example project = "analyze AI market treends and create business strategy"

# def demo_basic_multiagent():
#     print(" MULTIAGENT SYSTEM DEMO")
#     print("=" * 40)

#     # Create team
#     team = MultiAgentTeam()

#     # Execute project
#     project = "Analyze AI market trends and create business strategy"
#     results = team.execute_project(project)

#     # Show status
#     team.show_team_status()

#     return results

# # Analyst communication
# # Communication between the researcher and analyst ***
# # Working through information the researcher gathered and 
# # The analysis that has been done by analyst
# def demo_agent_communication():
#     print(f"\n AGENT COMMUNICATION")
#     print("-" * 30)

#     researcher = Researcher()
#     analyst = Analyst()

#     # Research → Analysis chain
#     research = researcher.research("artificial intelligence trends")
#     analysis = researcher.send_to(analyst, research)

#     print(" Communication chain completed!")


# # Show different models tackling the same task
# def demo_different_models():

#     print(f"\n DIFFERENT LLM MODELS")
#     print("-" * 30)

#     fast_agent = Agent("Fast", "assistant", "gpt-3.5-turbo")
#     smart_agent = Agent("Smart", "analyst", "gpt-4")

#     task = "Evaluate autonomous vehicles"

#     fast_result = fast_agent.process(task)
#     smart_result = smart_agent.process(task)

#     print(" Different models demonstrated!")

# # -------------------------------------
# # MAIN EXECUTION
# # Ties everything together
# # Shows agent to agent communication
# # Demonstrates model comparisons
# # Gives us the idea of how multiagent works from start to beginning
# # ---------------------------------------

# def run_multiagent_demo():
#     print("MULTIAGENT SYSTEM WITH REAL LLMs")
#     print("=" * 50)
#     print("  Set your OpenAI API key at the top!")

#     try:
#         # Main demo
#         # Ties everything together
#         results = demo_basic_multiagent()

#         # Additional demos
#         # Shows agent to agent communication
#         demo_agent_communication()
#         # Demonstrates model comparisons
#         demo_different_models()

#     except Exception as e:
#         print(f" Demo failed: {e}")
#         print(" Check your API key!")

# if __name__ == "__main__":
#     # Execute run_multiagent_demo
#     run_multiagent_demo()


# # Output shows joining agents to the team, 
# # Project goal " Analyze AI market trends and create business strategy"
# # Phase 1 - provide some input and do the market research, identify key resarch areas, provide some kind of report
# # Move forward with what other models can create or for example again you see it has summary based on 
# # T-key product that provided by Scout
# # Shows project completed and hows team status for count of tasks completed per agent, explained each agent's purpose
# # Explains communication between each agent ( "Agent Communication" )
# # Shows demo of different outputs with different models