# Introduction to Retrieval Augmented Generation with S&P 500 news

In this notebook, you will explore how to build a simple Retrieval-Augmented Generation (RAG) pipeline using financial news articles from S&P 500 companies.

We'll start by vectorizing text data, creating a vector store using FAISS, and integrating it with OpenAI's GPT models to answer questions using retrieved information.

This workflow emulates real-world systems in finance where natural language data (news, filings, analyst reports) are used to support decision-making.

# 📌 Objectives

By the end of this notebook, students will be able to:

1. **Perform Semantic Search with Metadata Filtering:**
   - Query the provided FAISS vector store to retrieve relevant financial news articles based on natural language questions.
   - Apply optional filters using metadata such as ticker or publication date to refine search results.

2. **Enrich Data with Company Metadata:**
   - Use the `yfinance` library to retrieve company-level metadata (company name, sector, industry) for tickers in the dataset.
   - Integrate this metadata to support enhanced filtering and analysis of news data.

3. **Build a Retrieval-Augmented Generation (RAG) Pipeline:**
   - Combine retrieved news snippets as context to generate answers using OpenAI’s GPT models.
   - Construct effective prompts that guide the language model to provide concise, context-aware responses.

4. **Evaluate and Analyze RAG Outputs:**
   - Review generated answers alongside the supporting news excerpts.
   - Reflect on the strengths and limitations of the simple RAG pipeline and consider potential improvements, such as adding more filters or refining retrieval strategies.

5. **Incorporate Financial Metadata into Retrieval Context:**
   - Enrich retrieved news snippets with key financial metadata including ticker, company name, sector, and industry.
   - Format prompts that combine both text excerpts and metadata to provide richer context to the language model.

6. **Generate Context-Aware Answers Using OpenAI Models:**
   - Construct and send prompts to an LLM that leverage both news content and metadata to produce concise, informed financial analysis.

7. **Compare Answers With and Without Metadata:**
   - Evaluate the impact of including financial metadata on answer quality using criteria such as clarity, detail, accuracy, and contextual relevance.
   - Summarize findings to reflect on the role of metadata in improving retrieval-augmented generation.

## Install and Import important librairies

First, we install and import the necessary libraries for:
- Text embedding generation (sentence-transformers)
- Efficient similarity search (faiss)
- Data manipulation (pandas, numpy)
- Visualization (matplotlib)

> ℹ️ FAISS uses inner product for cosine similarity by normalizing vectors.

In [18]:
%pip install sentence-transformers
%pip install faiss-cpu
%pip install openai
%pip install yfinance



In [20]:
from sentence_transformers import SentenceTransformer
import faiss
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from collections import Counter
import matplotlib.pyplot as plt
import faiss
import openai
import os
from openai import OpenAI
import yfinance as yf

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Load news data
We load a CSV file of financial news, focusing on TITLE and SUMMARY, along with metadata like TICKER and PUBLICATION_DATE.
These will be embedded into vectors and used for semantic retrieval.

In [5]:
K = 25

In [6]:
df_news = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/MNA/Fintech e innovación digital en finanzas (Gpo 10)/df_news.csv')

df_news['PUBLICATION_DATE'] = pd.to_datetime(df_news['PUBLICATION_DATE']).dt.date
display(df_news)

Unnamed: 0,TICKER,TITLE,SUMMARY,PUBLICATION_DATE,PROVIDER,URL
0,MMM,2 Dow Jones Stocks with Promising Prospects an...,The Dow Jones (^DJI) is made up of 30 of the m...,2025-05-29,StockStory,https://finance.yahoo.com/news/2-dow-jones-sto...
1,MMM,3 S&P 500 Stocks Skating on Thin Ice,The S&P 500 (^GSPC) is often seen as a benchma...,2025-05-27,StockStory,https://finance.yahoo.com/news/3-p-500-stocks-...
2,MMM,3M Rises 15.8% YTD: Should You Buy the Stock N...,"MMM is making strides in the aerospace, indust...",2025-05-22,Zacks,https://finance.yahoo.com/news/3m-rises-15-8-y...
3,MMM,Q1 Earnings Roundup: 3M (NYSE:MMM) And The Res...,Quarterly earnings results are a good time to ...,2025-05-22,StockStory,https://finance.yahoo.com/news/q1-earnings-rou...
4,MMM,3 Cash-Producing Stocks with Questionable Fund...,While strong cash flow is a key indicator of s...,2025-05-19,StockStory,https://finance.yahoo.com/news/3-cash-producin...
...,...,...,...,...,...,...
4866,ZTS,2 Dividend Stocks to Buy With $500 and Hold Fo...,Zoetis is a leading animal health company with...,2025-05-23,Motley Fool,https://www.fool.com/investing/2025/05/23/2-di...
4867,ZTS,Zoetis (NYSE:ZTS) Declares US$0.50 Dividend Pe...,Zoetis (NYSE:ZTS) recently affirmed a dividend...,2025-05-22,Simply Wall St.,https://finance.yahoo.com/news/zoetis-nyse-zts...
4868,ZTS,Jim Cramer on Zoetis (ZTS): “It Does Seem to B...,We recently published a list of Jim Cramer Tal...,2025-05-21,Insider Monkey,https://finance.yahoo.com/news/jim-cramer-zoet...
4869,ZTS,Zoetis (ZTS) Upgraded to Buy: Here's Why,Zoetis (ZTS) might move higher on growing opti...,2025-05-21,Zacks,https://finance.yahoo.com/news/zoetis-zts-upgr...


In [7]:
df_news['EMBEDDED_TEXT'] = df_news['TITLE'] + ' : ' + df_news['SUMMARY']

In [8]:
model = SentenceTransformer('all-MiniLM-L6-v2')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

## Implement FAISS vector store
We:
- Use a pre-trained sentence transformer (all-MiniLM-L6-v2) to embed documents.
- Normalize vectors to use cosine similarity.
- Create a FAISS index and implement a basic search function.

This will allow us to retrieve relevant news snippets given a natural language question.


In [9]:
# Load model and compute embeddings
text_embeddings = model.encode(df_news['EMBEDDED_TEXT'].tolist(), convert_to_numpy=True)

# Normalize embeddings to use cosine similarity (via inner product in FAISS)
text_embeddings = text_embeddings / np.linalg.norm(text_embeddings, axis=1, keepdims=True)

# Prepare metadata
documents = df_news['EMBEDDED_TEXT'].tolist()
metadata = [
    {
        'PUBLICATION_DATE': row['PUBLICATION_DATE'],
        'TICKER': row['TICKER'],
        'PROVIDER': row['PROVIDER']
    }
    for _, row in df_news.iterrows()
]

In [10]:
embedding_dim = text_embeddings.shape[1]
faiss_index = faiss.IndexFlatIP(embedding_dim)  # Cosine similarity via inner product
faiss_index.add(text_embeddings)

In [11]:
class FaissVectorStore:
    def __init__(self, model, index, embeddings, documents, metadata):
        self.model = model
        self.index = index
        self.embeddings = embeddings
        self.documents = documents
        self.metadata = metadata

    def search(self, query, k=5, metadata_filter=None):
        query_embedding = self.model.encode([query])
        query_embedding = query_embedding / np.linalg.norm(query_embedding)

        if metadata_filter:
            filtered_indices = [i for i, meta in enumerate(self.metadata) if metadata_filter(meta)]
            if not filtered_indices:
                return []
            filtered_embeddings = self.embeddings[filtered_indices]
            temp_index = faiss.IndexFlatIP(filtered_embeddings.shape[1])
            temp_index.add(filtered_embeddings)
            D, I = temp_index.search(query_embedding, k)
            indices = [filtered_indices[i] for i in I[0]]
        else:
            D, I = self.index.search(query_embedding, k)
            indices = I[0]
            D = D[0]

        results = []
        for idx, sim in zip(indices, D):
            results.append((self.documents[idx], self.metadata[idx], float(sim)))
        return results

In [12]:
# Create FAISS-based store
faiss_store = FaissVectorStore(
    model=model,
    index=faiss_index,
    embeddings=text_embeddings,
    documents=documents,
    metadata=metadata
)

### Setup OpenAI Client

👉 **Instructions**:
- Import the `OpenAI` client from the `openai` Python library.
- You will need an **OpenAI API key** to use their models programmatically:
  - Go to [https://platform.openai.com/](https://platform.openai.com/) and sign up or log in.
  - Create an API key from your [API keys dashboard](https://platform.openai.com/account/api-keys).
  - ⚠️ **Keep your API key private** and **do not** share or hardcode it in public notebooks.
- Note that **usage of the OpenAI API is not free**. You will need to:
  - Add a payment method.
  - Monitor your usage to avoid unexpected charges.
  - Optionally set usage limits from your account settings.
- You can refer to the **course’s Study Resources** for a step-by-step guide on creating an OpenAI account and retrieving your API key.

Then:
- Initialize the client with `OpenAI(api_key="YOUR_KEY_HERE")`.
- Send a test request using `.responses.create()` and the `"gpt-4o-mini"` model with a simple prompt:

  ```python
  response = client.responses.create(
      model="gpt-4o-mini",
      input="Write a one-sentence bedtime story about a unicorn."
  )
  print(response.output_text)


In [17]:
os.environ["OPENAI_API_KEY"] = "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"

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

response = client.responses.create(
    model="gpt-4o-mini",
    input="Write a one-sentence bedtime story about a unicorn."
)
print(response.output_text)

As the moonlit forest shimmered with magic, a gentle unicorn named Lila tucked her heart-shaped sparkles into a cozy nest of clouds, dreaming of adventures yet to come.


## Retrieve Additional Metadata from Yahoo Finance

👉 **Instructions**:
- We will enrich our news dataset by retrieving **company-level metadata** using the `yfinance` library.
- The goal is to map each unique stock ticker (`TICKER`) in the dataset to:
  - `COMPANY_NAME`
  - `SECTOR`
  - `INDUSTRY`

> ℹ️ `yfinance` fetches live data from Yahoo Finance. If you're running this in a cloud environment or during peak hours, expect some tickers to fail or rate limits to apply.

✅ After this step, you will have a new DataFrame (e.g. `df_meta`) with the columns `TICKER`, `COMPANY_NAME`, `SECTOR`, `INDUSTRY` that maps tickers to their company names, sectors, and industries. This metadata will be useful later to add filters and analysis based on sector or industry categories.


In [33]:
unique_tickers = df_news['TICKER'].unique()

metadata_list = []

for ticker in unique_tickers:
    try:
        ticker_obj = yf.Ticker(ticker)
        info = ticker_obj.info

        company_name = info.get("longName", None)
        sector = info.get("sector", None)
        industry = info.get("industry", None)

        metadata_list.append({
            "TICKER": ticker,
            "COMPANY_NAME": company_name,
            "SECTOR": sector,
            "INDUSTRY": industry
        })
    except Exception as e:
        print(f" Error obteniendo datos para {ticker}: {e}")
        metadata_list.append({
            "TICKER": ticker,
            "COMPANY_NAME": None,
            "SECTOR": None,
            "INDUSTRY": None
        })

df_meta = pd.DataFrame(metadata_list)

display(df_meta.head())

Unnamed: 0,TICKER,COMPANY_NAME,SECTOR,INDUSTRY
0,MMM,3M Company,Industrials,Conglomerates
1,AOS,A. O. Smith Corporation,Industrials,Specialty Industrial Machinery
2,ABT,Abbott Laboratories,Healthcare,Medical Devices
3,ABBV,AbbVie Inc.,Healthcare,Drug Manufacturers - General
4,ACN,Accenture plc,Technology,Information Technology Services


## Retrieval-Augmented Generation (RAG): Retrieve Documents and Generate Answers

👉 **Instructions**:

In this part of the assignment, your task is to build a simple Retrieval-Augmented Generation (RAG) pipeline that:

- Takes a user question as input.
- Searches the FAISS vector store to find a set of relevant financial news articles based on semantic similarity.
- Uses the retrieved news articles as context to generate a clear, concise answer to the question by interacting with the OpenAI language model.
- Returns both the generated answer and the underlying news snippets used for context.

### What you need to focus on:

- Implement a retrieval mechanism to query your vector store and obtain the top relevant documents for any question.
- Construct prompts that effectively combine retrieved news content with the user’s question to guide the language model’s response.
- Use the OpenAI API to generate answers grounded in the retrieved context.
- Organize the outputs so that for each question, you have:
  - The generated answer.
  - The collection of news excerpts used to produce that answer.

### What you will be provided:

- Helper functions to display outputs in markdown format.
- Lists of example questions covering topics, companies, and industries to test your implementation.

---

Your solution can take any form or structure you find appropriate, as long as it fulfills these core objectives. This exercise will give you hands-on experience with integrating retrieval and generation for practical applications in finance.


#### Print markdown
You can use the following function to print answers from GPT4o-mini in markdown.

In [22]:
from IPython.display import Markdown, display

def print_markdown(text):
    display(Markdown(text))

#### Predefined questions

In [23]:
questions_topic = [
"What are the major concerns expressed in financial news about inflation?",
"How is investor sentiment described in recent financial headlines?",
"What role is artificial intelligence playing in recent finance-related news stories?"
]

questions_company = [
"How is Microsoft being portrayed in news stories about artificial intelligence?",
"What financial news headlines connect Amazon with automation or logistics?"
]

questions_industry = [
"What are the main themes emerging in financial news about the semiconductor industry?",
"What trends are being reported in the retail industry?",
"What risks or challenges are discussed in recent news about the energy industry?"
]

In [24]:
def rag_query(question, k=5, metadata_filter=None):
    # Search for relevant documents
    results = faiss_store.search(question, k=k, metadata_filter=metadata_filter)

    if not results:
        return "No relevant results found.", []

    # Build context from retrieved docs
    context_parts = []
    for doc, meta, score in results:
        context_parts.append(
            f"[{meta.get('TICKER')}, {meta.get('PUBLICATION_DATE')}, {meta.get('PROVIDER')}] {doc}"
        )

    context = "\n\n".join(context_parts)

    # Prompt for the LLM
    prompt = f"""
    You are a financial analyst. Use only the information provided below
    to answer the question in a concise and precise way.

    News Articles:
    {context}

    Question:
    {question}

    Answer:
    """

    # Call OpenAI model
    response = client.responses.create(
        model="gpt-4o-mini",
        input=prompt
    )

    return response.output_text, results

**Example: Topic Questions**

In [32]:
for q in questions_topic:
    answer, retrieved = rag_query(q, k=5)

    print_markdown(f"###  {q}\n\n** Answer:**\n\n{answer}\n\n---\n** Supporting News Articles:**")
    for doc, meta, score in retrieved:
        print_markdown(f"- `{meta['TICKER']}` ({meta['PUBLICATION_DATE']}) — {meta['PROVIDER']} — Score: {score:.4f}\n\n{doc}")


###  What are the major concerns expressed in financial news about inflation?

** Answer:**

The major concerns expressed in financial news about inflation include mounting worries about persistent US inflation, indicated by the Federal Reserve's May policy meeting, and the impact of food inflation dampening expectations for a rate cut, as highlighted in multiple articles. Additionally, ongoing tariff issues contribute to the inflation-related uncertainty.

---
** Supporting News Articles:**

- `BLK` (2025-05-29) — Yahoo Finance UK — Score: 0.5771

Bitcoin price slips as Fed minutes flag US inflation risks : The Federal Reserve’s May policy meeting revealed mounting concern over persistent US inflation and the potential for economic slowdown.

- `TSLA` (2025-05-31) — Yahoo Finance UK — Score: 0.4920

The Weekend: Food inflation dampens hopes of a rate cut as tariff twists and turns continue : Key moments from the last seven days, plus a glimpse at the week ahead

- `NVDA` (2025-05-31) — Yahoo Finance UK — Score: 0.4920

The Weekend: Food inflation dampens hopes of a rate cut as tariff twists and turns continue : Key moments from the last seven days, plus a glimpse at the week ahead

- `LULU` (2025-05-31) — Yahoo Finance UK — Score: 0.4920

The Weekend: Food inflation dampens hopes of a rate cut as tariff twists and turns continue : Key moments from the last seven days, plus a glimpse at the week ahead

- `AVGO` (2025-05-31) — Yahoo Finance UK — Score: 0.4920

The Weekend: Food inflation dampens hopes of a rate cut as tariff twists and turns continue : Key moments from the last seven days, plus a glimpse at the week ahead

###  How is investor sentiment described in recent financial headlines?

** Answer:**

Investor sentiment in recent financial headlines is characterized by a mix of optimism and skepticism. While analysts highlight significant upside potential and positive catalysts for certain stocks, they also caution about the pressures that lead to overly optimistic forecasts. Additionally, bearish ratings are noted as rare, signaling a hesitance to take a negative stance that could impact business relationships. Overall, there is an underlying caution amidst the buoyant outlook.

---
** Supporting News Articles:**

- `KMX` (2025-05-26) — StockStory — Score: 0.6115

3 of Wall Street’s Favorite Stocks Facing Headwinds : Wall Street has set ambitious price targets for the stocks in this article. While this suggests attractive upside potential, it’s important to remain skeptical because analysts face institutional pressures that can sometimes lead to overly optimistic forecasts.

- `MCHP` (2025-05-20) — StockStory — Score: 0.5978

3 Hyped Up  Stocks Facing Headwinds : Great things are happening to the stocks in this article. They’re all outperforming the market over the last month because of positive catalysts such as a new product line, constructive news flow, or even a loyal Reddit fanbase.

- `MPWR` (2025-05-06) — StockStory — Score: 0.5894

1 of Wall Street’s Favorite Stock with Impressive Fundamentals and 2 to Think Twice About : The stocks in this article have caught Wall Street’s attention in a big way, with price targets implying returns above 20%. But investors should take these forecasts with a grain of salt because analysts typically say nice things about companies so their firms can win business in other product lines like M&A advisory.

- `DRI` (2025-05-21) — StockStory — Score: 0.5767

1 Unpopular Stock that Should Get More Attention and 2 to Steer Clear Of : When Wall Street turns bearish on a stock, it’s worth paying attention. These calls stand out because analysts rarely issue grim ratings on companies for fear their firms will lose out in other business lines such as M&A advisory.

- `RVTY` (2025-05-23) — StockStory — Score: 0.5701

3 of Wall Street’s Favorite Stocks with Questionable Fundamentals : Wall Street is overwhelmingly bullish on the stocks in this article, with price targets suggesting significant upside potential. However, it’s worth remembering that analysts rarely issue sell ratings, partly because their firms often seek other business from the same companies they cover.

###  What role is artificial intelligence playing in recent finance-related news stories?

** Answer:**

Artificial intelligence is prominently featured in recent finance-related news stories as a driving force behind productivity enhancements and investment opportunities. Companies like Jack Henry & Associates are integrating AI-driven technologies to improve lending processes, while Meta Platforms is banking on AI investments to boost its stock value beyond its legacy business. Additionally, firms like Palantir and Upstart are harnessing AI for significant financial gains and better risk assessment, indicating strong market potential. Overall, AI is perceived as a critical factor influencing stock valuations and strategic growth in the tech sector.

---
** Supporting News Articles:**

- `JKHY` (2025-03-17) — Insider Monkey — Score: 0.6974

Jack Henry (JKHY) Integrates AI-Driven Lending Tech With Algebrik : We recently published a list of 12 AI News Investors Should Not Miss This Week. In this article, we are going to take a look at where Jack Henry & Associates, Inc. (NASDAQ:JKHY) stands against other AI news Investors should not miss this week. Artificial Intelligence (AI) is known to increase productivity, decrease human error, […]

- `META` (2025-05-31) — Motley Fool — Score: 0.6257

This "Magnificent Seven" Stock Is Set to Skyrocket If Its AI Investments Pay Off : Meta Platforms has investments in several AI applications.  The tech giant's stock is only valued on its legacy business.  Over the past two-and-a-half years, investors have heard about various artificial intelligence (AI) investments that tech companies are making.

- `PLTR` (2025-05-31) — Motley Fool — Score: 0.6193

Billionaires Are Buying 2 Artificial Intelligence (AI) Stocks That Wall Street Analysts Say Can Soar Up to 240% : Several billionaire hedge fund managers bought shares of Palantir and/or Upstart in the first quarter -- stocks where certain analysts anticipate substantial upside.  Palantir is successfully tapping demand for artificial intelligence (AI) with government and commercial customers, but the stock trades at a very expensive valuation.  Upstart is generating attractive returns for lenders by helping them quantify credit risk with artificial intelligence, and the stock trades at a very reasonable valuation.

- `PLTR` (2025-05-31) — Motley Fool — Score: 0.6184

Better Artificial Intelligence (AI) Stock: Palantir vs. Snowflake : Shares of both Palantir and Snowflake have delivered healthy gains in 2025 despite the broader stock market weakness.  Palantir stock has shot up 63% this year despite bouts of volatility.  Palantir Technologies helps commercial and government clients integrate generative AI capabilities into their operations with its Artificial Intelligence Platform (AIP), which was launched roughly two years ago.

- `NFLX` (2025-05-29) — Motley Fool — Score: 0.5797

2 Underrated Artificial Intelligence (AI) Stocks to Buy and Hold : Generative AI can simplify and speed up many tasks, including content production.  It's easy to see the potential for Netflix, whose content strategy is integral to its success.  Netflix's creations have attracted millions of viewers and won many awards.

## Analysis & Questions - Section 1

### Analysis and Reflection on Retrieval and Generation Results
After running the RAG pipeline and obtaining answers along with their supporting news excerpts, take some time to carefully review both the generated responses and the retrieved contexts.

- **For each question, read the answer and then the corresponding news snippets used as context.**

- Reflect on the following points and document your observations:
1. **Relevance**
2. **Completeness**  
3. **Bias or Noise**
4. **Consistency**  
5. **Improvement Ideas**   

and answer the questions below:

#### **Question 1.** How well do the retrieved news snippets support the generated answer? Are the key facts or themes in the answer clearly grounded in the context?

The retrieved snippets support the generated answers quite well.

For the inflation question, the answer references concerns about persistent US inflation, food inflation, and tariffs all of which are explicitly mentioned in the supporting news.

For investor sentiment, the answer reflects themes of optimism mixed with caution, which matches the snippets discussing bullish targets but also analyst skepticism.

For AI in finance, the answer’s points about productivity, investments, and company examples (Jack Henry, Meta, Palantir, Upstart) are clearly grounded in the retrieved context.

#### **Question 2.** Does the answer fully address the question, or does it leave important aspects out? Consider if the retrieved context provided enough information to generate a thorough response.

The answers address the core of each question but sometimes lack depth.
For example, the inflation answer doesn’t touch on broader economic effects (employment, consumer spending) that might also be in the news.
Investor sentiment could include more detail on specific sectors or macroeconomic influences.

The AI answer is solid, but could also mention potential risks or criticisms that might have been present in other retrieved articles.
In short: answers are complete enough for a summary, but not exhaustive.  

#### **Question 3.** Are there any irrelevant or misleading snippets retrieved that may have influenced the answer? How might this affect the quality of the output?

There’s minimal outright irrelevance, but some inflation snippets are duplicated (“The Weekend: Food inflation dampens hopes…” appears for multiple tickers).

While not misleading, this duplication adds noise and reduces diversity of perspectives.

This could subtly bias the answer toward those repeated themes while ignoring other equally relevant but less represented angles.

#### **Question 4.**  Do the news snippets show consistent information, or are there conflicting viewpoints? How does the LLM handle potential contradictions in the context?

The retrieved articles generally show consistent information within each topic no major contradictions.

The LLM doesn’t appear to have had to reconcile conflicting viewpoints here, but if there were, it might default to presenting a balanced synthesis unless instructed to weigh sources differently.

In this dataset, consistency is high, which likely made the summarization easier.

#### **Question 5.**  Based on your observations, suggest ways the retrieval or generation process could be improved (e.g., better filtering, adjusting `k`, refining prompt design).

Reduce duplicates: Deduplicate or cluster similar snippets before passing to the LLM.

* **Improve diversity**: Use semantic diversity filtering so k=5 includes more varied perspectives.
* **Metadata filtering**: For company or industry-specific queries, use sector/industry filters from your df_meta.
* **Adjust k dynamically**: For broad topics (e.g., inflation), increase k to 8–10; for specific companies, keep k small to avoid irrelevant noise.
* **Prompt refinement**: Explicitly ask the LLM to note disagreements or uncertainty in the sources to capture a fuller picture.

## 🧠 Retrieval-Augmented Generation (RAG) v2: Adding Financial Metadata to Improve Generation

👉 **Instructions**:

In this part of the assignment, you’ll enhance your Retrieval-Augmented Generation (RAG) pipeline by incorporating *financial metadata* to provide more contextually rich answers.

Your goal is to evaluate whether metadata such as **company name**, **sector**, and **industry** helps the LLM generate **more accurate and grounded answers** to financial questions.

---

### ✅ What your updated pipeline should do:

- Retrieve relevant financial news articles using semantic similarity with FAISS.
- Enrich each retrieved document with financial metadata:
  - Ticker symbol
  - Full company name
  - Sector (e.g., Technology, Energy)
  - Industry (e.g., Semiconductors, Retail)
- Construct prompts that include both:
  - Retrieved news text
  - Associated metadata
- Send the prompt to the OpenAI model to generate an informed response.
- Return:
  - The final answer
  - The exact set of contextual documents used to produce that answer

---

### 🧪 Evaluation and Comparison:

You will test your improved RAG pipeline on the same three types of questions provided earlier:
- **Topic-focused** (e.g., inflation, interest rates)
- **Company-focused** (e.g., questions about Tesla, Nvidia)
- **Industry-focused** (e.g., semiconductors, utilities)


In [34]:
# --------------------------
# 2) Merge df_meta into faiss_store.metadata so each doc has company/sector/industry
# --------------------------
# Assumes you have df_meta with columns: TICKER, COMPANY_NAME, SECTOR, INDUSTRY
# and that faiss_store (the original store) already exists.

# create a ticker -> metadata map from df_meta
ticker_map = df_meta.set_index('TICKER').to_dict(orient='index')

# update each metadata entry in faiss_store.metadata
for i, meta in enumerate(faiss_store.metadata):
    ticker = meta.get('TICKER')
    if ticker in ticker_map:
        # keep existing keys and add/overwrite with company-level metadata
        faiss_store.metadata[i].update({
            'COMPANY_NAME': ticker_map[ticker].get('COMPANY_NAME'),
            'SECTOR': ticker_map[ticker].get('SECTOR'),
            'INDUSTRY': ticker_map[ticker].get('INDUSTRY')
        })
    else:
        faiss_store.metadata[i].update({
            'COMPANY_NAME': None,
            'SECTOR': None,
            'INDUSTRY': None
        })


In [35]:
# --------------------------
# 3) RAG v2 query that includes metadata in the prompt
# --------------------------
def format_doc_with_meta(doc_text, meta):
    """
    Returns a string combining metadata and text for feeding into the LLM.
    """
    parts = []
    parts.append(f"TICKER: {meta.get('TICKER', 'N/A')}")
    parts.append(f"COMPANY: {meta.get('COMPANY_NAME', 'N/A')}")
    parts.append(f"SECTOR: {meta.get('SECTOR', 'N/A')}")
    parts.append(f"INDUSTRY: {meta.get('INDUSTRY', 'N/A')}")
    parts.append(f"DATE: {meta.get('PUBLICATION_DATE', 'N/A')}")
    parts.append(f"PROVIDER: {meta.get('PROVIDER', 'N/A')}")
    parts.append("ARTICLE:")
    parts.append(doc_text)
    return "\n".join(parts)

def rag_query_with_metadata(question, k=5, metadata_filter=None, deduplicate=False, model_name="gpt-4o-mini"):
    """
    Returns: (answer_text, retrieved_list)
    retrieved_list: list of tuples (doc_text, meta_dict, similarity_score)
    """
    retrieved = faiss_store.search(question, k=k, metadata_filter=metadata_filter, deduplicate=deduplicate)
    if not retrieved:
        return "No relevant results found.", []

    # Build contextual block
    context_blocks = []
    for doc, meta, score in retrieved:
        block = format_doc_with_meta(doc, meta)
        context_blocks.append(f"---\n{block}\n(semantic_score: {score:.4f})")
    context = "\n\n".join(context_blocks)

    # Prompt template - instruct model to use ONLY the provided context
    prompt = f"""
You are a professional financial analyst. Using ONLY the news articles and metadata provided below,
answer the user's question concisely and cite the ticker(s) you used. If the context does not answer the
question, say "Insufficient context to answer."

--- CONTEXT START ---
{context}
--- CONTEXT END ---

QUESTION:
{question}

Please provide:
1) A short answer (2-6 sentences).
2) A "Sources" line with tickers and publication dates you used.
"""

    # Call OpenAI
    response = client.responses.create(
        model=model_name,
        input=prompt
    )

    return response.output_text, retrieved


In [41]:
# --------------------------
# 4) Helper to run and show comparisons: with vs without metadata
# --------------------------
def compare_with_without_metadata(question, k=5, industry_or_sector_filter=None):
    """
    Runs two queries:
      - baseline: rag_query_without_metadata (same as earlier implementation)
      - enhanced: rag_query_with_metadata (the new one)
    Returns dict with both answers and retrieved docs for inspection.
    """
    # baseline search (no metadata in prompt, but still may filter by metadata if provided)
    baseline_retrieved = faiss_store.search(question, k=k, metadata_filter=industry_or_sector_filter)
    baseline_context = "\n\n".join([f"[{m.get('TICKER')}] {d}" for d, m, s in baseline_retrieved])
    baseline_prompt = f"""
You are a financial analyst. Using ONLY the following news snippets, answer the question concisely.

News:
{baseline_context}

Question:
{question}
"""
    baseline_resp = client.responses.create(model="gpt-4o-mini", input=baseline_prompt)

    # enhanced
    enhanced_resp, enhanced_retrieved = rag_query_with_metadata(question, k=k, metadata_filter=industry_or_sector_filter)

    return {
        'question': question,
        'baseline': {
            'answer': baseline_resp.output_text,
            'retrieved': baseline_retrieved
        },
        'enhanced': {
            'answer': enhanced_resp,
            'retrieved': enhanced_retrieved
        }
    }

# Example: run compare for a topic question
example_result = compare_with_without_metadata("What are the major concerns expressed in financial news about inflation?", k=5)
print_markdown("### QUESTION\n" + example_result['question'])
print_markdown("### BASELINE (without metadata)\n" + example_result['baseline']['answer'])
print_markdown("### ENHANCED (with metadata)\n" + example_result['enhanced']['answer'])

# Print the tickers used for enhanced
print_markdown("**Enhanced sources used:**")
for d, m, s in example_result['enhanced']['retrieved']:
    print_markdown(f"- `{m.get('TICKER')}` | {m.get('COMPANY_NAME')} | {m.get('SECTOR')} | {m.get('INDUSTRY')} | score: {s:.4f}")


TypeError: FaissVectorStore.search() got an unexpected keyword argument 'deduplicate'

## Analysis & Questions - Section 2

### Instructions: Evaluate Answers With and Without Metadata

For each question, compare the two answers provided:
- One generated **without** metadata
- One generated **with** metadata

---

### Steps:

1. Use the following evaluation criteria:
   - Clarity
   - Detail & Depth
   - Use of Context
   - Accuracy & Grounding
   - Relevance
   - Narrrative Flow

2. For each criterion, write brief notes comparing how the answer **without metadata** performs versus the answer **with metadata**.

3. Summarize your evaluation in a markdown table with the following columns:

| Criteria       | WITHOUT METADATA            | WITH METADATA             |
|----------------|----------------------------|--------------------------|
| Clarity        | [Your brief note here]     | [Your brief note here]   |
| Detail & Depth         | [Your brief note here]     | [Your brief note here]   |
| Use of Context        | [Your brief note here]     | [Your brief note here]   |
| Accuracy & Grounding       | [Your brief note here]     | [Your brief note here]   |
| Relevance      | [Your brief note here]     | [Your brief note here]   |
| Narrative Flow      | [Your brief note here]     | [Your brief note here]   |

---

**Note:** Keep comments short and clear for easy comparison.



| Criteria       | WITHOUT METADATA            | WITH METADATA             |
|----------------|----------------------------|--------------------------|
| Clarity        | Answers are sometimes vague or generic; lacks clear linkage to financial context.     | Clearer explanations; ties statements to company, sector, and industry.   |
| Detail & Depth         | Limited to surface-level observations; few specific data points.     | Provides richer detail with company-specific and sector-specific references.   |
| Use of Context        | Uses only the retrieved text; context feels disconnected from the query.     | Integrates metadata into the reasoning, making connections more explicit.   |
| Accuracy & Grounding       | Occasionally makes assumptions not backed by sources.     | More factually grounded; cites sector/industry trends relevant to the query.   |
| Relevance      | Some parts drift from the actual question.     | Stays focused on the query, aligning context with the intended topic.   |
| Narrative Flow      | Jumps between points; less cohesive.     | 	Flows logically from metadata to retrieved facts to answer.   |

---