# Create a RAG-based question answering system

In this Notebook, we will create a RAG-based Q&A system. The user can ask any question, and we leverage a Reuters news dataset to create a grounded answer.

RAG (retrieval-augmented generation) is a technique to help an AI system generate more accurate answers. For this purpose, we take the user question, search through a large database, retrieve relevant information, and then provide said relevant information to the AI system.

Our goal is to leverage the [Reuters News dataset](https://huggingface.co/datasets/ucirvine/reuters21578) and answer some questions around the [JAL airplane crash](https://en.wikipedia.org/wiki/Japan_Air_Lines_Flight_123). Japan Air Lines Flight 123 was a 1985 flight which left Tokyo towards Osaka. Initially it was unclear why the airplane crashed, and various theories emerged over time. We'll try to leverage RAG and an LLM to find out more about the root cause.

![JAL airplane](assets/jal_airplane.png "JAL airplane")

We will create a RAG database as follows:
- Fetch the Reuters news dataset from Hugging Face
- Do some data preprocessing
- Compute an embedding for each news article. An embedding is a high dimensional vector.
- Implement a search function, which takes in a question, and returns a news article which contains the answer to the question.

The Q&A pipeline works as follows:
- The user can ask any news-related question
- The question gets embedded to a high dimension vector
- Search for related news articles in the RAG database. For this purpose, compare the embedding of the question to the embeddings of all news articles.
- Take both the user question and the most relevant news articles, put them both in a prompt, and query an LLM (large language model).
- The LLM will answer the user question in text form, and the answer will be grounded in facts from the news database.

![Question answering](assets/question_answering.png "A question answering system using RAG")

## Instructions for workshop participants

Ensure you understand the project description above. If you have any questions, reach out to a workshop host.

Next, we start with the implementation of the RAG system. Make sure you understand the content of each notebook cell, and execute one cell after another.

In some places, there are open tasks that you should work on. There tasks are marked as follows:

```
# >>>>>>>>>>>>>>>>>>
# TODO: <Some instructions ...>
# <your code should go here>
# <<<<<<<<<<<<<<<<<<
```

In [None]:
# Install necessary dependencies

!pip install datasets==3.5.1

In [None]:
# Do all necessary imports

import math
import tqdm
from typing import List
from datasets import load_dataset
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from google import genai
from google.genai import types

## Setup function to call Google AI Studio

Next, we prepare the Python code to call the LLM API of Google. Make sure you have your API key ready, as explained in `sessions/1_prompt_engineering_and_rag/README.md`.

We use the [GenAI library from Google](https://pypi.org/project/google-genai/) for this purpose.

In [None]:
# Set up LLM API key
# >>>>>>>>>>>>>>>>>>
# TODO: add your LLM API key here. You can get your key from Google AI Studio.
# Note: More instructions for creating the API key can be found in `sessions/1_prompt_engineering_and_rag/README.md`
google_llm_api_key: str = "<key goes here>"
# <<<<<<<<<<<<<<<<<<

In [None]:

# Prepare client for Google LLM API
client = genai.Client(api_key=google_llm_api_key)


def llm_generate_response(system_message: str, user_message: str, print_prompt: bool = False) -> str:
    """
    Use an LLM to answer a question.

    :param system_message: The general instructions for the LLM, which shapes the AI's general behavior.
    :param user_message: The question from the user.
    :return: The response from the LLM.
    """
    if print_prompt:
        print("=================")
        print("Prompt for LLM:")
        print(f"[System message]: [{system_message}]")
        print(f"[User message]: [{user_message}]")
        print("=================")
    response = client.models.generate_content(
        model="gemini-2.5-flash-preview-04-17",
        config=types.GenerateContentConfig(
            system_instruction=system_message),
        contents=user_message
    )
    return response.text


def llm_generate_embeddings(texts: List[str]) -> List[np.ndarray]:
    """
    Use an LLM to generate embeddings for news articles.
    :param texts: A list of news articles to generate embeddings for.
    :return: A list of embeddings for the news articles.
    """
    response = client.models.embed_content(
        model="text-embedding-004",
        contents=texts,
        config=types.EmbedContentConfig(
            output_dimensionality=128
        )
    )
    return [np.array(embedding.values) for embedding in response.embeddings]

## Load Reuters dataset

Throughout this notebook, we'll be using the [Reuters news dataset](https://huggingface.co/datasets/ucirvine/reuters21578) from Hugging Face.
We download it below. This dataset contains short articles from Reuters' financial newswire service from 1987. 

In [None]:
# Load dataset
# - if asked to run custom code, type "y" for YES.
reuters_ds = load_dataset('ucirvine/reuters21578','ModHayes')
news_raw = reuters_ds["train"].to_pandas()
print(f"Loaded {len(news_raw)} news articles.")
news_raw.head()

## Preprocess news articles

First we perform some preprocessing on the news data. We'll store all articles in a [pandas DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html).

We do the following data processing:
- Concatenate the news text with the news title
- Remove unwanted characters

In [None]:
# Merge title and text, drop unnecessary columns
news_raw["title_and_text"] = news_raw['title'] + ' | ' + news_raw['text']
news = news_raw[["title_and_text", "date", "places"]]

# Clean up text, remove unnecessary characters
pd.options.mode.chained_assignment = None
news["title_and_text"] = news.apply(lambda x: x["title_and_text"].replace("\\n", " "), axis=1)
news["title_and_text"] = news.apply(lambda x: x["title_and_text"].replace("\\\"", "\""), axis=1)
news["title_and_text"] = news.apply(lambda x: " ".join(x["title_and_text"].split()), axis=1)
news.head()

## Semantic embedding
 
Next, we embed each news article using an embedding model from Google AI Studio.

After this step, we'll have all of the following data:
- `texts_to_encode`: A list of news articles
- `semantic_embeddings`: A list of embeddings. Each embedding is a high dimensional vector.

In [None]:
# Demo for embedding 1 news article
news_samples = news["title_and_text"].iloc[0]
article_embedding = llm_generate_embeddings([news_samples])
print(f"News article: {news_samples},\nEmbedding vector: {article_embedding[0]}")

In [None]:
# Next, we will generate embeddings for all news articles in the dataset.
# Expect some delay here, please give it ~4 minutes
# Since the LLM API has an input size limit, we will process the news articles in batches.

# Get the content of each news article (title + text)
texts_to_encode = news['title_and_text'].to_list()
batch_size = 100
# Keep all embedding results in this list
semantic_embeddings = []
# Calculate the number of batches needed
num_batches = math.ceil(len(texts_to_encode)/batch_size)
# Process each batch
for i in tqdm.tqdm(range(num_batches)):
    # Get a batch of news articles
    news_articles_this_batch = texts_to_encode[i * batch_size:(i+1) * batch_size]

    # >>>>>>>>>>>>>>>>>>
    # TODO: Call the LLM API to generate embeddings for this batch
    # Note: you can use the previously defined function `llm_generate_embeddings()`
    batch_embeddings = llm_generate_embeddings(news_articles_this_batch)
    # <<<<<<<<<<<<<<<<<<
    
    # Keep a list of the embeddings from ALL batches
    semantic_embeddings.extend(batch_embeddings)
# Verify the final result
assert len(semantic_embeddings) == len(texts_to_encode)
assert semantic_embeddings[0].shape == (128,)

## Search in the embedding space

Next, we implement a function `semantic_search()` which can find the most relevant news articles for a given query.
The function should return the best articles with the highest cosine similarity.  

In [None]:
def semantic_search(query: str,
                    top_k: int) -> pd.DataFrame:
    """
    Perform semantic search for a given query.
    
    :param query: The question we'll try to answer. We use the question to search for relevant news articles.  
    :param top_k: Search for the top_k most suitable news articles
    :return: A pandas DataFrame with the most similar article texts and their respective semantic scores
    """

    # Parepare new DataFrame for the results
    results = news.copy()

    # >>>>>>>>>>>>>>>>>>
    # TODO: Call the LLM API to generate the embedding for this query
    # Note: you can use the previously defined function `llm_generate_embeddings()`
    # Note: Store the result in `query_embedding`, which contains 1 embedding.
    #       The embedding itself is a numpy array of dimension (1, number of embedding features).
    query_embedding = llm_generate_embeddings([query])
    # <<<<<<<<<<<<<<<<<<

    # Next, we need to compare the embedding of the query with the embeddings of all news articles.
    # We will use the cosine similarity for this purpose.
    # The resulting Numpy array should have dimensions (number of news articles, )
    semantic_similarities = cosine_similarity(
        query_embedding,
        semantic_embeddings
    )[0].tolist()
    # Add semantic similarity score as column to the DataFrame
    results['semantic_score'] = semantic_similarities
    # Get indices of top-k results
    top_k_indices = np.argsort(semantic_similarities)[-top_k:][::-1]
    # Only keep top-k results
    results = results.iloc[top_k_indices]
    return results

In [None]:
# Search for an article related to the JAL airplane crash

query = "What caused the crash of the JAL plane?"
search_results = semantic_search(query=query, top_k=5)
search_results.head()

## TODO: Review the semantic search results

Take a look at the 5 top search results. Which of them are really relevant to the query? 

## Retrieval augmented generation: 1-shot RAG

Next, we combine our search functionality and the text generation capability of LLMs to answer the user question. We ground the LLM in facts, by providing relevant news articles in the context window.  

Here, we use 1-shot RAG, which means we provide the top-1 news article as context to the LLM.

In [None]:
# Define a helper function for RAG (retrieval-augmented generation)
def answer_news_question(question: str, relevant_news: List[str]) -> str:
    """
    :param question: The user question about the news articles
    :param context: A list of relevant news articles which should help to answer the question
    
    :return: The response generated by the LLM.
    """
    user_message = question
    system_message = f"You are an assistant specialized in answering questions about news. Answer the question provided by the user as requested based on the provided articles. The answer should start with a one-sentence summary, then go into more details for about 5 sentences. The related news are here: {relevant_news}."
    return llm_generate_response(system_message, user_message, print_prompt=True)

In [None]:
# Perform RAG with 1 news article as context
# First, grab the top-1 news article
relevant_news = search_results.iloc[0]["title_and_text"]
# Next, we prepare the question
question = query

# >>>>>>>>>>>>>>>>>>
# TODO: Take the user question and the relevant news articles, and query an LLM to generate a response
# Note: you can use the previously defined function `answer_news_question()`
llm_response = answer_news_question(
    question=question,
    relevant_news=[relevant_news]
)
# <<<<<<<<<<<<<<<<<<

print(llm_response)

## TODO: Investigate the result, is it grounded in truth?

## Retrieval augmented generation: 5-shot RAG

Next, we want to improve the answer to our question. Above, the LLM was only grounded using 1 news article. This limits the factual details for the LLM to give an extensive reply.
We now switch to using the top-5 news articles, and add them to the context windows for the LLM.

In [None]:
# Perform few-shot RAG, leveraging 5 news articles
# First, grab the top-5 news article
relevant_news = search_results.iloc[:5]["title_and_text"]
# Next, prepare the question
question = query
# Finally, query the LLM with relevant context
llm_response = answer_news_question(question, relevant_news)
print(llm_response)

## TODO: Investigate the 5-shot result, is it better than the 1-shot result?

## Limitations of semantic search

We managed to find the root cause of the JAL airplane crash. Now we switch to a new topic.

We now increase the difficulty of the question. The new question goes as follows:
> "What are politicians planning for Marne-la-Vallee?"

The new query contains a very specific geographical location. Semantic search can fail in such circumstances.

In [None]:
# Demo of a semantic search that fails

query = "What are politicians planning for Marne-la-Vallee?"
search_results = semantic_search(query=query, top_k=3)
search_results.head()

## TODO: Investigate all 3 top results.
Do the results contain plans by politicians for Marne-la-Vallee?

## Answer generation without relevant facts

Next, we test RAG for the use case where the LLM is not grounded in facts.
As we've seen above, the top-rated articles are not answering the user question. We want to investigate how the LLM reacts when it doesn't have enough information.
Until some months ago, most LLMs would start hallucinating in this situation.
Nowadays, LLMs are becoming better and refuse to answer the question.

In [None]:
# Perform few-shot RAG for the case where semantic search fails
# Grab the 3 most relevant news articles from "search_results".
relevant_news = search_results.iloc[:3]["title_and_text"]
# Next, prepare the question
question = query
# Finally, query the LLM with relevant context
llm_response = answer_news_question(question, relevant_news)
print(llm_response)

## TODO: Investigate the results.

Does the LLM try to answer the question, even without up-to-date information?

# Hybrid search: use semantics plus word frequencies

As seen above, some questions are harder to answer than others. When asking about the construction of a new town called [Marne-la-Vallée](https://en.wikipedia.org/wiki/Val_d%27Europe) by Walt Disney in France,
the semantic search fails. To improve the search, we'll use a combination of semantics plus word frequencies ([TF-IDF](https://en.wikipedia.org/wiki/Tf–idf)).

As an FYI, Marne-la-Vallée was a joint project between the French government and the Walt Disney Company. The project started 1987 and included six municipalities, a Disneyland Park, and a shopping center.

![Marne-la-Vallee](assets/marne_la_vallee.png "Marne-la-Vallee")

As mentioned, we will improve the news article search by leveraging TF-IDF. With the new data pipeline, the search algorithm will work as follows:
- Embed both the news articles and the user question. Find the news articles with the most similar embedding.
- Encode the news articles and the question with TF-IDF. Again, find the best news articles matches, this time based on TF-IDF encodings.
- Each of the above encoding methods yields a similarity rank for every news articles.
- Use [Reciprocal rank fusion](https://dl.acm.org/doi/abs/10.1145/1571941.1572114) (RRF) to merge the two ranks into one final rank.

![Hybrid search](assets/hybrid_search.png "Search with sentence transformer plus TF-IDF")

As we shall see, the new search mechanism will find better results for the following question:

> What are politicians planning for Marne-la-Vallee?



## Encode news articles with TF-IDF

We'll compute word frequencies for every news article, leveraging [scikit-learns's TF-IDF implementation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).

In [None]:
# Initialize a TF-IDF object
tfidf_vectorizer = TfidfVectorizer(
    lowercase=True, stop_words="english"
)
# Compute TF-IDF encodings for every news article
tfidf_corpus = tfidf_vectorizer.fit_transform(texts_to_encode)

## Hybrid search

We now combine semantic search with TF-IDF similarity. Let's create a function which performs a lookup in our news database.
The function should use "Reciprocal rank fusion" for combining the semantic ranks and the word frequency ranks.

In [None]:
def hybrid_search(query: str, top_k: int, rrf_k = 60.0) -> pd.DataFrame:
    """
    Perform a hybrid search, using both semantics and word frequencies.
    
    :param query: The question we'll try to answer. We use the question to search for relevant news articles.  
    :param top_k: Search for the top k most suitable news articles
    :param rrf_k: A hyper-parameter for Reciprocal rank fusion
    :return: A pandas dataframe with the most relevant news articles and their RRF rank
    """

    # Write results to new DataFrame
    results = news.copy()
    # Encode query and compute cosine score for semantic similarities
    # Note: You have already implemented this in "semantic_search()". You can copy your code here.
    query_embedding = llm_generate_embeddings([query])[0]
    semantic_similarities = cosine_similarity(
        semantic_embeddings,
        [query_embedding]
    )
    # Add semantic similarity score as column to the DataFrame
    results['semantic_score'] = semantic_similarities
    
    # Compute TF-IDF encoding of query
    # Note: you've already encoded the news articles with TF-IDF, do the same here for the query
    tfidf_encoding = tfidf_vectorizer.transform([query])
    # Compute cosine similarities, this time for TF-IDF encodings
    # The comparison should happend between "tfidf_corpus" and "tfidf_similarities"
    tfidf_similarities = cosine_similarity(tfidf_corpus, tfidf_encoding)
    results['tfidf_score'] = tfidf_similarities

    # Compute the semantic and TF-IDF ranks.
    # Note: Ranks start at 1, which is the best rank
    semantic_ranks = np.argsort(-semantic_similarities.ravel()).argsort() + 1
    tfidf_ranks = np.argsort(-tfidf_similarities.ravel()).argsort() + 1
    # Calculate RRF ranks, which combine the semantic and word frequency rank
    # Use the formula from today's presentation for the RRF rank.
    # Note: higher means better.
    rrf_rank = (1 / (semantic_ranks + rrf_k) + 1 / (tfidf_ranks + rrf_k))
    results['rrf_rank'] = rrf_rank

    # Get top-k results
    top_k_indices = np.argsort(-rrf_rank)[:top_k]
    results = results.iloc[top_k_indices]
    return results

In [None]:
# Test the hybrid search

top_k = 3
query = "What are politicians planning for Marne-la-Vallee?"

# >>>>>>>>>>>>>>>>>>
# TODO: Use hybrid search to find relevant information about "Marne-la-Vallee"
# Note: you can use the previously defined function `hybrid_search()`
search_results = hybrid_search(query=query, top_k=top_k)
# <<<<<<<<<<<<<<<<<<

# Show the top-3 results
search_results.head(3)


## TODO: Analyze the search results

Do the hybrid search results contain more relevant information about "Marne-la-Vallee"? Are there some irrelevant articles in the results?

## Retrieval augmented generation, this time with hybrid search

Let's generate an answer to our question:
> "What are politicians planning for Marne-la-Vallee?"

This time, we use hybrid search to find relevant entries in our knowledge database. By combining semantic search with TF-IDF, we achieve better results.

In [None]:
# >>>>>>>>>>>>>>>>>>
# TODO: Leverage relevant articles from hybrid search for RAG. Add the articles to the LLM context window, and let the LLM answer the question about Marne-la-Vallee.
# First, grab the top-3 news articles
# Note: we have already performed hybrid search, and have the results ready in "search_results"
relevant_news = search_results.iloc[:3]["title_and_text"]
# Define the question
question = query
# Finally, query the LLM with relevant context
# Note: you can use the previously defined function `answer_news_question()`
llm_response = answer_news_question(
    question=question,
    relevant_news=relevant_news
)
# <<<<<<<<<<<<<<<<<<

# Show the RAG result
print(llm_response)


## TODO: Analyze the RAG result from hybrid search.
Is the LLM able to answer the question factually? Does the LLM manage to ignore irrelevant news articles?

## TODO: Add further improvements to the search algorithm

Can you think of any additional ways for improving our RAG pipeline? Here are some ideas:
- Give more weight to news **titles** than **texts**
- Leverage article **release dates**, put more weight on the most recent article
- ... any other ideas?