## Create a RAG-based question answering system

In this Notebook, we will create a RAG-based Q&A system.

Our goal is to leverage the [Reuters News dataset](https://huggingface.co/datasets/ucirvine/reuters21578) and answer some questions around the [JAL airplane crash](https://en.wikipedia.org/wiki/Japan_Air_Lines_Flight_123). Japan Air Lines Flight 123 was a 1985 flight which left Tokyo towards Osaka. Initially it was unclear why the airplane crashed, and various theories emerged over time. We'll try to leverage RAG and an LLM to find out more about the root cause.

![JAL airplane](assets/jal_airplane.png "JAL airplane")

We will create a RAG database as follows:
- Fetch the Reuters news dataset from Hugging Face
- Do some data preprocessing and cleaning
- Embed each news article using a sentence transformer
- Implement a search function with cosine similarity

The Q&A pipeline works as follows:
- The user can ask any news-related question
- The question gets embedded with the same sentence transformer as above
- Find the news article most similar to the question
- Query an LLM with the user question, and provide relevant news article in the context window

![Question answering](assets/question_answering.png "A question answering system using RAG")

In [None]:
# Do all necessary imports

from datasets import load_dataset
import numpy as np
from typing import List
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from google import genai
from google.genai import types
import math
import tqdm
import os
from dotenv import load_dotenv
from pathlib import Path


## Setup LLM client for Google AI Studio

<some description>

In [None]:
# Prepare LLM client
env_file_path = Path('../.env')
load_dotenv(dotenv_path=env_file_path)
google_llm_api_key = os.environ.get('GOOGLE_LLM_API_KEY')
client = genai.Client(api_key=google_llm_api_key)

In [None]:
# Helper function for running LLM in autoregressive mode
def llm_generate_response(user_message: str, system_message: str) -> str:

    response = client.models.generate_content(
        model="gemini-2.5-flash-preview-04-17",
        config=types.GenerateContentConfig(
            system_instruction=system_message),
        contents=user_message
    )
    return response.text

In [None]:
# Helper function to get the embeddings

def llm_generate_embeddings(texts: List[str]) -> List[np.ndarray]:
    response = client.models.embed_content(
        model="text-embedding-004",
        contents=texts,
        config=types.EmbedContentConfig(
            output_dimensionality=128
        )
    )
    return [np.array(embedding.values) for embedding in response.embeddings]
    


## Load Reuters dataset

Throughout this notebook, we'll be using the [Reuters news dataset](https://huggingface.co/datasets/ucirvine/reuters21578) from Hugging Face.
We download it below. This dataset contains short articles from Reuters' financial newswire service from 1987. 

In [None]:
# Load dataset
# - if asked to run custom code, type "y" for YES.
reuters_ds = load_dataset('ucirvine/reuters21578','ModHayes')
news_raw = reuters_ds["train"].to_pandas()
print(f"Loaded {len(news_raw)} news articles.")

## Preprocess news articles

First we perform some preprocessing on the news data. We'll store all articles in a [pandas DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html). For each article, we keep the actual news text, plus the news title. On the resulting strings, we remove unwanted characters.

In [None]:
# Merge title and text, drop unnecessary columns
news_raw["title_and_text"] = news_raw['title'] + ' | ' + news_raw['text']
news = news_raw[["title_and_text", "date", "places"]]

In [None]:
# Clean up text, remove unnecessary characters
pd.options.mode.chained_assignment = None
news["title_and_text"] = news.apply(lambda x: x["title_and_text"].replace("\\n", " "), axis=1)
news["title_and_text"] = news.apply(lambda x: x["title_and_text"].replace("\\\"", "\""), axis=1)
news["title_and_text"] = news.apply(lambda x: " ".join(x["title_and_text"].split()), axis=1)
news.head()

## Semantic embedding
 
Next, we embed each news article using an embedding model from Google AI Studio.

In [None]:
# Demo for embedding one news article
news_samples = [news["title_and_text"].iloc[0]]
article_embedding = llm_generate_embeddings(news_samples)
print(f"Embedding for article 0: {article_embedding[0]}")

In [None]:
# Embed all news articles
# Expect some delay here, please give it ~4 minutes

texts_to_encode = news['title_and_text'].to_list()
# We need to use batching, as the LLM requests have a size limit
batch_size = 100
semantic_embeddings = []
# Calculate the number of batches needed
num_batches = math.ceil(len(texts_to_encode)/batch_size)
# Process each batch
for i in tqdm.tqdm(range(num_batches)):
    current_batch = texts_to_encode[i * batch_size:(i+1) * batch_size]
    batch_embeddings = llm_generate_embeddings(current_batch)
    semantic_embeddings.extend(batch_embeddings)



## Search in the embedding space

Next, we implement a function `semantic_search()` which can find the most relevant news articles for a given query.
The function should return the best articles with the highest cosine similarity.  

In [None]:
def semantic_search(query: str,
                    top_k: int) -> pd.DataFrame:
    """
    Perform semantic search for a given query.
    
    :param query: The question we'll try to answer. We use the question to search for relevant news articles.  
    :param top_k: Search for the top k most suitable news articles
    :return: A pandas DataFrame with the most similar article texts and their respective semantic scores
    """

    # Write results to new DataFrame
    results = news.copy()
    # Encode the query using the same LLM as for embedding the news articles.
    # The resulting Numpy array should have dimensions (1, number of embedding features).
    query_embedding = llm_generate_embeddings([query])[0]
    # Calculate cosine similarity
    # Each vector inside "semantic_embeddings" should be compared to the vector "query_embedding".
    # The resulting Numpy array should have dimensions (number of news articles, )
    semantic_similarities = cosine_similarity(
        [query_embedding],
        semantic_embeddings
    )[0].tolist()
    # Add semantic similarity score as column to the DataFrame
    results['semantic_score'] = semantic_similarities
    # Get indices of top-k results
    top_k_indices = np.argsort(semantic_similarities)[-top_k:][::-1]
    # Only keep top-k results
    results = results.iloc[top_k_indices]
    return results

In [None]:
# Search for an article related to the JAL airplane crash

query = "What caused the crash of the JAL plane?"
search_results = semantic_search(query=query, top_k=5)
search_results.head()

In [None]:
# TODO: Review the semantic search results
# Take a look at the 5 search results. Which of them are really relevant to the query? 

## Retrieval augmented generation: Combine semantic search with GenAI

Use 1-shot RAG to answer the user question about the JAL airplane crash. Combine the strengths of information retrieval with the astonishing capabilities of generative artificial intelligence. Ground the LLM in facts, by providing relevant news articles in the context window.  

In [None]:
# Define a helper function for RAG (retrieval-augmented generation)
def answer_news_question(question: str, relevant_news: List[str]) -> str:
    """
    :param question: The user question about the news articles
    :param context: A list of relevant news articles which should help to answer the question
    
    :return: The answer to the question
    """
    user_message = question
    system_message = f"You are a assistant specialized in answering questions about the news. Answer the questions provided by the user as requested based on the provided articles. Provide the long form answer based on them explaining and summarizing all the details. The related news are following: {relevant_news}."
    return llm_generate_response(user_message, system_message)

In [None]:
# Perform retrieval augmented generation
# 1-shot RAG: We only retrieve the top rated news article
# First, grab the top-1 news article from the "search_results"
relevant_news = search_results.iloc[0]["title_and_text"]
# Next, prepare the question
question = query
# Finally, query the LLM with relevant context
llm_response = answer_news_question(question, relevant_news)
print(llm_response)

In [None]:
# TODO: Investigate the result, is it grounded in truth?

## Few-shot RAG

Next, we want to improve the answer to our question. Above, the LLM was only grounded using 1 news article. This limits the factual details for the LLM to give an extensive reply.
We now switch to using the top-5 news articles, and add them to the context windows for the LLM.

In [None]:
# Perform few-shot RAG, leveraging 5 news articles
# Grab the 5 most relevant news articles from "search_results".
relevant_news = search_results.iloc[:5]["title_and_text"]
# Next, prepare the question
question = query
# Finally, query the LLM with relevant context
llm_response = answer_news_question(question, relevant_news)
print(llm_response)

In [None]:
# TODO: Investigate the 5-shot result, is it better than the 1-shot result?

## Limitations of semantic search

We managed to find the root cause of the JAL airplane crash. Now we switch to a new topic.

We now increase the difficulty of the question. The new question goes as follows:
> "What will be constructed in Marne-la-Vallee?"

The new query contains a very specific geographical location. Semantic search can fail in such circumstances.

In [None]:
# Demo of a semantic search that only partially works

query = "What are politicians planning for Marne-la-Vallee?"
search_results = semantic_search(query=query, top_k=3)
search_results.head()

In [None]:
# TODO: Investigate all 3 top results. Do they contain plans by politicians for Marne-la-Vallee?

## What happens when the LLM doesn't get relevant information

Let us now investigate what happens when we perform RAG, and the top-rated articles are not answering the user question. How does the LLM react when it doesn't have enough information?
Until some months ago, most LLMs would start hallucinating when they don't have enough information to answer a very specific question.
Nowadays, LLMs are becoming better and can frequently spot this situation. If they do, they clarify they need more up to date information, and refuse to answer the question.

In [None]:
# Perform few-shot RAG for the case where semantic search fails
# Grab the 3 most relevant news articles from "search_results".
relevant_news = search_results.iloc[:3]["title_and_text"]
# Next, prepare the question
question = query
# Finally, query the LLM with relevant context
llm_response = answer_news_question(question, relevant_news)
print(llm_response)


In [None]:
# TODO: Investigate the results. Does the LLM try to answer the question, even without up-to-date information?

# Hybrid search: use semantics plus word frequencies

As seen above, some questions are harder to answer than others. When asking about the construction of a new town called [Marne-la-Vallée](https://en.wikipedia.org/wiki/Val_d%27Europe) by Walt Disney in France,
the semantic search fails. To improve the search, we'll use a combination of semantics plus word frequencies ([TF-IDF](https://en.wikipedia.org/wiki/Tf–idf)).

As an FYI, Marne-la-Vallée was a joint project between the French government and the Walt Disney Company. The project started 1987 and included six municipalities, a Disneyland Park, and a shopping center.

![Marne-la-Vallee](assets/marne_la_vallee.png "Marne-la-Vallee")

As mentioned, we will improve the news article search by leveraging TF-IDF. With the new data pipeline, the news article search will work as follows:
- Embed the news articles and the question with a sentence transformer. Find the news articles with the most similar embedding.
- Encode the news articles and the question with TF-IDF. Again, find the best news articles matches, this time based on TF-IDF encodings.
- Each of the above encoding methods yields a similarity rank for every news articles.
- Use [Reciprocal rank fusion](https://dl.acm.org/doi/abs/10.1145/1571941.1572114) (RRF) to turn the two ranks into one final rank.

As we shall see, the new search mechanism will find better results for the following question:

"What will be constructed in Marne-la-Vallee?"

![Hybrid search](assets/hybrid_search.png "Search with sentence transformer plus TF-IDF")

## Encode news articles with TF-IDF

We'll compute word frequencies for every news article, leveraging [scikit-learns's TF-IDF implementation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).

In [None]:
# Initialize a TF-IDF object
tfidf_vectorizer = TfidfVectorizer(
    lowercase=True, stop_words="english"
)
# Compute TF-IDF encodings for every news article
# Encode all news articles, which we stored in the variable "texts_to_encode".
# Study the documentation for scikit-learns TfidfVectorizer class, if necessary.
tfidf_corpus = tfidf_vectorizer.fit_transform(texts_to_encode)



## Hybrid search

We now combine semantic search with TF-IDF similarity. Let's create a function which performs a lookup in our news database.
The function should use "Reciprocal rank fusion" for combining the semantic ranks and the word frequency ranks.

In [None]:
def hybrid_search(query: str, top_k: int, rrf_k = 60.0) -> pd.DataFrame:
    """
    Perform a hybrid search, using both semantics and word frequencies.
    
    :param query: The question we'll try to answer. We use the question to search for relevant news articles.  
    :param top_k: Search for the top k most suitable news articles
    :param rrf_k: A hyper-parameter for Reciprocal rank fusion
    :return: A pandas dataframe with the most relevant news articles and their RRF rank
    """

    # Write results to new DataFrame
    results = news.copy()
    # Encode query and compute cosine score for semantic similarities
    # Note: You have already implemented this in "semantic_search()". You can copy your code here.
    query_embedding = llm_generate_embeddings([query])[0]
    semantic_similarities = cosine_similarity(
        semantic_embeddings,
        [query_embedding]
    )
    # Add semantic similarity score as column to the DataFrame
    results['semantic_score'] = semantic_similarities
    
    # Compute TF-IDF encoding of query
    # Note: you've already encoded the news articles with TF-IDF, do the same here for the query
    tfidf_encoding = tfidf_vectorizer.transform([query])
    # Compute cosine similarities, this time for TF-IDF encodings
    # The comparison should happend between "tfidf_corpus" and "tfidf_similarities"
    tfidf_similarities = cosine_similarity(tfidf_corpus, tfidf_encoding)
    results['tfidf_score'] = tfidf_similarities

    # Compute the semantic and TF-IDF ranks.
    # Note: Ranks start at 1, which is the best rank
    semantic_ranks = np.argsort(-semantic_similarities.ravel()).argsort() + 1
    tfidf_ranks = np.argsort(-tfidf_similarities.ravel()).argsort() + 1
    # Calculate RRF ranks, which combine the semantic and word frequency rank
    # Use the formula from today's presentation for the RRF rank.
    # Note: higher means better.
    rrf_rank = (1 / (semantic_ranks + rrf_k) + 1 / (tfidf_ranks + rrf_k))
    results['rrf_rank'] = rrf_rank

    # Get top-k results
    top_k_indices = np.argsort(-rrf_rank)[:top_k]
    results = results.iloc[top_k_indices]
    return results

In [None]:
# Test the hybrid search

top_k = 20
query = "What will be constructed in Marne-la-Vallee?"
# Run a hybrid search
search_results = hybrid_search(query=query, top_k=top_k)
search_results.head()

## Analyze the search results

TODO: Do the new search results contain more relevant information about "Marne-la-Vallee"? Are there some irrelevant articles in the results?

## Perform RAG, this time with hybrid search

In [None]:
# TODO: Look up relevant articles with hybrid search, add the articles to the LLM context window, and let the LLM answer the question about Marne-la-Vallee.
# Use 5-short RAG. We already have the search results available in "search_results".

relevant_news = search_results.iloc[:5]["title_and_text"]
question = query
llm_response = answer_news_question(question, relevant_news)
print(llm_response)


In [None]:
# TODO: Analyze the LLM response. Is it able to answer the question factually? Does the LLM manage to ignore irrelevant news articles?

# Further improvements to the search algorithm

Can you think of any additional ways for improving our RAG pipeline? Here are some ideas:
- Give more weight to news **titles** than **texts**
- Leverage article **release dates**, put more weight on the most recent article
- Use a more sophisticated LLM for semantic embedding
- ... any other ideas?