# Part 1 - Semantic search 

## Create a RAG-based question answering system

In this Notebook, we will create a RAG-based Q&A system.

Our goal is to leverage the [Reuters News dataset](https://huggingface.co/datasets/ucirvine/reuters21578) and answer some questions around the [JAL airplane crash](https://en.wikipedia.org/wiki/Japan_Air_Lines_Flight_123). Japan Air Lines Flight 123 was a 1985 flight which left Tokyo towards Osaka. Initially it was unclear why the airplane crashed, and various theories emerged over time. We'll try to leverage RAG and an LLM to find out more about the root cause.

![JAL airplane](jal_airplane.png "JAL airplane")

We will leverage the Reuters Nets dataset to create a vector store as follows:
- Fetch the Reuters news dataset from Huggingface
- Do some data preprocessing and cleaning
- Embed each article using a sentence transformer

The Q&A pipeline works as follows:
- The user can ask any news-related question
- The question gets embedded with the same sentence transformer as above
- Find the news article most likely to contain the answer, using cosine similarity
- Query an LLM with the user question, the related news article, and a suitable prompt

![Question answering](question_answering.png "A question answering system using RAG")

In [None]:
# Install necessary dependencies
# - this takes ~3 minutes, give it some patience
# - the imports can show error messages, you can ignore them

!pip install unsloth

In [None]:
# Do all necessary imports

from datasets import load_dataset
from sentence_transformers import SentenceTransformer
import numpy as np
from typing import List, Dict
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from unsloth import FastLanguageModel
from transformers import PreTrainedTokenizer, PreTrainedModel


## Prepare LLM for answer generation 

In [None]:
# Instantiate unsloth model and tokenizer
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-1B-Instruct-bnb-4bit"
)
FastLanguageModel.for_inference(model)

In [None]:
# Define function for LLM inference

def llm_inference(
        messages: List[Dict],
        model: PreTrainedModel,
        tokenizer: PreTrainedTokenizer
) -> str:
    prompt = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    input_tokens = tokenizer(prompt, return_tensors='pt', padding=True, truncation=True).to("cuda")
    input_len = len(input_tokens.tokens())
    output_tokens = model.generate(**input_tokens)
    output_clipped = output_tokens[:, input_len:-1]
    result = tokenizer.batch_decode(output_clipped)
    return result[0]

## Load Reuters dataset

Huggingface dataset: https://huggingface.co/datasets/ucirvine/reuters21578

In [None]:
# Load dataset from Huggingface - if asked to run custom code, type "y" for YES.
reuters_ds = load_dataset('ucirvine/reuters21578','ModHayes')
news_raw = reuters_ds["train"].to_pandas()
print(f"Loaded {len(news_raw)} news articles.")

## Preprocess articles

In [None]:
# Merge title and text, drop unnecessary columns
news_raw["title_and_text"] = news_raw['title'] + ' | ' + news_raw['text']
news = news_raw[["title_and_text", "date", "places"]]

In [None]:
# Clean up text, remove unnecessary characters
pd.options.mode.chained_assignment = None
news["title_and_text"] = news.apply(lambda x: x["title_and_text"].replace("\\n", " "), axis=1)
news["title_and_text"] = news.apply(lambda x: x["title_and_text"].replace("\\\"", "\""), axis=1)
news["title_and_text"] = news.apply(lambda x: " ".join(x["title_and_text"].split()), axis=1)
news.head()

## Compute semantic embedding of articles using a sentence transformer

In [None]:
# Get texts to encode
texts_to_encode = news['title_and_text'].to_list()
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
# Encode texts
# - This can take ~ 1 minute, give it some patience
print("Encoding news articles. This will take ~ 1 minute ...")
semantic_embeddings = embedding_model.encode(
    texts_to_encode,
    show_progress_bar=True,
    batch_size=32,
    normalize_embeddings=True
)

## Search by semantics

In [None]:
def semantic_search(query: str,
                    top_k: int = 5) -> pd.DataFrame:
    """
    Perform semantic search for a given query.
    
    :param query: The question we'll try to answer. We use the question to search for relevant news articles.  
    :param top_k: Search for the top k most suitable news articles
    :return: A pandas dataframe with the most similar article texts and semantic scores
    """

    # Write results to new DataFrame
    news_copy = news.copy()
    # Encode the query
    query_embedding = embedding_model.encode(query, normalize_embeddings=True)
    # Calculate cosine similarity - higher is better
    semantic_similarities = np.dot(semantic_embeddings, query_embedding)
    news_copy['semantic_score'] = semantic_similarities
    # Get indices of top-k results
    top_k_indices = np.argsort(semantic_similarities)[-top_k:][::-1]
    # Only keep top-k results
    results = news_copy.iloc[top_k_indices]
    return results

In [None]:
# Search for an article related to the JAL airplane crash

query = "What caused the crash of the JAL plane?"
search_results = semantic_search(query=query, top_k=5)
search_results.head()

## Q&A with an LLM + RAG  

## 1-shot RAG: Only retrieve 1 news article

In [None]:
def create_messages(context: List[str], question: str):
    messages = [
        {
            "role": "system",
            "content": f"You are a assistant specialized in answering questions about the news. Answer the questions provided by the user as requested based on the provided articles. Provide the long form answer based on them explaining and summarizing all the details. The related news are following: {context}",
        },
        {"role": "user", "content": f"{question}"},
    ]
    return messages

In [None]:
context = [search_results.iloc[0]["title_and_text"]]
messages = create_messages(context, query)
llm_response = llm_inference(messages, model, tokenizer)
print(f"LLM response:\n{llm_response}")


## Few-shot: Retrieve 5 news articles

Now be expand the search and look for the 5 best matching news articles. We provide all of them as context when querying the LLM.

In [None]:
top_k = 5
context = list(search_results.iloc[:top_k].sort_values(by="date", ascending=True)["title_and_text"])
messages = create_messages(context, query)
llm_response = llm_inference(messages, model, tokenizer)
print(f"LLM response:\n{llm_response}")


## Improvements from few-shot RAG

In few-short RAG, the LLM has access to more relevant news articles when compared to 1-shot RAG.
In the answer above, what additional information do you spot in the answer?

## A hard question where the RAG might fail

We managed to find the root cause of the JAL airplane crash. Now we switch to a new topic.

We now increase the difficulty of the question. The new query contains a very specific geographical location. Semantic search can fail in such circumstances.

In [None]:
# Demo of a RAG that fails

query = "What will be constructed in Marne-la-Vallee?"
search_results = semantic_search(query=query, top_k=5)
search_results.head()

## Analyze the results

Do you see any relevant articles in the results about "Marne-la_vallee"?

## Ask the LLM a hard question

How does the LLM react when it doesn't have enough information?
Until some months, most LLMs would start hallucinating when they don't have enough information to answer a very specific question.
Nowadays, LLMs become better at spotting this situation, and they clarify they need more up-to-date information.

In [None]:
top_k = 5
context = list(search_results.iloc[:top_k]["title_and_text"])
messages = create_messages(context, query)
llm_response = llm_inference(messages, model, tokenizer)
print(f"LLM response:\n{llm_response}")

# Part 2 - Hybrid search: use semantics plus word frequencies

As seen above, some questions are harder to answer than others. When asking about the construction of a new town called [Marne-la-Vallée](https://en.wikipedia.org/wiki/Val_d%27Europe) by Walt Disney in France,
the semantic search fails. To improve the search, we'll use a combination of semantics plus word frequencies [TF-IDF](https://en.wikipedia.org/wiki/Tf–idf).

Marne-la-Vallée is a joint project between the French governement and the Walt Disney Company. The project started 1987 and includes six municipalities, a Disneyland Park, and a shopping center.

![Marne-la-Vallee](marne_la_vallee.png "Marne-la-Vallee")

As mentioned, we will improve the news article search by leveraging TF-IDF. With the new data pipeline, the news article search will work as follows:
- Embed the news articles and the question with a sentence transformer. Find the news articles with the most similar embedding.
- Encode the news articles and the question with TF-IDF. Again, find the best news articles matches, this time based on TF-IDF encodings.
- Each of the above encoding methods yields a similarity rank for every news articles.
- Use [Reciprocal rank fusion](https://dl.acm.org/doi/abs/10.1145/1571941.1572114) (RRF) to turn the two ranks into one final rank.

As we shall see, the new search mechanism will find better results for the following question:

"What will be constructed in Marne-la-Vallee?"

![Hybrid search](hybrid_search.png "Search with sentence transformer plus TF-IDF")

## Encode news articles with TF-IDF

In [None]:
# Compute TF-IDF vectors of new articles
tfidf_vectorizer = TfidfVectorizer(
    lowercase=True, stop_words="english"
)
tfidf_corpus = tfidf_vectorizer.fit_transform(texts_to_encode)

## Hybrid search

Combine semantic search with TF-IDF similarity.

In [None]:
def hybrid_search(query: str, top_k: int = 5, rrf_k = 60.0) -> pd.DataFrame:
    """
    Perform a hybrid search, using both semantics and word frequencies.
    
    :param query: The question we'll try to answer. We use the question to search for relevant news articles.  
    :param top_k: Search for the top k most suitable news articles
    :param rrf_k: A hyper parameter for Reciprocal rank fusion
    :return: A pandas dataframe with the most similar article texts and RRF scores
    """

    # Write results to new DataFrame
    news_copy = news.copy()
    # Compute semantic embedding of query
    query_embedding = embedding_model.encode(query, normalize_embeddings=True)
    # Calculate semantic similarities - higher is better
    semantic_similarities = cosine_similarity(semantic_embeddings, [query_embedding])
    news_copy['semantic_score'] = semantic_similarities
    
    # Compute TF-IDF encoding of query
    tfidf_query = tfidf_vectorizer.transform([query])
    # Calculate TF-IDF similarities - higher is better
    tfidf_similarities = cosine_similarity(tfidf_corpus, tfidf_query)
    news_copy['tfidf_score'] = semantic_similarities

    # Calculate ranks. Ranks start at 1, which is the best rank
    semantic_ranks = np.argsort(-semantic_similarities.ravel()).argsort() + 1
    tfidf_ranks = np.argsort(-tfidf_similarities.ravel()).argsort() + 1
    # Calculate RRF scores - higher means better
    rrf_scores = (1 / (semantic_ranks + rrf_k) + 1 / (tfidf_ranks + rrf_k))
    news_copy['rrf_score'] = rrf_scores

    # Get top-k results
    top_k_indices = np.argsort(-rrf_scores)[:top_k]
    results = news_copy.iloc[top_k_indices]
    return results

In [None]:
# Example usage

top_k = 5
query = "What will be constructed in Marne-la-Vallee?"
search_results = hybrid_search(query=query, top_k=top_k)
search_results.head()

## Analyze the search results

Do the new search results contain more relevant information about "Marne-la-Vallee"?

In [None]:
context = list(search_results["title_and_text"])
messages = create_messages(context, query)
llm_response = llm_inference(messages, model, tokenizer)
print(f"LLM response:\n{llm_response}")

# TODO: improve hybrid search
Do you have any idea how you could improve the search further, so that more relevant results can be found for the question above?