# <a id='toc1_'></a>[Mini challenge retrieval augmented generation (RAG)](#toc0_)
> Oliver Pejic; Arian Iseni; Christof Weickhardt. 

> Here Summary

- [Description of the task](https://spaces.technik.fhnw.ch/storage/uploads/spaces/81/exercises/20240911_NPR_Trainingscenter_MiniChallenge_RAG-1729000458.pdf)
- [Introduction RAG](https://spaces.technik.fhnw.ch/storage/uploads/spaces/81/Retrieval-Augmented-Generation-Intro-1727189074.pdf)

[**Table of content**](#toc0_)
- [Mini challenge retrieval augmented generation (RAG)](#toc1_)
- [Our approach](#toc2_)
- [Setup](#toc3_)
- [Data Loading & Preprocessing](#toc4_) 
- [Chunking](#toc5_)
- [Embedding & VectorDB](#toc6_)
- [Reasoning for our models](#toc7_)
- [Baseline Pipeline](#toc8_)
- [Evaluation Metrix](#toc9_)
  - [Ragas Metrics](#toc9_1_)
  - [Non-LLM Based Metrics](#toc9_2_)
- [Evaluation Data](#toc10_)
- [Experiment 1:](#toc11_)
- [Experiment 2:](#toc12_)
- [Experiment 3:](#toc13_)
- [Experiment 4:](#toc14_)
- [Personal Takeaways](#toc15_)
- [AI Tools - Improving Our Project with Assistance](#toc16_)
  - [How AI Tools Help Us](#toc16_1_)
  - [Using ChatGPT](#toc16_2_)
  - [Using GitHub Copilot](#toc16_3_)
  - [What Works Best When Asking AI for Help](#toc16_4_)
  - [Summary](#toc16_5_)

# <a id='toc3_'></a>[Setup](#toc0_)


# <a id='toc4_'></a>[Data Loading & Preprocessing](#toc0_)

We will now load the data into a pandas DataFrame and preprocess it. This preprocessing is informed by the insights documented in the `notebooks/cleaning.ipynb` notebook.

The preprocessing steps are implemented in the `src/preprocess.py` file. A class named `TextPreprocessor` is responsible for preparing the main dataset for indexing and retrieval. The preprocessing includes the following steps:

The preprocessing includes the following steps:

1.	Language Detection and Filtering: Retain only English texts to ensure language consistency.
2.	HTML Cleaning: Strip out HTML tags to focus on the raw textual data.
3.	Special Character Removal: Remove unwanted non-alphanumeric characters while retaining essential punctuation.
4.	Duplicate Removal: Eliminate duplicate text entries to ensure data uniqueness.
5.	Unique Identifier Generation: Create a unique ID for each dataset row using a hash of its content.

In [None]:
import pandas as pd

df = pd.read_csv(
    "data/raw/cleantech_media_dataset_v2_2024-02-23.csv"
)
df.head()

In [None]:
# delete author col
df = df.drop(columns=['author'])
# rename Unnamed: 0 to 'id'
df = df.rename(columns={'Unnamed: 0': 'id'})

In [None]:
from src.preprocess import TextPreprocessor

tp = TextPreprocessor(df, 'content')

cleaned_data = tp.preprocess_data()

tp.add_unique_id()

The preprocessed data is then saved to a new CSV file for further analysis.

In [None]:
import os

if os.path.exists('../data/preprocessed'):
    cleaned_data.to_parquet('data/preprocessed/clean_cleantech.parquet')
else:
    os.makedirs('../data/preprocessed')
    cleaned_data.to_parquet('data/preprocessed/clean_cleantech.parquet')

# <a id='toc5_'></a>[Chunking](#toc0_)
Text chunking involves dividing large documents into smaller, manageable pieces or “chunks” that are easier to process and index. This is particularly useful when dealing with extensive datasets or documents, where processing the entire content at once would be computationally expensive and inefficient.

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

recursive_text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=0,
    length_function=len,
    is_separator_regex=False,
)

def create_documents(df: pd.DataFrame, text_splitter, verbose=True):
    metadata_cols = ['url', 'domain', 'title', 'date', 'id']
    if not all(col in df.columns for col in metadata_cols + ['content']):
        raise ValueError(
            f"DataFrame must contain all metadata columns and a 'content' column: {metadata_cols + ['content']}")

    metadata = df[metadata_cols].rename(columns={'id': 'origin_doc_id'}).to_dict('records')
    for i, m in enumerate(metadata):
        metadata[i] = {k: 'None' if v is None else v for k, v in m.items()}

    docs = text_splitter.create_documents(df['content'], metadata)

    if verbose:
        print(
            f"{text_splitter.__class__.__name__}: "
            f"Number of documents created: {len(docs)}, "
            f"Number of rows in source df: {len(df)}, "
            f"Percentage of documents created: {len(docs) / len(df) * 100:.2f}%")

    return docs

documents = create_documents(df, recursive_text_splitter)

This part splits the DataFrame containing large texts into smaller, non-overlapping chunks for easier processing. Each chunk is up to 1000 characters long. The RecursiveCharacterTextSplitter is initialized with no overlap to ensure each character in the original text is unique to one chunk, preventing redundancy and maintaining clear separation between chunks. The function checks for necessary metadata columns, handles missing values, and uses these metadata along with the text content to generate smaller document chunks. It provides feedback on the number of documents created compared to the original DataFrame size.

## <a id='toc6_'></a>[Embedding & VectorDB](#toc0_)
In a RAG system, both embeddings and a vector database play crucial roles in the process of augmenting language generation with retrieval capabilities. Embeddings and vector databases in RAG systems bridge the gap between raw user queries and informative content in vast document collections, enhancing the generation of responses by making them more relevant and contextually aware. 

In [None]:
from src.custom_embeddings import bge_m3_embed, qwen2_embed, nomic_embed

# Initialize a list with all three embedding models.
embedding_models = [bge_m3_embed, qwen2_embed, nomic_embed]

# Loop through each model, print its name, embed a sample query, and print the first 20 dimensions of the resulting embedding.
for model in embedding_models:
    print(model.model_name)
    embedding = model.embed_query("The company is also aiming to reduce gas flaring?")
    print(embedding[:20])
    print()

# Define a function to create unique collection names based on the model and text splitter.
from src.vectorstorage import EmbeddingVectorStorage
def get_col_name_vectordb(embeddings, text_splitter):
    return f"{embeddings.model_name}_{text_splitter.__class__.__name__}"

# Create a dictionary to store vector storage instances.
vector_stores = {}

# For each embedding model, create a vector storage instance and include documents.
for model in embedding_models:
    collection_name = get_col_name_vectordb(model, recursive_text_splitter)
    print(f"Collection name: {collection_name}")
    vector_storage = EmbeddingVectorStorage(method_of_embedding=model, collection=collection_name)
    vector_storage.include_documents(documents, should_verbose=True)
    vector_stores[model.model_name] = vector_storage

# Print the dictionary of vector storages.
print(vector_stores)

# Query each vector storage for documents similar to a specific query and print the results.
query = "The company is also aiming to reduce gas flaring?"
for model_name, vector_store in vector_stores.items():
    print(f"Results for model: {model_name}")
    try:
        results = vector_store.search_similar_w_scores(query)
        for doc, score in results:
            print(f"Document: {doc}")
            print(f"Score: {score}")
        print()
    except Exception as e:
        print(f"Error searching in vector store '{model_name}': {e}")
        print()

Here we initializes and utilizes custom embedding models along with a vector storage system to handle and query large sets of document embeddings efficiently. The custom embedding models are defined in the `CustomHuggingFaceEndpointEmbeddings` class, which extends the capabilities of HuggingFace's endpoint embeddings to include a `model_name` attribute. This addition helps identify and manage multiple models within our system.

Each model is configured with a specific server endpoint URL and is responsible for transforming text into embeddings. These embeddings are then stored and managed in a `chromadb` based vector storage system, which allows for efficient retrieval and management of the vector data.

The script performs the following key operations:
1. Initializes three embedding models with unique names and endpoints.
2. Embeds a sample query using each model and prints the first 20 elements of the embeddings for verification.
3. Defines a function to generate unique collection names for storing embeddings based on the model's name and the text splitter class, ensuring organized data management.
4. Creates a vector storage instance for each model and includes the processed documents into the database, with progress updates provided if verbose is enabled.
5. Demonstrates querying the vector storage with a sample text to find and print documents similar to the query along with their similarity scores.
6. Includes error handling to manage and report potential issues during the search operations, ensuring the robustness of the system.

### <a id='toc7_'></a>[Reasoning for our models](#toc0_)


### <a id='toc8_'></a>[Baseline Pipeline](#toc0_)
Our baseline pipeline for question answering employs a multi-component approach integrating embeddings, language models, and various utilities from the LangChain library. This setup allows for efficient retrieval and processing of relevant documents to generate accurate answers to user queries.

**Components**

1.	Embeddings:
- We utilize the bge_m3_vectordb from the LangChain library, which is backed by embeddings from Hugging Face. These embeddings are crucial for retrieving the most relevant documents from our database based on the semantic similarity to the input query.
2.	Language Models:
- The core of our language processing is handled by the OllamaLLM from the LangChain’s Ollama module, specifically using the qwen2.5:0.5b-instruct-q4_0 model. This model is designed to understand and generate human-like text based on the context provided by the retrieved documents.
3.	Retrieval and Processing Utilities:
- Retriever: The basic retriever setup uses the as_retriever() method from bge_m3_vectordb, which efficiently identifies and fetches documents relevant to the input question.
- [**Prompt**](https://smith.langchain.com/hub/rlm/rag-prompt?organizationId=2d6cd9b7-5b49-44db-a523-a13c23f12f29): We leverage a pre-built prompt from LangChain’s hub, specifically rlm/rag-prompt. This prompt is tailored to guide the model in generating coherent and contextually appropriate responses.

In [None]:
from langchain_ollama import OllamaLLM
from src.custom_embeddings import bge_m3_vectordb
from langchain import hub


qwen2_5_0_5b_model = 'qwen2.5:0.5b-instruct-q4_0'


basic_retriever = bge_m3_vectordb.as_retriever()
llm_model = OllamaLLM(model=qwen2_5_0_5b_model)
basic_prompt = hub.pull("rlm/rag-prompt")

[**Pipeline Execution**](https://python.langchain.com/v0.1/docs/use_cases/question_answering/sources/)

The execution of the pipeline is initialized by invoking the basic_rag_chain with a sample question. The process includes:

- Retrieving contextually relevant documents using the embedded database.
- Formatting the retrieved documents into a structured format that is then processed by the rag_chain_from_docs sequence.
- Generating an answer through the orchestrated interaction of the retriever, prompt, and language model.

In [None]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
from dotenv import load_dotenv

load_dotenv()

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

rag_chain_from_docs = (
    RunnablePassthrough.assign(context=(lambda x: format_docs(x["context"])))
    | basic_prompt
    | llm_model
    | StrOutputParser()
)

basic_rag_chain = RunnableParallel(
    {"context": basic_retriever, "question": RunnablePassthrough()}
).assign(answer=rag_chain_from_docs)

The `basic_rag_chain` can now generate answers for questions. Here you can see a test with a question to see how it works.

In [None]:
basic_rag_chain.invoke("The company is also aiming to reduce gas flaring?")

# <a id='toc9_'></a>[Evaluation Metrix](#toc0_)

To properly assess and compare different computational systems, we use a structured evaluation approach. This includes using the `ragas` library for language model-based metrics and traditional metrics for other types of systems. Below, we explain the specific metrics we use in our evaluations, designed to accurately gauge various aspects of system performance.

## <a id='toc9_1_'></a>[Ragas Metrics](#toc0_)

We use the following metrics from the ragas library to evaluate the quality of answers and how relevant the contexts are that language model-based systems provide:

1.	[**Faithfulness**](https://docs.ragas.io/en/v0.1.21/concepts/metrics/faithfulness.html): Measures the factual consistency of the generated answer against the given context. It is computed by identifying claims in the answer and verifying if each can be inferred from the context. The metric ranges from 0 to 1, where higher scores indicate better factual consistency.

2.	[**Answer Relevancy**](https://docs.ragas.io/en/v0.1.21/concepts/metrics/answer_relevance.html): Assesses the pertinence of the generated answer to the given prompt. It is calculated by comparing the mean cosine similarity of the original question to artificially generated questions based on the answer. High scores are given to answers that address the prompt directly and appropriately, without redundant details or omissions.

3.	[**Context Precision**](https://docs.ragas.io/en/v0.1.21/concepts/metrics/context_precision.html): Evaluates the accuracy with which relevant items from the context are retrieved and ranked. This metric checks if the essential chunks of context appear at the top of the ranking, using a range from 0 to 1 where higher values indicate better precision.

4.	[**Context Entity Recall**](https://docs.ragas.io/en/v0.1.21/concepts/metrics/context_entities_recall.html): Measures the recall of entities from the retrieved context compared to the entities present in the ground truth. This metric is crucial for use cases where specific entity-related information is necessary, such as historical QA or tourism help desks, indicating the fraction of correctly recalled entities.

5.	[**Answer Similarity**](https://docs.ragas.io/en/v0.1.21/concepts/metrics/semantic_similarity.html): Assesses the semantic resemblance between the generated answer and the ideal answer (ground truth). This metric uses a cross-encoder model to calculate semantic similarity, with values ranging from 0 to 1, where higher scores denote a closer alignment with the ground truth.

6.	[**Answer Correctness**](https://docs.ragas.io/en/v0.1.21/concepts/metrics/answer_correctness.html): Evaluates both the semantic and factual accuracy of the generated answer in relation to the ground truth. The metric combines aspects of semantic and factual similarity using a weighted scheme and offers a scoring range from 0 to 1. Higher scores indicate better alignment and correctness, with an optional ‘threshold’ for rounding scores to binary values if needed.

## <a id='toc9_2_'></a>[Non-LLM Based Metrics](#toc0_)

To assess traditional retrieval systems, we use the following non-LLM based metrics, which are essential for evaluating how well these systems perform and how relevant their results are:

1.	**Mean Reciprocal Rank (MRR)**: A ranking quality metric that evaluates how quickly a system can present the first relevant item among its results. MRR is calculated as the average of the reciprocal ranks of the first relevant answer for different queries, where the reciprocal rank is the inverse of the rank at which the first relevant item appears. MRR values range from 0 to 1, with higher values indicating that the first relevant item typically appears earlier in the list, thus suggesting better performance.

2.	**Precision at K**: Measures the accuracy of the system in identifying relevant items within the top K results. Precision at K is the proportion of relevant items among the top K positions in the list, highlighting the system’s effectiveness at ranking relevant documents higher. This metric helps to understand how many of the top K items are actually relevant to the user, with values ranging from 0 to 1 where higher values indicate better accuracy.

3.	**Recall at K**: Evaluates how comprehensive the system’s retrieval is by measuring the proportion of relevant items that are retrieved within the top K results out of all relevant items available in the dataset. This metric assesses the system’s ability to include as many relevant items as possible within the top ranks, reflecting its effectiveness in covering the relevant documents needed for user queries. Similar to Precision, Recall values range from 0 to 1, with higher values indicating more comprehensive retrieval of relevant items.

## <a id='toc10'></a>[Evaluation Data](#toc0_)

In [None]:
eval_df = pd.read_csv('data/eval_dataset/cleantech_rag_evaluation_data_2024-09-20.csv')
eval_df.head()

# <a id='toc11_'></a>[Experiment 1:](#toc0_)

# <a id='toc12_'></a>[Experiment 2:](#toc0_)

# <a id='toc13_'></a>[Experiment 3:](#toc0_)

# <a id='toc14_'></a>[Experiment 4:](#toc0_)

# <a id='toc15_'></a>[Personal Takeaways](#toc0_)

Though this work is to proof understanding and skills surrounding the task at hand, we would like to take a few minutes and each reflect in a few sentences the journey throught this mini-challenge.

### Arian

In this challenge, I had the incredible opportunity to learn so much about large language models, how a retrieval system works, and how the metrics are used to rank the relevant information of a query. Llm as a judge was also an amazing experience, seeing how the models judge each other. I found functionalities like Ollama or models from HuggingFace to be incredibly useful and easy to use. On top of all that, I also learned so much about Docker.

### Oliver

This project was an incredible opportunity for me to dive deep into the fascinating world of large language models, RAG systems, and how to use different models with HuggingFace, including Ollama. These tools are truly cutting-edge, combining advanced technology and data management to create smart responses. It was an amazing experience that has greatly increased my understanding and appreciation of how these powerful systems work.

### Christof

I've had a truly enriching experience taking part in this challenge! It's been an amazing opportunity to explore the cutting-edge RAG technologies that are changing the world. I've gained so much knowledge from new methodologies, scholarly papers and insights into infrastructure and computational frameworks. And it's been so inspiring to work with other like-minded individuals, sharing knowledge and skills, and fostering a real sense of camaraderie.

# <a id='toc16_'></a>[AI Tools - Improving Our Project with Assistance](#toc0_)

We've started using AI tools like ChatGPT and GitHub Copilot in our projects, and they've really helped us work better and solve problems faster. This document explains how we use these tools, what tasks they help with, and which ways of using them work best.

## <a id='toc16_1_'></a>[How AI Tools Help Us](#toc0_)

These AI tools make it easier for us to handle coding tasks and fix errors, freeing up time to focus on more important parts of our projects.

### <a id='toc16_2_'></a>[Using ChatGPT](#toc0_)

We use ChatGPT mainly for two things: fixing code errors and coming up with new ideas. When we run into a coding error, we paste the wrong code and the error message into ChatGPT, and it often gives us a solution right away. It's also great for brainstorming new ways to improve our projects, offering suggestions we might not think of on our own.

### <a id='toc16_3_'></a>[Using GitHub Copilot](#toc0_)

GitHub Copilot helps us write code faster. It's like having a coding assistant that suggests lines of code as we type, which is really helpful for straightforward tasks. For more complex problems, though, we still need to do a lot of the work ourselves.

### <a id='toc16_4_'></a>[What Works Best When Asking AI for Help](#toc0_)

Getting the best out of these AI tools depends on how we ask for help. For ChatGPT, being clear about what the error is and what part of the code isn't working is crucial. For new ideas, explaining exactly what we need helps the AI give us useful suggestions.

For GitHub Copilot, it helps to start by clearly writing out what we want the code to do. This makes the tool more likely to suggest the right kind of code.

### <a id='toc16_5_'></a>[Summary](#toc0_)

Using AI tools like ChatGPT and GitHub Copilot has made our projects run smoother and has sped up how quickly we can write code and fix problems. ChatGPT is excellent for quickly dealing with errors and for helping us think of new ideas. GitHub Copilot is great for speeding up our coding, especially for simpler tasks.