# LibRAGrian : Building a RAG pipeline with Chonkie Chunker & Faiss

Installs

In [None]:
import os

!git clone https://github.com/theophile-bb/LibRAGrian.git
%cd LibRAGrian

In [None]:
!pip install -r requirements.txt

In [None]:
import sys
sys.path.append("src")

# Utils
from utils import *

First we have to get the book dataset. We'll retrieve it using the Hugginface and start the processing.

Dataset : https://huggingface.co/datasets/stas/gutenberg-100

In [None]:
import pandas as pd

# Login using e.g. `huggingface-cli login` to access this dataset
df = pd.read_csv("hf://datasets/stas/gutenberg-100/books-100.csv")

In [None]:
df.head()

---

## Text processing and cleaning

We keep the relevant columns and remove the book duplicates.

In [None]:
book_df = df[['title','author','text']]
book_df = book_df.drop_duplicates(subset=["title","author"])

In [None]:
book_df

We then have to clean and strip the unnecessary parts of the texts.

The cleaning part is as follow :
- Decode the UTF-8 BOM text bytes to show punctuation.
- Keep only the main text related to the book that is located between the *START OF THE PROJECT GUTENBERG EBOOK.* and *END OF THE PROJECT GUTENBERG EBOOK.* tags.
- Remove ther unnecessary line jumps (*\n*) to improve clarity.
- Strip once again the redundant noise located at the beginning of the text. It includes : everything before the main title of the book, illustration description, translation and diffusion credits.

Apply the processing to the whole text corpus.

In [None]:
book_df["cleaned_text"] = book_df.apply(lambda row: Clean_book(row["text"], row["title"]),axis=1)

We then proceed to filter the books to keep only the ones written in English.

In [None]:
book_df['language'] = book_df['cleaned_text'].apply(detect)

In [None]:
book_df = book_df[book_df['language'].isin(['en'])]
book_df = book_df.reset_index(drop=True)

In [None]:
book_df

---

## Chunking of the book

The next step is the chunking. For each book, we will chunk it and store the chunks into a dataframe with metadata.

In [None]:
chunks_df = Create_chunk_df(book_df)

In [None]:
chunks_df.head()

---

## Embedding of the chunks

We transform the chunks into embeddings usingthe bge small model.

In [None]:
embed_model = "BAAI/bge-small-en-v1.5"
texts = chunks_df['chunk'].tolist()

embeddings = Embedding(embed_model, texts, batch_size = 32)

In [None]:
chunks_df["embedding"] = embeddings

---

## Index the embeddings with Faiss

We then create a Faiss index with the embeddings for retrieval using cosine similarity.

In [None]:
index = Create_Faiss_index(embeddings)

---

## Retrieval with Qwen 2.5-3b

The last step is the retrieval step. We use the embdding model to embed the query and retieve the k most relevant chunks. We then pass these chunks as well the context (title of the books and id of the chunks) to the generative model (here Qwen2.5-3B-Instruct) with a query wait for a reply to be generated.

In [None]:
gen_model = "Qwen/Qwen2.5-3B-Instruct"

embed_model = load_embedding_model(embed_model)
gen_tokenizer, gen_model, device = load_generation_model(gen_model)


The RAG is now ready for querying !

In [None]:
query = "Can you tell me all the Jules Verne books you have heard of ?"
#query = "Who is the main character in the book Twenty Thousand Leagues Under the Seas ? Can you descibe him/her a bit ?"
#query = "What is the plot about in the book The Strange Case of Dr. Jekyll and Mr. Hyde ?"
#query = "What is the plot about in the book The Skylark of Space ?"
#query = "Among all the books you know who is the most evil character you've seen ? The one with the least moral values or who causes the most pain."
#query = "In the novel The Time Machine, how long did it take to build the machine ?"

result = answer_query(
    query=query,
    embed_model=embed_model,
    index=index,
    chunks_df=chunks_df,
    gen_model=gen_model,
    gen_tokenizer=gen_tokenizer,
    device=device
)

print(result["answer"])
print(result["sources"])

---