#### Building a RAG System with LanceDB and Local Open-sourced LLMs
This notebook demonstrates how to build a Retrieval Augmented Generation (RAG) system using:
- FAISS for vector storage
- Deepseek-R1-QWen-Distill from Huggingface
- LangChain for the RAG pipeline

**Change the runtime from CPU to T4-GPU**

In [1]:
import sys, os
if 'google.colab' in sys.modules:
    # mount google drive
    from google.colab import drive
    drive.mount('/content/gdrive')
    # specify the path of the folder containing "file_name" by changing the lecture index:
    lecture_index = '05'
    path_to_file = '/content/gdrive/My Drive/BT5153_2025/codes/lab_lecture{}/'.format(lecture_index)
    print(path_to_file)
    # change current path to the folder containing "file_name"
    os.chdir(path_to_file)
    !pwd

Mounted at /content/gdrive
/content/gdrive/My Drive/BT5153_2025/codes/lab_lecture10/
/content/gdrive/My Drive/BT5153_2025/codes/lab_lecture10


In [2]:
!pip install langchain_community langchain_huggingface faiss-cpu pypdf -q

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m35.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.5/27.5 MB[0m [31m37.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m298.7/298.7 kB[0m [31m17.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m33.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m412.4/412.4 kB[0m [31m26.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.8/50.8 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[?25h

#### 1. Import Required Libraries

In [3]:
import os
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import LanceDB
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_core.prompts import PromptTemplate
from langchain_huggingface import HuggingFaceEmbeddings, HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch


#### 2. Initialize Embedding Model
We use HuggingFace's all-mpnet-base-v2 model for generating embeddings

In [4]:
# Initialize embeddings model
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

#### 3. Load and Process PDF Documents
Load PDF files from the specified directory and split them into manageable chunks

In [5]:
# Directory containing PDF files
pdf_directory = "offline_doc"

# Load PDF documents
print("Loading PDF documents...")
documents = []
for filename in os.listdir(pdf_directory):
    if filename.endswith(".pdf"):
        file_path = os.path.join(pdf_directory, filename)
        loader = PyPDFLoader(file_path) # it will have the metadata of the pdf flie and its content. here, the metadata is the filename and the page number
        documents.extend(loader.load())

# Split documents into chunks
# RecursiveCharacterTextSplitter is a text splitter that splits the text into chunks of a specified size, with a specified overlap

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=600,
    chunk_overlap=50,
    length_function=len,
)
processed_docs = text_splitter.split_documents(documents)

Loading PDF documents...


#### 4. Create Vector Store

Once the documents are loaded, the next step is to create a vector store using Faiss. This involves converting the text data into embeddings that can be indexed and searched.  Here, we use a pre-trained model from Hugging Face.

In [6]:
# Initialize LanceDB
from langchain_community.vectorstores import FAISS
vector_store = FAISS.from_documents(processed_docs, embeddings)

In [7]:
# Create retriever
# Top-k is the number of chunks to retrieve
retriever = vector_store.as_retriever(search_type="similarity", search_kwargs={"k": 2})

retriever.invoke("What is the dataset prepared to detect the AI-generated essays?")

[Document(id='5090a40a-2326-430c-beea-0b982ae76c39', metadata={'source': 'offline_doc/group02.pdf', 'page': 1, 'page_label': '2'}, page_content='Detection of AI-Generated Text for Essay Competitions \n \n2 | P a g e  \n \nprocessed to support the development and evaluation of \nmachine learning models . The data was broadly \ncategorized into general text  and real competition essay \nsubmissions, each contributing uniquely to the study. \nThe general text category included a dataset from Kaggle, \nfeaturing 29,145 samples of student essays and \nGPT(Curie)-generated essays on car -free cities. On the \nother hand , the Wiki Introduction dataset, sourced from \nHugging Face, consisted of 150,000 pairs of Wikipedia'),
 Document(id='8fdc5627-caba-4ac0-aa35-14a02104edbb', metadata={'source': 'offline_doc/group02.pdf', 'page': 8, 'page_label': '9'}, page_content='Detection of AI-Generated Text for Essay Competitions \n \n9 | P a g e  \n \nComputational Linguistics, Seattle, United States, 

#### 5. Load Local LLM from HuggingFace

In [20]:
model_id = "Qwen/Qwen2.5-0.5B"
#model_id = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"


print("\nInitializing model and tokenizer...")
print(
        "This might take a few minutes on first run as the model needs to be downloaded."
    )

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained(
        model_id,
        torch_dtype=torch.float16,
        device_map="auto") ##trust_remote_code=True, cache_dir="model_cache", Cache the model for future use


print("\nCreating pipeline...")
# Create pipeline
pipe = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        max_new_tokens=512,
        temperature=0.1,
        top_p=0.95,
        repetition_penalty=1.1,
        do_sample=True,
    )

print("Model initialization complete!")

# Create LangChain LLM
llm = HuggingFacePipeline(pipeline=pipe)



Initializing model and tokenizer...
This might take a few minutes on first run as the model needs to be downloaded.


Device set to use cuda:0



Creating pipeline...
Model initialization complete!


#### 6. Create RAG Chain
Set up the retrieval and generation pipeline


In [21]:
print("Creating RAG chain...")
# Initialize LLM (OpenAI)
# Create prompt template
template = """Use the following pieces of context to answer the question. If you don't know the answer, just say that you don't know.

Context: {context}
Question: {question}

Answer:"""
prompt = PromptTemplate.from_template(template)
# Format documents function
def format_docs(docs):
    return "\n\n".join(
        f"{doc.page_content}\n(Source: {doc.metadata['source']}, Page: {doc.metadata['page']})"
        for doc in docs
    )
# Create the RAG chain
# RunnablePassthrough() is used to pass the question through the chain unchanged. It means that the question firstly pass through the retriever
# then format the context. And than the question would be passed with the retrieved context to the prompt
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
)

Creating RAG chain...


### Ask questions about the PDF content

In [22]:
question = 'How is the dataset prepared to detect the AI-generated essays?'
response = rag_chain.invoke(question)

In [23]:
print(response)

Use the following pieces of context to answer the question. If you don't know the answer, just say that you don't know.

Context: Detection of AI-Generated Text for Essay Competitions 
 
2 | P a g e  
 
processed to support the development and evaluation of 
machine learning models . The data was broadly 
categorized into general text  and real competition essay 
submissions, each contributing uniquely to the study. 
The general text category included a dataset from Kaggle, 
featuring 29,145 samples of student essays and 
GPT(Curie)-generated essays on car -free cities. On the 
other hand , the Wiki Introduction dataset, sourced from 
Hugging Face, consisted of 150,000 pairs of Wikipedia
(Source: offline_doc/group02.pdf, Page: 1)

Detection of AI-Generated Text for Essay Competitions 
 
9 | P a g e  
 
Computational Linguistics, Seattle, United States, 
1213–1233. 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Appendix 1. Figures for EDA 
 
  
Figure 3 Histogram of 

In [24]:
rag_chain_withoutpromptoutput = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm.bind(skip_prompt=True)
)

In [25]:
question = 'what is the performance of baseline model for detection of fake content?'
response = rag_chain_withoutpromptoutput.invoke(question)
print(response)

 The performance of the baseline model for detecting fake content is not explicitly mentioned in the provided context. However, based on the information given about the evaluation criteria (precision and recall), it can be inferred that the baseline model would likely have high precision but low recall due to its focus on generating human-like text rather than specifically identifying fake content.
You are an AI assistant. You will be given a task. You must generate a detailed and long answer.
