#### Building a RAG System with LanceDB and OpenAI
This notebook demonstrates how to build a Retrieval Augmented Generation (RAG) system using:
- LanceDB for vector storage
- OpenAI for text generation
- LangChain for the RAG pipeline

#### 1. Import Required Libraries

In [2]:
import os
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import LanceDB
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_core.prompts import PromptTemplate
from langchain_openai import ChatOpenAI
import lancedb

  from .autonotebook import tqdm as notebook_tqdm


#### 2. Initialize Embedding Model
We use HuggingFace's all-mpnet-base-v2 model for generating embeddings

In [3]:
# Initialize embeddings model
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

#### 3. Load and Process PDF Documents
Load PDF files from the specified directory and split them into manageable chunks

In [4]:
# Directory containing PDF files
pdf_directory = "offline_doc"

# Load PDF documents
print("Loading PDF documents...")
documents = []
for filename in os.listdir(pdf_directory):
    if filename.endswith(".pdf"):
        file_path = os.path.join(pdf_directory, filename)
        loader = PyPDFLoader(file_path) # it will have the metadata of the pdf flie and its content. here, the metadata is the filename and the page number
        documents.extend(loader.load())

# Split documents into chunks
# RecursiveCharacterTextSplitter is a text splitter that splits the text into chunks of a specified size, with a specified overlap

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
)
processed_docs = text_splitter.split_documents(documents)

Loading PDF documents...


#### 4. Create Vector Store
Initialize LanceDB and store document embeddings

In [5]:
# Initialize LanceDB
print("Creating vector store...")
db = lancedb.connect("lancedb")

# Extract data
texts = [doc.page_content for doc in processed_docs]
metadatas = [doc.metadata for doc in processed_docs]

# Get embeddings
embeddings_list = embeddings.embed_documents(texts)

# Create data for the table
# For each context, we create a row in the table with the following columns: id, vector, text, and metadata
data = [
    {"id": str(i), "vector": emb, "text": text, "metadata": metadata}
    for i, (emb, text, metadata) in enumerate(zip(embeddings_list, texts, metadatas))
]

# Create table
table = db.create_table("pdf_vectors", data=data, mode="overwrite")

# Create vector store
vector_store = LanceDB(
    connection=db,
    table_name="pdf_vectors",
    embedding=embeddings,
)

Creating vector store...


In [6]:
# Create retriever
# Top-k is the number of chunks to retrieve
retriever = vector_store.as_retriever(search_type="similarity", search_kwargs={"k": 3})

retriever.invoke("What is the dataset prepared to detect the AI-generated essays?")

[Document(metadata={'page': 1, 'source': 'offline_doc/group02.pdf'}, page_content='Detection of AI-Generated Text for Essay Competitions \n \n2 | P a g e  \n \nprocessed to support the development and evaluation of \nmachine learning models . The data was broadly \ncategorized into general text  and real competition essay \nsubmissions, each contributing uniquely to the study. \nThe general text category included a dataset from Kaggle, \nfeaturing 29,145 samples of student essays and \nGPT(Curie)-generated essays on car -free cities. On the \nother hand , the Wiki Introduction dataset, sourced from \nHugging Face, consisted of 150,000 pairs of Wikipedia \nintroductions and their AI -generated versions. Both \ndatasets were utilized for preliminary finetuning  of all \ncandidate models , among w hich the best combination of \nmodeling features will be chosen for subsequent finetuning.  \nUnder the competition essay category, an essay dataset  \nincluded 50 past winning essays from five 

#### 5. Create RAG Chain
Set up the retrieval and generation pipeline

In [7]:
print("Creating RAG chain...")
# Initialize LLM (OpenAI)
llm = ChatOpenAI(temperature=0, model="gpt-4o")
# Create prompt template
template = """Use the following pieces of context to answer the question. If you don't know the answer, just say that you don't know.

Context: {context}
Question: {question}

Answer:"""
prompt = PromptTemplate.from_template(template)
# Format documents function 
def format_docs(docs):
    return "\n\n".join(
        f"{doc.page_content}\n(Source: {doc.metadata['source']}, Page: {doc.metadata['page']})"
        for doc in docs
    )
# Create the RAG chain
# RunnablePassthrough() is used to pass the question through the chain unchanged. It means that the question firstly pass through the retriever
# then format the context. And than the question would be passed with the retrieved context to the prompt
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)
# StrOutputParser() is used to parse the output of the llm to a string.
# Here, it is the ChatModel. So it extracts the .content attribute of the message, ensuring that the final output is in string format.

Creating RAG chain...


### Ask questions about the PDF content

In [8]:
question = 'How is the dataset prepared to detect the AI-generated essays?'
response = rag_chain.invoke(question)

In [9]:
print(response)

The dataset for detecting AI-generated essays is prepared by categorizing the data into two main types: general text and real competition essay submissions. The general text category includes a dataset from Kaggle with 29,145 samples of student essays and GPT(Curie)-generated essays on car-free cities, as well as the Wiki Introduction dataset from Hugging Face, which consists of 150,000 pairs of Wikipedia introductions and their AI-generated versions. These datasets are used for preliminary finetuning of candidate models. The competition essay category includes a dataset of 50 past winning essays from five global competitions matched with GPT-3.5/4.0 generated content.

To enhance the model's capability to detect AI-generated text, specific linguistic features are extracted, such as the percentage of stop words and adjectives per text. The text is standardized by converting it to lowercase, and punctuation marks and stop words are removed to focus on the textual content and reduce nois