# Initial Set Up
This notebook is for the initial set up of the LLM application that will be the same for each model.

The focus is on the data-pre processing step of the RAG pipeline and getting the data into the vector database.

We need to put the data of the SMU Catalog of 2023-2024 into Qdrant which is a cloud vector database. This will allow the language model to access and retrieve the necessary information.

There are many changes that can be done at this step to alter how the text goes into the vector database (ex: different text splitters, document loaders, etc.)

We will install some of the necessary dependancies now and the rest along the way throughout the notebook.

In [None]:
%pip install openai pypdf qdrant-client langchain python-dotenv tiktoken langchain-openai pandas

In [28]:
# Set up to initialize API keys from .env file into the
import os
from dotenv import find_dotenv, load_dotenv

# Load environment variables from the .env files
load_dotenv(find_dotenv(filename='SURF-Project_Optimizing-PerunaBot/setup/.env'))

True

 If you take a look at the pdfs in the data folder, the catalog pdf is over 800 pages long! To upload it into the vector databse, we have to 1st get all the text from the documents.

First we will do the PDFs, then later on the CSV of the FAQs with a different method.

In [25]:
# langchain imports
from langchain.schema import Document
from langchain_community.document_loaders import PyPDFLoader

text = "hi"
# file paths to the two PDFs we're using
pdf_paths = ['C:/Users/yawbt/OneDrive/Documents/GitHub/SURF-Project_Optimizing-PerunaBot/Data/20232024 Undergraduate Catalog91123.pdf',
             'C:/Users/yawbt/OneDrive/Documents/GitHub/SURF-Project_Optimizing-PerunaBot/Data/Official University Calendar 2023-2024.pdf'
             ]

def load_pdfs_with_langchain(pdf_paths):
    documents = []
    for path in pdf_paths:
        try:
            # Use LangChain's PyPDFLoader to load the PDF
            loader = PyPDFLoader(path)
            # Load and pase the PDF into document instances
            pdf_doc = loader.load()
            # Insert pdf into documents list variable
            documents.extend(pdf_doc)
        except Exception as e:
            print(f"Error loading {path}: {e}")
    return documents

#Load PDF documents using the function
docs = load_pdfs_with_langchain(pdf_paths)

print(len(docs))
print(docs[0].page_content[0:100])
print(docs[0].metadata)

896
 
 
 
 
 
 
Southern Methodist University 
General Information 
Undergraduate Catalog  
2023 -2024  
{'source': 'C:/Users/yawbt/OneDrive/Documents/GitHub/SURF-Project_Optimizing-PerunaBot/Data/20232024 Undergraduate Catalog91123.pdf', 'page': 0}


Here we are initializing an API connection to the Qdrant vector database from https://qdrant.tech/

In [20]:
from qdrant_client import qdrant_client
from qdrant_client.http import models

In [21]:
# Initializing Qdrant host URL and API key
qdrant_host = os.environ['QDRANT_HOST']
qdrant_api_key = os.environ['QDRANT_API_KEY']

#Initialize Qdrant Client
client = qdrant_client.QdrantClient(
    url=qdrant_host, 
    api_key = qdrant_api_key,
)

Now we are going to create the 1st collection of vectors and the vectorstore inside the database. Eventually, we will have more than one collection to see how changes to how the data is uploaded and retrieved affects the accuracy and other evaluation metrics.

This is a very important step because an LLM application is only as good as its data and the documents it retrieves to create an answer.

Since we will have more than one collection within the vector database, we will just create a function that will allow us to create a new vectorstore collection when needed.

In [None]:
# function to create a vector store based on the collection name
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Qdrant

# Initializing OpenAI API key for embeddings and later use
openai_api_key = os.environ['OPENAI_API_KEY']

def create_vectorstore(qdrant_collection_name):
    
    vectors_config = models.VectorParams(
   size=1536, #for OpenAI
   distance=models.Distance.COSINE
   )
    
    client.create_collection(
   collection_name = qdrant_collection_name,
   vectors_config=vectors_config,   
    )

    vector_store = Qdrant(
        client=client, 
        collection_name=qdrant_collection_name, 
        embeddings=OpenAIEmbeddings(),
    )
  
    return vector_store

In [None]:
# create 1st collection of vectors

qdrant_collection_1 = os.environ['QDRANT_COLLECTION_1']
vector_store_1 = create_vectorstore(qdrant_collection_1)

This is where the experiment begins! Now we have all the text from the pdfs in the pdfs_doc_text variable, the text needs to be split into chunks using langchain text splitters to be turned into vectors using the OpenAI Embeddings Model.

We are going to test two methods:
1. Parent Doucment Retriever method with the **RecursiveCharacterTextSplitter**

2. Semantic Chunking method using the **Semantic Text Splitter**

Since this will generate two different types of chunks, we will but them in two different collections within the vector database.

In [None]:
# Parent Document Retriever Method
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore

child_splitter = RecursiveCharacterTextSplitter(chunk_size=250, chunk_overlap=25, 
                                                length_function=len, add_start_index=True) 

parent_splitter =RecursiveCharacterTextSplitter(chunk_size=750, chunk_overlap=50, 
                                                length_function=len, add_start_index=True)  

# storage for parent splitter
store = InMemoryStore()

# retriever
parent_retriever = ParentDocumentRetriever(
    vectorstore=vector_store_1, 
    docstore=store, 
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
    )

parent_retriever.add_documents(docs)
# testing the retriever
parent_retriever.invoke("Tell me about Computer Science at SMU")

# adding  documents into the Qdrant vector database in the 1st collection
parent_retriever.add_documents(docs)

Now for the 2nd method using the Semantic Text Splitter which splits the text based on the meaning within each sentence for more granular control of retrieval.

For this one, we will use the Ensemble Retriever which allows us to combine the results of multiple retrievers, giving them different weights. Within the Ensemble retriever we will use: 
-BM25 Retriever which a retrieval method used by search engines
-Base retriever that comes with using the vectorstore as a retriever


In [11]:
# semantic text splitting method
# do '%pip install langchain_experimental' if needed
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings

semantic_text_splitter = SemanticChunker(
    OpenAIEmbeddings(), 
    breakpoint_threshold_type="percentile")

semantic_docs = semantic_text_splitter.split_documents(docs)
print(semantic_docs[0].page_content)
print(len(semantic_docs))


 
 
 
 
 
 
Southern Methodist University 
General Information 
Undergraduate Catalog  
2023 -2024   
2135


In [None]:
# %pip install rank_bm25
from langchain.retrievers import EnsembleRetriever, BM25Retriever
# we already imported the Qdrant vector store and OpenAI embeddings in a previous step

bm25_retriever = BM25Retriever.from_documents(semantic_docs)

# creating another instance of a vector store with a new collection using the function we made earlier
qdrant_collection_2 = os.environ['QDRANT_COLLECTION_2']
vector_store_2 = create_vectorstore(qdrant_collection_2)

vector_store_2.from_documents(semantic_docs, OpenAIEmbeddings())
vector_store_2_retriever = vector_store_2.as_retriever()

# initialize the ensemble retriever
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, vector_store_2_retriever], weights=[0.5, 0.5]
)

# adding the semantically split docs into the vector store
vector_store_2_retriever.add_documents(semantic_docs)

ensemble_retriever.invoke("Tell me about Computer Science at SMU")

Now we are going to work on uploading the excel file of 115 FAQs into each vector database collection. There are multiple sheets within the excel file so we're going to have to turn each sheet into a CSV using the pandas library and then use langchain's CSV loader to turn them into langchain documents.

In [None]:
# using the pandas library to work with the excel file
import pandas as pd

# Load the Excel file
excel_path = 'C:/Users/yawbt/OneDrive/Documents/GitHub/SURF-Project_Optimizing-PerunaBot/Data/SMU FAQs.xlsx'
xlsx = pd.ExcelFile(excel_path)

# checking to see if loading the file worked
print(xlsx.sheet_names)

# Iterate through each sheet and save as a CSV file
csv_files = []
for sheet_name in xlsx.sheet_names:
     # Read the sheet into a DataFrame
    df = pd.read_excel(xlsx, sheet_name=sheet_name)
    
    # Extract metadata from cell A1
    metadata = df.iat[0, 0]
    
    # Save the DataFrame to a CSV file
    csv_path = f'C:/Users/yawbt/OneDrive/Documents/GitHub/SURF-Project_Optimizing-PerunaBot/Data/{sheet_name}.csv'
    df.to_csv(csv_path, index=False)
    csv_files.append((csv_path, metadata))

# Display the list of generated CSV files and their metadata
csv_files

In [46]:
# Filter out the specific CSV file
unwanted_csv = 'C:/Users/yawbt/OneDrive/Documents/GitHub/SURF-Project_Optimizing-PerunaBot/Data/SMU FAQs.csv'
filtered_csv_files = [file for file in csv_files if file[0] != unwanted_csv]

In [None]:
# Now turning each csv into a langchain document
from langchain.document_loaders import CSVLoader

# Create LangChain documents from CSV files with metadata
csv_documents = []

for csv_path, metadata in csv_files:
    loader = CSVLoader(file_path=csv_path)
    csv_docs = loader.load()
    for csv_doc in csv_docs:
        csv_doc.metadata['source'] = metadata
    csv_documents.extend(docs)

# Display the first document as an example
print(csv_documents[0])