# Initial Set Up
This notebook is for the initial set up of the LLM application that will be the same for each model.

The focus is on the data-pre processing step of the RAG pipeline and getting the data into the vector database.

We need to put the data of the SMU Catalog of 2023-2024 into Qdrant which is a cloud vector database. This will allow the language model to access and retrieve the necessary information.

There are many changes that can be done at this step to alter how the text goes into the vector database (ex: different text splitters, document loaders, etc.)

Install the necessary dependecies for this notebook.

In [None]:
# Dependencies necessary for this step are: openai, pypdf, qdrant-client, langchain, python-dotenv, tiktoken, langchain-openai, and pandas
%pip install openai pypdf qdrant-client langchain python-dotenv tiktoken langchain-openai pandas

In [8]:
# Set up to initialize API keys from .env file into the
import os
from dotenv import find_dotenv, load_dotenv

# Load environment variables from the .env file
find_dotenv(filename='SURF-Project_Optimizing-PerunaBot/setup/.env')
load_dotenv(dotenv_path='c:/Users/yawbt/OneDrive/Documents/GitHub/SURF-Project_Optimizing-PerunaBot/setup/.env')

True

In [24]:
from qdrant_client import qdrant_client
from qdrant_client.http import models

In [25]:
# Initializing Qdrant host URL and API keys
qdrant_host = os.environ['QDRANT_HOST']
qdrant_api_key = os.environ['QDRANT_API_KEY']
qdrant_collection_name = os.environ['QDRANT_COLLECTION_NAME']

#Initialize Qdrant Client
client = qdrant_client.QdrantClient(
    url=qdrant_host, 
    api_key = qdrant_api_key,
)

In [26]:
# create 1st collection of vectors

vectors_config = models.VectorParams(
   size=1536, #for OpenAI
   distance=models.Distance.COSINE
   )

client.recreate_collection(
   collection_name = qdrant_collection_name,
   vectors_config=vectors_config,
)

True

Now we are going to create the vectorstore collection inside the database. Eventually, we will have more than one collection to see how changes to how the data is uploaded affects the accuracy and other evaluation metrics

In [None]:
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Qdrant


# create vector store
def get_vector_store():
    client = qdrant_client.QdrantClient(
    qdrant_host, 
    api_key = qdrant_api_key,
    )

    embeddings = OpenAIEmbeddings()

    vector_store = Qdrant(
        client=client, 
        collection_name=qdrant_collection_name, 
        embeddings=embeddings,
    )

    return vector_store

# created vector store
vector_store = get_vector_store()

Now is where the experiment begins! If you take a look at the pdfs in the data folder, the catalog pdf is over 800 pages long! To upload it into the vector databse, we have to 1st get all the text from the pdfs and split it into chunks that can be turned into vectors using langchain text splitters. We will use the OpenAI embedding model to turn the text chunks into vector embeddings.

First we will do the PDFs, then later on the CSV of the FAQs

In [None]:
# Initializing OpenAI API key
openai_api_key = os.environ['OPENAI_API_KEY']

In [None]:
import pypdf

# langchain imports
%pip install rank_bm25
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.retrievers import ParentDocumentRetriever, EnsembleRetriever, BM25Retriever
from langchain.storage import InMemoryStore

# file paths to the two PDFs we're using
pdf_paths = ['C:/Users/yawbt/OneDrive/Documents/GitHub/SURF-Project_Optimizing-PerunaBot/Data/20232024 Undergraduate Catalog91123.pdf',
             'C:/Users/yawbt/OneDrive/Documents/GitHub/SURF-Project_Optimizing-PerunaBot/Data/Official University Calendar 2023-2024.pdf'
             ]

# function to get the text from the PDFs
def load_pdf_documents(pdf_paths):
    pdf_documents = []
    for path in pdf_paths:
        try:
            # Open the PDF file in binary read mode
            with open(path, 'rb') as file:
                # Read the PDF file using PyPDF2
                pdf_reader = pypdf.PdfReader(file)
                pdf_documents.append(pdf_reader)
        except Exception as e:
            print(f"Error loading {path}: {e}")
    return pdf_documents

# text from the PDFs
pdfs_doc_text = load_pdf_documents(pdf_paths)

# langchain text splitting method
child_splitter = RecursiveCharacterTextSplitter(chunk_size=150, chunk_overlap=30, length_function=len) 
parent_splitter =RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=50, length_function=len)  

# storage for parent splitter
store = InMemoryStore()

# retriever
parent_retriever = ParentDocumentRetriever(
    vectorstore=vector_store, 
    docstore=store, 
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
    )

# adding split documents into the Qdrant vector database
parent_retriever.add_documents(pdfs_doc_text)