# Initial Set Up
This notebook is for the initial set up of the LLM application that will be the same for each model.

The focus is on the data-pre processing step of the RAG pipeline and getting the data into the vector database.

We need to put the data of the SMU Catalog of 2023-2024 into Qdrant which is a cloud vector database. This will allow the language model to access and retrieve the necessary information.

There are different experimentations that can be done hre

Install the necessary dependecies for this part requirements.txt

In [None]:
# Dependencies are: openai, pypdf, qdrant-client, langchain, langsmith, chainlit, python-dotenv,tiktoken, langchain-openai, pandas, databricks-vector-search, giskard[llm], and mlflow
%pip install -r requirements.txt

In [None]:
# Set up to initialize API keys from .env file into the
import os
from dotenv import load_dotenv
load_dotenv()

In [None]:
from qdrant_client import qdrant_client
from qdrant_client.http import models

In [None]:
# Initializing Qdrant host URL and API keys
qdrant_host = os.environ['QDRANT_HOST']
qdrant_api_key = os.environ['QDRANT_API_KEY']
qdrant_collection_name = os.environ['QDRANT_COLLECTION_NAME']

#Initialize Qdrant Client
client = qdrant_client.QdrantClient(
    url=qdrant_host, 
    api_key = qdrant_api_key,
)

In [None]:
# create 1st collection of vectors

vectors_config = models.VectorParams(
   size=1536, #for OpenAI
   distance=models.Distance.COSINE
   )

client.recreate_collection(
   collection_name = qdrant_collection_name,
   vectors_config=vectors_config,
)

Now we are going to create the vectorstore collection inside the database. Eventually, we will have more than one collection to see how changes to how the data is uploaded affects the accuracy and other evaluation metrics

In [None]:
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Qdrant

# create vector store
def get_vector_store():
    client = qdrant_client.QdrantClient(
    qdrant_host, 
    api_key = qdrant_api_key,
    )

    embeddings = OpenAIEmbeddings()

    vector_store = Qdrant(
        client=client, 
        collection_name=qdrant_collection_name, 
        embeddings=embeddings,
    )

    return vector_store

# created vector store
vector_store = get_vector_store()

Now is where the experiment begins! If you take a look at the pdfs in the data section