# Step 2a: RAG pipeline for the OpenAI chat model
Now we are moving on to the RAG pipeline for the base Open AI model that is not fine-tuned. This is the most identical to how it was done in [the original project](https://github.com/yawbtng/SMUChatBot_Project/blob/main/app.py) where we did not train the model on the data but just gave it access to the data. Think of it like an open book test where you haven't **learned** the information, but it's directly in front of you.

So in this step, we will create the actual [RAG chain](https://python.langchain.com/v0.2/docs/tutorials/rag/#retrieval-and-generation) using the vectorstores and retrievers we made in the data preprocessing python script in the 'Common' folder, along with other modules we will need.



First, we need to include the proper imports and load any environment variables we will need.

In [1]:
# Set up to initialize API keys from .env file into the
import os
from dotenv import find_dotenv, load_dotenv

# Load environment variables from the .env files
load_dotenv(find_dotenv(filename='SURF-Project_Optimizing-PerunaBot/setup/.env'))

True

In [2]:
# langsmith for tracing

from langsmith import Client
langsmith_api_key = os.environ["LANGSMITH_API_KEY"]
os.environ["LANGCHAIN_TRACING_V2"]
langchain_endpoint = os.environ["LANGCHAIN_ENDPOINT"]
langsmith_project = os.environ["LANGCHAIN_PROJECT"]

langmsiht_client = Client()

# test
from langchain_openai import ChatOpenAI
llm = ChatOpenAI()
llm.invoke("What is 2+7")

AIMessage(content='2 + 7 = 9', response_metadata={'token_usage': {'completion_tokens': 7, 'prompt_tokens': 13, 'total_tokens': 20}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-c7f93632-7be0-4fb6-91a9-e790010ac9d9-0', usage_metadata={'input_tokens': 13, 'output_tokens': 7, 'total_tokens': 20})

We are importing data_preprocessing.py in the Common folder to use the functions that geet the langchain docs, vectorstores, and retrievers for us to use.

In [1]:
import sys
sys.path.append('C:/Users/yawbt/OneDrive/Documents/GitHub/SURF-Project_Optimizing-PerunaBot/Common')
import data_preprocessing

1015
 
 
 
 
 
 
Southern Methodist University 
General Information 
Undergraduate Catalog  
2023 -2024  
{'source': 'C:/Users/yawbt/OneDrive/Documents/GitHub/SURF-Project_Optimizing-PerunaBot/Data/20232024 Undergraduate Catalog91123.pdf', 'page': 0}
 
 
 
 
 
 
Southern Methodist University 
General Information 
Undergraduate Catalog  
2023 -2024   
2443
Southern Methodist University 
General Information 
Undergraduate Catalog  
2023 -2024
7323
['University Advising Center FAQs', 'Student Financial Services FAQs', 'Parent FAQs', 'SMU Experience FAQs', 'UG Admissions Academics FAQs']
[('C:/Users/yawbt/OneDrive/Documents/GitHub/SURF-Project_Optimizing-PerunaBot/Data/University Advising Center FAQs.csv', 'https://www.smu.edu/provost/saes/academic-support/university-advising-center/frequently-asked-questions'), ('C:/Users/yawbt/OneDrive/Documents/GitHub/SURF-Project_Optimizing-PerunaBot/Data/Student Financial Services FAQs.csv', 'https://www.smu.edu/provost/saes/academic-support/student-a

In [2]:
# getting langchain docs
pdf_docs = data_preprocessing.get_all_langchain_docs()["pdf_docs"]
csv_docs = data_preprocessing.get_all_langchain_docs()["csv_docs"]

# getting collection 0 retriever and vector store
vector_store_0 = data_preprocessing.get_all_vectorstores()["vector_store_0"]
vector_store_0_retriever = data_preprocessing.get_all_retrievers()["vector_store_0_retriever"]

# getting collection 1 retriever and vector store
vector_store_1 = data_preprocessing.get_all_vectorstores()["vector_store_1"]
parent_retriever =  data_preprocessing.get_all_retrievers()["parent_retriever"]

# getting collection 2 retriever and vector store
vector_store_2 = data_preprocessing.get_all_vectorstores()["vector_store_2"]
ensemble_retriever =  data_preprocessing.get_all_retrievers()["ensemble_retriever"]

# Now you can use these objects as needed in your notebook
print(f"Number of PDF docs: {len(pdf_docs)}")
print(f"Number of CSV docs: {len(csv_docs)}")

Number of PDF docs: 1015
Number of CSV docs: 105


In [None]:
from langchain_openai import ChatOpenAI

# Initializing OpenAI API key for chat model and later use
openai_api_key = os.environ['OPENAI_API_KEY']

gpt_4o = ChatOpenAI(model="gpt-4o", temperature=0, max_tokens=None, 
                         timeout=None, max_retries=2)

gpt_3point5 = ChatOpenAI(model="gpt-3.5-turbo", temperature=0, max_tokens=None, 
                         timeout=None, max_retries=2)