# Step 1: Data Pre-processing
This notebook is for the initial set up of the LLM application that will be the same for each model.

The focus is on the data-pre processing step of the RAG pipeline and getting the data into the vector database.

We need to put the data of the SMU Catalog of 2023-2024 into Qdrant which is a cloud vector database. This will allow the language model to access and retrieve the necessary information.

There are many changes that can be done at this step to alter how the text goes into the vector database (ex: different text splitters, document loaders, retrievers, etc.)

We will install some of the necessary dependancies now and the rest along the way throughout the notebook.

In [6]:


# Set up to initialize API keys from .env file into the
import os
from dotenv import find_dotenv, load_dotenv

# Load environment variables from the .env files
load_dotenv(find_dotenv(filename='SURF-Project_Optimizing-PerunaBot/setup/.env'))

True

Here we will initialize langsmith for tracing and tracking

In [7]:
# Import the Client class from the langsmith package for tracing and tracking

from langsmith import Client

# Retrieve Langsmith API key and other related environment variables
langsmith_api_key = os.environ["LANGSMITH_API_KEY"]
os.environ["LANGCHAIN_TRACING_V2"]
langchain_endpoint = os.environ["LANGCHAIN_ENDPOINT"]
langsmith_project = os.environ["LANGCHAIN_PROJECT"]

# Initialize a Langsmith Client instance
langsmith_client = Client()

# Test section (commented out)
# from langchain_openai import ChatOpenAI
# llm = ChatOpenAI()
# llm.invoke("What can you do?")

## PDF Breakdown
If you take a look at the pdfs in the data folder, just the catalog pdf alone is over 800 pages long! 

To upload it into the vector databse, we have to 1st load it into a list of langchain documents.

In [61]:
# Import the PyPDFLoader from langchain_community.document_loaders for PDF document loading
from langchain_community.document_loaders import PyPDFLoader

# file paths of PDFs to be used
pdf_paths = ['../Data/RAG Knowledge Base/20232024 Undergraduate Catalog91123.pdf',
             '../Data/RAG Knowledge Base/Official University Calendar 2023-2024.pdf',
             '../Data/RAG Knowledge Base/2023_PerunaPassport.pdf',
             '../Data/RAG Knowledge Base/SMU Student Handbook 23-24.pdf',
             '../Data/RAG Knowledge Base/SMUCampusGuideFactsMap.pdf',
             ]

# Function to load PDFs using LangChain's PyPDFLoader
def load_pdfs_with_langchain(pdf_paths):
    documents = []
    for path in pdf_paths:
        try:
            # Use LangChain's PyPDFLoader to load the PDF
            loader = PyPDFLoader(path)
            # Load and pase the PDF into document instances
            pdf_doc = loader.load()
            # Insert the parsed PDF documents into the documents list
            documents.extend(pdf_doc)
        except Exception as e:
            print(f"Error loading {path}: {e}")
    return documents

# Load PDF documents using the function
pdf_docs = load_pdfs_with_langchain(pdf_paths)

print(len(pdf_docs))
print(pdf_docs[0].page_content[0:100])
print(pdf_docs[0].metadata)

1015
 
 
 
 
 
 
Southern Methodist University 
General Information 
Undergraduate Catalog  
2023 -2024  
{'source': '../Data/RAG Knowledge Base/20232024 Undergraduate Catalog91123.pdf', 'page': 0}


## CSV Breakdown
Now we are going to work on uploading the excel file of 115 FAQs into each vector database collection. There are multiple sheets within the excel file so we're going to have to turn each sheet into a CSV using the pandas library and then use langchain's CSV loader to turn them into langchain documents.

The CSVs you see there now were created by iterating through the xlsx file so you can delete those if you want to see them being recreated (be careful not to delete the xlsx file)

In [62]:
# Import the pandas library to work with the Excel file and convert it to a data frame
import pandas as pd

# Load the Excel file
excel_path = '../Data/RAG Knowledge Base/SMU FAQs.xlsx'
xlsx = pd.ExcelFile(excel_path)

# checking to see if loading the file worked
print(xlsx.sheet_names)

# Iterate through each sheet and save as a CSV file
csv_files = []
for sheet_name in xlsx.sheet_names:
    # Read the entire sheet to extract the metadata from cell A1
    sheet_df = pd.read_excel(xlsx, sheet_name=sheet_name, header=None)
    
    # Get the link of the webpage to include as metadata
    metadata = sheet_df.iat[0, 0]
    
    # Read the sheet into a DataFrame starting from the second row
    df = pd.read_excel(xlsx, sheet_name=sheet_name, skiprows=1)
    
    # Save the DataFrame to a CSV file
    csv_path = f'../Data/RAG Knowledge Base/{sheet_name}.csv'
    df.to_csv(csv_path, index=False, encoding='utf-8')
    csv_files.append((csv_path, metadata))

# Display the list of generated CSV files and their metadata
print(csv_files)

['University Advising Center FAQs', 'Student Financial Services FAQs', 'Parent FAQs', 'SMU Experience FAQs', 'UG Admissions Academics FAQs']
[('../Data/RAG Knowledge Base/University Advising Center FAQs.csv', 'https://www.smu.edu/provost/saes/academic-support/university-advising-center/frequently-asked-questions'), ('../Data/RAG Knowledge Base/Student Financial Services FAQs.csv', 'https://www.smu.edu/provost/saes/academic-support/student-academic-success/faq'), ('../Data/RAG Knowledge Base/Parent FAQs.csv', 'https://www.smu.edu/provost/saes/academic-support/university-advising-center/incoming-students/for-parents'), ('../Data/RAG Knowledge Base/SMU Experience FAQs.csv', 'https://www.smu.edu/admission/campuslife/faqlivingoncampus'), ('../Data/RAG Knowledge Base/UG Admissions Academics FAQs.csv', 'https://www.smu.edu/admission/academics/faqsacademics')]


In [63]:
# Import CSVLoader from langchain_community.document_loaders to load CSV documents
from langchain_community.document_loaders import CSVLoader

# Create LangChain documents from CSV files with metadata
csv_documents = []

for csv_path, metadata in csv_files:
    loader = CSVLoader(file_path=csv_path, encoding='utf-8')
    csv_docs = loader.load()
    for csv_doc in csv_docs:
        csv_doc.metadata['source'] = metadata
    csv_documents.extend(csv_docs)

# Display the first document as an example
print(csv_documents[0])

page_content='question: What's the difference between all-college GPA and SMU GPA?
answer: Your all-college GPA is the GPA used from all your grades at any college or university you’ve attended, including SMU. 
Your SMU GPA is your GPA based only on your SMU grades. Some schools/majors use your all-college GPA or grades in courses you’ve
 taken at another institution to determine if they will admit you to their major.' metadata={'source': 'https://www.smu.edu/provost/saes/academic-support/university-advising-center/frequently-asked-questions', 'row': 0}


Here we are initializing an API connection to the Qdrant vector database from https://qdrant.tech/

In [8]:
# Import Qdrant client for vector database cloud store
from qdrant_client import qdrant_client
from qdrant_client.http import models

# Initialize Qdrant host URL and API key from environment variables
qdrant_host = os.environ['QDRANT_HOST']
qdrant_api_key = os.environ['QDRANT_API_KEY']

# Initialize Qdrant Client
client = qdrant_client.QdrantClient(
    url=qdrant_host, 
    api_key = qdrant_api_key,
)

Here we are setting up functions to create the vector store collection and also retrieve them once they are already created.

In [9]:
# Import OpenAIEmbeddings and Qdrant from respective langchain modules

from langchain_openai import OpenAIEmbeddings
from langchain_qdrant import QdrantVectorStore as Qdrant

# Retrieve OpenAI API key from environment variables
openai_api_key = os.environ['OPENAI_API_KEY']

# Function to create a vector store based on the collection name
def create_vectorstore(qdrant_collection_name):
    
    # Ensuring Qdrant Client connection
    client = qdrant_client.QdrantClient(
        url=qdrant_host, 
        api_key = qdrant_api_key,
    )

    vectors_config = models.VectorParams(
        size=1536, #for OpenAI
        distance=models.Distance.COSINE
   )
    # Create a Qdrant collection with the specified name and vectors configuration
    client.create_collection(
        collection_name = qdrant_collection_name,
        vectors_config=vectors_config,   
    )

    # Initialize the vector store with the created Qdrant collection
    vector_store = Qdrant(
        client=client, 
        collection_name=qdrant_collection_name, 
        embeddings=OpenAIEmbeddings(),
    )
  
    return vector_store

# Function to return the vector store if it already exists

def get_vectorstore(qdrant_collection_name):
    
    # Ensuring Qdrant Client connection
    client = qdrant_client.QdrantClient(
    url=qdrant_host, 
    api_key = qdrant_api_key,
    )

    vector_store = Qdrant(
        client=client, 
        collection_name=qdrant_collection_name, 
        embeddings=OpenAIEmbeddings(),
    )

    return vector_store

Now we are going to create the 1st collection of vectors and the vectorstore inside the database. Eventually, we will have more than one collection to see how changes to how the data is uploaded and retrieved affects the accuracy and other evaluation metrics.

This is a very important step because an LLM application is only as good as its data and the documents it retrieves to create an answer.

Since we will have more than one collection within the vector database, we will just create a function that will allow us to create a new vectorstore collection when needed.

In [18]:
# create 1st collection of vectors
qdrant_collection_1 = os.environ['QDRANT_COLLECTION_1']

# Checking if the collection already exists
collection_check_1 = False

if client.collection_exists(qdrant_collection_1):
    vector_store_1 = get_vectorstore(qdrant_collection_1)
    collection_check_1 = True
    print(qdrant_collection_1 + " already exists")
else:
    vector_store_1 = create_vectorstore(qdrant_collection_1)
    print(qdrant_collection_1 + " was just created")

smu-data_1 already exists


This is where the experiment begins! Now we have all the text from the pdfs in the pdfs_doc_text variable, the text needs to be split into chunks using langchain text splitters to be turned into vectors using the OpenAI Embeddings Model.

We are going to test three methods:
1. Parent Doucment Retriever method with the **RecursiveCharacterTextSplitter**

2. Semantic Chunking method using the **Semantic Text Splitter**

3. Using the vector store as the retriever with the **Recursive Character Text Splitter**

Since this will generate two different types of chunks, we will but them in two different collections within the vector database.

In [71]:
# Parent Document Retriever Method
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.retrievers import ParentDocumentRetriever
# created a custom class in ParentDocumentRetriever that adds the documents to the docstore but not to the vectorstore
from langchain.storage import InMemoryStore


child_splitter = RecursiveCharacterTextSplitter(chunk_size=250, chunk_overlap=25, 
                                                length_function=len, add_start_index=True) 
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=750, chunk_overlap=50, 
                                                length_function=len, add_start_index=True)  

# storage for parent splitter
store = InMemoryStore()

# retriever
def create_parent_retriever():
    parent_retriever = ParentDocumentRetriever(
        vectorstore=vector_store_1, 
        docstore=store, 
        child_splitter=child_splitter,
        parent_splitter=parent_splitter,
        search_kwargs = {"k": 8}
        )
    return parent_retriever


parent_retriever = create_parent_retriever()

In [20]:
# Check the status of the collection and add documents to the vector store if necessary
if collection_check_1 == False:
    # if collection is just created and empty
    if client.get_collection(qdrant_collection_1).vectors_count == None:
    # Add documents to the Qdrant vector database and parent store
        parent_retriever.add_documents(pdf_docs)
        parent_retriever.add_documents(csv_documents)
        print("PDF docs and CSV docs added to doc store and vectorstore")

elif collection_check_1 == True:  
    # if collection was already there and empty
    if client.get_collection(qdrant_collection_1).vectors_count == None: 
        # Add documents to the Qdrant vector database and parent store
        parent_retriever.add_documents(pdf_docs)
        parent_retriever.add_documents(csv_documents)
        print("PDF docs and CSV docs added to doc store and vectorstore")

# testing the retriever
parent_retriever.invoke("What is SMU")

PDF docs and CSV docs added to doc store and vectorstore


[Document(metadata={'source': '../Data/RAG Knowledge Base/20232024 Undergraduate Catalog91123.pdf', 'page': 80, 'start_index': 3259, 'doc_id': '7ac85156-5906-4e02-9517-6c2564a3d313'}, page_content='fellow  students in nonhonors classes; in the Residential Commons; in the student center; on the playing fields; and \nin the numerous student governing, social, pre -professional, political, cultural and social organizations that enhance \nstudent life at SMU.  \nEntrance to the University Honors Program is by invitation prior to matriculation or by application after at least one term of coursework at SMU. At the end of their undergraduate years, students who maintain a 3.000 GPA in their \nhonors courses and at leas t a 3.300 overall GPA receive a diploma inscribed with the designation "Honors in the \nLiberal Arts." More information about the University Honors Program is available on the website'),
 Document(metadata={'source': '../Data/RAG Knowledge Base/20232024 Undergraduate Catalog911

In [None]:
parent_retriever.invoke("Lyle School of Engineering")

[Document(metadata={'source': '../Data/RAG Knowledge Base/20232024 Undergraduate Catalog91123.pdf', 'page': 489, 'start_index': 704, 'doc_id': '28e7a512-2efa-473d-af0a-fa7c2e473670'}, page_content="and computer engineering, environmental engineering, mechanical engineering and management science. The \nDallas area's national prominence i n high technology and research has been beneficial to the Lyle School of \nEngineering and its students. Corporate support for the Lyle School has generated a remarkable array of equipment and laboratories.  \nIn addition, the Lyle School is the home of the following facilities:  \nResearch Center for Advanced Manufacturing (RCAM)  lies at the interface among science, engineering and \nindustrial practice. The center provides North Texas companies applied research and development opportunities for material processing and testing."),
 Document(metadata={'source': '../Data/RAG Knowledge Base/20232024 Undergraduate Catalog91123.pdf', 'page': 490, 'start_i

In [64]:
list(store.yield_keys())
store.mget(list(store.yield_keys()))

[Document(metadata={'source': '../Data/RAG Knowledge Base/20232024 Undergraduate Catalog91123.pdf', 'page': 0, 'start_index': 12, 'doc_id': 'f495443c-1e57-4de8-b2ac-eb45812822c5'}, page_content='Southern Methodist University \nGeneral Information \nUndergraduate Catalog  \n2023 -2024'),
 Document(metadata={'source': '../Data/RAG Knowledge Base/20232024 Undergraduate Catalog91123.pdf', 'page': 1, 'start_index': 0, 'doc_id': '3db36a4a-1fe3-4d6c-9970-84a664e3dd4d'}, page_content='2 \n Catalog Policy and Legal Statement  \nBulletin of Southern Methodist University 2023- 2024 Vol. CVII  \nSouthern Methodist University publishes a complete bulletin every year. The following catalogs constitute the \nGeneral Bulletin of the University:  \n• Undergraduate Catalog*  \no Cox School of Business  \no Dedman College of Humanities and Sciences  \no Lyle School of Engineering  \no Meadows School of the Arts  \no Simmons School of Education and Human Development  \n• Graduate Catalog*  \no Cox School 

In [65]:
in_memory_store_docs = store.mget(list(store.yield_keys()))
document_ids = list(store.yield_keys())

In [66]:
parent_retriever_docs_dict = {}
for doc_id in document_ids:
   pr_doc = store.mget([doc_id])
   parent_retriever_docs_dict[doc_id] = pr_doc

In [67]:
for doc_id, pr_docs in parent_retriever_docs_dict.items():
    for pr_doc in pr_docs:
        print(f"Document ID: {doc_id}")
        print(f"Document Content: {pr_doc.page_content}")
        print(f"Document Metadata: {pr_doc.metadata}")
        print()

Document ID: f495443c-1e57-4de8-b2ac-eb45812822c5
Document Content: Southern Methodist University 
General Information 
Undergraduate Catalog  
2023 -2024
Document Metadata: {'source': '../Data/RAG Knowledge Base/20232024 Undergraduate Catalog91123.pdf', 'page': 0, 'start_index': 12, 'doc_id': 'f495443c-1e57-4de8-b2ac-eb45812822c5'}

Document ID: 3db36a4a-1fe3-4d6c-9970-84a664e3dd4d
Document Content: 2 
 Catalog Policy and Legal Statement  
Bulletin of Southern Methodist University 2023- 2024 Vol. CVII  
Southern Methodist University publishes a complete bulletin every year. The following catalogs constitute the 
General Bulletin of the University:  
• Undergraduate Catalog*  
o Cox School of Business  
o Dedman College of Humanities and Sciences  
o Lyle School of Engineering  
o Meadows School of the Arts  
o Simmons School of Education and Human Development  
• Graduate Catalog*  
o Cox School of Business  
o Dedman College of Humanities and Sciences  
o Dedman School of Law  
o Lyl

Now for the 2nd method using the Semantic Text Splitter which splits the text based on the meaning within each sentence for more granular control of retrieval.

For this one, we will use the Ensemble Retriever which allows us to combine the results of multiple retrievers, giving them different weights. Within the Ensemble retriever we will use: 

- BM25 Retriever which a retrieval method used by search engines
- Base retriever that comes with using the vectorstore as a retriever


In [68]:
# semantic text splitting method
# Import SemanticChunker from langchain_experimental.text_splitter
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings

# Initialize the semantic text splitter with OpenAI embeddings
semantic_text_splitter = SemanticChunker(
    OpenAIEmbeddings(), 
    breakpoint_threshold_type="percentile")
# Split documents using the semantic text splitter
semantic_docs = semantic_text_splitter.split_documents(pdf_docs)

print(semantic_docs[0].page_content)
print(len(semantic_docs))

 
 
 
 
 
 
Southern Methodist University 
General Information 
Undergraduate Catalog  
2023 -2024   
2444


In [None]:
# Create another instance of a vector store with a new collection using the function created earlier
qdrant_collection_2 = os.environ['QDRANT_COLLECTION_2']

# Check if the second collection already exists
collection_check_2 = False

# creating the second vector store and retriever
if client.collection_exists(qdrant_collection_2):
    vector_store_2 = get_vectorstore(qdrant_collection_2)
    print(qdrant_collection_2 + " already exists")
    collection_check_2 = True
else:
    vector_store_2 = create_vectorstore(qdrant_collection_2)
    print(qdrant_collection_2 + " was just created")

smu-data_2 already exists


In [None]:
vector_store_2_retriever = vector_store_2.as_retriever(search_type="similarity_score_threshold",
                                                            search_kwargs = {"k": 8, "score_threshold" : 0.75})

# Add documents to the second vector store if necessary
if collection_check_2 == False:
        vector_store_2_retriever.add_documents(semantic_docs) # adding the semantically split docs into the vector store if not there already
        vector_store_2_retriever.add_documents(csv_documents) # adding csv docs to vectorstore 
elif collection_check_2 == True:
    if client.get_collection(qdrant_collection_2).vectors_count == None:
      vector_store_2_retriever.add_documents(semantic_docs) # adding the semantically split docs into the vector store if not there already
      vector_store_2_retriever.add_documents(csv_documents) # adding csv docs to vectorstore

In [None]:
from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever

# Initialize BM25 retriever from combined semantic and CSV documents
bm25_retriever = BM25Retriever.from_documents(semantic_docs+csv_documents)

# Initialize the ensemble retriever with BM25 and vector store retrievers
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, vector_store_2_retriever], 
    weights=[0.5, 0.5]
)

# Test the ensemble retriever
ensemble_retriever.invoke("How many credit hours should I take my first year?")

[Document(metadata={'page': 25, 'source': 'C:/Users/yawbt/OneDrive/Documents/GitHub/SURF-Project_Optimizing-PerunaBot/Data/2023_PerunaPassport.pdf', '_id': 'e9ec59aa-9040-4ca0-a9ab-a023e91a877d', '_collection_name': 'smu-data_2'}, page_content='Credit Hours  Each course is assigned a certain number of credit hours. Most are three credits. You can deter-\nmine the number of credit hours a particular course is worth by referencing the second digit in the course num -\nber. For example, WRTR 1312 is a three credit course and HDEV 1210 is a two credit course. Full-time status  You will need to be enrolled in 12 credits to be considered a full-time student. Most students \nenroll in an average of 15 credits per term. Section  Some courses have multiple sections – the same topic is being taught at different times or by different  \nfaculty members.'),
 Document(metadata={'row': 25, 'source': 'https://www.smu.edu/provost/saes/academic-support/university-advising-center/frequently-asked-questi

We are going to have a base option that splits the text using just the RecursiveCharacterTextSplitter like the [original repo](https://github.com/yawbtng/SMUChatBot_Project/blob/main/main.py) does. From there, we're creating a third vector store collection to upload this text into using the vector store as the retriver.

In [69]:
# Initialize a base text splitter for normal splitting of documents
from langchain_text_splitters import RecursiveCharacterTextSplitter
base_text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50, 
                                                length_function=len, add_start_index=True)  

# Split documents using the base text splitter
normal_split_docs = base_text_splitter.split_documents(pdf_docs)

# Check and print the result of the normal splitting
print(normal_split_docs[0].page_content)
print(len(normal_split_docs))

Southern Methodist University 
General Information 
Undergraduate Catalog  
2023 -2024
7323


In [12]:
# getting the collection name of the third vector store
qdrant_collection_0 = os.environ['QDRANT_COLLECTION_0']

# Check if the third collection already exists
collection_check_0 = False

# creating the third vector store and retriever
if client.collection_exists(qdrant_collection_0):
    vector_store_0 = get_vectorstore(qdrant_collection_0)
    collection_check_0 = True
    print(qdrant_collection_0 + " is already there")
else:
    vector_store_0 = create_vectorstore(qdrant_collection_0)
    print(qdrant_collection_0 + " was just created")

smu-data_0 was just created


In [13]:
# Initialize the retriever for the third vector store
vector_store_0_retriever = vector_store_0.as_retriever(search_kwargs = {"k": 8, "score_threshold" : 0.75})

# Add documents to the third vector store if necessary
if collection_check_0 == False:
        vector_store_0_retriever.add_documents(normal_split_docs) # adding the semantically split docs into the vector store if not there already
        vector_store_0_retriever.add_documents(csv_documents) # adding csv docs to vectorstore 
elif collection_check_0 == True:
    if client.get_collection(qdrant_collection_0).vectors_count == None:
      vector_store_2_retriever.add_documents(normal_split_docs) # adding the semantically split docs into the vector store if not there already
      vector_store_2_retriever.add_documents(csv_documents) # adding csv docs to vectorstore

# Test the third vector store retriever
vector_store_0_retriever.invoke("Can I do study abroad?")

[Document(metadata={'row': 10, 'source': 'https://www.smu.edu/admission/academics/faqsacademics', '_id': 'efb28245-60f4-49e3-8b29-3a752cf7f9e2', '_collection_name': 'smu-data_0'}, page_content='question: Will I have the opportunity to study abroad?\nanswer: Yes. SMU offers summer, semester and year-long opportunities to live, study and travel in foreign countries.\nSMU Abroad\xa0offers 148 programs in 50 countries.'),
 Document(metadata={'page': 241, 'source': 'C:/Users/yawbt/OneDrive/Documents/GitHub/SURF-Project_Optimizing-PerunaBot/Data/20232024 Undergraduate Catalog91123.pdf', 'start_index': 935, '_id': '6d68a960-0610-46be-93ad-00a08bd4bdd2', '_collection_name': 'smu-data_0'}, page_content='To maximize the educational experience in these degree programs, all international studies majors are strongly \nencouraged to spend at least one term or summer studying abroad. The University offers numerous study abroad \nopportunities around the world; most  of these courses may be applied to

The last thing we need to do is to create functions for objects we will need in other notebooks or python scripts or notebooks. However, this method can be inefficient because the notebooks we will use later are in sibiling folders, so we will have to run the whole script again. So we will use another method shown below.

In [None]:
def get_all_langchain_docs():
  return {
    "pdf_docs": pdf_docs,
    "csv_docs": csv_documents
  }

def get_all_vectorstores():
  return {
      "vector_store_0": vector_store_0, # collection smu-data_0
      "vector_store_1": vector_store_1, # collection smu-data_1
      "vector_store_2": vector_store_2, # collection smu-data_2
  }

def get_all_retrievers():
  return {
      "vector_store_0_retriever": vector_store_0_retriever, # collection smu-data_0
      "parent_retriever": parent_retriever, # collection smu-data_1
      "ensemble_retriever": ensemble_retriever, # collection smu-data_2
  }

Here is the second method. We are going to serialize the objects that can be serialized, so we can use later on. The other objects we have above that are not included here are **unserializable** (trust me I tried to get it to work...for 4 days straight), so we will just have to recreate them in the code.

In [70]:
# Collecting needed langchain objects into a dictionary
all_data = {
    'pdf_docs': pdf_docs,
    'csv_docs': csv_documents,
    'semantic_docs': semantic_docs,
    'normal_split_docs': normal_split_docs,
    'parent_retriever_memory_store': store,
    'parent_retriever_document_ids': document_ids,
    'parent_retriever_in_memory_store_docs': in_memory_store_docs,
    "parent_retriever_docs_dict": parent_retriever_docs_dict,
}

import shelve

# Serialize the LangChain documentation to a JSON file
with shelve.open("data_preprocessing_langchain_docs.db") as db:
    for key, value in all_data.items():
        db[key] = value