# **Demo: LangChain Loader, Splitter, Embeddings, and VectorStore**

# __Description:__
In this activity, you will implement the functionalities of LangChain’s loaders, splitters, embeddings, and VectorStores.
The two files in the tutorial serve as practical examples of real-world data that one might encounter in natural language processing tasks. They are:

•	The **state_of_union.txt** file, which contains transcripts of the United States’ State of the Union Addresses, represents a large text document that can be loaded and processed.

•	The **michael_resume.pdf** file, an open source resume, represents a common type of document that one might analyze for tasks such as resume screening or information extraction.




# **Steps to Perform:**


1.   Import the Necessary Modules
2.   Load Text Data from a File Using TextLoader
3.   Load PDFs from the Internet Using PyPDFLoader
4.   Split the Documents Using RecursiveCharacterTextSplitter
5.   Embed the Documents Using HuggingFaceEmbeddings and Print the Length of the Embedding
6.   Embed the Documents Using OpenAIEmbeddings and Print the Length of the Embedding
7.   Create a FAISS Instance
8.   Perform a Similarity Search on the FAISS Instance
9.   Persist the FAISS Instance
10.  Load the Persisted FAISS Instance




# **Step 1: Import the Necessary Modules**







In [1]:
!pip install pysqlite3
!pip install pysqlite3-binary
!pip install pypdf

Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable


In [2]:
from langchain.document_loaders import TextLoader, PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings, OpenAIEmbeddings
from langchain.vectorstores import FAISS
import faiss
import pysqlite3
import sys
sys.modules["sqlite3"] = sys.modules.pop("pysqlite3")

#**Step 2: Load Text Data from a File Using TextLoader**



*   Print the first 100 characters from the loaded text.



In [6]:
text_loader = TextLoader("state_of_union.txt")
text_document = text_loader.load()
print(text_document[0].page_content[:100])  # Prints the first 100 characters of the text document

Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and th


# **Step 3: Load PDFs from the Internet Using PyPDFLoader**






In [7]:
from langchain.document_loaders import PyPDFLoader

pdf_loader = PyPDFLoader("michael_resume.pdf")
pdf_pages = pdf_loader.load_and_split()
print(pdf_pages[0].page_content[:100])  # Prints the first 100 characters of the first page of the PDF


CURRICULUM VITAE :  
M ichael M . Scott OBE, B.Sc., Dip.Ed  
 
Home address :  Strome House     Date


# **Step 4: Split the Documents Using RecursiveCharacterTextSplitter**


*   Split the PDF pages into smaller chunks and print the number of chunks.



In [8]:
doc_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=64)
split_texts = doc_splitter.split_documents(pdf_pages)
print(len(split_texts))  # Prints the number of chunks the PDF has been split into


15


# **Step 5: Embed the Documents Using HuggingFaceEmbeddings and Print the Length of the Embedding**






In [9]:
MODEL_NAME = "sentence-transformers/all-mpnet-base-v2"
hf_embed = HuggingFaceEmbeddings(model_name=MODEL_NAME)
text = split_texts[0].page_content
hf_embed_result = hf_embed.embed_documents([text])
print(len(hf_embed_result[0]))  # Prints the length of the first embedded document

  hf_embed = HuggingFaceEmbeddings(model_name=MODEL_NAME)
  warn(
2025-04-24 12:19:29.034600: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-04-24 12:19:29.088113: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-04-24 12:19:31.397546: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/

VOC-NOTICE: GPU memory for this assignment is capped at 2048MiB


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.4k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

768


# **Step 6: Embed the Documents Using OpenAIEmbeddings and Print the Length of the Embedding**




In [10]:
openai_embed = OpenAIEmbeddings()
openai_embed_result = openai_embed.embed_documents([text])
print(len(openai_embed_result[0]))  # Prints the length of the first embedded document


  openai_embed = OpenAIEmbeddings()


1536


# **Step 7: Create a FAISS Instance**

*   Create a FAISS instance using the split texts and the OpenAIEmbeddings.

In [11]:
# Create FAISS instance from documents and embeddings
faiss_index = FAISS.from_documents(split_texts, openai_embed)


# **Step 8: Perform a Similarity Search on the FAISS Instance**


*   Print the top two most similar documents.

In [12]:
# Perform a similarity search and print the top two most similar documents
search_result = faiss_index.similarity_search_with_score("What is the candidate's skill sets?", k=2)
print(search_result)  # Prints the top 2 most similar documents to the query


[(Document(id='dbc62b0f-22cd-4283-a0a7-9b3fbf32dbbe', metadata={'producer': 'BCL easyPDF 2.00.030', 'creator': 'PyPDF', 'creationdate': '', 'source': 'michael_resume.pdf', 'total_pages': 4, 'page': 3, 'page_label': '4'}, page_content='spring 2005, I went fully digital, and all photographs can be supplied in electronic format. \n \nComputer knowledge  \nI am reasonably fluent in basic PC computer skills, using Windows XP, Word, WordPro, Excel, PowerPoint, \nAdobe Photoshop Elements, e-mail, internet etc.  I have full computer and broadband facilities at home. \n \nOther interests  \nBotanising (especially mountain flowers), travel, walking, Scottish islands, gardening, photography, \ncomputers, rugby supporter, cinema, good wine, Runrig concerts (!). \n \n[updated, 26.03.08]'), 0.45549238), (Document(id='50bccc31-eaba-499d-96fc-67f89efbeecc', metadata={'producer': 'BCL easyPDF 2.00.030', 'creator': 'PyPDF', 'creationdate': '', 'source': 'michael_resume.pdf', 'total_pages': 4, 'page': 0,

# **Step 9: Persist the FAISS Instance**


*   Create a folder in the current working directory that persists the FAISS instance.

In [13]:
# Save the FAISS index to a file
faiss_index.save_local("faiss_index")


# **Step 10: Load the Persisted FAISS Instance**




In [14]:
from langchain_community.vectorstores import FAISS

# Load the persisted FAISS index from the file with deserialization allowed
faiss_index_loaded = FAISS.load_local(
    "faiss_index", 
    openai_embed, 
    allow_dangerous_deserialization=True
)

# Perform a similarity search with the loaded FAISS index
vector_search_result = faiss_index_loaded.similarity_search_with_score(
    "What is the candidate's skill sets?", k=2
)
print(vector_search_result)


[(Document(id='dbc62b0f-22cd-4283-a0a7-9b3fbf32dbbe', metadata={'producer': 'BCL easyPDF 2.00.030', 'creator': 'PyPDF', 'creationdate': '', 'source': 'michael_resume.pdf', 'total_pages': 4, 'page': 3, 'page_label': '4'}, page_content='spring 2005, I went fully digital, and all photographs can be supplied in electronic format. \n \nComputer knowledge  \nI am reasonably fluent in basic PC computer skills, using Windows XP, Word, WordPro, Excel, PowerPoint, \nAdobe Photoshop Elements, e-mail, internet etc.  I have full computer and broadband facilities at home. \n \nOther interests  \nBotanising (especially mountain flowers), travel, walking, Scottish islands, gardening, photography, \ncomputers, rugby supporter, cinema, good wine, Runrig concerts (!). \n \n[updated, 26.03.08]'), 0.45549238), (Document(id='50bccc31-eaba-499d-96fc-67f89efbeecc', metadata={'producer': 'BCL easyPDF 2.00.030', 'creator': 'PyPDF', 'creationdate': '', 'source': 'michael_resume.pdf', 'total_pages': 4, 'page': 0,

# **Conclusion**

This activity provided a step-by-step guide on how to use LangChain’s loaders, splitters, embeddings, and vector stores. You now know how to load documents, split them into manageable chunks, embed them into a numerical space, and store these embeddings for efficient similarity searches.