# **LangChain Loader, Splitter, and Embeddings**

# __Description:__
In this activity, you will implement the functionalities of LangChain’s loaders, splitters, and embeddings.
The two files in the tutorial serve as practical examples of real-world data that one might encounter in natural language processing tasks. They are:

•	The **state_of_union.txt** file, which contains transcripts of the United States’ State of the Union Addresses, represents a large text document that can be loaded and processed.

•	The **michael_resume.pdf** file, an open source resume, represents a common type of document that one might analyze for tasks such as resume screening or information extraction.




# **Steps to Perform:**


1.   Import the Necessary Modules
2.   Load Text Data from a File Using TextLoader
3.   Load PDFs from the Internet Using PyPDFLoader
4.   Split the Documents Using RecursiveCharacterTextSplitter
5.   Embed the Documents Using HuggingFaceEmbeddings and Print the Length of the Embedding
6.   Embed the Documents Using OpenAIEmbeddings and Print the Length of the Embedding
7.   Create a FAISS Instance
8.   Perform a Similarity Search on the FAISS Instance






# Demo: LangChain Document Loading → Splitting → Embedding → FAISS Search

## Overview
This demo walks through a full Retrieval-Augmented Generation (RAG) data preparation pipeline using **LangChain** with the modern namespace imports, **OpenAI embeddings**, and a **FAISS** vector store.  
It demonstrates how to load text/PDF files, split them into manageable chunks, embed them into vectors, store them in a searchable index, and perform both standard and diversified (MMR) similarity searches.

---

## Step-by-Step Process

### **Step 1 — Install Required Packages**
Install the core LangChain libraries, OpenAI integration, FAISS vector store, and supporting tools.
```bash
pip install -U langchain langchain-community langchain-text-splitters \
    langchain-openai faiss-cpu pypdf tiktoken chromadb


# **Step 1: Import the Necessary Modules**







In [1]:
!python -m pip install -U langchain langchain-community langchain-text-splitters \
    langchain-openai faiss-cpu pypdf tiktoken chromadb


Collecting faiss-cpu
  Downloading faiss_cpu-1.11.0.post1-cp311-cp311-win_amd64.whl.metadata (5.1 kB)
Collecting pypdf
  Using cached pypdf-5.9.0-py3-none-any.whl.metadata (7.1 kB)
Collecting chromadb
  Downloading chromadb-1.0.16-cp39-abi3-win_amd64.whl.metadata (7.5 kB)
Collecting build>=1.0.3 (from chromadb)
  Downloading build-1.3.0-py3-none-any.whl.metadata (5.6 kB)
Collecting pybase64>=1.4.1 (from chromadb)
  Downloading pybase64-1.4.2-cp311-cp311-win_amd64.whl.metadata (9.0 kB)
Collecting uvicorn>=0.18.3 (from uvicorn[standard]>=0.18.3->chromadb)
  Downloading uvicorn-0.35.0-py3-none-any.whl.metadata (6.5 kB)
Collecting posthog<6.0.0,>=2.4.0 (from chromadb)
  Downloading posthog-5.4.0-py3-none-any.whl.metadata (5.7 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Downloading onnxruntime-1.22.1-cp311-cp311-win_amd64.whl.metadata (5.1 kB)
Collecting opentelemetry-api>=1.2.0 (from chromadb)
  Downloading opentelemetry_api-1.36.0-py3-none-any.whl.metadata (1.5 kB)
Collecting ope

Step 2 — Configuration
Choose USE_OPENAI = True to use OpenAI embeddings (default for class).

Set your OPENAI_API_KEY (project-scoped key for demo).

Define file paths for the .txt and .pdf files and the directory for persistent storage (optional Chroma)

In [None]:
# Step 2 

# --- Configuration ---
USE_OPENAI = True  # OpenAI embeddings by default for class demo

# Paste your project-scoped OpenAI key here
OPENAI_API_KEY = "YOUR_OPENAI_KEY"

# Demo file locations
DATA_DIR   = "./data"
TEXT_PATH  = f"{DATA_DIR}/state_of_union.txt"
PDF_PATH   = f"{DATA_DIR}/michael_resume.pdf"
CHROMA_DIR = "./chroma_demo_store"


In [5]:
from pathlib import Path

Path(DATA_DIR).mkdir(parents=True, exist_ok=True)

if not Path(TEXT_PATH).exists():
    sample_text = """
    The United States is a nation of possibilities. We will build, innovate, and lead.
    This placeholder exists so the demo runs even without a real file.
    """.strip()
    with open(TEXT_PATH, "w", encoding="utf-8") as f:
        f.write(sample_text)

print("Text file created at:", TEXT_PATH)


Text file created at: ./data/state_of_union.txt


Step 3 — Load Text Documents
Use TextLoader from langchain_community.document_loaders to load a .txt file.

Preview the first part of the document to confirm loading worked.

In [None]:
# Step 3 

from langchain_community.document_loaders import TextLoader

# Load the text file into a Document list
text_loader = TextLoader(TEXT_PATH, encoding="utf-8")
text_docs = text_loader.load()

# Show the first 200 characters of the first document
print(f"Loaded {len(text_docs)} document(s)")
print(text_docs[0].page_content[:200])


Loaded 1 document(s)
The United States is a nation of possibilities. We will build, innovate, and lead.
    This placeholder exists so the demo runs even without a real file.


In [8]:
from langchain_community.document_loaders import PyPDFLoader
from pathlib import Path

pdf_docs = []
if Path(PDF_PATH).exists():
    pdf_loader = PyPDFLoader(PDF_PATH)
    pdf_docs = pdf_loader.load_and_split()
    print(f"Loaded {len(pdf_docs)} page(s) from PDF")
    print(pdf_docs[0].page_content[:200])
else:
    print("No PDF found at", PDF_PATH)



No PDF found at ./data/michael_resume.pdf


# **Step 4: Split the Documents Using RecursiveCharacterTextSplitter**


*   Split the PDF pages into smaller chunks and print the number of chunks.



Step 4 — Load PDF Documents (Optional)
Use PyPDFLoader from langchain_community.document_loaders to load and split a PDF by page.

Preview a sample page.

If no PDF is present, skip this step — the demo still works with text-only data.

In [9]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Combine all available documents (text + pdf)
all_docs = text_docs + pdf_docs

# Initialize splitter (adjust chunk size/overlap as needed)
doc_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=64)

# Perform the split
split_texts = doc_splitter.split_documents(all_docs)

print(f"Source docs: {len(all_docs)}")
print(f"Chunks: {len(split_texts)}")
print("Sample chunk:\n", split_texts[0].page_content[:300])



Source docs: 1
Chunks: 1
Sample chunk:
 The United States is a nation of possibilities. We will build, innovate, and lead.
    This placeholder exists so the demo runs even without a real file.


# **Step 5: Embed the Documents Using HuggingFaceEmbeddings and Print the Length of the Embedding**






Step 5 — Combine and Split Documents
Combine all text and PDF Document objects into one list.

Use RecursiveCharacterTextSplitter from langchain_text_splitters to break content into chunks.

Parameters:

chunk_size=800–1024 characters

chunk_overlap=64–120 characters

Overlap helps preserve context across chunks.

In [10]:
from langchain_openai import OpenAIEmbeddings
import os

# Ensure API key is set
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

# Initialize OpenAI embeddings
embedder = OpenAIEmbeddings(model="text-embedding-3-small")

# Test embedding size with first chunk
sample_text = split_texts[0].page_content
embed_result = embedder.embed_query(sample_text)

print("Embedding vector length:", len(embed_result))
print("First 10 values:", embed_result[:10])


Embedding vector length: 1536
First 10 values: [0.008483029901981354, 0.027606286108493805, 0.022286172956228256, 0.05445463955402374, -0.014859877526760101, -0.058594122529029846, -0.008978602476418018, 0.0401996485888958, 0.018642259761691093, 0.016324730589985847]


# **Step 6: Embed the Documents Using OpenAIEmbeddings and Print the Length of the Embedding**




Step 6 — Embed the Chunks
Use OpenAIEmbeddings(model="text-embedding-3-small") for compact, cost-effective embeddings.

Embed the split text chunks into numerical vectors.

Test with a single chunk to confirm vector length (e.g., 1536 dimensions).

In [11]:
from langchain_openai import OpenAIEmbeddings
import os

# Ensure API key is set
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

# Initialize OpenAI embeddings with explicit model
openai_embed = OpenAIEmbeddings(model="text-embedding-3-small")

# Take the first chunk’s text for testing
sample_text = split_texts[0].page_content

# Embed (returns a list of vectors, one per input string)
openai_embed_result = openai_embed.embed_documents([sample_text])

print("Embedding vector length:", len(openai_embed_result[0]))
print("First 10 values:", openai_embed_result[0][:10])



Embedding vector length: 1536
First 10 values: [0.008483029901981354, 0.027606286108493805, 0.022286172956228256, 0.05445463955402374, -0.014859877526760101, -0.058594122529029846, -0.008978602476418018, 0.0401996485888958, 0.018642259761691093, 0.016324730589985847]


# **Step 7: Create a FAISS Instance**

*   Create a FAISS instance using the split texts and the OpenAIEmbeddings.

Step 7 — Create a FAISS Vector Store
Use FAISS.from_documents(chunks, embedder) to store the vectors in memory.

FAISS is fast and ideal for demos or temporary retrieval indexes.

In [12]:
from langchain_community.vectorstores import FAISS

# Create FAISS instance from split chunks and OpenAI embeddings
faiss_store = FAISS.from_documents(split_texts, openai_embed)

print("FAISS store created")
print("Number of vectors stored:", len(split_texts))



FAISS store created
Number of vectors stored: 1


# **Step 8: Perform a Similarity Search on the FAISS Instance**


*   Print the top two most similar documents.

Step 8 — Perform a Similarity Search
Use similarity_search(query, k) to retrieve the most relevant chunks for a query.

Optionally use similarity_search_with_score to view relevance scores.

Print results in a readable format for class discussion.

In [13]:
# Define your query
query = "What is the candidate's skill set?"

# Perform search (top 2 matches)
search_results = faiss_store.similarity_search(query, k=2)

# Nicely format the output
for i, doc in enumerate(search_results, start=1):
    print(f"\n--- Result {i} ---")
    print(doc.page_content[:300], "...")




--- Result 1 ---
The United States is a nation of possibilities. We will build, innovate, and lead.
    This placeholder exists so the demo runs even without a real file. ...


Step 9 — Perform Max Marginal Relevance (MMR) Search
Use max_marginal_relevance_search(query, k, fetch_k, lambda_mult) to retrieve results that balance relevance and diversity.

Key parameters:

fetch_k: number of candidates to consider before selecting the final set.

lambda_mult:

0.0 → max diversity

1.0 → max relevance

0.3–0.7 → balanced

Useful for avoiding duplicate or near-duplicate chunks.

In [14]:
# MMR: balances relevance and diversity in retrieved results
mmr_results = faiss_store.max_marginal_relevance_search(
    query,
    k=4,          # final number of results
    fetch_k=12,   # pool of candidates to diversify from
    lambda_mult=0.5  # 0.0 = max diversity, 1.0 = max relevance
)

print(f"MMR returned {len(mmr_results)} results")
for i, doc in enumerate(mmr_results, start=1):
    print(f"\n--- MMR Result {i} ---")
    print(doc.page_content[:300], "...")


MMR returned 1 results

--- MMR Result 1 ---
The United States is a nation of possibilities. We will build, innovate, and lead.
    This placeholder exists so the demo runs even without a real file. ...


# **Conclusion**

This activity provided a step-by-step guide on how to use LangChain’s loaders, splitters, and embeddings. You now know how to load documents, split them into manageable chunks, embed them into a numerical space, and store these embeddings for efficient similarity searches.

Pipeline modularity: Loader → Splitter → Embedder → Store → Retriever.

Chunking strategy: Size and overlap impact retrieval quality.

Embeddings choice: OpenAI for convenience, HuggingFace for offline.

Vector store choice:

FAISS: fast, in-memory (ephemeral).

Chroma: persistent, simple local DB.

Retrieval strategies:

Basic similarity search for pure relevance.

MMR search to diversify results and reduce redundancy.