## **Tutorial: Prepare your own documents for vector search**

In [None]:
!uv add langchain pypdf langchain-google-genai langchain-community

### **Loading chunks**

In [2]:
import os
from langchain_community.document_loaders import PyPDFLoader

docs_dir = "./docs"

all_files = [os.path.join(docs_dir, f) for f in os.listdir(docs_dir)
             if os.path.isfile(os.path.join(docs_dir, f))]

documents = []
for file_path in all_files:
    try:
        loader = PyPDFLoader(
            file_path=file_path
        )
        docs = loader.load()
        documents.extend(docs)
        print(f"Loaded {len(docs)} chunks from {file_path}")
    except Exception as e:
        print(f"Error loading {file_path}: {e}")

print(f"Loaded total of {len(documents)} document chunks")

Loaded 1 chunks from ./docs\amazon.pdf
Loaded 1 chunks from ./docs\apple.pdf
Loaded 1 chunks from ./docs\google.pdf
Loaded 1 chunks from ./docs\meta.pdf
Loaded 1 chunks from ./docs\microsoft.pdf
Loaded 1 chunks from ./docs\tesla.pdf
Loaded total of 6 document chunks


### **Chunking documents**

Split large documents into smaller chunks for better embedding quality.

In [3]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 200
)

chunks = text_splitter.split_documents(documents)
print(f"Create {len(chunks)} chunks")

Create 6 chunks


In [5]:
for chunk in chunks:
    print(chunk.page_content)
    print("-"*100)

Amazon
Contact: Joe Smith - VP, Infrastructure Strategy
Sales Stage: Discovery (SQL)
Deal Size: Est. $450K ARR
Meetings:
May 12: Intro call. Focus on AWS internal cost visibility.
May 17: Demo of usage spike detection module.
Needs:
Real-time usage-based alerts for finance ops.
Historical usage analytics with granularity below SKU level.
Objections:
Concerned about integration effort with internal billing APIs.
Internal Notes:
Similar use case to Jetstream (current customer); POC completed in 3 
weeks.
Next Steps:
Share pricing tiers.
Align with FinOps team lead on pilot scope by June 1.
----------------------------------------------------------------------------------------------------
Apple
Contact: Sara Cooper – Sr. Manager, Enterprise Security
Sales Stage: MQL
Deal Size: Unknown (early stage)
Meetings:
May 20: Attended webinar, downloaded whitepaper.
Call scheduled for June 3.
Needs:
Post-quantum cryptographic module + supplier risk insights.
Objections:
Very early stage; waiting f

### **Generate embeddings**

In [6]:
import os
from dotenv import load_dotenv

load_dotenv()

gg_api_key = os.getenv("GOOGLE_API_KEY")
if gg_api_key is None:
    print("Please setting up your API_KEY in .env file")
else:
    print("Get API_KEY successfully")

Get API_KEY successfully


In [7]:
from langchain_google_genai import GoogleGenerativeAIEmbeddings

embeddings = GoogleGenerativeAIEmbeddings(google_api_key=gg_api_key, model="text-embedding-004")

### **Store embeddings in Chroma**

Initialize a Chroma vector store and persist it locally.

In [8]:
from langchain.vectorstores import Chroma

# Create vector store with Chroma
vectordb = Chroma.from_documents(
    chunks,
    embedding=embeddings,
    persist_directory="db",
    collection_name="crm"
)

### **Example similarity search**

In [9]:
# Test
query = "robotics"
results = vectordb.similarity_search(query, k=5)
for i, doc in enumerate(results):
    print(f"Result {i+1}: {doc.page_content}...\n")

Result 1: Tesla
Contact: Nathan Ash – Head of Software Integration
Sales Stage: Early Demo
Deal Size: $600K (initial scope)
Meetings:
May 10: Robotics analytics walkthrough.
May 14: Discussion with manufacturing systems team.
Needs:
Predictive maintenance alerts for Gigafactory robotics arms.
Offline deployment (low latency / no external traffic).
Objections:
Concern about vendor lock-in; prefers modular system.
Internal Notes:
Use on-prem reference from Vantage Energy rollout.
Next Steps:
Share local agent deployment option + performance benchmarks....

Result 2: Apple
Contact: Sara Cooper – Sr. Manager, Enterprise Security
Sales Stage: MQL
Deal Size: Unknown (early stage)
Meetings:
May 20: Attended webinar, downloaded whitepaper.
Call scheduled for June 3.
Needs:
Post-quantum cryptographic module + supplier risk insights.
Objections:
Very early stage; waiting for internal exploration greenlight.
Internal Notes:
Mentioned interest in minimal data collection footprint.
Next Steps:
Send