In [9]:
%pip install -U cryptography langchain-text-splitters langchain-community langchain-chroma pypdf sentence-transformers langchain

Collecting cryptography
  Using cached cryptography-46.0.5-cp311-abi3-manylinux_2_34_x86_64.whl.metadata (5.7 kB)
Collecting cffi>=2.0.0 (from cryptography)
  Using cached cffi-2.0.0-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (2.6 kB)
Collecting pycparser (from cffi>=2.0.0->cryptography)
  Using cached pycparser-3.0-py3-none-any.whl.metadata (8.2 kB)
Using cached cryptography-46.0.5-cp311-abi3-manylinux_2_34_x86_64.whl (4.5 MB)
Using cached cffi-2.0.0-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (219 kB)
Using cached pycparser-3.0-py3-none-any.whl (48 kB)
Installing collected packages: pycparser, cffi, cryptography
Successfully installed cffi-2.0.0 cryptography-46.0.5 pycparser-3.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m26.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may nee

In [3]:
import os
from langchain_community.document_loaders import PyPDFLoader
# NEW PATH:
from langchain_text_splitters import RecursiveCharacterTextSplitter
# NEW PATH (Separate package for stability):
from langchain_chroma import Chroma 
# EMBEDDINGS (Ensure sentence-transformers is installed):
from langchain_community.embeddings import SentenceTransformerEmbeddings
# 1. Setup Data Pipeline Paths
PDF_PATH = "./data/2023-Equity-Derivatives-2023-Latham-Watkins.pdf"
CHROMA_PATH = "chroma_db_store"

def run_pipeline():
    # 2. Extract & Load: Efficiently read the PDF
    # PyPDFLoader handles large files by page
    loader = PyPDFLoader(PDF_PATH)
    raw_documents = loader.load()
    
    # 3. Transform: Chunking
    # RecursiveCharacterTextSplitter tries to keep paragraphs together
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000, 
        chunk_overlap=200,
        add_start_index=True
    )
    chunks = text_splitter.split_documents(raw_documents)
    print(f"Created {len(chunks)} chunks from {len(raw_documents)} pages.")

    # 4. Embed & Load: Inject into ChromaDB
    # Using a free, local embedding model (SentenceTransformer)
    embedding_func = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
    
    vector_db = Chroma.from_documents(
        documents=chunks, 
        embedding=embedding_func,
        persist_directory=CHROMA_PATH
    )
    print("Data successfully injected into ChromaDB.")
    return vector_db

def query_database(vector_db, query_text):
    # 5. Retrieve: Search for relevant data chunks
    results = vector_db.similarity_search(query_text, k=3)
    return results


if __name__ == "__main__":
    # Run the ingestion
    db = run_pipeline()
    
    # Example Query
    query = "What are the key derivative risks mentioned?"
    relevant_chunks = query_database(db, query)
    
    for i, chunk in enumerate(relevant_chunks):
        print(f"\n--- Relevant Chunk {i+1} (Page {chunk.metadata['page']}) ---")
        print(chunk.page_content[:300] + "...")


Created 463 chunks from 115 pages.


Loading weights: 100%|██████████| 103/103 [00:00<00:00, 2435.81it/s, Materializing param=pooler.dense.weight]                             
[1mBertModel LOAD REPORT[0m from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m


Data successfully injected into ChromaDB.

--- Relevant Chunk 1 (Page 12) ---
publicly available and that would be likely to significantly impact the price of the shares if it 
were to be made publicly available) or engage in the unlawful disclosure of inside information 
or market manipulation. If the counterparty to an OTC derivative transaction involving shares 
in an issu...

--- Relevant Chunk 2 (Page 12) ---
publicly available and that would be likely to significantly impact the price of the shares if it 
were to be made publicly available) or engage in the unlawful disclosure of inside information 
or market manipulation. If the counterparty to an OTC derivative transaction involving shares 
in an issu...

--- Relevant Chunk 3 (Page 34) ---
Risk
8 What types of risks do dealers face in the event of a bankruptcy or insolvency 
of the counterparty? Do any special bankruptcy or insolvency rules apply if the 
counterparty is the issuer or an affiliate of the issuer?
The risk that de