<a href="https://colab.research.google.com/github/satvik314/RAG_experiments/blob/main/Parent_Document_Retriever.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Notebook Credits: [Sam Witteveen](https://www.youtube.com/watch?v=wQEl0GGxPcM&list=PL8motc6AQftn-X1HkaGG9KjmKtWImCKJS&index=8)

In [1]:
!pip -q install langchain langchain_openai openai tiktoken chromadb lark
!pip -q install sentence_transformers
!pip -q install -U FlagEmbedding

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m798.0/798.0 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.9/224.9 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m509.0/509.0 kB[0m [31m13.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m111.7/111.7 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m18.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m216.6/216.6 kB[0m [31m16.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.3/48.3 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━

In [4]:
# can you download the blog posts from here https://www.dropbox.com/scl/fi/ulbt145sthizf2nazey49/langchain_blog_posts.zip?rlkey=9unhw0vukhlwacahmpnk5m591&dl=0
# !mkdir -p blog_posts
!unzip -q /content/langchain_blog_posts.zip -d blog_posts

In [5]:
import os
from getpass import getpass

os.environ["OPENAI_API_KEY"] = getpass()

··········


Parent Document Retriever
- Return full docs from smaller chunks look up
- Return bigger chunks from smaller chunks look up

In [7]:
from langchain.schema import Document
from langchain.vectorstores import Chroma
from langchain.retrievers import ParentDocumentRetriever

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.storage import InMemoryStore
from langchain.document_loaders import TextLoader
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()

In [None]:
### BGE Embeddings

from langchain.embeddings import HuggingFaceBgeEmbeddings

model_name = "BAAI/bge-small-en-v1.5"
encode_kwargs = {'normalize_embeddings': True} # set True to compute cosine similarity

bge_embeddings = HuggingFaceBgeEmbeddings(
    model_name=model_name,
    model_kwargs={'device': 'cuda'},
    encode_kwargs=encode_kwargs
)

Data Prep

In [8]:
loaders = [
    TextLoader('/content/blog_posts/blog.langchain.dev_announcing-langsmith_.txt'),
    TextLoader('/content/blog_posts/blog.langchain.dev_benchmarking-question-answering-over-csv-data_.txt'),
]
docs = []
for l in loaders:
    docs.extend(l.load())

In [9]:
len(docs)

2

1. Retrieving full documents rather than chunks

In [10]:
# this text splitter is used to create the child document
child_splitter = RecursiveCharacterTextSplitter(chunk_size = 400)

# vector used to index child chunks
vectorstore = Chroma(
    collection_name = "full_documents",
    embedding_function = embeddings
)

# storage layer for the parent documents
store = InMemoryStore()

full_doc_retriever = ParentDocumentRetriever(
    vectorstore = vectorstore,
    docstore = store,
    child_splitter = child_splitter
)

In [11]:
full_doc_retriever.add_documents(docs, ids =None)

In [12]:
list(store.yield_keys())

['fe830ff8-9ec0-4a26-881b-28b8e8ecf123',
 'e8efaaea-0c14-46c6-8c60-bc0c115e11e6']

In [13]:
sub_docs = vectorstore.similarity_search("what is langsmith", k =2)

In [14]:
print(sub_docs[0].page_content)

URL: https://blog.langchain.dev/announcing-langsmith/
Title: Announcing LangSmith, a unified platform for debugging, testing, evaluating, and monitoring your LLM applications

LangChain exists to make it as easy as possible to develop LLM-powered applications.


In [15]:
retrieved_docs = full_doc_retriever.get_relevant_documents("What is langsmith?")

In [16]:
len(retrieved_docs[0].page_content)

11652

2. Retrieving Larger Chunks

In [17]:
parent_splitter = RecursiveCharacterTextSplitter(chunk_size = 2000)

child_splitter = RecursiveCharacterTextSplitter(chunk_size = 400)

vectorstore = Chroma(collection_name = "split_parents", embedding_function = embeddings)

store = InMemoryStore()

In [18]:
big_chunks_retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

In [19]:
big_chunks_retriever.add_documents(docs)

In [20]:
len(list(store.yield_keys()))

18

In [22]:
print(sub_docs[0].page_content)

URL: https://blog.langchain.dev/announcing-langsmith/
Title: Announcing LangSmith, a unified platform for debugging, testing, evaluating, and monitoring your LLM applications

LangChain exists to make it as easy as possible to develop LLM-powered applications.


In [23]:
retrieved_docs = big_chunks_retriever.get_relevant_documents("what is langsmith")

In [24]:
len(retrieved_docs)

3

In [25]:
len(retrieved_docs[0].page_content)

1869

In [26]:
from langchain.chains import RetrievalQA
from langchain_openai import OpenAI

qa = RetrievalQA.from_chain_type(llm = OpenAI(),
                                 chain_type = "stuff",
                                 retriever = big_chunks_retriever
                                 )

In [27]:
query = "what is langsmith?"

qa.run(query)

' LangSmith is a platform designed for building and iterating on products that can harness the power and complexity of LLMs. It helps developers close the gap between prototype and production by providing tools for debugging, testing, evaluating, and monitoring LLM applications. It also offers deep visibility into model performance and allows for easy integration and evaluation of different chain methods. LangSmith is currently in closed beta and has been used by companies such as Snowflake, Boston Consulting Group, and Fintual to develop and improve their LLM-powered applications. '