# References:

1. ) https://huggingface.co/docs/huggingface_hub/package_reference/inference_client
2. ) https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector
3. ) https://blog.langchain.dev/semi-structured-multi-modal-rag/

### Get a free Hugging Face Account and API Access Key
### I used Amazon SageMaker Studio to create. You can get a free account. I am also using a free Google Colab account.

### Install libraries to use langchain and hugging face embeddings and LLMs - Use free HuggingFaceEmbeddings and LLMs. I found these useful for learning and experimenting.

### Reference above use OpenAI models. The motivation for this notebook is to show how to use Hugging Face Hub hosted free models through LangChain integrations (which I have also locally updated and will hopefuly contribute to LangChain if changes are ccepted)

In [1]:
%pip install --upgrade --quiet python_dotenv langchain huggingface_hub  sentence_transformers chromadb

Note: you may need to restart the kernel to use updated packages.


### First thing is first: import you HUGGINGFACE_API_TOKEN which is your HF access token

In [6]:
import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file
hugging_face_access_token = os.environ['HUGGINGFACEHUB_API_TOKEN']

### Instatiate a free HuggingFace Embedding model

In [3]:
from langchain_community.embeddings import HuggingFaceEmbeddings

In [4]:
huggingface_embeddings = HuggingFaceEmbeddings()

In [5]:
text = "This is a test document."
query_result = huggingface_embeddings .embed_query(text)
len(query_result), query_result[0:5]

(768,
 [-0.04895173758268356,
  -0.03986186906695366,
  -0.021562783047556877,
  0.009908422827720642,
  -0.038103993982076645])

### MultiVector retrievers 1) with parent docs (InMemoryByteStore), 2) with smaller chunks, 3) with document summaries

In [6]:
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.storage import InMemoryByteStore
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import Chroma
import uuid

## Process Parent Document

In [7]:
# Add more text loaders as required 
loaders = [
    TextLoader("../state_of_the_union.txt"),
    
]
# larger documents to be used in InMemoryStore
docs = []
for loader in loaders:
    docs.extend(loader.load())
text_splitter = RecursiveCharacterTextSplitter(chunk_size=10000)
docs = text_splitter.split_documents(docs)

doc_ids = [str(uuid.uuid4()) for _ in docs]

len(docs), len(doc_ids)

(4, 4)

## Process Smaller Chunks
### Use case - retrieve larger chunks of information, but embed smaller chunks to capture the semantic meaning as closely as possible

In [8]:
# The splitter to use to create smaller chunks
child_text_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
id_key = "doc_id"
sub_docs = []
for i, doc in enumerate(docs):
    _id = doc_ids[i]
    _sub_docs = child_text_splitter.split_documents([doc])
    for _doc in _sub_docs:
        _doc.metadata[id_key] = _id
    sub_docs.extend(_sub_docs)

### Create a MultiVectorRetriever with vectore store for smaller chunks and for larger documents use InMemoryByte store

In [9]:
# The vectorstore to use to index the child chunks
vectorstore = Chroma(
    collection_name="full_documents_chunks", embedding_function=huggingface_embeddings
)
# The storage layer for the parent documents
store = InMemoryByteStore()
id_key = "doc_id"
# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=store,
    id_key=id_key,
)
# Retriever was empty to begin with, add smaller chunks in the vector store
# add the larger docs in the InMemoryByte store
retriever.vectorstore.add_documents(sub_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))

#### Vectorstore alone retrieves the small chunks

In [10]:
retriever.vectorstore.similarity_search("justice breyer")[0]

Document(page_content='Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.', metadata={'doc_id': '88a9d717-b506-49ee-a4e9-11d84e56c4c6', 'source': '../state_of_the_union.txt'})

### Retriever returns larger chunks

In [11]:
len(retriever.get_relevant_documents("justice breyer")[0].page_content)

9874

## Summaries
### Use case - a summary may be able to distill more accurately what a chunk is about, leading to better retrieval


In [10]:
from langchain_core.documents import Document
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate

import llm_utils

### Instatiate a HuggingFace LLM for free using your access toke above and create a langchain expression chain to process summaries

In [11]:
# create a free llm for text summarization from Hagging Hub. This model is currently available through an
# llms_util local library. Will propose updating the one that is available in langchain mainly because Hugging Face Hub
# is now using InferenceClient and depracating the use of InferenceAPI which LangChain integration uses
llm = llm_utils.HuggingFaceHub(task="summarization")

# create and use a langchain expression language chain
chain = (
    {"doc": lambda x: x.page_content}
    | ChatPromptTemplate.from_template("Summarize the following document:\n\n{doc}")
    | llm
    | StrOutputParser()
)

## Process Summaries for parent documents

In [14]:
# batch process with concurrency
summaries = chain.batch(docs, {"max_concurrency": 5})

summary_docs = [
    Document(page_content=s, metadata={id_key: doc_ids[i]})
    for i, s in enumerate(summaries)
]

### Create a MultiVectorRetriever with the summary vector store and parent docs

In [15]:
len(summaries), summaries[0]

(4,
 ' Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways . But he badly miscalculated . Instead he met a wall of strength he never imagined . Putin is now isolated from the world more than ever. We are inflicting pain on Russia and supporting the people of Ukraine .')

In [16]:
# The vectorstore to use to index the summary chunks
vectorstore = Chroma(collection_name="summaries", embedding_function=huggingface_embeddings)
# The storage layer for the parent documents
store = InMemoryByteStore()
id_key = "doc_id"
# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=store,
    id_key=id_key,
)
retriever.vectorstore.add_documents(summary_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))

#### Retrieve from summaries vectorstore

In [17]:
summary_results = vectorstore.similarity_search("justice breyer")
summary_results[0]

Document(page_content=" The Bipartisan Infrastructure Law is the most sweeping investment to rebuild America in history . America used to have the best roads, bridges, and airports on Earth . Now our infrastructure is ranked 13th in the world. We won't be able to compete for the jobs of the 21st Century if we don’t fix that. We’ll create good jobs for millions of Americans .", metadata={'doc_id': 'f89e9a68-e1c8-4d0f-bba9-7b31ce466866'})

#### Retrieve from document store

In [18]:
retrieved_docs = retriever.get_relevant_documents("justice breyer")
len(retrieved_docs[0].page_content)

9902