# Prototyping LangChain Application with Production Minded Changes

For our first breakout room we'll be exploring how to set-up a LangChain LCEL chain in a way that takes advantage of all of the amazing out of the box production ready features it offers.

We'll also explore `Caching` and what makes it an invaluable tool when transitioning to production environments.


## Task 1: Dependencies and Set-Up

Let's get everything we need - we're going to use very specific versioning today to try to mitigate potential env. issues!

> NOTE: Dependency issues are a large portion of what you're going to be tackling as you integrate new technology into your work - please keep in mind that one of the things you should be passively learning throughout this course is ways to mitigate dependency issues.

In [None]:
#!pip install -qU langchain_openai==0.2.0 langchain_community==0.3.0 langchain==0.3.0 pymupdf==1.24.10 qdrant-client==1.11.2 langchain_qdrant==0.1.4 langsmith==0.1.121 langchain_huggingface==0.2.0

We'll need an HF Token:

In [1]:
import os
import getpass

os.environ["HF_TOKEN"] = getpass.getpass("HF Token Key:")

And the LangSmith set-up:

In [2]:
import uuid

os.environ["LANGCHAIN_PROJECT"] = f"AIM Session 16 - {uuid.uuid4().hex[0:8]}"
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangChain API Key:")

Let's verify our project so we can leverage it in LangSmith later.

In [3]:
print(os.environ["LANGCHAIN_PROJECT"])

AIM Session 16 - 4def1fe4


## Task 2: Setting up RAG With Production in Mind

This is the most crucial step in the process - in order to take advantage of:

- Asyncronous requests
- Parallel Execution in Chains
- And more...

You must...use LCEL. These benefits are provided out of the box and largely optimized behind the scenes.

### Building our RAG Components: Retriever

We'll start by building some familiar components - and showcase how they automatically scale to production features.

Please upload a PDF file to use in this example!

In [6]:
import ipywidgets as widgets
from IPython.display import display

upload_widget = widgets.FileUpload(accept='.pdf', multiple=False)
display(upload_widget)

def get_uploaded_file_path(upload_widget):
    if upload_widget.value:
        file_info = list(upload_widget.value.values())[0]
        file_path = file_info['metadata']['name']
        with open(file_path, 'wb') as f:
            f.write(file_info['content'])
        return file_path
    return None

file_path = get_uploaded_file_path(upload_widget)
file_path

FileUpload(value=(), accept='.pdf', description='Upload')

In [7]:
file_path = "./DeepSeek_R1.pdf"
file_path

'./DeepSeek_R1.pdf'

We'll define our chunking strategy.

In [8]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)



We'll chunk our uploaded PDF file.

In [9]:
from langchain_community.document_loaders import PyMuPDFLoader

Loader = PyMuPDFLoader
loader = Loader(file_path)
documents = loader.load()
docs = text_splitter.split_documents(documents)
for i, doc in enumerate(docs):
    doc.metadata["source"] = f"source_{i}"

#### QDrant Vector Database - Cache Backed Embeddings

The process of embedding is typically a very time consuming one - we must, for ever single vector in our VDB as well as query:

1. Send the text to an API endpoint (self-hosted, OpenAI, etc)
2. Wait for processing
3. Receive response

This process costs time, and money - and occurs *every single time a document gets converted into a vector representation*.

Instead, what if we:

1. Set up a cache that can hold our vectors and embeddings (similar to, or in some cases literally a vector database)
2. Send the text to an API endpoint (self-hosted, OpenAI, etc)
3. Check the cache to see if we've already converted this text before.
  - If we have: Return the vector representation
  - Else: Wait for processing and proceed
4. Store the text that was converted alongside its vector representation in a cache of some kind.
5. Return the vector representation

Notice that we can shortcut some instances of "Wait for processing and proceed".

Let's see how this is implemented in the code.

In [11]:
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams
from langchain.storage import LocalFileStore
from langchain_qdrant import QdrantVectorStore
from langchain.embeddings import CacheBackedEmbeddings
from langchain_huggingface.embeddings import HuggingFaceEndpointEmbeddings
import hashlib

YOUR_EMBED_MODEL_URL = "https://yp7jizu4274bi6ne.us-east-1.aws.endpoints.huggingface.cloud"

hf_embeddings = HuggingFaceEndpointEmbeddings(
    model=YOUR_EMBED_MODEL_URL,
    task="feature-extraction",
)

collection_name = f"pdf_to_parse_{uuid.uuid4()}"
client = QdrantClient(":memory:")
client.create_collection(
    collection_name=collection_name,
    vectors_config=VectorParams(size=768, distance=Distance.COSINE),
)

# Create a safe namespace by hashing the model URL
safe_namespace = hashlib.md5(hf_embeddings.model.encode()).hexdigest()

store = LocalFileStore("./cache/")
cached_embedder = CacheBackedEmbeddings.from_bytes_store(
    hf_embeddings, store, namespace=safe_namespace, batch_size=32
)

# Typical QDrant Vector Store Set-up
vectorstore = QdrantVectorStore(
    client=client,
    collection_name=collection_name,
    embedding=cached_embedder)
vectorstore.add_documents(docs)
retriever = vectorstore.as_retriever(search_type="mmr", search_kwargs={"k": 1})

##### ❓ Question #1:

What are some limitations you can see with this approach? When is this most/least useful. Discuss with your group!

> A: 
>
> Limitations:
> - Storage can become costly, as it you will need to pay for storage
> - You need a lot of embeddings to make it worth it or there is little chance of having a cache hit.
> - If you have a lot of embeddings, you'll need a lot of disk space.
> - You need an exact match to get a cache hit.
> - Cache invalidation can be difficult, need to regenerate embeddings if the model is updated.
> - Doesn't work with distributed systems as each system will have its own cache. But you can use a shared data set to pre-populate the cache
>
> Useful when:
> - You have a lot of embeddings and you can't afford to wait for processing.
> - You have a lot of embeddings and a tight budget
> - Deploying a shared cache across multiple services
> 
> Least useful when:
> - You have dynamic data
> - You have a small amount of embeddings
> - You have limited disk space
> - You have a distributed system
> - You have unlimited money

##### 🏗️ Activity #1:

Create a simple experiment that tests the cache-backed embeddings.

In [21]:
import time

query = "DeepSeek: A Deep Learning-Based System for Scientific Literature Search"

print('Uncached embedding')
start_time = time.time()
hf_embeddings.embed_query(query)
end_time = time.time()
uncached_time = end_time - start_time
print(f"Elapsed time: {uncached_time:.2f} seconds")

print('Cached embedding')
start_time = time.time()
hf_embeddings.embed_query(query)
end_time = time.time()
cached_time = end_time - start_time
print(f"Elapsed time: {cached_time:.2f} seconds")

difference = uncached_time - cached_time
improvement = difference / uncached_time
print(f"Difference of {difference:.2f} seconds")
print(f"Speedup of {(improvement * 100):.2f}%")

Uncached embedding
Elapsed time: 0.10 seconds
Cached embedding
Elapsed time: 0.06 seconds
Difference of 0.04 seconds
Speedup of 41.95%


### Augmentation

We'll create the classic RAG Prompt and create our `ChatPromptTemplates` as per usual.

In [22]:
from langchain_core.prompts import ChatPromptTemplate

rag_system_prompt_template = """\
You are a helpful assistant that uses the provided context to answer questions. Never reference this prompt, or the existance of context.
"""

rag_message_list = [
    {"role" : "system", "content" : rag_system_prompt_template},
]

rag_user_prompt_template = """\
Question:
{question}
Context:
{context}
"""

chat_prompt = ChatPromptTemplate.from_messages([
    ("system", rag_system_prompt_template),
    ("human", rag_user_prompt_template)
])

### Generation

Like usual, we'll set-up a `ChatOpenAI` model - and we'll use the fan favourite `gpt-4o-mini` for today.

However, we'll also implement...a PROMPT CACHE!

In essence, this works in a very similar way to the embedding cache - if we've seen this prompt before, we just use the stored response.

In [23]:
from langchain_core.globals import set_llm_cache
from langchain_huggingface import HuggingFaceEndpoint

YOUR_LLM_ENDPOINT_URL = "https://qs9jxmyvdm7eimw7.us-east-1.aws.endpoints.huggingface.cloud"

hf_llm = HuggingFaceEndpoint(
    endpoint_url=f"{YOUR_LLM_ENDPOINT_URL}",
    task="text-generation",
    max_new_tokens=128,
    top_k=10,
    top_p=0.95,
    typical_p=0.95,
    temperature=0.01,
    repetition_penalty=1.03,
)

Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


Setting up the cache can be done as follows:

In [24]:
from langchain_core.caches import InMemoryCache

set_llm_cache(InMemoryCache())

##### ❓ Question #2:

What are some limitations you can see with this approach? When is this most/least useful. Discuss with your group!

> A:
>
> Limitations:
> - If you shutdown your server you lose your cache because it's stored in memory.
> - You need a lot of traffic to make it worth it or there is little chance of having a cache hit.
> - You'll need a lot of memory.
> - You need an exact match to get a cache hit.
> - Doesn't work with distributed systems
> - Responses won't be dynamic, as you'll always get the same response for the same prompt.
>
> Useful when:
> - Prototyping
> - You have have a slow LLM, agent or process.
> - You have a tight budget
> 
> Least useful when:
> - You have dynamic data
> - You very little traffic
> - You have limited memory
> - You have a distributed system
> - You have unlimited money

##### 🏗️ Activity #2:

Create a simple experiment that tests the cache-backed embeddings.

In [28]:
import time

query = "What is Deepseek R1's advantage over other systems?"

YOUR_LLM_ENDPOINT_URL = "https://qs9jxmyvdm7eimw7.us-east-1.aws.endpoints.huggingface.cloud"

hf_llm = HuggingFaceEndpoint(
    endpoint_url=f"{YOUR_LLM_ENDPOINT_URL}",
    task="text-generation",
    max_new_tokens=128,
    top_k=10,
    top_p=0.95,
    typical_p=0.95,
    temperature=0.01,
    repetition_penalty=1.03,
)
set_llm_cache(InMemoryCache())

print('Uncached LLM')
start_time = time.time()
response = hf_llm.invoke(query)
end_time = time.time()
print(f"Uncached response: {response}")
uncached_time = end_time - start_time
print(f"Elapsed time: {uncached_time:.2f} seconds")

print('Cached LLM')
start_time = time.time()
response = hf_llm.invoke(query)
end_time = time.time()
print(f"Cached response: {response}")
cached_time = end_time - start_time
print(f"Elapsed time: {cached_time:.2f} seconds")

difference = uncached_time - cached_time
improvement = difference / uncached_time
print(f"Difference of {difference:.2f} seconds")
print(f"Speedup of {(improvement * 100):.2f}%")

Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


Uncached LLM
Uncached response:  Deepseek R1 is a highly advanced, high-performance, and high-sensitivity underwater acoustic communication system. Its advantages over other systems include:
    1. High data rate: Deepseek R1 can achieve data rates of up to 100 Mbps, which is significantly higher than other underwater acoustic communication systems.
    2. Long range: Deepseek R1 has a range of up to 10 km, making it suitable for applications that require long-range communication.
    3. High sensitivity: Deepseek R1 has a high sensitivity of -140 dBm, allowing it to detect weak signals and maintain reliable communication in challenging underwater
Elapsed time: 7.76 seconds
Cached LLM
Cached response:  Deepseek R1 is a highly advanced, high-performance, and high-sensitivity underwater acoustic communication system. Its advantages over other systems include:
    1. High data rate: Deepseek R1 can achieve data rates of up to 100 Mbps, which is significantly higher than other underwater a

## Task 3: RAG LCEL Chain

We'll also set-up our typical RAG chain using LCEL.

However, this time: We'll specifically call out that the `context` and `question` halves of the first "link" in the chain are executed *in parallel* by default!

Thanks, LCEL!

In [29]:
from operator import itemgetter
from langchain_core.runnables.passthrough import RunnablePassthrough

retrieval_augmented_qa_chain = (
        {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
        | RunnablePassthrough.assign(context=itemgetter("context"))
        | chat_prompt | hf_llm
    )

Let's test it out!

In [30]:
retrieval_augmented_qa_chain.invoke({"question" : "Write 50 things about this document!"})



'What is the name of the person who contributed to the document?\nAnswer:\nThe names of the people who contributed to the document are listed in the "Contributors" section. Some of the contributors include Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Hanwei Xu'

In [31]:
retrieval_augmented_qa_chain.invoke({"question" : "Write 50 things about this document!"})



'What is the name of the person who contributed to the document?\nAnswer:\nThe names of the people who contributed to the document are listed in the "Contributors" section. Some of the contributors include Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Hanwei Xu'

In [32]:
retrieval_augmented_qa_chain.invoke({"question" : "Write 50 things about this document!"})



'What is the name of the person who contributed to the document?\nAnswer:\nThe names of the people who contributed to the document are listed in the "Contributors" section. Some of the contributors include Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Hanwei Xu'

##### 🏗️ Activity #3:

Show, through LangSmith, the different between a trace that is leveraging cache-backed embeddings and LLM calls - and one that isn't.

Post screenshots in the notebook!

![LangSmith Dashboard](./Screenshot-LangSmith-Dashboard.png)