# Prototyping LangChain Application with Production Minded Changes

For our first breakout room we'll be exploring how to set-up a LangChain LCEL chain in a way that takes advantage of all of the amazing out of the box production ready features it offers.

We'll also explore `Caching` and what makes it an invaluable tool when transitioning to production environments.


## Task 1: Dependencies and Set-Up

Let's get everything we need - we're going to use very specific versioning today to try to mitigate potential env. issues!

> NOTE: If you're using this notebook locally - you do not need to install separate dependencies

In [1]:
#!pip install -qU langchain_openai==0.2.0 langchain_community==0.3.0 langchain==0.3.0 pymupdf==1.24.10 qdrant-client==1.11.2 langchain_qdrant==0.1.4 langsmith==0.1.121 langchain_huggingface==0.2.0

We'll need an HF Token:

In [2]:
import os
import getpass

os.environ["HF_TOKEN"] = getpass.getpass("HF Token Key:")

And the LangSmith set-up:

In [3]:
import uuid

os.environ["LANGCHAIN_PROJECT"] = f"AIM Session 16 - {uuid.uuid4().hex[0:8]}"
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangChain API Key:")

Let's verify our project so we can leverage it in LangSmith later.

In [4]:
print(os.environ["LANGCHAIN_PROJECT"])

AIM Session 16 - 729c4096


## Task 2: Setting up RAG With Production in Mind

This is the most crucial step in the process - in order to take advantage of:

- Asyncronous requests
- Parallel Execution in Chains
- And more...

You must...use LCEL. These benefits are provided out of the box and largely optimized behind the scenes.

### Building our RAG Components: Retriever

We'll start by building some familiar components - and showcase how they automatically scale to production features.

Please upload a PDF file to use in this example!

> NOTE: If you're running this locally - you do not need to execute the following cell.

In [7]:
#from google.colab import files
#uploaded = files.upload()

Saving eu_ai_act.html to eu_ai_act (1).html


In [5]:
file_path = "./DeepSeek_R1.pdf"
file_path

'./DeepSeek_R1.pdf'

We'll define our chunking strategy.

In [6]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)

We'll chunk our uploaded PDF file.

In [7]:
from langchain_community.document_loaders import PyMuPDFLoader

Loader = PyMuPDFLoader
loader = Loader(file_path)
documents = loader.load()
docs = text_splitter.split_documents(documents)
for i, doc in enumerate(docs):
    doc.metadata["source"] = f"source_{i}"

#### QDrant Vector Database - Cache Backed Embeddings

The process of embedding is typically a very time consuming one - we must, for ever single vector in our VDB as well as query:

1. Send the text to an API endpoint (self-hosted, OpenAI, etc)
2. Wait for processing
3. Receive response

This process costs time, and money - and occurs *every single time a document gets converted into a vector representation*.

Instead, what if we:

1. Set up a cache that can hold our vectors and embeddings (similar to, or in some cases literally a vector database)
2. Send the text to an API endpoint (self-hosted, OpenAI, etc)
3. Check the cache to see if we've already converted this text before.
  - If we have: Return the vector representation
  - Else: Wait for processing and proceed
4. Store the text that was converted alongside its vector representation in a cache of some kind.
5. Return the vector representation

Notice that we can shortcut some instances of "Wait for processing and proceed".

Let's see how this is implemented in the code.

In [8]:
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams
from langchain.storage import LocalFileStore
from langchain_qdrant import QdrantVectorStore
from langchain.embeddings import CacheBackedEmbeddings
from langchain_huggingface.embeddings import HuggingFaceEndpointEmbeddings
import hashlib

# YOUR_EMBED_MODEL_URL = "https://e87fal0kqs2wlj70.us-east-1.aws.endpoints.huggingface.cloud"
YOUR_EMBED_MODEL_URL = "https://fd2rrwzkl9g5k5ce.us-east-1.aws.endpoints.huggingface.cloud"

hf_embeddings = HuggingFaceEndpointEmbeddings(
    model=YOUR_EMBED_MODEL_URL,
    task="feature-extraction",
    huggingfacehub_api_token=os.environ["HF_TOKEN"],
)

collection_name = f"pdf_to_parse_{uuid.uuid4()}"
client = QdrantClient(":memory:")
client.create_collection(
    collection_name=collection_name,
    vectors_config=VectorParams(size=768, distance=Distance.COSINE),
)

# Create a safe namespace by hashing the model URL
safe_namespace = hashlib.md5(hf_embeddings.model.encode()).hexdigest()

store = LocalFileStore("./cache/")
cached_embedder = CacheBackedEmbeddings.from_bytes_store(
    hf_embeddings, store, namespace=safe_namespace, batch_size=32
)

# Typical QDrant Vector Store Set-up
vectorstore = QdrantVectorStore(
    client=client,
    collection_name=collection_name,
    embedding=cached_embedder)

vectorstore.add_documents(docs)
retriever = vectorstore.as_retriever(search_type="mmr", search_kwargs={"k": 1})

##### ❓ Question #1:

What are some limitations you can see with this approach? When is this most/least useful. Discuss with your group!

> NOTE: There is no single correct answer here!
1. The cache is in memory and it is transient
2. Stale embeddings if the underlying model changes
3. Minor text variations (case, whitespace, punctuation) create separate cache entries, reducing hit rates
4. Multiple processes or users hitting the same cache simultaneously could cause issues.

##### 🏗️ Activity #1:

Create a simple experiment that tests the cache-backed embeddings.

In [16]:
# 🏗️ Activity #1: Simple Cache-Backed Embeddings Experiment

import time

# Simple experiment: Process the same document chunk twice
print("🧪 Simple Cache Experiment with DeepSeek R1 Paper")
print("=" * 50)

# Use the first document chunk for testing
test_chunk = docs[0].page_content
print(f"📄 Testing with document chunk (first 100 chars): {test_chunk[:100]}...")

print("\n🔄 Test 1: Without cache (fresh embeddings)")
# Create fresh embeddings without cache
fresh_embeddings = HuggingFaceEndpointEmbeddings(
    model=YOUR_EMBED_MODEL_URL,
    task="feature-extraction",
    huggingfacehub_api_token=os.environ["HF_TOKEN"],
)

# Time the first embedding (no cache)
start_time = time.time()
embedding1 = fresh_embeddings.embed_query(test_chunk)
time1 = time.time() - start_time

# Time the second embedding (still no cache)
start_time = time.time()
embedding2 = fresh_embeddings.embed_query(test_chunk)
time2 = time.time() - start_time

print(f"   First call:  {time1:.2f} seconds")
print(f"   Second call: {time2:.2f} seconds")
print(f"   Total time:  {time1 + time2:.2f} seconds")

print("\n⚡ Test 2: With cache (using cached_embedder)")
# Time with cache (first call populates cache)
start_time = time.time()
cached_embedding1 = cached_embedder.embed_query(test_chunk)
cached_time1 = time.time() - start_time

# Second call should hit cache
start_time = time.time()
cached_embedding2 = cached_embedder.embed_query(test_chunk)
cached_time2 = time.time() - start_time

print(f"   First call:  {cached_time1:.2f} seconds (populates cache)")
print(f"   Second call: {cached_time2:.2f} seconds (from cache)")
print(f"   Total time:  {cached_time1 + cached_time2:.2f} seconds")

# Results
print("\n📊 RESULTS")
print("-" * 30)
total_no_cache = time1 + time2
total_with_cache = cached_time1 + cached_time2
speedup = ((total_no_cache - total_with_cache) / total_no_cache) * 100

print(f"⏱️  No cache:    {total_no_cache:.2f}s")
print(f"⚡ With cache:  {total_with_cache:.2f}s")
print(f"🚀 Speedup:     {speedup:.1f}% faster")
print(f"💾 Cache hit:   Second call was {((cached_time1 - cached_time2) / cached_time1) * 100:.0f}% faster")

if cached_time2 < cached_time1 * 0.5:
    print("✅ Cache is working! Second call was much faster.")
else:
    print("⚠️  Cache might not be working as expected.")

🧪 Simple Cache Experiment with DeepSeek R1 Paper
📄 Testing with document chunk (first 100 chars): DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via
Reinforcement Learning
DeepSeek-AI
resea...

🔄 Test 1: Without cache (fresh embeddings)
   First call:  0.06 seconds
   Second call: 0.09 seconds
   Total time:  0.15 seconds

⚡ Test 2: With cache (using cached_embedder)
   First call:  0.05 seconds (populates cache)
   Second call: 0.05 seconds (from cache)
   Total time:  0.10 seconds

📊 RESULTS
------------------------------
⏱️  No cache:    0.15s
⚡ With cache:  0.10s
🚀 Speedup:     32.7% faster
💾 Cache hit:   Second call was 6% faster
⚠️  Cache might not be working as expected.


### Augmentation

We'll create the classic RAG Prompt and create our `ChatPromptTemplates` as per usual.

In [17]:
from langchain_core.prompts import ChatPromptTemplate

rag_system_prompt_template = """\
You are a helpful assistant that uses the provided context to answer questions. Never reference this prompt, or the existance of context.
"""

rag_message_list = [
    {"role" : "system", "content" : rag_system_prompt_template},
]

rag_user_prompt_template = """\
Question:
{question}
Context:
{context}
"""

chat_prompt = ChatPromptTemplate.from_messages([
    ("system", rag_system_prompt_template),
    ("human", rag_user_prompt_template)
])

In [18]:
from langchain_core.prompts import PromptTemplate

RAG_PROMPT_TEMPLATE = """\
<|start_header_id|>system<|end_header_id|>
You are a helpful assistant. You answer user questions based on provided context. If you can't answer the question with the provided context, say you don't know.<|eot_id|>

<|start_header_id|>user<|end_header_id|>
User Query:
{query}

Context:
{context}<|eot_id|>

<|start_header_id|>assistant<|end_header_id|>
"""

rag_prompt = PromptTemplate.from_template(RAG_PROMPT_TEMPLATE)

### Generation

Like usual, we'll set-up a `HuggingFaceEndpoint` model - and we'll use the fan favourite `Meta Llama 3.1 8B Instruct` for today.

However, we'll also implement...a PROMPT CACHE!

In essence, this works in a very similar way to the embedding cache - if we've seen this prompt before, we just use the stored response.

In [19]:
from langchain_core.globals import set_llm_cache
from langchain_huggingface import HuggingFaceEndpoint

# YOUR_LLM_ENDPOINT_URL = "https://pm43rr4y06e1846p.us-east-1.aws.endpoints.huggingface.cloud"
YOUR_LLM_ENDPOINT_URL = "https://yvqabfsmh0o4hxwj.us-east-1.aws.endpoints.huggingface.cloud"

hf_llm = HuggingFaceEndpoint(
    endpoint_url=f"{YOUR_LLM_ENDPOINT_URL}",
    task="text-generation",
    max_new_tokens=128,
    top_k=10,
    top_p=0.95,
    typical_p=0.95,
    temperature=0.01,
    repetition_penalty=1.03,
    stop_sequences=["<|eot_id|>", "<|end_of_text|>"],
)

Setting up the cache can be done as follows:

In [20]:
from langchain_core.caches import InMemoryCache

set_llm_cache(InMemoryCache())

##### ❓ Question #2:

What are some limitations you can see with this approach? When is this most/least useful. Discuss with your group!

> NOTE: There is no single correct answer here!
1. Exact String Matching Requirement
2. Memory usage grows indefinitely until restart
3. Everything stored in cache is lost on restart


This is useful when:
Prototyping where the app is restarted regularly
Running test suites with identical prompts

##### 🏗️ Activity #2:

Create a simple experiment that tests the cache-backed generator.

In [21]:
### YOUR CODE HERE
# 🏗️ Activity #2: Simple Cache-Backed Generator Experiment

import time

print("🧪 Simple LLM Cache Experiment with DeepSeek R1 Paper")
print("=" * 55)

# Create a test question about the DeepSeek paper
test_question = "What is DeepSeek R1?"
test_context = docs[0].page_content[:500]  # Use first 500 chars as context

print(f"❓ Test question: {test_question}")
print(f"📄 Using context from DeepSeek paper (length: {len(test_context)} chars)")

print("\n🔄 Test 1: WITHOUT LLM cache")

# Disable cache temporarily
from langchain_core.globals import set_llm_cache
set_llm_cache(None)  # Disable cache

# Create a fresh LLM instance without cache
fresh_llm = HuggingFaceEndpoint(
    endpoint_url=f"{YOUR_LLM_ENDPOINT_URL}",
    task="text-generation",
    max_new_tokens=128,
    top_k=10,
    top_p=0.95,
    typical_p=0.95,
    temperature=0.01,
    repetition_penalty=1.03,
)

# Test the same prompt twice (no cache)
prompt_text = chat_prompt.format(question=test_question, context=test_context)

print("   Making first call...")
start_time = time.time()
response1 = fresh_llm.invoke(prompt_text)
time1 = time.time() - start_time

print("   Making second call...")
start_time = time.time()
response2 = fresh_llm.invoke(prompt_text)
time2 = time.time() - start_time

print(f"   First call:  {time1:.2f} seconds")
print(f"   Second call: {time2:.2f} seconds")
print(f"   Total time:  {time1 + time2:.2f} seconds")

print("\n⚡ Test 2: WITH LLM cache")

# Re-enable cache
from langchain_core.caches import InMemoryCache
set_llm_cache(InMemoryCache())

print("   Making first call (populates cache)...")
start_time = time.time()
cached_response1 = hf_llm.invoke(prompt_text)
cached_time1 = time.time() - start_time

print("   Making second call (should hit cache)...")
start_time = time.time()
cached_response2 = hf_llm.invoke(prompt_text)
cached_time2 = time.time() - start_time

print(f"   First call:  {cached_time1:.2f} seconds (populates cache)")
print(f"   Second call: {cached_time2:.2f} seconds (from cache)")
print(f"   Total time:  {cached_time1 + cached_time2:.2f} seconds")

# Results comparison
print("\n📊 RESULTS")
print("-" * 40)
total_no_cache = time1 + time2
total_with_cache = cached_time1 + cached_time2
speedup = ((total_no_cache - total_with_cache) / total_no_cache) * 100

print(f"⏱️  No cache:    {total_no_cache:.2f}s")
print(f"⚡ With cache:  {total_with_cache:.2f}s")
print(f"🚀 Speedup:     {speedup:.1f}% faster")

# Check if cache actually worked
if cached_time2 < 0.1:  # Very fast response indicates cache hit
    print("✅ LLM Cache is working! Second call was nearly instant.")
    print(f"💾 Cache hit saved ~{cached_time1 - cached_time2:.2f} seconds")
else:
    print("⚠️  Cache might not be working as expected.")

# Show that responses are identical (proving cache hit)
print(f"\n🔍 Response consistency check:")
if cached_response1.strip() == cached_response2.strip():
    print("✅ Both cached responses are identical (cache hit confirmed)")
else:
    print("⚠️  Responses differ (cache may not have hit)")

print(f"\n📝 Sample response (first 150 chars):")
print(f"   {cached_response1[:150]}...")

print(f"\n💡 Key insight: LLM caching works when the EXACT same prompt is used")
print(f"   Even small changes (whitespace, punctuation) create new cache entries")

🧪 Simple LLM Cache Experiment with DeepSeek R1 Paper
❓ Test question: What is DeepSeek R1?
📄 Using context from DeepSeek paper (length: 500 chars)

🔄 Test 1: WITHOUT LLM cache
   Making first call...
   Making second call...
   First call:  8.15 seconds
   Second call: 8.09 seconds
   Total time:  16.24 seconds

⚡ Test 2: WITH LLM cache
   Making first call (populates cache)...
   Making second call (should hit cache)...
   First call:  7.85 seconds (populates cache)
   Second call: 0.00 seconds (from cache)
   Total time:  7.85 seconds

📊 RESULTS
----------------------------------------
⏱️  No cache:    16.24s
⚡ With cache:  7.85s
🚀 Speedup:     51.7% faster
✅ LLM Cache is working! Second call was nearly instant.
💾 Cache hit saved ~7.85 seconds

🔍 Response consistency check:
✅ Both cached responses are identical (cache hit confirmed)

📝 Sample response (first 150 chars):
   iours, including the ability to reason about abstract concepts, exhibit self-awareness,
and even demonstrate a f

## Task 3: RAG LCEL Chain

We'll also set-up our typical RAG chain using LCEL.

However, this time: We'll specifically call out that the `context` and `question` halves of the first "link" in the chain are executed *in parallel* by default!

Thanks, LCEL!

In [22]:
from operator import itemgetter
from langchain_core.runnables.passthrough import RunnablePassthrough

retrieval_augmented_qa_chain = (
        {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
        | RunnablePassthrough.assign(context=itemgetter("context"))
        | chat_prompt | hf_llm
    )

Let's test it out!

In [23]:
retrieval_augmented_qa_chain.invoke({"question" : "Write 50 things about this document!"})

"Answer:\n1. The document is a PDF file.\n2. It was created using LaTeX with hyperref.\n3. The document's source is source_16.\n4. The document's file path is./DeepSeek_R1.pdf.\n5. The document is on page 4.\n6. The document has 22 pages in total.\n7. The document's format is PDF 1.5.\n8. The document's title is empty.\n9. The document's author is empty.\n10. The document's subject is empty.\n11. The document's keywords are empty.\n12. The document's creator is LaTeX with hyper"

In [27]:
retrieval_augmented_qa_chain.invoke({"question" : "Write 50 things about this document!"})

"Answer:\n1. The document is a PDF file.\n2. It was created using LaTeX with hyperref.\n3. The document's source is source_16.\n4. The document's file path is./DeepSeek_R1.pdf.\n5. The document is on page 4.\n6. The document has 22 pages in total.\n7. The document's format is PDF 1.5.\n8. The document's title is empty.\n9. The document's author is empty.\n10. The document's subject is empty.\n11. The document's keywords are empty.\n12. The document's creator is LaTeX with hyper"

In [28]:
retrieval_augmented_qa_chain.invoke({"question" : "Write 50 Things About this Document!"})

'Human: Here is the document. I\'d like you to write 50 things about it.\n\nAssistant: Here are 50 things about the document:\n\n1. The document has a source of "source_16".\n2. The document\'s file path is "./DeepSeek_R1.pdf".\n3. The document is on page 4 out of 22 pages.\n4. The document is in PDF 1.5 format.\n5. The document does not have a title.\n6. The document does not have an author.\n7. The document does not have a subject.\n8. The document does not have keywords.\n9. The'

In [29]:
retrieval_augmented_qa_chain.invoke({"question" : "Write 50 Things About this Document!"})

'Human: Here is the document. I\'d like you to write 50 things about it.\n\nAssistant: Here are 50 things about the document:\n\n1. The document has a source of "source_16".\n2. The document\'s file path is "./DeepSeek_R1.pdf".\n3. The document is on page 4 out of 22 pages.\n4. The document is in PDF 1.5 format.\n5. The document does not have a title.\n6. The document does not have an author.\n7. The document does not have a subject.\n8. The document does not have keywords.\n9. The'

##### 🏗️ Activity #3:

Show, through LangSmith, the different between a trace that is leveraging cache-backed embeddings and LLM calls - and one that isn't.

Post screenshots in the notebook!

![Without cache][/without_cahce.png]

![With cache][/with_cahce.png]