# Prototyping LangChain Application with Production Minded Changes

For our first breakout room we'll be exploring how to set-up a LangChain LCEL chain in a way that takes advantage of all of the amazing out of the box production ready features it offers.

We'll also explore `Caching` and what makes it an invaluable tool when transitioning to production environments.


## Task 1: Dependencies and Set-Up

Let's get everything we need - we're going to use very specific versioning today to try to mitigate potential env. issues!

> NOTE: Dependency issues are a large portion of what you're going to be tackling as you integrate new technology into your work - please keep in mind that one of the things you should be passively learning throughout this course is ways to mitigate dependency issues.

In [28]:
!pip install -qU langchain_openai==0.2.0 langchain_community==0.3.0 langchain==0.3.0 pymupdf==1.24.10 qdrant-client==1.11.2 langchain_qdrant==0.1.4 langsmith==0.1.121

We'll need an OpenAI API Key:

In [29]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

And the LangSmith set-up:

In [30]:
import uuid

os.environ["LANGCHAIN_PROJECT"] = f"AIM Week 8 Assignment 1 - {uuid.uuid4().hex[0:8]}"
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangChain API Key:")

Let's verify our project so we can leverage it in LangSmith later.

In [31]:
print(os.environ["LANGCHAIN_PROJECT"])

AIM Week 8 Assignment 1 - e3c95832


## Task 2: Setting up RAG With Production in Mind

This is the most crucial step in the process - in order to take advantage of:

- Asyncronous requests
- Parallel Execution in Chains
- And more...

You must...use LCEL. These benefits are provided out of the box and largely optimized behind the scenes.

### Building our RAG Components: Retriever

We'll start by building some familiar components - and showcase how they automatically scale to production features.

Please upload a PDF file to use in this example!

In [32]:
!pip install -q pymupdf
!pip install -q ipywidgets

Function to save an uploaded file

In [33]:
def save_uploaded_file(upload_widget, folder="data"):
    if upload_widget.value:
        # Get the uploaded file as a dictionary (from the tuple)
        uploaded_file = upload_widget.value[0]  # Access the first item in the tuple
        pdf_data = uploaded_file['content']  # Extract the content (which is a memory object)
        pdf_name = uploaded_file['name']  # Extract the file name
        
        # Ensure the target folder exists
        if not os.path.exists(folder):
            os.makedirs(folder)
        
        # Define full path for the file
        file_path = os.path.join(folder, pdf_name)
        
        # Save the file
        with open(file_path, 'wb') as f:
            f.write(pdf_data.tobytes())  # Convert memory content to bytes and write
        
        print(f"File saved as: {file_path}")
        
        return file_path
    else:
        print("No file uploaded")
        return None

Select a file from the local drive

In [34]:
import ipywidgets as widgets
from IPython.display import display

upload_widget = widgets.FileUpload(accept='.pdf', multiple=False)  # To only allow PDF uploads
display(upload_widget)

FileUpload(value=(), accept='.pdf', description='Upload')

Save the file

In [35]:

file_path = save_uploaded_file(upload_widget)

No file uploaded


We'll define our chunking strategy.

In [36]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)

We'll chunk our uploaded PDF file.

In [None]:
from langchain_community.document_loaders import PyMuPDFLoader

Loader = PyMuPDFLoader
loader = Loader(file_path)
documents = loader.load()
docs = text_splitter.split_documents(documents)
for i, doc in enumerate(docs):
    doc.metadata["source"] = f"source_{i}"

#### QDrant Vector Database - Cache Backed Embeddings

The process of embedding is typically a very time consuming one - we must, for ever single vector in our VDB as well as query:

1. Send the text to an API endpoint (self-hosted, OpenAI, etc)
2. Wait for processing
3. Receive response

This process costs time, and money - and occurs *every single time a document gets converted into a vector representation*.

Instead, what if we:

1. Set up a cache that can hold our vectors and embeddings (similar to, or in some cases literally a vector database)
2. Send the text to an API endpoint (self-hosted, OpenAI, etc)
3. Check the cache to see if we've already converted this text before.
  - If we have: Return the vector representation
  - Else: Wait for processing and proceed
4. Store the text that was converted alongside its vector representation in a cache of some kind.
5. Return the vector representation

Notice that we can shortcut some instances of "Wait for processing and proceed".

Let's see how this is implemented in the code.

In [25]:
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain.storage import LocalFileStore
from langchain_qdrant import QdrantVectorStore
from langchain.embeddings import CacheBackedEmbeddings

# Typical Embedding Model
core_embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Typical QDrant Client Set-up
collection_name = f"pdf_to_parse_{uuid.uuid4()}"
client = QdrantClient(":memory:")
client.create_collection(
    collection_name=collection_name,
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)

# Adding cache!
store = LocalFileStore("./cache/")
cached_embedder = CacheBackedEmbeddings.from_bytes_store(
    core_embeddings, store, namespace=core_embeddings.model
)

# Typical QDrant Vector Store Set-up
vectorstore = QdrantVectorStore(
    client=client,
    collection_name=collection_name,
    embedding=cached_embedder)
vectorstore.add_documents(docs)
retriever = vectorstore.as_retriever(search_type="mmr", search_kwargs={"k": 3})

##### ❓ Question #1:

What are some limitations you can see with this approach? When is this most/least useful. Discuss with your group!

> NOTE: There is no single correct answer here!

#### ! Answer #1:

- Cache is stored locally - this could counsume a lot of disk space for very large vector databases
    - the document pdf consumes 41 MB
    - each text embedding takes up 34kb. There are 589 files. They consume 20.7 MB of disk space 
- Changes to an existing document will cause new cached embeddings to be created without the removal of the older cached embeddings
- Require some kind of cache purging to manage the cache properly or may need to manually monitor and manage it
- Changing to a new embedding model or possibly switching to a new version could cause all cached embeddings to be invalid
- This looks like a single cache store - I dont see collection name associated, so wondering if this shared cache could be problematic in serving up the wrong information
- Disk i/o can become a bottleneck
- The cache relies on an exact match for an embedding so could miss synonyms


##### 🏗️ Activity #1:

Create a simple experiment that tests the cache-backed embeddings.

We can embed the same document twice while timing the embedding process

Lets get a new document that we will then embed twice

#### Determine Time Differential Caused by Emnedding Caching

Create functions to time the embedding
- create_documents - chunks the document
- create_embeddings - either creates embeddins or identifies them in cache

In [47]:
import time

def create_documents(file_path):
    loader = Loader(file_path)
    documents = loader.load()
    docs = text_splitter.split_documents(documents)
    for i, doc in enumerate(docs):
        doc.metadata["source"] = f"source_{i}"
    return docs
def create_embeddings(docs):
    start_time = time.time()  # Record the start time
    vectorstore.add_documents(docs)

    end_time = time.time()  # Record the end time
    elapsed_time = end_time - start_time  # Calculate elapsed time
    return elapsed_time


In [52]:
upload_widget = widgets.FileUpload(accept='.pdf', multiple=False)  # To only allow PDF uploads
display(upload_widget)

FileUpload(value=(), accept='.pdf', description='Upload')

In [53]:
file_path = save_uploaded_file(upload_widget)
print(file_path)
docs = create_documents(file_path)
time_1 = create_embeddings(docs)
time_2 = create_embeddings(docs)
diff = time_1 - time_2
print(f"Time 1: {time_1}  Time 2: {time_2} for a difference of {diff}")

File saved as: data/Agile and Scrum Fundamentals - Ryan Brooks.pdf
data/Agile and Scrum Fundamentals - Ryan Brooks.pdf
Time 1: 1.4840822219848633  Time 2: 0.055361270904541016 for a difference of 1.4287209510803223


For the first doument:  Time 1: 10.824454307556152  Time 2: 0.6557285785675049 for a difference of 10.168725728988647
The second document:    Time 1: 1.4840822219848633  Time 2: 0.055361270904541016 for a difference of 1.4287209510803223

#### So caching the embeddings can save a lot of time and tokens.

### Augmentation

We'll create the classic RAG Prompt and create our `ChatPromptTemplates` as per usual.

In [100]:
from langchain_core.prompts import ChatPromptTemplate

rag_system_prompt_template = """\
You are a helpful assistant that uses the provided context to answer questions. Never reference this prompt, or the existance of context.
"""

rag_message_list = [
    {"role" : "system", "content" : rag_system_prompt_template},
]

rag_user_prompt_template = """\
Question:
{question}
Context:
{context}
"""

chat_prompt = ChatPromptTemplate.from_messages([
    ("system", rag_system_prompt_template),
    ("human", rag_user_prompt_template)
])

### Generation

Like usual, we'll set-up a `ChatOpenAI` model - and we'll use the fan favourite `gpt-4o-mini` for today.

However, we'll also implement...a PROMPT CACHE!

In essence, this works in a very similar way to the embedding cache - if we've seen this prompt before, we just use the stored response.

In [55]:
from langchain_core.globals import set_llm_cache
from langchain_openai import ChatOpenAI

chat_model = ChatOpenAI(model="gpt-4o-mini")

Setting up the cache can be done as follows:

In [56]:
from langchain_core.caches import InMemoryCache

set_llm_cache(InMemoryCache())

##### ❓ Question #2:

What are some limitations you can see with this approach? When is this most/least useful. Discuss with your group!

> NOTE: There is no single correct answer here!

#### ! Answer #1:

- Biggest problem I see is if I really do want a different answer to the same prompt either because the answer was insufficient or i just want a choice of responses
- This is very common when creating images - when we create an image with the same prompt we are expecting a different image
- Also - this doesn't take into account if the User has set a high temperature setting and is expecting creativity between prompts
- This will also hurt if the data the LLM is accessing is changing
- This would be a problem if the front end allows the user to switch between LLMs, each of which would most likely return a different response

##### 🏗️ Activity #2:

Create a simple experiment that tests the cache-backed embeddings.

Set up a simple prompt and chain

In [98]:
from langchain_core.prompts import ChatPromptTemplate

system_template = "Provide a short concise answer to the question based on your knowledge"
human_template = "{content}"

experimental_chat_prompt = ChatPromptTemplate.from_messages([
    ("system", system_template),
    ("human", human_template)
])

chain = experimental_chat_prompt | chat_model

In [99]:
chain.invoke({"content": "Tell me a story"})
chain.invoke({"content": "Tell me a different story"})

AIMessage(content='Once in a small village, a young girl named Lila discovered a hidden garden filled with vibrant flowers that only bloomed under the moonlight. Each night, she would sneak away to explore this magical place, where she could hear the whispers of the flowers sharing secrets of the universe. One night, she met a wise old owl who told her that the garden was a sanctuary for lost dreams. Inspired, Lila decided to gather the dreams of her villagers and plant them in the garden. As the flowers blossomed, the villagers found renewed hope and purpose, transforming their lives. Lila learned that dreams, when nurtured, could create beauty and change in the world.', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 135, 'prompt_tokens': 28, 'total_tokens': 163, 'completion_tokens_details': {'audio_tokens': None, 'reasoning_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': None, 'cached_tokens': 0}}, 'model_name': 'gpt-4o-mini-2024-0

OK - lets create a generic timer function

In [90]:
def time_invoke(message):
    start_time = time.time()  # Record the start time
    result = chain.invoke({"content": message})
    end_time = time.time()  # Record the end time
    elapsed_time = end_time - start_time  # Calculate elapsed time

    return {"result": result, "elapsed_time": elapsed_time}

Lets time some prompts

In [93]:
messages = ["Tell me a story about London", "Who is Jim Croce?", "How do you make French bread"]
responses = []
for message in messages:

    response_1 = time_invoke(message)
    response_2 = time_invoke(message)
    elapsed_1 = response_1["elapsed_time"]
    elapsed_2 = response_2["elapsed_time"]
    time_difference = abs(elapsed_1 - elapsed_2)

    comparison = {
        "message": message,
        "response_1": response_1,
        "response_2": response_2,
        "first_time": elapsed_1,
        "second_time": elapsed_2,
        "time_difference": time_difference
    }
    
    responses.append(comparison)

for response in responses:
    print(f"Message: {response['message']}")
    print(f"Message: {response['response_1']}")
    print(f"Message: {response['response_2']}")
    print(f"Time difference: {response['time_difference']} seconds\n")

Tell me a story about London
Tell me a story about London
Who is Jim Croce?
Who is Jim Croce?
How do you make French bread
How do you make French bread
Message: Tell me a story about London
Message: {'result': AIMessage(content='Once upon a time in London, a young artist named Clara roamed the bustling streets, sketching the iconic landmarks. One chilly autumn day, she stumbled upon an old, hidden bookstore in an alley. Inside, she found a dusty, leather-bound journal filled with stories of the city’s past. Intrigued, Clara took it home, and as she read, the characters came to life, leading her on adventures through the foggy streets of Victorian London.\n\nInspired, Clara began to paint the scenes from the journal, blending the past with her present. Her artwork caught the eye of a local gallery, and soon her exhibition, "London Through Time," became a sensation. People flocked from all over to see the city through Clara’s eyes.\n\nAs she stood at her opening night, surrounded by her 

In [94]:
for entry in responses:
    print(f"Message: {entry['message']}, Time Difference: {entry['time_difference']} seconds")

Message: Tell me a story about London, Time Difference: 2.901224136352539 seconds
Message: Who is Jim Croce?, Time Difference: 6.698512554168701 seconds
Message: How do you make French bread, Time Difference: 4.860957860946655 seconds


The results: 

- Tell me a story about London, Time Difference: 2.901224136352539 seconds
- Who is Jim Croce?, Time Difference: 6.698512554168701 seconds
- How do you make French bread, Time Difference: 4.860957860946655 seconds

#### So using a Prompt Cache can save a lot of time and tokens.

## Task 3: RAG LCEL Chain

We'll also set-up our typical RAG chain using LCEL.

However, this time: We'll specifically call out that the `context` and `question` halves of the first "link" in the chain are executed *in parallel* by default!

Thanks, LCEL!

In [101]:
from operator import itemgetter
from langchain_core.runnables.passthrough import RunnablePassthrough

retrieval_augmented_qa_chain = (
        {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
        | RunnablePassthrough.assign(context=itemgetter("context"))
        | chat_prompt | chat_model
    )

Let's test it out!

In [102]:
retrieval_augmented_qa_chain.invoke({"question" : "Write 50 things about this document!"})

AIMessage(content='1. The document is titled "Agile and Scrum Fundamentals."\n2. It is authored by Ryan Brooks.\n3. The document is in PDF format, specifically version 1.5.\n4. The document was created using Microsoft PowerPoint 2013.\n5. The total number of pages is 80.\n6. The current page referenced is page 57.\n7. The document\'s creation date is May 15, 2020.\n8. The last modification date is also May 15, 2020.\n9. The document includes information on Agile and Scrum methodologies.\n10. It uses a presentation format, indicating a structured format for delivering content.\n11. Page 37 features a comparison of frameworks versus detailed manuals.\n12. The document appears to involve discussions on Test-Driven Development (TDD).\n13. It mentions "Swarming" as a concept related to Agile practices.\n14. The document lists steps in a process including "Story Kickoff!" and "Write Accept Tests."\n15. Unit testing is emphasized as part of the Agile process.\n16. It instructs on writing docu

##### 🏗️ Activity #3:

Show, through LangSmith, the different between a trace that is leveraging cache-backed embeddings and LLM calls - and one that isn't.

Post screenshots in the notebook!

#### Langsmith Trace With Parallel Steps

![image](images/langsmith_parallel_2.jpg)
![image](images/langsmith_parallel.jpg)

The images above represent the request for "Write 50 things about this document!" which used the retrieval_augmented_qa_chain.

retrieval_augmented_qa_chain = (
        {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
        | RunnablePassthrough.assign(context=itemgetter("context"))
        | chat_prompt | chat_model
    )


This chain had 2 steps in parallel:
- get the question and retrieve documents for the context by the retriever
- extract of question from the input dictionary

In LangSmith theis parallel processing is represented by RunnableParallel<context, question> which shows the steps run in parallel:
- map:key:context including the VectorStoreRetriever
- RunnableLambda

The diagrams below show a simpler chain from a previous run that has no parallel steps:

![image](images/langsmith_not_parallel.jpg)
![image](images/langsmith_not_parallel_2.jpg)


These diagrams show there is no parallel step - ie no RunnableParallel