# Prototyping LangChain Application with Production Minded Changes

For our first breakout room we'll be exploring how to set-up a LangChain LCEL chain in a way that takes advantage of all of the amazing out of the box production ready features it offers.

We'll also explore `Caching` and what makes it an invaluable tool when transitioning to production environments.

🤝 BREAKOUT ROOM #1:
  - Task 1: Depends and Set-Up
  - Task 2: Setting up RAG With Production in Mind
  - Task 3: RAG LCEL Chain



## Task 1: Depends and Set-Up

Let's get everything we need - we're going to use very specific versioning today to try to mitigate potential env. issues!

> NOTE: Dependency issues are a large portion of what you're going to be tackling as you integrate new technology into your work - please keep in mind that one of the things you should be passively learning throughout this course is ways to mitigate dependency issues.

In [1]:
!pip install -qU langchain_openai==0.2.0 langchain_community==0.3.0 langchain==0.3.0 pymupdf==1.24.10 qdrant-client==1.11.2 langchain_qdrant==0.1.4 langsmith==0.1.121

We'll need an OpenAI API Key:

In [2]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

OpenAI API Key:··········


And the LangSmith set-up:

In [3]:
import uuid

os.environ["LANGCHAIN_PROJECT"] = f"AIM Week 8 Assignment 1 - {uuid.uuid4().hex[0:8]}"
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangChain API Key:")

LangChain API Key:··········


Let's verify our project so we can leverage it in LangSmith later.

In [4]:
print(os.environ["LANGCHAIN_PROJECT"])

AIM Week 8 Assignment 1 - 0cb11d43


## Task 2: Setting up RAG With Production in Mind

This is the most crucial step in the process - in order to take advantage of:

- Asyncronous requests
- Parallel Execution in Chains
- And more...

You must...use LCEL. These benefits are provided out of the box and largely optimized behind the scenes.

### Building our RAG Components: Retriever

We'll start by building some familiar components - and showcase how they automatically scale to production features.

Please upload a PDF file to use in this example!

⚠ If you're running in a non-Chrome browser - you may run into issues with this cell. Please upload the file using Colab's file upload - and indicate the `file_path` in the cell with `file_path`. ⚠

![image](https://i.imgur.com/Qa1Uwlj.png)



> NOTE: You can skip this step if you are running locally - please just point to your local file.

In [5]:
from google.colab import files
uploaded = files.upload()

Saving 2307.06435v9.pdf to 2307.06435v9 (1).pdf


In [6]:
file_path = list(uploaded.keys())[0]
file_path

'2307.06435v9 (1).pdf'

We'll define our chunking strategy.

In [7]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)

We'll chunk our uploaded PDF file.

In [8]:
from langchain_community.document_loaders import PyMuPDFLoader

Loader = PyMuPDFLoader
loader = Loader(file_path)
documents = loader.load()
docs = text_splitter.split_documents(documents)
for i, doc in enumerate(docs):
    doc.metadata["source"] = f"source_{i}"

#### QDrant Vector Database - Cache Backed Embeddings

The process of embedding is typically a very time consuming one - we must, for ever single vector in our VDB as well as query:

1. Send the text to an API endpoint (self-hosted, OpenAI, etc)
2. Wait for processing
3. Receive response

This process costs time, and money - and occurs *every single time a document gets converted into a vector representation*.

Instead, what if we:

1. Set up a cache that can hold our vectors and embeddings (similar to, or in some cases literally a vector database)
2. Send the text to an API endpoint (self-hosted, OpenAI, etc)
3. Check the cache to see if we've already converted this text before.
  - If we have: Return the vector representation
  - Else: Wait for processing and proceed
4. Store the text that was converted alongside its vector representation in a cache of some kind.
5. Return the vector representation

Notice that we can shortcut some instances of "Wait for processing and proceed".

Let's see how this is implemented in the code.

In [9]:
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain.storage import LocalFileStore
from langchain_qdrant import QdrantVectorStore
from langchain.embeddings import CacheBackedEmbeddings

# Typical Embedding Model
core_embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Typical QDrant Client Set-up
collection_name = f"pdf_to_parse_{uuid.uuid4()}"
client = QdrantClient(":memory:")
client.create_collection(
    collection_name=collection_name,
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)

# Adding cache!
store = LocalFileStore("./cache/")
cached_embedder = CacheBackedEmbeddings.from_bytes_store(
    core_embeddings, store, namespace=core_embeddings.model
)

# Typical QDrant Vector Store Set-up
vectorstore = QdrantVectorStore(
    client=client,
    collection_name=collection_name,
    embedding=cached_embedder)
vectorstore.add_documents(docs)
retriever = vectorstore.as_retriever(search_type="mmr", search_kwargs={"k": 3})

##### ❓ Question #1:

What are some limitations you can see with this approach? When is this most/least useful. Discuss with your group!

> NOTE: There is no single correct answer here!

One issue is that the cache adds extra memory usage and if one is not careful for a lot of documents in can eat up storage space for file store or RAM for caching that takes up memory. Also one has to be careful with user session and that memory is not shared among the users otherwise one can ccess other people's data. In addition, any knowledge changes would make the cached data stale and obsolete so one would have to remove the cached data and it would probably be difficult to determine which data has to be flushed.

This approach is useful for small documents where one needs to do a fast process as the caching will help in performance.

##### 🏗️ Activity #1:

Create a simple experiment that tests the cache-backed embeddings.

In [10]:
import time
query = "What does the document say about open source models?"
_begin = time.time()
retriever.invoke(query)
_end = time.time()
_begin2 = time.time()
retriever.invoke(query)
_end2 = time.time()
print(f"Time for first retrieval: {_end-_begin} sec")
print(f"Time for second retrieval: {_end2-_begin2} sec")

Time for first retrieval: 0.5727956295013428 sec
Time for second retrieval: 0.45288562774658203 sec


Looks like the second retrieval is faster by about 30%

### Augmentation

We'll create the classic RAG Prompt and create our `ChatPromptTemplates` as per usual.

In [11]:
from langchain_core.prompts import ChatPromptTemplate

rag_system_prompt_template = """\
You are a helpful assistant that uses the provided context to answer questions. Never reference this prompt, or the existance of context.
"""

rag_message_list = [
    {"role" : "system", "content" : rag_system_prompt_template},
]

rag_user_prompt_template = """\
Question:
{question}
Context:
{context}
"""

chat_prompt = ChatPromptTemplate.from_messages([
    ("system", rag_system_prompt_template),
    ("human", rag_user_prompt_template)
])

### Generation

Like usual, we'll set-up a `ChatOpenAI` model - and we'll use the fan favourite `gpt-4o-mini` for today.

However, we'll also implement...a PROMPT CACHE!

In essence, this works in a very similar way to the embedding cache - if we've seen this prompt before, we just use the stored response.

In [12]:
from langchain_core.globals import set_llm_cache
from langchain_openai import ChatOpenAI

chat_model = ChatOpenAI(model="gpt-4o-mini")

Setting up the cache can be done as follows:

In [13]:
from langchain_core.caches import InMemoryCache

set_llm_cache(InMemoryCache())

##### ❓ Question #2:

What are some limitations you can see with this approach? When is this most/least useful. Discuss with your group!

> NOTE: There is no single correct answer here!

One issue is that the LLM model can change and we know that for LLM models such as OpenAI they frequently update the models. The changes would make the cache data stale and incorrect.

Also one relies on high temperature settings on the model then this will not work as the cache results would always return the same results.

##### 🏗️ Activity #2:

Create a simple experiment that tests the cache-backed embeddings.

In [14]:
import time
query = "When was this paper published?"
_begin = time.time()
chat_model.invoke(query)
_end = time.time()
_begin2 = time.time()
chat_model.invoke(query)
_end2 = time.time()
print(f"Time for first llm call: {_end-_begin} sec")
print(f"Time for second llm call: {_end2-_begin2} sec")

Time for first llm call: 1.422450304031372 sec
Time for second llm call: 0.0022842884063720703 sec


## Task 3: RAG LCEL Chain

We'll also set-up our typical RAG chain using LCEL.

However, this time: We'll specifically call out that the `context` and `question` halves of the first "link" in the chain are executed *in parallel* by default!

Thanks, LCEL!

In [15]:
from operator import itemgetter
from langchain_core.runnables.passthrough import RunnablePassthrough

retrieval_augmented_qa_chain = (
        {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
        | RunnablePassthrough.assign(context=itemgetter("context"))
        | chat_prompt | chat_model
    )

Let's test it out!

In [16]:
retrieval_augmented_qa_chain.invoke({"question" : "Write 50 things about this document!"})

AIMessage(content='1. The document is titled "A Comprehensive Overview of Large Language Models."\n2. It was authored by Humza Naveed, Asad Ullah Khan, Shi Qiu, Muhammad Saqib, Saeed Anwar, Muhammad Usman, Naveed Akhtar, Nick Barnes, and Ajmal Mian.\n3. The document has a total of 46 pages.\n4. It is in PDF format, specifically PDF 1.5.\n5. The creation date of the document is April 11, 2024.\n6. The document\'s metadata includes a unique identifier (_id) for reference.\n7. The authors are affiliated with research in computational linguistics and artificial intelligence.\n8. The document discusses various large language models (LLMs).\n9. It references multiple datasets such as PanGu-Σ, WuDaoCorpora, and CLUE.\n10. The document includes contributions from various research papers and preprints.\n11. It lists models like BloombergGPT and LLaMA-2.\n12. The document is produced by pdfTeX-1.40.25.\n13. It was created using LaTeX with hyperref.\n14. The authors explore model architectures an

In [17]:
retrieval_augmented_qa_chain.invoke({"question" : "Write 50 things about this document!"})

AIMessage(content='1. The document is titled "A Comprehensive Overview of Large Language Models."\n2. It was authored by Humza Naveed, Asad Ullah Khan, Shi Qiu, Muhammad Saqib, Saeed Anwar, Muhammad Usman, Naveed Akhtar, Nick Barnes, and Ajmal Mian.\n3. The document has a total of 46 pages.\n4. It is in PDF format, specifically PDF 1.5.\n5. The creation date of the document is April 11, 2024.\n6. The document\'s metadata includes a unique identifier (_id) for reference.\n7. The authors are affiliated with research in computational linguistics and artificial intelligence.\n8. The document discusses various large language models (LLMs).\n9. It references multiple datasets such as PanGu-Σ, WuDaoCorpora, and CLUE.\n10. The document includes contributions from various research papers and preprints.\n11. It lists models like BloombergGPT and LLaMA-2.\n12. The document is produced by pdfTeX-1.40.25.\n13. It was created using LaTeX with hyperref.\n14. The authors explore model architectures an

##### 🏗️ Activity #3:

Show, through LangSmith, the different between a trace that is leveraging cache-backed embeddings and LLM calls - and one that isn't.

Post screenshots in the notebook!

Below are the screenshots for the traces of the first and second retrievals/LLM calls. The first one will not experience any caching but the second one should. 

![image](langsmith-overall.png)

### Retriever
#### Below is the traces for the retriever for the first call.
![image](langsmith-retriever-first.png)

####Below is the traces for the retriever for the second call. Surprisingly it does not show improvement in latency.
![image](langsmith-retriever.png)

### LLM
#### Below is the traces for the LLM for the first call.
![image](langsmith-llm-first.png)

#### Below is the traces for the LLM for the second call. We see a substantial improvement in latency.
![image](langsmith-llm-second.png)