# Prototyping LangChain Application with Production Minded Changes

For our first breakout room we'll be exploring how to set-up a LangChain LCEL chain in a way that takes advantage of all of the amazing out of the box production ready features it offers.

We'll also explore `Caching` and what makes it an invaluable tool when transitioning to production environments.


## Task 1: Dependencies and Set-Up

Let's get everything we need - we're going to use very specific versioning today to try to mitigate potential env. issues!

> NOTE: Dependency issues are a large portion of what you're going to be tackling as you integrate new technology into your work - please keep in mind that one of the things you should be passively learning throughout this course is ways to mitigate dependency issues.

In [1]:
!pip install -qU langchain_openai==0.2.0 langchain_community==0.3.0 langchain==0.3.0 pymupdf==1.24.10 qdrant-client==1.11.2 langchain_qdrant==0.1.4 langsmith==0.1.121

We'll need an OpenAI API Key:

In [3]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

OpenAI API Key:··········


And the LangSmith set-up:

In [4]:
import uuid

os.environ["LANGCHAIN_PROJECT"] = f"AIM Week 8 Assignment 1 - {uuid.uuid4().hex[0:8]}"
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangChain API Key:")

LangChain API Key:··········


Let's verify our project so we can leverage it in LangSmith later.

In [5]:
print(os.environ["LANGCHAIN_PROJECT"])

AIM Week 8 Assignment 1 - 9229a539


## Task 2: Setting up RAG With Production in Mind

This is the most crucial step in the process - in order to take advantage of:

- Asyncronous requests
- Parallel Execution in Chains
- And more...

You must...use LCEL. These benefits are provided out of the box and largely optimized behind the scenes.

### Building our RAG Components: Retriever

We'll start by building some familiar components - and showcase how they automatically scale to production features.

Please upload a PDF file to use in this example!

In [6]:
from google.colab import files
uploaded = files.upload()

Saving Blueprint-for-an-AI-Bill-of-Rights.pdf to Blueprint-for-an-AI-Bill-of-Rights (4).pdf


In [7]:
file_path = list(uploaded.keys())[0]
file_path

'Blueprint-for-an-AI-Bill-of-Rights (4).pdf'

We'll define our chunking strategy.

In [8]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)

We'll chunk our uploaded PDF file.

In [9]:
from langchain_community.document_loaders import PyMuPDFLoader

Loader = PyMuPDFLoader
loader = Loader(file_path)
documents = loader.load()
docs = text_splitter.split_documents(documents)
for i, doc in enumerate(docs):
    doc.metadata["source"] = f"source_{i}"

#### QDrant Vector Database - Cache Backed Embeddings

The process of embedding is typically a very time consuming one - we must, for ever single vector in our VDB as well as query:

1. Send the text to an API endpoint (self-hosted, OpenAI, etc)
2. Wait for processing
3. Receive response

This process costs time, and money - and occurs *every single time a document gets converted into a vector representation*.

Instead, what if we:

1. Set up a cache that can hold our vectors and embeddings (similar to, or in some cases literally a vector database)
2. Send the text to an API endpoint (self-hosted, OpenAI, etc)
3. Check the cache to see if we've already converted this text before.
  - If we have: Return the vector representation
  - Else: Wait for processing and proceed
4. Store the text that was converted alongside its vector representation in a cache of some kind.
5. Return the vector representation

Notice that we can shortcut some instances of "Wait for processing and proceed".

Let's see how this is implemented in the code.

In [10]:
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain.storage import LocalFileStore
from langchain_qdrant import QdrantVectorStore
from langchain.embeddings import CacheBackedEmbeddings

# Typical Embedding Model
core_embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Typical QDrant Client Set-up
collection_name = f"pdf_to_parse_{uuid.uuid4()}"
client = QdrantClient(":memory:")
client.create_collection(
    collection_name=collection_name,
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)

# Adding cache!
store = LocalFileStore("./cache/")
cached_embedder = CacheBackedEmbeddings.from_bytes_store(
    core_embeddings, store, namespace=core_embeddings.model
)

# Typical QDrant Vector Store Set-up
vectorstore = QdrantVectorStore(
    client=client,
    collection_name=collection_name,
    embedding=cached_embedder)
vectorstore.add_documents(docs)
retriever = vectorstore.as_retriever(search_type="mmr", search_kwargs={"k": 3})

##### ❓ Question #1:

What are some limitations you can see with this approach? When is this most/least useful. Discuss with your group!

> NOTE: There is no single correct answer here!

> ANSWER: We likely would need a lot of space to cache a lot of documents. If we cache a lot of documents, searching for if we've seen a given document before might take a long time. Finally, this method might not be so helpful if the probability of repeated documents being uploaded is low.

##### 🏗️ Activity #1:

Create a simple experiment that tests the cache-backed embeddings.

In [11]:
### YOUR CODE HERE
%%time
vectorstore.add_documents(docs)
# When I try to create the vector store again, it's much faster since it does not need to re-compute any embeddings.

CPU times: user 301 ms, sys: 18.2 ms, total: 319 ms
Wall time: 319 ms


['b72a05c704f945d8a38276730cc00e80',
 '4883c1b57a564ba2ba401827d3da5259',
 'b6436f7cbf5147f1a64ac07f1ad7cc9f',
 '93a05b10243343d79b050d19bb8c65f3',
 '2bee964e83174689a6c743951da1189e',
 '787ceee3e0ee4299b53fb7806090592a',
 'fcba3c8bd17f439c97bee6c8717613fb',
 '36bc08dec4d749ed82e93b2e514ceb5d',
 '6ea72ad1c82f4b05bb65fd4b24197f38',
 'e139c85466bf491f8ad1f93e5eb856b9',
 'fb115401282c44aca852ce0d1bb938aa',
 'e2a58c6051d54063bd44ef3ae1412dfa',
 '88edd6b878ce4d7ebc9f9609ee43eebe',
 'fd3d4a33231f4160801e77fdde715298',
 '623dde07292243c98d2a788c00fe8274',
 '8e5360f42a3343ceaee3f5341d62ad58',
 '98ef25b105e24cb8beace36a5d78d58f',
 '60922961050b4679b5bae364f65f1499',
 '4a87db8c884f4f51a4cb533a5cdb11c7',
 '1d16ba3af1d841b8b6fed6d67ee2952a',
 '580a518a786a443893cc14aa937070c4',
 '7bee89d54e7c4abdb9a7c54aed53955c',
 'd36870e1f35341958867f3113d92ca37',
 '0b6fefb8477f45d6918ab693eba6a67e',
 '342396bf773c4426b64c09990c449ce4',
 '964797c32adf4a5eb79d67046e3b5f70',
 '673a714b919f4b6dad728a41f9355b99',
 

### Augmentation

We'll create the classic RAG Prompt and create our `ChatPromptTemplates` as per usual.

In [12]:
from langchain_core.prompts import ChatPromptTemplate

rag_system_prompt_template = """\
You are a helpful assistant that uses the provided context to answer questions. Never reference this prompt, or the existance of context.
"""

rag_message_list = [
    {"role" : "system", "content" : rag_system_prompt_template},
]

rag_user_prompt_template = """\
Question:
{question}
Context:
{context}
"""

chat_prompt = ChatPromptTemplate.from_messages([
    ("system", rag_system_prompt_template),
    ("human", rag_user_prompt_template)
])

### Generation

Like usual, we'll set-up a `ChatOpenAI` model - and we'll use the fan favourite `gpt-4o-mini` for today.

However, we'll also implement...a PROMPT CACHE!

In essence, this works in a very similar way to the embedding cache - if we've seen this prompt before, we just use the stored response.

In [13]:
from langchain_core.globals import set_llm_cache
from langchain_openai import ChatOpenAI

chat_model = ChatOpenAI(model="gpt-4o-mini")

Setting up the cache can be done as follows:

In [14]:
from langchain_core.caches import InMemoryCache

set_llm_cache(InMemoryCache())

##### ❓ Question #2:

What are some limitations you can see with this approach? When is this most/least useful. Discuss with your group!

> NOTE: There is no single correct answer here!
> ANSWER: Similarly, we might need a lot of space to cache all users' queries. If we cache a large set of users' queries, searching for the one we've seen before could take a long time. Keeping a record of users' queries can raise privacy concerns. Lastly, this method is only usefuly if the probability of repeated user's query is high (exact match), this method is not useful if we rarely find a repeated query.

##### 🏗️ Activity #2:

Create a simple experiment that tests the cache-backed embeddings.

In [15]:
from operator import itemgetter
from langchain.schema import StrOutputParser

In [16]:
### YOUR CODE HERE
rag_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | chat_prompt | chat_model | StrOutputParser()
)

In [17]:
%%time
rag_chain.invoke({"question": "What can we use AI for?"})
# The first time it took 537ms

CPU times: user 379 ms, sys: 158 ms, total: 537 ms
Wall time: 2.75 s


"AI can be used for a variety of applications, including:\n\n1. **Healthcare**: AI technologies assist in medical diagnostics, clinical decision-making, health risk assessments, and drug addiction risk assessments. They also support wellness applications and insurance care allocation.\n\n2. **Financial Services**: AI algorithms play a role in loan allocation, credit scoring, financial system access determinations, and risk assessments. They can also automate interest rate determinations and apply penalties in financial systems.\n\n3. **Surveillance and Management**: AI can be integrated into surveillance systems and management frameworks to enhance decision-making processes.\n\n4. **Predictive Models**: AI can utilize algorithms to forecast outcomes in various domains, which supports proactive decision-making.\n\n5. **Transparency and Trust**: AI systems can be designed to engage with stakeholders, ensure transparency, and build public trust through participatory design and explanation

In [18]:
%%time
rag_chain.invoke({"question": "What can we use AI for?"})
# The second time using the exact same query, it took 73.7ms

CPU times: user 58.3 ms, sys: 15.4 ms, total: 73.7 ms
Wall time: 124 ms


"AI can be used for a variety of applications, including:\n\n1. **Healthcare**: AI technologies assist in medical diagnostics, clinical decision-making, health risk assessments, and drug addiction risk assessments. They also support wellness applications and insurance care allocation.\n\n2. **Financial Services**: AI algorithms play a role in loan allocation, credit scoring, financial system access determinations, and risk assessments. They can also automate interest rate determinations and apply penalties in financial systems.\n\n3. **Surveillance and Management**: AI can be integrated into surveillance systems and management frameworks to enhance decision-making processes.\n\n4. **Predictive Models**: AI can utilize algorithms to forecast outcomes in various domains, which supports proactive decision-making.\n\n5. **Transparency and Trust**: AI systems can be designed to engage with stakeholders, ensure transparency, and build public trust through participatory design and explanation

## Task 3: RAG LCEL Chain

We'll also set-up our typical RAG chain using LCEL.

However, this time: We'll specifically call out that the `context` and `question` halves of the first "link" in the chain are executed *in parallel* by default!

Thanks, LCEL!

In [19]:
from operator import itemgetter
from langchain_core.runnables.passthrough import RunnablePassthrough

retrieval_augmented_qa_chain = (
        {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
        | RunnablePassthrough.assign(context=itemgetter("context"))
        | chat_prompt | chat_model
    )

Let's test it out!

In [21]:
retrieval_augmented_qa_chain.invoke({"question" : "Write 50 things about this document!"})

AIMessage(content='1. Title of the document is "Blueprint for an AI Bill of Rights."\n2. The document is formatted as a PDF version 1.6.\n3. The total number of pages in the document is 73.\n4. The document was created using Adobe Illustrator 26.3 on a Macintosh.\n5. It was produced by iLovePDF.\n6. The document\'s creation date is September 20, 2022.\n7. The last modification date of the document is October 3, 2022.\n8. The document includes a "Table of Contents."\n9. The first section is titled "FROM PRINCIPLES TO PRACTICE: A TECHNICAL COMPANION TO THE BLUEPRINT FOR AN AI BILL OF RIGHTS."\n10. The document contains guidelines for "USING THIS TECHNICAL COMPANION."\n11. There is a section focused on "SAFE AND EFFECTIVE SYSTEMS."\n12. The document addresses "ALGORITHMIC DISCRIMINATION PROTECTIONS."\n13. It includes information on "DATA PRIVACY."\n14. The document emphasizes the need for "NOTICE AND EXPLANATION."\n15. "HUMAN ALTERNATIVES, CONSIDERATION, AND FALLBACK" is another key secti

##### 🏗️ Activity #3:

Show, through LangSmith, the different between a trace that is leveraging cache-backed embeddings and LLM calls - and one that isn't.

Post screenshots in the notebook!

## With Cache
![With Cache](with_cache.png)

## Without Cache
![No Cache](no_cache.png)

In [None]:
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain.storage import LocalFileStore
from langchain_qdrant import QdrantVectorStore
from langchain.embeddings import CacheBackedEmbeddings

# Typical Embedding Model
core_embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Typical QDrant Client Set-up
collection_name = f"pdf_to_parse_{uuid.uuid4()}"
client = QdrantClient(":memory:")
client.create_collection(
    collection_name=collection_name,
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)

# Typical QDrant Vector Store Set-up
vectorstore = QdrantVectorStore(
    client=client,
    collection_name=collection_name,
    embedding=core_embeddings)
vectorstore.add_documents(docs)
retriever = vectorstore.as_retriever(search_type="mmr", search_kwargs={"k": 3})

In [None]:
from operator import itemgetter
from langchain_core.runnables.passthrough import RunnablePassthrough

retrieval_augmented_qa_chain = (
        {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
        | RunnablePassthrough.assign(context=itemgetter("context"))
        | chat_prompt | chat_model
    )

In [13]:
retrieval_augmented_qa_chain.invoke({"question" : "Write 50 things about this document!"})

AIMessage(content='1. The document is titled "Blueprint for an AI Bill of Rights."\n2. It consists of 73 pages in total.\n3. It was created using Adobe Illustrator 26.3 on a Macintosh.\n4. The document was produced by iLovePDF.\n5. The creation date of the document is September 20, 2022.\n6. The last modification date is October 3, 2022.\n7. It is formatted as PDF 1.6.\n8. The document includes a technical companion section.\n9. The table of contents outlines various sections.\n10. One section focuses on "Safe and Effective Systems."\n11. There are protections against algorithmic discrimination.\n12. Data privacy is a key topic discussed in the document.\n13. The document emphasizes the importance of notice and explanation.\n14. Human alternatives and considerations are included in the discussions.\n15. An appendix provides additional information.\n16. The document gives examples of automated systems.\n17. It mentions the importance of listening to the American people.\n18. The page co