# LangSmith and Evaluation Overview with AI Makerspace

Today we'll be looking at an amazing tool:

[LangSmith](https://docs.smith.langchain.com/)!

This tool will help us monitor, test, debug, and evaluate our LangChain applications - and more!

We'll also be looking at a few Advanced Retrieval techniques along the way - and evaluate it using LangSmith!

✋BREAKOUT ROOM #2:
- Task 1: Dependencies and OpenAI API Key
- Task 2: LCEL RAG Chain
- Task 3: Setting Up LangSmith
- Task 4: Examining the Trace in LangSmith!
- Task 5: Create Testing Dataset
- Task 6: Evaluation

## Task 1: Dependencies and OpenAI API Key

We'll be using OpenAI's suite of models today to help us generate and embed our documents for a simple RAG system built on top of LangChain's blogs!

In [1]:
!pip install langchain_core langchain_openai langchain_community langchain-qdrant qdrant-client langsmith openai tiktoken cohere lxml -qU

In [2]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API Key:")

#### Asyncio Bug Handling

This is necessary for Colab.

In [3]:
import nest_asyncio
nest_asyncio.apply()

## Task #2: Create a Simple RAG Application Using Qdrant, Hugging Face, and LCEL

Now that we have a grasp on how LCEL works, and how we can use LangChain and Hugging Face to interact with our data - let's step it up a notch and incorporate Qdrant!

## LangChain Powered RAG

First and foremost, LangChain provides a convenient way to store our chunks and their embeddings.

It's called a `VectorStore`!

We'll be using QDrant as our `VectorStore` today. You can read more about it [here](https://qdrant.tech/documentation/).

Think of a `VectorStore` as a smart way to house your chunks and their associated embedding vectors. The implementation of the `VectorStore` also allows for smarter and more efficient search of our embedding vectors - as the method we used above would not scale well as we got into the millions of chunks.

Otherwise, the process remains relatively similar under the hood!

We'll use a SiteMapLoader to scrape the LangChain blogs - which will serve as our data for today!

### Data Collection

We'll be leveraging the `SitemapLoader` to load our PDF directly from the web!

In [4]:
from langchain.document_loaders import SitemapLoader

documents = SitemapLoader(web_path="https://blog.langchain.dev/sitemap-posts.xml").load()

USER_AGENT environment variable not set, consider setting it to identify your requests.
Fetching pages: 100%|##########| 220/220 [00:19<00:00, 11.47it/s]


These are posts from the langchain blog 

![Langchain Posts](images/langchain_posts.jpg)

In [5]:
# lets look at what a document looks like
print(documents[0])


page_content='


How Podium optimized agent behavior and reduced engineering intervention by 90% with LangSmith


















































Skip to content
















All Posts




Case Studies




In the Loop




LangChain




Docs




Changelog





Sign in
Subscribe



















How Podium optimized agent behavior and reduced engineering intervention by 90% with LangSmith
See how Podium tests across the lifecycle development of their AI employee agent, using LangSmith for dataset curation and finetuning. They improved agent F1 response quality to 98% and reduced the need for engineering intervention by 90%.

5 min read
Aug 15, 2024





About PodiumPodium is a communication platform that helps small businesses connect quickly with customers via phone, text, email, and social media. Small businesses often have high-touch interactions with customers — think automotive dealers, jewelers, bike shops — yet are understaffed. Podium's mission is to help the

### Chunking Our Documents

Let's do the same process as we did before with our `RecursiveCharacterTextSplitter` - but this time we'll use ~500 tokens as our max chunk size!

In [6]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap = 40,
    length_function = len,
)

split_chunks = text_splitter.split_documents(documents)

In [7]:
print(f"Number of chunks: {len(split_chunks)}")

Number of chunks: 5087


Alright, now we have 5087 ~500 token long chunks.

Let's verify the process worked as intended by checking our max document length.

In [8]:
max_chunk_length = 0

for chunk in split_chunks:
  max_chunk_length = max(max_chunk_length, len(chunk.page_content))

print(f"max chunk length: {max_chunk_length}")

max chunk length: 499


Perfect! Now we can carry on to creating and storing our embeddings.

### Embeddings and Vector Storage

We'll use the `text-embedding-3-small` embedding model again - and `Qdrant` to store all our embedding vectors for easy retrieval later!

In [9]:
from langchain_community.vectorstores import Qdrant
from langchain_openai.embeddings import OpenAIEmbeddings

embedding_model = OpenAIEmbeddings(model="text-embedding-3-small")

qdrant_vectorstore = Qdrant.from_documents(
    documents=split_chunks,
    embedding=embedding_model,
    location=":memory:"
)

Now let's set up our retriever, just as we saw before, but this time using LangChain's simple `as_retriever()` method!

In [10]:
qdrant_retriever = qdrant_vectorstore.as_retriever()

#### Back to the Flow

We're ready to move to the next step!

### Setting up our RAG

We'll use the LCEL we touched on earlier to create a RAG chain.

Let's think through each part:

1. First we need to retrieve context
2. We need to pipe that context to our model
3. We need to parse that output

Let's start by setting up our prompt again, just so it's fresh in our minds!

<div style="border: 2px solid white; background: black; padding: 10px;">

#### 🏗️ Activity #2:

Complete the prompt so that your RAG application answers queries based on the context provided, but *does not* answer queries if the context is unrelated to the query.

In [11]:
from langchain.prompts import ChatPromptTemplate

base_rag_prompt_template = """\
You are a helpful assistant.
Use the context provided below to answer the question asked below. 
Only use the information in the context.
Do not use information from other sources or documents or websites.
If you do not know the answer based on the information in the context respond with 'I have insufficient information to answer that'.
Be clear with your answers and concise.
Ensure your answers are correct.

Context:
{context}

Question:
{question}
"""

base_rag_prompt = ChatPromptTemplate.from_template(base_rag_prompt_template)

We'll set our Generator - `gpt-4o-min` in this case - below - to ensure we do not have rate limiting and we have reduced costs

In [12]:
from langchain_openai.chat_models import ChatOpenAI

base_llm = ChatOpenAI(model="gpt-4o-mini", tags=["base_llm"])

#### Our RAG Chain

Notice how we have a bit of a more complex chain this time - that's because we want to return our sources with the response.

Let's break down the chain step-by-step:

1. We invoke the chain with the `question` item. Notice how we only need to provide `question` since both the retreiver and the `"question"` object depend on it.
  - We also chain our `"question"` into our `retriever`! This is what ultimately collects the context through Qdrant.
2. We assign our collected context to a `RunnablePassthrough()` from the previous object. This is going to let us simply pass it through to the next step, but still allow us to run that section of the chain.
3. We finally collect our response by chaining our prompt, which expects both a `"question"` and `"context"`, into our `llm`. We also, collect the `"context"` again so we can output it in the final response object.

The key thing to keep in mind here is that we need to pass our context through *after* we've retrieved it - to populate the object in a way that doesn't require us to call it or try and use it for something else.

In [13]:
from operator import itemgetter
from langchain.schema.output_parser import StrOutputParser
from langchain.schema.runnable import RunnablePassthrough

retrieval_augmented_qa_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    #
    {"context": itemgetter("question") | qdrant_retriever, "question": itemgetter("question")}
    #
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    #
    | RunnablePassthrough.assign(context=itemgetter("context"))
    #
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    #
    | {"response": base_rag_prompt | base_llm, "context": itemgetter("context")}
)

Let's get a visual understanding of our chain!

In [23]:
!pip install -qU grandalf

In [24]:
print(retrieval_augmented_qa_chain.get_graph().draw_ascii())

          +---------------------------------+      
          | Parallel<context,question>Input |      
          +---------------------------------+      
                    **            **               
                  **                **             
                **                    **           
         +--------+                     **         
         | Lambda |                      *         
         +--------+                      *         
              *                          *         
              *                          *         
              *                          *         
  +----------------------+          +--------+     
  | VectorStoreRetriever |          | Lambda |     
  +----------------------+          +--------+     
                    **            **               
                      **        **                 
                        **    **                   
          +----------------------------------+     
          | 

Let's try another visual representation:

![image](https://i.imgur.com/Ad31AhL.png)

Let's test our chain out!

In [14]:
response = retrieval_augmented_qa_chain.invoke({"question" : "What's new in LangChain v0.2?"})

In [15]:
print(response["response"].content)

LangChain v0.2 introduces several improvements, including:
- A full separation of langchain and langchain-community.
- New and versioned documentation.
- A more mature and controllable agent framework.
- Improved LLM interface standardization, particularly regarding tool calling.


In [16]:
print(f"Number of found context: {len(response['context'])}")
for context in response["context"]:
  print("Context:")
  print(context)
  print("----")

Number of found context: 4
Context:
page_content='Four months ago, we released the first stable version of LangChain. Today, we are following up by announcing a pre-release of langchain v0.2.This release builds upon the foundation laid in v0.1 and incorporates community feedback. We’re excited to share that v0.2 brings: A much-desired full separation of langchain and langchain-community New (and versioned!) docs A more mature and controllable agent framework Improved LLM interface standardization, particularly around tool callingBetter' metadata={'source': 'https://blog.langchain.dev/langchain-v02-leap-to-stability/', 'loc': 'https://blog.langchain.dev/langchain-v02-leap-to-stability/', 'lastmod': '2024-05-16T22:26:07.000Z', '_id': 'e95bbca0f3d840af972479acb7c7bc2b', '_collection_name': '25a427116bbf455bb852f7f34433807e'}
----
Context:
page_content='LangChain v0.2: A Leap Towards Stability




















































Skip to content
















All Posts





Let's see if it can handle a query that is totally unrelated to the source documents.

In [17]:
response = retrieval_augmented_qa_chain.invoke({"question" : "What is the airspeed velocity of an unladen swallow?"})

In [18]:
print(response["response"].content)

I have insufficient information to answer that.


## Task 3: Setting Up LangSmith

Now that we have a chain - we're ready to get started with LangSmith!

We're going to go ahead and use the following `env` variables to get our Colab notebook set up to start reporting.

If all you needed was simple monitoring - this is all you would need to do!

In [19]:
from uuid import uuid4

unique_id = uuid4().hex[0:8]

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = f"LangSmith - {unique_id}"
print(f"Your langsmith key is: {os.environ['LANGCHAIN_PROJECT']}")

Your langsmith key is: LangSmith - 5d73bab4


### LangSmith API

In order to use LangSmith - you will need a beta key, you can join the queue through the `Beta Sign Up` button on LangSmith's homepage!

Join [here](https://www.langchain.com/langsmith)

In [20]:
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass('Enter your LangSmith API key: ')

Let's test our our first generation!

In [21]:
response = retrieval_augmented_qa_chain.invoke({"question" : "What is LangSmith?"}, {"tags" : ["Demo Run"]})['response']
print(response)
print(response.content)

content='LangSmith is a framework built on LangChain, designed to track the inner workings of large language models (LLMs) and AI agents within products. It provides tools for debugging, testing, and improving the reliability of LLM applications. LangSmith allows users to analyze LLM behavior, create and run tests with existing or new datasets, and offers fine-grained controls and customizability through its SDK.' additional_kwargs={'refusal': None} response_metadata={'token_usage': {'completion_tokens': 81, 'prompt_tokens': 1032, 'total_tokens': 1113}, 'model_name': 'gpt-4o-mini-2024-07-18', 'system_fingerprint': 'fp_f33667828e', 'finish_reason': 'stop', 'logprobs': None} id='run-217c84bf-4508-4500-b5ad-db8dbf1a8806-0' usage_metadata={'input_tokens': 1032, 'output_tokens': 81, 'total_tokens': 1113}
LangSmith is a framework built on LangChain, designed to track the inner workings of large language models (LLMs) and AI agents within products. It provides tools for debugging, testing, an

## Task 4: Examining the Trace in LangSmith!

Head on over to your LangSmith web UI to check out how the trace looks in LangSmith!

<div style="border: 2px solid white; background: black; padding: 10px;">

#### 🏗️ Activity #1:

Include a screenshot of your trace and explain what it means.

#### ! Answer #1:

![Sample Image](images/langsmith_3.jpg)

The LangSmith interface is a bit overwhelming and I spent a while clicking around it.

Looking at the Personal -> Projects lists all of the traces and evaluations that you have set up.

For this project I used the name "LangSmith - 5d73bab4"  

In the summary it tells you:
- run count - number of times this trace has been run
- error rate
- total tokens used
- total cost 

Drilling into the selected project you see the following screen with runs, threads, monitor and setup as main tabs.

![Sample Image](images/langsmith_4.jpg)

The Runs tab shows each run with the input and output. At this level, the run represents the entire chain. The top level run can be expanded to show the different steps in the chain. Each can be expanded further.

Summary information is displayed on the table to the right. This run summary includes start and end time, status of the run, total tokens used, latency etc

Selecting one of the runs (in this case ChatOpenAI) displays a more detail screen giving information on the step of the chain as shown below. 

![Sample Image](images/langsmith_5.jpg)

For each step the input and output are displayed. There is also a summary with the start and end, the status of the step, number of tokens and cost of the step. It also identifies the type of run.

In this case, the model represented by the ChatOpenAI was passed an input of Human information composed of the instructions, the context (as selected by the retriever) and the Question. The response has the answer provided by AI.

Looking at other runns - each details their input and their output with the information depending on the type of run

Each run is identified by type with an icon and a Name - Prompt, Sequence, Chain, Retriever, 

The retriever shows the documents that were returned by the retriever
The prompt shows the inputs including the question and the context as well as the output - which is the final prompt sent to the model

</div>

## Task 5: Loading Our Testing Set

Get the data for the LangSmith testing from the AIMakerspace github

In [68]:
!git clone https://github.com/AI-Maker-Space/DataRepository.git

Cloning into 'DataRepository'...
remote: Enumerating objects: 84, done.[K
remote: Counting objects: 100% (76/76), done.[K
remote: Compressing objects: 100% (63/63), done.[K
remote: Total 84 (delta 23), reused 28 (delta 8), pack-reused 8 (from 1)[K
Receiving objects: 100% (84/84), 70.08 MiB | 3.60 MiB/s, done.
Resolving deltas: 100% (23/23), done.


In [22]:
import pandas as pd

test_df = pd.read_csv("DataRepository/langchain_blog_test_data.csv")

Now we can set up our LangSmith client - and we'll add the above created dataset to our LangSmith instance!

> NOTE: Read more about this process [here](https://docs.smith.langchain.com/old/evaluation/faq/manage-datasets#create-from-list-of-values)

In [23]:
from langsmith import Client

client = Client()

dataset_name = "langsmith-demo-dataset-aie4-triples-v7"

dataset = client.create_dataset(
    dataset_name=dataset_name, description="LangChain Blog Test Questions"
)

for triplet in test_df.iterrows():
  triplet = triplet[1]
  client.create_example(
      inputs={"question" : triplet["question"], "context": triplet["context"]},
      outputs={"answer" : triplet["answer"]},
      dataset_id=dataset.id
  )

## Task 6: Evaluation

Now we can run the evaluation!

We'll need to start by preparing some custom data preparation functions to ensure our chain works with the expected inputs/outputs from the `evaluate` process in LangSmith.

> NOTE: More reading on this available [here](https://docs.smith.langchain.com/how_to_guides/evaluation/evaluate_llm_application#evaluate-a-langchain-runnable)

In [24]:
def prepare_data_ref(run, example):
  return {
      "prediction" : run.outputs["response"],
      "reference" : example.outputs["answer"],
      "input" : example.inputs["question"]
  }

def prepare_data_noref(run, example):
  return {
      "prediction" : run.outputs["response"],
      "input" : example.inputs["question"]
  }

def prepare_context_ref(run, example):
  return {
      "prediction" : run.outputs["response"],
      "reference" : example.inputs["context"],
      "input" : example.inputs["question"]
  }

We'll be using a few custom evaluators to evaluate our pipeline, as well as a few "built in" methods!

Check out the built-ins [here](https://docs.smith.langchain.com/reference/sdk_reference/langchain_evaluators)!

In [30]:
from langsmith.evaluation import LangChainStringEvaluator, evaluate

cot_qa_evaluator = LangChainStringEvaluator("cot_qa",
                                            config={"llm":base_llm},
                                            prepare_data=prepare_context_ref)

unlabeled_dopeness_evaluator = LangChainStringEvaluator(
    "criteria",
    config={
        "llm":base_llm,
        "criteria" : {
            "dopeness" : "Is the answer to the question dope, meaning cool - awesome - and legit?"
        }
    },
    prepare_data=prepare_data_noref
)

labeled_score_evaluator = LangChainStringEvaluator(
    "labeled_score_string",
    config={
        "llm":base_llm,
        "criteria": {
            "accuracy": "Is the generated answer the same as the reference answer?"
        },
    },
    prepare_data=prepare_data_ref
)

unlabeled_spelling_errors_evaluator = LangChainStringEvaluator(
    "criteria",
    config={
        "llm":base_llm,
        "criteria" : {
            "typos" : "Are there any typographical errors ie words spelled incorrectly?"
        }
    },
    prepare_data=prepare_data_noref
)


base_rag_results = evaluate(
    retrieval_augmented_qa_chain.invoke,
    data=dataset_name,
    evaluators=[
        cot_qa_evaluator,
        unlabeled_dopeness_evaluator,
        labeled_score_evaluator,
        unlabeled_spelling_errors_evaluator,
        ],
    experiment_prefix="Base RAG Evaluation"
)

View the evaluation results for experiment: 'Base RAG Evaluation-b64674cb' at:
https://smith.langchain.com/o/c97b0028-7cab-5f76-a748-1369ba450931/datasets/7784a166-5cb4-4bf0-9152-3bdecf032df6/compare?selectedSessions=03b29b92-16b1-4b6a-92ad-8e1c7c09be73




0it [00:00, ?it/s]

<div style="border: 2px solid white; background: black; padding: 10px;">

#### ❓Question #1:

![Langchain Posts](images/langsmith_2.jpg)

<b>What conclusions can you draw about the above results?</b>

Over 7 runs there is some variance between each run: 

Chain of thought contextual accuracy runs 0.66 to 0.74
Dopeness runs 0.27 to 0.53
Accuracy runs 5.6 to 6.5
There were no typos

This is without any changes to the data, the processing, or the evaluation criteria. So this is a bit surprising. The graph above shows this variation.

Dopeness swings from 6 Y to 12 Y so there is a lot of variability here. Contextual accuracy varies drom 15 to 17 correct so not so much variabe - which I guessis more reasonable since Dopeness is a much more subjective evaluation. 

<b>Describe in your own words what the metrics are expressing.</b>

COT Contextual Accuracy - checks how accurate the answer is based on the chain of thought. Either correct or incorrect.

Accuracy - how accurate the answer is and whether it is missong any specific details. Rating of 1-10

Dopeness - answers are checked for accuracy, clarity, engagement (language of the answer, how "exciting"), completeness (does it fully answer the question), overall impact (does it evoke emotional reaction). Provides a Y or N score. However it seems that different questions have slightly different evaluation criteria for dopeness which seems concerning.
 
Accuracy - this provides a numerical score as to how accurate the answer matches the reference answer. It also checks how different the answer is and how the answer is phrased to determine how accurate it is. Low scores are given if the response does not answer the question. It is also looking for context and coherent explanations. Seems to also look for clarity. Scores appear to be on a scale 1 to 10.

I experimented with grade level and never got outside of a Y or N answer. I also put in an evaluation for Typos which it found none (but labeled as Y). It is interesting reading the explanations how the text was checked for accuracy.


</div>