# Experiments




### Setup

In [27]:
# You can set them inline
import os
os.environ["MISTRAL_API_KEY"] = "MISTRAL_API_KEY"
os.environ["LANGSMITH_API_KEY"] = "LANGSMITH_API_KEY"
os.environ["LANGSMITH_TRACING"] = "true"
os.environ["LANGSMITH_PROJECT"] = "langsmith-academy-mistral"

In [28]:
# Or you can use a .env file
from dotenv import load_dotenv
load_dotenv(dotenv_path="../../.env", override=True)

True

Here is the RAG Application that we've been working with throughout this course

In [29]:
import os
import tempfile
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders.sitemap import SitemapLoader
from langchain_community.vectorstores import SKLearnVectorStore
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_mistralai.chat_models import ChatMistralAI
from langchain_core.messages import SystemMessage, HumanMessage
from langsmith import traceable
from typing import List
import nest_asyncio

# Configured for Mistral AI
MODEL_NAME = "mistral-small-latest"
MODEL_PROVIDER = "mistral" 
APP_VERSION = 2.0
RAG_SYSTEM_PROMPT = """You are an assistant for question-answering tasks. 
Use the following pieces of retrieved context to answer the latest question in the conversation. 
If you don't know the answer, just say that you don't know. 
Use three sentences maximum and keep the answer concise.
"""

mistral_client = ChatMistralAI(model=MODEL_NAME, temperature=0.3)

def get_vector_db_retriever():
    persist_path = os.path.join(tempfile.gettempdir(), "mistral_docs.parquet")
    # Use HuggingFace embeddings instead of OpenAI
    embd = HuggingFaceEmbeddings(
        model_name="sentence-transformers/all-MiniLM-L6-v2",
        model_kwargs={'device': 'cpu'}
    )

    # If vector store exists, then load it
    if os.path.exists(persist_path):
        vectorstore = SKLearnVectorStore(
            embedding=embd,
            persist_path=persist_path,
            serializer="parquet"
        )
        return vectorstore.as_retriever(lambda_mult=0)

    # Otherwise, index LangSmith documents and create new vector store
    ls_docs_sitemap_loader = SitemapLoader(web_path="https://docs.smith.langchain.com/sitemap.xml", continue_on_failure=True)
    ls_docs = ls_docs_sitemap_loader.load()

    # Use standard text splitter since we're using HuggingFace embeddings
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=500, chunk_overlap=50
    )
    doc_splits = text_splitter.split_documents(ls_docs)

    vectorstore = SKLearnVectorStore.from_documents(
        documents=doc_splits,
        embedding=embd,
        persist_path=persist_path,
        serializer="parquet"
    )
    vectorstore.persist()
    return vectorstore.as_retriever(lambda_mult=0)

nest_asyncio.apply()
retriever = get_vector_db_retriever()

"""
retrieve_documents
- Returns documents fetched from a vectorstore based on the user's question
"""
@traceable(run_type="chain")
def retrieve_documents(question: str):
    return retriever.invoke(question)

"""
generate_response
- Calls `call_mistral` to generate a model response after formatting inputs
"""
@traceable(run_type="chain")
def generate_response(question: str, documents):
    formatted_docs = "\n\n".join(doc.page_content for doc in documents)
    messages = [
        SystemMessage(content=RAG_SYSTEM_PROMPT),
        HumanMessage(content=f"Context: {formatted_docs} \n\n Question: {question}")
    ]
    return call_mistral(messages)

"""
call_mistral
- Returns the chat completion output from Mistral AI
"""
@traceable(
    run_type="llm",
    metadata={
        "ls_provider": MODEL_PROVIDER,
        "ls_model_name": MODEL_NAME
    }
)
def call_mistral(messages: List) -> str:
    response = mistral_client.invoke(messages)
    return response

"""
langsmith_rag
- Calls `retrieve_documents` to fetch documents
- Calls `generate_response` to generate a response based on the fetched documents
- Returns the model response
"""
@traceable(run_type="chain")
def langsmith_rag(question: str):
    documents = retrieve_documents(question)
    response = generate_response(question, documents)
    return response.content


### Experiment

Here is a code snippet that should look similar to what you see from the starter code!

There are a few important components here.

1. We have defined an Evaluator
2. We pipe our dataset examples (dict) to the shape of input that our function `langsmith_rag` takes (str) using a target function

In [30]:
from langsmith import evaluate, Client

client = Client()
dataset_name = "Mistral RAG Application Dataset"

# First, let's create the dataset if it doesn't exist
try:
    # Try to get the dataset to see if it exists
    dataset = client.read_dataset(dataset_name=dataset_name)
    print(f"Dataset '{dataset_name}' already exists with {dataset.example_count} examples")
except Exception:
    # Dataset doesn't exist, let's create it with sample examples
    print(f"Creating dataset '{dataset_name}'...")
    
    # Create sample examples for testing
    examples = [
        {
            "question": "What is LangSmith and how does it help with LLM applications?",
            "output": "LangSmith is a platform for debugging, testing, and monitoring LLM applications. It provides tracing capabilities to track execution steps, evaluation tools for systematic testing, and monitoring features for production deployments."
        },
        {
            "question": "How do I set up tracing in my Mistral AI application?",
            "output": "To set up tracing in your Mistral AI application, you need to install langsmith, set your LANGSMITH_API_KEY environment variable, and use the @traceable decorator on your functions. This allows automatic tracking of your application's execution flow."
        },
        {
            "question": "What are the benefits of using custom evaluators?",
            "output": "Custom evaluators allow you to test specific aspects of your application beyond basic metrics. They can check for domain relevance, response quality, accuracy, and other application-specific criteria to ensure your LLM performs well for your use case."
        },
        {
            "question": "How can I compare different Mistral AI models?",
            "output": "You can compare different Mistral AI models by running experiments with the same dataset and evaluators but different model configurations. LangSmith's experiment feature allows you to systematically compare performance across model variants."
        }
    ]
    
    # Create the dataset
    dataset = client.create_dataset(
        dataset_name=dataset_name,
        description="Dataset for testing Mistral AI RAG application with custom evaluators"
    )
    
    # Add examples to the dataset
    client.create_examples(
        inputs=[{"question": ex["question"]} for ex in examples],
        outputs=[{"output": ex["output"]} for ex in examples],
        dataset_id=dataset.id
    )
    
    print(f"Created dataset '{dataset_name}' with {len(examples)} examples")

# Custom evaluator for Mistral AI responses
def is_concise_enough(reference_outputs: dict, outputs: dict) -> dict:
    score = len(outputs["output"]) < 1.5 * len(reference_outputs["output"])
    return {"key": "is_concise", "score": int(score)}

# Additional custom evaluator for response quality
def contains_key_terms(reference_outputs: dict, outputs: dict) -> dict:
    key_terms = ["langsmith", "mistral", "trace", "evaluation"]
    output_lower = outputs["output"].lower()
    score = any(term in output_lower for term in key_terms)
    return {"key": "contains_key_terms", "score": int(score)}

def target_function(inputs: dict):
    return {"output": langsmith_rag(inputs["question"])}

print("Running evaluation with Mistral AI...")
evaluate(
    target_function,
    data=dataset_name,
    evaluators=[is_concise_enough, contains_key_terms],
    experiment_prefix="mistral-small-latest"
)

Dataset 'Mistral RAG Application Dataset' already exists with 4 examples
Running evaluation with Mistral AI...
View the evaluation results for experiment: 'mistral-small-latest-0ea11166' at:
https://smith.langchain.com/o/bd531ccf-4286-4467-99ba-7eab707122af/datasets/dab4404e-e95f-4521-9892-284e6476bbca/compare?selectedSessions=5a2fa2ac-fe87-4668-9f9b-ce0c62f63cb2




0it [00:00, ?it/s]

Unnamed: 0,inputs.question,outputs.output,error,reference.output,feedback.is_concise,feedback.contains_key_terms,execution_time,example_id,id
0,What is LangSmith and how does it help with LL...,LangSmith is a platform designed for evaluatin...,,"LangSmith is a platform for debugging, testing...",0,1,1.308382,7dfd0d3f-c378-4fe2-8a08-301d52dd92e2,ec182711-27c0-4392-a3ee-e248c3baab5c
1,How can I compare different Mistral AI models?,You can compare different Mistral AI models us...,,You can compare different Mistral AI models by...,1,1,1.017068,99f57a2e-5d00-40e2-9fbe-01a263fcebba,036ba565-85ee-41cd-b096-b58a8fa07b2c
2,What are the benefits of using custom evaluators?,Custom evaluators are beneficial when each eva...,,Custom evaluators allow you to test specific a...,1,1,1.321609,9ebb8f0f-b91f-425d-b2da-d91149c0cfa1,eebd145a-55d7-4d27-a931-0a96ff610fb2
3,How do I set up tracing in my Mistral AI appli...,To set up tracing in your Mistral AI applicati...,,To set up tracing in your Mistral AI applicati...,1,1,1.029923,d1fc5afd-2380-4215-accc-cb26660ba5b3,78b210fb-21d4-4f0f-97b9-8cf79251194b


### Modifying your Application

Now, let's change our model to mistral-tiny and see how it performs compared to mistral-small-latest!

Make this change, and then run this code snippet!

In [31]:
from langsmith import evaluate, Client
from langsmith.schemas import Example, Run

# Update the model for comparison
MODEL_NAME = "mistral-tiny"
mistral_client = ChatMistralAI(model=MODEL_NAME, temperature=0.3)

def target_function(inputs: dict):
    return {"output": langsmith_rag(inputs["question"])}

print("Running evaluation with Mistral Tiny...")
evaluate(
    target_function,
    data=dataset_name,
    evaluators=[is_concise_enough, contains_key_terms],
    experiment_prefix="mistral-tiny"
)

Running evaluation with Mistral Tiny...
View the evaluation results for experiment: 'mistral-tiny-f7525ce0' at:
https://smith.langchain.com/o/bd531ccf-4286-4467-99ba-7eab707122af/datasets/dab4404e-e95f-4521-9892-284e6476bbca/compare?selectedSessions=57527339-dbf2-44ca-8ee8-6851c8dc9c29




0it [00:00, ?it/s]

Unnamed: 0,inputs.question,outputs.output,error,reference.output,feedback.is_concise,feedback.contains_key_terms,execution_time,example_id,id
0,What is LangSmith and how does it help with LL...,LangSmith is a platform that consists of a fro...,,"LangSmith is a platform for debugging, testing...",0,1,0.907266,7dfd0d3f-c378-4fe2-8a08-301d52dd92e2,42291fe6-c104-4127-9e71-88125e128516
1,How can I compare different Mistral AI models?,"To compare different Mistral AI models, you ca...",,You can compare different Mistral AI models by...,0,1,1.102619,99f57a2e-5d00-40e2-9fbe-01a263fcebba,38c188d7-b6ff-478a-bfe1-d20d057e68c4
2,What are the benefits of using custom evaluators?,Custom evaluators can be beneficial in complex...,,Custom evaluators allow you to test specific a...,1,1,0.764297,9ebb8f0f-b91f-425d-b2da-d91149c0cfa1,87cb8911-7f26-49fc-8ec4-fb9e12c9e60c
3,How do I set up tracing in my Mistral AI appli...,To set up tracing in your Mistral AI applicati...,,To set up tracing in your Mistral AI applicati...,0,1,1.398458,d1fc5afd-2380-4215-accc-cb26660ba5b3,c94133e0-5857-46f7-bf23-ee7b55ef2bc2


### Running over Different pieces of Data

##### Dataset Version

You can execute an experiment on a specific version of a dataset in the sdk by using the `as_of` parameter in `list_examples`

Let's try running on just our initial dataset.

In [32]:
# Note: This will work once you have dataset versions set up
try:
    evaluate(
        target_function,
        data=client.list_examples(dataset_name=dataset_name, as_of="initial dataset"),   # We use as_of to specify a version
        evaluators=[is_concise_enough],
        experiment_prefix="initial dataset version"
    )
except Exception as e:
    print(f"Dataset version experiment skipped: {e}")
    print("Note: Create dataset versions in LangSmith UI to use this feature")

Dataset version experiment skipped: 
Note: Create dataset versions in LangSmith UI to use this feature


##### Dataset Split

You can run an experiment on a specific split of your dataset, let's try running on the Crucial Examples split.

In [33]:
# Note: This will work once you have dataset splits configured
try:
    evaluate(
        target_function,
        data=client.list_examples(dataset_name=dataset_name, splits=["Crucial Examples"]),  # We pass in a list of Splits
        evaluators=[is_concise_enough],
        experiment_prefix="Crucial Examples split"
    )
except Exception as e:
    print(f"Dataset split experiment skipped: {e}")
    print("Note: Create dataset splits in LangSmith UI to use this feature")

Dataset split experiment skipped: 
Note: Create dataset splits in LangSmith UI to use this feature


##### Specific Data Points

You can specify individual data points to run an experiment over as well

In [34]:
# Get example IDs from the dataset and run on first two examples
try:
    examples = list(client.list_examples(dataset_name=dataset_name, limit=2))
    if len(examples) >= 2:
        example_ids = [ex.id for ex in examples[:2]]
        print(f"Running experiment on specific examples: {example_ids}")
        
        evaluate(
            target_function,
            data=client.list_examples(
                dataset_name=dataset_name, 
                example_ids=example_ids
            ),
            evaluators=[is_concise_enough],
            experiment_prefix="two specific example ids"
        )
    else:
        print("Not enough examples in dataset for this experiment")
except Exception as e:
    print(f"Specific example IDs experiment skipped: {e}")

Running experiment on specific examples: [UUID('7dfd0d3f-c378-4fe2-8a08-301d52dd92e2'), UUID('99f57a2e-5d00-40e2-9fbe-01a263fcebba')]
View the evaluation results for experiment: 'two specific example ids-1fab4719' at:
https://smith.langchain.com/o/bd531ccf-4286-4467-99ba-7eab707122af/datasets/dab4404e-e95f-4521-9892-284e6476bbca/compare?selectedSessions=bd0d6903-1167-4448-8a33-418a511a7e09




0it [00:00, ?it/s]

### Other Parameters

##### Repetitions

You can run an experiment several times to make sure you have consistent results

In [35]:
evaluate(
    target_function,
    data=dataset_name,
    evaluators=[is_concise_enough],
    experiment_prefix="two repetitions",
    num_repetitions=2   # This field defaults to 1
)

View the evaluation results for experiment: 'two repetitions-d6cb5eb3' at:
https://smith.langchain.com/o/bd531ccf-4286-4467-99ba-7eab707122af/datasets/dab4404e-e95f-4521-9892-284e6476bbca/compare?selectedSessions=a05f477d-d81e-482e-abd4-58c5f88fe2eb




0it [00:00, ?it/s]

Unnamed: 0,inputs.question,outputs.output,error,reference.output,feedback.is_concise,execution_time,example_id,id
0,What is LangSmith and how does it help with LL...,LangSmith is a platform that consists of a fro...,,"LangSmith is a platform for debugging, testing...",0,1.142637,7dfd0d3f-c378-4fe2-8a08-301d52dd92e2,2540fb5a-f57a-4faf-92f4-33ad3fcb1701
1,How can I compare different Mistral AI models?,"To compare different Mistral AI models, you ca...",,You can compare different Mistral AI models by...,0,1.095756,99f57a2e-5d00-40e2-9fbe-01a263fcebba,090a81f4-aee3-428c-9c2c-47ebce68302a
2,What are the benefits of using custom evaluators?,Custom evaluators can be beneficial in complex...,,Custom evaluators allow you to test specific a...,1,0.725765,9ebb8f0f-b91f-425d-b2da-d91149c0cfa1,244b5115-c2c7-47c6-9159-4fd4a60835ef
3,How do I set up tracing in my Mistral AI appli...,To set up tracing in your Mistral AI applicati...,,To set up tracing in your Mistral AI applicati...,0,1.206153,d1fc5afd-2380-4215-accc-cb26660ba5b3,2877b2b8-f0ba-430e-85f1-d9d84ee64577
4,What is LangSmith and how does it help with LL...,LangSmith is a platform that consists of a fro...,,"LangSmith is a platform for debugging, testing...",0,0.827014,7dfd0d3f-c378-4fe2-8a08-301d52dd92e2,70b6d428-8d4b-4fa0-af1a-cabb456b058f
5,How can I compare different Mistral AI models?,"To compare different Mistral AI models, you ca...",,You can compare different Mistral AI models by...,0,0.848918,99f57a2e-5d00-40e2-9fbe-01a263fcebba,19da2ce2-b322-4dad-8615-62a02edfef9b
6,What are the benefits of using custom evaluators?,Custom evaluators can be beneficial in complex...,,Custom evaluators allow you to test specific a...,0,0.832502,9ebb8f0f-b91f-425d-b2da-d91149c0cfa1,971525d7-d2a1-4463-85bd-1226fd6214e3
7,How do I set up tracing in my Mistral AI appli...,To set up tracing in your Mistral AI applicati...,,To set up tracing in your Mistral AI applicati...,0,1.315302,d1fc5afd-2380-4215-accc-cb26660ba5b3,1c7990b0-b8f7-4d5f-aab1-a7097171fd8b


##### Concurrency
You can also kick off concurrent threads of execution to make your experiments finish faster!

In [36]:
evaluate(
    target_function,
    data=dataset_name,
    evaluators=[is_concise_enough],
    experiment_prefix="concurrency",
    max_concurrency=3,  # This defaults to None, so this is an improvement!
)

View the evaluation results for experiment: 'concurrency-9ab8caad' at:
https://smith.langchain.com/o/bd531ccf-4286-4467-99ba-7eab707122af/datasets/dab4404e-e95f-4521-9892-284e6476bbca/compare?selectedSessions=253bbe46-f1a4-4a23-906f-fd9fad5f0ae5




0it [00:00, ?it/s]

Unnamed: 0,inputs.question,outputs.output,error,reference.output,feedback.is_concise,execution_time,example_id,id
0,How can I compare different Mistral AI models?,"To compare different Mistral AI models, you ca...",,You can compare different Mistral AI models by...,0,1.041991,99f57a2e-5d00-40e2-9fbe-01a263fcebba,d679df64-ee4d-4fa2-9219-e1a073e6827e
1,What are the benefits of using custom evaluators?,Custom evaluators can be beneficial in complex...,,Custom evaluators allow you to test specific a...,1,1.062715,9ebb8f0f-b91f-425d-b2da-d91149c0cfa1,3f51b044-d92c-47aa-86c9-64dbae4c8695
2,What is LangSmith and how does it help with LL...,LangSmith is a platform that consists of a fro...,,"LangSmith is a platform for debugging, testing...",0,1.14945,7dfd0d3f-c378-4fe2-8a08-301d52dd92e2,86a467ac-6c78-42be-9868-6193fd8f0a2f
3,How do I set up tracing in my Mistral AI appli...,To set up tracing in your Mistral AI applicati...,,To set up tracing in your Mistral AI applicati...,0,1.264403,d1fc5afd-2380-4215-accc-cb26660ba5b3,99190fd8-9c97-43f4-95ea-f48682c06be2


##### Metadata 

You can (and should) add metadata to your experiments, to make them easier to find in the UI

In [37]:
evaluate(
    target_function,
    data=dataset_name,
    evaluators=[is_concise_enough, contains_key_terms],
    experiment_prefix="metadata added",
    metadata={  # Custom metadata for Mistral AI experiments
        "model_name": MODEL_NAME,
        "model_provider": MODEL_PROVIDER,
        "embedding_model": "sentence-transformers/all-MiniLM-L6-v2",
        "app_version": APP_VERSION,
        "experiment_type": "mistral_comparison"
    }
)

View the evaluation results for experiment: 'metadata added-57bbb851' at:
https://smith.langchain.com/o/bd531ccf-4286-4467-99ba-7eab707122af/datasets/dab4404e-e95f-4521-9892-284e6476bbca/compare?selectedSessions=fa5357b3-c4f0-4114-a4f7-6981eb26b7a0




0it [00:00, ?it/s]

Unnamed: 0,inputs.question,outputs.output,error,reference.output,feedback.is_concise,feedback.contains_key_terms,execution_time,example_id,id
0,What is LangSmith and how does it help with LL...,LangSmith is a platform that consists of a fro...,,"LangSmith is a platform for debugging, testing...",0,1,0.95969,7dfd0d3f-c378-4fe2-8a08-301d52dd92e2,299fa867-f7b5-4b22-b13c-5f4dc78be42a
1,How can I compare different Mistral AI models?,"To compare different Mistral AI models, you ca...",,You can compare different Mistral AI models by...,0,1,1.048974,99f57a2e-5d00-40e2-9fbe-01a263fcebba,77bd5bf5-8e27-4451-9a3f-7f2d6e73b731
2,What are the benefits of using custom evaluators?,Custom evaluators can be beneficial in complex...,,Custom evaluators allow you to test specific a...,1,1,0.751763,9ebb8f0f-b91f-425d-b2da-d91149c0cfa1,a67b782d-9b20-43d5-be9b-f1100d49e83e
3,How do I set up tracing in my Mistral AI appli...,To set up tracing in your Mistral AI applicati...,,To set up tracing in your Mistral AI applicati...,0,1,1.305908,d1fc5afd-2380-4215-accc-cb26660ba5b3,e3b17f17-82d9-4ea3-b45d-3920fcd1970f


## Migration Summary and Learning Outcomes

### Key Changes Made:
1. **Model Provider**: Migrated from OpenAI to Mistral AI using `ChatMistralAI`
2. **Environment Variables**: Changed from `OPENAI_API_KEY` to `MISTRAL_API_KEY`
3. **Embeddings**: Replaced OpenAI embeddings with HuggingFace (`sentence-transformers/all-MiniLM-L6-v2`)
4. **Message Format**: Updated to use LangChain message objects (SystemMessage, HumanMessage)
5. **Model Configuration**: Updated to use `mistral-small-latest` and `mistral-tiny` for comparisons
6. **Project Name**: Updated to `langsmith-academy-mistral`

### Custom Tweakings:
1. **Additional Evaluator**: Added `contains_key_terms` evaluator to check for domain-specific terms
2. **Enhanced Metadata**: Extended experiment metadata to include embedding model and experiment type
3. **Model Comparison**: Changed comparison from GPT-4 vs GPT-3.5 to mistral-small-latest vs mistral-tiny
4. **Custom Dataset**: Updated dataset name to `Mistral RAG Application Dataset`
5. **Temperature Setting**: Added temperature control (0.3) for more natural responses
6. **Persistent Storage**: Updated vector store path to `mistral_docs.parquet`

### Experimental Features Added:
1. **Dual Evaluation**: Combined conciseness and domain relevance evaluation
2. **Enhanced Tracing**: Improved metadata tracking for Mistral AI model calls
3. **Model Variants**: Easy comparison between different Mistral AI model sizes
4. **Embedding Optimization**: Used efficient open-source embeddings for cost reduction

### What I Learned:
1. **Experiment Migration**: Successfully adapted LangSmith experiments from OpenAI to Mistral AI while maintaining evaluation consistency
2. **Multi-Model Comparison**: Learned to compare different variants of the same model family (mistral-small vs mistral-tiny)
3. **Custom Evaluators**: Created domain-specific evaluators that test for relevant content beyond just conciseness
4. **Metadata Management**: Enhanced experiment tracking with comprehensive metadata for better analysis
5. **Cost Optimization**: Implemented cost-effective alternatives using HuggingFace embeddings while maintaining quality

### Technical Insights:
- Mistral AI models provide competitive performance with different size options for various use cases
- HuggingFace embeddings offer a viable open-source alternative to proprietary embedding services
- LangChain message objects provide consistent interfaces across different model providers
- Custom evaluators can be tailored to specific domains for more meaningful assessment
- Experiment metadata is crucial for tracking and comparing different model configurations

### Experimental Capabilities Demonstrated:
1. **Dataset Versioning**: Running experiments on specific dataset versions and splits
2. **Concurrency Control**: Optimizing experiment execution with parallel processing
3. **Repetition Testing**: Ensuring consistent results across multiple runs
4. **Selective Testing**: Running experiments on specific data points or subsets
5. **Comprehensive Metadata**: Tracking detailed experiment information for analysis

This migration demonstrates how to adapt experimental workflows when changing language model providers while adding improvements in evaluation methodology and cost efficiency.