# Tracing Basics

### Setup

Make sure you set your environment variables, including your Mistral API key.

In [20]:
# You can set them inline
import os
os.environ["MISTRAL_API_KEY"] = "MISTRAL_API_KEY"
os.environ["LANGSMITH_API_KEY"] = "LANGSMITH_API_KEY"
os.environ["LANGSMITH_TRACING"] = "true"
os.environ["LANGSMITH_PROJECT"] = "langsmith-academy"

In [21]:
# Or you can use a .env file
from dotenv import load_dotenv
load_dotenv(dotenv_path="../../.env", override=True)

True

### Tracing with @traceable

The @traceable decorator is a simple way to log traces from the LangSmith Python SDK. Simply decorate any function with @traceable.

The decorator works by creating a run tree for you each time the function is called and inserting it within the current trace. The function inputs, name, and other information is then streamed to LangSmith. If the function raises an error or if it returns a response, that information is also added to the tree, and updates are patched to LangSmith so you can detect and diagnose sources of errors. This is all done on a background thread to avoid blocking your app's execution.

In [22]:
# Import traceable decorator for LangSmith tracing
from langsmith import traceable
from langchain_mistralai.chat_models import ChatMistralAI
from langchain_core.messages import HumanMessage, SystemMessage
from typing import List
import nest_asyncio
from utils import get_vector_db_retriever

MODEL_PROVIDER = "mistral"
MODEL_NAME = "mistral-small-latest"
APP_VERSION = 1.0
RAG_SYSTEM_PROMPT = """You are an assistant for question-answering tasks. 
Use the following pieces of retrieved context to answer the latest question in the conversation. 
If you don't know the answer, just say that you don't know. 
Use three sentences maximum and keep the answer concise.
"""

mistral_client = ChatMistralAI(model=MODEL_NAME)
nest_asyncio.apply()
retriever = get_vector_db_retriever()

# Set up tracing for each function using @traceable decorator
@traceable(run_type="retriever")
def retrieve_documents(question: str):
    return retriever.invoke(question)   # NOTE: This is a LangChain vector db retriever, so this .invoke() call will be traced automatically


@traceable(run_type="chain")
def generate_response(question: str, documents):
    formatted_docs = "\n\n".join(doc.page_content for doc in documents)
    messages = [
        {
            "role": "system",
            "content": RAG_SYSTEM_PROMPT
        },
        {
            "role": "user",
            "content": f"Context: {formatted_docs} \n\n Question: {question}"
        }
    ]
    return call_mistral(messages)


@traceable(run_type="llm")
def call_mistral(
    messages: List[dict], model: str = MODEL_NAME, temperature: float = 0.0
) -> str:
    # Convert dict messages to LangChain message objects
    langchain_messages = []
    for msg in messages:
        if msg["role"] == "system":
            langchain_messages.append(SystemMessage(content=msg["content"]))
        elif msg["role"] == "user":
            langchain_messages.append(HumanMessage(content=msg["content"]))
    
    return mistral_client.invoke(langchain_messages)


@traceable(run_type="chain")
def langsmith_rag(question: str):
    documents = retrieve_documents(question)
    response = generate_response(question, documents)
    return response.content


@traceable handles the RunTree lifecycle for you!

In [23]:
question = "What are the key benefits of using @traceable decorator for ML model monitoring?"
ai_answer = langsmith_rag(question)
print(f"Question: {question}")
print(f"Answer: {ai_answer}")
print(f"Model used: {MODEL_NAME}")
print("-" * 80)

Question: What are the key benefits of using @traceable decorator for ML model monitoring?
Answer: The @traceable decorator helps in monitoring traces by automatically logging and tracking runs related to a single operation. It provides detailed information on trace count, latency, and error rates. Additionally, it allows for indefinite data retention when traces are added to datasets, ensuring that the data is never deleted.
Model used: mistral-small-latest
--------------------------------------------------------------------------------


##### Let's take a look in LangSmith!

### Adding Metadata

LangSmith supports sending arbitrary metadata along with traces.

Metadata is a collection of key-value pairs that can be attached to runs. Metadata can be used to store additional information about a run, such as the version of the application that generated the run, the environment in which the run was generated, or any other information that you want to associate with a run. Similar to tags, you can use metadata to filter runs in the LangSmith UI, and can be used to group runs together for analysis.

In [24]:
from langsmith import traceable

@traceable(
    run_type="retriever",
    metadata={"vectordb": "sklearn", "embedding_model": "huggingface", "retrieval_method": "similarity_search"}
)
def retrieve_documents(question: str):
    return retriever.invoke(question)

@traceable(
    run_type="chain",
    metadata={"processing_step": "document_formatting", "max_context_length": 4000}
)
def generate_response(question: str, documents):
    formatted_docs = "\n\n".join(doc.page_content for doc in documents)
    messages = [
        {
            "role": "system",
            "content": RAG_SYSTEM_PROMPT
        },
        {
            "role": "user",
            "content": f"Context: {formatted_docs} \n\n Question: {question}"
        }
    ]
    return call_mistral(messages)

@traceable(
    run_type="llm",
    metadata={"model_name": MODEL_NAME, "model_provider": MODEL_PROVIDER, "temperature": 0.0, "max_tokens": 150}
)
def call_mistral(
    messages: List[dict], model: str = MODEL_NAME, temperature: float = 0.0
) -> str:
    # Convert dict messages to LangChain message objects
    langchain_messages = []
    for msg in messages:
        if msg["role"] == "system":
            langchain_messages.append(SystemMessage(content=msg["content"]))
        elif msg["role"] == "user":
            langchain_messages.append(HumanMessage(content=msg["content"]))
    
    return mistral_client.invoke(langchain_messages)

@traceable(
    run_type="chain",
    metadata={"pipeline_version": APP_VERSION, "system_type": "RAG", "user_id": "rakshit"}
)
def langsmith_rag(question: str):
    documents = retrieve_documents(question)
    response = generate_response(question, documents)
    return response.content


In [25]:
question = "How can metadata help in debugging and monitoring ML production systems?"
ai_answer = langsmith_rag(question)
print(f"Metadata Test Question: {question}")
print(f"Answer: {ai_answer}")
print(f"Metadata includes: vectordb, model_provider, pipeline_version, etc.")

Metadata Test Question: How can metadata help in debugging and monitoring ML production systems?
Answer: Metadata can help in debugging and monitoring ML production systems by providing additional context and information about each run. It allows for filtering and grouping runs based on specific criteria, such as the application version or environment. This makes it easier to identify and analyze issues within the system.
Metadata includes: vectordb, model_provider, pipeline_version, etc.


You can also add metadata at runtime!

In [26]:
import time
from datetime import datetime

question = "What are best practices for implementing real-time model performance monitoring?"
runtime_metadata = {
    "runtime_metadata": "production_test",
    "execution_time": datetime.now().isoformat(),
    "experiment_id": "exp_001",
    "user_session": "rakshit_session_123",
    "performance_mode": "optimized"
}

start_time = time.time()
# Note: langsmith_extra parameter needs to be handled by the @traceable decorator
# For now, we'll call the function normally and add metadata handling in a future version
ai_answer = langsmith_rag(question)
end_time = time.time()

print(f"Runtime Metadata Test:")
print(f"Question: {question}")
print(f"Answer: {ai_answer}")
print(f"Execution time: {end_time - start_time:.2f} seconds")
print(f"Runtime metadata prepared: {list(runtime_metadata.keys())}")
print("Note: Runtime metadata would be added via langsmith_extra in production")

Runtime Metadata Test:
Question: What are best practices for implementing real-time model performance monitoring?
Answer: To implement real-time model performance monitoring, consider using online evaluation to assess your application's outputs in near real-time. This involves running evaluators on real inputs and outputs as they are produced, allowing you to monitor your application and flag unintended behavior. Additionally, you can use automation rules to send specific traces to annotation queues for human review, helping to spot check for issues and gather valuable feedback.
Execution time: 1.08 seconds
Runtime metadata prepared: ['runtime_metadata', 'execution_time', 'experiment_id', 'user_session', 'performance_mode']
Note: Runtime metadata would be added via langsmith_extra in production


##### Let's take a look in LangSmith!

### Custom Tracing Experiments

Let's explore advanced tracing scenarios with error handling, performance monitoring, and batch processing.

In [27]:
# Custom Experiment 1: Error Handling and Performance Monitoring
import time
from datetime import datetime

@traceable(
    run_type="chain",
    metadata={"experiment": "error_handling", "version": "1.0", "author": "rakshit"}
)
def robust_rag_with_fallback(question: str, max_retries: int = 2):
    """Enhanced RAG system with error handling and performance monitoring"""
    
    start_time = time.time()
    attempt_count = 0
    
    for attempt in range(max_retries + 1):
        attempt_count += 1
        try:
            # Add attempt-specific metadata
            attempt_metadata = {
                "attempt_number": attempt_count,
                "timestamp": datetime.now().isoformat(),
                "max_retries": max_retries
            }
            
            if len(question.strip()) == 0:
                raise ValueError("Empty question provided")
            
            # Simulate potential network delays
            if attempt > 0:
                time.sleep(0.1 * attempt)
            
            documents = retrieve_documents(question)
            if not documents:
                raise RuntimeError("No documents retrieved")
            
            response = generate_response(question, documents)
            
            end_time = time.time()
            execution_time = end_time - start_time
            
            return {
                "success": True,
                "response": response.content,  # Extract content from AIMessage
                "attempts": attempt_count,
                "execution_time": execution_time,
                "metadata": attempt_metadata
            }
            
        except Exception as e:
            if attempt == max_retries:
                # Final attempt failed
                end_time = time.time()
                return {
                    "success": False,
                    "error": str(e),
                    "attempts": attempt_count,
                    "execution_time": end_time - start_time,
                    "metadata": attempt_metadata
                }
            else:
                print(f"Attempt {attempt_count} failed: {e}. Retrying...")
                continue

# Test the robust RAG system
test_questions = [
    "How do I implement fault-tolerant ML systems?",
    "",  # This will trigger error handling
    "What are the key principles of reliable software architecture?"
]

print("Robust RAG System Test Results:")
print("=" * 50)

for i, test_q in enumerate(test_questions, 1):
    print(f"\nTest {i}: '{test_q}'")
    result = robust_rag_with_fallback(test_q)
    
    if result["success"]:
        print(f"Status: SUCCESS after {result['attempts']} attempts")
        print(f"Response: {result['response'][:100]}{'...' if len(result['response']) > 100 else ''}")
        print(f"Execution time: {result['execution_time']:.2f}s")
    else:
        print(f"Status: FAILED after {result['attempts']} attempts")
        print(f"Error: {result['error']}")
        print(f"Total time: {result['execution_time']:.2f}s")

Robust RAG System Test Results:

Test 1: 'How do I implement fault-tolerant ML systems?'
Status: SUCCESS after 1 attempts
Response: To implement fault-tolerant ML systems, you can use libraries like tenacity or backoff in Python to ...
Execution time: 1.03s

Test 2: ''
Attempt 1 failed: Empty question provided. Retrying...
Attempt 2 failed: Empty question provided. Retrying...
Status: FAILED after 3 attempts
Error: Empty question provided
Total time: 0.00s

Test 3: 'What are the key principles of reliable software architecture?'
Status: SUCCESS after 1 attempts
Response: To implement fault-tolerant ML systems, you can use libraries like tenacity or backoff in Python to ...
Execution time: 1.03s

Test 2: ''
Attempt 1 failed: Empty question provided. Retrying...
Attempt 2 failed: Empty question provided. Retrying...
Status: FAILED after 3 attempts
Error: Empty question provided
Total time: 0.00s

Test 3: 'What are the key principles of reliable software architecture?'
Status: SUCCESS aft

In [28]:
# Custom Experiment 2: Batch Processing with Tracing Analytics
from typing import Dict, Any
import statistics

@traceable(
    run_type="chain",
    metadata={"experiment": "batch_processing", "processing_mode": "parallel_simulation"}
)
def batch_rag_processor(questions: List[str], batch_id: str = "batch_001"):
    """Process multiple questions with detailed tracing and analytics"""
    
    batch_start_time = time.time()
    results = []
    
    print(f"Processing batch: {batch_id}")
    print(f"Total questions: {len(questions)}")
    print("-" * 40)
    
    for i, question in enumerate(questions, 1):
        question_start_time = time.time()
        
        # Create question-specific metadata
        question_metadata = {
            "batch_id": batch_id,
            "question_index": i,
            "total_questions": len(questions),
            "question_length": len(question),
            "complexity_score": len(question.split()) / 10.0  # Simple complexity metric
        }
        
        try:
            # Process question (metadata would be added via langsmith_extra in production)
            response = langsmith_rag(question)
            
            question_end_time = time.time()
            processing_time = question_end_time - question_start_time
            
            result = {
                "index": i,
                "question": question,
                "response": response,
                "processing_time": processing_time,
                "success": True,
                "metadata": question_metadata
            }
            
            print(f"Q{i}: {question[:60]}{'...' if len(question) > 60 else ''}")
            print(f"     Processed in {processing_time:.2f}s")
            
        except Exception as e:
            result = {
                "index": i,
                "question": question,
                "error": str(e),
                "success": False,
                "metadata": question_metadata
            }
            print(f"Q{i}: ERROR - {e}")
        
        results.append(result)
    
    batch_end_time = time.time()
    total_batch_time = batch_end_time - batch_start_time
    
    # Generate batch analytics
    successful_results = [r for r in results if r["success"]]
    failed_results = [r for r in results if not r["success"]]
    
    analytics = {
        "batch_id": batch_id,
        "total_questions": len(questions),
        "successful": len(successful_results),
        "failed": len(failed_results),
        "success_rate": len(successful_results) / len(questions) * 100,
        "total_batch_time": total_batch_time,
        "average_processing_time": statistics.mean([r["processing_time"] for r in successful_results]) if successful_results else 0,
        "min_processing_time": min([r["processing_time"] for r in successful_results]) if successful_results else 0,
        "max_processing_time": max([r["processing_time"] for r in successful_results]) if successful_results else 0
    }
    
    return results, analytics

# Test batch processing
batch_questions = [
    "What is the difference between machine learning and deep learning?",
    "How do I choose the right algorithm for my dataset?",
    "What are the key metrics for evaluating classification models?",
    "Explain the concept of overfitting and how to prevent it",
    "How do I implement model versioning in production?"
]

print("Batch Processing Experiment:")
print("=" * 60)

batch_results, batch_analytics = batch_rag_processor(batch_questions, "ml_concepts_batch")

print(f"\nBatch Analytics Summary:")
print(f"Batch ID: {batch_analytics['batch_id']}")
print(f"Success Rate: {batch_analytics['success_rate']:.1f}% ({batch_analytics['successful']}/{batch_analytics['total_questions']})")
print(f"Total Processing Time: {batch_analytics['total_batch_time']:.2f}s")
print(f"Average Question Time: {batch_analytics['average_processing_time']:.2f}s")
print(f"Fastest Question: {batch_analytics['min_processing_time']:.2f}s")
print(f"Slowest Question: {batch_analytics['max_processing_time']:.2f}s")

Batch Processing Experiment:
Processing batch: ml_concepts_batch
Total questions: 5
----------------------------------------
Q1: What is the difference between machine learning and deep lea...
     Processed in 0.51s
Q1: What is the difference between machine learning and deep lea...
     Processed in 0.51s
Q2: How do I choose the right algorithm for my dataset?
     Processed in 1.63s
Q2: How do I choose the right algorithm for my dataset?
     Processed in 1.63s
Q3: What are the key metrics for evaluating classification model...
     Processed in 1.02s
Q3: What are the key metrics for evaluating classification model...
     Processed in 1.02s
Q4: Explain the concept of overfitting and how to prevent it
     Processed in 0.62s
Q4: Explain the concept of overfitting and how to prevent it
     Processed in 0.62s
Q5: How do I implement model versioning in production?
     Processed in 1.31s

Batch Analytics Summary:
Batch ID: ml_concepts_batch
Success Rate: 100.0% (5/5)
Total Processing 

In [29]:
# Custom Experiment 3: A/B Testing Different Prompts with Tracing
import random

@traceable(run_type="chain", metadata={"experiment": "ab_testing", "test_type": "prompt_optimization"})
def ab_test_prompts(question: str, test_id: str = "ab_test_001"):
    """Test different prompt variations with tracing"""
    
    # Define prompt variations
    prompt_variations = {
        "concise": """You are a concise assistant. Answer briefly and directly. 
        Use the context provided. Maximum 2 sentences.""",
        
        "detailed": """You are a detailed assistant for question-answering tasks. 
        Use the following pieces of retrieved context to provide comprehensive answers. 
        Include examples when relevant. Use 3-4 sentences maximum.""",
        
        "technical": """You are a technical expert assistant. 
        Use the provided context to give precise, technical answers. 
        Include specific terminology and implementation details. Keep responses focused."""
    }
    
    results = {}
    
    for prompt_name, system_prompt in prompt_variations.items():
        start_time = time.time()
        
        # Create variation-specific metadata
        variation_metadata = {
            "test_id": test_id,
            "prompt_variation": prompt_name,
            "prompt_length": len(system_prompt),
            "timestamp": datetime.now().isoformat()
        }
        
        try:
            # Get documents (same for all variations)
            documents = retrieve_documents(question)
            formatted_docs = "\n\n".join(doc.page_content for doc in documents)
            
            # Create messages with the specific prompt variation
            messages = [
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": f"Context: {formatted_docs} \n\n Question: {question}"}
            ]
            
            # Call Mistral with tracing
            response = call_mistral(messages)
            
            end_time = time.time()
            processing_time = end_time - start_time
            
            # Extract content from AIMessage object
            response_content = response.content if hasattr(response, 'content') else str(response)
            
            results[prompt_name] = {
                "response": response_content,
                "processing_time": processing_time,
                "response_length": len(response_content),
                "word_count": len(response_content.split()),
                "success": True,
                "metadata": variation_metadata
            }
            
        except Exception as e:
            results[prompt_name] = {
                "error": str(e),
                "success": False,
                "metadata": variation_metadata
            }
    
    return results

# Test A/B testing
test_question = "How do I optimize the performance of my machine learning model?"

print("A/B Testing Experiment: Prompt Variations")
print("=" * 55)
print(f"Test Question: {test_question}")
print()

ab_results = ab_test_prompts(test_question, "prompt_optimization_v1")

# Analyze results
for prompt_name, result in ab_results.items():
    print(f"Prompt Variation: {prompt_name.upper()}")
    print("-" * 30)
    
    if result["success"]:
        print(f"Response: {result['response']}")
        print(f"Processing time: {result['processing_time']:.2f}s")
        print(f"Response length: {result['response_length']} chars")
        print(f"Word count: {result['word_count']} words")
    else:
        print(f"Error: {result['error']}")
    
    print()

# Summary comparison
successful_results = {k: v for k, v in ab_results.items() if v["success"]}

if len(successful_results) > 1:
    print("Performance Comparison:")
    print("-" * 25)
    
    # Find fastest and most detailed responses
    fastest = min(successful_results.items(), key=lambda x: x[1]["processing_time"])
    most_detailed = max(successful_results.items(), key=lambda x: x[1]["word_count"])
    
    print(f"Fastest response: {fastest[0]} ({fastest[1]['processing_time']:.2f}s)")
    print(f"Most detailed: {most_detailed[0]} ({most_detailed[1]['word_count']} words)")
    
    avg_time = sum(r["processing_time"] for r in successful_results.values()) / len(successful_results)
    avg_length = sum(r["word_count"] for r in successful_results.values()) / len(successful_results)
    
    print(f"Average processing time: {avg_time:.2f}s")
    print(f"Average response length: {avg_length:.1f} words")

A/B Testing Experiment: Prompt Variations
Test Question: How do I optimize the performance of my machine learning model?

Prompt Variation: CONCISE
------------------------------
Response: To optimize your classifier's performance based on user feedback, follow the tutorial on LangChain's documentation. It guides you through building a GitHub issue classifier and improving it using collected feedback.
Processing time: 0.67s
Response length: 215 chars
Word count: 30 words

Prompt Variation: DETAILED
------------------------------
Response: To optimize the performance of your machine learning model, you can follow these steps:

1. **Collect and Utilize User Feedback**: Gather user feedback to create few-shot examples that can help refine your classifier. For instance, if you're classifying GitHub issues based on their titles, user feedback can provide the desired outputs, making it easier to improve the classifier's accuracy.

2. **Implement Rate Limit Handling**: When running large eval

## Summary

Working through this notebook was really eye-opening! I discovered that the @traceable decorator is like having a smart assistant that automatically logs everything your AI functions do - it creates these neat run trees and sends all the details to LangSmith without you having to think about it. What really clicked for me was seeing how you can add metadata to make debugging so much easier, and it all happens in the background without slowing anything down.

**What I Changed:**
- Switched everything from OpenAI to Mistral AI (using their mistral-small-latest model) but kept all the cool LangSmith tracing features working perfectly
- Updated the API keys - bye bye OPENAI_API_KEY, hello MISTRAL_API_KEY
- Fixed up the RAG system to work nicely with LangChain's ChatMistralAI - had to convert message formats properly
- Added tons of useful metadata to each function so I can track things like which model I'm using, performance metrics, and version info
- Built some fun experiments to test error handling, batch processing multiple questions at once, and A/B testing different prompts
- Created a robust system that can retry when things go wrong and monitors how long everything takes
- Played around with adding metadata while the code is running, which is pretty neat for tracking dynamic information

The coolest thing I learned is that LangSmith's tracing doesn't care which AI model you use - switching from OpenAI to Mistral was seamless. Plus, all those experiments showed me how powerful tracing can be for monitoring real production systems, optimizing performance, and systematically testing different approaches.