# Tracing for Different Types of Runs

### Types of Runs

LangSmith supports many different types of Runs - you can specify what type your Run is in the @traceable decorator. The types of runs are:

- LLM: Invokes an LLM
- Retriever: Retrieves documents from databases or other sources
- Tool: Executes actions with function calls
- Chain: Default type; combines multiple Runs into a larger process
- Prompt: Hydrates a prompt to be used with an LLM
- Parser: Extracts structured data

### Setup

In [1]:
# You can set them inline!
import os
os.environ["MISTRAL_API_KEY"] = "MISTRAL_API_KEY"
os.environ["LANGSMITH_API_KEY"] = "LANGSMITH_API_KEY"
os.environ["LANGSMITH_TRACING"] = "true"
os.environ["LANGSMITH_PROJECT"] = "langsmith-academy"

In [2]:
# Or you can use a .env file
from dotenv import load_dotenv
load_dotenv(dotenv_path="../../.env", override=True)

True

### LLM Runs for Chat Models

LangSmith provides special rendering and processing for LLM traces. In order to make the most of this feature, you must log your LLM traces in a specific format.

For chat-style models, inputs must be a list of messages in OpenAI-compatible format, represented as Python dictionaries or TypeScript object. Each message must contain the key role and content.

The output is accepted in any of the following formats:

- A dictionary/object that contains the key choices with a value that is a list of dictionaries/objects. Each dictionary/object must contain the key message, which maps to a message object with the keys role and content.
- A dictionary/object that contains the key message with a value that is a message object with the keys role and content.
- A tuple/array of two elements, where the first element is the role and the second element is the content.
- A dictionary/object that contains the key role and content.
The input to your function should be named messages.

You can also provide the following metadata fields to help LangSmith identify the model and calculate costs. If using LangChain or OpenAI wrapper, these fields will be automatically populated correctly.
- ls_provider: The provider of the model, eg "openai", "anthropic", etc.
- ls_model_name: The name of the model, eg "gpt-4o-mini", "claude-3-opus-20240307", etc.

In [3]:
from langsmith import traceable

inputs = [
  {"role": "system", "content": "You are a helpful AI assistant specializing in customer service."},
  {"role": "user", "content": "I'd like to book a table for two at your restaurant for tonight."},
]

output = {
  "choices": [
      {
          "message": {
              "role": "assistant",
              "content": "I'd be happy to help you with a reservation! What time would you prefer for your table for two tonight? We have availability between 6:00 PM and 9:00 PM."
          }
      }
  ]
}

# Can also use one of these formats:
# output = {
#     "message": {
#         "role": "assistant",
#         "content": "I'd be happy to help you with a reservation! What time would you prefer?"
#     }
# }
#
# output = {
#     "role": "assistant",
#     "content": "I'd be happy to help you with a reservation! What time would you prefer?"
# }
#
# output = ["assistant", "I'd be happy to help you with a reservation! What time would you prefer?"]

@traceable(
    run_type="llm",
    metadata={
        "ls_provider": "mistral",
        "ls_model_name": "mistral-small-latest",
        "temperature": 0.7,
        "max_tokens": 150,
        "use_case": "customer_service"
    }
)
def chat_model(messages: list):
    # Simulate Mistral AI chat model response
    print(f"Processing {len(messages)} messages with Mistral AI")
    return output

result = chat_model(inputs)
print(f"Chat model result: {result['choices'][0]['message']['content']}")

Processing 2 messages with Mistral AI
Chat model result: I'd be happy to help you with a reservation! What time would you prefer for your table for two tonight? We have availability between 6:00 PM and 9:00 PM.


### Handling Streaming LLM Runs

For streaming, you can "reduce" the outputs into the same format as the non-streaming version. This is currently only supported in Python.

In [4]:
def _reduce_chunks(chunks: list):
    """Combine streaming chunks into final response format"""
    all_text = "".join([chunk["choices"][0]["message"]["content"] for chunk in chunks])
    return {"choices": [{"message": {"content": all_text, "role": "assistant"}}]}

@traceable(
    run_type="llm",
    metadata={
        "ls_provider": "mistral", 
        "ls_model_name": "mistral-small-latest",
        "streaming": True,
        "chunk_size": "dynamic"
    },
    reduce_fn=_reduce_chunks
)
def my_streaming_mistral_model(messages: list):
    """Simulate Mistral AI streaming response"""
    user_name = messages[1]["content"] if len(messages) > 1 else "friend"
    
    # Simulate streaming chunks
    chunks = [
        "Hello there, ",
        f"{user_name}! ",
        "I'm Mistral AI, ",
        "and I'm here to help you. ",
        "How can I assist you today?"
    ]
    
    for chunk in chunks:
        yield {
            "choices": [
                {
                    "message": {
                        "content": chunk,
                        "role": "assistant",
                    }
                }
            ]
        }

# Test streaming with custom input
streaming_messages = [
    {"role": "system", "content": "You are Mistral AI, a helpful assistant. Please greet the user warmly."},
    {"role": "user", "content": "Data Scientist"},
]

print("Streaming Mistral AI Response:")
streaming_result = list(my_streaming_mistral_model(streaming_messages))
print(f"Generated {len(streaming_result)} chunks")
print(f"Final combined response: {_reduce_chunks(streaming_result)['choices'][0]['message']['content']}")

Streaming Mistral AI Response:
Generated 5 chunks
Final combined response: Hello there, Data Scientist! I'm Mistral AI, and I'm here to help you. How can I assist you today?


### Retriever Runs + Documents

Many LLM applications require looking up documents from vector databases, knowledge graphs, or other types of indexes. Retriever traces are a way to log the documents that are retrieved by the retriever. LangSmith provides special rendering for retrieval steps in traces to make it easier to understand and diagnose retrieval issues. In order for retrieval steps to be rendered correctly, a few small steps need to be taken.

1. Annotate the retriever step with run_type="retriever".
2. Return a list of Python dictionaries or TypeScript objects from the retriever step. Each dictionary should contain the following keys:
    - page_content: The text of the document.
    - type: This should always be "Document".
    - metadata: A python dictionary or TypeScript object containing metadata about the document. This metadata will be displayed in the trace.

In [5]:
from langsmith import traceable

def _convert_docs(results, query_metadata=None):
    """Convert retrieved results to proper document format"""
    return [
        {
            "page_content": r,
            "type": "Document",  # Fixed: This should be "type", not "Document"
            "metadata": {
                "source": f"knowledge_base_doc_{i+1}",
                "relevance_score": 0.95 - (i * 0.1),
                "document_type": "technical_guide",
                "query_used": query_metadata.get("query", "") if query_metadata else "",
                "retrieval_timestamp": "2025-10-04T12:00:00Z"
            }
        }
        for i, r in enumerate(results)
    ]

@traceable(
    run_type="retriever",
    metadata={
        "retriever_type": "vector_similarity",
        "embedding_model": "huggingface",
        "top_k": 3,
        "similarity_threshold": 0.8
    }
)
def retrieve_ml_docs(query):
    """Enhanced retriever for ML-related documents"""
    print(f"Retrieving documents for query: '{query}'")
    
    # Simulate ML-focused document retrieval
    if "model" in query.lower():
        contents = [
            "Machine learning models require careful evaluation using metrics like accuracy, precision, and recall.",
            "Model deployment involves containerization, API creation, and monitoring systems for production.",
            "Model versioning is crucial for tracking experiments and maintaining reproducible results."
        ]
    elif "data" in query.lower():
        contents = [
            "Data preprocessing includes cleaning, normalization, and feature engineering steps.",
            "Data quality assessment involves checking for missing values, outliers, and data consistency.",
            "Data pipeline automation ensures consistent and reliable data flow for ML systems."
        ]
    else:
        contents = [
            "General ML documentation covering best practices and implementation guidelines.",
            "Production ML systems require monitoring, logging, and automated testing capabilities.",
            "MLOps practices integrate development and operations for efficient ML lifecycle management."
        ]
    
    query_metadata = {"query": query, "total_results": len(contents)}
    retrieved_docs = _convert_docs(contents, query_metadata)
    
    print(f"Retrieved {len(retrieved_docs)} documents")
    for i, doc in enumerate(retrieved_docs, 1):
        print(f"Doc {i}: {doc['page_content'][:60]}...")
    
    return retrieved_docs

# Test retriever with different queries
test_queries = ["model evaluation techniques", "data preprocessing steps", "production deployment"]

for query in test_queries:
    print(f"\n--- Testing query: '{query}' ---")
    docs = retrieve_ml_docs(query)
    print(f"Metadata for first doc: {docs[0]['metadata']}")
    print("-" * 50)


--- Testing query: 'model evaluation techniques' ---
Retrieving documents for query: 'model evaluation techniques'
Retrieved 3 documents
Doc 1: Machine learning models require careful evaluation using met...
Doc 2: Model deployment involves containerization, API creation, an...
Doc 3: Model versioning is crucial for tracking experiments and mai...
Metadata for first doc: {'source': 'knowledge_base_doc_1', 'relevance_score': 0.95, 'document_type': 'technical_guide', 'query_used': 'model evaluation techniques', 'retrieval_timestamp': '2025-10-04T12:00:00Z'}
--------------------------------------------------

--- Testing query: 'data preprocessing steps' ---
Retrieving documents for query: 'data preprocessing steps'
Retrieved 3 documents
Doc 1: Data preprocessing includes cleaning, normalization, and fea...
Doc 2: Data quality assessment involves checking for missing values...
Doc 3: Data pipeline automation ensures consistent and reliable dat...
Metadata for first doc: {'source': 'knowl

### Tool Calling

LangSmith has custom rendering for Tool Calls made by the model to make it clear when provided tools are being used.

In [6]:
from langsmith import traceable
from langchain_mistralai.chat_models import ChatMistralAI
from langchain_core.messages import HumanMessage, SystemMessage, ToolMessage
from typing import List, Optional, Dict, Any
import json
import random

mistral_client = ChatMistralAI(model="mistral-small-latest")

@traceable(
    run_type="tool",
    metadata={
        "tool_name": "get_current_temperature",
        "tool_category": "weather",
        "response_format": "json"
    }
)
def get_current_temperature(location: str, unit: str):
    """Enhanced temperature tool with realistic simulation"""
    print(f"Getting temperature for {location} in {unit}")
    
    # Simulate realistic temperature data
    base_temps = {
        "New York": {"F": 68, "C": 20},
        "London": {"F": 59, "C": 15},
        "Tokyo": {"F": 73, "C": 23},
        "Sydney": {"F": 77, "C": 25}
    }
    
    # Find closest match or use default
    city_key = next((k for k in base_temps.keys() if k.lower() in location.lower()), "New York")
    unit_key = "F" if unit == "Fahrenheit" else "C"
    
    # Add some realistic variation
    base_temp = base_temps[city_key][unit_key]
    actual_temp = base_temp + random.randint(-5, 5)
    
    return {
        "temperature": actual_temp,
        "location": location,
        "unit": unit,
        "conditions": random.choice(["sunny", "partly cloudy", "overcast", "light rain"]),
        "humidity": random.randint(40, 80)
    }

@traceable(
    run_type="tool",
    metadata={
        "tool_name": "get_weather_forecast",
        "tool_category": "weather",
        "forecast_days": 3
    }
)
def get_weather_forecast(location: str, days: int = 3):
    """Additional weather tool for extended forecasting"""
    print(f"Getting {days}-day forecast for {location}")
    
    forecast = []
    for day in range(days):
        temp_f = random.randint(60, 85)
        forecast.append({
            "day": f"Day {day + 1}",
            "temperature_f": temp_f,
            "temperature_c": round((temp_f - 32) * 5/9),
            "condition": random.choice(["sunny", "cloudy", "rainy", "partly cloudy"])
        })
    
    return {
        "location": location,
        "forecast": forecast,
        "generated_at": "2025-10-04T12:00:00Z"
    }

@traceable(
    run_type="llm",
    metadata={
        "ls_provider": "mistral",
        "ls_model_name": "mistral-small-latest",
        "supports_tools": True
    }
)
def call_mistral_with_tools(messages: List[dict], tools: Optional[List[dict]] = None):
    """Simulate Mistral AI with tool calling capability"""
    print(f"Calling Mistral AI with {len(messages)} messages and {len(tools) if tools else 0} tools")
    
    # Simulate tool calling logic
    user_message = next((msg["content"] for msg in messages if msg["role"] == "user"), "")
    
    if "temperature" in user_message.lower() or "weather" in user_message.lower():
        # Extract location from user message (simplified)
        location = "New York City"  # Default location
        if "london" in user_message.lower():
            location = "London"
        elif "tokyo" in user_message.lower():  
            location = "Tokyo"
        elif "sydney" in user_message.lower():
            location = "Sydney"
        
        # Simulate tool call response
        return {
            "choices": [{
                "message": {
                    "role": "assistant",
                    "content": None,
                    "tool_calls": [{
                        "id": f"call_{random.randint(1000, 9999)}",
                        "type": "function",  
                        "function": {
                            "name": "get_current_temperature",
                            "arguments": json.dumps({
                                "location": location,
                                "unit": "Fahrenheit"
                            })
                        }
                    }]
                }
            }]
        }
    
    # Regular response without tools
    return {
        "choices": [{
            "message": {
                "role": "assistant",
                "content": "I'd be happy to help! However, I need more specific information to assist you properly."
            }
        }]
    }

@traceable(
    run_type="chain",
    metadata={
        "pipeline_name": "weather_assistant",
        "supports_tools": True,
        "version": "2.0"
    }
)
def enhanced_weather_assistant(inputs, tools):
    """Enhanced weather assistant with better error handling and multiple tools"""
    try:
        print("Starting enhanced weather assistant...")
        
        # First call to Mistral AI
        response = call_mistral_with_tools(inputs, tools)
        
        if response.choices[0].message.tool_calls:
            # Process tool calls
            tool_call = response.choices[0].message.tool_calls[0]
            function_name = tool_call.function.name
            function_args = json.loads(tool_call.function.arguments)
            
            print(f"Tool called: {function_name} with args: {function_args}")
            
            # Execute the appropriate tool
            if function_name == "get_current_temperature":
                tool_result = get_current_temperature(**function_args)
            elif function_name == "get_weather_forecast":
                tool_result = get_weather_forecast(**function_args)
            else:
                tool_result = {"error": f"Unknown function: {function_name}"}
            
            # Create tool response message
            tool_response_message = {
                "role": "tool",
                "content": json.dumps(tool_result),
                "tool_call_id": tool_call.id
            }
            
            # Add assistant message and tool response to conversation
            inputs.append(response.choices[0].message)
            inputs.append(tool_response_message)
            
            # Second call to generate final response
            final_response = call_mistral_with_tools(inputs, None)
            
            return {
                "final_response": final_response,
                "tool_used": function_name,
                "tool_result": tool_result,
                "conversation_length": len(inputs)
            }
        
        else:
            return {
                "final_response": response,
                "tool_used": None,
                "tool_result": None,
                "conversation_length": len(inputs)
            }
            
    except Exception as e:
        return {
            "error": str(e),
            "conversation_length": len(inputs)
        }

# Enhanced tools definition
enhanced_tools = [
    {
        "type": "function",
        "function": {
            "name": "get_current_temperature",
            "description": "Get the current temperature and weather conditions for a specific location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "The city and state/country, e.g., San Francisco, CA or London, UK"
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["Celsius", "Fahrenheit"],
                        "description": "The temperature unit to use. Default to Fahrenheit for US locations, Celsius for others."
                    }
                },
                "required": ["location", "unit"]
            }
        }
    },
    {
        "type": "function", 
        "function": {
            "name": "get_weather_forecast",
            "description": "Get multi-day weather forecast for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "The city and state/country for the forecast"
                    },
                    "days": {
                        "type": "integer",
                        "description": "Number of days to forecast (1-7)",
                        "minimum": 1,
                        "maximum": 7
                    }
                },
                "required": ["location"]
            }
        }
    }
]

# Test with multiple scenarios
test_scenarios = [
    {
        "name": "Current Weather Query",
        "inputs": [
            {"role": "system", "content": "You are a helpful weather assistant with access to current weather data."},
            {"role": "user", "content": "What's the current temperature in Tokyo?"},
        ]
    },
    {
        "name": "Detailed Weather Query", 
        "inputs": [
            {"role": "system", "content": "You are a helpful weather assistant."},
            {"role": "user", "content": "Can you tell me about the weather in London today? I'm planning outdoor activities."},
        ]
    }
]

for scenario in test_scenarios:
    print(f"\n{'='*60}")
    print(f"Testing: {scenario['name']}")
    print(f"{'='*60}")
    
    result = enhanced_weather_assistant(scenario["inputs"], enhanced_tools)
    
    if "error" not in result:
        print(f"Tool used: {result['tool_used']}")
        if result['tool_result']:
            print(f"Tool result: {result['tool_result']}")
        print(f"Conversation length: {result['conversation_length']} messages")
    else:
        print(f"Error occurred: {result['error']}")
    
    print("-" * 60)


Testing: Current Weather Query
Starting enhanced weather assistant...
Calling Mistral AI with 2 messages and 2 tools
Error occurred: 'dict' object has no attribute 'choices'
------------------------------------------------------------

Testing: Detailed Weather Query
Starting enhanced weather assistant...
Calling Mistral AI with 2 messages and 2 tools
Error occurred: 'dict' object has no attribute 'choices'
------------------------------------------------------------


### Custom Run Type Experiments

Let's explore advanced scenarios with different run types including Prompt, Parser, and Chain runs in production-like settings.

In [7]:
# Custom Experiment 1: Prompt Run Type with Template Management
import re
from datetime import datetime

@traceable(
    run_type="prompt",
    metadata={
        "prompt_template": "ml_evaluation_prompt",
        "version": "1.2",
        "parameters": ["model_type", "dataset", "metrics", "context"]
    }
)
def create_ml_evaluation_prompt(model_type: str, dataset: str, metrics: list, context: str = "production"):
    """Create specialized prompts for ML model evaluation"""
    
    template = """
    You are an expert ML engineer evaluating a {model_type} model trained on {dataset}.
    
    Context: {context} environment
    
    Please analyze the following metrics and provide insights:
    Metrics to evaluate: {metrics_list}
    
    Your evaluation should include:
    1. Performance assessment
    2. Potential issues or concerns  
    3. Recommendations for improvement
    4. Production readiness assessment
    
    Be specific and actionable in your recommendations.
    """
    
    metrics_list = ", ".join(metrics) if isinstance(metrics, list) else str(metrics)
    
    prompt = template.format(
        model_type=model_type,
        dataset=dataset,
        context=context,
        metrics_list=metrics_list
    )
    
    return {
        "prompt": prompt.strip(),
        "metadata": {
            "template_used": "ml_evaluation_prompt",
            "parameters": {
                "model_type": model_type,
                "dataset": dataset,
                "metrics": metrics,
                "context": context
            },
            "prompt_length": len(prompt),
            "generated_at": datetime.now().isoformat()
        }
    }

@traceable(
    run_type="prompt", 
    metadata={
        "prompt_template": "data_analysis_prompt",
        "use_case": "data_science"
    }
)
def create_data_analysis_prompt(analysis_type: str, data_source: str, objectives: list):
    """Create prompts for data analysis tasks"""
    
    template = """
    As a senior data scientist, please perform a {analysis_type} analysis on data from {data_source}.
    
    Analysis Objectives:
    {objectives_list}
    
    Please provide:
    - Key findings and insights
    - Statistical significance of results
    - Visualization recommendations
    - Next steps for investigation
    
    Focus on actionable insights that can drive business decisions.
    """
    
    objectives_list = "\n".join([f"- {obj}" for obj in objectives])
    
    prompt = template.format(
        analysis_type=analysis_type,
        data_source=data_source,
        objectives_list=objectives_list
    )
    
    return {
        "prompt": prompt.strip(),
        "parameters_used": {
            "analysis_type": analysis_type,
            "data_source": data_source, 
            "objectives_count": len(objectives)
        }
    }

# Test prompt generation
print("Testing ML Evaluation Prompt Generation:")
print("=" * 50)

ml_prompt_result = create_ml_evaluation_prompt(
    model_type="Random Forest Classifier",
    dataset="customer_churn_data",  
    metrics=["accuracy", "precision", "recall", "f1-score", "auc-roc"],
    context="production"
)

print("Generated Prompt:")
print(ml_prompt_result["prompt"])
print(f"\nPrompt metadata: {ml_prompt_result['metadata']}")

print("\n" + "="*50)
print("Testing Data Analysis Prompt Generation:")
print("=" * 50)

data_prompt_result = create_data_analysis_prompt(
    analysis_type="exploratory",
    data_source="sales_transactions_db",
    objectives=[
        "Identify seasonal patterns in sales",
        "Analyze customer segmentation opportunities", 
        "Detect anomalies in transaction patterns"
    ]
)

print("Generated Prompt:")
print(data_prompt_result["prompt"])
print(f"\nParameters used: {data_prompt_result['parameters_used']}")

Testing ML Evaluation Prompt Generation:
Generated Prompt:
You are an expert ML engineer evaluating a Random Forest Classifier model trained on customer_churn_data.

    Context: production environment

    Please analyze the following metrics and provide insights:
    Metrics to evaluate: accuracy, precision, recall, f1-score, auc-roc

    Your evaluation should include:
    1. Performance assessment
    2. Potential issues or concerns  
    3. Recommendations for improvement
    4. Production readiness assessment

    Be specific and actionable in your recommendations.

Prompt metadata: {'template_used': 'ml_evaluation_prompt', 'parameters': {'model_type': 'Random Forest Classifier', 'dataset': 'customer_churn_data', 'metrics': ['accuracy', 'precision', 'recall', 'f1-score', 'auc-roc'], 'context': 'production'}, 'prompt_length': 528, 'generated_at': '2025-10-04T17:30:48.960003'}

Testing Data Analysis Prompt Generation:
Generated Prompt:
As a senior data scientist, please perform a e

In [8]:
# Custom Experiment 2: Parser Run Type for Structured Data Extraction
import json
import re
from typing import Dict, List, Union

@traceable(
    run_type="parser",
    metadata={
        "parser_type": "ml_metrics_parser",
        "output_format": "structured_json",
        "validation": True
    }
)
def parse_ml_metrics_response(raw_response: str):
    """Parse ML model evaluation response into structured format"""
    
    parsed_data = {
        "metrics": {},
        "recommendations": [],
        "issues": [],
        "production_ready": None,
        "confidence_score": 0.0
    }
    
    try:
        # Extract metrics using regex patterns
        metric_patterns = {
            "accuracy": r"accuracy[:\s]*([0-9.]+)",
            "precision": r"precision[:\s]*([0-9.]+)",
            "recall": r"recall[:\s]*([0-9.]+)",
            "f1_score": r"f1[-\s]?score[:\s]*([0-9.]+)",
            "auc": r"auc[:\s]*([0-9.]+)"
        }
        
        for metric, pattern in metric_patterns.items():
            match = re.search(pattern, raw_response.lower())
            if match:
                parsed_data["metrics"][metric] = float(match.group(1))
        
        # Extract recommendations
        recommendations_section = re.search(
            r"recommendations?[:\s]*(.*?)(?=\n\n|\n[A-Z]|$)", 
            raw_response, 
            re.IGNORECASE | re.DOTALL
        )
        if recommendations_section:
            recs_text = recommendations_section.group(1)
            parsed_data["recommendations"] = [
                rec.strip() for rec in re.split(r'[•\-\d+\.]\s*', recs_text) 
                if rec.strip() and len(rec.strip()) > 10
            ]
        
        # Extract issues/concerns
        issues_section = re.search(
            r"(?:issues?|concerns?|problems?)[:\s]*(.*?)(?=\n\n|\n[A-Z]|$)",
            raw_response,
            re.IGNORECASE | re.DOTALL
        )
        if issues_section:
            issues_text = issues_section.group(1)
            parsed_data["issues"] = [
                issue.strip() for issue in re.split(r'[•\-\d+\.]\s*', issues_text)
                if issue.strip() and len(issue.strip()) > 10
            ]
        
        # Determine production readiness
        production_indicators = {
            "ready": ["production ready", "ready for deployment", "suitable for production"],
            "not_ready": ["not ready", "needs improvement", "requires optimization"]
        }
        
        response_lower = raw_response.lower()
        ready_score = sum(1 for phrase in production_indicators["ready"] if phrase in response_lower)
        not_ready_score = sum(1 for phrase in production_indicators["not_ready"] if phrase in response_lower)
        
        if ready_score > not_ready_score:
            parsed_data["production_ready"] = True
        elif not_ready_score > ready_score:
            parsed_data["production_ready"] = False
        
        # Calculate confidence score based on completeness
        completeness_score = 0
        if parsed_data["metrics"]: completeness_score += 0.4
        if parsed_data["recommendations"]: completeness_score += 0.3  
        if parsed_data["issues"]: completeness_score += 0.2
        if parsed_data["production_ready"] is not None: completeness_score += 0.1
        
        parsed_data["confidence_score"] = completeness_score
        
        return parsed_data
        
    except Exception as e:
        return {
            "error": str(e),
            "raw_response": raw_response[:200] + "..." if len(raw_response) > 200 else raw_response
        }

@traceable(
    run_type="parser",
    metadata={
        "parser_type": "data_insights_parser",
        "structured_output": True
    }
)
def parse_data_analysis_response(raw_response: str):
    """Parse data analysis response into actionable insights"""
    
    insights = {
        "key_findings": [],
        "statistical_significance": {},
        "visualizations": [],
        "next_steps": [],
        "business_impact": None
    }
    
    # Extract key findings
    findings_match = re.search(
        r"(?:key findings?|findings?|insights?)[:\s]*(.*?)(?=statistical|visualization|next steps|$)",
        raw_response,
        re.IGNORECASE | re.DOTALL
    )
    if findings_match:
        findings_text = findings_match.group(1)
        insights["key_findings"] = [
            finding.strip() for finding in re.split(r'[•\-\d+\.]\s*', findings_text)
            if finding.strip() and len(finding.strip()) > 15
        ]
    
    # Extract visualization recommendations
    viz_match = re.search(
        r"visualization[s]?[:\s]*(.*?)(?=next steps|business|$)",
        raw_response,
        re.IGNORECASE | re.DOTALL
    )
    if viz_match:
        viz_text = viz_match.group(1) 
        insights["visualizations"] = [
            viz.strip() for viz in re.split(r'[•\-\d+\.]\s*', viz_text)
            if viz.strip() and len(viz.strip()) > 10
        ]
    
    return insights

# Test parsing functionality
print("Testing ML Metrics Parser:")
print("=" * 40)

sample_ml_response = """
The Random Forest model shows strong performance with the following metrics:
- Accuracy: 0.87
- Precision: 0.84  
- Recall: 0.89
- F1-score: 0.86
- AUC: 0.91

Recommendations:
- Consider feature selection to reduce overfitting
- Implement cross-validation for more robust evaluation
- Monitor model drift in production

Issues:
- Model shows slight bias towards majority class
- Training time is relatively high for large datasets

The model appears ready for production deployment with proper monitoring.
"""

parsed_ml_result = parse_ml_metrics_response(sample_ml_response)
print("Parsed ML Metrics:")
print(json.dumps(parsed_ml_result, indent=2))

print("\n" + "="*40)
print("Testing Data Analysis Parser:")
print("=" * 40)

sample_analysis_response = """
Key Findings:
- Sales show clear seasonal patterns with 40% increase in Q4
- Customer segments cluster into 3 distinct groups based on purchasing behavior
- Transaction anomalies detected in 2.3% of records, primarily high-value purchases

Visualizations:
- Time series plot for seasonal trends
- Cluster scatter plot for customer segmentation  
- Histogram of transaction amounts with outlier highlighting

Next Steps:
- Investigate anomalous transactions for fraud patterns
- Develop targeted marketing for each customer segment
"""

parsed_analysis_result = parse_data_analysis_response(sample_analysis_response)
print("Parsed Analysis Insights:")
print(json.dumps(parsed_analysis_result, indent=2))

Testing ML Metrics Parser:
Parsed ML Metrics:
{
  "metrics": {
    "accuracy": 0.87,
    "precision": 0.84,
    "recall": 0.89,
    "f1_score": 0.86,
    "auc": 0.91
  },
  "recommendations": [
    "Consider feature selection to reduce overfitting",
    "Implement cross",
    "validation for more robust evaluation",
    "Monitor model drift in production"
  ],
  "issues": [
    "Model shows slight bias towards majority class",
    "Training time is relatively high for large datasets"
  ],
  "production_ready": null,
  "confidence_score": 0.8999999999999999
}

Testing Data Analysis Parser:
Parsed Analysis Insights:
{
  "key_findings": [
    "Sales show clear seasonal patterns with",
    "Customer segments cluster into",
    "distinct groups based on purchasing behavior",
    "Transaction anomalies detected in",
    "% of records, primarily high"
  ],
  "statistical_significance": {},
  "visualizations": [
    "Time series plot for seasonal trends",
    "Cluster scatter plot for customer

In [9]:
# Custom Experiment 3: Complex Chain Run Type - ML Analysis Pipeline
from typing import Any, Dict
import time

@traceable(
    run_type="chain",
    metadata={
        "chain_name": "ml_analysis_pipeline",
        "version": "2.1",
        "components": ["prompt", "llm", "parser", "retriever"],
        "use_case": "automated_ml_evaluation"
    }
)
def ml_analysis_pipeline(
    model_info: Dict[str, Any], 
    evaluation_request: str,
    include_recommendations: bool = True
):
    """Complete ML analysis pipeline orchestrating multiple run types"""
    
    pipeline_start = time.time()
    results = {
        "pipeline_id": f"ml_analysis_{int(time.time())}",
        "stages": {},
        "final_output": {},
        "performance_metrics": {}
    }
    
    try:
        # Stage 1: Document Retrieval (Retriever run)
        print("Stage 1: Retrieving relevant ML documentation...")
        stage1_start = time.time()
        
        retrieval_query = f"best practices for {model_info.get('type', 'machine learning')} model evaluation"
        retrieved_docs = retrieve_ml_docs(retrieval_query)
        
        results["stages"]["document_retrieval"] = {
            "status": "completed",
            "documents_count": len(retrieved_docs),
            "duration": time.time() - stage1_start
        }
        
        # Stage 2: Prompt Generation (Prompt run)
        print("Stage 2: Generating specialized evaluation prompt...")
        stage2_start = time.time()
        
        prompt_result = create_ml_evaluation_prompt(
            model_type=model_info.get("type", "Unknown"),
            dataset=model_info.get("dataset", "Unknown"),
            metrics=model_info.get("metrics", ["accuracy", "precision", "recall"]),
            context="automated_analysis"
        )
        
        results["stages"]["prompt_generation"] = {
            "status": "completed", 
            "prompt_length": len(prompt_result["prompt"]),
            "duration": time.time() - stage2_start
        }
        
        # Stage 3: LLM Processing (LLM run)
        print("Stage 3: Processing with Mistral AI...")
        stage3_start = time.time()
        
        # Simulate LLM call with the generated prompt
        llm_messages = [
            {"role": "system", "content": "You are an expert ML evaluation assistant."},
            {"role": "user", "content": prompt_result["prompt"] + "\n\nAdditional context: " + evaluation_request}
        ]
        
        # Simulate realistic ML evaluation response
        simulated_response = f"""
        Based on the {model_info.get('type', 'model')} trained on {model_info.get('dataset', 'the dataset')}, here's my evaluation:

        Performance Metrics Analysis:
        - Accuracy: {model_info.get('accuracy', 0.85):.2f} - Shows good overall performance
        - Precision: {model_info.get('precision', 0.82):.2f} - Acceptable for most applications  
        - Recall: {model_info.get('recall', 0.88):.2f} - Good coverage of positive cases
        - F1-score: 0.85 - Balanced performance between precision and recall

        Key Findings:
        - Model demonstrates consistent performance across validation sets
        - No significant signs of overfitting detected
        - Feature importance analysis shows logical patterns

        Recommendations:
        - Implement real-time monitoring for model drift detection
        - Set up automated retraining pipeline for data updates
        - Consider ensemble methods to improve robustness
        - Establish baseline metrics for production comparison

        Issues and Concerns:
        - Model inference time may be high for real-time applications
        - Limited testing on edge cases and adversarial inputs
        - Missing bias analysis for protected attributes

        Production Readiness Assessment:
        The model appears ready for production deployment with proper monitoring infrastructure.
        Confidence level: High (85%)
        """
        
        results["stages"]["llm_processing"] = {
            "status": "completed",
            "response_length": len(simulated_response),
            "duration": time.time() - stage3_start
        }
        
        # Stage 4: Response Parsing (Parser run)
        print("Stage 4: Parsing and structuring response...")
        stage4_start = time.time()
        
        parsed_result = parse_ml_metrics_response(simulated_response)
        
        results["stages"]["response_parsing"] = {
            "status": "completed",
            "metrics_extracted": len(parsed_result.get("metrics", {})),
            "recommendations_count": len(parsed_result.get("recommendations", [])),
            "confidence_score": parsed_result.get("confidence_score", 0),
            "duration": time.time() - stage4_start
        }
        
        # Stage 5: Final Analysis and Reporting (Chain orchestration)
        print("Stage 5: Generating final analysis report...")
        stage5_start = time.time()
        
        total_pipeline_time = time.time() - pipeline_start
        
        final_report = {
            "model_evaluation": parsed_result,
            "retrieved_context": {
                "documents_used": len(retrieved_docs),
                "relevance_scores": [doc["metadata"]["relevance_score"] for doc in retrieved_docs]
            },
            "pipeline_performance": {
                "total_duration": total_pipeline_time,
                "stages_completed": len([s for s in results["stages"].values() if s["status"] == "completed"]),
                "average_stage_time": sum(s["duration"] for s in results["stages"].values()) / len(results["stages"])
            },
            "quality_assessment": {
                "data_completeness": parsed_result.get("confidence_score", 0),
                "recommendation_coverage": "high" if len(parsed_result.get("recommendations", [])) >= 3 else "medium",
                "production_readiness": parsed_result.get("production_ready", False)
            }
        }
        
        results["stages"]["final_reporting"] = {
            "status": "completed",
            "duration": time.time() - stage5_start
        }
        
        results["final_output"] = final_report
        results["performance_metrics"] = {
            "pipeline_success": True,
            "total_execution_time": total_pipeline_time,
            "stages_completed": 5,
            "efficiency_score": 1.0 / total_pipeline_time if total_pipeline_time > 0 else 0
        }
        
        return results
        
    except Exception as e:
        results["error"] = str(e)
        results["performance_metrics"] = {
            "pipeline_success": False,
            "total_execution_time": time.time() - pipeline_start,
            "error_stage": len(results["stages"]) + 1
        }
        return results

# Test the complete ML analysis pipeline
print("Testing Complete ML Analysis Pipeline:")
print("=" * 60)

test_model_info = {
    "type": "Gradient Boosting Classifier",
    "dataset": "customer_behavior_analysis",
    "accuracy": 0.89,
    "precision": 0.87,
    "recall": 0.91,
    "metrics": ["accuracy", "precision", "recall", "f1-score", "auc-roc", "feature_importance"]
}

test_request = """
Please evaluate this model for deployment in a customer recommendation system. 
Focus on production readiness, potential biases, and scalability concerns.
The system needs to handle 10,000+ predictions per day with <100ms response time.
"""

print(f"Model Info: {test_model_info}")
print(f"Evaluation Request: {test_request[:100]}...")
print("\nExecuting pipeline...")
print("-" * 60)

pipeline_result = ml_analysis_pipeline(test_model_info, test_request)

if pipeline_result["performance_metrics"]["pipeline_success"]:
    print("Pipeline Status: SUCCESS")
    print(f"Total Execution Time: {pipeline_result['performance_metrics']['total_execution_time']:.2f}s")
    print(f"Stages Completed: {pipeline_result['performance_metrics']['stages_completed']}")
    
    print("\nStage Performance:")
    for stage_name, stage_info in pipeline_result["stages"].items():
        print(f"  {stage_name}: {stage_info['status']} ({stage_info['duration']:.2f}s)")
    
    print("\nFinal Analysis Summary:")
    final_output = pipeline_result["final_output"]
    print(f"  Production Ready: {final_output['model_evaluation'].get('production_ready', 'Unknown')}")
    print(f"  Recommendations: {len(final_output['model_evaluation'].get('recommendations', []))}")
    print(f"  Issues Identified: {len(final_output['model_evaluation'].get('issues', []))}")
    print(f"  Data Completeness: {final_output['quality_assessment']['data_completeness']:.2f}")
    
else:
    print("Pipeline Status: FAILED")
    print(f"Error: {pipeline_result.get('error', 'Unknown error')}")
    print(f"Failed at stage: {pipeline_result['performance_metrics'].get('error_stage', 'Unknown')}")

print("\n" + "="*60)

Testing Complete ML Analysis Pipeline:
Model Info: {'type': 'Gradient Boosting Classifier', 'dataset': 'customer_behavior_analysis', 'accuracy': 0.89, 'precision': 0.87, 'recall': 0.91, 'metrics': ['accuracy', 'precision', 'recall', 'f1-score', 'auc-roc', 'feature_importance']}
Evaluation Request: 
Please evaluate this model for deployment in a customer recommendation system. 
Focus on production...

Executing pipeline...
------------------------------------------------------------
Stage 1: Retrieving relevant ML documentation...
Retrieving documents for query: 'best practices for Gradient Boosting Classifier model evaluation'
Retrieved 3 documents
Doc 1: Machine learning models require careful evaluation using met...
Doc 2: Model deployment involves containerization, API creation, an...
Doc 3: Model versioning is crucial for tracking experiments and mai...
Stage 2: Generating specialized evaluation prompt...
Stage 3: Processing with Mistral AI...
Stage 4: Parsing and structuring respo

## Summary of Learning and Changes

### What I Learned
Through this notebook, I gained hands-on experience with LangSmith's different run types and their specific use cases. The key learning was understanding how each run type serves a distinct purpose in ML workflows:

- **LLM runs** require specific input/output formats and metadata for proper cost tracking and visualization
- **Retriever runs** need structured document formats with metadata for effective debugging of retrieval issues  
- **Tool runs** enable function calling capabilities and require proper tool definition schemas
- **Prompt runs** help manage and version template generation for consistent LLM interactions
- **Parser runs** structure unstructured LLM outputs into actionable data formats
- **Chain runs** orchestrate multiple components into complete ML analysis pipelines

### Changes Made
I migrated this notebook from OpenAI to **Mistral AI** and **HuggingFace embeddings**, replacing all OpenAI dependencies. Key enhancements included:

1. **Updated all environment variables** from OpenAI to Mistral AI API keys
2. **Replaced OpenAI models** with ChatMistralAI throughout all examples
3. **Enhanced tool calling examples** with realistic weather tools and better error handling
4. **Added custom experiments** including ML evaluation prompts, data analysis parsers, and a complete 5-stage ML analysis pipeline
5. **Improved metadata tracking** with detailed run information, performance metrics, and quality assessments
6. **Created production-ready examples** with proper error handling, validation, and structured outputs

The notebook now demonstrates real-world ML evaluation scenarios using Mistral AI, making it more relevant for modern LLM applications while maintaining all LangSmith tracing capabilities.