# Dataset Upload 



## Advanced Dataset Management with LangSmith SDK

In addition to creating and editing Datasets in the LangSmith UI, you can also create and edit datasets with the LangSmith SDK.

This version has been updated to use Mistral AI instead of OpenAI for better performance and cost-effectiveness.

## Dataset Upload Process

Let's go ahead and upload a list of examples that we have from our RAG application to LangSmith as a new dataset.

The examples have been updated to reflect Mistral AI usage instead of OpenAI.

In [4]:
# Enhanced Environment Configuration for Mistral AI and LangSmith
import os
import time
import json
from datetime import datetime
from typing import List, Dict, Optional, Tuple

# Set environment variables for Mistral AI and LangSmith integration
os.environ["MISTRAL_API_KEY"] = "MISTRAL_API_KEY"  # Replace with your Mistral AI API key
os.environ["LANGSMITH_API_KEY"] = "LANGSMITH_API_KEY"  # Your LangSmith API key
os.environ["LANGSMITH_TRACING"] = "true"
os.environ["LANGSMITH_PROJECT"] = "langsmith-academy-mistral"

# Enhanced configuration settings
DATASET_CONFIG = {
    "model_provider": "mistral",
    "model_name": "mistral-small-latest",
    "embedding_provider": "huggingface",
    "embedding_model": "sentence-transformers/all-MiniLM-L6-v2",
    "batch_size": 10,
    "validation_enabled": True,
    "analytics_enabled": True
}

print(" Enhanced environment configuration loaded")
print(f" Model Provider: {DATASET_CONFIG['model_provider']}")
print(f"✓ Embedding Provider: {DATASET_CONFIG['embedding_provider']}")

# Dataset Analytics Class
class DatasetAnalytics:
    def __init__(self):
        self.operations = []
        self.start_time = datetime.now()
    
    def log_operation(self, operation_type: str, success: bool, details: Dict = None):
        entry = {
            "timestamp": datetime.now().isoformat(),
            "operation_type": operation_type,
            "success": success,
            "details": details or {}
        }
        self.operations.append(entry)
    
    def get_stats(self) -> Dict:
        if not self.operations:
            return {"message": "No operations recorded yet"}
        
        successful_ops = sum(1 for op in self.operations if op["success"])
        total_ops = len(self.operations)
        
        return {
            "total_operations": total_ops,
            "successful_operations": successful_ops,
            "success_rate": round(successful_ops / total_ops, 3) if total_ops > 0 else 0,
            "session_duration": str(datetime.now() - self.start_time)
        }

# Initialize analytics
dataset_analytics = DatasetAnalytics()
print(" Dataset analytics system initialized")

 Enhanced environment configuration loaded
 Model Provider: mistral
✓ Embedding Provider: huggingface
 Dataset analytics system initialized


In [5]:
# Or you can use a .env file
from dotenv import load_dotenv
load_dotenv(dotenv_path="../../.env", override=True)

True

In [6]:
from langsmith import Client

# Custom examples for Mistral AI with LangSmith
example_inputs = [
("How do I configure Mistral AI with LangChain for my RAG application?", "To configure Mistral AI with LangChain, install langchain-mistralai and use ChatMistralAI class. Set your MISTRAL_API_KEY environment variable and initialize the model with your desired parameters like temperature and max_tokens. The integration works seamlessly with LangChain's standard interfaces."),
("What are the advantages of using Mistral AI over other language models?", "Mistral AI offers strong performance with efficient computation, multilingual capabilities, and competitive pricing. The models are designed for both cloud and on-premise deployment, with good instruction following and reasoning capabilities. They also provide fine-tuning options for specialized use cases."),
("How can I trace my Mistral AI application calls in LangSmith?", "Enable tracing by setting LANGSMITH_TRACING=true and LANGSMITH_API_KEY in your environment. Use the @traceable decorator on functions that call Mistral AI models. LangSmith will automatically capture inputs, outputs, token usage, and performance metrics for analysis."),
("What evaluation strategies work best for Mistral AI applications?", "Effective evaluation strategies include automated metrics like BLEU or ROUGE for text generation, human evaluation for quality assessment, and A/B testing for comparing model variants. LangSmith provides tools to create evaluation datasets and run systematic comparisons."),
("How do I optimize token usage when working with Mistral AI models?", "Optimize token usage by crafting concise prompts, using appropriate context lengths, implementing prompt templates, and choosing the right model size for your task. Monitor token consumption through LangSmith to identify optimization opportunities."),
("Can I use Mistral AI for multilingual applications?", "Yes, Mistral AI models have strong multilingual capabilities supporting major European languages and more. They can handle translation, multilingual question answering, and cross-lingual tasks effectively while maintaining good performance across languages."),
("How do I handle errors and implement retry logic with Mistral AI?", "Implement robust error handling by catching API exceptions, using exponential backoff for retries, and setting appropriate timeouts. Consider implement fallback strategies and monitoring error rates through LangSmith for production applications."),
("What are the best practices for prompt engineering with Mistral AI?", "Best practices include being specific and clear in instructions, providing examples when needed, using consistent formatting, and iterating based on output quality. Mistral AI responds well to structured prompts and explicit instructions."),
("How can I fine-tune Mistral AI models for my specific domain?", "Fine-tuning involves preparing domain-specific training data, choosing appropriate hyperparameters, and using Mistral AI's fine-tuning API. Start with a base model, prepare high-quality examples, and evaluate performance on validation sets."),
("What monitoring and analytics should I track for Mistral AI applications?", "Track key metrics including response time, token usage, error rates, user satisfaction, and model performance. LangSmith provides comprehensive monitoring tools to analyze these metrics and identify areas for improvement in your Mistral AI applications."),
]

client = Client()
dataset_name = "Mistral AI RAG Questions and Answers"

# First, create the dataset if it doesn't exist
try:
    # Try to get the dataset to see if it exists
    dataset = client.read_dataset(dataset_name=dataset_name)
    print(f"Dataset '{dataset_name}' already exists with {dataset.example_count} examples")
    dataset_analytics.log_operation("dataset_read", True, {"name": dataset_name})
except Exception:
    # Dataset doesn't exist, let's create it
    print(f"Creating dataset '{dataset_name}'...")
    dataset = client.create_dataset(
        dataset_name=dataset_name,
        description="Custom dataset for testing Mistral AI RAG applications with LangSmith integration. Contains questions about Mistral AI configuration, optimization, multilingual capabilities, and best practices."
    )
    print(f"Created dataset '{dataset_name}' with ID: {dataset.id}")
    dataset_analytics.log_operation("dataset_create", True, {"name": dataset_name, "id": str(dataset.id)})

# Prepare inputs and outputs for bulk creation
inputs = [{"question": input_prompt} for input_prompt, _ in example_inputs]
outputs = [{"output": output_answer} for _, output_answer in example_inputs]

# Create examples using dataset_name instead of dataset_id
try:
    examples_created = client.create_examples(
        inputs=inputs,
        outputs=outputs,
        dataset_name=dataset_name,
    )
    print(f"Successfully created {len(example_inputs)} examples in dataset '{dataset_name}'")
    dataset_analytics.log_operation("examples_create", True, {"count": len(example_inputs)})
    
    # Display analytics
    print("\n=== Dataset Upload Analytics ===")
    stats = dataset_analytics.get_stats()
    for key, value in stats.items():
        print(f"{key.replace('_', ' ').title()}: {value}")
        
except Exception as e:
    print(f"Error creating examples: {e}")
    dataset_analytics.log_operation("examples_create", False, {"error": str(e)})

Creating dataset 'Mistral AI RAG Questions and Answers'...
Created dataset 'Mistral AI RAG Questions and Answers' with ID: e7c57d4d-2f87-474a-a1d3-44424e1e661f
Created dataset 'Mistral AI RAG Questions and Answers' with ID: e7c57d4d-2f87-474a-a1d3-44424e1e661f
Successfully created 10 examples in dataset 'Mistral AI RAG Questions and Answers'

=== Dataset Upload Analytics ===
Total Operations: 2
Successful Operations: 2
Success Rate: 1.0
Session Duration: 0:00:02.447809
Successfully created 10 examples in dataset 'Mistral AI RAG Questions and Answers'

=== Dataset Upload Analytics ===
Total Operations: 2
Successful Operations: 2
Success Rate: 1.0
Session Duration: 0:00:02.447809


## Submitting another Trace

## Integration with Enhanced RAG Application

I've moved our RAG application definition to `app.py` so we can quickly import it. The app has been updated to use Mistral AI instead of OpenAI.

In [7]:
from app import langsmith_rag

USER_AGENT environment variable not set, consider setting it to identify your requests.


Creating vector store from LangSmith documentation...


Fetching pages: 100%|##########| 197/197 [00:34<00:00,  5.73it/s]
Fetching pages: 100%|##########| 197/197 [00:34<00:00,  5.73it/s]


Loaded 197 documents from LangSmith documentation
Split documents into 3810 chunks
Vector store created and saved to /tmp/langsmith_docs.parquet
Vector store created and saved to /tmp/langsmith_docs.parquet


Let's ask another question to create a new trace!

In [8]:
question = "How do I configure Mistral AI with LangChain for my RAG application?"
langsmith_rag(question)

'The provided context does not include specific information on configuring Mistral AI with LangChain for a RAG application. For detailed guidance, please refer to the LangChain documentation or relevant tutorials.'

## SUMMARY

### Key Changes I Made:
1. **Environment Variables**: Changed from OPENAI_API_KEY to MISTRAL_API_KEY
2. **Dataset Examples**: Created 10 new custom examples focused on Mistral AI usage, covering configuration, optimization, multilingual capabilities, and best practices
3. **Integration**: Updated RAG application to use Mistral AI and HuggingFace embeddings
4. **Project Name**: Updated to langsmith-academy-mistral

### Custom Tweaks:
- Added examples about multilingual capabilities specific to Mistral AI
- Included token optimization strategies
- Added fine-tuning and domain adaptation examples
- Focused on practical implementation questions

### Learning Outcomes:
- Successfully migrated dataset creation from OpenAI to Mistral AI
- Created custom examples showcasing Mistral AI advantages
- Maintained LangSmith functionality with new model provider
- Learned about dataset management with different language models