# Monitoring Sarvam AI usage with OpenLIT: OpenTelemetry-Native Observability

## **Overview**

This cookbook demonstrates how to implement comprehensive observability for Sarvam AI applications using **OpenLIT**, an OpenTelemetry-native LLM monitoring platform. You'll learn how to:

- Auto-instrument Sarvam AI API calls with zero code changes
- Track costs, performance, and token usage in real-time
- Capture detailed traces of chat completions, translations, and speech services
- Visualize and analyze your AI application's behavior
- Debug issues with complete request/response visibility

By the end of this tutorial, you'll have a production-ready monitoring setup that provides deep insights into your Sarvam AI applications.

## **Why Monitor Your AI Applications?**

### **The Challenge of AI Observability**

AI-powered applications introduce unique observability challenges that traditional monitoring tools weren't designed to handle. When your Sarvam AI application fails or behaves unexpectedly, you need answers:

- **Which API call failed and why?** - Understanding failure modes across chat completions, translations, and speech services
- **What was the model thinking?** - Visibility into prompts, responses, and reasoning patterns
- **How much is each request costing?** - Real-time cost tracking across different Sarvam models and services
- **Which calls succeeded and which failed?** - Success rates, error patterns, and reliability metrics
- **How long did each operation take?** - Performance bottlenecks in LLM inference, translation, or speech processing
- **What inputs led to this behavior?** - Complete context of user messages, system prompts, and model parameters

Without proper observability, debugging means adding print statements, guessing at failure points, and blindly optimizing costs. You're essentially flying blind through complex AI workflows.

### **The Hidden Complexity of AI Failures**

Consider a typical scenario: Your multilingual chatbot using Sarvam AI fails to respond correctly in Hindi. Without proper tracing, you're left wondering:

- Did the translation service fail?
- Was the language detection incorrect?
- Did the chat model misunderstand the context?
- Was there a quota or rate limit issue?
- Did the request timeout or fail silently?

Each of these failures requires different solutions, but without visibility into execution, you're reduced to trial-and-error debugging.

### **The Role of OpenTelemetry in AI Observability**

OpenTelemetry has emerged as the industry standard for observability, offering a vendor-neutral approach to collecting traces, metrics, and logs. For AI applications, OpenTelemetry's structured tracing is particularly powerful because it can capture:

- **Distributed execution flows** - Following requests across multiple services and components
- **Hierarchical relationships** - Parent-child relationships between operations (API call → model inference → response)
- **Rich contextual data** - Arbitrary attributes attached to each span for detailed analysis
- **Performance metrics** - Timing data at every level of the execution stack
- **Standard semantic conventions** - Consistent attribute naming for AI workloads (model name, token counts, prompts, completions)

### **Why OpenTelemetry Matters for AI Applications**

Traditional APM tools excel at monitoring web servers and databases, but they fall short with AI applications because:

- **AI workflows are non-deterministic** - The same input can produce different outputs, making traditional error tracking insufficient
- **Context is everything** - You need to see not just that a call failed, but what the model was processing and how it responded
- **Token costs are variable** - Unlike fixed compute costs, AI costs vary dramatically based on prompt size and model choice
- **Multi-step reasoning** - Complex applications chain multiple AI services (translation → chat → speech), requiring full execution history

### **What OpenLIT Provides**

OpenLIT builds on OpenTelemetry to provide AI-specific observability:

- **Zero-code auto-instrumentation** - Monitor Sarvam AI calls without modifying your application code
- **Cost tracking** - Real-time visibility into token usage and estimated costs
- **Performance monitoring** - Latency tracking, throughput analysis, and bottleneck identification
- **Request/response capture** - Complete visibility into prompts, completions, and model parameters
- **Error tracking** - Detailed error logs with full context for debugging
- **Multi-backend support** - Send telemetry to Grafana, Datadog, New Relic, or any OpenTelemetry-compatible backend

Let's dive into implementing this powerful monitoring for your Sarvam AI applications.

## Installation

First, install the required packages:
- `sarvamai` - Official Sarvam AI Python SDK
- `openlit` - OpenTelemetry-native LLM observability SDK

In [None]:
!pip install -Uqq sarvamai openlit

## Basic Setup with OpenLIT

### Import Required Libraries

In [None]:
import openlit
from sarvamai import SarvamAI
import os

### Configure Your API Key

Get your Sarvam AI API key from the [Sarvam AI Dashboard](https://dashboard.sarvam.ai/).

In [None]:
SARVAM_API_KEY = "YOUR_SARVAM_API_KEY"
os.environ["SARVAM_API_KEY"] = SARVAM_API_KEY

### Initialize OpenLIT Monitoring

This single line of code enables automatic instrumentation of all Sarvam AI API calls. OpenLIT will capture:
- Request parameters (model, messages, temperature, etc.)
- Response data (completions, token counts, etc.)
- Timing information (latency, duration)
- Cost estimates (based on token usage)
- Error details (if any failures occur)

In [None]:
# Initialize OpenLIT with console output for development
openlit.init()

### Initialize Sarvam AI Client

In [None]:
client = SarvamAI(api_subscription_key=SARVAM_API_KEY)

## Monitoring Chat Completions

### Basic Chat Completion with Tracing

Let's make a simple chat completion request. OpenLIT will automatically capture all the details.

In [None]:
# Make a chat completion request
response = client.chat.completions(
    messages=[
        {"role": "system", "content": "You are a helpful assistant knowledgeable about Indian culture."},
        {"role": "user", "content": "What are the major classical dance forms of India?"}
    ],
    temperature=0.7
)

print("Response:", response.choices[0].message.content)
print("\n--- OpenLIT is capturing this interaction in the background ---")

### Multi-turn Conversation with Context Tracking

OpenLIT tracks the entire conversation context across multiple turns.

In [None]:
# Multi-turn conversation
conversation = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Tell me about Bharatanatyam."}
]

# First turn
response1 = client.chat.completions(messages=conversation, temperature=0.7)
print("Turn 1:", response1.choices[0].message.content[:200] + "...")

# Add assistant's response to conversation
conversation.append({
    "role": "assistant",
    "content": response1.choices[0].message.content
})

# Second turn
conversation.append({"role": "user", "content": "What are the key mudras used?"})
response2 = client.chat.completions(messages=conversation, temperature=0.7)
print("\nTurn 2:", response2.choices[0].message.content[:200] + "...")

### Monitoring Different Reasoning Levels

Track performance differences between reasoning modes.

In [None]:
# High reasoning effort request
complex_question = "Explain the philosophical differences between Advaita and Dvaita Vedanta."

response = client.chat.completions(
    messages=[{"role": "user", "content": complex_question}],
    temperature=0.5,
    reasoning_effort="high"
)

print("Response:", response.choices[0].message.content[:300] + "...")
print("\n--- Check OpenLIT dashboard to compare latency and token usage across reasoning levels ---")

### Wikipedia-Grounded Queries

Monitor requests that use Wikipedia grounding for factual accuracy.

In [None]:
# Wikipedia-grounded query
response = client.chat.completions(
    messages=[{"role": "user", "content": "What is the history of the Taj Mahal?"}],
    temperature=0.2,
    wiki_grounding=True
)

print("Response:", response.choices[0].message.content[:300] + "...")

## Advanced Configuration: Sending Traces to OpenLIT Platform

### Deploy OpenLIT Platform Locally

For production use, deploy the OpenLIT platform to visualize and analyze your traces. Run this in your terminal:

```bash
# Clone OpenLIT repository
git clone https://github.com/openlit/openlit.git
cd openlit

# Start OpenLIT with Docker Compose
docker compose up -d
```

Access the OpenLIT dashboard at `http://127.0.0.1:3000`

Default credentials:
- Email: `user@openlit.io`
- Password: `openlituser`

### Configure OpenLIT to Send Traces to Platform

In [None]:
# Re-initialize OpenLIT to send traces to platform
openlit.init(
    otlp_endpoint="http://127.0.0.1:4318",  # OpenLIT platform endpoint
    application_name="sarvam-ai-app",        # Your application name
    environment="development",               # Environment (development/staging/production)
)

print("OpenLIT configured to send traces to platform at http://127.0.0.1:3000")

### Test with Platform Integration

In [None]:
# Make some requests that will be visible in the OpenLIT dashboard
test_queries = [
    "What are the benefits of yoga?",
    "Explain the Indian monsoon season.",
    "What are the main ingredients in biryani?"
]

for query in test_queries:
    response = client.chat.completions(
        messages=[{"role": "user", "content": query}],
        temperature=0.7
    )
    print(f"Query: {query}")
    print(f"Response: {response.choices[0].message.content[:100]}...\n")

print("\n✅ Check your OpenLIT dashboard at http://127.0.0.1:3000 to see these traces!")

## Integration with Other Observability Backends

OpenLIT can send traces to any OpenTelemetry-compatible backend.

### Grafana Cloud Integration

In [None]:
# Example: Configure for Grafana Cloud
# openlit.init(
#     otlp_endpoint="https://otlp-gateway-prod-us-central-0.grafana.net/otlp",
#     otlp_headers="base64encodedkey"
#     application_name="sarvam-ai-production",
#     environment="production"
# )

### Datadog Integration

In [None]:
# Example: Configure for Datadog
# openlit.init(
#     otlp_endpoint="https://api.datadoghq.com",
#     otlp_headers="YOUR_DATADOG_API_KEY"
#     application_name="sarvam-ai-production",
#     environment="production"
# )

### New Relic Integration

In [None]:
# Example: Configure for New Relic
# openlit.init(
#     otlp_endpoint="https://otlp.nr-data.net:4318",
#     otlp_headers="YOUR_NEW_RELIC_LICENSE_KEY"
#     application_name="sarvam-ai-production",
#     environment="production"
# )

## What OpenLIT Captures

For each Sarvam AI API call, OpenLIT automatically captures:

### **Request Attributes**
- Model name (e.g., `sarvam-m`)
- Temperature, top_p, and other parameters
- Reasoning effort level
- Wikipedia grounding status
- Complete message history
- System prompts and user messages

### **Response Attributes**
- Complete model response/completion
- Token counts (prompt tokens, completion tokens, total)
- Response time and latency
- Finish reason (completed, length, error, etc.)

### **Performance Metrics**
- Request duration (total time)
- Time to first token (for streaming)
- Tokens per second
- API call success/failure rates

### **Cost Information**
- Estimated cost per request
- Token usage breakdown
- Cumulative costs over time

### **Error Details**
- Exception type and message
- Stack traces
- HTTP status codes
- Error context (request that caused the error)

## Using the OpenLIT Dashboard

Once you've deployed the OpenLIT platform and sent traces to it, you can:

### View Request Traces
- Navigate to the **Traces** tab
- Filter by application name, environment, or time range
- Click on individual traces to see detailed span information
- View complete request/response payloads

### Analyze Performance
- Check the **Metrics** dashboard for:
  - Average response time trends
  - Token usage patterns
  - Request volume over time
  - Error rates and types

### Track Costs
- View the **Cost Analysis** dashboard
- See cost breakdown by:
  - Model type
  - Time period
  - User or session
  - Feature or endpoint

### Debug Errors
- Use the **Exceptions** tab to:
  - Identify common error patterns
  - See full error context
  - Track error resolution over time

## Real-World Use Cases

### **Use Case 1: Cost Optimization**

Use OpenLIT to identify expensive requests and optimize your usage:

In [None]:
# Compare costs between different approaches

# Approach 1: High reasoning effort
response1 = client.chat.completions(
    messages=[{"role": "user", "content": "Summarize the Indian Constitution."}],
    reasoning_effort="high",
    temperature=0.5
)

# Approach 2: Medium reasoning effort
response2 = client.chat.completions(
    messages=[{"role": "user", "content": "Summarize the Indian Constitution."}],
    reasoning_effort="medium",
    temperature=0.5
)

# Approach 3: Low reasoning effort
response3 = client.chat.completions(
    messages=[{"role": "user", "content": "Summarize the Indian Constitution."}],
    reasoning_effort="low",
    temperature=0.5
)

print("Check OpenLIT dashboard to compare token usage and costs across reasoning levels")

### **Use Case 2: Performance Benchmarking**

Track latency across different types of requests:

In [None]:
import time

# Benchmark different query types
test_cases = [
    ("Simple query", "What is 2+2?"),
    ("Medium query", "Explain the water cycle in 3 sentences."),
    ("Complex query", "Analyze the economic impact of digital payments in India."),
]

for test_name, query in test_cases:
    start = time.time()
    response = client.chat.completions(
        messages=[{"role": "user", "content": query}],
        temperature=0.7
    )
    duration = time.time() - start
    print(f"{test_name}: {duration:.2f}s")

print("\nView detailed timing breakdowns in OpenLIT dashboard")

### **Use Case 3: A/B Testing and Experimentation**

Use OpenLIT to compare different prompt strategies:

In [None]:
# A/B test: Different system prompts

# Version A: Concise style
response_a = client.chat.completions(
    messages=[
        {"role": "system", "content": "Answer concisely in 1-2 sentences."},
        {"role": "user", "content": "What is artificial intelligence?"}
    ]
)

# Version B: Detailed style
response_b = client.chat.completions(
    messages=[
        {"role": "system", "content": "Provide detailed, comprehensive explanations."},
        {"role": "user", "content": "What is artificial intelligence?"}
    ]
)

print("Version A (concise):", response_a.choices[0].message.content)
print("\nVersion B (detailed):", response_b.choices[0].message.content)
print("\nCompare token usage and costs in OpenLIT to determine the optimal approach")

In [None]:
# Initialize with privacy mode (disables prompt/response capture)
openlit.init(
    otlp_endpoint="http://127.0.0.1:4318",
    application_name="sarvam-ai-app",
    environment="production",
    disable_metrics_collection=False,
    trace_content=False,  # Disables capturing prompts and responses
)

print("OpenLIT configured with privacy mode - metrics tracked but content not logged")

## Additional Resources

### **OpenLIT Documentation**
- Official Docs: [docs.openlit.io](https://docs.openlit.io)
- GitHub: [github.com/openlit/openlit](https://github.com/openlit/openlit)
- Discord Community: [Join Discord](https://discord.gg/openlit)

### **Sarvam AI Documentation**
- API Reference: [docs.sarvam.ai](https://docs.sarvam.ai)
- Dashboard: [dashboard.sarvam.ai](https://dashboard.sarvam.ai)
- Discord Community: [Join Discord](https://discord.gg/hTuVuPNF)

### **OpenTelemetry Resources**
- OpenTelemetry Docs: [opentelemetry.io](https://opentelemetry.io)
- LLM Observability Guide: [opentelemetry.io/blog/2024/llm-observability](https://opentelemetry.io/blog/2024/llm-observability)

## Next Steps

Now that you have monitoring set up:

1. **Explore Your Data** - Make various API calls and explore the traces in OpenLIT dashboard
2. **Set Up Alerts** - Configure alerts for critical metrics in your observability backend
3. **Optimize Performance** - Use insights to optimize prompt design, model selection, and parameters
4. **Track Costs** - Monitor spending and set up budget alerts
5. **Share with Team** - Set up shared dashboards for your engineering team

### **Production Deployment Checklist**

- [ ] Configure production OTLP endpoint
- [ ] Set up authentication/API keys
- [ ] Enable batching for better performance
- [ ] Configure sampling for high-volume applications
- [ ] Set up alerts for critical metrics
- [ ] Document monitoring setup for your team
- [ ] Test error scenarios and verify error tracking
- [ ] Configure privacy settings if handling sensitive data
- [ ] Set up cost monitoring dashboards
- [ ] Schedule regular performance reviews

## Conclusion

You've successfully set up comprehensive monitoring for your Sarvam AI applications using OpenLIT! With this setup, you now have:

- **Complete visibility** into your AI application's behavior
- **Real-time cost tracking** to manage your AI budget
- **Performance insights** to optimize latency and throughput
- **Error tracking** with full context for faster debugging
- **Industry-standard observability** using OpenTelemetry

Remember: Observability is not just about monitoring - it's about understanding your system's behavior, optimizing performance, and building more reliable AI applications.