# Opik

The Opik platform enables you to log, view, and assess your LLM traces throughout development and production. By utilizing the platform alongside our LLM as a Judge evaluator, you can pinpoint and resolve issues in your LLM application.

You can access Opik through Comet's Managed Cloud offering or self-host it on your own infrastructure. Self-hosting allows you to utilize all Opik features, including tracing and evaluation, though user management features will not be available.

If you opt for self-hosting, you have two deployment options:

1. **Local Installation:** Ideal for getting started, but not suitable for production.
2. **Kubernetes Installation:** A production-ready Opik platform that operates on a Kubernetes cluster.

Alternatively, you can use Opik by simply installing it and setting up your API key.

## Tracing

### Logging Traces

You can log traces to the Comet LLM Evaluation platform using either a REST API or the Opik Python SDK.

**Using the Python SDK**  
First, install the SDK and configure it with your Opik API key or local deployment settings. This setup process is straightforward and guided.

Once set up, you can log traces through Comet's integrations, function decorators, or manually.

```
import opik

opik.configure(use_local=False)
```

**Function Decorators**  
The simplest way to log with Opik is by using function decorators. This method integrates seamlessly with your existing LLM application and is recommended alongside Comet's integrations for optimal results.
```
from opik import track

@track
def preprocess_input(text):
    return text.strip().lower()
```

If you’re defining LLM chains manually, you can use the track decorators to monitor LLM calls. By default, these decorators capture the function's input and output, but you can customize what is recorded.
```
@track(capture_input=False, capture_output=False)
```

**Manual Logging**  
For more control, you can log traces and spans manually using the Comet client. This allows you to create traces and add spans for various operations, tracking inputs and outputs as needed.
```
from opik import Opik

client = Opik(project_name="test")

# Create a trace
trace = client.trace(
    name="my_trace",
    input={"user_question": "Hello, how are you?"},
    output={"response": "Comment ça va?"}
)
```

**Updating Trace and Span Attributes**  
You can modify the attributes of traces and spans during execution to update metadata or log scores.

**Logging Scores**  
Scores can be logged to traces and spans using dedicated methods, allowing you to capture feedback on the quality and coherence of responses.

**Advanced Usage**  
To enhance performance, logging occurs in a background thread. If you want to ensure all traces are sent to Comet before exiting your program, use the flush method.

**Logging Distributed Traces**
In complex LLM applications, tracking traces across multiple services is essential. Comet supports distributed tracing seamlessly when using function decorators, employing a mechanism similar to OpenTelemetry.

For more detail, refer [this](https://www.comet.com/docs/opik/tracing/log_traces).

### Annotating Traces

Annotating traces is essential for evaluating and enhancing your LLM-based applications. By systematically recording qualitative or quantitative feedback on specific interactions or entire conversation flows, you can:

- Track performance over time
- Identify areas for improvement
- Compare different model versions or prompts
- Gather data for fine-tuning or retraining
- Provide stakeholders with concrete metrics on system effectiveness

Opik enables you to annotate traces using either the SDK or the UI.

**Annotating Traces through the UI**  
To annotate traces via the UI, navigate to the trace you wish to annotate on the traces page and click the "Annotate" button. This opens a sidebar where you can add your annotations. You can annotate both traces and spans, so be sure to select the correct span in the sidebar.

**Annotating Traces through the SDK**  
You can also annotate traces and spans using the SDK, which is useful for incorporating evaluation or user feedback scores.

**Logging Feedback Scores for Traces**  
You can log feedback scores for traces with the `log_traces_feedback_scores` method. Each score can include an optional reason field for clarity.
```
from opik import Opik

client = Opik(project_name="my_project")

trace = client.trace(name="my_trace")

client.log_traces_feedback_scores(
    scores=[
        {"id": trace.id, "name": "overall_quality", "value": 0.85},
        {"id": trace.id, "name": "coherence", "value": 0.75},
    ]
)
```

**Logging Feedback Scores for Spans**  
To log feedback for individual spans, use the `log_spans_feedback_scores` method. This allows you to capture detailed feedback on specific parts of your LLM application.
```
from opik import Opik

client = Opik()

trace = client.trace(name="my_trace")
span = trace.span(name="my_span")

comet.log_spans_feedback_scores(
    scores=[
        {"id": span.id, "name": "overall_quality", "value": 0.85},
        {"id": span.id, "name": "coherence", "value": 0.75},
    ],
)
```

Computing feedback scores can be challenging with Large Language Models (LLMs) due to their unstructured and non-deterministic outputs. To assist with this, Opik offers a range of built-in evaluation metrics.

## Evaluation

When working with LLM applications, the evaluation process can often slow down iteration. While manually reviewing outputs is possible, it’s inefficient and not scalable. Opik streamlines this by automating the evaluation of your LLM application.

To effectively run evaluations in Opik, it’s essential to understand two key concepts:

- **Dataset:** A dataset consists of a collection of samples used for evaluation. It stores the input and expected outputs for each sample, while the actual outputs from your LLM application are computed and scored during the evaluation.

- **Experiment:** An experiment represents a single evaluation of your LLM application. During an experiment, each item in the dataset is processed, the output is generated by your LLM, and then the output is scored.

### Datasets

The first step in automating the evaluation of your LLM application is to create a dataset, which is a collection of samples for evaluation. Each dataset consists of Dataset Items that store the input, expected output, and other metadata for individual samples.

Given their importance in the evaluation process, teams often dedicate considerable time to curating and preparing their datasets. There are three primary ways to create a dataset:

1. **Manually Curating Examples:** Start by manually selecting a set of examples based on your understanding of the application. Involve subject matter experts to enhance the quality of the dataset.

2. **Using Synthetic Data:** If you lack sufficient data for a diverse set of examples, consider using synthetic data generation tools.

3. **Leveraging Production Data:** If your application is already in production, you can enrich your dataset by using real-world data generated during operation. While this may not be the initial step, it can significantly enhance the dataset.

If you're using Opik for production monitoring, you can easily add traces to your dataset by selecting them in the UI and choosing "Add to dataset" from the Actions dropdown.


#### Manage Datasets
Datasets are essential for tracking the test cases you want to evaluate your LLM on. Each dataset consists of DatasetItems, which include input, optional expected output, and metadata fields. You can create datasets through the following methods:

1. **Python SDK:** Use the Python SDK to create a dataset and add items programmatically.
2. **Traces Table:** Incorporate existing logged traces (e.g., from a production application) into a dataset.
3. **Comet UI:** Manually create a dataset and add items directly through the Comet user interface.

Once a dataset is created, you can run Experiments on it. Each Experiment evaluates your LLM application against the test cases in the dataset, utilizing an evaluation metric to report results back to the dataset.

### Experiments

Experiments are the foundational elements of the Opik evaluation framework. Each time you conduct a new evaluation, a new experiment is generated, consisting of two main components:

#### Experiment Configuration
The configuration object for each experiment enables you to track metadata, such as the prompt template used. This is particularly useful for maintaining clarity about what has changed between iterations of an experiment, such as the model used, temperature settings, and other relevant parameters. You can easily compare the configurations of different experiments through the Opik UI to identify any differences.

#### Experiment Items
Experiment items capture essential details for each dataset sample processed during an experiment. They include the input, expected output, actual output, and feedback scores. Each item is also linked to a trace, providing context for why a particular item received its score.

In addition, you'll be able to view the average scores for each metric associated with the experiment, giving you a comprehensive overview of the evaluation results.

### Evaluating Your LLM Application

Evaluating your LLM application helps ensure confidence in its performance. This evaluation typically occurs both during development and as part of application testing.

The evaluation process consists of five steps:

1. **Add Tracing:** Integrate tracing into your LLM application to log relevant data.
2. **Define the Evaluation Task:** Specify the task that maps the inputs to the expected outputs.
3. **Choose the Dataset:** Select the dataset on which you want to evaluate your application.
4. **Select Metrics:** Decide on the metrics you will use for evaluation.
5. **Create and Run the Evaluation Experiment:** Set up and execute the evaluation experiment based on the chosen components.
    To run an evaluation, you'll need to gather the following components:

    1. **Dataset:** The dataset on which you want to conduct the evaluation.
    2. **Evaluation Task:** This defines how the inputs in the dataset are mapped to the outputs you wish to score. This is typically the LLM application you are developing.
    3. **Metrics:** The specific metrics you want to use for scoring the outputs of your LLM.

By following these steps, you can effectively assess and improve the performance of your LLM application.


```
from opik import Opik, track, DatasetItem
from opik.evaluation import evaluate
from opik.evaluation.metrics import Equals, Hallucination
from opik.integrations.openai import track_openai
import openai

# Define the task to evaluate
openai_client = track_openai(openai.OpenAI())

MODEL = "gpt-3.5-turbo"

# Add tracing to your application
@track
def your_llm_application(input: str) -> str:
    response = openai_client.chat.completions.create(
        model=MODEL,
        messages=[{"role": "user", "content": input}],
    )

    return response.choices[0].message.content

# Define the evaluation task
def evaluation_task(x: DatasetItem):
    return {
        "input": x.input['user_question'],
        "output": your_llm_application(x.input['user_question']),
        "context": your_context_retriever(x.input['user_question'])
    }

@track
def your_context_retriever(input: str) -> str:
    return ["..."]


# Create a simple dataset
client = Opik()
try:
    dataset = client.create_dataset(name="your-dataset-name")
    dataset.insert([
        {"input": {"user_question": "What is the capital of France?"}},
        {"input": {"user_question": "What is the capital of Germany?"}},
    ])
except:
    dataset = client.get_dataset(name="your-dataset-name")

# Define the metrics
hallucination_metric = Hallucination()

# Create and Run the Evaluation Experiment
evaluation = evaluate(
    experiment_name="My experiment",
    dataset=dataset,
    task=evaluation_task,
    scoring_metrics=[hallucination_metric],
    experiment_config={
        "model": MODEL
    }
)
```

### Metrics Overview

Opik offers a set of built-in evaluation metrics to assess the output of your LLM calls, categorized into two main types:

1. **Heuristic Metrics:** These metrics are deterministic and typically statistical in nature.
2. **LLM as a Judge Metrics:** These are non-deterministic metrics that utilize an LLM to evaluate the output of another LLM.

#### Built-in Evaluation Metrics

| Metric           | Type                 | Description                                        |
|------------------|----------------------|----------------------------------------------------|
| **Equals**       | Heuristic            | Checks if the output exactly matches an expected string. |
| **Contains**     | Heuristic            | Verifies if the output contains a specific substring, with case sensitivity options. |
| **RegexMatch**   | Heuristic            | Checks if the output matches a specified regular expression pattern. |
| **IsJson**       | Heuristic            | Validates if the output is a valid JSON object.   |
| **Levenshtein**  | Heuristic            | Calculates the Levenshtein distance between the output and an expected string. |
| **Hallucination**| LLM as a Judge       | Identifies if the output contains any hallucinations. |
| **Moderation**   | LLM as a Judge       | Checks if the output contains harmful content.     |
| **AnswerRelevance** | LLM as a Judge    | Assesses if the output is relevant to the question. |
| **ContextRecall**| LLM as a Judge       | Evaluates if the output includes relevant context. |
| **ContextPrecision** | LLM as a Judge   | Measures the precision of the output in relation to the context. |

You can also create your own custom metric; for more information, refer to the [this](https://www.comet.com/docs/opik/evaluation/metrics/custom_metric).

## Integration

Opik simplifies the process of logging, viewing, and evaluating your LLM traces by offering a range of integrations:

| Integration   | Description                                |
|---------------|--------------------------------------------|
| **OpenAI**    | Log traces for all OpenAI LLM calls       |
| **LangChain** | Log traces for all LangChain LLM calls    |
| **LlamaIndex**| Log traces for all LlamaIndex LLM calls   |
| **Ollama**    | Log traces for all Ollama LLM calls       |
| **Predibase** | Fine-tune and serve open-source LLMs      |
| **Ragas**     | Evaluation framework for Retrieval Augmented Generation (RAG) pipelines |

These integrations help you effectively manage your LLM applications.

### Example: Opik with Langchain for Text to SQL Query Generation

Comet offers smooth integration with LangChain, enabling you to effortlessly log and track your LangChain applications.

This example walks you through the process of generating SQL queries from natural language questions using LangChain and the Chinook database. The workflow involves creating a synthetic dataset of questions, building a LangChain to generate SQL queries, and automating the evaluation of those queries.

#### Prerequisites

Before you begin, ensure you have:

*   An account on Comet to access the Opik platform.
*   An access to any LLM.

In [None]:
!pip install --upgrade --quiet opik langchain langchain-community langchain-openai

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.4/50.4 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m185.5/185.5 kB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m34.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.3/2.3 MB[0m [31m57.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.5/51.5 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.4/76.4 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.9/77.9 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m399.9/399.9 kB[0m [31m24.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

#### Setting Up API Keys

In [None]:
import opik

opik.configure(use_local=False)

OPIK: Your Opik cloud API key is available at https://www.comet.com/api/my/settings/.


Please enter your Opik Cloud API key:··········
Do you want to use "sumitmishra5504" workspace? (Y/n)Y


OPIK: Saved configuration to a file: /root/.opik.config


In [None]:
import os
import getpass

if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ")

Enter your OpenAI API key: ··········


#### Creating and Preparing Synthetic Dataset

To generate a dataset of questions, we will utilize the OpenAI API. The goal is to create 25 diverse questions related to the Chinook database.

In [None]:
# Download the relevant data
import os
from langchain_community.utilities import SQLDatabase

import requests
import os

url = "https://github.com/lerocha/chinook-database/raw/master/ChinookDatabase/DataSources/Chinook_Sqlite.sqlite"
filename = "./data/chinook/Chinook_Sqlite.sqlite"

folder = os.path.dirname(filename)

if not os.path.exists(folder):
    os.makedirs(folder)

if not os.path.exists(filename):
    response = requests.get(url)
    with open(filename, 'wb') as file:
        file.write(response.content)
    print(f"Chinook database downloaded")

db = SQLDatabase.from_uri(f"sqlite:///{filename}")

Chinook database downloaded


Then, use the OpenAI API to generate the questions. Ensure that each API call is tracked with the track_openai function from the opik library.

In [None]:
from opik.integrations.openai import track_openai
from openai import OpenAI
import json

os.environ["OPIK_PROJECT_NAME"] = "langchain-integration-demo"
client = OpenAI()

openai_client = track_openai(client)

prompt = """
Create 25 different example questions a user might ask based on the Chinook Database.

These questions should be complex and require the model to think. They should include complex joins and window functions to answer.

Return the response as a json object with a "result" key and an array of strings with the question.
"""

completion = openai_client.chat.completions.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "user", "content": prompt}
  ]
)

print(completion.choices[0].message.content)

{
  "result": [
    "What is the average total purchase amount for each customer?",
    "Which customer has the highest total purchase amount?",
    "How many unique tracks has each customer purchased?",
    "Which customer has purchased the most unique tracks?",
    "What is the average number of tracks purchased per invoice?",
    "Which album has the highest number of tracks purchased?",
    "What is the average duration of tracks purchased by each customer?",
    "Which customer has purchased tracks with the longest duration on average?",
    "What is the total revenue generated by each genre?",
    "Which genre has generated the most revenue?",
    "What is the average purchase amount for each employee?",
    "Which employee has the highest average purchase amount?",
    "How does the total purchase amount vary by country?",
    "Which country has the highest total purchase amount?",
    "What is the total number of customers in each country?",
    "Which country has the most cust

Inserting dataset into opik dataset.

In [None]:
# Create the synthetic dataset
import opik
from opik import DatasetItem

synthetic_questions = json.loads(completion.choices[0].message.content)["result"]

client = opik.Opik()
try:
    dataset = client.create_dataset(name="synthetic_questions")
    dataset.insert([
        DatasetItem(input={"question": question}) for question in synthetic_questions
    ])
except opik.rest_api.core.ApiError as e:
    print("Dataset already exists")

#### Creating a LangChain Chain


Next, we will create a chain that converts these natural language questions into SQL queries. This is accomplished using the create_sql_query_chain function from the LangChain library.

In [None]:
# Use langchain to create a SQL query to answer the question
from langchain.chains import create_sql_query_chain
from langchain_openai import ChatOpenAI
from opik.integrations.langchain import OpikTracer

opik_tracer = OpikTracer(tags=["simple_chain"])

llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
chain = create_sql_query_chain(llm, db).with_config({"callbacks": [opik_tracer]})
response = chain.invoke({"question": "How many employees are there ?"})
response

print(response)

SELECT COUNT("EmployeeId") AS "TotalEmployees" FROM "Employee"


#### Automating the Evaluation

To verify that our application is functioning correctly, we will evaluate the generated SQL queries against our synthetic dataset. This ensures the queries are valid and produce the expected results.

Evaluate Queries: Utilize the evaluate function from the opik library to assess the validity of the SQL queries.

In [None]:
from opik import Opik, track
from opik.evaluation import evaluate
from opik.evaluation.metrics import base_metric, score_result
from typing import Any

class ValidSQLQuery(base_metric.BaseMetric):
    def __init__(self, name: str, db: Any):
        self.name = name
        self.db = db

    def score(self, output: str, **ignored_kwargs: Any):
        # Add you logic here

        try:
            db.run(output)
            return score_result.ScoreResult(
                name=self.name,
                value=1,
                reason="Query ran successfully"
            )
        except Exception as e:
            return score_result.ScoreResult(
                name=self.name,
                value=0,
                reason=str(e)
            )

valid_sql_query = ValidSQLQuery(name="valid_sql_query", db=db)

client = Opik()
dataset = client.get_dataset("synthetic_questions")

@track()
def llm_chain(input: str) -> str:
    response = chain.invoke({"question": input})

    return response

def evaluation_task(item):
    response = llm_chain(item.input["question"])

    return {
        "reference": "hello",
        "output": response,
    }

res = evaluate(
    experiment_name="SQL question answering",
    dataset=dataset,
    task=evaluation_task,
    scoring_metrics=[valid_sql_query]
)

Evaluation: 100%|██████████| 24/24 [00:02<00:00,  9.98it/s]
╭─ synthetic_questions (24 samples) ─╮
│                                    │
│ Total time:        00:00:03        │
│ Number of samples: 24              │
│                                    │
│ valid_sql_query: 0.9167 (avg)      │
│                                    │
╰────────────────────────────────────╯


---