### Deepseek model with Langfuse

In [None]:
# Drop-in replacement to get full logging by changing only the import
from langfuse.openai import OpenAI

# Configure the OpenAI client to use https://openrouter.ai/api/v1 as base url
client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key="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",
)

response = client.chat.completions.create(
    model="deepseek/deepseek-chat",
    messages=[{"role": "user", "content": "What is Langfuse?"}],
)
print(response.choices[0].message.content)


In [11]:
from langfuse import observe
from openai import OpenAI

client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key="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"
)

@observe()
def story():
    return (
        client.chat.completions.create(
            model="deepseek/deepseek-chat",
            messages=[{"role": "user", "content": "What is Langfuse?"}],
        )
        .choices[0]
        .message.content
    )

@observe()
def main():
    return story()


main()

'**Langfuse** is an **open-source observability and analytics platform** designed for **LLM (Large Language Model) applications**. It helps developers and teams track, analyze, and optimize interactions with AI models by providing detailed insights into prompts, responses, costs, latencies, and user feedback.\n\n### **Key Features of Langfuse:**\n1. **Prompt & Response Tracking**  \n   - Logs full LLM interactions (input prompts, outputs, metadata).\n   - Supports multiple model providers (OpenAI, Anthropic, Mistral, etc.).\n\n2. **Tracing & Debugging**  \n   - Visualizes complex LLM workflows (e.g., multi-step chains, RAG pipelines).\n   - Helps identify errors, inefficiencies, or unexpected behavior.\n\n3. **Analytics & Metrics**  \n   - Tracks **costs, latencies, token usage**, and user feedback.\n   - Provides dashboards for performance monitoring.\n\n4. **Feedback & Evaluation**  \n   - Collects user ratings (thumbs up/down) or custom metrics.\n   - Enables human or automated eval

### Evaluate Langfuse LLM Traces with an External Evaluation Pipeline


In [None]:
%pip install langfuse openai deepeval --upgrade

In [None]:
import os
 
# Get keys for your project from the project settings page: https://cloud.langfuse.com
os.environ["LANGFUSE_SECRET_KEY"] = "sk-lf-d34a5339-8fc2-4cdd-b91b-828c2d9447dd" 
os.environ["LANGFUSE_PUBLIC_KEY"] = "pk-lf-d1562c7b-7525-4835-80d0-31e153dcd175" 
LANGFUSE_HOST="http://localhost:3000"

# Your openai key
os.environ["OPENAI_API_KEY"] = "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"

Let‚Äôs go ahead and generate a list of topic suggestions that we can later query to our application.

In [10]:
from openai import OpenAI

client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key=os.environ["OPENAI_API_KEY"]
)

 
topic_suggestion = """ You're a world-class journalist, specialized
in figuring out which are the topics that excite people the most.
Your task is to give me 50 suggestions for pop-science topics that the general
public would love to read about. Make sure topics don't repeat.
The output must be a comma-separated list. Generate the list and NOTHING else.
The use of numbers is FORBIDDEN.
"""
 
output = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": topic_suggestion
        }
    ],
    model="deepseek/deepseek-chat:free",
 
    temperature=1
).choices[0].message.content
 
topics = [item.strip() for item in output.split(",")]
for topic in topics:
    print(topic)

NotFoundError: Error code: 404 - {'error': {'message': 'No endpoints found for deepseek/deepseek-chat:free.', 'code': 404}, 'user_id': 'user_2zXO05UXFvcyo7gqwuBgsDROSDE'}

we‚Äôll use Langfuse‚Äôs @observe() decorator. This decorator automatically monitors all LLM calls (generations) nested in the function. We‚Äôre also using the langfuse class to label and tag the traces, making it easier to fetch them later.

In [None]:
from langfuse import observe, get_client
langfuse = get_client()
 
prompt_template = """
You're an expert science communicator, able to explain complex topics in an
approachable manner. Your task is to respond to the questions of users in an
engaging, informative, and friendly way. Stay factual, and refrain from using
jargon. Your answer should be 4 sentences at max.
Remember, keep it ENGAGING and FUN!
 
Question: {question}
"""
 
@observe()
def explain_concept(topic):
    langfuse.update_current_trace(
        name=f"Explanation '{topic}'",
        tags=["ext_eval_pipelines"]
    )
    prompt = prompt_template.format(question=topic)
 
 
    return client.chat.completions.create(
        messages=[
            {
                "role": "user",
                "content": prompt,
            }
        ],
        model="deepseek/deepseek-chat:free",
 
        temperature=0.6
    ).choices[0].message.content
 
 
for topic in topics:
    print(f"Input: Please explain to me {topic.lower()}")
    print(f"Answer: {explain_concept(topic)} \n")

### 1. Fetch Your Traces

 We‚Äôll take an incremental approach: first, we‚Äôll fetch the initial 10 traces and evaluate them. After that, we‚Äôll add our scores back into Langfuse and move on to the next batch of 10 traces. We‚Äôll keep this cycle going until we‚Äôve processed a total of 50 traces.

In [None]:
from langfuse import get_client
from datetime import datetime, timedelta
 
BATCH_SIZE = 10
TOTAL_TRACES = 50
 
langfuse = get_client()
 
now = datetime.now()
five_am_today = datetime(now.year, now.month, now.day, 5, 0)
five_am_yesterday = five_am_today - timedelta(days=1)
 
traces_batch = langfuse.api.trace.list(page=1,
                                     limit=BATCH_SIZE,
                                     tags="ext_eval_pipelines",
                                     from_timestamp=five_am_yesterday,
                                     to_timestamp=datetime.now()
                                   ).data
 
print(f"Traces in first batch: {len(traces_batch)}")

In [14]:
print("content: ", traces_batch[1].output)

content:  Ah, decision-making‚Äîour brain‚Äôs daily game of ‚Äúchoose your own adventure‚Äù! It‚Äôs a mix of logic, emotions, and even a pinch of gut feeling, all working together to guide us. Sometimes, we overthink (hello, analysis paralysis!), while other times, we go with snap judgments (thanks, instincts!). Understanding this process helps us make better choices and maybe even outsmart our own biases! üß†‚ú®


### 2. Run your evaluations

#### 2.1. Categoric Evaluations

In [None]:
template_tone_eval = """
You're an expert in human emotional intelligence. You can identify with ease the
 tone in human-written text. Your task is to identify the tones present in a
 piece of <text/> with precission. Your output is a comma separated list of three
 tones. PRINT THE LIST ALONE, NOTHING ELSE.
 
<possible_tones>
neutral, confident, joyful, optimistic, friendly, urgent, analytical, respectful
</possible_tones>
 
<example_1>
Input: Citizen science plays a crucial role in research by involving everyday
people in scientific projects. This collaboration allows researchers to collect
vast amounts of data that would be impossible to gather on their own. Citizen
scientists contribute valuable observations and insights that can lead to new
discoveries and advancements in various fields. By participating in citizen
science projects, individuals can actively contribute to scientific research
and make a meaningful impact on our understanding of the world around us.
 
Output: respectful,optimistic,confident
</example_1>
 
<example_2>
Input: Bionics is a field that combines biology and engineering to create
devices that can enhance human abilities. By merging humans and machines,
bionics aims to improve quality of life for individuals with disabilities
or enhance performance for others. These technologies often mimic natural
processes in the body to create seamless integration. Overall, bionics holds
great potential for revolutionizing healthcare and technology in the future.
 
Output: optimistic,confident,analytical
</example_2>
 
<example_3>
Input: Social media can have both positive and negative impacts on mental
health. On the positive side, it can help people connect, share experiences,
and find support. However, excessive use of social media can also lead to
feelings of inadequacy, loneliness, and anxiety. It's important to find a
balance and be mindful of how social media affects your mental well-being.
Remember, it's okay to take breaks and prioritize your mental health.
 
Output: friendly,neutral,respectful
</example_3>
 
<text>
{text}
</text>
"""
 
 
test_tone_score = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": template_tone_eval.format(
                text=traces_batch[1].output),
        }
    ],
    model="deepseek/deepseek-chat:free",
 
    temperature=0
).choices[0].message.content
print(f"User query: {traces_batch[1].input['args'][0]}")
print(f"Model answer: {traces_batch[1].output}")
print(f"Dominant tones: {test_tone_score}")

Identifying human intents and tones can be tricky for language models. To handle this, we used a multi-shot prompt, which means giving the model several examples to learn from. Now let‚Äôs wrap our code in an evaluation function for convenience.

In [None]:
def tone_score(trace):
    return client.chat.completions.create(
        messages=[
            {
                "role": "user",
                "content": template_tone_eval.format(text=trace.output),
            }
        ],
        model="deepseek/deepseek-chat:free",
        temperature=0
    ).choices[0].message.content
 
tone_score(traces_batch[1])

#### 2.2. Numeric Evaluations

In [None]:
%pip install ipywidgets

GEval ÊîØÊåÅÁöÑÊ®°Âûã: ÈªòËÆ§‰ΩøÁî® OpenAI

In [None]:
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams, LLMTestCase
 
def joyfulness_score(trace):
		joyfulness_metric = GEval(
		    name="Correctness",
		    criteria="Determine whether the output is engaging and fun.",
		    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
		)
		test_case = LLMTestCase(
    input=trace.input["args"],
    actual_output=trace.output)
 
		joyfulness_metric.measure(test_case)
 
		print(f"Score: {joyfulness_metric.score}")
		print(f"Reason: {joyfulness_metric.reason}")
 
		return {"score": joyfulness_metric.score, "reason": joyfulness_metric.reason}
 
joyfulness_score(traces_batch[1])

OpenRouter Âíå DeepseekÔºå‰øÆÊîπÊàê

In [None]:
from deepeval.models import DeepEvalBaseLLM
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams, LLMTestCase
import os

class CustomLLM(DeepEvalBaseLLM):
    def __init__(self, model_name="deepseek/deepseek-chat"):
        self.model_name = model_name
        self.client = OpenAI(
            base_url="https://openrouter.ai/api/v1",
            api_key=os.environ["OPENAI_API_KEY"]
        )
    
    def load_model(self):
        return self.model_name
    
    def generate(self, prompt: str) -> str:
        try:
            response = self.client.chat.completions.create(
                model=self.model_name,
                messages=[{"role": "user", "content": prompt}],
                temperature=0
            )
            return response.choices[0].message.content
        except Exception as e:
            return f"Error: {e}"
    
    async def a_generate(self, prompt: str) -> str:
        return self.generate(prompt)
    
    def get_model_name(self):
        return self.model_name

# ‰ΩøÁî®Ëá™ÂÆö‰πâÊ®°ÂûãËøõË°åËØÑ‰º∞
def joyfulness_score(trace):
    custom_llm = CustomLLM("deepseek/deepseek-chat")
    
    joyfulness_metric = GEval(
        name="Joyfulness",
        criteria="Determine whether the output is engaging and fun.",
        evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
        model=custom_llm
    )
    
    test_case = LLMTestCase(
        input=trace.input["args"],
        actual_output=trace.output
    )
    
    joyfulness_metric.measure(test_case)
    
    return {"score": joyfulness_metric.score, "reason": joyfulness_metric.reason}

joyfulness_score(traces_batch[1])

 GEval uses chain of thought (CoT) prompting to formulate a set of criteria for scoring prompts. When developing your own metrics, it‚Äôs important to review the reasoning behind these scores. This helps ensure that the model evaluates the traces just as you intended when you wrote the evaluation prompt.

### 3. Pushing Scores to Langfuse

Now that we have our evaluation functions ready, it‚Äôs time to put them to work. Use the Langfuse client to add scores to existing traces.

In [None]:
langfuse.create_score(
    trace_id=traces_batch[1].id,
    name="tone",
    value=joyfulness_score(traces_batch[1])["score"],
    comment=joyfulness_score(traces_batch[1])["reason"]
)

And thus, you‚Äôve added your first externally-evaluated score to Langfuse! Just 49 more to go üòÅ. But don‚Äôt worry ‚Äî our solutions are easy to scale.

### 4. Putting everything together


In [23]:
import math
 
for page_number in range(1, math.ceil(TOTAL_TRACES/BATCH_SIZE)):
 
    traces_batch = langfuse.api.trace.list(
        tags="ext_eval_pipelines",
        page=page_number,
        from_timestamp=five_am_yesterday,
        to_timestamp=five_am_today,
        limit=BATCH_SIZE
    ).data
 
    for trace in traces_batch:
        print(f"Processing {trace.name}")
 
        if trace.output is None:
            print(f"Warning: \n Trace {trace.name} had no generated output, \
            it was skipped")
            continue
 
        langfuse.create_score(
            trace_id=trace.id,
            name="tone",
            value=tone_score(trace)
        )
 
        jscore = joyfulness_score(trace)
        langfuse.create_score(
            trace_id=trace.id,
            name="joyfulness",
            value=jscore["score"],
            comment=jscore["reason"]
        )
 
    print(f"Batch {page_number} processed üöÄ \n")

Batch 1 processed üöÄ 

Batch 2 processed üöÄ 

Batch 3 processed üöÄ 

Batch 4 processed üöÄ 

