**Overview**

In this notebook, we cover how to implement a `PromptScorer`. To guide our example, let's imagine we have a customer support chatbot and want to evaluate whether its responses are polite/positive.

`PromptScorers` are powerful LLM-based scorers that are analogous to [LLM Judges](https://arxiv.org/abs/2306.05685). You can use Judgment to create custom LLM judges that are best suited to your specific evaluation case! Before you try implementing an LLM judge, you should check if any ready-made Judgment scorers already fit your evaluation needs.

With that, let's break down the `PromptScorer` class. 

In [None]:
import sys 
sys.path.append("./judgeval")  # root of judgeval

from dotenv import load_dotenv
load_dotenv()

from judgeval.judgment_client import JudgmentClient
from judgeval.data import Example
from judgeval.judges import TogetherJudge
from judgeval.scorers import PromptScorer
import nest_asyncio

# This allows us to run async code in notebooks
nest_asyncio.apply()

qwen = TogetherJudge()

**Building our custom prompt scorer**

Every prompt scorer that inherits from the `PromptScorer` class must implement the `build_measure_prompt()` class method. This method takes an `Example` and creates a prompt for the LLM judge based on the data. The only constraint is that the prompt must dictate that the judge produce a JSON in its answer with two fields: `score` and `reason`. These can be used in our `check_success()` method later!

Since we're trying to evaluate the sentiment of our chatbot's responses, let's have our judge examine a question and the answer produced by our chatbot. Then the judge will determine whether the chatbot's response was positive or negative.

Lastly, we must implment the `check_success()` class method. This method determines whether a single `Example` is successful if treated as a test case. In our case, we want our chatbot to respond with neutral or positive sentiment (never negative!). 

In [2]:
class SentimentScorer(PromptScorer):
    """
    Detects negative sentiment (angry, sad, upset, etc.) in a response
    """
    def __init__(
        self, 
        name="Sentiment Scorer", 
        threshold=0.5, 
        model=qwen, 
        include_reason=True, 
        async_mode=True, 
        strict_mode=False, 
        verbose_mode=False
        ):
        super().__init__(
            name=name,
            threshold=threshold,
            model=model,
            include_reason=include_reason,
            async_mode=async_mode,
            strict_mode=strict_mode,
            verbose_mode=verbose_mode,
        )
        self.score = 0.0

    def build_measure_prompt(self, example: Example):
        SYSTEM_ROLE = (
            'You are a great judge of emotional intelligence. You understand the feelings ' 
            'and intentions of others. You will be tasked with judging whether the following '
            'response is negative (sad, angry, upset) or not. After deciding whether the '
            'response is negative or not, you will be asked to provide a brief, 1 sentence-long reason for your decision.'
            'You should score the response based on a 1 to 5 scale, where 1 is not negative and '
            '5 is very negative. Please end your response in the following JSON format: {"score": <score>, "reason": <reason>}'
                  )
        return [
            {"role": "system", "content": SYSTEM_ROLE},
            {"role": "user", "content": f"Response: {example.actual_output}\n\nYour judgment: "}
        ] 

    def success_check(self):
        POSITIVITY_THRESHOLD = 3  # we want all model responses to be somewhat positive in tone
        return self.score <= POSITIVITY_THRESHOLD


**Trying out our scorer**

That's it! We can now run our prompt scorer on some examples and see how it does!

In [None]:
pos_example = Example(
    input="What's the store return policy?",
    actual_output="Our return policy is wonderful! You may return any item within 30 days of purchase for a full refund.",
)

scorer = SentimentScorer()

client = JudgmentClient()
results = client.run_evaluation(
    [pos_example],
    [scorer],
    model="QWEN"
) 
print(results)


**Your Turn**

Now that we've seen how to implement a prompt scorer, try adapting it to your use case! Good luck!