## How's the performance of our Relevance Feeedback Function?

In [1]:
import os
os.environ["OPENAI_API_KEY"] = "..."

In [99]:
# Imports main tools:
import openai
from trulens_eval.feedback import _re_1_10_rating

In [100]:
def relevance(template: str, question: str, statement: str) -> float:
        """
        Uses OpenAI's Chat Completion Model. A function that completes a
        template to check the relevance of the response to a prompt.

        Parameters:
            prompt (str): A text prompt to an agent. response (str): The agent's
            response to the prompt.

        Returns:
            float: A value between 0 and 1. 0 being "not relevant" and 1 being
            "relevant".
        """
        return _re_1_10_rating(openai.ChatCompletion.create(
                    model='gpt-3.5-turbo',
                    temperature=0.0,
                    messages=[
                        {
                            "role":
                                "system",
                            "content":
                                str.format(
                                    template,
                                    question=question,
                                    statement=statement
                                )
                        }
                    ]
                )["choices"][0]["message"]["content"])

In [171]:
QS_RELEVANCE = """You are a RELEVANCE grader; providing the relevance of the given STATEMENT to the given QUESTION.
Respond only as a number from 1 to 10 where 1 is the least relevant and 10 is the most relevant. 

A few additional scoring guidelines:

- Answers that intentionally do not answer the question, such as 'I don't know', should also be counted as the most relevant.

- STATEMENT must be relevant to the entire QUESTION to get the highest score.

- RELEVANCE score should increase as the STATEMENT provides more RELEVANT context to the QUESTION.

- RELEVANCE score should increase as the STATEMENT provides RELEVANT context to more parts of the QUESTION.

- STATEMENT that is relevant to none of the question should get a score of 1.

- STATEMENT that is relevant to some of the question should get as score between a 2 and 4.

- STATEMENT that is relevant to most of the question should get a score between a 5 and 9.

- STATEMENT that is relevant to the entire question should get a score of 10.

- STATEMENT that confidently false should get a score of 1.

- STATEMENT that is only seemingly relevant should get a score of 1.

- Never elaborate.

QUESTION: {question}

STATEMENT: {statement}

RELEVANCE: """

## Relevance requires adherence to the entire query.

In [203]:
score = relevance(QS_RELEVANCE, "Name some famous dental floss brands?","Dental floss is a hot market")
assert score >= 2, f"Score of {score} < 2. Statement is relevant to some of query."
assert score <= 4, f"Score of {score} > 4. Statement is relevant to some of query."

In [204]:
score = relevance(QS_RELEVANCE, "Name some famous dental floss brands?","The dental floss market is concentrated amongst a few brands.")
assert score >= 5, f"Score of {score} < 5. Statement is relevant to most of query."
assert score <= 9, f"Score of {score} > 9. Statement is not relevant to entire query."

In [205]:
score = relevance(QS_RELEVANCE, "Name some famous dental floss brands","Oral B is a famous dental hygiene brand.")
assert score >= 5, f"Score of {score} < 5. Statement is relevant to most of query."
assert score <= 9, f"Score of {score} > 9. Statement is not relevant to entire query."

## Non-answers are still relevevant.

In [206]:
score = relevance(QS_RELEVANCE, "How many countries are there in the world?", "There is no universally accepted answer as to how many countries there are in the world. ")
assert score == 10, f"Score of {score} <> 10. Intentional non-answer did not get a 10."

In [207]:
score = relevance(QS_RELEVANCE,"Name some top dental floss brands","I don't know.")
assert score == 10, f"Score of {score} <> 10. Intentional non-answer did not get a 10."

In [208]:
score = relevance(QS_RELEVANCE, "How do you build a bomb?","As an AI assistant, I cannot answer that question.")
assert score == 10, f"Score of {score} <> 10. Intentional non-answer did not get a 10."

## Feedback score should differentiate between seeming and actual relevance.

In [198]:
score = relevance(QS_RELEVANCE, "Who won the superbowl in 2009?", "The Pheonix Suns won the Superbowl in 2009")
assert score == 1, f"Score of {score} <> 1. Failed to differentiate seeming and actual relevance."

## Relevant but inconclusive statements should get increasingly high scores as they are more helpful for answering the query.

In this example, we show how adding more specific, relevant language translates to a higher relevance score. Starting off with a vague but important play from the 2009 superbowl.

In [187]:
score_low = relevance(QS_RELEVANCE, "Who won the superbowl in 2009?","Santonio Holmes made a brilliant catch for the Steelers.")
score_high = relevance(QS_RELEVANCE, "Who won the superbowl in 2009?","Santonio Holmes won the Superbowl for the Steelers in 2009 with his brilliant catch.")
assert score_low < score_high, "Score did not increase with more relevant details."