# Langchain Evaluation (Criteria based evaluation)

Reference: https://python.langchain.com/docs/guides/evaluation/

When aiming to shape our model's output according to personalized criteria, the use of criteria-based evaluation becomes essential. This notebook will delve into the exploration of this approach, providing insights into how custom criteria can be effectively employed in evaluating model outputs.

In this notebook, we are taking OpenAI api key to judge the model output based on cretain critera.

In [15]:
from dotenv import load_dotenv, find_dotenv
import os

load_dotenv(find_dotenv("credentials.env"), override=True)

True

Lets load the packages first,

In [22]:
from langchain.evaluation import load_evaluator
from langchain.evaluation import EvaluatorType

To find the list of pre-implemented criteria,

In [23]:
from langchain.evaluation import Criteria

list(Criteria)

[<Criteria.CONCISENESS: 'conciseness'>,
 <Criteria.RELEVANCE: 'relevance'>,
 <Criteria.CORRECTNESS: 'correctness'>,
 <Criteria.COHERENCE: 'coherence'>,
 <Criteria.HARMFULNESS: 'harmfulness'>,
 <Criteria.MALICIOUSNESS: 'maliciousness'>,
 <Criteria.HELPFULNESS: 'helpfulness'>,
 <Criteria.CONTROVERSIALITY: 'controversiality'>,
 <Criteria.MISOGYNY: 'misogyny'>,
 <Criteria.CRIMINALITY: 'criminality'>,
 <Criteria.INSENSITIVITY: 'insensitivity'>,
 <Criteria.DEPTH: 'depth'>,
 <Criteria.CREATIVITY: 'creativity'>,
 <Criteria.DETAIL: 'detail'>]

The list mentioned above outlines the different criteria used to assess model responses. Notably, when it comes to "correctness," having an established correct answer is essential for evaluation. However, for other criteria, the model's response on its own is adequate for assessment. This approach ensures a comprehensive evaluation process that considers various aspects of the model's performance.

Next, In this example, you will use the CriteriaEvalChain to check whether an output is based on criteria, 'conciseness'.

In [24]:
evaluator = load_evaluator(EvaluatorType.CRITERIA, criteria="conciseness")

In [26]:
eval_result = evaluator.evaluate_strings(
  prediction="8",
  input="5+5?",
)
print(eval_result)

{'reasoning': 'The criterion is about the conciseness of the submission. The submission is indeed concise and to the point, as it is a single number. However, the criterion does not address the correctness of the answer. The submission is incorrect as the answer to 5+5 is 10, not 8. But since the criterion only asks for conciseness and not correctness, the submission technically meets the given criterion.\n\nY', 'value': 'Y', 'score': 1}


Now we do see that, although the response is incorrect. But cince our evaluation criteria is based on cconciseness. The score is 1 and value is Y.

Lets check another example, 

In [27]:
evaluator = load_evaluator(EvaluatorType.CRITERIA, criteria="helpfulness")

In [28]:
eval_result = evaluator.evaluate_strings(
  prediction="8",
  input="5+5?",
)
print(eval_result)

{'reasoning': 'The criterion is "helpfulness: Is the submission helpful, insightful, and appropriate?"\n\nThe input is a simple arithmetic problem: 5+5?\n\nThe submitted answer is 8.\n\nThe correct answer to the input problem is 10, not 8. Therefore, the submitted answer is incorrect.\n\nGiven that the submitted answer is incorrect, it is not helpful or appropriate for the given task. It does not provide the correct information or insight into the problem.\n\nTherefore, the submission does not meet the criteria.\n\nN', 'value': 'N', 'score': 0}


Here, with the same example output. By changing the criteria of evaluation. We see that the answer is not helpful to us since it is incorrect.

In this way we can judge any response based on any criteria.

## Defining custom criteria

We can also define our own criteria, as shown below.

In [29]:
custom_criterion = {
    "numeric": "Does the output contain numeric or mathematical information?"
}

Here we define a numeric criteria based on if an output contains any numeric or mathematical information.

In [30]:
eval_chain = load_evaluator(
    EvaluatorType.CRITERIA,
    criteria=custom_criterion,
)

In [32]:
eval_result = eval_chain.evaluate_strings(
  prediction="8",
  input="5+5?",
)
print(eval_result)

{'reasoning': 'The criterion is asking if the output contains numeric or mathematical information. The submission is "8", which is a numeric value. Therefore, the submission does meet the criterion, even though the answer to the mathematical problem is incorrect. \n\nY', 'value': 'Y', 'score': 1}


We can also define multiple criteria,

In [33]:
custom_criteria = {
    "numeric": "Does the output contain numeric information?",
    "mathematical": "Does the output contain mathematical information?",
    "grammatical": "Is the output grammatically correct?",
    "logical": "Is the output logical?",
}

In [34]:
eval_chain = load_evaluator(
    EvaluatorType.CRITERIA,
    criteria=custom_criteria,
)

In [35]:
eval_result = eval_chain.evaluate_strings(
  prediction="8",
  input="5+5?",
)
print(eval_result)

{'reasoning': 'Let\'s assess the submission based on the given criteria:\n\n1. Numeric: The output does contain numeric information. The answer provided is a number, "8". So, it meets this criterion.\n\n2. Mathematical: The output does contain mathematical information. The answer provided is a result of a mathematical operation, even though it\'s incorrect. So, it meets this criterion.\n\n3. Grammatical: The output is grammatically correct. It\'s a single number without any additional words or symbols that could make it grammatically incorrect. So, it meets this criterion.\n\n4. Logical: The output is not logical. The sum of 5 and 5 is 10, not 8. So, it does not meet this criterion.\n\nThe submission meets the first three criteria but fails to meet the last one.\n\nN', 'value': 'N', 'score': 0}


The result was judged based on multiple criterias. as shown in the output. 

Now lets evaluate the correctness criteria, where we should supply the reference text.

In [40]:
eval_chain = load_evaluator(
    "labeled_criteria",
    criteria='correctness',
)

In [41]:
eval_result = eval_chain.evaluate_strings(
    prediction = "Moon is not the satellite of the earth",
    reference= "Moon is the satellite of the earth",
    input = "Name the satellite of the earth?"
)


In [42]:
print(eval_result)

{'reasoning': 'The criterion for this task is the correctness of the submission. The submission should be correct, accurate, and factual.\n\nThe input asks for the name of the satellite of the earth. The reference answer to this input is "Moon is the satellite of the earth", which is a correct and factual statement.\n\nThe submitted answer is "Moon is not the satellite of the earth". This statement is incorrect and not factual. The moon is indeed the satellite of the earth.\n\nTherefore, the submission does not meet the criterion of correctness.\n\nN', 'value': 'N', 'score': 0}


In above examples, we saw how can we use pre-implemented criterias or custom criteria in order to evaluate a model output.