# deepeval-geval

code reproduced from https://www.datacamp.com/tutorial/deepeval

Using ollama to serve phi4 locally

This notebook demonstrates how to use DeepEval, a framework for evaluating large language models (LLMs), to assess the correctness of generated outputs. The focus is on utilizing the GEval metric to evaluate LLM responses based on factual accuracy, completeness, and alignment with expected outputs. By running through a few test cases, we can see how GEval provides structured, meaningful feedback for LLM evaluation.

In [1]:
!deepeval set-local-model --model-name=phi4 --base-url="http://localhost:11434/v1/" --api-key="ollama"

🙌 Congratulations! You're now using a local model for all evals that require an
LLM.


In [2]:
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams
correctness_metric = GEval(
    name="Correctness",
    #model="gpt-4o", set local model instead
    evaluation_params=[
        LLMTestCaseParams.EXPECTED_OUTPUT,
        LLMTestCaseParams.ACTUAL_OUTPUT],
    evaluation_steps=[
        "Check whether the facts in 'actual output' contradicts any facts in 'expected output'",
        "You should also lightly penalize omission of detail, and focus on the main idea",
        "Vague language, or contradicting OPINIONS, are OK"
    ],
)

In [3]:
from deepeval.test_case import LLMTestCase
from deepeval.dataset import EvaluationDataset


first_test_case = LLMTestCase(input="What are the main causes of deforestation?",
                              actual_output="The main causes of deforestation include agricultural expansion, logging, infrastructure development, and urbanization.",
                              expected_output="The main causes of deforestation include agricultural expansion, logging, infrastructure development, and urbanization.")


second_test_case = LLMTestCase(input="Define the term 'artificial intelligence'.",
                               actual_output="Artificial intelligence is the simulation of human intelligence by machines.",
                               expected_output="Artificial intelligence refers to the simulation of human intelligence in machines that are programmed to think and learn like humans, including tasks such as problem-solving, decision-making, and language understanding.")


third_test_case = LLMTestCase(input="List the primary colors.",
                              actual_output="The primary colors are green, orange, and purple.",
                              expected_output="The primary colors are red, blue, and yellow.")

In [4]:
test_cases = [first_test_case, second_test_case, third_test_case]

dataset = EvaluationDataset(test_cases=test_cases)

In [5]:
evaluation_output = dataset.evaluate([correctness_metric])

Event loop is already running. Applying nest_asyncio patch to allow async execution...


Evaluating 3 test case(s) in parallel: |██████████|100% (3/3) [Time Taken: 00:06,  2.23s/test case]



Metrics Summary

  - ✅ Correctness (GEval) (score: 1.0, threshold: 0.5, strict: False, evaluation model: local model, reason: The actual output matches the expected output exactly, with no contradictions or omissions of detail. The main idea is fully captured without any vague language or contradicting opinions., error: None)

For test case:

  - input: What are the main causes of deforestation?
  - actual output: The main causes of deforestation include agricultural expansion, logging, infrastructure development, and urbanization.
  - expected output: The main causes of deforestation include agricultural expansion, logging, infrastructure development, and urbanization.
  - context: None
  - retrieval context: None


Metrics Summary

  - ✅ Correctness (GEval) (score: 0.7, threshold: 0.5, strict: False, evaluation model: local model, reason: The actual output captures the main idea of simulating human intelligence by machines, aligning with the expected output. However, it omits detai


