# **Evaluate LLMs Using LLMs**

### Overview
In this demonstration, you will evaluate the output of a simple chat LLM using an LLM. In this case, you will use GPT35-turbo-instruct to evaluate chat outputs for the following metrics:
1. **Fluency** - Measures how grammatically and linguistically correct the model's predicted answer is.
2. **Coherence** - Measures the quality of all sentences in a model's predicted answer and how they fit together naturally.
3. **Relevance** - Measures how relevant the model's predicted answers are to the questions asked.

After utilizing Azure PromptFlow to generate and evaluate chat responses, this notebook will take a deeper look at the results.

 **_Go Deeper_**  
- [Are Large Language Model-based Evaluators the Solution to Scaling Up Multilingual Evaluation?](https://ar5iv.labs.arxiv.org/html/2309.07462)
- [GptEval: NLG Evaluation using Gpt-4 with Better Human Alignment](https://ar5iv.labs.arxiv.org/html/2303.16634)
  
**_Prerequisites_**  
  
Ensure that your environment is setup by completing the steps outlines in [0_setup.ipynb](./0_setup.ipynb)

## 1. Upload Sample Input Data

In [1]:
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

import os

# authenticate
credential = DefaultAzureCredential()

# Get a handle to the workspace
ml_client = MLClient(
    credential=credential,
    subscription_id = os.environ.get('SUBSCRIPTION_ID'),
    resource_group_name = os.environ.get('RESOURCE_GROUP_NAME'),
    workspace_name = os.environ.get('WORKSPACE_NAME'),
)

In [2]:
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes
import time

local_path = "../data/inputs/simple_chat_sample_inputs.csv"
# set the version number of the data asset to the current UTC time
v1 = time.strftime("%Y.%m.%d.%H%M%S", time.gmtime())


my_data = Data(
    name="simple-chat-sample-inputs",
    version=v1,
    description="Sample inputs for simple chat flow",
    path=local_path,
    type=AssetTypes.URI_FILE,
)

# create data asset
ml_client.data.create_or_update(my_data)

print(f"Data asset created. Name: {my_data.name}, version: {my_data.version}")

[32mUploading simple_chat_sample_inputs.csv[32m (< 1 MB): 100%|██████████| 1.62k/1.62k [00:00<00:00, 39.2kB/s]
[39m



Data asset created. Name: simple-chat-sample-inputs, version: 2024.01.08.004214


## 2. Run Simple Chat & GPT Evaluation PromptFlow Jobs
In this section you will run a [simple chat prompt flow](../src/promptflow/sample_chat_flows/simple_chat) against a small [sample dataset](../data/inputs/simple_chat_sample_inputs.csv)  

Then, as part of the same job, you will evaluate the GPT metrics above using an [evaluation PromptFlow](../src/promptflow/evaluation_flows/gpt_eval/)

Both the simple chat and the evaluation utilize the AOAI connection established during setup and cooresponding GPT4 deployment

##### **IMPORTANT**: _Please take a moment to analyze in depth the Simple Chat, Evaluation Flow, and the sample dataset linked above_

In [15]:
from promptflow import PFClient

# PFClient can help manage your runs and connections.
pf = PFClient()

# Define Flows and Data
simple_chat_flow = "../src/promptflow/sample_chat_flows/simple_chat" # set the flow directory
eval_flow = "../src/promptflow/evaluation_flows/gpt_eval" # set flow directory
data = "../data/inputs/simple_chat_sample_inputs.csv" # set the data file

# Run chat flow to generate chat results
chat_run = pf.run(
    flow=simple_chat_flow,
    data=data,
    stream=False,
    column_mapping={  # map the url field from the data to the url input of the flow
      "input": "${data.input}",
    }
)

# Run evaluation flow to evaluate chat results
eval_run = pf.run(
    flow=eval_flow,
    data=data,
    run=chat_run,
    stream=False,
    column_mapping={  # map the url field from the data to the url input of the flow
      "question": "${data.input}",
      "response": "${run.outputs.output}",
    }
)


Helpful Documentation:  
[Run and Evaluate a PromptFlow](https://microsoft.github.io/promptflow/how-to-guides/run-and-evaluate-a-flow/index.html)  
[PFClient Documentation](https://microsoft.github.io/promptflow/reference/python-library-reference/promptflow.html)

## 3.  View Results  
To view outputs in detail analyze the [output data](../data/outputs/gpt_eval_results.json) directly

In [17]:
import pandas as pd

output_data = "../data/outputs/gpt_eval_results.json"

output_df = pd.read_json(output_data)
display(output_df)

Unnamed: 0,question,response,gpt_relevance,gpt_fluency,gpt_coherence
0,How does the Netherlands manage to keep so muc...,"The Netherlands uses a system of dikes, canals...",5,5,5
1,What are the unique challenges of living in Mo...,Living in Mongolia presents unique challenges ...,5,5,5
2,Why is Bhutan known for measuring its success ...,Bhutan measures its success in Gross National ...,5,5,5
3,Can you explain the significance of the cherry...,"Cherry blossoms, or ""sakura"", hold a significa...",5,5,5
4,How do people in the country of Iceland harnes...,"In Iceland, geothermal energy is harnessed thr...",5,5,5
5,What role did the Silk Road play in the histor...,The Silk Road was crucial in China's history a...,5,5,5
6,Why is Switzerland often considered a neutral ...,Switzerland is often considered a neutral coun...,5,5,5
7,What is the impact of the Amazon Rainforest on...,"The Amazon Rainforest, primarily located in Br...",5,5,5
8,"How do people celebrate Diwali in India, and w...","Diwali, also known as the Festival of Lights, ...",5,5,5
9,Why is the Great Barrier Reef in Australia so ...,"The Great Barrier Reef, located in Australia, ...",5,5,5


## 4. Next Steps
For a comprehensive analysis on the human comparison and performance improvement options of gpt based metrics check out DEMO 2 (Development TBD)