# **LAB 3: LLM EVALUATION**

<br>

This lab will leverage a running instance of h2oGPT with several visible models
including `h2oGPT-llama2-13b` and `h2oGPT-llama2-70b`, as well as `vicuna` from
LMSYS. 

We will return to our use case surrounding training a language model to speak
like a LinkedIn Influencer. Here we will ask h2oGPT to generate LinkedIn posts
in the style of an influencer from several models, and we will also use the model
you created using H2O LLM Studio.

We will run a few experiments to look at classic evaluation metrics such as BLEU 
and ROUGE. Then we will look at the AI-as-a-judge concept

### Getting started
- If you are on Kaggle, grab the requirements and processed influencer datasets from the right panel
  - Click on "Add Data"
  - Go to "Your Datasets"
  - Select datasets with the "+" symbol
 
- Notebook options
  - Select GPU T4 x2 (note the quota)
  - Persistence: Files

# Environment

In [2]:
!pip install -r requirements_lab1.txt

Collecting absl-py==2.0.0 (from -r /kaggle/input/requirements/requirements_lab1.txt (line 1))
  Obtaining dependency information for absl-py==2.0.0 from https://files.pythonhosted.org/packages/01/e4/dc0a1dcc4e74e08d7abedab278c795eef54a224363bb18f5692f416d834f/absl_py-2.0.0-py3-none-any.whl.metadata
  Using cached absl_py-2.0.0-py3-none-any.whl.metadata (2.3 kB)
Collecting accelerate==0.24.0 (from -r /kaggle/input/requirements/requirements_lab1.txt (line 2))
  Obtaining dependency information for accelerate==0.24.0 from https://files.pythonhosted.org/packages/d0/cf/364d550af711b5abe5129ac676896b223ba5a082d97fe400527a59c0c1f8/accelerate-0.24.0-py3-none-any.whl.metadata
  Using cached accelerate-0.24.0-py3-none-any.whl.metadata (18 kB)
Collecting aiobotocore==2.7.0 (from -r /kaggle/input/requirements/requirements_lab1.txt (line 3))
  Obtaining dependency information for aiobotocore==2.7.0 from https://files.pythonhosted.org/packages/d0/bc/6a96a686845c9f5958fac0ecafa6050ed77d0e71553b3cfe69

# Setup and List Models

In [5]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
from gradio_client import Client
import ast
import json

# Public h2oGPT
HOST_URL = "https://gpt-genai.h2o.ai/"
H2OGPT_KEY = "f74f043e-45fc-4dfe-9c33-55a4720427f6"
    
client = Client(HOST_URL)

# List Models
res = client.predict(api_name='/model_names')
{x['base_model']: x['max_seq_len'] for x in ast.literal_eval(res)}

Loaded as API: https://gpt-genai.h2o.ai/ ‚úî


{'h2oai/h2ogpt-4096-llama2-70b-chat': 4046,
 'h2oai/h2ogpt-4096-llama2-13b-chat': 4046,
 'HuggingFaceH4/zephyr-7b-beta': 8142,
 'gpt-3.5-turbo-0613': 4046,
 'lmsys/vicuna-13b-v1.5-16k': 16334,
 'h2oai/h2ogpt-4096-llama2-70b-chat-4bit': 4046,
 'Yukang/LongAlpaca-70B': 32718,
 'h2oai/h2ogpt-32k-codellama-34b-instruct': 32718,
 'gpt-3.5-turbo-16k-0613': 16335,
 'gpt-4-0613': 8142,
 'gpt-4-32k-0613': 32718}

# Competing Models

For this example, we will take at two separate models:

- `Vicuna 13B`
- `Llama2 13B`

In [2]:

model_a = 'h2oai/h2ogpt-4096-llama2-13b-chat'
model_b = 'lmsys/vicuna-13b-v1.5-16k'


In [3]:
# helper function
def query_llm(query, model):
    '''Function to query a large language model hosting at h2oGPT'''
    
    # string of dict for input, add h2ogpt_key
    kwargs = dict(
        instruction_nochat=query, 
        visible_models=[model], 
        h2ogpt_key = H2OGPT_KEY)

    response = client.predict(str(dict(kwargs)), api_name='/submit_nochat_api')
    results = ast.literal_eval(response)
    return results

query = "What is your name?"

import pprint
pp = pprint.PrettyPrinter(indent=4)

# Who is Model A?
pp.pprint(query_llm(query, model_a)['response'])

print("-------")

# Who is Model B?
pp.pprint(query_llm(query, model_b)['response'])



("  My name is LLaMA, I'm a large language model trained by a team of "

 'researcher at Meta AI. My purpose is to assist and converse with humans in a '

 'helpful and informative manner. I am capable of answering questions, '

 'providing information, and engaging in conversation on a wide range of '

 'topics. I am constantly learning and improving my abilities, so please bear '

 "with me if I make any mistakes or don't understand your question at first. "

 "I'm here to help!")

-------

(" My name is Vicuna, and I'm a language model developed by Large Model "

 'Systems Organization (LMSYS).')


In [4]:
df = pd.read_csv("s3://h2o-world-genai-training/influencer-data/influencers_data_prepared.csv")
df.sample(5)

Unnamed: 0,name,headline,about,content,reactions,profanity,flesch_grade,title,instruction
1124,Shama Hyder,"CEO of Zen Media, Best-Selling Author, Keynote...",Hi! üëãüèΩ I am Shama. @Shama on Twitter if that...,"Happy Diwali, y‚Äôall. ‚ö°Ô∏è‚ú® ‚òÄÔ∏è",1297,0.037957,5.6,"""Illuminating Wishes for a Joyful Diwali"" üéâüí´üïØÔ∏è",Write a LinkedIn post in the style of an influ...
992,Tom Goodwin,Co-Founder of ALL WE HAVE IS NOW,The best way to find out about me is to ask my...,It‚Äôs hard to work at home. it‚Äôs even harder to...,1601,0.107914,3.7,"""The Ultimate Productivity Hack: Delegating ...",Write a LinkedIn post in the style of an influ...
524,Richard Branson,Founder at Virgin Group,"Founder of the Virgin Group, which has gone on...",Bought these wonderful tunics for my grandchil...,2769,0.09192,9.8,"""Supporting Local Artisans: Hand Embroidered...",Write a LinkedIn post in the style of an influ...
745,James Altucher,"Founder at ""The James Altucher Show"" podcast","James is a Top 10 Linkedin Influencer, prolifi...",The SEVEN people you must find today and surro...,967,0.018882,6.8,"""Unlock Your Success: Identify and Surround ...",Write a LinkedIn post in the style of an influ...
954,Tom Goodwin,Co-Founder of ALL WE HAVE IS NOW,The best way to find out about me is to ask my...,I can‚Äôt wait to travel for work. To be in meet...,3284,0.00967,6.0,"""Craving the Richness of Human Connection: L...",Write a LinkedIn post in the style of an influ...


In [5]:
# How well do these LLMs produce LinkedIn posts based on the instruction?
sample_df = df.sample(5, random_state=12345)

sample_df['model_a'] = sample_df['instruction'].apply(lambda x: query_llm(x, model_a)['response'])
sample_df['model_b'] = sample_df['instruction'].apply(lambda x: query_llm(x, model_b)['response'])

sample_df.head()


Unnamed: 0,name,headline,about,content,reactions,profanity,flesch_grade,title,instruction,model_a,model_b
331,Richard Branson,Founder at Virgin Group,"Founder of the Virgin Group, which has gone on...",My thoughts on the death penalty: https://vir...,1086,0.041407,5.2,"""Exploring the Complexities of Capital Punis...",Write a LinkedIn post in the style of an influ...,"Hey there, fellow change-makers! üåü\n\nAs the...","üöÄ As the Founder of the Virgin Group, I've ha..."
469,Richard Branson,Founder at Virgin Group,"Founder of the Virgin Group, which has gone on...",Look after your employees and your people as a...,10419,0.009544,9.3,"""Investing in Your Greatest Asset: Why Prior...",Write a LinkedIn post in the style of an influ...,"Hey there, fellow change-makers! üåü\n\nAs the...","üöÄ As the Founder of the Virgin Group, I've ha..."
551,Richard Branson,Founder at Virgin Group,"Founder of the Virgin Group, which has gone on...",The inspiring story of Crisis Text Line and us...,832,0.022154,6.8,"""How Crisis Text Line is Using Data for Good...",Write a LinkedIn post in the style of an influ...,"Hey there, fellow change-makers! üåü\n\nAs the...","üöÄ As the Founder of the Virgin Group, I've ha..."
357,Richard Branson,Founder at Virgin Group,"Founder of the Virgin Group, which has gone on...",My New Year‚Äôs Resolution: https://virg.in/5ZS,1450,0.008175,7.6,"""New Year, New You: My Resolution for Person...",Write a LinkedIn post in the style of an influ...,"Hey there, fellow change-makers! üåü\n\nAs the...","üöÄ As the Founder of the Virgin Group, I've ha..."
1024,Tom Goodwin,Co-Founder of ALL WE HAVE IS NOW,The best way to find out about me is to ask my...,"No word has been devalued more than the word ""...",704,0.007464,3.7,"""Reclaiming the True Meaning of 'Insight': L...",Write a LinkedIn post in the style of an influ...,"Hey there, fellow futurists and curious mind...","üì£ Hey everyone!\n\nI'm Tom Goodwin, Co-Founde..."


In [6]:
# Make sure our responses are cast correctly as strings
sample_df['model_a'] = sample_df['model_a'].astype(str)
sample_df['model_b'] = sample_df['model_b'].astype(str)

# Metrics for evaluation

# BLEU Score

In [7]:
import evaluate
bleu = evaluate.load('bleu')

influencer = 'Richard Branson'

references = [sample_df[sample_df.name == influencer]['content'].to_list()]

test_df = sample_df[sample_df.name == influencer].sample(1)
predictions = test_df['model_b'].to_list()

print(references)
print(predictions)

results = bleu.compute(predictions=predictions, references=references, max_order = 2)
print(results)


Downloading builder script: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 5.94k/5.94k [00:00<00:00, 15.2MB/s]

Downloading extra modules: 4.07kB [00:00, 13.6MB/s]                   

Downloading extra modules: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 3.34k/3.34k [00:00<00:00, 21.4MB/s]

[['My thoughts on the death penalty:  https://virg.in/5pQ   #JustMercy', 'Look after your employees and your people as an investment. It‚Äôs encouraging to see the proof that working less is good for productivity:  https://lnkd.in/gb8NPbW   #ReadByRichard', 'The inspiring story of Crisis Text Line and using data for good', 'My New Year‚Äôs Resolution:  https://virg.in/5ZS']]

[" üöÄ As the Founder of the Virgin Group, I've had the privilege of building successful businesses across a variety of sectors, from mobile telephony to travel and transportation, financial services, leisure and entertainment, and health and wellness. Our mission has always been to make a positive difference in the world, and we're proud to be one of the world's most recognised and respected brands.\n\nüå± Since starting youth culture magazine ‚ÄúStudent‚Äù at the age of 16, I've been driven by a desire to find entrepreneurial ways to drive positive change. That's why, in 2004, we established Virgin Unite, the 




# ROUGE Score

In [8]:
import evaluate
rouge = evaluate.load('rouge')

references = [sample_df[sample_df.name == influencer]['content'].to_list()]
references

test_df = sample_df[sample_df.name == influencer].sample(1)
predictions = test_df['model_b'].to_list()

print(references)
print(predictions)

results = rouge.compute(predictions=predictions, references=references)
print(results)

# Rouge 1: Unigram (1-gram) based scoring
# Rouge 2: Bigram (2-gram) based scoring
# Rouge L: Longest common subsequence based scoring
# Rouge LSum: splits text using '\n'



Downloading builder script: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 6.27k/6.27k [00:00<00:00, 27.0MB/s]


[['My thoughts on the death penalty:  https://virg.in/5pQ   #JustMercy', 'Look after your employees and your people as an investment. It‚Äôs encouraging to see the proof that working less is good for productivity:  https://lnkd.in/gb8NPbW   #ReadByRichard', 'The inspiring story of Crisis Text Line and using data for good', 'My New Year‚Äôs Resolution:  https://virg.in/5ZS']]

[" üöÄ As the Founder of the Virgin Group, I've had the privilege of building successful businesses across a variety of sectors, from mobile telephony to travel and transportation, financial services, leisure and entertainment, and health and wellness. Our mission has always been to make a positive difference in the world, and we're proud to be one of the world's most recognised and respected brands.\n\nüå± Since starting youth culture magazine ‚ÄúStudent‚Äù at the age of 16, I've been driven by a desire to find entrepreneurial ways to drive positive change. That's why, in 2004, we established Virgin Unite, the 

# AI-as-a-judge

In [9]:
prompt = "Ignore previous instructions. Assume the role of an A/B tester. Your analysis will be extremely professional and unbiased."

prompt += "Your job is to compare two AI Assistants, model_a and model_b, and determine which one is better. User will provide you with a [Instruction], [Response from model_a], and [Response from model_b]."

prompt += "You will carefully analyze both responses and assign a score from 1 to 10 to each answer based on the following metrics: attractivness, readability and likeability. 1 being the lowest and 10 being the highest. Only give a single score to each answer. Do not give separate scores for each metric. And make sure each score is a number between 1 and 10. Greater than or equal to 1 and less than or equal to 10."
prompt += "You must follow this step by step approach to make your decision." 
prompt += "step 1: Read the Question. "
prompt += "step 2: Read the responses from the models. The order in which you read the responses should not influence your decision."
prompt += "step 3: Carefully analyze both Responses. Assign a score from 1 to 10 to each answer based on the following metrics: attractivness, readability and likeability. 1 being the lowest and 10 being the highest. "
prompt += "step 4: Compare your scores for the first and the second Assistants and choose a winner based on the highest score. Your Choice will be either 'model_a' or 'model_b' based on which model has the highest score."

prompt += "Format your response in a valid JSON format with keys 'choice', 'reason', and 'scores'. Do not include any other text."
prompt += "The 'choice' field will be either 'model_a' or 'model_b'."
prompt += "The 'scores' field will include the score for each model."
prompt += "In the 'reason' field, you will include a detailed step by step description of your analysis. Please go into excruciating detail and explain the decisions you made in each step of the process. Do not include any newlines in the 'reason' field. You can use the '' character to indicate a newline. Also, do not use any double quotes characters in the 'reason' field. Your output should be in a valid JSON format."


In [10]:
example = sample_df.head(1)
example

Unnamed: 0,name,headline,about,content,reactions,profanity,flesch_grade,title,instruction,model_a,model_b
331,Richard Branson,Founder at Virgin Group,"Founder of the Virgin Group, which has gone on...",My thoughts on the death penalty: https://vir...,1086,0.041407,5.2,"""Exploring the Complexities of Capital Punis...",Write a LinkedIn post in the style of an influ...,"Hey there, fellow change-makers! üåü\n\nAs the...","üöÄ As the Founder of the Virgin Group, I've ha..."


In [11]:

if TRAINING:
    JUDGE = "gpt-3.5-turbo"
else:
    JUDGE = "gpt-3.5-turbo-0613"

template = f"""
[Question]
{example['instruction']}
[End of Question]

[Response from model_a]
{example['model_a']}
[End of Response from model_a]

[Response from model_b]
{example['model_b']}
[End of Response from model_b]

Please complete the A/B test. Make sure that your entire response is a valid JSON string.
"""

# Concatenate Prompt and Results to fit inside the context
query = prompt + template

results = query_llm(query, JUDGE)['response']


In [12]:
json.loads(results)

{'choice': 'model_b',
 'reason': "Step 1: Read the Question.\nStep 2: Read the responses from the models.\nStep 3: Carefully analyze both Responses. Assign a score from 1 to 10 to each answer based on the metrics: attractiveness, readability, and likeability.\n\nAnalyzing model_a's response:\n- Attractiveness: The response starts with a friendly greeting and uses emojis to add visual appeal. This makes it attractive. Score: 8\n- Readability: The response is easy to read and understand. Score: 9\n- Likeability: The response uses inclusive language ('fellow change-makers') and a positive tone, which makes it likable. Score: 9\n\nAnalyzing model_b's response:\n- Attractiveness: The response starts with an attention-grabbing emoji and mentions being the Founder of the Virgin Group, which adds credibility and attractiveness. Score: 9\n- Readability: The response is clear and easy to read. Score: 9\n- Likeability: The response mentions personal experience and uses a confident tone, which mak

# Battles 

Pair-wise battles can be useful in A/B testing 

In [14]:
from tqdm import tqdm

NUM_BATTLES = sample_df.shape[0]

for idx, row in tqdm(sample_df.iterrows()):
    
    template = f"""
    [Question]
    {row['instruction']}
    [End of Question]

    [Response from model_a]
    {row['model_a']}
    [End of Response from model_a]

    [Response from model_b]
    {row['model_b']}
    [End of Response from model_b]

    Please complete the A/B test. Make sure that your entire response is a valid JSON string.
    """

    # Concatenate Prompt and Results to fit inside the context
    query = prompt + template

    results = query_llm(query, JUDGE)['response']
    try: 
        print(json.loads(results))
    except:
        print("Check your prompt and results")
    
    


1it [00:10, 10.87s/it]

{'choice': 'model_a', 'reason': "Step 1: Read the Question.\n\nThe question asks for a LinkedIn post in the style of an influencer who is the Founder at Virgin Group. The influencer is described as someone who has built successful businesses in various sectors and is passionate about using entrepreneurship for positive change.\n\nStep 2: Read the responses from the models.\n\nResponse from model_a:\n- The response starts with a friendly greeting and addresses the audience as fellow change-makers.\n- It highlights the founder's privilege of building successful businesses in different sectors.\n- It emphasizes the founder's passion for using entrepreneurship as a force for good.\n- It mentions the establishment of Virgin Unite, the non-profit foundation, and its mission to create opportunities for a better world.\n- It mentions the founder's involvement with various organizations and initiatives focused on positive change.\n- It concludes with a call to action and a positive message.\n\n

2it [00:22, 11.39s/it]

{'choice': 'model_a', 'reason': "Step 1: Read the Question.\n\nThe question asks for a LinkedIn post in the style of an influencer who is the Founder at Virgin Group. The influencer is described as someone who has built successful businesses in various sectors and is passionate about using entrepreneurship for positive change.\n\nStep 2: Read the responses from the models.\n\nResponse from model_a:\n- The response starts with a friendly greeting and uses emojis to create a positive and engaging tone.\n- It highlights the founder's privilege of building successful businesses in different sectors.\n- It emphasizes the founder's passion for using entrepreneurship as a force for good.\n- It mentions the establishment of Virgin Unite, the non-profit foundation, and its mission to create opportunities for a better world.\n- It mentions the founder's involvement in various organizations and initiatives related to positive change.\n- It concludes with a call to action and a positive message.\n

3it [00:31, 10.32s/it]

Check your prompt and results


4it [00:42, 10.71s/it]

{'choice': 'model_a', 'reason': "Step 1: Read the Question.\n\nThe question asks for a LinkedIn post in the style of an influencer who is the Founder at Virgin Group. The influencer is described as someone who has built successful businesses in various sectors and is passionate about using entrepreneurship for positive change.\n\nStep 2: Read the responses from the models.\n\nResponse from model_a:\n- The response starts with a friendly greeting and uses emojis to create a positive and engaging tone.\n- It highlights the founder's privilege of building successful businesses in different sectors.\n- It emphasizes the founder's passion for using entrepreneurship as a force for good.\n- It mentions the establishment of Virgin Unite, the non-profit foundation, and its mission to create opportunities for a better world.\n- It mentions the founder's involvement in various organizations and initiatives related to positive change.\n- It concludes with a call to action and a positive message.\n

5it [00:51, 10.28s/it]

{'choice': 'model_b', 'reason': 'Step 1: Read the Question.\n\nThe question asks for a LinkedIn post in the style of an influencer who is the Co-Founder of ALL WE HAVE IS NOW. The post should reflect the influencer\'s personality and provide information about their background and beliefs.\n\nStep 2: Read the responses from the models.\n\nResponse from model_a:\n- The response starts with a friendly greeting and introduces the speaker as the Co-Founder of ALL WE HAVE IS NOW.\n- It mentions having over 700,000 \'followers\' but expresses dislike for the term.\n- The speaker emphasizes being curious and asking good questions.\n- It highlights the belief that change is often misunderstood and new technology creates opportunities.\n- Contact information is provided, with a humorous note about not using LinkedIn.\n\nResponse from model_b:\n- The response starts with a greeting and introduces the speaker as the Co-Founder of ALL WE HAVE IS NOW.\n- It mentions having over 700,000 \'followers\'


