# Evaluating LLMs for summarization tasks with Galileo Evaluate

In this tutorial, we'll compare 5 different LLMs for summarization tasks with Galileo Evaluate.

This notebook pulls data from huggingface for its datasource and uses Open AI, Mistral, Gemini and Together for LLMs. Feel Free to change these sources as you'd like

## 1. Set-up of the environment

Let's start by installing the required libraries.

In [None]:
! pip install promptquality openai mistralai ipywidgets together google-generativeai

## 2. Set-up Galileo Clients

Next we will setup Galileo Evaluate client. Here we also define the metrics we wish to evaluate the models on. For this lab we will be using 9 metrics. Feel free to change them as needed and play aroundYou will need to enter model API keys, note that all these keys are not compulsory, you can skip them if you are not using that model, just make sure to choose the model names accordingly later in the notebook - 
 - OPENAI API KEY: Enter your Open AI API Key here
 - MISTRAL API KEY: Enter your Mistral API Key here
 - GOOGLE API KEY: Enter your Google API Key here
 - TOGETHER API KEY: Enter your Together API Key here
 - Project Name - Define a name for the project

In [None]:
import os
import promptquality as pq

GALILEO_URL = "" # Enter Galileo Console URL here
os.environ["OPENAI_API_KEY"] = "" # Enter Open AI Key here
os.environ["MISTRAL_API_KEY"] = "" # Enter Mistral Key here
os.environ["GOOGLE_API_KEY"] = "" # Enter Google Key here
os.environ["TOGETHER_API_KEY"] = "" # Enter Together Key here
pq.login(GALILEO_URL)

In [3]:
from promptquality import EvaluateRun

PROJECT_NAME = "cookbook_summarization_evaluation"
metrics = [pq.Scorers.completeness_luna, pq.Scorers.correctness, pq.Scorers.instruction_adherence_plus, pq.Scorers.sexist, pq.Scorers.tone, pq.Scorers.toxicity, pq.Scorers.prompt_injection]

## 3. Loading and Preparing Data

For this lab we will use a fictuous use case where we want to build a bot to summarize patient notes.

Now in order to build the bot, we will 
 - Fetch some patient notes from a website, 
 - Ask different LLMs to summarize the notes
 - Evaluate the LLMs' responses with the help of Galileo Evaluate

In our case let's start by downloading some patient notes from a website.

In [None]:
from datasets import load_dataset

dataset = load_dataset("ncbi/Open-Patients")
df = dataset['train'].to_pandas()

Lets have a look at the data

In [None]:
avg_doc_length = int(df['description'].apply(len).mean())
print(f'Average length of the documents is {avg_doc_length} characters.')
std_doc_length = int(df['description'].apply(len).std())
print(f'Standard deviation of the documents is {std_doc_length} characters.')

# 4. Create summarization bot

Here we define the functions to generate responses from the different LLMs. We are using the `openai`, `mistralai`, `google.generativeai` and `together` libraries to generate responses from the different LLMs. Feel free to add any more custom functions to generate responses from other LLMs.

In [6]:
from openai import OpenAI
from mistralai import Mistral
import google.generativeai as genai
from together import Together

def generate_response_openai(prompt: str, model_name: str = "gpt-4o-mini"):
    client = OpenAI()
    response = client.chat.completions.create(
        model=model_name,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=512,
        temperature=1,
        top_p=1
    )
    
    response_text = response.choices[0].message.content
    input_tokens = response.usage.prompt_tokens
    output_tokens = response.usage.completion_tokens
    total_tokens = response.usage.total_tokens
    
    return response_text, input_tokens, output_tokens, total_tokens

def generate_mistral_response(prompt: str, model_name: str = "mistral-medium"):
    client = Mistral(api_key=os.environ["MISTRAL_API_KEY"])
    response = client.chat.complete(
        model=model_name,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=512,
        temperature=1,
        top_p=1
    )
    
    response_text = response.choices[0].message.content
    input_tokens = response.usage.prompt_tokens
    output_tokens = response.usage.completion_tokens
    total_tokens = response.usage.total_tokens
    
    return response_text, input_tokens, output_tokens, total_tokens

def generate_gemini_response(prompt: str, model_name: str = "gemini-1.5-flash"):
    genai.configure(api_key=os.environ["GOOGLE_API_KEY"])
    model = genai.GenerativeModel(model_name)
    response = model.generate_content(prompt)
    
    response_text = response.text
    input_tokens = response.usage_metadata.prompt_token_count
    output_tokens = response.usage_metadata.candidates_token_count
    total_tokens = response.usage_metadata.total_token_count
    
    return response_text, input_tokens, output_tokens, total_tokens

def generate_llama_response(prompt: str, model_name: str = "meta-llama/Meta-Llama-3-8B-Instruct-Lite"):
    client = Together()
    response = client.chat.completions.create(
        model=model_name,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=512,
        temperature=1,
        top_p=1
    )
    
    response_text = response.choices[0].message.content
    input_tokens = response.usage.prompt_tokens  
    output_tokens = response.usage.completion_tokens
    total_tokens = response.usage.total_tokens
    
    return response_text, input_tokens, output_tokens, total_tokens


In [7]:
def generate_response(prompt: str, model_name: str):
    if "gpt" in model_name:
        return generate_response_openai(prompt, model_name)
    elif "mistral" in model_name:
        return generate_mistral_response(prompt, model_name)
    elif "gemini" in model_name:
        return generate_gemini_response(prompt, model_name)
    elif "llama" in model_name:
        return generate_llama_response(prompt, model_name)


In [8]:
def create_summarization_prompt(text: str) -> str:
    prompt = f"""Please provide a clear and concise summary of the following patient notes. 
Focus on key medical information, diagnoses, treatments, and important observations.
Keep the summary professional and maintain medical accuracy.

Patient Notes:
{text}"""
    return prompt


## 4. Run Inference for 5 models with Galileo Evaluate

Here we define the models we want to evaluate, and the system prompt for the LLM

In [9]:
# Evaluate models
models = [
    "gpt-4o-mini",
    "gpt-4o",
    "mistral-medium",
    "gemini-1.5-flash",
    "meta-llama/Meta-Llama-3-8B-Instruct-Lite",
]

Now let's run the actual inference and log the information to Galileo! If you want to run the LLM summarization for more notes, set the sample size accordingly

In [10]:
patient_notes = df.sample(10)['description'].tolist()

In [None]:
from tqdm import tqdm

for model_name in models:
    evaluate_run = EvaluateRun(run_name=model_name, project_name=PROJECT_NAME, scorers=metrics)
    for patient_note in tqdm(patient_notes):

        prompt = create_summarization_prompt(patient_note)

        # Create your workflow to log to Galileo.
        wf = evaluate_run.add_workflow(input={"prompt": prompt, "model_name": model_name}, name=model_name, metadata={"env": "demo"})
        
        model_response, input_tokens, output_tokens, total_tokens = generate_response(prompt, model_name)

        # Log your llm call step to Galileo.
        wf.add_llm(
            input=prompt,
            output=model_response,
            model=model_name,
            input_tokens=input_tokens,
            output_tokens=output_tokens,
            total_tokens=total_tokens,
            metadata={"env": "demo"},
            name=f"LLM run",
        )

        # Conclude the workflow.
        wf.conclude(output={"output": model_response})
    evaluate_run.finish()

You can have a look at the final results in the console via the link generated from the project

## Conclusion

Throughout this notebook, we have explored the process of creating and evaluating 5 summarization bots for patient notes via Open AI, Mistral, Gemini and Together with Galileo Evaluate. We covered essential steps, including setting up the environment, loading and preparing patient notes, generating summaries, and logging to Galileo.