This notebook will evaluate LLM on medical dataset and use Opik to track the process.
We use `llama3.2:3b` and the dataset can be found here: https://huggingface.co/datasets/FreedomIntelligence/medical-o1-verifiable-problem

### Steps to get started on Opik

1. We'll start by creating an account on [comet.com](https://www.comet.com/site/?ref=dailydoseofds.com).

2. Once you create an account, it will give you two options to choose from—select LLM evaluation (Opik).

3. Once done, you will find yourself in your dashboard, where you can also find your API key on the right.

4. Next, in your current working directory, create a `.env` file. Copy the API key shown in your dashboard and paste it as follows: `COMET_API_KEY="your-api-key-here"`.

5. To configure Opik, run the following code in a new Python file. 

In [None]:
# import opik
# opik.configure(use_local=False)

Executing the above code will open a panel to enter the API key obtained above. Enter the API key there, and Opik has been configured. Or you can set the API key in the Python file.

In [None]:
# # Set up Opik environment variables
# os.environ["OPIK_API_KEY"] = "YOUR_API_KEY"  # Replace with your API key
# os.environ["OPIK_WORKSPACE"] = "YOUR_WROKSPACE_HERE"  # Replace with your workspace

### A step-by-step guide on using Ollama

1. Go to [Ollama.com](https://ollama.com/?ref=dailydoseofds.com), select your operating system. I'm using macOS, so I directly download it.

2. After installing, run the command: "ollama serve". 

3. Choose the model you're looking for and in another terminal, run: "ollama run [YOUR_MODEL_HERE]". This will download the model locally. I'm using `llama 3.2:3b`.

4. Finally, install the open-source Opik framework, LlamaIndex, and LlamaIndex's Ollama integration module as follows: 
```bash
pip install opik
pip install llama-index
pip install llama-index-llms-ollama
```

In [1]:
from datasets import load_dataset
import requests
from tqdm import tqdm
import time



Contstruct to get the response.

The decorator `@track` is all you need to track the LLM response.

In [2]:
from opik import track

@track(project_name="medical_dataset")
def get_llama_response(question, max_retries=3):
    """Get response from locally running Llama model via Ollama"""
    prompt = f"You are a medical knowledge assistant trained to provide information and guidance on various health-related topics.\nGive the direct answer without any explanation. Question: {question}\nAnswer:"
    
    for attempt in range(max_retries):
        try:
            response = requests.post('http://localhost:11434/api/generate',
                                   json={
                                       'model': 'llama3.2:3b',
                                       'prompt': prompt,
                                       'stream': False,
                                       'options': {
                                           'mps': True  # I'm using Apple Silicon GPU
                                       }
                                   },
                                   timeout=30)
            response.raise_for_status()
            return response.json()['response'].strip()
        except requests.exceptions.RequestException as e:
            if attempt == max_retries - 1:
                raise Exception(f"Failed to get response after {max_retries} attempts: {str(e)}")
            time.sleep(2 * (attempt + 1))

We calculate the metrics using BLUE and ROUGE

`evaluate` is Hugging Face's official library for:

1. Loading and computing NLP metrics with consistent APIs

2. Using standard implementations (same as in papers and benchmarks)

3. Supporting BLEU, ROUGE, METEOR, BERTScore, Exact Match, and more

In [None]:

import evaluate

@track(project_name="medical_dataset")
def calculate_metrics(predictions, references):
    """Calculate BLEU and ROUGE scores for a batch"""
    bleu_eval = evaluate.load("bleu")
    bleu_results = bleu_eval.compute(predictions=predictions, references=references)
    
    rouge_eval = evaluate.load("rouge")
    rouge_results = rouge_eval.compute(predictions=predictions, references=[r[0] for r in references])
    
    return {**bleu_results, **rouge_results}

In [None]:
@track(project_name="medical_dataset")
def main():
    # Load medical dataset
    try:
        medical_dataset = load_dataset("FreedomIntelligence/medical-o1-verifiable-problem")
    except Exception as e:
        print(f"Failed to load dataset: {str(e)}")
        return

    total_entries = len(medical_dataset['train'])
    batch_size = 200
    
    # Initialize lists for storing responses
    llama_responses = []
    answer_list = []
    error_count = 0
    
    print(f"\nProcessing {total_entries} entries in batches of {batch_size}:")
    print("=" * 80)
    
    # Process all examples with progress bar
    for i, example in enumerate(tqdm(medical_dataset['train'], total=total_entries)):
        if error_count >= 10:
            print("\nToo many errors encountered. Stopping processing.")
            break
            
        try:
            question = example['Open-ended Verifiable Question']
            ground_truth = example['Ground-True Answer']
            
            prediction = get_llama_response(question=question)
            llama_responses.append(prediction)
            answer_list.append([ground_truth])
            
            # Print first 5 Q&As
            if i < 5:
                print(f"\nQ{i+1}: {question}")
                print(f"A{i+1}: {prediction}")
                print(f"Ground Truth: {ground_truth}\n")
            
        except Exception as e:
            error_count += 1
            print(f"\nError processing entry {i}: {str(e)}")
            continue
        
        # Calculate metrics every batch_size entries
        if (i + 1) % batch_size == 0 and llama_responses:
            metrics = calculate_metrics(llama_responses, answer_list)
            
            print(f"\nMetrics after {i + 1} entries:")
            print("BLEU Score:", metrics['bleu'])
            print("ROUGE Scores:")
            print(f"  ROUGE-1: {metrics['rouge1']:.4f}")
            print(f"  ROUGE-2: {metrics['rouge2']:.4f}")
            print(f"  ROUGE-L: {metrics['rougeL']:.4f}")
            print(f"Errors encountered: {error_count}")
            print("-" * 80)
    
    # Calculate final metrics
    if llama_responses:
        print("\nFinal Metrics:")
        final_metrics = calculate_metrics(llama_responses, answer_list)
        final_metrics['total_errors'] = error_count
        print(json.dumps(final_metrics, indent=2))

In [None]:
# if __name__ == "__main__":
    # main()