In [2]:
# Comprehensive Guide to Evaluating Language Models (LLMs) with Python
# ## Introduction
# 
# Evaluating the performance of Large Language Models (LLMs) is an essential step in ensuring they meet user expectations for accuracy, logical reasoning, ethical behavior, and usability. This comprehensive guide combines theoretical understanding, formulas, and Python implementations for a wide array of metrics. By the end of this notebook, you will have all the tools necessary to benchmark and improve your LLM.

# ---
# ## Setting Up the Environment
# Install all required libraries if not already installed
# Uncomment and run the following lines if you haven't installed the required packages.

# !pip install numpy pandas sklearn rouge-score nltk detoxify lm-eval matplotlib

# ### Example Datasets
# Let's define example datasets that will be used across multiple metric evaluations.

# Example dataset for accuracy and logical consistency
gold_standard = [
    {"query": "What is 2 + 2?", "correct_answer": "4"},
    {"query": "Who wrote Macbeth?", "correct_answer": "William Shakespeare"},
    {"query": "What is the boiling point of water?", "correct_answer": "100°C"}
]

model_outputs = [
    {"query": "What is 2 + 2?", "output": "4"},
    {"query": "Who wrote Macbeth?", "output": "Charles Dickens"},
    {"query": "What is the boiling point of water?", "output": "100°C"}
]

# Example dataset for toxicity detection
texts = [
    "This is a friendly and respectful comment.",
    "This is a hateful and offensive comment."
]

# ---
# ## Metrics and Python Implementations

# ### 1. Hallucination Reduction Rate (HRR)
def calculate_hrr(baseline_outputs, validated_outputs):
    hallucinations_reduced = sum(
        1 for base, valid in zip(baseline_outputs, validated_outputs)
        if base["is_hallucination"] and not valid["is_hallucination"]
    )
    total_hallucinations = sum(1 for base in baseline_outputs if base["is_hallucination"])
    hrr = (hallucinations_reduced / total_hallucinations) * 100 if total_hallucinations > 0 else 0
    return hrr

# Example usage for HRR
baseline_outputs = [
    {"query": "What is the boiling point of water?", "output": "50°C", "is_hallucination": True},
    {"query": "Who wrote Hamlet?", "output": "Charles Dickens", "is_hallucination": True}
]
validated_outputs = [
    {"query": "What is the boiling point of water?", "output": "100°C", "is_hallucination": False},
    {"query": "Who wrote Hamlet?", "output": "William Shakespeare", "is_hallucination": False}
]
hrr_score = calculate_hrr(baseline_outputs, validated_outputs)
print(f"Hallucination Reduction Rate (HRR): {hrr_score:.2f}%")

# ---
# ### 2. Logical Consistency Score (LCS)
def calculate_lcs(responses):
    consistent_responses = sum(1 for response in responses if response["is_consistent"])
    return (consistent_responses / len(responses)) * 100

# Example usage for LCS
responses = [
    {"query": "If A > B and B > C, is A > C?", "output": "Yes", "is_consistent": True},
    {"query": "Is it possible for a square to have three sides?", "output": "No", "is_consistent": True}
]
lcs_score = calculate_lcs(responses)
print(f"Logical Consistency Score (LCS): {lcs_score:.2f}%")

# ---
# ### 3. Response Accuracy (RA)
def calculate_ra(gold_standard, model_outputs):
    correct_responses = sum(
        1 for gold, output in zip(gold_standard, model_outputs)
        if gold["correct_answer"] == output["output"]
    )
    return (correct_responses / len(gold_standard)) * 100

# Example usage for RA
ra_score = calculate_ra(gold_standard, model_outputs)
print(f"Response Accuracy (RA): {ra_score:.2f}%")

# ---
# ### 4. Exact Match (EM)
def exact_match(prediction, target):
    return prediction == target

# Example usage for EM
em_score = exact_match("Paris", "Paris")
print(f"Exact Match (EM): {em_score}")

# ---
# ### 5. F1 Score
from sklearn.metrics import f1_score

def calculate_f1(predictions, targets):
    return f1_score(targets, predictions, average="binary")

# Example usage for F1
predictions = [1, 0, 1, 1]
targets = [1, 0, 0, 1]
print(f"F1 Score: {calculate_f1(predictions, targets):.2f}")

# ---
# ### 6. ROUGE
from rouge_score import rouge_scorer

def calculate_rouge(prediction, target):
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
    return scorer.score(target, prediction)

# Example usage for ROUGE
rouge_scores = calculate_rouge("The cat sat on the mat.", "The cat is on the mat.")
print("ROUGE Scores:", rouge_scores)

# ---
# ### 7. BLEU
from nltk.translate.bleu_score import sentence_bleu

def calculate_bleu(prediction, target):
    reference = [target.split()]
    candidate = prediction.split()
    return sentence_bleu(reference, candidate)

# Example usage for BLEU
bleu_score = calculate_bleu("The cat is on the mat.", "The cat sat on the mat.")
print(f"BLEU Score: {bleu_score:.2f}")

# ---
# ### 8. Toxicity Detection
from detoxify import Detoxify

def detect_toxicity(text):
    model = Detoxify('original')
    return model.predict(text)

# Example usage for Toxicity Detection
for text in texts:
    print(f"Toxicity for '{text}': {detect_toxicity(text)}")

# ---
# ### 9. Using `lm-evaluation-harness`
# Install this package if you haven't:
# !pip install lm-eval
from lm_eval import Evaluator

# Basic Usage of lm-evaluation-harness
evaluator = Evaluator(model="gpt2", tasks=["lambada", "piqa"])
results = evaluator.evaluate()
print("LM Evaluation Results:", results)

# ---
# ## Final Thoughts
# This notebook demonstrated how to evaluate LLMs using Python. By applying these metrics to real-world datasets, you can gain deeper insights into model performance and identify areas for improvement.


Hallucination Reduction Rate (HRR): 100.00%
Logical Consistency Score (LCS): 100.00%
Response Accuracy (RA): 66.67%
Exact Match (EM): True
F1 Score: 0.80
ROUGE Scores: {'rouge1': Score(precision=0.8333333333333334, recall=0.8333333333333334, fmeasure=0.8333333333333334), 'rougeL': Score(precision=0.8333333333333334, recall=0.8333333333333334, fmeasure=0.8333333333333334)}
BLEU Score: 0.00


The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
  from .autonotebook import tqdm as notebook_tqdm
Downloading: "https://github.com/unitaryai/detoxify/releases/download/v0.1-alpha/toxic_original-c1212f89.ckpt" to C:\Users\rusla/.cache\torch\hub\checkpoints\toxic_original-c1212f89.ckpt
100%|██████████| 418M/418M [00:14<00:00, 31.2MB/s] 
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Toxicity for 'This is a friendly and respectful comment.': {'toxicity': np.float32(0.0005483509), 'severe_toxicity': np.float32(0.00014052661), 'obscene': np.float32(0.00019850969), 'threat': np.float32(0.00013926814), 'insult': np.float32(0.00018051789), 'identity_attack': np.float32(0.00014728199)}
Toxicity for 'This is a hateful and offensive comment.': {'toxicity': np.float32(0.15707134), 'severe_toxicity': np.float32(0.00023562438), 'obscene': np.float32(0.0022060007), 'threat': np.float32(0.00058931776), 'insult': np.float32(0.0024856713), 'identity_attack': np.float32(0.0010868483)}


ImportError: cannot import name 'Evaluator' from 'lm_eval' (c:\Dropbox\23-GITHUB\Projects\Comprehensive-Guide-to-Evaluating-LLMS-with-Python\.venv\Lib\site-packages\lm_eval\__init__.py)