# LLM Evaluation Parameters and Techniques
Evaluating large language models (LLMs) is crucial for understanding their performance, capabilities, and areas of improvement. This notebook provides an overview of evaluation parameters, their use cases, examples, and code implementations for different tasks.

---

## Evaluation Parameters

### 1. **Perplexity**
- **What it is**: A measure of how well a probability model predicts a sample. Lower perplexity indicates better performance.
- **When to use**: Suitable for language modeling tasks where the goal is to predict sequences of text.
- **Formula**:
  $$\text{Perplexity} = 2^{-\frac{1}{N} \sum_{i=1}^{N} \log_2 P(x_i)}$$

In [None]:
import math

def calculate_perplexity(probabilities):
    n = len(probabilities)
    log_prob_sum = sum(math.log2(p) for p in probabilities)
    perplexity = 2 ** (-log_prob_sum / n)
    return perplexity

# Example usage
probabilities = [0.1, 0.2, 0.3, 0.4]
print("Perplexity:", calculate_perplexity(probabilities))

### 2. **BLEU (Bilingual Evaluation Understudy)**
- **What it is**: A metric for comparing a generated sequence against reference sequences using n-grams.
- **When to use**: Commonly used in machine translation.

In [None]:
from nltk.translate.bleu_score import sentence_bleu

reference = [['this', 'is', 'a', 'test']]
candidate = ['this', 'is', 'test']
score = sentence_bleu(reference, candidate)
print("BLEU score:", score)

### 3. **ROUGE (Recall-Oriented Understudy for Gisting Evaluation)**
- **What it is**: Measures overlap between generated and reference texts.
- **When to use**: Useful for summarization tasks.

In [None]:
!pip install rouge

In [None]:
from rouge import Rouge

rouge = Rouge()
hypothesis = "The quick brown fox jumps over the lazy dog."
reference = "The fast brown fox leaps over the lazy dog."
scores = rouge.get_scores(hypothesis, reference)
print("ROUGE scores:", scores)

### 4. **Accuracy**
- **What it is**: Percentage of correct predictions.
- **When to use**: Suitable for classification tasks.

In [None]:
from sklearn.metrics import accuracy_score

y_true = [1, 0, 1, 1]
y_pred = [1, 0, 0, 1]
accuracy = accuracy_score(y_true, y_pred)
print("Accuracy:", accuracy)

### 5. **F1 Score**
- **What it is**: Harmonic mean of precision and recall.
- **When to use**: Appropriate for imbalanced datasets.

In [None]:
from sklearn.metrics import f1_score

y_true = [1, 0, 1, 1]
y_pred = [1, 0, 0, 1]
f1 = f1_score(y_true, y_pred)
print("F1 Score:", f1)

### 6. **Exact Match (EM)**
- **What it is**: Measures whether the generated output matches the reference exactly.
- **When to use**: Best for tasks like question answering.

In [None]:
def exact_match_score(reference, prediction):
    return int(reference == prediction)

# Example usage
reference = "What is the capital of France?"
prediction = "What is the capital of France?"
print("Exact Match:", exact_match_score(reference, prediction))

### 7. **Human Evaluation**
- **What it is**: Involves subjective feedback from humans.
- **When to use**: Ideal for tasks where automated metrics fail to capture nuances, such as creative writing or conversational AI.
- **Drawback**: Time-consuming and subjective.

## Task-Based Evaluation

### Task: **Text Generation**
- **Evaluation Metric**: Perplexity, BLEU, Human Evaluation

### Task: **Summarization**
- **Evaluation Metric**: ROUGE

In [None]:
hypothesis = "The quick brown fox."
reference = "The fast brown fox jumps."
scores = rouge.get_scores(hypothesis, reference)
print("ROUGE for Summarization:", scores)

### Task: **Classification**
- **Evaluation Metric**: Accuracy, F1 Score

In [None]:
y_true = [1, 0, 1, 1]
y_pred = [1, 0, 0, 1]
print("Accuracy for Classification:", accuracy_score(y_true, y_pred))
print("F1 Score for Classification:", f1_score(y_true, y_pred))

### Task: **Machine Translation**
- **Evaluation Metric**: BLEU

In [None]:
print("BLEU score for Translation:", sentence_bleu(reference, candidate))

### Task: **Question Answering**
- **Evaluation Metric**: Exact Match, F1 Score

In [None]:
reference = "Paris"
prediction = "Paris"
print("Exact Match for QA:", exact_match_score(reference, prediction))

## Conclusion
Selecting the right evaluation metric is essential for measuring the effectiveness of LLMs. This notebook provides examples and code for various tasks, helping you choose the most appropriate evaluation technique for your specific needs.