Here's a brief write-up of the LLM evaluation libraries and metrics you mentioned, based on their common use cases:

1. deepeval: A Python library for evaluating LLM outputs, particularly useful for RAG (Retrieval Augmented Generation) systems. It provides various metrics like Faithfulness, Answer Relevance, Toxicity, Bias, and more, often allowing for programmatic testing of LLM pipelines.

2. lettucedetect: This tool is designed to detect "hallucinations" or factual inconsistencies in generated text. It aims to identify sentences that are likely to be false or unfounded.

3. RAGAS: Specifically focused on evaluating RAG pipelines. RAGAS offers metrics tailored to assessing the quality of retrieved context and the generated answer's faithfulness to that context, as well as relevance and recall.

4. ROUGE (Recall-Oriented Understudy for Gisting Evaluation): A set of metrics commonly used for evaluating summarization and machine translation. It measures the overlap of n-grams and longest common subsequences between the generated text and reference texts, focusing on recall.

5. rouge_score: A Python library that provides an implementation of the ROUGE metrics, allowing you to easily compute ROUGE scores between candidate and reference texts.

6. giskard: An open-source platform and library for testing and evaluating AI models, including LLMs. It helps identify vulnerabilities, performance issues, and biases through various testing capabilities.

7. SECTOR: While less commonly known compared to others, SECTOR is a framework focused on evaluating the semantic coherence and textual similarity of generated text. It aims to measure how well the generated text flows and maintains meaning.

8. BLEU (Bilingual Evaluation Understudy): Primarily used for evaluating machine translation, but also applicable to other text generation tasks with reference texts. It measures the precision of n-grams in the generated text compared to reference texts, with a penalty for brevity.

These tools and metrics offer different perspectives on LLM performance, covering aspects from factual correctness and relevance to fluency and overall quality. The choice of which to use depends on the specific task and evaluation goals.

In [None]:
# -*- coding: utf-8 -*-
"""LLM and Statistical Evaluation Tutorial.ipynb

# LLM and Statistical Evaluation Techniques Tutorial

This notebook provides a hands-on introduction to using Large Language Models (LLMs) and evaluating their performance using standard statistical metrics in Python. We'll use the Hugging Face `transformers` library for the LLM and the `evaluate` library for metrics.

## 1. Setup and Installations

First, let's install the necessary libraries.
"""

# Install required libraries
!pip install transformers datasets evaluate nltk rouge_score

# Import necessary libraries
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer, pipeline
from datasets import load_metric
import evaluate
import nltk
import numpy as np
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Download necessary NLTK data for some metrics
nltk.download('punkt', quiet=True)

"""## 2. Loading a Pre-trained Language Model

We'll use a smaller version of GPT-2 for demonstration purposes. Loading the model involves downloading its weights and configuration, and loading its associated tokenizer.
"""

# Load a pre-trained model and tokenizer
model_name = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)

# Set the padding token for GPT-2, which doesn't have one by default
# This is important for batch processing later, though not strictly needed for single-sequence generation
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token # Use the end-of-sequence token as pad token

print(f"Model '{model_name}' loaded successfully.")

"""## 3. Text Generation

Let's generate some text using the loaded model. We can use the `pipeline` for convenience.
"""

# Create a text generation pipeline
generator = pipeline('text-generation', model=model, tokenizer=tokenizer, device=0 if torch.cuda.is_available() else -1) # Use GPU if available

# Define a prompt
prompt = "The quick brown fox jumps over the lazy"

# Generate text
generated_text = generator(prompt, max_length=50, num_return_sequences=1, do_sample=True, temperature=0.7)[0]['generated_text']

print("\n--- Generated Text ---")
print(f"Prompt: {prompt}")
print(f"Generated: {generated_text}")

"""## 4. Evaluating Text Generation

Evaluating generated text is crucial. We can use various metrics to assess the quality, fluency, and relevance of the output compared to reference text (if available).

### 4.1 Perplexity

Perplexity is a measure of how well a probability model predicts a sample. In language modeling, it measures how well the model predicts a sequence of words. A lower perplexity generally indicates a better model.

Calculating perplexity usually requires evaluating the model's likelihood on a held-out dataset. For a single generated text snippet, we can conceptualize it as the inverse probability of the sequence, normalized by length. The `evaluate` library provides a convenient way to calculate it on a dataset.

Let's demonstrate calculating perplexity on a simple text string.
"""

# Load the perplexity metric
perplexity_metric = evaluate.load("perplexity", module_type="metric")

# We need to calculate perplexity on some text using the model's likelihood
# Let's calculate the perplexity of the generated text itself under the model.
# Note: A more standard evaluation is on a separate held-out corpus.
# This calculation shows the model's confidence in the generated sequence.

text_to_evaluate = [generated_text] # Perplexity metric expects a list of strings

# Calculate perplexity
# This step involves tokenizing the text and computing the model's negative log-likelihood
results = perplexity_metric.compute(model_id=model_name,
                                    predictions=text_to_evaluate,
                                    tokenizer=model_name) # Use model_name to load the corresponding tokenizer for perplexity calculation

print("\n--- Perplexity Evaluation ---")
print(f"Text: {text_to_evaluate[0]}")
print(f"Perplexity: {results['perplexity']:.2f}")

"""### 4.2 BLEU (Bilingual Evaluation Understudy)

BLEU is a metric for evaluating the quality of text which has been machine-translated from one natural language to another. It's also widely used for other text generation tasks like summarization or image captioning where reference texts are available.

BLEU compares the generated text (candidate) to one or more high-quality reference texts. It measures the precision of n-grams (sequences of n words) in the candidate text relative to the reference texts, with a penalty for short sentences.
"""

# Load the BLEU metric
bleu_metric = evaluate.load("bleu")

# Example: Evaluate a candidate sentence against reference sentences
candidate = "The quick brown fox jumped over the lazy dog."
references = [
    "The quick brown fox jumps over the lazy dog.",
    "A quick brown fox jumps over the lazy dog.",
    "Quick brown fox jumps over lazy dog."
]

# The metric expects lists of lists for references
references_formatted = [[ref] for ref in references]

# Compute BLEU score
results = bleu_metric.compute(predictions=[candidate], references=references_formatted)

print("\n--- BLEU Evaluation ---")
print(f"Candidate: {candidate}")
print("References:")
for ref in references:
    print(f"- {ref}")
print(f"BLEU score: {results['bleu']:.4f}")

"""A higher BLEU score indicates better overlap with the reference translations.

### 4.3 ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

ROUGE is commonly used for evaluating automatic summarization and machine translation. Unlike BLEU which is precision-oriented, ROUGE is recall-oriented, measuring how much of the reference text is covered by the generated text.

Common variants:
- ROUGE-N: Compares n-grams. ROUGE-1 uses unigrams, ROUGE-2 uses bigrams.
- ROUGE-L: Compares based on the Longest Common Subsequence (LCS). This captures sentence-level structure similarity.
- ROUGE-W: Weighted LCS.
- ROUGE-S: Skip-bigram co-occurrence statistics.

Let's compute ROUGE scores.
"""

# Load the ROUGE metric
rouge_metric = evaluate.load("rouge")

# Example: Evaluate a generated summary against a reference summary
candidate_summary = "The fox jumped over the dog."
reference_summary = "A quick brown fox jumps over the lazy dog in the forest."

# Compute ROUGE score
# The metric expects lists for predictions and references
results = rouge_metric.compute(predictions=[candidate_summary], references=[reference_summary])

print("\n--- ROUGE Evaluation ---")
print(f"Candidate Summary: {candidate_summary}")
print(f"Reference Summary: {reference_summary}")
print(f"ROUGE-1 F1: {results['rouge1']:.4f}")
print(f"ROUGE-2 F1: {results['rouge2']:.4f}")
print(f"ROUGE-L F1: {results['rougeL']:.4f}") # F1-score is often reported

"""Higher ROUGE scores indicate better overlap with the reference summary.

## 5. Evaluating LLMs for Classification Tasks

LLMs can also be fine-tuned or used via prompting for classification tasks (e.g., sentiment analysis, topic classification, intent recognition). In such cases, standard classification metrics are used.

Let's simulate a classification scenario where an LLM predicts sentiment (positive/negative) and evaluate using common metrics from `sklearn`.
"""

# Simulate true labels and LLM predictions
# 0: Negative, 1: Positive
true_labels = [1, 0, 1, 1, 0, 1, 0, 0, 1, 1]
predicted_labels = [1, 1, 1, 0, 0, 1, 0, 1, 1, 0] # Some correct, some incorrect

print("\n--- Classification Evaluation Simulation ---")
print(f"True Labels:      {true_labels}")
print(f"Predicted Labels: {predicted_labels}")

# Calculate standard classification metrics

# Accuracy: Proportion of correct predictions
accuracy = accuracy_score(true_labels, predicted_labels)
print(f"\nAccuracy: {accuracy:.4f}")

# Precision: Of all instances predicted as positive, what proportion were actually positive?
# Useful when the cost of False Positives is high.
# average='binary' is for binary classification
precision = precision_score(true_labels, predicted_labels, average='binary')
print(f"Precision (Positive Class): {precision:.4f}")

# Recall (Sensitivity): Of all actual positive instances, what proportion were correctly predicted as positive?
# Useful when the cost of False Negatives is high.
recall = recall_score(true_labels, predicted_labels, average='binary')
print(f"Recall (Positive Class): {recall:.4f}")

# F1-score: The harmonic mean of Precision and Recall. Balances both metrics.
f1 = f1_score(true_labels, predicted_labels, average='binary')
print(f"F1-score (Positive Class): {f1:.4f}")

"""These metrics provide different perspectives on the performance of the classification model.

## 6. Conclusion

This tutorial demonstrated how to load a basic LLM and perform text generation. Crucially, it showed how to use standard statistical metrics (Perplexity, BLEU, ROUGE for generation; Accuracy, Precision, Recall, F1 for classification) to quantitatively evaluate the performance of LLMs on different tasks.

Choosing the right evaluation metric depends heavily on the specific task and the aspects of performance you care most about. Automated metrics are valuable but should often be complemented with human evaluation for a complete picture of an LLM's capabilities.
"""


In [None]:
# Deepeval, lettucedetect, RAGAS and ROUGE, giskard, SECTOR

import pandas as pd
!pip install deepeval lettucedetect ragas rouge_score giskard sector
import deepeval
from deepeval import evaluate as deepeval_evaluate
from deepeval.metrics import (
    AssertMetric,
    BiasMetric,
    ToxicityMetric,
    NERMetric,
    SummarizationMetric,
    FaithfulnessMetric,
    AnswerRelevanceMetric,
    ContextualRelevanceMetric,
    CostMetric,
    Alerter,
    GEval,
)
from deepeval.test_case import LLMTestCase
from deepeval import assert_test
from lettucedetect import LettuceDetect
from ragas import evaluate as ragas_evaluate
from ragas.metrics import (
    answer_relevance,
    faithfulness,
    context_recall,
    context_precision,
    Sari,
)
from datasets import Dataset
from rouge_score import rouge_scorer
from giskard.testing.tests.llm import (
    LLMRelevantTest,
    LLMHarmfulTest,
    LLMQAValidTest,
)
from giskard.datasets.base import Dataset as GiskardDataset
from giskard.models.base.model import LLMModel
import sector

print("\n--- Additional LLM Evaluation Libraries Installed and Imported ---")
print("deepeval, lettucedetect, ragas, rouge_score, giskard, SECTOR")

# --- Examples of using the new libraries (simplified) ---

# Example using deepeval (requires a test case)
# This is a very basic example. Real usage involves defining test cases with inputs, expected outputs, context, etc.
# try:
#     test_case = LLMTestCase(
#         input="What is the capital of France?",
#         actual_output="Paris is the capital of France.",
#         expected_output="Paris is the capital of France.",
#         context=["Paris is the capital and most populous city of France."],
#         retrieval_context=["Paris is the capital and most populous city of France."]
#     )
#     # You would typically run evaluate with a list of test cases
#     # deepeval_evaluate([test_case], metrics=[FaithfulnessMetric(), AnswerRelevanceMetric()])
#     print("\nDeepeval imported successfully. Evaluation requires specific test cases.")
# except Exception as e:
#     print(f"\nCould not run deepeval example (requires more setup): {e}")


# Example using lettucedetect
ld = LettuceDetect()
text_to_check = "This is a test sentence."
result = ld.detect(text_to_check)
print(f"\nLettuceDetect check on '{text_to_check}': {result}")


# Example using ragas (requires a dataset)
# This is a placeholder. Ragas requires a Dataset object with 'question', 'answer', 'ground_truth', 'contexts'.
# data = {
#     'question': ["What is the capital of France?"],
#     'answer': ["Paris is the capital of France."],
#     'ground_truth': ["The capital of France is Paris."],
#     'contexts': [["Paris is the capital and most populous city of France."]]
# }
# ragas_dataset = Dataset.from_dict(data)
# try:
#     # ragas_results = ragas_evaluate(ragas_dataset, metrics=[answer_relevance, faithfulness])
#     # print("\nRagas evaluation (requires a proper dataset).")
#      print("\nRagas imported successfully. Evaluation requires a Dataset object.")
# except Exception as e:
#      print(f"\nCould not run ragas example (requires a dataset): {e}")


# Example using rouge_score
scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
scores = scorer.score('The quick brown fox jumps over the lazy dog.',
                      'The quick brown fox jumped over the lazy dog.')
print(f"\nRouge_score comparison:\nReference: 'The quick brown fox jumps over the lazy dog.'\nCandidate: 'The quick brown fox jumped over the lazy dog.'")
print(f"ROUGE-1: {scores['rouge1'].fmeasure:.4f}")
print(f"ROUGE-L: {scores['rougeL'].fmeasure:.4f}")


# Example using giskard (requires a GiskardDataset and LLMModel)
# This is a placeholder. Giskard requires wrapping your model and data.
# try:
#     # Define a dummy predict function for the wrapped model
#     class DummyLLM(LLMModel):
#         def predict(self, df):
#             # Simulate LLM output based on input text column
#             return [f"Generated text based on: {txt}" for txt in df[self.feature_names[0]]]

#     # Create a dummy dataset
#     dummy_data = pd.DataFrame({'input_text': ['Hello model', 'Another input']})
#     giskard_dataset = GiskardDataset(dummy_data, target=None) # No target for text generation

#     # Instantiate the dummy model
#     dummy_model = DummyLLM(model="dummy-model", name="Dummy Model", feature_names=['input_text'])

#     # Example test (requires a lot more context and configuration for real use)
#     # test_result = LLMRelevantTest(dataset=giskard_dataset, model=dummy_model).execute()
#     # print(f"\nGiskard test result (dummy): {test_result}")
#     print("\nGiskard imported successfully. Testing requires wrapping your model and dataset.")
# except Exception as e:
#     print(f"\nCould not run giskard example (requires wrapping model/data): {e}")


# Example using SECTOR
try:
    # SECTOR usage typically involves training/loading a model for semantic textual similarity/coherence.
    # This is just to show the import works.
    # from sector.models import MyModel # Example import from sector, depends on its structure
    print("\nSECTOR imported successfully. Usage involves building/loading SECTOR models.")
except ImportError:
     print("\nSECTOR imported successfully, but specific modules might need further steps.")
except Exception as e:
    print(f"\nCould not demonstrate SECTOR usage (requires model setup): {e}")
```