### Evaluation & Metrics: Assess model performance using relevant metrics: perplexity, distinct, Self-BLEU

#### Perplexity:
measures how “surprised” a language model is by a given text. Lower perplexity means the model assigns higher probability to the text
>Lower scores indicate the model is better at modeling the style/content of your corpus. Extremely low scores can also hint at overly safe or repetitive outputs.
>

#### Distinct:
quantifies how many unique n-grams appear in the generated texts, relative to the total number of n-grams produced. It’s a simple diversity measure
>	•	Distinct-1 measures word-level diversity (unigrams). Distinct-2 measures phrase-level diversity (bigrams). Values closer to 1.0 mean high diversity (few repeats); values closer to 0 mean the model is repeating the same words/phrases.

####  Self-BLEU:
evaluates how similar the generated samples are to one another. It’s a reverse of BLEU: treating each generation as a “hypothesis” and all the others as “references.”
>	•	Scores range from 0 to 1. Higher Self-BLEU means samples are very similar to each other (low diversity). Lower Self-BLEU means samples are more distinct.


In [3]:
import math
import itertools
from collections import Counter
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

In [7]:
#perplexity based on GPT2

model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).eval().to("cuda" if torch.cuda.is_available() else "cpu")

def calc_perplexity(text: str) -> float:
    """perplexity"""
    encodings = tokenizer(text, return_tensors="pt")
    input_ids = encodings.input_ids.to(model.device)
    with torch.no_grad():
        outputs = model(input_ids, labels=input_ids)
        neg_log_likelihood = outputs.loss * input_ids.shape[1]
    ppl = torch.exp(neg_log_likelihood / input_ids.shape[1])
    return ppl.item()

In [9]:
#distinct

def distinct_n(texts, n=1):
    """Distinct-n = (#unique n-grams) / (#total n-grams)"""
    total_ngrams = 0
    unique_ngrams = set()
    for txt in texts:
        tokens = txt.split()
        ngrams = zip(*[tokens[i:] for i in range(n)])
        count = 0
        for ng in ngrams:
            unique_ngrams.add(ng)
            count += 1
        total_ngrams += count
    return len(unique_ngrams) / total_ngrams if total_ngrams > 0 else 0.0

In [13]:
#BLEU

def self_bleu(texts, n_gram=4):
    """
   Self-BLEU
    """
    smoothie = SmoothingFunction().method4
    scores = []
    for i, cand in enumerate(texts):
        references = [t.split() for j,t in enumerate(texts) if j != i]
        scores.append(sentence_bleu(
            references=[references],       
            hypothesis=cand.split(),
            smoothing_function=smoothie,
            weights=tuple([1/n_gram]*n_gram)  
        ))
    return sum(scores) / len(scores)

In [25]:
if __name__ == "__main__":
    
    generated_texts = ["When first mine eyes beheld thy gentle face, and I saw the light of thylips, which was in the midst of my mind, as it were. I took a little while to look at the picture" ] #input generated poem 

    # 4.1 Perplexity
    ppls = [calc_perplexity(txt) for txt in generated_texts]
    print("Perplexities:", ppls)
    print(f"Avg Perplexity: {sum(ppls)/len(ppls):.2f}")

    # 4.2 Distinct-1 / Distinct-2
    print(f"Distinct-1: {distinct_n(generated_texts, n=1):.4f}")
    print(f"Distinct-2: {distinct_n(generated_texts, n=2):.4f}")

    # 4.3 Self-BLEU
    sb = self_bleu(generated_texts, n_gram=4)
    print(f"Self-BLEU (up to 4-gram): {sb:.4f}")

Perplexities: [46.377098083496094]
Avg Perplexity: 46.38
Distinct-1: 0.8889
Distinct-2: 1.0000
Self-BLEU (up to 4-gram): 0.0000
