## Evaluation & Metrics: Assess model performance using relevant metrics: perplexity, distinct, Self-BLEU

### Perplexity:
measures how “surprised” a language model is by a given text. Lower perplexity means the model assigns higher probability to the text
> 	•	Lower scores indicate the model is better at modeling the style/content of your corpus. 
	•	Extremely low scores can also hint at overly safe or repetitive outputs.

### Distinct:
quantifies how many unique n-grams appear in the generated texts, relative to the total number of n-grams produced. It’s a simple diversity measure
> 	•	Distinct-1 measures word-level diversity (unigrams).
	•	Distinct-2 measures phrase-level diversity (bigrams).
	•	Values closer to 1.0 mean high diversity (few repeats)
	•	values closer to 0 mean the model is repeating the same words/phrases.

###  Self-BLEU:
evaluates how similar the generated samples are to one another. It’s a reverse of BLEU: treating each generation as a “hypothesis” and all the others as “references.”
> 	•	Scores range from 0 to 1.
	•	Higher Self-BLEU means samples are very similar to each other (low diversity).
	•	Lower Self-BLEU means samples are more distinct.


In [30]:
import math
import itertools
from collections import Counter
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

In [32]:
#perplexity based on GPT2

model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).eval().to("cuda" if torch.cuda.is_available() else "cpu")

def calc_perplexity(text: str) -> float:
    """perplexity"""
    encodings = tokenizer(text, return_tensors="pt")
    input_ids = encodings.input_ids.to(model.device)
    with torch.no_grad():
        outputs = model(input_ids, labels=input_ids)
        neg_log_likelihood = outputs.loss * input_ids.shape[1]
    ppl = torch.exp(neg_log_likelihood / input_ids.shape[1])
    return ppl.item()

In [34]:
#distinct

def distinct_n(texts, n=1):
    """Distinct-n = (#unique n-grams) / (#total n-grams)"""
    total_ngrams = 0
    unique_ngrams = set()
    for txt in texts:
        tokens = txt.split()
        ngrams = zip(*[tokens[i:] for i in range(n)])
        count = 0
        for ng in ngrams:
            unique_ngrams.add(ng)
            count += 1
        total_ngrams += count
    return len(unique_ngrams) / total_ngrams if total_ngrams > 0 else 0.0

In [83]:
#BLEU

smoothie = SmoothingFunction().method4

def self_bleu(texts, n_gram=4):
    """
    Compute the average BLEU score of each text against all the others.
    texts: list of raw strings
    n_gram: maximum n-gram order
    """
    scores = []
    for i, cand in enumerate(texts):
        # Tokenize candidate once
        cand_tokens = cand.split()
        # Build list of reference token lists (exclude the candidate itself)
        references = [
            other.split()
            for j, other in enumerate(texts)
            if j != i
        ]
        # Compute sentence BLEU against the list of references
        score = sentence_bleu(
            references=references,      # list of reference token lists
            hypothesis=cand_tokens,     # candidate token list
            smoothing_function=smoothie,
            weights=tuple([1.0 / n_gram] * n_gram),
            auto_reweigh=False
        )
        scores.append(score)
    return sum(scores) / len(scores)

## Compute style metrics to check how the generated poems style match to training poems

### TTR (Type–Token Ratio)
> 	•	High TTR (closer to 1): lots of different words — strong variety.
	•	Low TTR (closer to 0): repeat the same words more often — less variety.

### Simpson Diversity Index
> 	•	High Simpson (closer to 1): high probability that two randomly picked tokens are different—strong diversity.
	•	Low Simpson (closer to 0): high chance that two picks are the same token—low diversity.

### POS KL Divergence
> 	•	Low KL (near 0): generated poem’s POS mix is very similar to the reference style.
	•	High KL: generated poem’s POS proportions deviate strongly from that style.

In [4]:
import spacy
from spacy.cli import download as spacy_download
from collections import Counter
import math


In [6]:
spacy_download("en_core_web_sm")

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m35.7 MB/s[0m eta [36m0:00:00[0m [36m0:00:01[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [8]:
#Calculate reference POS
from datasets import load_dataset

In [10]:
# Load the small English model
nlp = spacy.load("en_core_web_sm")

# Define POS distribution function
def get_reference_pos_distribution(texts):
    all_pos = []
    for text in texts:
        doc = nlp(text)
        all_pos.extend([token.pos_ for token in doc if token.is_alpha])
    total = len(all_pos)
    count = Counter(all_pos)
    return {k: v / total for k, v in count.items()}

In [12]:
# Load poetry dataset
dataset = load_dataset("merve/poetry")

Repo card metadata block was not found. Setting CardData to empty.


In [13]:
dataset

DatasetDict({
    train: Dataset({
        features: ['author', 'content', 'poem name', 'age', 'type'],
        num_rows: 573
    })
})

In [16]:
# Filter only Renaissance poems
renaissance_poems = dataset['train'].filter(lambda x: x['age'] == 'Renaissance')

In [18]:
renaissance_poems

Dataset({
    features: ['author', 'content', 'poem name', 'age', 'type'],
    num_rows: 315
})

In [182]:
# Extract text column 
train_texts = renaissance_poems['content']  

In [184]:
train_texts

['Let the bird of loudest lay\r\nOn the sole Arabian tree\r\nHerald sad and trumpet be,\r\nTo whose sound chaste wings obey.\r\n\r\nBut thou shrieking harbinger,\r\nFoul precurrer of the fiend,\r\nAugur of the fever\'s end,\r\nTo this troop come thou not near.\r\n\r\nFrom this session interdict\r\nEvery fowl of tyrant wing,\r\nSave the eagle, feather\'d king;\r\nKeep the obsequy so strict.\r\n\r\nLet the priest in surplice white,\r\nThat defunctive music can,\r\nBe the death-divining swan,\r\nLest the requiem lack his right.\r\n\r\nAnd thou treble-dated crow,\r\nThat thy sable gender mak\'st\r\nWith the breath thou giv\'st and tak\'st,\r\n\'Mongst our mourners shalt thou go.\r\n\r\nHere the anthem doth commence:\r\nLove and constancy is dead;\r\nPhoenix and the Turtle fled\r\nIn a mutual flame from hence.\r\n\r\nSo they lov\'d, as love in twain\r\nHad the essence but in one;\r\nTwo distincts, division none:\r\nNumber there in love was slain.\r\n\r\nHearts remote, yet not asunder;\r\nDi

In [186]:
# POS distribution
target_pos_dist = get_reference_pos_distribution(train_texts) 
print(target_pos_dist)

{'VERB': 0.13404029366044357, 'DET': 0.06707401536022656, 'NOUN': 0.21866005879365275, 'ADP': 0.09025287427085096, 'ADJ': 0.08058732088713773, 'PROPN': 0.05245255721607437, 'CCONJ': 0.050697982238776106, 'AUX': 0.051898480907453866, 'ADV': 0.05680821264217444, 'PART': 0.021947578224801072, 'SCONJ': 0.032644329182891355, 'PRON': 0.13283979499176582, 'NUM': 0.004740430640419867, 'INTJ': 0.004925122743293368, 'X': 0.00032321118002862727, 'PUNCT': 0.00010773706000954242}


In [172]:
# Load and Clean Texts

from pathlib import Path
import pandas as pd

samples_path = Path("samples.txt")
with open(samples_path, "r", encoding="ISO-8859-1") as f:
    lines = f.readlines()

texts = []
current = []
for line in lines:
    if line.startswith('--- Prompt:'):
        if current:
            texts.append(" ".join(current).strip())
            current = []
    else:
        current.append(line.strip())
if current:
    texts.append(" ".join(current).strip())

generated_texts_input = pd.DataFrame({'text': texts})

generated_texts = generated_texts_input['text'].tolist()

Unnamed: 0,text
0,"My sweet lady, your eyes are like the stars in..."
1,O fairest rose that bloomed in summer's garden...
2,"O, thou my soul's most radiant star, Thou art ..."
3,"Ah, sweet delight, my heart's true guide, To t..."
4,"Come, gentle breeze, that whispers low, Softly..."
5,"Bright as the sun, my dearest love, You shine ..."
6,"O, how thy beauty doth amaze my sight, And mak..."
7,"Soft as the dawn, thy grace appears, And all t..."
8,"My true love's gaze, a heaven's light, And all..."
9,"When first I saw thy heavenly face, Thou dids..."


In [188]:
# Novelty/Divergence via Term-Frequency Cosine Similarity

from collections import Counter
import math
import pandas as pd

def max_cosine_to_corpus(generated_texts, train_texts):
    """
    For each generated text, compute the maximum cosine similarity
    against all training texts using term-frequency vectors.
    Returns a list of max similarities.
    """
    # Build term-frequency counters
    gen_tfs = [Counter(txt.lower().split()) for txt in generated_texts]
    train_tfs = [Counter(txt.lower().split()) for txt in train_texts]
    
    # Precompute norms
    gen_norms   = [math.sqrt(sum(v*v for v in tf.values())) for tf in gen_tfs]
    train_norms = [math.sqrt(sum(v*v for v in tf.values())) for tf in train_tfs]
    
    max_sims = []
    for i, gen_tf in enumerate(gen_tfs):
        best = 0.0
        for j, train_tf in enumerate(train_tfs):
            # dot product
            dot = sum(gen_tf[w] * train_tf.get(w, 0) for w in gen_tf)
            denom = gen_norms[i] * train_norms[j] if gen_norms[i] and train_norms[j] else 1.0
            sim = dot / denom
            if sim > best:
                best = sim
        max_sims.append(best)
    return max_sims



In [190]:
def compute_style_metrics_en(texts, target_pos_dist):
    """
    Compute style metrics :
      - TTR (Type-Token Ratio)
      - Simpson Diversity Index
      - KL Divergence between POS distributions
    """
    results = []
    for text in texts:
        # Tokenize and keep only alphabetic tokens (lowercased)
        doc = nlp(text)
        tokens = [token.text.lower() for token in doc if token.is_alpha]
        N = len(tokens)

        # Compute Type-Token Ratio
        unique_tokens = len(set(tokens))
        ttr = unique_tokens / N if N else 0.0

        # Compute Simpson Diversity Index
        freq = Counter(tokens)
        simpson = (
            1.0 - sum(f * (f - 1) / (N * (N - 1)) for f in freq.values())
            if N > 1 else 0.0
        )

        # Extract POS tags and compute their distribution
        pos_tags = [token.pos_ for token in doc if token.is_alpha]
        total_pos = len(pos_tags)
        pos_count = Counter(pos_tags)
        pos_dist = {
            pos: pos_count.get(pos, 0) / total_pos
            for pos in target_pos_dist
        }

        # Compute KL Divergence between the text's POS distribution and the target
        kl_div = sum(
            p * math.log(p / target_pos_dist[pos])
            for pos, p in pos_dist.items()
            if p > 0 and target_pos_dist.get(pos, 0) > 0
        )

        # Compute novelty/divergence via term-frequency cosine similarity
        novelty_score = max_cosine_to_corpus(text, train_texts)[0]


        results.append({
            'text': text,
            'ttr': ttr,
            'simpson_diversity': simpson,
            'pos_kl_divergence': kl_div,
            'novelty_divergence': novelty_score
        })

    return results



In [194]:
if __name__ == "__main__":
    
    #generated_texts = ["When first mine eyes beheld thy gentle face, and I saw the light of thylips, which was in the midst of my mind, as it were. I took a little while to look at the picture", "O fairest rose that bloomed in summer's garden, The fairest flower that ever grew, And the fairest of them all, the fairest rose, Was the fairest that ever bloomed" ] #input generated poem 

    # 1 Perplexity
    ppls = [calc_perplexity(txt) for txt in generated_texts]
    print("Perplexities:", ppls)
    print(f"Avg Perplexity: {sum(ppls)/len(ppls):.2f}\n")

    # 2 Distinct-1 / Distinct-2
    print(f"Distinct-1: {distinct_n(generated_texts, n=1):.4f}")
    print(f"Distinct-2: {distinct_n(generated_texts, n=2):.4f}\n")

    # 3 Self-BLEU
    avg_self_bleu = self_bleu(generated_texts, n_gram=4)
    print(f"Average Self-BLEU (up to 4-gram): {avg_self_bleu:.4f}\n")

    
    print("Style metrics for each poem: \n")

    
    # 4 Metrics (POS)
    metrics = compute_style_metrics_en(generated_texts, target_pos_dist)
    for m in metrics:
        print(f"Text: {m['text']}")
        print(f"  TTR: {m['ttr']:.4f}")
        print(f"  Simpson Diversity: {m['simpson_diversity']:.4f}")
        print(f"  POS KL Divergence: {m['pos_kl_divergence']:.4f}")
        print(f"  Novelty Score: {m['novelty_divergence']:.4f}\n")

Perplexities: [7.125983238220215, 25.986927032470703, 27.913658142089844, 47.42353820800781, 40.0672721862793, 14.553314208984375, 30.52975845336914, 28.172653198242188, 33.44835662841797, 16.580223083496094, 65.7305679321289, 34.078250885009766]
Avg Perplexity: 30.97

Distinct-1: 0.4958
Distinct-2: 0.8493

Average Self-BLEU (up to 4-gram): 0.1252

Style metrics for each poem: 

Text: My sweet lady, your eyes are like the stars in the sky." "You are so beautiful." "And I love you so much." "I love you, too." "I can't wait to see you again." "I'll be waiting
  TTR: 0.7222
  Simpson Diversity: 0.9746
  POS KL Divergence: 0.2751
  Novelty Score: 0.0000

Text: O fairest rose that bloomed in summer's garden, The fairest flower that ever grew, And the fairest of them all, the fairest rose, Was the fairest that ever bloomed
  TTR: 0.5862
  Simpson Diversity: 0.9458
  POS KL Divergence: 0.2169
  Novelty Score: 0.2887

Text: O, thou my soul's most radiant star, Thou art the light of my life, An