## Evaluation & Metrics: Assess model performance using relevant metrics: perplexity, distinct, Self-BLEU

### Perplexity:
measures how “surprised” a language model is by a given text. Lower perplexity means the model assigns higher probability to the text
> 	•	Lower scores indicate the model is better at modeling the style/content of your corpus. 
	•	Extremely low scores can also hint at overly safe or repetitive outputs.

### Distinct:
quantifies how many unique n-grams appear in the generated texts, relative to the total number of n-grams produced. It’s a simple diversity measure
> 	•	Distinct-1 measures word-level diversity (unigrams).
	•	Distinct-2 measures phrase-level diversity (bigrams).
	•	Values closer to 1.0 mean high diversity (few repeats)
	•	values closer to 0 mean the model is repeating the same words/phrases.

###  Self-BLEU:
evaluates how similar the generated samples are to one another. It’s a reverse of BLEU: treating each generation as a “hypothesis” and all the others as “references.”
> 	•	Scores range from 0 to 1.
	•	Higher Self-BLEU means samples are very similar to each other (low diversity).
	•	Lower Self-BLEU means samples are more distinct.


In [1]:
import math
import itertools
from collections import Counter
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

In [3]:
#perplexity based on GPT2

model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).eval().to("cuda" if torch.cuda.is_available() else "cpu")

def calc_perplexity(text: str) -> float:
    """perplexity"""
    encodings = tokenizer(text, return_tensors="pt")
    input_ids = encodings.input_ids.to(model.device)
    with torch.no_grad():
        outputs = model(input_ids, labels=input_ids)
        neg_log_likelihood = outputs.loss * input_ids.shape[1]
    ppl = torch.exp(neg_log_likelihood / input_ids.shape[1])
    return ppl.item()

In [5]:
#distinct

def distinct_n(texts, n=1):
    """Distinct-n = (#unique n-grams) / (#total n-grams)"""
    total_ngrams = 0
    unique_ngrams = set()
    for txt in texts:
        tokens = txt.split()
        ngrams = zip(*[tokens[i:] for i in range(n)])
        count = 0
        for ng in ngrams:
            unique_ngrams.add(ng)
            count += 1
        total_ngrams += count
    return len(unique_ngrams) / total_ngrams if total_ngrams > 0 else 0.0

In [7]:
#BLEU

def self_bleu(texts, n_gram=4):
    """
   Self-BLEU
    """
    smoothie = SmoothingFunction().method4
    scores = []
    for i, cand in enumerate(texts):
        references = [t.split() for j,t in enumerate(texts) if j != i]
        scores.append(sentence_bleu(
            references=[references],       
            hypothesis=cand.split(),
            smoothing_function=smoothie,
            weights=tuple([1/n_gram]*n_gram)  
        ))
    return sum(scores) / len(scores)

In [9]:
if __name__ == "__main__":
    
    generated_texts = ["When first mine eyes beheld thy gentle face, and I saw the light of thylips, which was in the midst of my mind, as it were. I took a little while to look at the picture" ] #input generated poem 

    # 4.1 Perplexity
    ppls = [calc_perplexity(txt) for txt in generated_texts]
    print("Perplexities:", ppls)
    print(f"Avg Perplexity: {sum(ppls)/len(ppls):.2f}")

    # 4.2 Distinct-1 / Distinct-2
    print(f"Distinct-1: {distinct_n(generated_texts, n=1):.4f}")
    print(f"Distinct-2: {distinct_n(generated_texts, n=2):.4f}")

    # 4.3 Self-BLEU
    sb = self_bleu(generated_texts, n_gram=4)
    print(f"Self-BLEU (up to 4-gram): {sb:.4f}")

`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Perplexities: [46.377098083496094]
Avg Perplexity: 46.38
Distinct-1: 0.8889
Distinct-2: 1.0000
Self-BLEU (up to 4-gram): 0.0000


## Compute style metrics to check how the generated poems style match to training poems

### TTR (Type–Token Ratio)
> 	•	High TTR (closer to 1): lots of different words — strong variety.
	•	Low TTR (closer to 0): repeat the same words more often — less variety.

### Simpson Diversity Index
> 	•	High Simpson (closer to 1): high probability that two randomly picked tokens are different—strong diversity.
	•	Low Simpson (closer to 0): high chance that two picks are the same token—low diversity.

### POS KL Divergence
> 	•	Low KL (near 0): generated poem’s POS mix is very similar to the reference style.
	•	High KL: generated poem’s POS proportions deviate strongly from that style.

In [20]:
import spacy
from spacy.cli import download as spacy_download
from collections import Counter
import math


In [22]:
spacy_download("en_core_web_sm")

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m31.2 MB/s[0m eta [36m0:00:00[0m00:01[0m0:01[0m
[?25hInstalling collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [14]:
def get_reference_pos_distribution(texts):
    all_pos = []
    for text in texts:
        doc = nlp(text)
        all_pos.extend([token.pos_ for token in doc if token.is_alpha])
    total = len(all_pos)
    count = Counter(all_pos)
    return {k: v / total for k, v in count.items()}

In [24]:
# Load the small English model
nlp = spacy.load("en_core_web_sm")

def compute_style_metrics_en(texts, target_pos_dist):
    """
    Compute style metrics :
      - TTR (Type-Token Ratio)
      - Simpson Diversity Index
      - KL Divergence between POS distributions
    """
    results = []
    for text in texts:
        # Tokenize and keep only alphabetic tokens (lowercased)
        doc = nlp(text)
        tokens = [token.text.lower() for token in doc if token.is_alpha]
        N = len(tokens)

        # Compute Type-Token Ratio
        unique_tokens = len(set(tokens))
        ttr = unique_tokens / N if N else 0.0

        # Compute Simpson Diversity Index
        freq = Counter(tokens)
        simpson = (
            1.0 - sum(f * (f - 1) / (N * (N - 1)) for f in freq.values())
            if N > 1 else 0.0
        )

        # Extract POS tags and compute their distribution
        pos_tags = [token.pos_ for token in doc if token.is_alpha]
        total_pos = len(pos_tags)
        pos_count = Counter(pos_tags)
        pos_dist = {
            pos: pos_count.get(pos, 0) / total_pos
            for pos in target_pos_dist
        }

        # Compute KL Divergence between the text's POS distribution and the target
        kl_div = sum(
            p * math.log(p / target_pos_dist[pos])
            for pos, p in pos_dist.items()
            if p > 0 and target_pos_dist.get(pos, 0) > 0
        )

        results.append({
            'text': text,
            'ttr': ttr,
            'simpson_diversity': simpson,
            'pos_kl_divergence': kl_div
        })

    return results

# Example usage
if __name__ == "__main__":

    #get_reference_pos_distribution(training_poem)
    # Example target POS distribution. To get real distribution need to use function "get_reference_pos_distribution"
    target_pos_dist = {
        'NOUN': 0.30,
        'VERB': 0.25,
        'ADJ': 0.15,
        'ADV': 0.10,
        'PRON': 0.05,
        'ADP': 0.05,
        'DET': 0.05,
        'CCONJ': 0.05
    }

    metrics = compute_style_metrics_en(generated_texts, target_pos_dist)
    for m in metrics:
        print(f"Text: {m['text']}")
        print(f"  TTR: {m['ttr']:.4f}")
        print(f"  Simpson Diversity: {m['simpson_diversity']:.4f}")
        print(f"  POS KL Divergence: {m['pos_kl_divergence']:.4f}\n")

Text: When first mine eyes beheld thy gentle face, and I saw the light of thylips, which was in the midst of my mind, as it were. I took a little while to look at the picture
  TTR: 0.8889
  Simpson Diversity: 0.9921
  POS KL Divergence: 0.1913

