# Overview metrics for text generation
## *Lexical-based similarity* (ngrams-based comparison)
1. ROUGE: [Recall-Oriented Understudy for Gisting Evaluation](https://medium.com/nlplanet/two-minutes-nlp-learn-the-rouge-metric-by-examples-f179cc285499)
2. BLEU: [BiLingual Evaluation Understudy](https://www.geeksforgeeks.org/nlp-bleu-score-for-evaluating-neural-machine-translation-python/)

In [23]:
reference = ["The sun set behind the hills"]
candidate = ["The moon set behind the hills"]

In [24]:
import evaluate

### ROUGE: [Recall-Oriented Understudy for Gisting Evaluation](https://medium.com/nlplanet/two-minutes-nlp-learn-the-rouge-metric-by-examples-f179cc285499)

In [25]:
rouge_metric = evaluate.load("rouge")

# ROUGE expects plain text inputs
rouge_results = rouge_metric.compute(predictions=candidate, references=reference)

# Access ROUGE scores (no need for indexing into the result)
print(f'"{reference[0]}" <<-->> "{candidate[0]}"')
print(f"--> ROUGE-1 F1 Score: {rouge_results['rouge1']:.2f}")
print(f"--> ROUGE-2 F1 Score: {rouge_results['rouge2']:.2f}")
print(f"--> ROUGE-L F1 Score: {rouge_results['rougeL']:.2f}")

"The sun set behind the hills" <<-->> "The moon set behind the hills"
--> ROUGE-1 F1 Score: 0.83
--> ROUGE-2 F1 Score: 0.60
--> ROUGE-L F1 Score: 0.83


### BLEU: [BiLingual Evaluation Understudy](https://www.geeksforgeeks.org/nlp-bleu-score-for-evaluating-neural-machine-translation-python/)

In [26]:
bleu_metric = evaluate.load("bleu")

# BLEU expects plain text inputs
bleu_results = bleu_metric.compute(predictions=candidate, references=reference)
print(f'"{reference[0]}" <<-->> "{candidate[0]}"')
print(f"--> BLEU Score: {bleu_results['bleu'] * 100:.2f}")

"The sun set behind the hills" <<-->> "The moon set behind the hills"
--> BLEU Score: 53.73


## Semantic-based similarity (embedding-based comparison)
1. [Cosine similarity](https://spencerporter2.medium.com/understanding-cosine-similarity-and-word-embeddings-dbf19362a3c)
2. [BertSCORE](https://medium.com/@abonia/bertscore-explained-in-5-minutes-0b98553bfb71)


### Cosine similarity 

In [28]:
from sentence_transformers import SentenceTransformer

In [29]:
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

In [30]:
referece_embedding = embedding_model.encode(reference)
candidate_embedding = embedding_model.encode(candidate)

similarity = embedding_model.similarity(referece_embedding, candidate_embedding)
print(f'"{reference[0]}" <<-->> "{candidate[0]}"')
print(f"--> Cosine Similarity: {similarity[0][0]:.2f}")

"The sun set behind the hills" <<-->> "The moon set behind the hills"
--> Cosine Similarity: 0.71


### BertSCORE

In [27]:
bertscore = evaluate.load("bertscore")

# BERTScore
bert_result = bertscore.compute(predictions=candidate, references=reference, model_type="bert-base-uncased")
print(f'"{reference[0]}" <<-->> "{candidate[0]}"')
print(f"--> BERTScore (F1): {bert_result['f1'][0]:.2f}")

"The sun set behind the hills" <<-->> "The moon set behind the hills"
--> BERTScore (F1): 0.92


#### Information retrival based on cosine similarity

In [31]:
query = "How many people live in London?"

In [32]:
docs = [
    "London is known for its financial district",
    "London has 9,787,426 inhabitants at the 2011 census",
    "The United Kingdom is the fourth largest exporter of goods in the world",
]

In [33]:
query_embedding = embedding_model.encode(query)
doc_embeddings = embedding_model.encode(docs)

# Compute cosine similarities
similarities = embedding_model.similarity(query_embedding, doc_embeddings).squeeze()
similarities = dict(sorted(zip(docs, similarities), key=lambda x: x[1], reverse=True))

for doc, score in similarities.items():
    print(f"[Score: {score.item():.2f}] {doc}")

[Score: 0.75] London has 9,787,426 inhabitants at the 2011 census
[Score: 0.46] London is known for its financial district
[Score: 0.26] The United Kingdom is the fourth largest exporter of goods in the world


# Real-world use cases with pre-trained LLMs
1. Machine Translation (MT)
2. Retrieval-Augmented Generation (RAG)
3. Question-answering (QA)

### Load the pre-trained LLM
* allenai/OLMo-2-0425-1B-Instruct
* microsoft/Phi-4-mini-instruct
* Qwen/Qwen3-4B-Instruct-2507
* HuggingFaceTB/SmolLM3-3B

In [34]:
model_name = "allenai/OLMo-2-0425-1B-Instruct"

#### Load the LLM using the pipeline function (a high-level helper)

In [35]:
from transformers import pipeline
pipe = pipeline(task = "text-generation", model = model_name)

Device set to use cuda:0


### (1) Machine Translation

In [36]:
mt_task = "You are a professional translator. Your task is to translate the following text into Italian. Provide only the translation, without explanations or additional commentary.\nTEXT:"

In [37]:
mt_docs = ['Life is full of surprises.', 'They played soccer last weekend.', 'We are going to the park later.']

outputs = []
for doc in mt_docs:
    
    # Create the prompt by combining the task description with the document
    prompt = f"{mt_task} {doc}\nTRANSLATION: "
    
    # Generate the output using the pipeline
    result = pipe([prompt], return_full_text = False)
    generated_text = result[0][0]['generated_text']
    
    # Clean up the output by removing whitespaces and newlines
    generated_text = generated_text.split("\n")[0].strip()
    print(f'Input: "{doc}"\nOutput: "{generated_text}"\n')
        
    # Store the generated output
    outputs.append(generated_text)

Input: "Life is full of surprises."
Output: "La vita è piena di sorprendenti."

Input: "They played soccer last weekend."
Output: "Filiali hanno partecipato al calcio del Weekend ultimo."

Input: "We are going to the park later."
Output: "1. Siamo andate al parco dopo."



#### Evaluate the outputs using the aforementioned metrics 

In [38]:
mt_targets = ["La vita è piena di sorprese.", "Hanno giocato a calcio lo scorso fine settimana.", "Andremo al parco più tardi."]
stats = {}
for output, target in zip(outputs, mt_targets):
    
    print(f'\nLLM OUTPUT: "{output}" <<-->> TARGET: "{target}"')
    
    # Lexical metrics (ngram-based)
    bleu_results = bleu_metric.compute(predictions=[output], references=[target])
    print(f"--> BLEU Score: {bleu_results['bleu']:.1%}")
    
    # Semantic metrics (embedding-based)
    bert_result = bertscore.compute(predictions=[output], references=[target], model_type="bert-base-uncased")
    print(f"--> BERTScore (F1): {bert_result['f1'][0]:.2f}")
     
    cosine_similarity = embedding_model.similarity(embedding_model.encode(target), embedding_model.encode(output)).squeeze()
    print(f"--> Cosine Similarity: {cosine_similarity:.2f}")
    
    stats[target] = {'blue': bleu_results['bleu'], 'bertscore': bert_result['f1'][0], 'cosine': cosine_similarity}
    


LLM OUTPUT: "La vita è piena di sorprendenti." <<-->> TARGET: "La vita è piena di sorprese."
--> BLEU Score: 64.3%
--> BERTScore (F1): 0.91
--> Cosine Similarity: 0.94

LLM OUTPUT: "Filiali hanno partecipato al calcio del Weekend ultimo." <<-->> TARGET: "Hanno giocato a calcio lo scorso fine settimana."
--> BLEU Score: 0.0%
--> BERTScore (F1): 0.67
--> Cosine Similarity: 0.61

LLM OUTPUT: "1. Siamo andate al parco dopo." <<-->> TARGET: "Andremo al parco più tardi."
--> BLEU Score: 0.0%
--> BERTScore (F1): 0.72
--> Cosine Similarity: 0.58


### (2) Retrieval-Augmented Generation (RAG)
Using the [Stanford Question Answering Dataset](https://rajpurkar.github.io/SQuAD-explorer/) (SQuAD). It evaluates extractive question answering:
* questions are generated by crowdworkers over Wikipedia articles;
* each answer is a text span within the input context.

In [39]:
squad_docs = [
    {
        "question": "How do asset prices generally move in relation to interest rates?",
        "answer": "inversely",
        "context": "The Fed then raised the Fed funds rate significantly between July 2004 and July 2006. This contributed to an increase in 1-year and 5-year adjustable-rate mortgage (ARM) rates, making ARM interest rate resets more expensive for homeowners. This may have also contributed to the deflating of the housing bubble, as asset prices generally move inversely to interest rates, and it became riskier to speculate in housing. U.S. housing and financial assets dramatically declined in value after the housing bubble burst."
    },{
        "question": "How can climate changes be determined from soil?",
        "answer": "fossil pollen deposits in sediments",
        "context": "Plant responses to climate and other environmental changes can inform our understanding of how these changes affect ecosystem function and productivity. For example, plant phenology can be a useful proxy for temperature in historical climatology, and the biological impact of climate change and global warming. Palynology, the analysis of fossil pollen deposits in sediments from thousands or millions of years ago allows the reconstruction of past climates. Estimates of atmospheric CO2 concentrations since the Palaeozoic have been obtained from stomatal densities and the leaf shapes and sizes of ancient land plants. Ozone depletion can expose plants to higher levels of ultraviolet radiation-B (UV-B), resulting in lower growth rates. Moreover, information from studies of community ecology, plant systematics, and taxonomy is essential to understanding vegetation change, habitat destruction and species extinction.",
    },{
        "question": "In what year did Miami's government declare bankruptcy?",
        "answer": "2001",
        "context": "According to the U.S. Census Bureau, in 2004, Miami had the third highest incidence of family incomes below the federal poverty line in the United States, making it the third poorest city in the USA, behind only Detroit, Michigan (ranked #1) and El Paso, Texas (ranked #2). Miami is also one of the very few cities where its local government went bankrupt, in 2001. However, since that time, Miami has experienced a revival: in 2008, Miami was ranked as \"America's Cleanest City\" according to Forbes for its year-round good air quality, vast green spaces, clean drinking water, clean streets and city-wide recycling programs. In a 2009 UBS study of 73 world cities, Miami was ranked as the richest city in the United States (of four U.S. cities included in the survey) and the world's fifth-richest city, in terms of purchasing power.",
    },{
        "question": "Which English philosopher wrote Leviathan in 1651?",
        "answer": "Thomas Hobbes",
        "context": "John Locke, one of the most influential Enlightenment thinkers, based his governance philosophy in social contract theory, a subject that permeated Enlightenment political thought. The English philosopher Thomas Hobbes ushered in this new debate with his work Leviathan in 1651. Hobbes also developed some of the fundamentals of European liberal thought: the right of the individual; the natural equality of all men; the artificial character of the political order (which led to the later distinction between civil society and the state); the view that all legitimate political power must be \"representative\" and based on the consent of the people; and a liberal interpretation of law which leaves people free to do whatever the law does not explicitly forbid.",
    },
]

In [40]:
task = "Answer the question based on the context provided. Extract the text span from the context."

outputs = []
for doc in squad_docs:
    
    # Create the prompt by combining the task description with the document
    prompt = f"{task}\nQUESTION: {doc['question']}\nCONTEXT: {doc['context']}\nANSWER:"
    
    # Generate the output using the pipeline
    result = pipe([prompt], return_full_text = False)
    generated_text = result[0][0]['generated_text']
    
    # Clean up the output by removing whitespaces and newlines
    generated_text = generated_text.split("\n")[0].strip()
    print(f'QUESTION: "{doc["question"]}"\nLLM OUTPUT: "{generated_text}"\n')
        
    # Store the generated output
    outputs.append(generated_text)

QUESTION: "How do asset prices generally move in relation to interest rates?"
LLM OUTPUT: "inversely to interest rates"

QUESTION: "How can climate changes be determined from soil?"
LLM OUTPUT: "Plant responses to climate and other environmental changes can inform our understanding of how these changes affect ecosystem function and productivity. For example, plant phenology can be a useful proxy for temperature in historical climatology, and the biological impact of climate change and global warming. Palynology, the analysis of fossil pollen deposits in sediments from thousands or millions of years ago allows the reconstruction of past climates. Estimates of atmospheric CO2 concentrations since the Palaeozoic have been obtained from stomatal densities and the leaf shapes and sizes of ancient land plants. Ozone depletion can expose plants to higher levels of ultraviolet radiation-B (UV-B), resulting in lower growth rates. Moreover, information from studies of community ecology, plant sy

In [41]:
squad_targets = [doc["answer"] for doc in squad_docs]
for output, target in zip(outputs, squad_targets):
    
    print(f'\nLLM OUTPUT: "{output}" <<-->> TARGET: "{target}"')
    
    # Lexical metrics (ngram-based)
    bleu_results = bleu_metric.compute(predictions=[output], references=[target])
    print(f"--> BLEU Score: {bleu_results['bleu'] * 100:.2f}")
    
    exact_match = target == output
    print('--> Exact Match (EM):', exact_match)
    
    mentioned = target in output
    print('--> Is mentioned:', mentioned)
    
    # Semantic metrics (embedding-based)
    bert_result = bertscore.compute(predictions=[output], references=[target], model_type="bert-base-uncased")
    print(f"--> BERTScore (F1): {bert_result['f1'][0]:.2f}")
     
    cosine_similarity = embedding_model.similarity(embedding_model.encode(target), embedding_model.encode(output)).squeeze()
    print(f"--> Cosine Similarity: {cosine_similarity:.2f}")


LLM OUTPUT: "inversely to interest rates" <<-->> TARGET: "inversely"
--> BLEU Score: 0.00
--> Exact Match (EM): False
--> Is mentioned: True
--> BERTScore (F1): 0.64
--> Cosine Similarity: 0.57

LLM OUTPUT: "Plant responses to climate and other environmental changes can inform our understanding of how these changes affect ecosystem function and productivity. For example, plant phenology can be a useful proxy for temperature in historical climatology, and the biological impact of climate change and global warming. Palynology, the analysis of fossil pollen deposits in sediments from thousands or millions of years ago allows the reconstruction of past climates. Estimates of atmospheric CO2 concentrations since the Palaeozoic have been obtained from stomatal densities and the leaf shapes and sizes of ancient land plants. Ozone depletion can expose plants to higher levels of ultraviolet radiation-B (UV-B), resulting in lower growth rates. Moreover, information from studies of community eco

# Prompt engineering
From *zero-shot* (as above) to *In-Context Learning* (ICL)

### In-context Learning: include examples in the prompt to guide the model's behavior

In [42]:
examples = [
  ('Rome is the capital of Italy.','Roma è la capitale d’Italia.'),
  ('Where is the train station?', 'Dove si trova la stazione dei treni?'),
  ('We are going to the market tomorrow, do you want to come with us?', 'Andremo al mercato domani, vuoi venire con noi?'),
]

prompt = f"{mt_task} {doc}\nTRANSLATION: "
main_conversation = [
  {"role": "user", "content": f"{mt_task} {examples[0][0]}\nTRANSLATION:"},
  {"role": "assistant", "content": examples[0][1]},
  {"role": "user", "content": f"{examples[1][0]}\nTRANSLATION:"},
  {"role": "assistant", "content": examples[1][1]},
  {"role": "user", "content": f"{examples[2][0]}\nTRANSLATION:"},
  {"role": "assistant", "content": examples[2][1]},
]

outputs = []
for doc in mt_docs:
    
    # Copy the main conversation and append the new user input
    conversation = main_conversation.copy()
    conversation.append({"role": "user", "content": f"{doc}\nTRANSLATION:"})
    
    result = pipe(conversation, return_full_text = False)
    generated_text = result[0]['generated_text']
    
    # Clean up the output by removing whitespaces and newlines
    generated_text = generated_text.split("\n")[0].strip()
        
    # Store the generated output
    outputs.append(generated_text)

In [43]:
for output, target in zip(outputs, mt_targets):
    print(f'\nLLM OUTPUT: "{output}" <<-->> TARGET: "{target}"')
    
    # Lexical metrics (ngram-based)
    bleu_results = bleu_metric.compute(predictions=[output], references=[target])
    print(f"--> BLEU Score: {bleu_results['bleu']:.1%} --> Δ: {bleu_results['bleu'] - stats[target]['blue']:.1%}")

    # Semantic metrics (embedding-based)
    bert_result = bertscore.compute(predictions=[output], references=[target], model_type="bert-base-uncased")
    print(f"--> BERTScore (F1): {bert_result['f1'][0]:.2f} --> Δ: {bert_result['f1'][0]- stats[target]['bertscore']:.2f}")
     
    cosine_similarity = embedding_model.similarity(embedding_model.encode(target), embedding_model.encode(output)).squeeze()
    print(f"--> Cosine Similarity: {cosine_similarity:.2f}  --> Δ: {cosine_similarity- stats[target]['cosine']:.2f}")


LLM OUTPUT: "La vita è piena di sorprise." <<-->> TARGET: "La vita è piena di sorprese."
--> BLEU Score: 64.3% --> Δ: 0.0%
--> BERTScore (F1): 0.93 --> Δ: 0.02
--> Cosine Similarity: 0.95  --> Δ: 0.00

LLM OUTPUT: "Entrarono a calcio l'ultimo weekend." <<-->> TARGET: "Hanno giocato a calcio lo scorso fine settimana."
--> BLEU Score: 0.0% --> Δ: 0.0%
--> BERTScore (F1): 0.68 --> Δ: 0.00
--> Cosine Similarity: 0.57  --> Δ: -0.04

LLM OUTPUT: "Andremo al parco dopo." <<-->> TARGET: "Andremo al parco più tardi."
--> BLEU Score: 0.0% --> Δ: 0.0%
--> BERTScore (F1): 0.86 --> Δ: 0.15
--> Cosine Similarity: 0.78  --> Δ: 0.20
