#### ROUGE Score

ROUGE score measures word overlap by counting how many words overlap between generated text and reference text.
*   Rouge score values are between 0 and 1 (no overlap and perfect overlap respectively)
*   Three variants:
  1.   ROUGE-1: Overlap of individual words (unigrams)
  2.   ROUGE-2: Overlap of two words (bigrams)
  3.   ROUGE-L: Longest common sequence

In [2]:
!pip install rouge_score

Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24934 sha256=820aeefda40293bbd5ee7ca6296608aca3acdad1e2ee8c9ccb4a05f6bc9c2589
  Stored in directory: /root/.cache/pip/wheels/85/9d/af/01feefbe7d55ef5468796f0c68225b6788e85d9d0a281e7a70
Successfully built rouge_score
Installing collected packages: rouge_score
Successfully installed rouge_score-0.1.2


In [3]:
from rouge_score import rouge_scorer

In [6]:
# sample answers to evaluate using ROUGE
# A good answer (what we hope the LLM produces)
GOOD_ANSWER = """
This cyber liability policy covers data breaches, ransomware attacks,
and business interruption losses up to $5 million per occurrence.
It excludes war, nuclear events, and intentional acts by the insured.
"""

# A bad answer (vague, incomplete)
BAD_ANSWER = """
The policy covers various cyber risks and has some exclusions
related to certain events. Coverage is available for incidents.
"""

# A hallucinated answer (makes up facts not in the document)
HALLUCINATED_ANSWER = """
This cyber policy covers data breaches and ransomware attacks up to $10 million.
It also covers physical damage to hardware and employee theft.
Exclusions include acts of God and pandemics.
"""

In [22]:
def get_rouge_score(generated_output: str, reference_doc: str) -> dict:
  """
  Computes ROUGE-1, ROUGE-2, ROUGE-L score for LLM's generated output and the
  reference document (ground truth).
  Returns a dictionary containing:
  precision: out of all words generated by LLM, how many are correct?
  recall: out of all words in reference document, how many words were there in generated response
  f1-measure: harmonic mean of precision and recall
  """

  scorer = rouge_scorer.RougeScorer(
      ['rouge1', 'rouge2', 'rougeL'],
      use_stemmer=True
      )
  scores = scorer.score(generated_output, reference_doc)

  return {
      "ROUGE_1":
       { "precision": scores["rouge1"].precision,
         "recall": scores["rouge1"].recall,
         "f1_score": scores["rouge1"].fmeasure,
       },
      "ROUGE_2":
       { "precision": scores["rouge2"].precision,
         "recall": scores["rouge2"].recall,
         "f1_score": scores["rouge2"].fmeasure,
       },
      "ROUGE_L":
       { "precision": scores["rougeL"].precision,
         "recall": scores["rougeL"].recall,
         "f1_score": scores["rougeL"].fmeasure,
       }
  }


In [23]:
def print_rouge_scorer_response(answers: dict[str, str], reference_doc: str) -> None:
  for label, answer in answers.items():
    print("Evaluating: ", label)
    scores = get_rouge_score(answer, reference_doc)
    print(scores)


In [26]:
if __name__=="__main__":
  answers_to_evaluate = {
      "Good_Answer": GOOD_ANSWER,
      "Bad_Answer": BAD_ANSWER,
      "Hallucinated_Answer": HALLUCINATED_ANSWER,

  }
  print_rouge_scorer_response(answers_to_evaluate, reference_doc=GOOD_ANSWER)

Evaluating:  Good_Answer
{'ROUGE_1': {'precision': 1.0, 'recall': 1.0, 'f1_score': 1.0}, 'ROUGE_2': {'precision': 1.0, 'recall': 1.0, 'f1_score': 1.0}, 'ROUGE_L': {'precision': 1.0, 'recall': 1.0, 'f1_score': 1.0}}
Evaluating:  Bad_Answer
{'ROUGE_1': {'precision': 0.23333333333333334, 'recall': 0.3684210526315789, 'f1_score': 0.2857142857142857}, 'ROUGE_2': {'precision': 0.034482758620689655, 'recall': 0.05555555555555555, 'f1_score': 0.0425531914893617}, 'ROUGE_L': {'precision': 0.16666666666666666, 'recall': 0.2631578947368421, 'f1_score': 0.2040816326530612}}
Evaluating:  Hallucinated_Answer
{'ROUGE_1': {'precision': 0.5, 'recall': 0.5, 'f1_score': 0.5}, 'ROUGE_2': {'precision': 0.20689655172413793, 'recall': 0.20689655172413793, 'f1_score': 0.20689655172413793}, 'ROUGE_L': {'precision': 0.4666666666666667, 'recall': 0.4666666666666667, 'f1_score': 0.4666666666666667}}


#### BERT Score
BERT score measures semantic meaning not just word overlap. It uses BERT model to convert words into vectors/embeddings to capture semantic meaning. Thereby, improving over ROUGE's limitation.

*   BERT score are between 0 and 1
*   In practice, BERT score values tend to cluster between 0.8 and 1
*   Anything above 0.9 is considered good



In [27]:
#### BERT Score


In [28]:
!pip install bert-score

Collecting bert-score
  Downloading bert_score-0.3.13-py3-none-any.whl.metadata (15 kB)
Downloading bert_score-0.3.13-py3-none-any.whl (61 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.1/61.1 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: bert-score
Successfully installed bert-score-0.3.13


In [29]:
from bert_score import score as bert_score_fn

In [48]:
def compute_bert_score(generated_output:str, reference_text: str, model_type: str) -> dict:
  precision, recall, f1_measure = bert_score_fn (
      cands = [generated_output],
      refs = [reference_text],
      model_type = model_type,
      verbose = False
  )
  # BERT score returns a tensor
  print("BERT Precision Score", precision[0].item())
  print("BERT Precision Score", recall[0].item())
  print("BERT Precision Score", f1_measure[0].item())


In [49]:
def print_bert_score_value(answers: dict[str, str], reference_doc: str) -> None:
  for label, answer in answers.items():
    print("Evaluating:", label)
    compute_bert_score(answer, reference_doc, model_type="distilbert-base-uncased")

In [50]:
if __name__ == "__main__":
  answers = {
      "Good Answer": GOOD_ANSWER,
      "Bad Answer": BAD_ANSWER,
      "Hallucinated Answer": HALLUCINATED_ANSWER,

  }
  print_bert_score_value(answers, reference_doc=GOOD_ANSWER)

Evaluating: Good Answer


Loading weights:   0%|          | 0/100 [00:00<?, ?it/s]

DistilBertModel LOAD REPORT from: distilbert-base-uncased
Key                     | Status     |  | 
------------------------+------------+--+-
vocab_layer_norm.bias   | UNEXPECTED |  | 
vocab_projector.bias    | UNEXPECTED |  | 
vocab_layer_norm.weight | UNEXPECTED |  | 
vocab_transform.weight  | UNEXPECTED |  | 
vocab_transform.bias    | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


BERT Precision Score 1.0
BERT Precision Score 1.0
BERT Precision Score 1.0
Evaluating: Bad Answer


Loading weights:   0%|          | 0/100 [00:00<?, ?it/s]

DistilBertModel LOAD REPORT from: distilbert-base-uncased
Key                     | Status     |  | 
------------------------+------------+--+-
vocab_layer_norm.bias   | UNEXPECTED |  | 
vocab_projector.bias    | UNEXPECTED |  | 
vocab_layer_norm.weight | UNEXPECTED |  | 
vocab_transform.weight  | UNEXPECTED |  | 
vocab_transform.bias    | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


BERT Precision Score 0.8412926197052002
BERT Precision Score 0.7878788709640503
BERT Precision Score 0.8137101531028748
Evaluating: Hallucinated Answer


Loading weights:   0%|          | 0/100 [00:00<?, ?it/s]

DistilBertModel LOAD REPORT from: distilbert-base-uncased
Key                     | Status     |  | 
------------------------+------------+--+-
vocab_layer_norm.bias   | UNEXPECTED |  | 
vocab_projector.bias    | UNEXPECTED |  | 
vocab_layer_norm.weight | UNEXPECTED |  | 
vocab_transform.weight  | UNEXPECTED |  | 
vocab_transform.bias    | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


BERT Precision Score 0.8860654830932617
BERT Precision Score 0.8730091452598572
BERT Precision Score 0.8794888854026794
