##1. BLEU Score (Precision-Focused)
BLEU measures precision. It was designed for translation, where the goal is to ensure the model doesn't hallucinate or add extra, incorrect words.


How it works: It counts how many n-grams in the generated text appear in the reference text.

The "Cheating" Problem: If a model just outputs "The" for a long sentence, it would have 100% precision. To fix this, BLEU adds a Brevity Penalty (BP) that lowers the score if the output is significantly shorter than the reference.

Variants: Usually reported as BLEU-4, which looks at 1, 2, 3, and 4-gram overlaps simultaneously.

## 2. ROUGE Score
ROUGE measures recall. It was designed for summarization, where the goal is to ensure the model captures all the essential points from the source.


How it works: It counts how many n-grams from the reference (human) text are present in the generated text.

Common Variants:

ROUGE-1: Overlap of individual words (unigrams).

ROUGE-2: Overlap of two-word pairs (bigrams).

ROUGE-L: Longest Common Subsequence. This looks at the longest string of words that appear in both in the same relative order (even if not contiguous), which better captures sentence structure

In [None]:
!pip install evaluate rouge_score

Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24934 sha256=7c1418baf3a1f7cfdf88f61ca3594fdf51097e26efeebf25bc0fe2d8e60587ac
  Stored in directory: /root/.cache/pip/wheels/85/9d/af/01feefbe7d55ef5468796f0c68225b6788e85d9d0a281e7a70
Successfully built rouge_score
Installing collected packages: rouge_score
Successfully installed rouge_score-0.1.2


In [None]:
# Importing evaluate library
import evaluate

# Load the BLEU and ROUGE metrics
bleu_metric = evaluate.load("bleu")
rouge_metric = evaluate.load("rouge")

# Example sentences (non-tokenized)
reference = ["The policy requires reporting incidents within 24 hours."]
candidate = ["Incidents must be reported within 24 hours according to policy."]

# BLEU expects plain text inputs
bleu_results = bleu_metric.compute(predictions=candidate, references=reference)
print(bleu_results)
print(f"BLEU Score: {bleu_results['bleu'] * 100:.2f}")

# ROUGE expects plain text inputs
rouge_results = rouge_metric.compute(predictions=candidate, references=reference)

# Access ROUGE scores (no need for indexing into the result)
print(f"ROUGE-1 F1 Score: {rouge_results['rouge1']:.2f}")
print(f"ROUGE-L F1 Score: {rouge_results['rougeL']:.2f}")

{'bleu': 0.0, 'precisions': [0.45454545454545453, 0.2, 0.1111111111111111, 0.0], 'brevity_penalty': 1.0, 'length_ratio': 1.2222222222222223, 'translation_length': 11, 'reference_length': 9}
BLEU Score: 0.00
ROUGE-1 F1 Score: 0.56
ROUGE-L F1 Score: 0.44


In [None]:
!pip install sentence_transformers



In [None]:
from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2")

ref = "Report security incidents within 24 hours."
cand = "All security breaches must be reported within a day."

emb = model.encode([ref, cand])

similarity = np.dot(emb[0], emb[1]) / (
    np.linalg.norm(emb[0]) * np.linalg.norm(emb[1])
)

print(similarity)


0.68099535


In [None]:
!pip install bert-score


Collecting bert-score
  Downloading bert_score-0.3.13-py3-none-any.whl.metadata (15 kB)
Downloading bert_score-0.3.13-py3-none-any.whl (61 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.1/61.1 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: bert-score
Successfully installed bert-score-0.3.13


In [None]:
from bert_score import score

# Reference (ground truth / expected answer)
references = [
    "Security incidents must be reported within 24 hours according to company policy."
]

# Candidate (LLM-generated answer)
candidates = [
    "All security breaches are required to be reported within one day under policy."
]

# Compute BERTScore
P, R, F1 = score(
    candidates,
    references,
    lang="en",
    model_type="roberta-large"
)

print(f"Precision: {P.mean():.3f}")
print(f"Recall:    {R.mean():.3f}")
print(f"F1 Score:  {F1.mean():.3f}")


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Precision: 0.945
Recall:    0.953
F1 Score:  0.949


In [None]:
from sentence_transformers import CrossEncoder

# Load a pretrained cross-encoder
model = CrossEncoder("cross-encoder/stsb-roberta-base")

# Reference (expected answer / ground truth)
reference = "Security incidents must be reported within 24 hours according to policy."

# Candidate (LLM-generated answer)
candidate = "All security breaches are required to be reported within one day."

# Cross-encoder expects text pairs
score = model.predict([(reference, candidate)])

print(f"Cross-Encoder Similarity Score: {score[0]:.3f}")


config.json:   0%|          | 0.00/608 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

README.md: 0.00B [00:00, ?B/s]

Cross-Encoder Similarity Score: 0.837


In [None]:
!pip install presidio-analyzer presidio-anonymizer

Collecting presidio-analyzer
  Downloading presidio_analyzer-2.2.360-py3-none-any.whl.metadata (3.4 kB)
Collecting presidio-anonymizer
  Downloading presidio_anonymizer-2.2.360-py3-none-any.whl.metadata (8.9 kB)
Collecting phonenumbers<10.0.0,>=8.12 (from presidio-analyzer)
  Downloading phonenumbers-9.0.20-py2.py3-none-any.whl.metadata (11 kB)
Collecting tldextract (from presidio-analyzer)
  Downloading tldextract-5.3.0-py3-none-any.whl.metadata (11 kB)
Collecting requests-file>=1.4 (from tldextract->presidio-analyzer)
  Downloading requests_file-3.0.1-py2.py3-none-any.whl.metadata (1.7 kB)
Downloading presidio_analyzer-2.2.360-py3-none-any.whl (128 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m128.7/128.7 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading presidio_anonymizer-2.2.360-py3-none-any.whl (35 kB)
Downloading phonenumbers-9.0.20-py2.py3-none-any.whl (2.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m38

In [None]:
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig

# Sample text containing PII
text = "My name is John Doe, and my phone number is (123) 456-7890. I live in New York City, and my email is john.doe@example.com."

# Initialize the AnalyzerEngine and AnonymizerEngine
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()

# Analyze the text to detect PII entities
results = analyzer.analyze(text=text, language='en')

# Print detected entities
print("Detected entities:")
for result in results:
    print(result)

# Anonymize the detected entities using a replace operator
anonymized_text = anonymizer.anonymize(
    text=text,
    analyzer_results=results,
    operators={
        "DEFAULT": OperatorConfig("replace", {"new_value": "<REDACTED>"})
    }
).text

print("\nOriginal Text:", text)
print("Anonymized Text:", anonymized_text)



[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.




Detected entities:
type: EMAIL_ADDRESS, start: 101, end: 121, score: 1.0
type: PERSON, start: 11, end: 19, score: 0.85
type: LOCATION, start: 70, end: 83, score: 0.85
type: PHONE_NUMBER, start: 44, end: 58, score: 0.75
type: URL, start: 101, end: 108, score: 0.5
type: URL, start: 110, end: 121, score: 0.5

Original Text: My name is John Doe, and my phone number is (123) 456-7890. I live in New York City, and my email is john.doe@example.com.
Anonymized Text: My name is <REDACTED>, and my phone number is <REDACTED>. I live in <REDACTED>, and my email is <REDACTED>.
