#### ROUGE Score-  
Recall Oriented Understudy for Gisting Evaluation  
Its a set of metrics used to estimate the quality of summary generated by a LLM/ other model.  
It has various variants but mostly used are ROUGE-N and ROUGE-l  
It generally uses overlaps of n-grams between generated string and reference sting or finds the Longest common subsequence of generated string and reference string.

Here I am implementing the ROUGE score from scratch in python.

In [1]:
from collections import Counter

In [2]:
#Calculating the n-grams , n can be 1 for unigrams, 2 for bigrams ...

def n_grams(tokens,n):
    return [tuple(tokens[i:i+n]) for i in range(len(tokens)+1-n)]


In [3]:
# Implementing ROUGE-N first
def rouge_n(generated,reference,n=1):
    generated_n_grams=Counter(n_grams(generated,n))
    reference_n_grams=Counter(n_grams(reference,n))
    overlap_ngrams=generated_n_grams & reference_n_grams
    overlap_count=sum(overlap_ngrams.values())
    reference_count=sum(reference_n_grams.values())
# calculate precision,recall and F1 score
    precision=overlap_count/sum(generated_n_grams.values()) if generated_n_grams else 0
    recall=overlap_count/reference_count if reference_count else 0
    f1_score=(2*precision*recall)/(precision+recall) if (precision+recall)>0 else 0
    return {"precision":precision,"recall":recall,"f1_score":f1_score}


In [4]:
m,n=4,4
dp=[[0]*(n+1) for _ in range(m+1)]
dp

[[0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0]]

In [5]:
# implementing ROUGE-L where need to calculate LCS between 2 strings
def rouge_l(generated,reference):
    def lcs_length(x,y):
        m,n=len(x),len(y)
        dp=[[0]*(n+1) for _ in range(m+1)]
        for i in range(1,m+1):
            for j in range(1,n+1):
                if x[i-1]==y[j-1]:
                    dp[i][j]=dp[i-1][j-1]+1
                else:
                    dp[i][j]=max(dp[i-1][j],dp[i][j-1])
        return dp[m][n]

    lcs_len=lcs_length(generated,reference)
    recall=lcs_len/len(reference) if reference else 0
    precision=lcs_len/len(generated) if generated else 0
    f1_score=(2*recall*precision)/(recall+precision) if (recall+precision)>0 else 0
    return {"precision":precision,"recall":recall,"f1_score":f1_score}

#### Testing ROUGE score using some data available on internet.  
Using 1-gram and ROUGE-L for comparison

In [11]:
generated="The majestic mountains stood tall against the horizon, their peaks dusted with snow that glistened under the golden rays of the morning sun. The valley below was a patchwork of green fields and winding rivers, dotted with small villages that seemed to be untouched by time. Birds chirped merrily in the trees, and the air was filled with the scent of fresh pine and blooming flowers. It was a scene of perfect tranquility, where nature and humanity existed in harmony. As the day progressed, the sunlight shifted, casting long shadows across the landscape, and the mountains took on a deep purple hue as dusk approached.".split()
reference="The towering mountains dominated the landscape, their snow-capped peaks glowing in the early morning light. Below, the valley stretched out in a mosaic of fields and rivers, with quaint villages nestled among the greenery. The chirping of birds filled the air, blending with the scent of pine trees and blooming flowers. This serene scene was a testament to the peaceful coexistence of nature and mankind. As the day wore on, the light changed, creating dramatic shadows on the mountainside, which turned a rich purple as the evening drew near".split()

rouge_1_result=rouge_n(generated,reference,n=1)
print("ROUGE-1:",rouge_1_result)

rouge_l_result=rouge_l(generated,reference)
print("ROUGE-L:",rouge_l_result)

ROUGE-1: {'precision': 0.47619047619047616, 'recall': 0.5617977528089888, 'f1_score': 0.5154639175257731}
ROUGE-L: {'precision': 0.37142857142857144, 'recall': 0.43820224719101125, 'f1_score': 0.4020618556701031}


#### If want to use inbuilt library- use rouge-score for the same.  
Download and install it using pip install rouge_score

In [12]:
from rouge_score import rouge_scorer

# Define the two paragraphs
paragraph1=("The majestic mountains stood tall against the horizon, "
              "their peaks dusted with snow that glistened under the golden rays of the morning sun. "
              "The valley below was a patchwork of green fields and winding rivers, dotted with small villages "
              "that seemed to be untouched by time. Birds chirped merrily in the trees, and the air was filled "
              "with the scent of fresh pine and blooming flowers. It was a scene of perfect tranquility, "
              "where nature and humanity existed in harmony. As the day progressed, the sunlight shifted, "
              "casting long shadows across the landscape, and the mountains took on a deep purple hue as dusk approached.")

paragraph2=("The towering mountains dominated the landscape, their snow-capped peaks glowing in the early morning light. "
              "Below, the valley stretched out in a mosaic of fields and rivers, with quaint villages nestled among the greenery. "
              "The chirping of birds filled the air, blending with the scent of pine trees and blooming flowers. "
              "This serene scene was a testament to the peaceful coexistence of nature and mankind. "
              "As the day wore on, the light changed, creating dramatic shadows on the mountainside, "
              "which turned a rich purple as the evening drew near.")

# Initialize the ROUGE scorer
scorer=rouge_scorer.RougeScorer(['rouge1','rouge2','rougeL'],use_stemmer=True)
scores=scorer.score(paragraph1,paragraph2)

# Print the results
for rouge_type,score in scores.items():
    print(f"{rouge_type.upper()} - Precision: {score.precision:.6f}, Recall: {score.recall:.6f}, F1-Score: {score.fmeasure:.6f}")


ROUGE1 - Precision: 0.622222, Recall: 0.533333, F1-Score: 0.574359
ROUGE2 - Precision: 0.168539, Recall: 0.144231, F1-Score: 0.155440
ROUGEL - Precision: 0.455556, Recall: 0.390476, F1-Score: 0.420513
