# Compute ROUGE Scores
- ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measures the overlap of n-grams between the reference and generated responses. 
- ROUGE-1, ROUGE-2, and ROUGE-L are commonly used metrics:

1. ROUGE-1: Overlap of unigrams (single words).
2. ROUGE-2: Overlap of bigrams (two-word sequences).
3. ROUGE-L: Longest common subsequence.

You can use the rouge-score library in Python to compute these scores.

# Compute BLEU Scores
- BLEU (Bilingual Evaluation Understudy) measures the precision of n-grams in the generated responses with respect to the reference responses. 
- BLEU-1, BLEU-2, BLEU-3, and BLEU-4 are commonly used metrics, with BLEU-4 being the most popular for overall evaluation.
- You can use the nltk library in Python to compute BLEU scores.

In [4]:
# pip install pandas rouge-score nltk openpyxl

In [11]:
import pandas as pd
from rouge_score import rouge_scorer
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
import numpy as np

# Load the Excel file
file_path = 'Ground Truth.xlsx'
sheet_name = 'Q&A Walkthro'
data = pd.read_excel(file_path, sheet_name=sheet_name)

# Extract the relevant columns
reference_responses = data['Answer'].tolist()
model_columns = [
    'llama3-8b-8192 (OpenAIEmbedding/session_state.docs (all))',
    'llama3-8b-8192 (OllamaEmbedding/session_state.docs (all))',
    'llama3-8b-8192 (GoogleGenAIEmbedding/session_state.docs (all))'
]

# Initialize ROUGE scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

# Initialize BLEU scorer
smoothie = SmoothingFunction().method4

# List to store aggregated scores
aggregated_scores_list = []

# Compute ROUGE and BLEU scores for each model
for model in model_columns:
    generated_responses = data[model].tolist()
    
    # Lists to store individual scores
    rouge_1_scores = []
    rouge_2_scores = []
    rouge_l_scores = []
    bleu_scores = []

    # Compute ROUGE and BLEU scores
    for ref, gen in zip(reference_responses, generated_responses):
        # Compute ROUGE scores
        rouge_scores = scorer.score(ref, gen)
        rouge_1_scores.append(rouge_scores['rouge1'].fmeasure)
        rouge_2_scores.append(rouge_scores['rouge2'].fmeasure)
        rouge_l_scores.append(rouge_scores['rougeL'].fmeasure)
        
        # Compute BLEU scores
        ref_tokens = ref.split()
        gen_tokens = gen.split()
        bleu_score = sentence_bleu([ref_tokens], gen_tokens, smoothing_function=smoothie)
        bleu_scores.append(bleu_score)

    # Calculate average scores
    avg_rouge_1 = np.mean(rouge_1_scores)
    avg_rouge_2 = np.mean(rouge_2_scores)
    avg_rouge_l = np.mean(rouge_l_scores)
    avg_bleu = np.mean(bleu_scores)

    # Append the scores to the list
    aggregated_scores_list.append({
        'Model': model,
        'Average ROUGE-1': avg_rouge_1,
        'Average ROUGE-2': avg_rouge_2,
        'Average ROUGE-L': avg_rouge_l,
        'Average BLEU': avg_bleu
    })

# Convert the list to a DataFrame
aggregated_scores = pd.DataFrame(aggregated_scores_list)

# Display the aggregated scores DataFrame
aggregated_scores


Unnamed: 0,Model,Average ROUGE-1,Average ROUGE-2,Average ROUGE-L,Average BLEU
0,llama3-8b-8192 (OpenAIEmbedding/session_state....,0.133607,0.046611,0.108567,0.014828
1,llama3-8b-8192 (OllamaEmbedding/session_state....,0.103982,0.024111,0.077843,0.006972
2,llama3-8b-8192 (GoogleGenAIEmbedding/session_s...,0.213687,0.097963,0.178456,0.029943


### Evaluating the Scores

**Interpretation of ROUGE Scores:**

- ROUGE-1: Measures the overlap of unigrams (single words). Higher scores indicate better recall of individual words from the reference responses.
- ROUGE-2: Measures the overlap of bigrams (two-word sequences). Higher scores indicate better capture of contextual information.
- ROUGE-L: Measures the longest common subsequence. Higher scores indicate better overall structure and coherence.


**Interpretation of BLEU Scores:**

- BLEU scores range from 0 to 1, with higher scores indicating better precision of n-grams in the generated responses relative to the reference responses.
- BLEU is often used with a smoothing function (as done here) to handle cases where there are no matches, especially for short sentences.


**Benchmarking:**

- Compare the aggregated scores to benchmarks from similar systems or previous versions of your chatbot.
- Higher scores generally indicate better performance, but the specific thresholds for "good" scores can vary by application.