In [1]:
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from rouge import Rouge

def evaluate_explanations(ground_truth, model_outputs):
    # Define a smoothing function
    chencherry = SmoothingFunction()
    
    # Initialize Rouge
    rouge = Rouge()
    
    for i, model_output in enumerate(model_outputs):
        # Calculate BLEU scores
        if len(ground_truth.split()) < 5:  # Apply smoothing for short sentences
            bleu1 = sentence_bleu([ground_truth.split()], model_output.split(), weights=(1, 0, 0, 0), smoothing_function=chencherry.method1)
            bleu2 = sentence_bleu([ground_truth.split()], model_output.split(), weights=(0.5, 0.5, 0, 0), smoothing_function=chencherry.method1)
            bleu3 = sentence_bleu([ground_truth.split()], model_output.split(), weights=(0.33, 0.33, 0.33, 0), smoothing_function=chencherry.method1)
            bleu4 = sentence_bleu([ground_truth.split()], model_output.split(), weights=(0.25, 0.25, 0.25, 0.25), smoothing_function=chencherry.method1)
        else:
            bleu1 = sentence_bleu([ground_truth.split()], model_output.split(), weights=(1, 0, 0, 0))
            bleu2 = sentence_bleu([ground_truth.split()], model_output.split(), weights=(0.5, 0.5, 0, 0))
            bleu3 = sentence_bleu([ground_truth.split()], model_output.split(), weights=(0.33, 0.33, 0.33, 0))
            bleu4 = sentence_bleu([ground_truth.split()], model_output.split(), weights=(0.25, 0.25, 0.25, 0.25))
        
        # Calculate ROUGE scores
        rouge_scores = rouge.get_scores(hyps=model_output, refs=ground_truth)[0]
        
        # Print BLEU and ROUGE scores
        print(f'For model output {i + 1}:')
        print(f'BLEU-1: {bleu1}, BLEU-2: {bleu2}, BLEU-3: {bleu3}, BLEU-4: {bleu4}')
        print(f"ROUGE-1: {rouge_scores['rouge-1']}, ROUGE-2: {rouge_scores['rouge-2']}, ROUGE-L: {rouge_scores['rouge-l']}")
        print("\n")

# Usage:
ground_truth = "Soap"
model_outputs = ["Great soap", "Gentle, effective soap recommended.", "The user highly recommends this soap for sensitive skin due to its moisturizing and non-irritating qualities, along with a pleasant, soft fragrance."]
evaluate_explanations(ground_truth, model_outputs)


For model output 1:
BLEU-1: 0, BLEU-2: 0, BLEU-3: 0, BLEU-4: 0
ROUGE-1: {'r': 0.0, 'p': 0.0, 'f': 0.0}, ROUGE-2: {'r': 0.0, 'p': 0.0, 'f': 0.0}, ROUGE-L: {'r': 0.0, 'p': 0.0, 'f': 0.0}


For model output 2:
BLEU-1: 0, BLEU-2: 0, BLEU-3: 0, BLEU-4: 0
ROUGE-1: {'r': 0.0, 'p': 0.0, 'f': 0.0}, ROUGE-2: {'r': 0.0, 'p': 0.0, 'f': 0.0}, ROUGE-L: {'r': 0.0, 'p': 0.0, 'f': 0.0}


For model output 3:
BLEU-1: 0, BLEU-2: 0, BLEU-3: 0, BLEU-4: 0
ROUGE-1: {'r': 0.0, 'p': 0.0, 'f': 0.0}, ROUGE-2: {'r': 0.0, 'p': 0.0, 'f': 0.0}, ROUGE-L: {'r': 0.0, 'p': 0.0, 'f': 0.0}




### -----------------> OBSERVATION

> In the evaluation metrics presented, both the BLEU and ROUGE scores are 0 for all three model outputs. This suggests that, according to these metrics, none of the model outputs match the reference (or "ground truth") text.

> BLEU (Bilingual Evaluation Understudy) measures the overlap of n-grams between the generated text and the reference text. In this case, the BLEU score of 0 indicates that there is no overlap of n-grams (for n=1,2,3,4) between your model outputs and the reference text.

> ROUGE (Recall-Oriented Understudy for Gisting Evaluation) evaluates the quality of a summary by comparing it to other (typically human-generated) summaries. The 'r', 'p', and 'f' keys in the ROUGE output refer to recall, precision, and F-measure, respectively. All being zero indicates that there were no overlapping n-grams found in the model outputs and the reference summary.

> However, one key point to note here is that both BLEU and ROUGE metrics can be less effective or informative when dealing with extremely short texts. These metrics are often used for tasks like machine translation or text summarization where the output texts are generally longer. The 'ground truth' in your case appears to be a single word ("Soap"), which might be too short for these metrics to provide a meaningful evaluation.

> Therefore, while the scores suggest a complete lack of overlap between the model outputs and the reference, these scores might not be giving a fair or useful evaluation of the quality of the model outputs given the very short length of the reference text.




