## Metrics
1. BLEU
2. Rouge

### Microsoft Paraphrase dataset

Additional dataset<br>
Paraphrasing dataset of news sources on the web.<br>
Proccessed dataset from https://github.com/wasiahmad/paraphrase_identification/tree/master/dataset/msr-paraphrase-corpus

I used this dataset for testing metrics for comparing ground truth and generated summaries.<br>

In [51]:
import csv
import pandas as pd
from nltk import ngrams
from nltk.corpus import stopwords
from nltk.translate.bleu_score import sentence_bleu as bleu
pd.set_option('display.max_colwidth', 500)
pd.options.display.float_format = '{:,.3f}'.format

In [53]:
para_df = pd.read_csv('MSRParaphraseCorpus.txt', sep='\t', quoting=csv.QUOTE_NONE)
para_df = para_df.drop(['id1', 'id2'], axis=1)
para_df.head(10)

Unnamed: 0,quality,string1,string2
0,1,"Amrozi accused his brother, whom he called ""the witness"", of deliberately distorting his evidence.","Referring to him as only ""the witness"", Amrozi accused his brother of deliberately distorting his evidence."
1,0,Yucaipa owned Dominick's before selling the chain to Safeway in 1998 for $2.5 billion.,Yucaipa bought Dominick's in 1995 for $693 million and sold it to Safeway for $1.8 billion in 1998.
2,1,"They had published an advertisement on the Internet on June 10, offering the cargo for sale, he added.","On June 10, the ship's owners had published an advertisement on the Internet, offering the explosives for sale."
3,0,"Around 0335 GMT, Tab shares were up 19 cents, or 4.4%, at A$4.56, having earlier set a record high of A$4.57.","Tab shares jumped 20 cents, or 4.6%, to set a record closing high at A$4.57."
4,1,"The stock rose $2.11, or about 11 percent, to close Friday at $21.51 on the New York Stock Exchange.",PG&E Corp. shares jumped $1.63 or 8 percent to $21.03 on the New York Stock Exchange on Friday.
5,1,Revenue in the first quarter of the year dropped 15 percent from the same period a year earlier.,"With the scandal hanging over Stewart's company, revenue the first quarter of the year dropped 15 percent from the same period a year earlier."
6,0,"The Nasdaq had a weekly gain of 17.27, or 1.2 percent, closing at 1,520.15 on Friday.","The tech-laced Nasdaq Composite .IXIC rallied 30.46 points, or 2.04 percent, to 1,520.15."
7,1,The DVD-CCA then appealed to the state Supreme Court.,The DVD CCA appealed that decision to the U.S. Supreme Court.
8,0,"That compared with $35.18 million, or 24 cents per share, in the year-ago period.",Earnings were affected by a non-recurring $8 million tax benefit in the year-ago period.
9,1,He said the foodservice pie business doesn't fit the company's long-term growth strategy.,"""The foodservice pie business does not fit our long-term growth strategy."


In [94]:
from nltk.translate.bleu_score import SmoothingFunction
chencherry = SmoothingFunction()

para_df['bleu-1'] = para_df.apply(lambda row: bleu(row['string1'],
                                                 row['string2'],
                                                 smoothing_function=chencherry.method1,
                                                 weights=(1,)),
                                axis=1)

para_df['bleu-w'] = para_df.apply(lambda row: bleu(row['string1'],
                                                 row['string2'],
                                                 smoothing_function=chencherry.method1,
                                                 weights=(0.25, 0.25, 0.25, 0.25)),
                                axis=1)

para_df['bleu-modi'] = para_df.apply(lambda row: bleu(row['string1'],
                                                 row['string2'],
                                                 smoothing_function=chencherry.method1,
                                                 weights=(0.8, 0.1, 0.05, 0.05)),
                                axis=1)

In [55]:
stop_words = set(stopwords.words('english'))
def rouge_score(reference, hypothesis, ngrams_count):
    
    def get_ngrams(line, n=1):
        ngrams_list = []
        for n in range(1, n + 1):
            for gram in ngrams(line.split(), n):
                if not ' '.join(gram).lower() in stop_words:
                    ngrams_list.append(' '.join(gram))
        return set(ngrams_list)
    
    ref_ngrams = get_ngrams(reference, n=ngrams_count)
    hypo_ngrams = get_ngrams(hypothesis, n=ngrams_count)
    
    overlapping_count = 0
    for r_gram in ref_ngrams:
        for h_gram in hypo_ngrams:
            if r_gram == h_gram:
                overlapping_count += 1
    
    all_ngrams_count = round((len(ref_ngrams) + len(hypo_ngrams))/2)
    
    return overlapping_count/(all_ngrams_count + 1)

In [60]:
para_df['my_rouge1'] = para_df.apply(lambda row: rouge_score(row['string1'],
                                                         row['string2'],
                                                         ngrams_count=1),
                                 axis=1)

In [57]:
para_df.groupby('quality')[['bleu', 'my_rouge1']].mean()

Unnamed: 0_level_0,bleu,my_rouge1
quality,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0.004,0.427
1,0.004,0.552
﻿Quality,0.04,0.333


In article https://www.aclweb.org/anthology/W04-1013 :<br>
ROUGE-1, ROUGE-L, ROUGE-W, ROUGE-SU4, and ROUGE-SU9 performed great in evaluating very short summaries (or headline-like summaries)

In [58]:
import rouge
# nltk stemmer, wordnet

evaluator = rouge.Rouge(metrics=['rouge-n', 'rouge-l', 'rouge-w'], max_n=4)

def rouge_scores(reference, hypothesis):
    results = evaluator.get_scores(reference, hypothesis)
    return [(name, d['f']) for name, d in results.items()]  # f1


def get_score_by_name(row, name):
    for score_name, score in row:
        if score_name == name:
            return score

para_df['raw_rouges'] = para_df.apply(lambda row: rouge_scores(row['string1'], row['string2']), axis=1)
para_df['rouge-1'] = para_df['raw_rouges'].map(lambda x: get_score_by_name(x, 'rouge-1'))
para_df['rouge-2'] = para_df['raw_rouges'].map(lambda x: get_score_by_name(x, 'rouge-2'))
para_df['rouge-3'] = para_df['raw_rouges'].map(lambda x: get_score_by_name(x, 'rouge-3'))
para_df['rouge-4'] = para_df['raw_rouges'].map(lambda x: get_score_by_name(x, 'rouge-4'))
para_df['rouge-l'] = para_df['raw_rouges'].map(lambda x: get_score_by_name(x, 'rouge-l'))
para_df['rouge-w'] = para_df['raw_rouges'].map(lambda x: get_score_by_name(x, 'rouge-w'))
para_df = para_df.drop('raw_rouges', axis=1)

In [98]:
para_df.head(10)

Unnamed: 0,quality,string1,string2,my_rouge1,rouge-1,rouge-2,rouge-3,rouge-4,rouge-l,rouge-w,bleu-1,bleu-w,bleu-modi
0,1,"Amrozi accused his brother, whom he called ""the witness"", of deliberately distorting his evidence.","Referring to him as only ""the witness"", Amrozi accused his brother of deliberately distorting his evidence.",0.7,0.733,0.571,0.385,0.25,0.6,0.6,0.243,0.004,0.08
1,0,Yucaipa owned Dominick's before selling the chain to Safeway in 1998 for $2.5 billion.,Yucaipa bought Dominick's in 1995 for $693 million and sold it to Safeway for $1.8 billion in 1998.,0.273,0.556,0.176,0.0,0.0,0.444,0.444,0.323,0.004,0.102
2,1,"They had published an advertisement on the Internet on June 10, offering the cargo for sale, he added.","On June 10, the ship's owners had published an advertisement on the Internet, offering the explosives for sale.",0.455,0.757,0.571,0.364,0.258,0.595,0.595,0.225,0.004,0.075
3,0,"Around 0335 GMT, Tab shares were up 19 cents, or 4.4%, at A$4.56, having earlier set a record high of A$4.57.","Tab shares jumped 20 cents, or 4.6%, to set a record closing high at A$4.57.",0.538,0.591,0.381,0.2,0.0,0.545,0.545,0.368,0.005,0.12
4,1,"The stock rose $2.11, or about 11 percent, to close Friday at $21.51 on the New York Stock Exchange.",PG&E Corp. shares jumped $1.63 or 8 percent to $21.03 on the New York Stock Exchange on Friday.,0.231,0.524,0.3,0.211,0.167,0.476,0.476,0.295,0.004,0.096
5,1,Revenue in the first quarter of the year dropped 15 percent from the same period a year earlier.,"With the scandal hanging over Stewart's company, revenue the first quarter of the year dropped 15 percent from the same period a year earlier.",0.667,0.791,0.732,0.718,0.703,0.791,0.791,0.162,0.003,0.055
6,0,"The Nasdaq had a weekly gain of 17.27, or 1.2 percent, closing at 1,520.15 on Friday.","The tech-laced Nasdaq Composite .IXIC rallied 30.46 points, or 2.04 percent, to 1,520.15.",0.182,0.421,0.111,0.059,0.0,0.421,0.421,0.258,0.004,0.087
7,1,The DVD-CCA then appealed to the state Supreme Court.,The DVD CCA appealed that decision to the U.S. Supreme Court.,0.429,0.727,0.4,0.111,0.0,0.727,0.727,0.344,0.006,0.119
8,0,"That compared with $35.18 million, or 24 cents per share, in the year-ago period.",Earnings were affected by a non-recurring $8 million tax benefit in the year-ago period.,0.2,0.375,0.267,0.214,0.154,0.375,0.375,0.25,0.004,0.085
9,1,He said the foodservice pie business doesn't fit the company's long-term growth strategy.,"""The foodservice pie business does not fit our long-term growth strategy.",0.778,0.643,0.462,0.333,0.182,0.643,0.643,0.329,0.006,0.11


In [104]:
para_df.groupby('quality')[['my_rouge1', 'rouge-1', 'rouge-2',
       'rouge-3', 'rouge-4', 'rouge-l', 'rouge-w', 'bleu-1', 'bleu-w',
       'bleu-modi']].mean()

Unnamed: 0_level_0,my_rouge1,rouge-1,rouge-2,rouge-3,rouge-4,rouge-l,rouge-w,bleu-1,bleu-w,bleu-modi
quality,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0,0.427,0.574,0.389,0.291,0.225,0.536,0.536,0.241,0.004,0.081
1,0.552,0.715,0.533,0.421,0.341,0.67,0.67,0.226,0.004,0.075
﻿Quality,0.333,0.5,0.0,0.0,0.0,0.5,0.5,0.889,0.04,0.387


In [105]:
# rouge-l and rouge-w are equal for all paraphrases
para_df[para_df['rouge-l'] != para_df['rouge-w']]

Unnamed: 0,quality,string1,string2,my_rouge1,rouge-1,rouge-2,rouge-3,rouge-4,rouge-l,rouge-w,bleu-1,bleu-w,bleu-modi


### Conclusions:
1. BLEU metric<br>
I tested three variants of bleu metric:<br>
**bleu-1** - bleu metric based on unigrams<br>
a generated title with the score more than 0.23 could be considered as a good title<br>
**bleu-w** - here is bleu weighed with a vector (0.25, 0.25, 0.25, 0.25)<br>
Even if a generated title is quite good, the score can be 0.03 or 0.04
**bleu-modi** - modified version with weights (0.8, 0.1, 0.05, 0.05)<br>
<br>

2. Rouge metric<br>
**my_rouge1** - simple rouge metric based only on words' unigrams, not lemmas<br>
**rouge-n** - based on lemmas' n-grams, n=1,2,3,4<br>
rouge-1 works way better than my_rouge1 because it takes lemmas into processing not just words (on good paraphrases mean of my_rouge1 = 0.552, mean of rouge-1 = 0.715)<br>
So it is extremely important to take lemmas for evaluation<br>
**rouge-l** and **rouge-w** are equal for all paraphrases<br>
'l' stands for Longest Common Subsequence, for this metric the formula of Sentence-Level LCS was used.<br>
The intuition is that the longer the LCS of two summary sentences is, the more similar the two summaries are.<br>
The score of 0.5 indicates a good result.