# ROUGE

**ROUGE** stands for **R**ecall-**O**riented **U**nderstudy for **G**isting **E**valuation. We now implement this metric in python

In [4]:
from rouge import Rouge

model_out = 'hello to the world'
reference = 'hello world'

# initialize the rouge object
rouge = Rouge()

# get the scores
rouge.get_scores(model_out, reference)

[{'rouge-1': {'r': 1.0, 'p': 0.5, 'f': 0.6666666622222223},
  'rouge-2': {'r': 0.0, 'p': 0.0, 'f': 0.0},
  'rouge-l': {'r': 1.0, 'p': 0.5, 'f': 0.6666666622222223}}]

The `get_scores` method returns three metrics, ROUGE-N using a unigram (ROUGE-1) and a bigram (ROUGE-2) — and ROUGE-L.

For each of these, we receive the F1 score **f**, precision **p**, and recall **r**.

Let's apply this to our set of five answers and see what we get. First, we need to define the `answers` list.

In [5]:
answers = [{'predicted': 'France', 'true': 'France.'},
           {'predicted': 'in the 10th and 11th centuries',
            'true': '10th and 11th centuries'},
           {'predicted': '10th and 11th centuries', 'true': '10th and 11th centuries'},
           {'predicted': 'Denmark, Iceland and Norway',
            'true': 'Denmark, Iceland and Norway'},
           {'predicted': 'Rollo', 'true': 'Rollo,'}]

Then we need to reformat this list into two lists, one for our predictions `model_out` and another for the true answers `reference`:

In [6]:
model_out = [ans['predicted'] for ans in answers]

reference = [ans['true'] for ans in answers]

In [7]:
model_out

['France',
 'in the 10th and 11th centuries',
 '10th and 11th centuries',
 'Denmark, Iceland and Norway',
 'Rollo']

Now we can pass both of these lists to the `rouge.get_scores` method to return a list of results:

In [8]:
rouge.get_scores(model_out, reference)

[{'rouge-1': {'r': 1.0, 'p': 1.0, 'f': 0.999999995},
  'rouge-2': {'r': 0.0, 'p': 0.0, 'f': 0.0},
  'rouge-l': {'r': 1.0, 'p': 1.0, 'f': 0.999999995}},
 {'rouge-1': {'r': 1.0, 'p': 0.6666666666666666, 'f': 0.7999999952000001},
  'rouge-2': {'r': 1.0, 'p': 0.6, 'f': 0.7499999953125},
  'rouge-l': {'r': 1.0, 'p': 0.6666666666666666, 'f': 0.7999999952000001}},
 {'rouge-1': {'r': 1.0, 'p': 1.0, 'f': 0.999999995},
  'rouge-2': {'r': 1.0, 'p': 1.0, 'f': 0.999999995},
  'rouge-l': {'r': 1.0, 'p': 1.0, 'f': 0.999999995}},
 {'rouge-1': {'r': 1.0, 'p': 1.0, 'f': 0.999999995},
  'rouge-2': {'r': 1.0, 'p': 1.0, 'f': 0.999999995},
  'rouge-l': {'r': 1.0, 'p': 1.0, 'f': 0.999999995}},
 {'rouge-1': {'r': 0.0, 'p': 0.0, 'f': 0.0},
  'rouge-2': {'r': 0.0, 'p': 0.0, 'f': 0.0},
  'rouge-l': {'r': 0.0, 'p': 0.0, 'f': 0.0}}]

Ideally, we want to get average metrics for all answers, we can do this by adding `avg=True` to the `get_scores` method.

In [9]:
rouge.get_scores(model_out, reference, avg=True)

{'rouge-1': {'r': 0.8, 'p': 0.7333333333333333, 'f': 0.7599999960400001},
 'rouge-2': {'r': 0.6, 'p': 0.52, 'f': 0.5499999970625},
 'rouge-l': {'r': 0.8, 'p': 0.7333333333333333, 'f': 0.7599999960400001}}