# Model evaluation
To evaluate the quality of the generated summarizations the Recall-Oriented Understudy for Gisting Evaluation (ROUGE) score can be used. The ROUGE scoring alogrithm evaluates the similarity between a candidate document (i.e. an abstract) and a reference document (i.e. the summary created by the GPT-2). 
The ROUGE score assesses the quality of a summary by counting how many n-grams in the reference document(s) match the n-grams in the candidate document.
To calculate ROUGE scores the rouge library is used.

## Data preparation
To calculate ROUGE scores for a set of abstracts and summaries a set of .txt files is needed:


*   One .txt file containing the abstracts (one abstract per line).
*   One .txt file for containing the summaries generated by each of the models (one summary per line; the order of the summaries has to match the order of the abstracts in the abstract documen).




In [1]:
#Mount gDrive
from google.colab import drive
import os

drive.mount('/content/gdrive')  # Mounting GoogleDrive to the content folder

project_dir = 'NLP_scientific-text-generation'
if not os.path.exists('/content/gdrive/MyDrive/'+project_dir):  # Create a project folder if it does not exist yet
    os.makedirs('/content/gdrive/MyDrive/'+project_dir)
os.chdir('/content/gdrive/MyDrive/'+project_dir)  # Changing the working directory to the project folder on GoogleDrive

Mounted at /content/gdrive


## Score calculation

In [2]:
#Install the rouge library
!pip install rouge

Collecting rouge
  Downloading https://files.pythonhosted.org/packages/43/cc/e18e33be20971ff73a056ebdb023476b5a545e744e3fc22acd8c758f1e0d/rouge-1.0.0-py3-none-any.whl
Installing collected packages: rouge
Successfully installed rouge-1.0.0


In [3]:
#Calculate scores
os.chdir('/content/gdrive/My Drive/NLP_scientific-text-generation/')
from rouge import FilesRouge

files_rouge = FilesRouge()
rouge_gpt3_os = files_rouge.get_scores('abstracts.txt', 'gpt3_os.txt', avg=True)
rouge_gpt3_fs = files_rouge.get_scores('abstracts.txt', 'gpt3_fs.txt', avg=True)
rouge_gpt2_a_os = files_rouge.get_scores('abstracts.txt', 'gpt2_a_os.txt', avg=True)
rouge_gpt2_a_fs = files_rouge.get_scores('abstracts.txt', 'gpt2_a_fs.txt', avg=True)
rouge_gpt2_ft_os = files_rouge.get_scores('abstracts.txt', 'gpt2_ft_os.txt', avg=True)
rouge_gpt2_ft_fs = files_rouge.get_scores('abstracts.txt', 'gpt2_ft_fs.txt', avg=True)

In [4]:
import pandas as pd
rouge1 = pd.DataFrame(rouge_gpt3_os)
index = rouge1.index
index.name = "GPT-3 OS"

rouge2 = pd.DataFrame(rouge_gpt3_fs)
index = rouge2.index
index.name = "GPT-3 FS"

rouge3 = pd.DataFrame(rouge_gpt2_a_os)
index = rouge3.index
index.name = "GPT-2 AB OS"

rouge4 = pd.DataFrame(rouge_gpt2_a_fs)
index = rouge4.index
index.name = "GPT-2 AB FS"

rouge5 = pd.DataFrame(rouge_gpt2_ft_os)
index = rouge5.index
index.name = "GPT-2 FT OS"

rouge6 = pd.DataFrame(rouge_gpt2_ft_fs)
index = rouge6.index
index.name = "GPT-2 FT FS"


In [6]:
#Quick summary
print(rouge1,'\n',rouge2,'\n',rouge3,'\n',rouge4,'\n',rouge5,'\n',rouge6)

           rouge-1   rouge-2   rouge-l
GPT-3 OS                              
f         0.151580  0.097994  0.190932
p         0.085264  0.055060  0.112048
r         0.755006  0.503604  0.710610 
            rouge-1   rouge-2   rouge-l
GPT-3 FS                              
f         0.142686  0.095379  0.189342
p         0.080542  0.053712  0.111353
r         0.766042  0.551002  0.735236 
               rouge-1   rouge-2   rouge-l
GPT-2 AB OS                              
f            0.131404  0.050276  0.143804
p            0.075910  0.029046  0.086602
r            0.554176  0.204743  0.473762 
               rouge-1   rouge-2   rouge-l
GPT-2 AB FS                              
f            0.133295  0.053752  0.148717
p            0.078052  0.031571  0.089502
r            0.545004  0.216439  0.485919 
               rouge-1   rouge-2   rouge-l
GPT-2 FT OS                              
f            0.102876  0.034575  0.128974
p            0.059176  0.019741  0.077252
r            0

In the columns are the different rouge scores:


*   rouge-n : Overlap of n-grams between the abstracts and the summaries
*   rouge-l: Longest Common Subsequences. Takes sentence level structure similarity naturally into account and identifies longest co-occurring in sequence n-grams automatically

Three different test statistics are reported:



*   precision (p): Proportion of the n-grams in the generated summary that are also present in the abstract (i.e. a rouge-1 precision of 0.21 means that 21% percent of unigrams in the generated summary are also present in the abstract).
*   recall (r): Proportion of the n-grams in the abstract that are also present in the summary (i.e. a rouge-1 recall of 0.45 means that 45% of the unigrams in the abstract are also present in the summary).
*   f-score: Measure of robustness and precision. Harmonic mean of your precision and recall. Greatest when precision and recall are equal. 




