# Model evaluation
To evaluate the quality of the generated summarizations the Recall-Oriented Understudy for Gisting Evaluation (ROUGE) score can be used. The ROUGE scoring alogrithm evaluates the similarity between a candidate document (i.e. an abstract) and a reference document (i.e. the summary created by the GPT-2). 
The ROUGE score assesses the quality of a summary by counting how many n-grams in the reference document(s) match the n-grams in the candidate document.
To calculate ROUGE scores the rouge library is used.

## Data preparation
To calculate ROUGE scores for a set of abstracts and summaries a set of .txt files is needed:


*   One .txt file containing the abstracts (one abstract per line).
*   One .txt file for containing the summaries generated by each of the models (one summary per line; the order of the summaries has to match the order of the abstracts in the abstract documen).




In [4]:
#Mount gDrive
from google.colab import drive
import os

drive.mount('/content/gdrive')  # Mounting GoogleDrive to the content folder

project_dir = 'NLP_scientific-text-generation'
if not os.path.exists('/content/gdrive/MyDrive/'+project_dir):  # Create a project folder if it does not exist yet
    os.makedirs('/content/gdrive/MyDrive/'+project_dir)
os.chdir('/content/gdrive/MyDrive/'+project_dir)  # Changing the working directory to the project folder on GoogleDrive

Mounted at /content/gdrive


## Score calculation

In [16]:
#Install the rouge library
!pip install rouge

Collecting rouge
  Downloading https://files.pythonhosted.org/packages/43/cc/e18e33be20971ff73a056ebdb023476b5a545e744e3fc22acd8c758f1e0d/rouge-1.0.0-py3-none-any.whl
Installing collected packages: rouge
Successfully installed rouge-1.0.0


In [24]:
#Calculate scores
os.chdir('/content/gdrive/My Drive/NLP_scientific-text-generation/')
from rouge import FilesRouge

files_rouge = FilesRouge()
rouge_gpt3 = files_rouge.get_scores('abstracts.txt', 'gpt3.txt', avg=True)
rouge_gpt2_a_os = files_rouge.get_scores('abstracts.txt', 'gpt2_a_os.txt', avg=True)
rouge_gpt2_a_fs = files_rouge.get_scores('abstracts.txt', 'gpt2_a_fs.txt', avg=True)
rouge_gpt2_ft_os = files_rouge.get_scores('abstracts.txt', 'gpt2_ft_os.txt', avg=True)
rouge_gpt2_ft_fs = files_rouge.get_scores('abstracts.txt', 'gpt2_ft_fs.txt', avg=True)

In [25]:
rouge1 = pd.DataFrame(rouge_gpt3)
index = rouge1.index
index.name = "GPT-3"

rouge2 = pd.DataFrame(rouge_gpt2_a_os)
index = rouge2.index
index.name = "GPT-2 AB OS"

rouge3 = pd.DataFrame(rouge_gpt2_a_fs)
index = rouge3.index
index.name = "GPT-2 AB FS"

rouge4 = pd.DataFrame(rouge_gpt2_ft_os)
index = rouge4.index
index.name = "GPT-2 FT OS"

rouge5 = pd.DataFrame(rouge_gpt2_ft_fs)
index = rouge5.index
index.name = "GPT-2 FT FS"


In [27]:
#Quick summary
print(rouge1,'\n',rouge2,'\n',rouge3,'\n',rouge4,'\n',rouge5)

        rouge-1   rouge-2   rouge-l
GPT-3                              
f      0.238128  0.187547  0.290337
p      0.147665  0.117890  0.188683
r      0.790589  0.578159  0.764191 
               rouge-1   rouge-2   rouge-l
GPT-2 AB OS                              
f            0.315052  0.135779  0.258778
p            0.250132  0.112231  0.202112
r            0.445498  0.177827  0.374828 
               rouge-1   rouge-2   rouge-l
GPT-2 AB FS                              
f            0.315807  0.131853  0.255497
p            0.249998  0.107336  0.196013
r            0.449041  0.177281  0.383152 
               rouge-1   rouge-2   rouge-l
GPT-2 FT OS                              
f            0.254266  0.084814  0.218212
p            0.194343  0.066174  0.168405
r            0.387685  0.123463  0.340429 
               rouge-1   rouge-2   rouge-l
GPT-2 FT FS                              
f            0.286813  0.125116  0.247005
p            0.217478  0.097511  0.193752
r            0

In the columns are the different rouge scores:


*   rouge-n : Overlap of n-grams between the abstracts and the summaries
*   rouge-l: Longest Common Subsequences. Takes sentence level structure similarity naturally into account and identifies longest co-occurring in sequence n-grams automatically

Three different test statistics are reported:



*   precision (p): Proportion of the n-grams in the generated summary that are also present in the abstract (i.e. a rouge-1 precision of 0.21 means that 21% percent of unigrams in the generated summary are also present in the abstract).
*   recall (r): Proportion of the n-grams in the abstract that are also present in the summary (i.e. a rouge-1 recall of 0.45 means that 45% of the unigrams in the abstract are also present in the summary).
*   f-score: Measure of robustness and precision. Harmonic mean of your precision and recall. Greatest when precision and recall are equal. 




