# Model Comparison

Now that a model has been trained and visual inspection done on performance between models it is time to measure the performance between models. To do this we shall be using the rouge-L to measure the similarity between the longest common subsequences of words found in the generated and target summaries. 

A rougeL score shall be given to each summary for both models.

In [1]:
!pip install --quiet rouge-score transformers

[K     |████████████████████████████████| 4.0 MB 12.4 MB/s 
[K     |████████████████████████████████| 77 kB 7.3 MB/s 
[K     |████████████████████████████████| 596 kB 56.9 MB/s 
[K     |████████████████████████████████| 895 kB 50.9 MB/s 
[K     |████████████████████████████████| 6.6 MB 64.8 MB/s 
[?25h

In [2]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [3]:
import pandas as pd
import torch
import pathlib
from sklearn.model_selection import train_test_split

In [4]:
datapath = pathlib.Path.cwd()/'gdrive/My Drive/Capstone_three/data/mtsamples'
df = pd.read_json(datapath/'preprocessed2.jsonl', lines=True)

In [5]:
df_train, df_test = train_test_split(df, test_size=0.2, random_state=42)
df_train.reset_index(drop=True, inplace=True)
df_test.reset_index(drop=True, inplace=True)

## The Models:

Here I shall be comparing the performance of the t5-small model fine-tuned on the MTSamples data for 50 epochs vs. the baseline t5-small. To measure model performance I will be using the rougeL score which measures similarity between texts based on the longest common subsequence they share.

In [13]:
from transformers import T5ForConditionalGeneration, T5TokenizerFast as T5Tokenizer
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
base_model = T5ForConditionalGeneration.from_pretrained('t5-small')
tokenizer = T5Tokenizer.from_pretrained('t5-small')
ft_model = torch.load('/content/gdrive/MyDrive/Capstone_three/models/tuned_model.pkl')

In [14]:
def summarize(text,model):
    text_enc = tokenizer(
        "summarize:"+text+tokenizer.eos_token,
        max_length= 512,
        padding="max_length",
        truncation=True,
        return_attention_mask=True,
        add_special_tokens=True,
        return_tensors="pt"
    )
    generated_ids = model.generate(
        input_ids = text_enc['input_ids'],
        attention_mask=text_enc['attention_mask'],
        max_length=128,
        num_beams=4,
        repetition_penalty = 5.5,
        length_penalty=1.1,
        early_stopping=True   
    )
    predictions = [
        tokenizer.decode(gen_id, skip_special_tokens=True, clean_up_tokenization_spaces=True)
        for gen_id in generated_ids
    ]

    return "".join(predictions)
    

set up the scorer and generate predictions and scores for each row in the testing set.

In [9]:
from rouge_score import rouge_scorer
scorer = rouge_scorer.RougeScorer(['rougeL'])

In [17]:
base_sums = []
base_rouge = []
ft_sums = []
ft_rouge = []

for item in df_test.index:
    row = df_test.iloc[item]
    bsum = summarize(row.input, base_model)
    base_sums.append(bsum)
    ftsum = summarize(row.input, ft_model)
    ft_sums.append(ftsum)

    brg = scorer.score(row.target,bsum)
    base_rouge.append(brg['rougeL'])
    ftrg = scorer.score(row.target,ftsum)
    ft_rouge.append(ftrg['rougeL'])








In [91]:
sum_df = pd.DataFrame({'input':df_test.input.values,'target':df_test.target.values,'base_model':base_sums,'base_rougeL':base_rouge,'fine_tuned':ft_sums,'ft_rougeL':ft_rouge})

In [92]:
sum_df.to_csv(datapath/'summary_results.csv')

In [23]:
sum_df.base_rougeL = sum_df.base_rougeL.apply(lambda x: x.fmeasure)
sum_df.ft_rougeL = sum_df.ft_rougeL.apply(lambda x: x.fmeasure)

In [25]:
sum_df.base_rougeL.mean()

0.16385148262878682

In [26]:
sum_df.ft_rougeL.mean()

0.15551224612953585

So preliminary results suggest that the baseline model performs better. Let's bootstrap this to see how confident we are that the means are different.

In [34]:
import numpy as np

In [39]:
base_boot = []
ft_boot = []

for _ in range(10000):
    base_boot.append(np.mean(np.random.choice(sum_df.base_rougeL.values, size=100)))
    ft_boot.append(np.mean(np.random.choice(sum_df.ft_rougeL.values, size=100)))

In [52]:
print(f'95% CI of mean rougeL for base: [{np.quantile(base_boot, 0.025):.4f},{np.quantile(base_boot, 0.975):.4f}]')
print(f'95% CI of mean rougeL for fine-tuned: [{np.quantile(ft_boot, 0.025):.4f},{np.quantile(ft_boot, 0.975):.4f}]')

95% CI of mean rougeL for base: [0.1305,0.2006]
95% CI of mean rougeL for fine-tuned: [0.1247,0.1907]


In [53]:
from scipy.stats import ttest_ind

In [54]:
ttest_ind(sum_df.base_rougeL, sum_df.ft_rougeL)

Ttest_indResult(statistic=0.3944989300886372, pvalue=0.6935173778712012)

We cannot really say that there is a significant difference in the overall performance between the two models at summarizing the testing data.

Let's now visually compare a the summaries of random targets.

In [93]:
sample_df = sum_df.iloc[np.random.choice(sum_df.index, size=int(sum_df.shape[0]/10),replace=False)]

In [94]:
for item in sample_df.index:
    print(f'input: {sample_df.loc[item].input}')
    print(f'target: {sample_df.loc[item].target}')
    print(f'base model summary:\n{sample_df.loc[item].base_model}\nfine-tuned model:\n{sample_df.loc[item].fine_tuned} \n\n')

input: The system review was only positive for molar pain, but rest of the 13 review of systems were negative to date..,MEASUREMENTS: Height 135 cm and weight 28.1 kg.Atraumatic and normocephalic. The pupils are equal, round, and reactive to light. Full EOMs. The conjunctivae and sclerae are clear. The TMs show normal landmarks. The nasal mucosa is pink and moist. The teeth and gums are in good condition. The pharynx is clear
target: This is a 10-year-old with history of:1. Biliary atresia. 2. Status post orthotopic liver transplantation. 3. Dental cavities. 4. Food allergies. 5. History of urinary tract infections
base model summary:
system review was only positive for molar pain, but rest of the 13 systems were negative to date. conjunctivae and sclerae show normal landmarks.
fine-tuned model:
The system review was only positive for molar pain, but rest of the 13 review of systems were negative to date..MEASUREMENTS: Height 135 cm and weight 28.1 kg. 


input: She is awake, alert, an

## Conclusion:

Both models seem to struggle in some cases to truly grasp the important information. However in the cases where the context is captured, the fine-tuned model seems to do a better job at distilling the important details. It either reduces the information to the necessary components or it finds additional information that was missing in the base model summary.