# Model Comparison

Now that a model has been trained and visual inspection done on performance between models it is time to measure the performance between models. To do this we shall be using the rouge-L to measure the similarity between the longest common subsequences of words found in the generated and target summaries. 

A rougeL score shall be given to each summary for both models.

In [None]:
!pip install --quiet rouge-score transformers

[K     |████████████████████████████████| 4.0 MB 7.6 MB/s 
[K     |████████████████████████████████| 77 kB 3.4 MB/s 
[K     |████████████████████████████████| 596 kB 55.0 MB/s 
[K     |████████████████████████████████| 6.6 MB 35.1 MB/s 
[K     |████████████████████████████████| 895 kB 15.2 MB/s 
[?25h

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
import pandas as pd
import torch
import pathlib
from sklearn.model_selection import train_test_split

In [None]:
datapath = pathlib.Path.cwd()/'gdrive/My Drive/Capstone_three/data/mtsamples'
df = pd.read_json(datapath/'preprocessed2.jsonl', lines=True)

In [None]:
df_train, df_test = train_test_split(df, test_size=0.2, random_state=42)
df_train.reset_index(drop=True, inplace=True)
df_test.reset_index(drop=True, inplace=True)

## The Models:

Here I shall be comparing the performance of the t5-small model fine-tuned on the MTSamples data for 50 epochs vs. the baseline t5-small. To measure model performance I will be using the rougeL score which measures similarity between texts based on the longest common subsequence they share.

In [None]:
from transformers import T5ForConditionalGeneration, T5TokenizerFast as T5Tokenizer
base_model = T5ForConditionalGeneration.from_pretrained('t5-small')
tokenizer = T5Tokenizer.from_pretrained('t5-small')
ft_model = torch.load('/content/gdrive/MyDrive/Capstone_three/models/tuned_model.pkl')

Downloading:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/231M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/773k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.32M [00:00<?, ?B/s]

In [None]:
def summarize(text,model):
    text_enc = tokenizer(
        "summarize:"+text+tokenizer.eos_token,
        max_length= 512,
        padding="max_length",
        truncation=True,
        return_attention_mask=True,
        add_special_tokens=True,
        return_tensors="pt"
    )
    generated_ids = model.generate(
        input_ids = text_enc['input_ids'],
        attention_mask=text_enc['attention_mask'],
        max_length=128,
        num_beams=4,
        repetition_penalty = 5.5,
        length_penalty=1.1,
        early_stopping=True   
    )
    predictions = [
        tokenizer.decode(gen_id, skip_special_tokens=True, clean_up_tokenization_spaces=True)
        for gen_id in generated_ids
    ]

    return "".join(predictions)
    

set up the scorer and generate predictions and scores for each row in the testing set.

In [None]:
from rouge_score import rouge_scorer
scorer = rouge_scorer.RougeScorer(['rougeL'])

In [None]:
base_sums = []
base_rouge = []
ft_sums = []
ft_rouge = []

for item in df_test.index:
    row = df_test.iloc[item]
    bsum = summarize(row.input, base_model)
    base_sums.append(bsum)
    ftsum = summarize(row.input, ft_model)
    ft_sums.append(ftsum)

    brg = scorer.score(row.target,bsum)
    base_rouge.append(brg['rougeL'].fmeasure)
    ftrg = scorer.score(row.target,ftsum)
    ft_rouge.append(ftrg['rougeL'].fmeasure)








In [None]:
sum_df = pd.DataFrame({'input':df_test.input.values,'target':df_test.target.values,'base_model':base_sums,'base_rougeL':base_rouge,'fine_tuned':ft_sums,'ft_rougeL':ft_rouge})

In [None]:
sum_df.to_csv(datapath/'summary_results.csv')

In [None]:
sum_df.base_rougeL.mean()

0.16385148262878682

In [None]:
sum_df.ft_rougeL.mean()

0.18648782658953966

So preliminary results suggest that the baseline model performs better. Let's bootstrap this to see how confident we are that the means are different.

In [None]:
import numpy as np

In [None]:
base_boot = []
ft_boot = []

for _ in range(10000):
    base_boot.append(np.mean(np.random.choice(sum_df.base_rougeL.values, size=100)))
    ft_boot.append(np.mean(np.random.choice(sum_df.ft_rougeL.values, size=100)))

In [None]:
print(f'95% CI of mean rougeL for base: [{np.quantile(base_boot, 0.025):.4f},{np.quantile(base_boot, 0.975):.4f}]')
print(f'95% CI of mean rougeL for fine-tuned: [{np.quantile(ft_boot, 0.025):.4f},{np.quantile(ft_boot, 0.975):.4f}]')

95% CI of mean rougeL for base: [0.1299,0.2015]
95% CI of mean rougeL for fine-tuned: [0.1516,0.2242]


In [None]:
from scipy.stats import ttest_ind

In [None]:
ttest_ind(sum_df.base_rougeL, sum_df.ft_rougeL)

Ttest_indResult(statistic=-1.024233920462916, pvalue=0.3066216603115707)

We cannot really say that there is a significant difference in the overall performance between the two models at summarizing the testing data.

Let's now visually compare a the summaries of random targets.

In [None]:
sample_df = sum_df.iloc[np.random.choice(sum_df.index, size=int(sum_df.shape[0]/10),replace=False)]

In [None]:
for item in sample_df.index:
    print(f'input: {sample_df.loc[item].input}')
    print(f'target: {sample_df.loc[item].target}')
    print(f'base model summary:\n{sample_df.loc[item].base_model}\nfine-tuned model:\n{sample_df.loc[item].fine_tuned} \n\n')

input: Urgent cardiac catheterization with coronary angiogram.The patient was brought urgently to the cardiac cath lab from the emergency room with the patient being intubated with an abnormal EKG and a cardiac arrest. The right groin was prepped and draped in usual manner. Under 2% lidocaine anesthesia, the right femoral artery was entered. A 6-French sheath was placed. The patient was already on anticoagulation. Selective coronary angiograms were then performed using a left and a 3DRC catheter. The catheters were reviewed. The catheters were then removed and an Angio-Seal was placed. There was some hematoma at the cath site
target: Normal coronary angiogram.
base model summary:
the patient was brought urgently to the cardiac cath lab from the emergency room. under 2% lidocaine anesthesia, the right femoral artery was entered.
fine-tuned model:
A 2-French sheath catheterization with coronary angiogram. The patient was intubated with an abnormal EKG and a cardiac arrest. 


input: Aort

## Conclusion:

Both models seem to struggle in some cases to truly grasp the important information. However in the cases where the context is captured, the fine-tuned model seems to do a better job at distilling the important details. It either reduces the information to the necessary components or it finds additional information that was missing in the base model summary.