# GPT-4 Summarization

## Creating a Prompt

For creating a prompt, I will give you 10 training examples of the original text-summary pairs and 10 validation examples.
I will also provide code to check the performance of the 10 validation examples below.
You have to imput the output of GPT-4 for these by hand.
Not all experiments use exactly the same data as the original text-summary pairs (see below), but I think these are good to get a sense of the performance and create a prompt for all experiments.

## Experiments To Run

All other experiments come with their own 10 in-context examples.

### For quantitative performance estimates

1. Summarization of 100 original text-summary pairs
2. Summarization of 100 original text-summary pairs with short text (<4000 chars) and long summaries (>600 chars)
    * I did not mention this to you, but we also have to get the performance on this data.
    * This is a subset of 20% of the data I had to work with to make the human annotation feasible. Too long texts where impossible to annotate.
    * Basically I just want to show that this subselection makes no difference in performance.
3. Not high priority, but could be useful: Summarization of 100 _cleaned and improved_ text-summary pairs when using 10 cleaned and improved in-context examples (10 validation _cleaned and improved data_)

### For annotating hallucinations and determining hallucination rates

4. Summarization of 25 examples when using in-context examples with unsupported facts (10 validation _original data_)
    * I will give you 50 test examples to have some for debugging
5. Summarization of 25 examples when using in-context examples with unsupported facts removed (10 validation _cleaned data_)
    * I will give you 50 test examples to have some for debugging

### For qualitative results with human annotation

6. Summarization of 25 examples when using in-context examples with unsupported facts removed and improved text such as deidentification removed (10 validation _cleaned and improved data_)
    * I will give you 50 test examples to have some for debugging

In [1]:
# Imports
import json
import random
import numpy as np
from collections import defaultdict
import evaluate
from rouge_score import rouge_scorer

In [9]:
# Read all files
def read_jsonl(file_name):
    with open(file_name, "r") as f:
        return [json.loads(line) for line in f]
    
prompt_train = read_jsonl('summarization_data/prompt_train.json')
prompt_valid = read_jsonl('summarization_data/prompt_valid.json')

exp_1_in_context = read_jsonl('summarization_data/exp_1_in-context.json')
exp_1_test = read_jsonl('summarization_data/exp_1_test.json')
exp_2_in_context = read_jsonl('summarization_data/exp_2_in-context.json')
exp_2_test = read_jsonl('summarization_data/exp_2_test.json')
exp_3_in_context = read_jsonl('summarization_data/exp_3_in-context.json')
exp_3_test = read_jsonl('summarization_data/exp_3_test.json')

exp_4_in_context = read_jsonl('summarization_data/exp_4_in-context.json')
exp_4_test = read_jsonl('summarization_data/exp_4_test.json')
exp_5_in_context = read_jsonl('summarization_data/exp_5_in-context.json')
exp_5_test = read_jsonl('summarization_data/exp_5_test.json')

exp_6_in_context = read_jsonl('summarization_data/exp_6_in-context.json')
exp_6_test = read_jsonl('summarization_data/exp_6_test.json')

assert len(prompt_train) == 10
assert len(prompt_valid) == 10
# Assert length of in-context always 10
assert len(exp_1_in_context) == 10
assert len(exp_2_in_context) == 10
assert len(exp_3_in_context) == 10
assert len(exp_4_in_context) == 10
assert len(exp_5_in_context) == 10
assert len(exp_6_in_context) == 10
# Assert length of test
assert len(exp_1_test) == 100
assert len(exp_2_test) == 100
assert len(exp_3_test) == 100
assert len(exp_4_test) == 50
assert len(exp_5_test) == 50
assert len(exp_6_test) == 50

In [10]:
# Use custom rouge function to obtain rouge 3/4 which are not available in huggingface
def get_rouge_score(gold, pred):
    rouge_scores = ['rouge1', 'rouge2', 'rouge3', 'rouge4', 'rougeL']
    scorer = rouge_scorer.RougeScorer(rouge_scores, use_stemmer=True)
    scores = scorer.score(gold, pred)
    return {k: scores[k].fmeasure * 100 for k in rouge_scores}

def compute_custom_metrics(srcs, golds, preds, device):
    scores = defaultdict(list)
    bertscore = evaluate.load("bertscore")
    sari = evaluate.load("sari")
    
    # For rouge and length go over examples one by one and determine mean
    for gold, pred in zip(golds, preds):
        for k, v in get_rouge_score(gold, pred).items():
            scores[k].append(v)
        scores['words'].append(len(pred.split(' ')))
    for k, v in scores.items():
        scores[k] = np.mean(v)

    # This is the default call using model_type="roberta-large"
    # This is the same as in the paper "Generation of Patient After-Visit Summaries to Support Physicians" (AVS_gen/eval_summarization.py) using the libary SummerTime
    scores['bert_score'] = np.mean((bertscore.compute(predictions=preds, references=golds, lang="en", device=device))['f1']) * 100
    # BERTScore authors recommend "microsoft/deberta-large-mnli" (https://github.com/Tiiiger/bert_score)
    scores['bert_score_deberta-large'] = np.mean((bertscore.compute(predictions=preds, references=golds, device=device, model_type="microsoft/deberta-large-mnli"))['f1']) * 100
    scores['sari'] = sari.compute(sources=srcs, predictions=preds, references=[[g] for g in golds])['sari']
    # scores['sari'] = scores['sari'][0]
    # Importing readability for dallc score not working: https://pypi.org/project/py-readability-metrics/    

    return {k: round(v, 2) for k, v in scores.items()}

In [11]:
# Creating prompt

# To obtain the valid performance on the 10 validation examples
prompt_valid_gpt_predicitions = []
prompt_valid_gpt_predicitions.append("This is a test prediction.")
prompt_valid_gpt_predicitions.append("")
prompt_valid_gpt_predicitions.append("")
prompt_valid_gpt_predicitions.append("")
prompt_valid_gpt_predicitions.append("")
prompt_valid_gpt_predicitions.append("")
prompt_valid_gpt_predicitions.append("")
prompt_valid_gpt_predicitions.append("")
prompt_valid_gpt_predicitions.append("")
prompt_valid_gpt_predicitions.append("")

srcs = []
golds = []
preds = []
for i, pred in enumerate(prompt_valid_gpt_predicitions):
    if pred != "":
        srcs.append(prompt_valid[i]['text'])
        golds.append(prompt_valid[i]['summary'])
        preds.append(pred)
        
print(f"Evaluate on {len(srcs)} validation examples.")
compute_custom_metrics(srcs, golds, preds, "cuda")

# Model                                    & R-1 & R-2 & R-3 & R-L & BERTScore & Deberta & SARI & Words \\ \midrule
# Llama 2 70B (100 training ex.)           & 43  & 15  & 6   & 25  & 87        & 62      & 44.24 & 125  \\

Evaluate on 1 validation examples.


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


{'rouge1': 5.19,
 'rouge2': 0.0,
 'rouge3': 0.0,
 'rouge4': 0.0,
 'rougeL': 5.19,
 'words': 5.0,
 'bert_score': 83.34,
 'bert_score_deberta-large': 43.23,
 'sari': 50.95}