Long text summarization refers to the task of generating a concise summary of a lengthy document or piece of text. However, a common challenge in performing long text summarization is the limitation on the maximum allowed input, often set at 512 tokens. Tokens can be words, subwords, or characters depending on the tokenization method used.

The maximum token limit is typically imposed due to computational constraints and to ensure efficient processing within the models used for text summarization, such as transformer-based models like T5, FLANT5, BART as used in our experiments.
In this notebook we explore and evaluate few strategies to deal with problem of long text summarization for our task.

In [1]:
!pip install datasets transformers rouge-score nltk -q
!pip install torch==1.7.1 -q

[0m[31mERROR: Could not find a version that satisfies the requirement torch==1.7.1 (from versions: 1.11.0, 1.12.0, 1.12.1, 1.13.0, 1.13.1, 2.0.0, 2.0.1)[0m[31m
[0m[31mERROR: No matching distribution found for torch==1.7.1[0m[31m
[0m

## Import the Libraries

In [2]:
import numpy as np
import pandas as pd
import nltk

import torch
import datasets
import warnings
import transformers
from tqdm import tqdm
from datasets import Dataset
from datasets import load_metric
from transformers import AutoTokenizer
from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer

warnings.filterwarnings("ignore")
print(transformers.__version__)



4.28.1


In [3]:
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
print("[INFO] training using {}".format(torch.cuda.get_device_name(0)))

[INFO] training using Tesla P100-PCIE-16GB


**Helper functions**

In [4]:
import math
metric = load_metric("rouge")

def split_string(string, n):
    words = string.split()
    total_words = len(words)
    words_per_part = math.ceil(total_words / n)
    parts = []

    current_part = []
    for word in words:
        current_part.append(word)
        if len(current_part) == words_per_part:
            parts.append(" ".join(current_part))
            current_part = []

    if current_part:
        parts.append(" ".join(current_part))

    return parts

def compute_metrics_(preds, labels):
    labels=list(np.array(labels))
    # Rouge expects a newline after each sentence
    decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in preds]
    decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in labels]
    
    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    # Extract a few results
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}
    return {k: round(v, 4) for k, v in result.items()}

Downloading builder script:   0%|          | 0.00/2.16k [00:00<?, ?B/s]

## Load the test data

In [5]:
test=pd.read_csv('/kaggle/input/train-val-test-split/test.csv')

## Load the trained model

In [6]:
model_checkpoint ='t5-base'
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)
pad_on_right = tokenizer.padding_side == "right"
assert isinstance(tokenizer, transformers.PreTrainedTokenizerFast)

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/892M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

## Load model

In [7]:
model.load_state_dict(torch.load('/kaggle/input/t5-base-model/t5-base-finetuned-newsarticles/checkpoint-1500/pytorch_model.bin'))
model.to(DEVICE)

T5ForConditionalGeneration(
  (shared): Embedding(32128, 768)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 768)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=768, out_features=768, bias=False)
              (k): Linear(in_features=768, out_features=768, bias=False)
              (v): Linear(in_features=768, out_features=768, bias=False)
              (o): Linear(in_features=768, out_features=768, bias=False)
              (relative_attention_bias): Embedding(32, 12)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseActDense(
              (wi): Linear(in_features=768, out_features=3072, bias=False)
              (wo): Linear(in_features=3072, out_features=768, bias=False)
              (dropout): Dro

## Performance on test data using vanilla inference

In [8]:
normal_predictions=[]
for i in tqdm(test['document']):
    text='summarize:'+ i
    inputs = tokenizer(text, return_tensors="pt").input_ids.to(DEVICE)
    outputs = model.generate(inputs, max_new_tokens=128, num_beams=3, do_sample=False)
    summary=tokenizer.decode(outputs[0], skip_special_tokens=True)
    normal_predictions.append(summary)

  0%|          | 0/445 [00:00<?, ?it/s]Token indices sequence length is longer than the specified maximum sequence length for this model (958 > 512). Running this sequence through the model will result in indexing errors
100%|██████████| 445/445 [16:16<00:00,  2.19s/it]


In [9]:
normal_result=compute_metrics_(normal_predictions, test['summary'])
normal_result

{'rouge1': 57.9471, 'rouge2': 51.0773, 'rougeL': 43.6278, 'rougeLsum': 48.0644}

## Analyze the data
We look for examples in our test data which on tokenization have more than 512 tokens. This gives us an idea of how many test summaries are truncated before being fed to the model and are suffering the loss of information as the model might be missing the important parts in the document while generating the summary and that can hamper the model scores.

In [10]:
test['tokenized_length']=test['document'].apply(lambda x:len(tokenizer(x)['input_ids']))
test[test['tokenized_length']>512]

Unnamed: 0.1,Unnamed: 0,document,summary,categories,articles_length,summaries_length,tokenized_length
0,314,blair dismisses quit claim report\n\ntony blai...,former welfare minister frank field mp said th...,politics,625,255,957
1,1310,web helps collect aid donations\n\nthe web is ...,many of the sites that google lists are also t...,tech,468,191,573
3,904,reds sink 10-man magpies\n\ntitus bramble's ow...,"given, andrew o'brien, elliott, bramble, berna...",sport,427,206,830
6,1561,charity single for quake relief\n\nsingers inc...,he said the song was a slow ballad and would w...,entertainment,360,135,518
7,1862,giving financial gifts to children\n\nyour chi...,"therefore, it may be preferable for parents to...",business,793,340,1002
...,...,...,...,...,...,...,...
438,2058,japan's ageing workforce: built to last\n\nin ...,"glen wood, vice president of deutsche securiti...",business,891,388,1311
439,463,"scotland v italy (sat)\n\nmurrayfield, edinbur...",and the pressure is on scotland coach matt wil...,sport,425,157,816
442,123,more reforms ahead says milburn\n\nlabour will...,labour will continue to pursue controversial r...,politics,542,220,732
443,860,jones happy with henson heroics\n\nwales fly-h...,jones was happy to hail henson's heroic contri...,sport,453,193,714


In [11]:
test[test['tokenized_length']>512]['tokenized_length'].describe()

count     217.000000
mean      849.032258
std       553.437560
min       515.000000
25%       604.000000
50%       713.000000
75%       890.000000
max      5077.000000
Name: tokenized_length, dtype: float64

We can see there are 217 rows in the test data with tokenized length>512 and 75% of them are in the range 512-890, whereas 25% are greater than 890 going till 5077 that is maximum.

## Methods tried to solve the problem

1. Map reduce
2. Modified Map reduce
3. Refine

## Map Reduce
This method is inspired by the `MapReduceDocumentsChain` implemented in the LangChain model.
The steps followed in this inference method is:
* Calculate the number of tokens for the document for which we want the summary.If the number of tokens<=520(we chose a number close to 512 as we want to include the text that contains some meaningful information and not just a few tokens), we do the vanilla inference on the text. If the number of tokens >512, we divide the document into chunks such that none of the chunks after division exceed 512 tokens. We consider upto 5000 tokens as that is the maximum number of tokens as seen in our test data.
* Next, we pass each of the chunk to the model and generate the summary for each of the chunks.
* Then we concatenate the generated summary for the chunks and generate a summary of the concatenated summary text to generate the final summary.

The `modified` argument is used for **Modified Map Reduce** where the only modification is that we omit the last step in the Map Reduce function and return the concatenated summary of the chunks as the final summary

In [12]:
def map_reduce(text, modified=True):
    i=text
    token_length=len(tokenizer(i)['input_ids'])
    if token_length<=520:
        text='summarize:'+ i
        inputs = tokenizer(text, return_tensors="pt").input_ids.to(DEVICE)
        outputs = model.generate(inputs, max_new_tokens=128, num_beams=3, do_sample=False)
        summary=tokenizer.decode(outputs[0], skip_special_tokens=True)
        return summary
        
    elif token_length>520 and token_length<=1024:
        input_list=split_string(i, 2)
        
    elif token_length>1024 and token_length<=1536:
        input_list=split_string(i, 3)
        
    elif token_length>1536 and token_length<=2100:
        input_list=split_string(i, 4)
        
    elif token_length>2100 and token_length<=3000:
        input_list=split_string(i, 5)
        
    elif token_length>3000 and token_length<=4000:
        input_list=split_string(i, 7)
        
    else:
        input_list=split_string(i, 8)

    main_summary=''
    for inp in input_list:
        text='summarize:'+ inp
        inputs = tokenizer(text, return_tensors="pt").input_ids.to(DEVICE)
        outputs = model.generate(inputs, max_new_tokens=128, do_sample=False)
        summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
        main_summary+=''+ summary
        
    if modified==True:
        return main_summary
    
    text='summarize:'+ main_summary
    inputs = tokenizer(text, return_tensors="pt").input_ids.to(DEVICE)
    outputs = model.generate(inputs, max_new_tokens=128, num_beams=3, do_sample=False)
    summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return summary

In [13]:
mapreduce_predictions_=[]
for i in tqdm(test['document']):
    summary= map_reduce(i, modified= False)
    mapreduce_predictions_.append(summary) 
mr_result=compute_metrics_(mapreduce_predictions_, test['summary'])
mr_result

100%|██████████| 445/445 [32:17<00:00,  4.35s/it]


{'rouge1': 54.9131, 'rouge2': 46.3191, 'rougeL': 41.276, 'rougeLsum': 45.3409}

## Modified Map Reduce

In [14]:
mod_mapreduce_predictions=[]
for i in tqdm(test['document']):
    summary= map_reduce(i, modified= True)
    mod_mapreduce_predictions.append(summary) 
result=compute_metrics_(mod_mapreduce_predictions, test['summary'])
result

100%|██████████| 445/445 [24:24<00:00,  3.29s/it]


{'rouge1': 65.7871, 'rouge2': 55.4353, 'rougeL': 45.2729, 'rougeLsum': 51.3515}

We can see that the modified map reduce method helped improve the scores.

## Refine 
This method is inspired by the `RefineDocumentsChain` implemented in the LangChain model.
The steps followed in this inference method is:
* Calculate the number of tokens for the document for which we want the summary.If the number of tokens<=520 (we chose a number close to 512 as we want to include the text that contains some meaningful information and not just a few tokens), we do the vanilla inference on the text. If the number of tokens >512, we divide the document into chunks such that none of the chunks after division exceed 512 tokens. We consider upto 5000 tokens as that is the maximum number of tokens as seen in our test data.
* Next, we pass the first chunk to the model and generate the summary, then we pass the generated summary in addition to the next chunk so that the model refines the previous summary based on the new context, and we continue this till the last chunk and return the final generated summary.

Since more context(previous summary + new text) is passed at each step, we divide the text into more number of chunks so that the number of tokens per chunk is lesser.

In [15]:
def refine(text):
    i=text
    token_length=len(tokenizer(i)['input_ids'])
    if token_length<=520:
        text='summarize:'+ i
        inputs = tokenizer(text, return_tensors="pt").input_ids.to(DEVICE)
        outputs = model.generate(inputs, max_new_tokens=128, num_beams=3, do_sample=False)
        summary=tokenizer.decode(outputs[0], skip_special_tokens=True)
        return summary
        
    elif token_length>520 and token_length<=1024:
        input_list=split_string(i, 3)
        
    elif token_length>1024 and token_length<=1536:
        input_list=split_string(i, 4)
        
    elif token_length>1536 and token_length<=2100:
        input_list=split_string(i, 5)
        
    elif token_length>2100 and token_length<=3000:
        input_list=split_string(i, 6)
        
    elif token_length>3000 and token_length<=4000:
        input_list=split_string(i, 8)
        
    else:
        input_list=split_string(i, 9)

    res=''
    for inp in input_list:
        text='summarize:'+ res + inp
        inputs = tokenizer(text, return_tensors="pt").input_ids.to(DEVICE)
        outputs = model.generate(inputs, max_new_tokens=128, num_beams=3, do_sample=False)
        summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
        res=summary
    return summary

In [16]:
refine_predictions=[]
for i in tqdm(test['document']):
    summary=refine(i)
    refine_predictions.append(summary) 
refine_result=compute_metrics_(refine_predictions, test['summary'])
refine_result

100%|██████████| 445/445 [33:41<00:00,  4.54s/it]


{'rouge1': 54.99, 'rouge2': 46.5886, 'rougeL': 41.5034, 'rougeLsum': 45.5726}

## Modified Inference Function

In [17]:
def inference_function(text, method="normal"):
    if method=="normal": 
        text='summarize:'+ text
        inputs = tokenizer(text, return_tensors="pt").input_ids.to(DEVICE)
        outputs = model.generate(inputs, max_new_tokens=128, num_beams=3, do_sample=False)
        summary=tokenizer.decode(outputs[0], skip_special_tokens=True)
        
    elif method=="map_reduce":
        summary=map_reduce(text, modified=False)
            
    elif method=="modified_map_reduce":
        summary=map_reduce(text, modified=True)
            
    elif method=="refine":
        summary=refine(text)
        
    return summary

In [18]:
summary=inference_function(test['document'][0], method="normal")
summary

'mr blair said the claims were "reheated from six months ago" and that he was concentrating on running the country.but the prime minister said he had discussed these claims with the chancellor and dismissed them as a "load of nonsense".the liberal democrat parliamentary chairman matthew taylor said the personal ambition of mr blair and mr brown was "getting in the way of good government".the book, by sunday telegraph journalist robert peston and serialised in the newspaper, said the'

In [19]:
summary=inference_function(test['document'][0], method="map_reduce")
summary

'the liberal democrat parliamentary chairman matthew taylor said the personal ambition of mr blair and mr brown was "getting in the way of good government".what the book says is there is now a pretty profound mutual mistrust, mutual animosity."what the book says is there is now a pretty profound mutual mistrust, mutual animosity."what the book says is there is now a pretty profound mutual mistrust, mutual animosity."what the book says is there is now a pretty profound mutual'

In [20]:
summary=inference_function(test['document'][0], method="modified_map_reduce")
summary

'according to mr peston the prime minister said: "help me to get through the year and i will then stand down."what the book says is there is now a pretty profound mutual mistrust, mutual animosity."what the book says is there is now a pretty profound mutual mistrust, mutual animosity."what the book says is there is now a pretty profound mutual mistrust, mutual animosity."what the book says is there is now a pretty profound mutual mistrust, mutual animosity."what the book says is there is now a pretty profound mutualbut the prime minister said he had discussed these claims with the chancellor and dismissed them as a "load of nonsense".the liberal democrat parliamentary chairman matthew taylor said the personal ambition of mr blair and mr brown was "getting in the way of good government".the liberal democrat parliamentary chairman matthew taylor said the personal ambition of mr blair and mr brown was "getting in the way of good government".the former welfare minister frank field mp said 

In [21]:
summary=inference_function(test['document'][0], method="refine")
summary

'but the prime minister said he had discussed these claims with the chancellor and dismissed them as a "load of nonsense".the liberal democrat parliamentary chairman matthew taylor said the personal ambition of mr blair and mr brown was "getting in the way of good government".but, in a wide-ranging bbc interview covering issues such as the asian tsunami disaster, the middle east peace process and northern ireland, mr blair said: "when you get to the top in politics you get this'

In [22]:
test['summary'][0]

'former welfare minister frank field mp said the prime minister should sack mr brown, but did not believe mr blair was strong enough to do so.mr blair said the claims were "reheated from six months ago" and that he was concentrating on running the country.the liberal democrat parliamentary chairman matthew taylor said the personal ambition of mr blair and mr brown was "getting in the way of good government".according to mr peston the prime minister said: "help me to get through the year and i will then stand down."according to a new book, brown\'s britain, mr blair went back on a pledge to make way for mr brown after cabinet allies intervened in june 2004.mr blair said: "i\'ve dealt with this six months ago.and that at a dinner hosted by deputy prime minister john prescott he told mr brown of his intention to stand down.during the interview mr blair also said the former home secretary david blunkett would play a "big role" at the general election.tory leader michael howard accused the 