## Text summarization

In this short project I have tried to use a pretrained model from [transformers](https://huggingface.co/transformers/) library for text summarization. Dataset was taken from [kaggle](https://www.kaggle.com/Cornell-University/arxiv). It consists of meta data of the scientific papers from *arXiv*. Due to the limited computational resources I have used abstracts of scientific papers (instead of entire articles) as text to be summarized and titles of these papers as targets. Downloaded model was fine-tuned on the Google Colab and it is hosted on the [huggingface](https://huggingface.co/Tymoteusz/optics-abstracts-summarization). Fine-tuning was done according to this tutorial [github](https://github.com/huggingface/notebooks/blob/master/examples/summarization.ipynb).

In [130]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import jsonlines
import re
from datasets import list_datasets, load_dataset, load_metric
from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer, AutoTokenizer
from sklearn.model_selection import train_test_split
import nltk
import pickle

%matplotlib inline

### Data preprocessing

I have only chosen papers related to optics by the *category* field. To filter by this condition we can use *regex*. Then we can create a dataframe and split it for training and validation set. The most efficient way for fine-tuning a pretrained model from transformers library is to create a *dataset* object. It allows to manipulate the data without reading it into memory and it is cached automatically.

In [49]:
abstracts = []
titles = []
i = 0
with jsonlines.open('arxiv-metadata-oai-snapshot.json') as reader:
        for obj in reader:
            if re.search('optics', obj['categories']) is not None:
                titles.append(obj['title'])
                abstracts.append(obj['abstract'])

In [50]:
df = pd.DataFrame({'titles': titles, 'abstracts': abstracts})

In [51]:
df

Unnamed: 0,titles,abstracts
0,Convergence of the discrete dipole approximati...,We performed a rigorous theoretical converge...
1,Convergence of the discrete dipole approximati...,We propose an extrapolation technique that a...
2,The discrete dipole approximation for simulati...,In this manuscript we investigate the capabi...
3,The discrete dipole approximation: an overview...,We present a review of the discrete dipole a...
4,Some new experimental photonic flame effect fe...,"The results of the spectral, energetical and..."
...,...,...
34057,Quantum non-demolition (QND) modulation of qua...,We propose an experiment where quantum inter...
34058,Nonclassical correlations of photon number and...,It is shown that the quantum jumps in the ph...
34059,Optical Holonomic Quantum Computer,In this paper the idea of holonomic quantum ...
34060,Solutions to the Optical Cascading Equations,Group theoretical methods are used to study ...


In [52]:
print('Title: ', df['titles'][35])
print('\n')
print('Abstract: ', df['abstracts'][35])

Title:  Computation and visualization of Casimir forces in arbitrary geometries:
  non-monotonic lateral forces and failure of proximity-force approximations


Abstract:    We present a method of computing Casimir forces for arbitrary geometries,
with any desired accuracy, that can directly exploit the efficiency of standard
numerical-electromagnetism techniques. Using the simplest possible
finite-difference implementation of this approach, we obtain both agreement
with past results for cylinder-plate geometries, and also present results for
new geometries. In particular, we examine a piston-like problem involving two
dielectric and metallic squares sliding between two metallic walls, in two and
three dimensions, respectively, and demonstrate non-additive and non-monotonic
changes in the force due to these lateral walls.



In [53]:
df_train, df_test = train_test_split(df, test_size=0.15)

In [None]:
df_train.to_csv('df_train.csv', index=False)
df_test.to_csv('df_test.csv', index=False)

In [None]:
# from google.colab import drive
# drive.mount('/content/gdrive')

In [3]:
dataset = load_dataset('csv', data_files={'train': 'df_train.csv', 'test': 'df_test.csv'},
                      cache_dir='D:\\ML_projects\\summarization_project')

Using custom data configuration default-1f5f109b43dbc7b4
Reusing dataset csv (D:\ML_projects\summarization_project\csv\default-1f5f109b43dbc7b4\0.0.0\e138af468cb14e747fb46a19c787ffcfa5170c821476d20d5304287ce12bbc23)


In [54]:
dataset

DatasetDict({
    train: Dataset({
        features: ['titles', 'abstracts'],
        num_rows: 28952
    })
    test: Dataset({
        features: ['titles', 'abstracts'],
        num_rows: 5110
    })
})

Now we can specify which kind of pretrained model to use and download the corresponding tokenizer.

In [55]:
model_checkpoint = "t5-small"

In [56]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Abstracts and titles need some preparation before they can be tokenized. We can replace every non word string, non whitespace string and a new line character with a whitespace. There are some abstracts which contain citations but we leave them without cleaning as most of abstracts are composed of words only. Prepared titles and abstracrs are then tokenized. Sizes of the tokenized abstracts and titles are limited by the used model so anything longer is truncated.

In [57]:
if model_checkpoint in ["t5-small", "t5-base", "t5-larg", "t5-3b", "t5-11b"]:
    prefix = "summarize: "
else:
    prefix = ""

In [58]:
max_input_length = 1024
max_target_length = 128

def preprocess_function(examples):
    examples['titles'] = [re.sub(r'[^\w\s]', ' ', title) for title in examples['titles']]
    examples['titles'] = [re.sub('\n', ' ', title) for title in examples['titles']]
    
    examples['abstracts'] = [re.sub(r'[^\w\s]', ' ', abstract) for abstract in examples['abstracts']]
    examples['abstracts'] = [re.sub('\n', ' ', abstract) for abstract in examples['abstracts']]
    
    inputs = [prefix + abstract for abstract in examples['abstracts']]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples['titles'], max_length=max_target_length, truncation=True)
    
    model_inputs['labels'] = labels['input_ids']
    return model_inputs

In [59]:
preprocess_function(dataset['train'][:2])

{'input_ids': [[21603, 10, 1266, 12472, 7, 11, 11110, 13, 3, 9, 1249, 14930, 6772, 16183, 1413, 1010, 14148, 2699, 16, 3, 9, 10851, 14286, 7013, 2175, 10030, 33, 2196, 438, 3, 9, 209, 3538, 2179, 4401, 1391, 3, 9, 743, 14569, 4005, 13, 8, 3911, 11638, 19, 2546, 57, 3, 9, 335, 3, 632, 13944, 4401, 28351, 483, 16, 8, 505, 3, 632, 2179, 4401, 9400, 13, 8, 23471, 515, 291, 5432, 6772, 16183, 100, 3, 25875, 7, 12, 3, 9, 26533, 30024, 13, 108, 51, 17871, 26, 9, 12778, 100, 19, 46, 455, 13, 20722, 2755, 145, 3150, 2196, 772, 21, 48, 607, 13, 1413, 1010, 14148, 1], [21603, 10, 16184, 6849, 12237, 33, 46, 359, 1464, 12, 1848, 1737, 28911, 11638, 7, 13, 5272, 31728, 11423, 611, 7450, 10498, 2434, 686, 12237, 103, 59, 995, 21, 2547, 6772, 6849, 3, 28561, 13, 315, 3, 5628, 4900, 3379, 24, 164, 36, 915, 16, 3, 9, 11638, 84, 6790, 70, 1120, 2176, 2020, 21, 26075, 1427, 19276, 306, 29610, 3381, 454, 21855, 2836, 947, 62, 4277, 3, 9, 6772, 6849, 7824, 24, 3629, 8, 6772, 6849, 7, 13, 66, 8, 29610, 7, 1

Now we can use the *map* function to preprocess both the train and validation data.

In [60]:
prep_dataset = dataset.map(preprocess_function, batched=True)

Loading cached processed dataset at D:\ML_projects\summarization_project\csv\default-1f5f109b43dbc7b4\0.0.0\e138af468cb14e747fb46a19c787ffcfa5170c821476d20d5304287ce12bbc23\cache-54c2967b8b2c869e.arrow
Loading cached processed dataset at D:\ML_projects\summarization_project\csv\default-1f5f109b43dbc7b4\0.0.0\e138af468cb14e747fb46a19c787ffcfa5170c821476d20d5304287ce12bbc23\cache-03c7d3a8df480672.arrow


In [61]:
prep_dataset

DatasetDict({
    train: Dataset({
        features: ['abstracts', 'attention_mask', 'input_ids', 'labels', 'titles'],
        num_rows: 28952
    })
    test: Dataset({
        features: ['abstracts', 'attention_mask', 'input_ids', 'labels', 'titles'],
        num_rows: 5110
    })
})

In [62]:
prep_dataset['train'][:2]

{'abstracts': ['  Predictions and measurements of a multimode waveguide interferometer operating in a fibre coupled    dual mode   regime are reported  With a 1 32 micrometer source  a complete switching cycle of the output beam is produced by a 10 0 nanometer incremental change in the 8 0 micrometer width of the hollow planar mirror waveguide  This equates to a fringe spacing of   sim lambda  130   This is an order of magnitude smaller than previously reported results for this form of interferometer  ',
  '  Wavefront sensors are an important tool to characterize coherent beams of extreme ultraviolet radiation  However  conventional Hartmann type sensors do not allow for independent wavefront characterization of different spectral components that may be present in a beam  which limits their applicability for intrinsically broadband high harmonic generation  HHG  sources  Here we introduce a wavefront sensor that measures the wavefronts of all the harmonics in a HHG beam in a single ca

Now we have to load the pretrained model from a checkpoint and also load a metric which will be used for evaluation. Here we are using a *rouge* set of metrics which are explained there [article](https://towardsdatascience.com/the-ultimate-performance-metric-in-nlp-111df6c64460). In general they are measuring how good is the generated summarization with respect to the target summarization.

In [63]:
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

In [64]:
metric = load_metric("rouge")

In [65]:
metric

Metric(name: "rouge", features: {'predictions': Value(dtype='string', id='sequence'), 'references': Value(dtype='string', id='sequence')}, usage: """
Calculates average rouge scores for a list of hypotheses and references
Args:
    predictions: list of predictions to score. Each predictions
        should be a string with tokens separated by spaces.
    references: list of reference for each prediction. Each
        reference should be a string with tokens separated by spaces.
    rouge_types: A list of rouge types to calculate.
        Valid names:
        `"rouge{n}"` (e.g. `"rouge1"`, `"rouge2"`) where: {n} is the n-gram based scoring,
        `"rougeL"`: Longest common subsequence based scoring.
        `"rougeLSum"`: rougeLsum splits text using `"
"`.
        See details in https://github.com/huggingface/datasets/issues/617
    use_stemmer: Bool indicating whether Porter stemmer should be used to strip word suffixes.
    use_agregator: Return aggregates if this is set to True
Retu

Training parameters were set according to the tutorial. The *compute_metrics* function was also taken directly from that notebook. *DataCollator* is used for padding the sequences to the maximum sequence length within a batch and not the entire dataset. Training was done for 10 epochs. 

In [66]:
batch_size = 16
args = Seq2SeqTrainingArguments(
    'gdrive/My Drive/trained_models',
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=1,
    num_train_epochs=10,
    predict_with_generate=True
)

In [67]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

In [68]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    # Rouge expects a newline after each sentence
    decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]
    decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]
    
    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    # Extract a few results
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}
    
    # Add mean generated length
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)
    
    return {k: round(v, 4) for k, v in result.items()}

In [None]:
trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=prep_dataset['train'],
    eval_dataset=prep_dataset['test'],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [None]:
trainer.train()

Fine-tuned model can be easily saved and then pushed to the repo on the *huggingface*. Evaluation metrics after each epoch were not shown cause training was done on the Google Colab but we can use the downloaded model to evaluate it on the validation set. All *rouge* metrics display the **F1 score**.

In [None]:
trainer.save_model('gdrive/My Drive/final_model/')

In [131]:
evaluation_data = pickle.load(open('evaluation_data.pkl', 'rb'))

In [132]:
evaluation_data

{'eval_loss': 2.0490682125091553,
 'eval_rouge1': 42.9079,
 'eval_rouge2': 22.6112,
 'eval_rougeL': 38.1581,
 'eval_rougeLsum': 38.1522,
 'eval_gen_len': 15.7039,
 'eval_runtime': 131.1091,
 'eval_samples_per_second': 38.975,
 'eval_steps_per_second': 2.441,
 'epoch': 10.0}

In [122]:
tokenizer = AutoTokenizer.from_pretrained("Tymoteusz/optics-abstracts-summarization")

model = AutoModelForSeq2SeqLM.from_pretrained("Tymoteusz/optics-abstracts-summarization")

To generate summarizations we have to preprocess input abstracts in the similar way as before with cleaning and tokenization.

In [149]:
def prep_sample(abstract):
    abstract = re.sub(r'[^\w\s]', ' ', abstract)
    abstract = re.sub('\n', ' ', abstract)
    abstract = tokenizer.encode(abstract, truncation=True, return_tensors="pt")
    return abstract

In [150]:
abstract_number = 170
sample_prep = prep_sample(dataset['test'][abstract_number]['abstracts'])

In [151]:
sample_prep

tensor([[ 2892,  5790,    16,    16, 10207,  5255,    15,  1162,  1202,  2532,
           783,    19,  7157,     3, 12913,    57,     8,  3438,    13,     8,
         20624,    52,  6645,  2479,    16,     8,  2768,   947,    34,    19,
          2008, 13605,   120,    24,    16,     3,     9,     3, 12851,  1202,
          2532,  6884,  1809,   659,  5790,    11, 13503,   729,    15,   485,
            13,     8,     3, 12851,  3438,    33,     3, 29604,    57,     3,
             9,   650, 13080,   973,   728,     8,  2769,    13, 13503,   729,
            15,   485,    13,     8,  1202,  2532,  1809,    19,  8413,    57,
             8, 26664,  5538,  5456,  8152,    16,  7475, 29393,    11,   251,
             3,    35, 12395,    63,    37, 11775,  1693,  1267,    24,     8,
          5790,    13,   659,    16,   224,   783,  5619, 13080,   120,    45,
            70, 13503,   729,    15,   485,     8,    72,    19, 13503,   729,
            15,  1162,     8,  1809,     8,    72,  

In [152]:
outputs_0 = model.generate(sample_tokenized, max_length=50,
                         num_beams=4, early_stopping=True,
                        num_return_sequences=3)
outputs_1 = model.generate(sample_tokenized, max_length=50)

In [153]:
tokenizer.batch_decode(outputs_0, skip_special_tokens=True)

['Light transmission in inhomogeneous photonic crystal structures',
 'Light transmission in inhomogeneous photonic crystals',
 'Light transmission in inhomogeneous photonic media']

In [154]:
tokenizer.batch_decode(outputs_1, skip_special_tokens=True)

['Light transmission in random photonic crystal structures']

In [155]:
print('Title: ', dataset['test'][abstract_number]['titles'])
print('\n')
print('Abstract: ', dataset['test'][abstract_number]['abstracts'])

Title:  Transmission of Light in Crystals with different homogeneity: Using
  Shannon Index in Photonic Media


Abstract:    Light transmission in inhomogeneous photonic media is strongly influenced by
the distribution of the diffractive elements in the medium. Here it is shown
theoretically that, in a pillar photonic crystal structure, light transmission
and homogeneity of the pillar distribution are correlated by a simple linear
law once the grade of homogeneity of the photonic structure is measured by the
Shannon index, widely employed in statistics, ecology and information entropy.
The statistical analysis shows that the transmission of light in such media
depends linearly from their homogeneity: the more is homogeneous the structure,
the more is the light transmitted. With the found linear relationship it is
possible to predict the transmission of light in random photonic structures.
The result can be useful for the study of electron transport in solids, since
the similarity with 