Skip to content

Scalable Machine Learning Process for Abstractive Text Summarization in German and English with Google-T5

License

Notifications You must be signed in to change notification settings

tbrodbeck/summarization-toolbox

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Summarization Toolbox

Introduction

This repository provides an end-to-end pipeline to fine-tune a πŸ€—-Summary-Model on your own corpus.
It is subdividet in those three parts:

  • Data Provider: Preprocess and Tokenize data for training
  • Model Trainer: Fine tune a selected πŸ€—-Model on the provided data
  • Evaluator: Automated evaluation of the fine tuned model on validation set

The pipeline supports german and english texts to be summarized. For both languages a T5 model is used which can further be explored in the paper Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.

Huggingface Models:

Process Description

  1. Provide source and target files with matching text-summary-pairs. These are split and converted to the right format for the fine-tuning task.
  2. Choose one of the supported languages and set training parameters via a config file. Then run the training on your data.
  3. Evaluate the produced model checkpoints and compare them either by the Rouge-L or the specially developed SemanticSimilarity metric. You can also track the training metrics via TensorBoard.

Example

The following example was produced by one of our german models which was fine-tuned on our specially scraped Golem corpus.

Original Text:
Tamrons neues Objektiv ist ein Weitwinkelzoom fΓΌr Canon- und Nikonkameras mit Kleinbildsensor, das ΓΌber 15 Elemente verfΓΌgt, darunter dispersionsarme und asphΓ€rische. Der sogenannte Silent Drive Motor ermΓΆglicht laut Hersteller eine hohe Geschwindigkeit beim Scharfstellen und eine niedrige GerΓ€uschentwicklung. Die minimale Fokusdistanz wird mit 28 cm angegeben. Die feuchtigkeitsbestΓ€ndige Konstruktion und die Fluorbeschichtung des Frontelements sollen dazu beitragen, dass das Objektiv auch bei harschen Wetterbedingungen funktioniert. Das Objektiv misst 84 mm x 93 mm und weist einen Filterdurchmesser von 77 mm auf. Das 17-35 mm F2.8-4 Di OSD von Tamron soll Anfang September 2018 fΓΌr Nikon FX erhΓ€ltlich sein, ein Canon-EF-Modell wird spΓ€ter folgen. Der Preis wird mit rund 600 US-Dollar angegeben. Deutsche Daten liegen noch nicht vor.

Produced Summary:
Tamron hat mit dem 17-35 mm F2.8-4 Di OSD ein Weitwinkelzoom fΓΌr Canon- und Nikon-Kameras vorgestellt, das ΓΌber 15 Elemente verfΓΌgt.

Installation

Please setup a Python3 evironment with your virtual environment favorite tool. We tested it with Python 3.7.4.

sh ./install_dependencies.sh

Data Provider

Provides tokenized data for training.

Input

It requires to have a dataset in the dataProvider/datasets/$DATASETNAME directory. Either the dataset can be provided as single files or already split into a train, validation and test set. Each line in a file should represent a single example string.

Providing a Single Dataset

The sources (full texts) for summarization should be provided in a sources.txt file and the target summarizations should be provided in a targets.txt file.

Now the --create_splits flag has to be used to create the train, val and test files in that directory, which will then be the resource for the tokenization.

Providing Train, Val and Test Split Files

If training, validation and test splits are already present, they should be provided in the following format of πŸ€—-seq2seq examples.

train.source
train.target
val.source
val.target
test.source
test.target

Usage

Use the Command Line Interface like this:

bin/provide_data $DATASETNAME $TOKENIZERNAME $MODELNAME <flags>

Example:

bin/provide_data golem WikinewsSum/t5-base-multi-de-wiki-news t5-base --create_splits=True --filtering=True

Flags

--size=$SIZE

Limits the amount of samples that are taken for tokenization for each split. Defaults to None.

--create_splits=$CREATESPLITS

Split the dataset into train, validation and test splits. Defaults to False.

$CREATESPLITS has to be a dictionary containing the keys train and val and values between 0 and 1. The value of train represents the ratio of the dataset that is used for training (and not for validation or testing). The value of val represents the the ratio between the validation and the test set. Because of shell restrictions the dictionary has to be wrapped in " in the CLI, like this: --createSplits="{'train': 0.7, 'val': 0.66}"

If the value of $CREATESPLITS is True it defaults to {'train': 0.8, 'val': 0.5}, which results a 80/10/10 split.

--splits2tokenize=$SPLITS2TOKENIZE

Can be set to only tokenize certain splits. Defaults to [train, val, test].

--filtering=$FILTERING

Longer examples than the maximum token size are filtered, else they are truncated. Defaults to True.

Output

The resulting tokenized PyTorch tensors are saved in the dataProvider/datasets/$DATASETNAME/$TOKENIZERNAME[_filtered] directory as the following files:

train_source.pt
train_target.pt
val_source.pt
val_target.pt
test_source.pt
test_target.pt

Model Trainer

Performs training process for selected model on the previously created data sets.

Input

To execute the Model Training you need to previously run the Data Provider module to generate training data in the right format either from your own or predefined text/summary pairs. It requires files in the output format of the Data Provider module. Since you could have run the module for multiple text/summary sets, you have to provide the $DATASETNAME to train on.
Additionally you can choose a supported πŸ€—-Model with the $MODELNAME parameter (the model will be downloaded to your virtual environment if you run the training for the first time).

Flags

--filtered=$FILTERED

By the $FILTERED flag you can specify if filtered or unfiltered data is used for training (if previously created by the Data Provider). It defaults to True.

--config_name=$CONFIGNAME

Since all model and training pipeline configurations are read from a config file (which has to be stored in the ./modelTrainer/config directory) you might also select your config file by setting the $CONFIGNAME parameter.
If you don't do so, this parameter defaults to 'fine_tuning.ini' (which could also be used as a template for your own configurations).

Usage

Use the Command Line Interface like this:

bin/run_training $DATASETNAME $MODELNAME <flags>

Example:

bin/run_training golem WikinewsSum/t5-base-multi-de-wiki-news --filtered=False

Configurations

The pipeline is designed to inherit all customizable parameters from an '.ini' file. It follows the structure that a component is defined by [COMPONENT] and the assigned parameters by parameter = parameter_value (as string). There are two components, the model and training component. Each component can be configured with with multiple parameters (see fine_tuning.ini for a full list). Only the parameters in the provided 'fine_tuning_config.ini' file stored in the config folders can be changed.

Output

In the config file you choose an output_directory in this directory the following folder structure is created:

output_directory
    └── model_shortname
        └── model_version
            └── checkpoint_folder
            └── final_model_files
        └── logs
            └── model_version
                └── tensorboard_file

<model_shortname> = Abbreviation for the chosen model
<model_version> = Counts the versions of the fine tuned model (canbe seen as an id and makes sure you don't override any previously trained model)
<checkpoint_folder> = contains model files after a certain number of training steps (checkpoints are saved after n training steps) <tensorboard_file> = saved training metrics for TensorBoard usage

After the training the following final output files are saved in the <model_version> folder:

  • config.json
  • training_args.bin (parameters for the πŸ€—-Trainer)
  • pytorch_model.bin (model which can then be loaded for inference)
  • model_info.yml (file with information used for evaluation)

Evaluator

Performs evaluation on the validation or test set for a fine-tuned model.

Input

To execute the Evaluation you need to previously run the Model Trainer module to generate a fine-tuned πŸ€—-Model in the right format and stored in the correct folder structure. These four files are required:

  • config.json
  • pytorch_model.bin
  • training_args.bin
  • model_info.yml

Since the model evaluation uses the validation set or test set created from the underlying datasaet you need to specify the $DATASETNAME.
Additionally you can choose the fine-tuned πŸ€—-Model Checkpoints to compare by setting the $RUNPATH parameter. This path has to be the directory of the checkpoint_folder defined by the folder structure in the training section.

Flags

--split_name=$SPLIT_NAME

Should be train, val or test. Default to val

--nr_samples=$NR_SAMPLES

Number of samples selected from the data set to evaluate the checkpoint on. Defaults to 10

--metric_type=$METRIC_TYPE

It can be chosen from the two metric types:

  • Rouge-L: set parameter to "Rouge"
  • Semantic Similarity: set parameter to "SemanticSimilarity"

Defaults to Rouge.

Usage

Use the Command Line Interface like this:

bin/evaluate_with_checkpoints_and_compare $RUN_PATH $DATASET_NAME <flags>

Example:

bin/evaluate_with_checkpoints_and_compare golem WikinewsSum/t5-base-multi-de-wiki-news --split_name=train --nr_samples=1000 --metric_type=SemanticSimilarity

Output

By defalut the produced Overview.xlsx files are stored in the evaluator directory under the following structure:

evaluator
    └── evaluations
        └── model_short_name
            └── model_version
                └── checkpoint_folders
                    └── metric_type-sample_name-split_name-folders
                        └── iteration-folders
                            - Overview.xlsx
                        - analysis.xlsx

Graphical User Interface (GUI)

A summarization model can be used for live-prediction in a GUI developed with PyQT5.

GUI

Input

All the GUI needs is either a directory containing a pytorch_model.bin file with a fine-tuned summary model of T5, or the flag model_status has to be set to base - then the base T5 model is used.

Flags

--model_language=$MODEL_LANGUAGE

Language of the model to choose. Defaults to german

--model_status=MODEL_STATUS

Can be either base or fine-tuned. If it is base the model_dir will be ignored. Defaults to fine-tuned

Usage

Use the Command Line Interface like this:

bin/run_gui $MODEL_DIR <flags>

Example:

bin/run_gui path/to/UI-Model/checkpoint-100000/ german fine-tuned

TensorBoard

During the training a TensorBoard file is produced which can then be activated to track your training parameters and metrics afterwards in your localhost. To access the TensorBoard the library tensorboard has to be installed (requirements.txt) and you can use the following CLI to activate it:

tensorboard --logdir <tensorboard_log_dir>

In the <tensorboard_log_dir> a events.out.tfevents...-file should exist. The default path is described by the folder structure in the training section.

Example:

tensorboard --logdir ./results/t5-de/logs/0

Development Instructions

pip install pytest

Use fd and entr to execute tests automatically on file changes:

fd . | entr pytest

Use the following command to add a new package (optionally with version number) $pkg to the repository, while keeping requirements.txt orderly:

echo $pkg | sort -o requirements.txt - requirements.txt && pip install $pkg

About

Scalable Machine Learning Process for Abstractive Text Summarization in German and English with Google-T5

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published