GPT-2 Pipeline

This repository contains code for creating a end-end TFX pipeline for GPT-2. The pipeline contains the needed data preprocessing, exporting data from MongoDB, training with pretrained model and deployment to a MLFlow model registry. In the registry there is a possibility set a label for each produced models, which can be used later on to define the production-readiness state of a model. All pipelines can be orchestrated with either Airflow, Beam or Kubeflow. Tensorboard is supported and can be used for keeping a track during training.

Which pipelines are available?

Finetune from OpenAI models alias Transfer Learning
Train from scatch. It is not recommended to do Transfer Learning, when you want to train on datasets with different languages.

Why TFX?

TFX is a open source framework for creating production grade end-end machine learning pipelines.
It handles a lot of useful things like caching and versioning
Support for Airflow and Kubeflow

Why Tensorboard?

Tensorflow native visualization tool
Powerful tool

Why MLFlow

Built-in Model Registy
Dashboard for comparing models, their performance and produces artifacts
mlflow-heroku: Repository for setting up MLFlow on heroku

Getting Started

Install package

git clone https://github.com/NewsPipe/gpt2-tfx-pipeline.git
cd gpt2-tfx-pipeline
pip indstall .

Run pipeline with Apache Beam

cd examples
python beam-local-example.py
# open mlflow and tensorboard
mlflow server --backend-store-uri sqlite:///mlflow.d --default-artifact-root ./mlruns
tensorboard --logdir ./outputs

List of available models

Look at the Paper for performance details. Choose the model depending on your resources. Insert the model name as string value for model_name in your pipeline.

Model Name	Layers
`117M` `124M`	12
`345M` `355M`	24
`774M`	36
`1558M`	48

Examples

gpt2-tfx-pipeline-airflow-setup: Example for setting up Airflow with the pipeline
gpt2-tfx-pipeline-code-dataset-example: Example for using the pipeline for training GPT-2 on Python code

Training recommendations

From gpt-2-simple: The method GPT-2 uses to generate text is slightly different than those like other packages like textgenrnn (specifically, generating the full text sequence purely in the GPU and decoding it later), which cannot easily be fixed without hacking the underlying model code. As a result:

In general, GPT-2 is better at maintaining context over its entire generation length, making it good for generating conversational text. The text is also generally gramatically correct, with proper capitalization and few typoes.
The original GPT-2 model was trained on a very large variety of sources, allowing the model to incorporate idioms not seen in the input text.
GPT-2 cannot stop early upon reaching a specific end token. (workaround: pass the truncate parameter to a generate function to only collect text until a specified end token. You may want to reduce length appropriately.)
Higher temperatures work better (e.g. 0.7 - 1.0) to generate more interesting text, while other frameworks work better between 0.2 - 0.5.
GPT-2 allows you to generate texts in parallel by setting a batch_size that is divisible into nsamples, resulting in much faster generation. Works very well with a GPU (can set batch_size up to 20 on Colaboratory's K80)!
Due to GPT-2's architecture, it scales up nicely with more powerful GPUs. For the 124M model, if you want to train for longer periods of time, GCP's P100 GPU is about 3x faster than a K80/T4 for only 3x the price, making it price-comparable (the V100 is about 1.5x faster than the P100 but about 2x the price). The P100 uses 100% of the GPU even with batch_size=1, and about 88% of the V100 GPU.

Notes

GPT-2 models' robustness and worst case behaviors are not well-understood. As with any machine-learned model, carefully evaluate GPT-2 for your use case, especially if used without fine-tuning or in safety-critical applications where reliability is important.

Name		Name	Last commit message	Last commit date
Latest commit History 99 Commits
.github/imgs		.github/imgs
examples		examples
tfx_gpt2		tfx_gpt2
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github/imgs

.github/imgs

examples

examples

tfx_gpt2

tfx_gpt2

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

setup.py

setup.py

Repository files navigation

GPT-2 Pipeline

Getting Started

Install package

Run pipeline with Apache Beam

List of available models

Examples

Training recommendations

Notes

About

Releases

Packages

Languages

License

steven-mi/tfx-gpt2

Folders and files

Latest commit

History

Repository files navigation

GPT-2 Pipeline

Getting Started

Install package

Run pipeline with Apache Beam

List of available models

Examples

Training recommendations

Notes

About

Topics

Resources

License

Stars

Watchers

Forks

Languages