Copyright 2023 The TensorFlow Authors.


In [None]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# TFX Pipeline Tutorial for Large Language Model using CNN Daily Dataset


In this codelab, we use  KerasNLP to load a pre-trained Large Language Model (LLM) - GPT-2 model - finetune it to a dataset. The dataset that is used in this demo is CNN daily dataset.  Note that GPT-2 is used here only to demonstrate the end-to-end process; the techniques and tooling introduced in this codelab are potentially transferrable to other generative language models such as Google T5.

<div class="devsite-table-wrapper"><table class="tfo-notebook-buttons" align="left">
<td><a target="_blank" href="https://www.tensorflow.org/tfx/tutorials/tfx/penguin_simple">
<img src="https://www.tensorflow.org/images/tf_logo_32px.png"/>View on TensorFlow.org</a></td>
<td><a target="_blank" href="https://colab.research.google.com/github/tensorflow/tfx/blob/master/docs/tutorials/tfx/penguin_simple.ipynb">
<img src="https://www.tensorflow.org/images/colab_logo_32px.png">Run in Google Colab</a></td>
<td><a target="_blank" href="https://github.com/tensorflow/tfx/tree/master/docs/tutorials/tfx/penguin_simple.ipynb">
<img width=32px src="https://www.tensorflow.org/images/GitHub-Mark-32px.png">View source on GitHub</a></td>
<td><a href="https://storage.googleapis.com/tensorflow_docs/tfx/docs/tutorials/tfx/penguin_simple.ipynb"><img src="https://www.tensorflow.org/images/download_logo_32px.png" />Download notebook</a></td>
</table></div>

# Before You Begin

Colab offers different kinds of runtimes. Make sure to go to **Runtime -> Change runtime** type and choose the GPU Hardware Accelerator runtime (which should have >12G System RAM and ~15G GPU RAM) since you will finetune the GPT-2 model.

# Set Up

We first install the TFX Python package.

## Upgrade Pip
To avoid upgrading Pip in a system when running locally, check to make sure that we are running in Colab. Local systems can of course be upgraded separately.

In [None]:
try:
  import colab
  !pip install --upgrade pip
except:
  pass

## Install TFX

TFX is currently experiencing issues with Python 3.10 in Colab.
Therefore, simply running the command
```
!pip install -U tfx
```
to install tfx **will fail**. Hence, follow the code below.

In [None]:
%%shell
update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.8 3
curl -O https://bootstrap.pypa.io/get-pip.py
python get-pip.py

In [None]:
# 1) TFX relies on an old version of google-api-core so we let google-auth float
# for the install. We grep it out below:
!grep -v google-auth /etc/requirements.core.in > requirements.txt

# 2) httplib2 should be included in /etc/requirements.core.in but it's not for
# reasons. We ensure it's included:
!grep httplib2 /etc/requirements.user.in >> requirements.txt

# 3) google.colab package is not available as a wheel. We symlink that in so
# it's on the sys.path of Python 3.8:
!mkdir /usr/local/lib/python3.8/dist-packages/google
!ln -s /usr/local/lib/python3.10/dist-packages/google/colab /usr/local/lib/python3.8/dist-packages/google/colab

# Now with those pre-requisites out of the way:
!pip install tfx==1.13.0 -r requirements.txt

## Did you restart the runtime?

If you are using Google Colab, the first time that you run the cell above, you must restart the runtime by clicking above "RESTART RUNTIME" button or using "Runtime > Restart runtime ..." menu. This is because of the way that Colab loads packages. Check the TensorFlow and TFX versions.


# Imports
Let's first get our imports out of the way.

In [None]:
from tensorflow import keras
from tfx.types import Channel
from absl import logging
from tfx.orchestration.experimental.interactive.interactive_context import InteractiveContext

## Uninstall shapely

TODO(b/263441833) This is a temporal solution to avoid an ImportError. Ultimately, it should be handled by supporting a recent version of Bigquery, instead of uninstalling other extra dependencies.


In [None]:
!pip uninstall shapely -y

Let's check the library versions.

In [None]:
import tensorflow as tf
print('TensorFlow version: {}'.format(tf.__version__))
from tfx import v1 as tfx
print('TFX version: {}'.format(tfx.__version__))

## Set up variables
There are some variables used to define a pipeline. You can customize these variables as you want. By default all output from the pipeline will be generated under the current directory.

# CSV Downloader
In order to make the pipeline more efficient and possible for automation, it is useful to have a component that takes in a download link to the CSV file to be downloaded. Furthermore, one important goal of TFX production ML pipeline is to collect metadata containing information about the pipeline components, their executions, and resulting artifacts. In other words, the purpose of the metadata is to analyze the lineage of pipeline components and debug issues, and the CSV Downloader Component would help the users logging and tracking information about the source of the data and the preprocessing steps that the data have undergone before entering the pipeline. In this section, we declare a new artifact called CSVdoc and develop a custom component -- CSV Downloader -- which stores information about the dataset and downloads the CSV file in the CSVdoc artifact's URI.

In [None]:
from tfx.types import artifact
from tfx import types

Property = artifact.Property
PropertyType = artifact.PropertyType

URL_PROPERTY = Property(type=PropertyType.STRING)
PATH_PROPERTY = Property(type=PropertyType.STRING)

class CsvDoc(types.Artifact):
  """ Artifact that contains the CSV dataset.

     - 'url' : saves the source of the original data.
     - 'path': saves the path to the CSV file.
  """

  TYPE_NAME = 'CsvDoc'
  PROPERTIES = {
      'url' : URL_PROPERTY,
      'path': PATH_PROPERTY,
  }

In [None]:
from absl import logging
import requests
import os
import tfx.v1 as tfx
from tfx.dsl.component.experimental.decorators import component

@tfx.dsl.components.component
def CsvDownloaderComponent(
    url: tfx.dsl.components.Parameter[str],
    file_name: tfx.dsl.components.Parameter[str],
    saved_file: tfx.dsl.components.OutputArtifact[CsvDoc],
) -> None:
  response = requests.get(url)
  saved_file.url = url
  if response.status_code == 200:
    file_path = os.path.join(saved_file.uri, file_name)
    saved_file.path = file_path
    url_content = response.content
    with open(file_path, 'wb') as csv_file:
      csv_file.write(url_content)
    logging.info(f"CSV file saved successfully at {file_path}")
  else:
    raise Exception("CSV file failed to be saved.")

In [None]:
downloader = CsvDownloaderComponent(
  url = 'https://drive.google.com/uc?id=1YdZsJlRafqxiNSl0nHQkwR7rzrNlN9LI&export=download', file_name ='testing_doc.csv')

In [None]:
from tfx.orchestration.experimental.interactive.interactive_context import InteractiveContext
context = InteractiveContext()

In [None]:
context.run(downloader, enable_cache = False)

# CSV ExampleGen

As a second component of the LLM pipeline, this component takes in a CSV artifact as its input and ouputs a CsvDoc artifact, a custom artifact created in the previous step. We apply NLTK Tokenizer in this step as well to tokenize a single long text into a list of tokenized sentences. For the purpose of MLOPs and parallelized design, users should be able to  input CSV files with different columns and file structures. Therefore, this component should be designed in a way that can take in any CSV files and create an example component. The current design is implemented in the way that the user defines a column in the dataset to extract the article contents.


## Imports
Let's do all the necessary imports for our custom CSVExampleGen Component.

In [None]:
!pip install nltk

In [None]:
import nltk
import pandas as pd
from absl import logging
nltk.download('punkt')

In our case, we are performing next word prediction in a language model, so we only need the 'article' feature.

To further prepare our training data, we are going to use a NLP package: NLTK. NLTK helps us download a sentence tokenizer 'punkt', which divides text into a list of sentences.

We define a helper function to merge shorter sentences into longer ones, until they reach a pre-defined max_length.

In [None]:
import tfx.v1 as tfx
from tfx.dsl.component.experimental.decorators import component
from absl import logging
import os
import pandas as pd
from nltk import tokenize

def merge_sentences(sentences, max_length):
    res = []
    cur_len = 0
    cur_sentences = []
    for s in sentences:
        if cur_len + len(s) > max_length:
            # If adding the next sentence exceeds `max_length`, we add the
            # current sentences into collection
            res.append(" ".join(cur_sentences))
            cur_len = len(s)
            cur_sentences = [s]
        else:
            cur_len += len(s)
            cur_sentences.append(s)
    res.append(" ".join(cur_sentences))
    return res

def create_csv(data_arr, save_dir):
  column_value = 'preprocessed_text'
  df = pd.DataFrame(data = data_arr, columns = [column_value])
  df.to_csv(save_dir, index = False)

def preprocess_and_save(csv_file_dir, column_name, save_dir):
  with open(csv_file_dir, "r") as f:
    df = pd.read_csv(f, delimiter = ',')
  all_sentences = []
  count = 0
  num_articles_to_process = 500
  max_length = 512
  if column_name in df.columns:
    logging.info(f"{column_name} is found in the dataframe.")
    for index, row in df.iterrows():
        article = row[column_name]
        # Use NLTK tokenize to split articles into sentences
        sentences = tokenize.sent_tokenize(str(article))
        # Merge individual sentences into longer context
        combined_res = merge_sentences(sentences, max_length)
        # Add merged context into collection
        all_sentences.extend(combined_res)
        count += 1
        if count >= num_articles_to_process:
          break
    return create_csv(all_sentences, save_dir)
  else:
    raise Exception(f"{column_name} is not found in the dataframe.")


Below is the code for custom CSV Example Generator Component for an LLM Model.This csv examplegen component takes a csv artifact as an input and generates train and eval examples for downstream components.

In [None]:
@tfx.dsl.components.component
def ExampleGenComponent(
    input_csv: tfx.dsl.components.InputArtifact[CsvDoc],
    column_name : tfx.dsl.components.Parameter[str],
    output_csv: tfx.dsl.components.OutputArtifact[CsvDoc]) -> None:
    input_csv_dir = input_csv.full_path
    output_csv.full_path = f"{output_csv.uri}/processed_data.csv"
    output_csv.url = input_csv.url
    preprocess_and_save(input_csv_dir, column_name, output_csv.full_path)

In [None]:
# from customCSVexampleGenComponent import customCSVexampleGenComponent
ExampleGenerator = ExampleGenComponent(input_csv = downloader.outputs['saved_file'], column_name = 'article')

In [None]:
context.run(ExampleGenerator, enable_cache = False)

# Trainer

First, let's import all the necessary packages and libraries.

In [None]:
import keras_nlp
from tensorflow import keras
import tensorflow as tf

KerasNLP provides a number of pre-trained models, such as Google Bert and GPT-2. You can see the list of models available in the KerasNLP repository.

It's very easy to load the GPT-2 model as you can see below:

In [None]:
gpt2_tokenizer = keras_nlp.models.GPT2Tokenizer.from_preset("gpt2_base_en")
gpt2_preprocessor = keras_nlp.models.GPT2CausalLMPreprocessor.from_preset(
    "gpt2_base_en",
    sequence_length=256,
    add_end_token=True,
)
gpt2_lm = keras_nlp.models.GPT2CausalLM.from_preset("gpt2_base_en", preprocessor=gpt2_preprocessor)

We will define a new artifact that represents a trained model. This artifact contains a directory to the saved model and the parameters used for training the model.

In [None]:
from tfx.types import artifact
from tfx import types

Property = artifact.Property
PropertyType = artifact.PropertyType

EPOCH_PROPERTY = Property(type=PropertyType.INT)

class CustomModel(types.Artifact):
  """ Artifact that contains the trained model.

  * Properties:
     - 'epoch' : saves the number of epochs it took to train the model.

  """
  TYPE_NAME = 'Model'
  PROPERTIES = {
      'epoch' : EPOCH_PROPERTY,
  }

A dictionary, which maps signature keys to either tf.function instances with input signatures or concrete functions. Keys of such a dictionary may be arbitrary strings, but will typically be from the tf.saved_model.signature_constants module.

In [None]:
import tfx.v1 as tfx
from tfx.dsl.component.experimental.decorators import component


@tfx.dsl.components.component
def Trainer(
    preprocessed_data : tfx.dsl.components.InputArtifact[CsvDoc],
    trained_model : tfx.dsl.components.OutputArtifact[CustomModel],
    num_epochs :tfx.dsl.components.Parameter[int]) -> None:
      with open(preprocessed_data.full_path, "r") as f:
        df = pd.read_csv(f, delimiter = ',')
      tf_train_ds = tf.data.Dataset.from_tensor_slices(df['preprocessed_text'])
      processed_ds = tf_train_ds.map(gpt2_preprocessor, tf.data.AUTOTUNE).batch(20).cache().prefetch(tf.data.AUTOTUNE)

      trained_model.epoch = num_epochs
      gpt2_lm.include_preprocessing = False

      lr = tf.keras.optimizers.schedules.PolynomialDecay(
          5e-5,
          decay_steps=processed_ds.cardinality() * num_epochs,
          end_learning_rate=0.0,
      )
      loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
      gpt2_lm.compile(
          optimizer=keras.optimizers.experimental.Adam(lr),
          loss=loss,
          weighted_metrics=["accuracy"])

      gpt2_lm.fit(processed_ds, epochs=num_epochs)
      tf.saved_model.save(gpt2_lm, trained_model.uri)

In [None]:
# from customCSVexampleGenComponent import customCSVexampleGenComponent
custom_trainer = Trainer(preprocessed_data = ExampleGenerator.outputs['output_csv'] , num_epochs = 1)

In [None]:
context.run(custom_trainer, enable_cache = False)