<a href="https://colab.research.google.com/github/vineetp6/Trax/blob/main/TraxAttentionModel.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# An introduction to Trax

Trax is an amazing Deep Learning library that's being developed and maintained by **Google Brain Team**.

It combines the power and experience of Tensorflow with a cleaner, easier to understand code, in a way that resembles a lot the Keras library (which is a part of tensorflow itself now).

Since it's being developed by Google Brain Team, it inclueds a lot of state-of-the-art algorithms and techniques, especially in *Natural Language Processing*!

For example, it contains code for:


*   Seq2Seq Models
*   Transformer Models
*   BERt Model
*   Reformer Model (the efficient Transformer!)

Aside from this, there are many useful preprocessing tools that help you get your NLP Training running as fast as a bolt!

Let us get started!

In [None]:
# Run this cell to set TPU in Colab
import os
import jax
import requests
# Run this to get the TPU address.
if 'TPU_DRIVER_MODE' not in globals():
  url = 'http://' + os.environ['COLAB_TPU_ADDR'].split(':')[0] + ':8475/requestversion/tpu_driver0.1-dev20191206'
  resp = requests.post(url)
  TPU_DRIVER_MODE = 1

# The following is required to use TPU Driver as JAX's backend.
from jax.config import config
config.FLAGS.jax_xla_backend = "tpu_driver"
config.FLAGS.jax_backend_target = "grpc://" + os.environ['COLAB_TPU_ADDR']

In [None]:
try:
    import trax
except ModuleNotFoundError:
    # Woa google, didn't you get your own library to your colabs yet!?
    !pip install trax
    import trax
    
trax.fastmath.set_backend('jax')
# Layers used to build the Deep Learning Models
import trax.layers as tl
# Utilities to build and download datasets
import trax.data as data
# Fastmath is trax interface to numpy, or jax, or tensorflow-numpy. Jax has some numpy operations working much faster, but not all of them, so here's the need for both.
import trax.fastmath as np
# Basics for training
import trax.supervised as ts

import numpy # default numpy


## Preprocessing Pipelines
---

Trax provides us easy to use resources to implement NLP Pipelines (actually, any preprocessing pipeline).

For example, Trax allows us to use many Tensorflow utilities, such as Tensorflow Datasets. We can download any dataset available there very easily.

To do it, we use:

```python 
data.TFDS(dataset_name, keys=(which_value_field_to_get, target_value), train=if_data_is_train_or_eval)
```

We can, for example, get the 'imdb_reviews' dataset to be used for sentiment analysis.

In [None]:
# Get dataset
imdb = data.TFDS('imdb_reviews', keys=('text', 'label'), train=True) #This returns a function that wraps a generator.

Downloading and preparing dataset 80.23 MiB (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Generating splits...:   0%|          | 0/3 [00:00<?, ? splits/s]

Generating train examples...:   0%|          | 0/25000 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteWURWJU/imdb_reviews-train.tfrecord…

Generating test examples...:   0%|          | 0/25000 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteWURWJU/imdb_reviews-test.tfrecord*…

Generating unsupervised examples...:   0%|          | 0/50000 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteWURWJU/imdb_reviews-unsupervised.t…



Dataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.




In [None]:
# We can check the next value by using next(dataset). It returns the text in byte format and the sentiment (0 = neg, 1 = pos).
# Since it is a function, next(dataset_func()) will always return the first sentence in the dataset, unless we get the result from the function, which is the actual generator.
text_in_bytes, sent = next(imdb())
# To work with it, we have to decode the string, because it is in bytes format:
decoded = text_in_bytes.decode('utf-8')
# To avoid index errors in formatting:
max_len = 100 if len(decoded) > 100 else len(decoded)-1
print("The text, up to 100 characters, is:\n{}\nThe sentiment is: {}".format(decoded[:max_len], "\033[92mPositive" if sent else "\033[91mNegative"))

The text, up to 100 characters, is:
This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. 
The sentiment is: [91mNegative



---


As I mentioned, Trax allows for easy to use preprocessing pipelines.

To do it, we implement a special type of object: the Serial Layer (for data). We can import it from trax.data.Serial, and this one allows us to process data one function at a time, in a sequential manner.

```python
# Remember that we imported trax.data as data
pipeline = data.Serial(
    function1(Usually loads data),
    function2(does some preprocessing)...
) -> returns a generator
```

We can also make a preprocessing pipeline and then feed the data generator to it. This way, we can do many important tasks, such as:

```python
my_data_generator = dataset_generator #function with yield statement
preprocessing_pipeline = data.Serial(
    my_remove_punct_function(puncts_to_remove='.,:;'),
    my_stopword_removal_function(stopword_dict='/data/stopword_dict.json'),
    my_tokenizer() ...
)
my_pipeline_generator = preprocessing_pipeline(my_data_generator())
```
Just remember that each of your functions takes as input the same format as the output of the previous function (eg.: (1) takes a string and returns a list; (2) takes a list and returns a list, etc.)

(generator_sample)->(1)->(2)->...->(ml_output)

For you to 'plug' your pipeline into Trax algorithms, the expected format is batch_input, batch_expected_output, mask weights (if any).

In the following example, we show a pipeline that makes use of some useful preprocessing steps provided by Trax itself.

In [None]:
pipeline_with_data = data.Serial(
  data.TFDS('imdb_reviews', keys=('text', 'label'), train=True), #Loads dataset generator, you can use it inside the pipeline or outside, as we'll see an example below.
  data.Tokenize(vocab_file='en_8k.subword', keys=[0]), #Tokenizes using a subword dict, it breaks words into common prefixes and suffixes for better generalization (useful for machine learning only).
  data.Shuffle(), # Randomizes to avoid bias in the way data is displayed.
  data.FilterByLength(max_length=128, length_keys=[0]), # This makes sure no sentence is bigger than 128 tokens, useful for some models that have size constraints.
  data.BucketByLength(boundaries=[  32, 64, 128], # Bucketing is useful for batch training, it optimizes algorithm learning using batches of same-size sentences together.
                      batch_sizes=[128,  64, 32],
                      length_keys=[0]),
  data.AddLossWeights() # Adds Loss Weights, useful in case of masking and padding, else padding can affect model loss.
)

In [None]:
# Testing the pipeline:
batch_token_ids, batch_target_values, batch_loss_weights = next(pipeline_with_data())
# Here's how to detokenize the sentence input:
first_tokenized_sentence = batch_token_ids[0][:10] #Using 10 max lenght is safe for our batches will have at least 32 len according to bucketing policy, even if some of them are just padding values.
detokenized = data.detokenize(first_tokenized_sentence, vocab_file='en_8k.subword')
print("Example sentence first 10 tokens: {}...\nExample first 10 tokens after detokenization: {}...\nSentiment: {}".format(first_tokenized_sentence, 
                                                                                    detokenized, 
                                                                                    "Positive" if batch_target_values[0] else "Negative"))

Example sentence first 10 tokens: [ 433 1268   96 6300    3  151    4 2156   37 1005]...
Example first 10 tokens after detokenization: As others have mentioned, all the women that go...
Sentiment: Positive


In [None]:
# Here's an example with our pipeline

def mock_generate_samples(mock_data):
    """
    #Change this to use your file! (format: each line has text \t sent)
    with open(file) as f:
        line = f.read()
        for line in f:
            values = line.split('\t')
            if len(values) < 2: 
                continue
            yield values[0], values[1]
        """
    strings = mock_data.split('.')
    for string in strings:
        values = string.split(';')
        if len(values) < 2:
            continue
        yield values[0], values[1]

# Here's an example into how to implement your own functions to the Pipeline.

def change_all_to_lowercase(generator):
    for entry in generator:
        next_text, next_sent = entry
        lowercased_text = next_text.lower()
        yield lowercased_text, next_sent

# You need to create a wrapper for it.

def Change_All_To_LowerCase(): 
  """Returns function to lowercase the sentence."""
  return lambda g: change_all_to_lowercase(g)

my_generator = mock_generate_samples("This is a positive sentence; 1. This is another positive sentence; 1. This is a negative sentence :(; 0.")
my_pipeline = data.Serial(
    change_all_to_lowercase, #Order matters! Lowercase before tokenizing.
    data.Tokenize(vocab_file='en_8k.subword', keys=[0]),
    data.Shuffle(),
    data.AddLossWeights()
)
# We'll get a warning because our dataset is small, but it works.
# We only iterate over results for simplicity. No need for that. We actually use the pipeline as input
for tokens, sent, weights in my_pipeline(my_generator):
    print(tokens, sent, data.detokenize(tokens, vocab_file='en_8k.subword'))



[  84  114   25 1066 6567    2 4965  665]  1  this is another positive sentence
[ 114   25   12 6567    2 4965  665]  1 this is a positive sentence
[  84  114   25   12 5780  160 4965  665  192 1304  186]  0  this is a negative sentence :(


## Creating Models


---

Now that we've seen preprocessing, it's time to move into Modeling itself.

Trax allows the use of models in two ways:

<ul>
<li>Predefined models</li>
<ul>
<li>Seq2Seq with Attention</li>
<li>BERT</li>
<li>Transformer</li>
<li>Reformer</li>
</ul>
<li>Self-made models with layers.</li>
</ul>

Lets peek into one of Trax [pre-made models](https://trax-ml.readthedocs.io/en/latest/trax.models.html), the LSTMSeq2SeqAttn.


In [None]:
# First, I'll download a vocab file to be used as an example.
import os.path
if not os.path.isfile('/content/pt.wiki.bpe.vs10000.vocab'):
    !wget https://nlp.h-its.org/bpemb/pt/pt.wiki.bpe.vs10000.vocab
# Suppose we'll use this model to make a Neural Machine Translator from English to Portuguese with a Seq2Seq Model.
# We can do that instantiating a predefined model LSTMSeq2SeqAttn, you can check all parameters available in the Trax link above.
seq2seq_model = trax.models.LSTMSeq2SeqAttn(input_vocab_size=data.vocab_size(vocab_file='en_8k.subword'), # I'm using data.vocab_size to get the vocab size that is going to be used by the tokenizer
                                            target_vocab_size=data.vocab_size(vocab_type='subword', vocab_file='pt.wiki.bpe.vs10000.vocab', vocab_dir='/content'))
# We can peek its structure:
print(seq2seq_model)

Serial_in2_out2[
  Select[0,1,0,1]_in2_out4
  Parallel_in2_out2[
    Serial[
      Embedding_8183_512
      LSTM_512
      LSTM_512
    ]
    Serial[
      Serial[
        ShiftRight(1)
      ]
      Embedding_10000_512
      LSTM_512
    ]
  ]
  PrepareAttentionInputs_in3_out4
  Serial_in4_out2[
    Branch_in4_out3[
      None
      Serial_in4_out2[
        _in4_out4
        Serial_in4_out2[
          Parallel_in3_out3[
            Dense_512
            Dense_512
            Dense_512
          ]
          PureAttention_in4_out2
          Dense_512
        ]
        _in2_out2
      ]
    ]
    Add_in2
  ]
  Select[0,2]_in3_out2
  LSTM_512
  LSTM_512
  Dense_10000
  LogSoftmax
]


That's a lot of things, right? Each of these are one of the network **layers**.

Layers are the LEGO blocks of Trax!

<img src='https://cdn.britannica.com/48/182648-050-6C20C6AB/LEGO-bricks.jpg' height=300>

With these layers we can create our own models as well.

Each layer is a function (or a whole bunch of functions bundled together) that gets some input and returns output in the promised format.

Layers can be combined using what are called "combiner layers". 

By default, there are 3 "combiner" layers: "Serial" (sequential model), "Branch" (parallel model) and "Residual" (kind of a merge between Serial and Branch).

Example:

```python
# tl is common shortcut for trax.layers, as np is for numpy and pd is for pandas.
model = tl.Serial(
    layer1,
    layer2,
    layer3
)
# Will run layer1>layer2>layer3
model2 = tl.Branch(
    layer1,
    layer2
)
# Will run layer1 and layer2 at once.
```

### Create your own layers

We can use trax [predefined layers](https://trax-ml.readthedocs.io/en/latest/trax.layers.html) (such as LSTM Cells, Dense layers, etc), but we can even make our own layers. 

There are two ways to create our own layers: the first is to implement it completely by inheriting from any layer (and overriding inputs). We'll see an example with a Dense Layer modification.

The other method is to provide a function to ```trax.layers.PureLayer(forward_fn = your_function_here)``` and add it as a layer.

For that, suppose we wanted to make a Dense Layer with a weird bias pattern in forward pass. We make a class that inherits from tl.Dense and modify the forward method, like such:

In [None]:
from trax.fastmath import numpy as jnp #Numpy on steroids for the computations

class MyDenseLayer(tl.Dense):
    def forward(self, x):
        """Our modified forward pass with half bias term.
        Args:
        x: Tensor of same shape and dtype as the input signature used to
            initialize this layer.
        Returns:
        Tensor of same shape and dtype as the input, except the final dimension
        is the layer's `n_units` value.
        """
        if self._use_bias:
            if not isinstance(self.weights, (tuple, list)):
                raise ValueError(f'Weights should be a (w, b) tuple or list; '
                                f'instead got: {self.weights}')
            w, b = self.weights
            return jnp.dot(x, w) + b*0.5  # Here's where we add our modification, by halving the bias term. You can check the original source at https://github.com/google/trax/blob/master/trax/layers/core.py
        else:
            w = self.weights
            return jnp.dot(x, w)  


We can also define a full layer from scratch by inheriting from trax.base.Layer, but I won't get into that.

Let us now build an entire model with this layer that we've built. We'll use a Serial Layer as a combiner.

In [None]:
# Btw this is the same as the sample from trax quickstart guide.
sentiment_analysis_model = tl.Serial(
    tl.Embedding(data.vocab_size(vocab_file='en_8k.subword'), d_feature=256), #Add an embeddings layer to turn tokens into embeddings with 256 dim and vocab size equal to the one used for tokenization.
    tl.Mean(axis=1),  # Average on axis 1 (length of sentence).
    MyDenseLayer(2),      # Classify 2 classes.
    tl.LogSoftmax()   # Produce log-probabilities.
)
# Let us peek into our model
print(sentiment_analysis_model)

Serial[
  Embedding_8183_256
  Mean
  Dense_2
  LogSoftmax
]




---


## Training Models

So we already seen the way pipelines are created and how to create/use models.

Now, it is time for us to do the trick and put both of them to work together.

Let us put the machine to learn!

<img src='https://images.unsplash.com/photo-1563209259-b2fa97148ce1?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1257&q=80' height=300>



---

Remember, there are two main methods of Machine Learning:

*   Supervised
*   unsupervised


We're only covering supervised and self-supervised models (those where we get the targets from the inputs).

To do Supervised Training in Trax, we have to define three important 'blocks':

1.   The training task, which takes as input the labeled data, the loss layer, the optimizer and the number of steps between checkpoints.
2.   The eval task, which takes as input the labeled data, the metrics and the number of eval batches.
3.   The training loop, which puts all of this together.

We get all that from trax.supervised. 

Before, though, let us use those pipelines we've learned before to build the labeled that for training and eval.


In [None]:
# First we get the streams from TFDS
train_stream = trax.data.TFDS('imdb_reviews', keys=('text', 'label'), train=True)()
eval_stream = trax.data.TFDS('imdb_reviews', keys=('text', 'label'), train=False)()

# Next, we build the pipeline
data_pipeline = trax.data.Serial(
    trax.data.Tokenize(vocab_file='en_8k.subword', keys=[0]),
    trax.data.Shuffle(),
    trax.data.FilterByLength(max_length=2048, length_keys=[0]),
    trax.data.BucketByLength(boundaries=[  32, 128, 512, 2048],
                             batch_sizes=[512, 128,  32,    8, 1],
                             length_keys=[0]),
    trax.data.AddLossWeights()
  )

# Finally, we get the generators
train_batches_stream = data_pipeline(train_stream)
eval_batches_stream = data_pipeline(eval_stream)

# We should redefine the model with the correct Dense layer, otherwise weird things will happen!
sentiment_analysis_model = tl.Serial(
    tl.Embedding(data.vocab_size(vocab_file='en_8k.subword'), d_feature=256),
    tl.Mean(axis=1),
    tl.Dense(2),
    tl.LogSoftmax()
)



---
Now we can define the training/eval tasks and the loop.

Just some important notes before.

About the training and eval tasks:
* We're using CrossEntropyLoss for the loss layer.
* We're using Adam as the optimizer. This is the actual de-facto standard optimizer for DNNs.
* We're also using accuracy as a metric. There are others which are fit to other situations. Check them here: https://trax-ml.readthedocs.io/en/latest/trax.layers.html#module-trax.layers.metrics

About the Traning loop:
* We create an output dir for the weights and checkpoints.
* We plug the model and the train/eval tasks.
* We run it by using training_loop.run =)

In [None]:
from trax.supervised import training
import os

# Training task.
train_task = training.TrainTask(
    labeled_data=train_batches_stream,
    loss_layer=tl.CrossEntropyLoss(),
    optimizer=trax.optimizers.Adam(0.01),
    n_steps_per_checkpoint=200, #This will print the results at every 200 training steps.
)

# Evaluaton task.
eval_task = training.EvalTask(
    labeled_data=eval_batches_stream,
    metrics=[tl.CrossEntropyLoss(), tl.Accuracy()],
    n_eval_batches=20  # For less variance in eval numbers.
)

# Training loop saves checkpoints to output_dir.
output_dir = os.path.expanduser('~/output_dir/')
!rm -rf {output_dir}
training_loop = training.Loop(sentiment_analysis_model,
                              train_task,
                              eval_tasks=[eval_task],
                              output_dir=output_dir)

# Run 2000 steps (batches).
training_loop.run(2000)


Step      1: Total number of trainable weights: 2095362
Step      1: Ran 1 train steps in 1.40 secs
Step      1: train CrossEntropyLoss |  0.69225746
Step      1: eval  CrossEntropyLoss |  0.68985087
Step      1: eval          Accuracy |  0.54062500

Step    200: Ran 199 train steps in 17.50 secs
Step    200: train CrossEntropyLoss |  0.64294529
Step    200: eval  CrossEntropyLoss |  0.60096635
Step    200: eval          Accuracy |  0.66406250

Step    400: Ran 200 train steps in 16.77 secs
Step    400: train CrossEntropyLoss |  0.46843573
Step    400: eval  CrossEntropyLoss |  0.42378835
Step    400: eval          Accuracy |  0.81562500

Step    600: Ran 200 train steps in 15.25 secs
Step    600: train CrossEntropyLoss |  0.37792361
Step    600: eval  CrossEntropyLoss |  0.37655992
Step    600: eval          Accuracy |  0.83281250

Step    800: Ran 200 train steps in 15.00 secs
Step    800: train CrossEntropyLoss |  0.36868617
Step    800: eval  CrossEntropyLoss |  0.38055726
Step   


Okay, now how to use this model?



---



## Predicting from new inputs

So we got a trained model, how do we use it?

Simple! **Just feed a tokenized input to the model**!

But, a caution note before: Trax models (as all current deep learning frameworks) expects the input to come with a batch dimmension besides the expected input dimmensions. So we have to wrap our sample arround that.

Let us try it out.

In [None]:
import numpy
example_input = "I loved the way that the actors were cast, also, It is clear that they've put a huge effort in post-production."
#example_input = "Very bad movie"
# Steps explained: 
# 1st: tokenize input. We cast it to an iterator to fake a generator.
input_iter = iter([example_input])
input_tokens = data.tokenize(input_iter, vocab_file='en_8k.subword')
# 2nd: cast the results to a list and get the first value (the tokens, not the label or anything else)
tokenized_input = list(input_tokens)[0]
# 3rd: Add fake batch dimmension
tokenized_with_batch = tokenized_input[None, :]
# 4th: input it to the model and get the logprobs
sentiment_log_probs = sentiment_analysis_model(tokenized_with_batch)
# 5th: get the values to between 0 and 1 by exponentiating the logprobs.
norm_log_probs = numpy.exp(sentiment_log_probs)
# 6th: get if it's either true or false by checking the position of the greatest values at norm_log_probs dimmension 0 (that's what argmax does)
sentiment = numpy.argmax(norm_log_probs[0])
print('Input review:\n"{}"\nThe sentiment is: {}'.format(example_input, "\033[92mPositive" if sentiment else "\033[91mNegative"))

Input review:
"I loved the way that the actors were cast, also, It is clear that they've put a huge effort in post-production."
The sentiment is: [91mNegative


In [None]:
# Just a hint: make your life easier with a function:
import numpy
def parse_sentiment(text, model, vocab_file = 'en_8k.subword'):
    input_iter = iter([example_input])
    input_tokens = data.tokenize(input_iter, vocab_file=vocab_file)
    tokenized_input = list(input_tokens)[0]
    tokenized_with_batch = tokenized_input[None, :]
    sentiment_log_probs = model(tokenized_with_batch)
    norm_log_probs = numpy.exp(sentiment_log_probs)
    sentiment = numpy.argmax(norm_log_probs[0])
    return "Positive" if sentiment else "Negative"




---


## Restoring the Model

This enables us to retake training from a certain point, which is useful if you want to train for a really long time, if for some reason the running session crashes or if you want to test new parameters to the training.

In [None]:
# This loads a checkpoint:
training_loop.load_checkpoint(directory='~/output_dir/', filename="model.pkl.gz")
# Continue training:
training_loop.run(200)


Step   2400: Ran 200 train steps in 13.82 secs
Step   2400: train CrossEntropyLoss |  0.28808481
Step   2400: eval  CrossEntropyLoss |  0.37845760
Step   2400: eval          Accuracy |  0.84023437


We can also load a pretrained model. This allows us to use models that have been trained before and are to be used in production:

In [None]:
# First, we need the same structure:
new_model = tl.Serial(
    tl.Embedding(data.vocab_size(vocab_file='en_8k.subword'), d_feature=256),
    tl.Mean(axis=1),
    tl.Dense(2),
    tl.LogSoftmax()
)
# Then, we load the weights:
new_model.init_from_file(file_name="/root/output_dir/model.pkl.gz", weights_only=True) # Only load weights
# Same result as before (I used a helper function for simplicity)
print("The sentiment is: ", parse_sentiment("Very bad movie", new_model))

The sentiment is:  Negative


### Saving as Keras model and exporting to Tensorflow Serving

As mentioned, Trax models can be easily converted to Keras layers. This also happens to entire models that can be warpped arround Keras layers and then be saved using the later to be exported to Tensorflow Serving. This is a convenient features that can help bringing your trained models into production.

However, for that to happen, there's the need for the backend to be set to tensorflow-numpy, which is the same behind Keras. And **the model has to be trained using tensorflow-numpy** as backend (you cannot train with Jax and deploy with tensorflow-numpy).

In [None]:
import tensorflow as tf
# Setting the backend
trax.fastmath.set_backend("tensorflow-numpy")
print(trax.fastmath.backend_name())

tensorflow-numpy


One hint: train the bulk on Jax (if the promise of efficiency keeps up) and then load tensorflow-numpy for a few epochs to convert the model to tensorflow.

In [None]:
training_loop = training.Loop(sentiment_analysis_model,
                              train_task,
                              eval_tasks=[eval_task],
                              output_dir=output_dir)

# Run 2000 steps (batches).
training_loop.run(20)

To convert the model to a Keras layer, it is very straightforward: just call trax.AsKeras(model) and you get a keras layer as the output!

In [None]:
# To convert the model to Keras, simply run:
keras_layer = trax.AsKeras(sentiment_analysis_model)
# This will be a trax.trax2keras.AsKeras object
print(keras_layer)

# Run the Keras layer to verify it returns the same result.
example_input = list(data.tokenize(iter(["I loved the way that the actors were cast, also, It is clear that they've put a huge effort in post-production."]), vocab_file="en_8k.subword"))[0]
sentiment_activations = keras_layer(example_input[None, :])
print(f'Keras returned sentiment activations: {numpy.asarray(sentiment_activations)}')

<trax.trax2keras.AsKeras object at 0x7f2ff7a588b0>
Keras returned sentiment activations: [[-0.03070426 -3.498665  ]]


Finally, use it as a layer on a full keras model!

In [None]:
# Create a full Keras  model using the layer from Trax.
inputs = tf.keras.Input(shape=(None,), dtype='int32')
hidden = keras_layer(inputs)
# You can add other Keras layers here operating on hidden.
outputs = hidden
keras_model = tf.keras.Model(inputs=inputs, outputs=outputs)
print(keras_model)

<keras.engine.functional.Functional object at 0x7f2ff6098d00>


To save the model to deployment, it is the same as in other Keras models:

In [None]:
model_file = os.path.join("/content/", "saved_model")
# This yields a .pb file in the defined path: /content/saved_model/saved_model.pb
keras_model.save(model_file)





---


# Other NLP Tasks with Trax

Now that we've seen how to do a basic sentiment analysis with Trax, I'll give some examples on how to do other tasks as well.

For some of them, we'll need quite a lot of data, so bear with me during the downloads, ok?


---
## Neural Machine Translation with Transformer model

Suppose we wanted to train a transformer to translate from English to Portuguese. We could do that! First, we get some data, for example, the enpt pairs from the para_crawl corpus.

Btw.: This is a huge corpus (about 2.65 gb after uncompression), so it takes a good while to download (more than 30 min).

In [None]:
nmt_train_stream = data.TFDS('para_crawl/enpt', keys=['en', 'pt'], train = True, eval_holdout_size=0.2)()
nmt_eval_stream = data.TFDS('para_crawl/enpt', keys=['en', 'pt'], train = False, eval_holdout_size=0.2)()

### Data preparation and Model creation

To train a NMT model, we need a vocab file that (preferrably) encompasses both languages at once. I added an example below, and a predefinde one. 

If you want to use it for your own tasks, feel free to change the line numbers (let the whole corpus be loaded to a file) and increase the vocab_size. Just remember to save this vocab file.

In [None]:
# This downloads a pretrained sentencepiece vocab that I pretrained, comment it to let it be generated.

!wget https://storage.googleapis.com/dl_models/enpt_32k.model

# The following steps will train a sentencepiece model for portuguese vocab. It is not necessary if you already have one.
import os.path
import io
import sentencepiece as spm
if not os.path.isfile('/content/enpt_32k.model'):
    if not os.path.isfile('/content/lines.txt'):
        def turn_port_to_file(stream):
            linenum = 0
            with open('lines.txt', 'w') as file:
                for en, pt in stream:
                    file.write(pt.decode('utf8')+"\n")
                    file.write(en.decode('utf8')+"\n")
                    linenum+=2
                    if linenum > 100000:
                        break
        turn_port_to_file(nmt_train_stream)
    model = io.BytesIO()
    # We're using Sentencepiece with bpe model type. It will make a 32k words vocab with both portuguese and english worrds.
    spm.SentencePieceTrainer.train(
        input='/content/lines.txt', model_writer=model, vocab_size=32000, pad_id=0, unk_id=2,
         bos_id=3, eos_id=1, pad_piece='<pad>', unk_piece='<unk>', bos_piece='<bos>', eos_piece='<eos>', num_threads=128, model_type='BPE') 
    # Important: Remember to use these pad, unk, bos and eos ids in custom vocab files, since Trax will expect them in this order.

    # Serialize the model as file.
    with open('/content/enpt_32k.model', 'wb') as f:
        f.write(model.getvalue())

# We set a simple model. Be aware that this is not the most efficient out there!
transformer_nmt = trax.models.Transformer(input_vocab_size=data.vocab_size(vocab_type='sentencepiece', vocab_file='enpt_32k.model', vocab_dir='/content'),
                                            output_vocab_size=data.vocab_size(vocab_type='sentencepiece', vocab_file='enpt_32k.model', vocab_dir='/content'),d_model=512,
            n_encoder_layers=2,
            n_decoder_layers=2,
            n_heads =4)

In [None]:
# Prepare the preprocessing pipeline
nmt_data_pipeline = trax.data.Serial(
    trax.data.Tokenize(vocab_type='sentencepiece', vocab_file='enpt_32k.model', vocab_dir='/content', keys=[0, 1]),
    trax.data.Shuffle(),
    trax.data.FilterByLength(max_length=1024, length_keys=[0, 1]),
    trax.data.BucketByLength(boundaries=[16, 32, 64, 128, 256, 512, 1024], #Let us go only with smaller strings for simplicity
                             batch_sizes=[512, 256, 128, 64, 32, 16, 8],
                             length_keys=[0, 1]),
    trax.data.AddLossWeights()
  )

nmt_train_batches_stream = nmt_data_pipeline(nmt_train_stream)
nmt_eval_batches_stream = nmt_data_pipeline(nmt_eval_stream)

### Model Training

In [None]:
from trax.supervised import training
import os

# Training task, as usual, but with one more step: setting up the lr scale with warmup for better training.
nmt_train_task = training.TrainTask(
    labeled_data=nmt_train_batches_stream,
    loss_layer=tl.CrossEntropyLoss(),
    optimizer=trax.optimizers.Adam(0.1),
    lr_schedule= trax.lr.warmup_and_rsqrt_decay(1000, 0.01),
    n_steps_per_checkpoint=200
)

# Evaluaton task.
nmt_eval_task = training.EvalTask(
    labeled_data=nmt_eval_batches_stream,
    metrics=[tl.CrossEntropyLoss(), tl.Accuracy()],
    n_eval_batches=20
)

# Training loop saves checkpoints to output_dir.
output_dir = os.path.expanduser('/content/nmt')
nmt_training_loop = training.Loop(transformer_nmt,
                              nmt_train_task,
                              eval_tasks=[nmt_eval_task],
                              output_dir=output_dir)
#Here you can download a pretrained model. I've trained it for more than 50k steps, but got to just 55%.
# !wget --directory-prefix=nmt/ https://storage.googleapis.com/dl_models/model.pkl.gz
if os.path.isfile('/content/nmt/model.pkl.gz'):
    nmt_training_loop.load_checkpoint('/content/nmt/model.pkl.gz')
# This will take a while (I trained for about 30 hours!) It can get to about 56% accuracy. This is a vague metric, but the thing is it will be awful (will produce blabberish). 
# Unless you're like google and have 'unlimited' TPU's, you won't be able to do anything useful.
nmt_training_loop.run(15000) 

### Performing Translations
There are different techniques to perform a translation using a NMT model. 

One method is to use **autoregressive sampling** (currently implemented in trax 1.36), which does translation based on each other output in a linear format.

There's another method not currently available in the current release: BeamSearch. It is on some of trax prototypes, so expect this to be available soon. BeamSearch does several translations and choses the one with best scoring, so the performance is better than autoregressive sampling.

A word of caution: If you're going to use the model that I've pretrained, it will be blabberish, nonsense. To really train a translator we'd need more than 30h on a single TPU to achieve something useful. But it is a head start. If you need it and have the budget, its not that hard to scale!

In [None]:
# We start by tokenizing the input sentence.
sentence = 'this does not yield good results'
tokenized = list(trax.data.tokenize
                 (iter([sentence]),  # Operates on streams.
                                    vocab_type='sentencepiece', vocab_file='enpt_32k.model', vocab_dir='/content'))[0]

# We decode the result of feeding the input to the the Transformer using autoregressive sampling.
tokenized = tokenized[None, :]  # Add batch dimension.
tokenized_translation = trax.supervised.decoding.autoregressive_sample(
    transformer_nmt, tokenized, temperature=0.2)  # Temperature allows to help avoid "rigid" translations. The closer to 1 = more diverse results.

# We de-tokenize the results.
tokenized_translation = tokenized_translation[0][:-1]  # Remove batch and EOS token.
translation = trax.data.detokenize(tokenized_translation,
                                   vocab_type='sentencepiece', vocab_file='enpt_32k.model', vocab_dir='/content')
print(translation)
# You might get a lot of hiphens and other stuff with the model. That's because it's too weak.
# head over to https://trax-ml.readthedocs.io/en/latest/notebooks/trax_intro.html to find a working translator from English to German.



---


## Named Entity Recognition with Reformer model

This example has been adapted from the original one from Trax git repository (https://github.com/google/trax/blob/master/trax/examples/NER_using_Reformer.ipynb)

We'll be using the dataset provided at Kaggle in the following link:https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus?select=ner_dataset.csv

I've provided a Google Cloud Storage option for easier usability.

We'll download it and load to pandas to do some simple preprocessing.

In [None]:
import pandas as pd
import numpy
import random as rnd # For using random functions
!wget https://storage.googleapis.com/dl_models/ner_dataset.csv
df = pd.read_csv("/content/ner_dataset.csv",encoding = 'ISO-8859-1')
df = df.fillna(method = 'ffill')
df.head()

### Data Preprocessing

The following cell performs all steps needed to preprocess the input to be used to train the model.

In [None]:
# Creates a class to process the corpus.
class Get_sentence(object):
    def __init__(self, data):
        self.n_sent = 1
        self.data = data
        # Use an aggregator function to merge together all words, pos and tags with the same sentence ID.
        agg_func = lambda s:[(w,p,t) for w,p,t in zip(s["Word"].values.tolist(),
                                                     s["POS"].values.tolist(),
                                                     s["Tag"].values.tolist())]
        self.grouped = self.data.groupby("Sentence #").apply(agg_func)
        self.sentences = [s for s in self.grouped]

# Generate the corpus object by instantiating it with the dataframe
getter = Get_sentence(df)

# Access the merged sentences with the class .sentence property.
sentence = getter.sentences

# Create a list of unique words
words = list(set(df["Word"].values))

# Create a list of unique tags
words_tag = list(set(df["Tag"].values))

# Create a word to idx dict.
word_idx = {w : i+1 for i ,w in enumerate(words)}
word_idx['<PAD>']=len(word_idx)

# Create a tag to idx dict.
tag_idx =  {t : i for i ,t in enumerate(words_tag)}

# Create a generator function to be used in the data pipeline.
def id_translator(sentences):
    while True: # Simulates a loop arround batcher. No problem since we are shuffling using trax.
        for sentence in sentences: # Loops arround sentences.
            words = [word_idx[triplet[0]] for triplet in sentence]
            tags = [tag_idx[triplet[2]] for triplet in sentence]
            yield numpy.array([words, tags])

In the original sample, a data generator was manually created. 

We will simplify that by using Trax's own resources:

In [None]:
batch_size=128
ner_pipeline = data.Serial(
    id_translator, #No need to tokenize, since we did it using the id_translator.
    data.Shuffle(),
    data.BucketByLength(batch_sizes=[batch_size],boundaries=[1024]),
    data.AddLossWeights(id_to_mask=word_idx['<PAD>'])
)

# Let us split the dataset into train and eval
split_size = 0.2
split_limit = len(sentence)-int(len(sentence)*split_size)
sentences_train = sentence[:split_limit]
sentences_eval = sentence[split_limit:]

# We now create the pipeline using the train and eval datasets
ner_train_pipeline = ner_pipeline(sentences_train)
ner_eval_pipeline = ner_pipeline(sentences_eval)

### Creating and training the Model

We create the model by using the predefined Reformer and adding a Dense layer with the number of tags as the number of units and a logsoftmax layer.

You could go further and add some dropouts to avoid overfitting, but I'll leave that as is.


In [None]:
# This will
NERModel = tl.Serial(
    trax.models.reformer.Reformer(len(word_idx), d_model=50, ff_activation=tl.LogSoftmax),
    tl.Dense(len(words_tag)),
    tl.LogSoftmax()
)

In [None]:
from trax.supervised import training

ner_train_task = training.TrainTask(
    ner_train_pipeline,  
    loss_layer = tl.CrossEntropyLoss(),
    lr_schedule= trax.lr.warmup_and_rsqrt_decay(100, 0.001),
    optimizer = trax.optimizers.Adam(0.005), 
    n_steps_per_checkpoint=200
)

ner_eval_task = training.EvalTask(
    labeled_data = ner_eval_pipeline, 
    metrics = [tl.CrossEntropyLoss(), tl.Accuracy()], 
    n_eval_batches = 20 
)
ner_output_dir = "/content/ner"
os.path.expanduser(ner_output_dir)

ner_training_loop = training.Loop(
    NERModel, 
    ner_train_task, 
    eval_tasks = [ner_eval_task], 
    output_dir = ner_output_dir) 
ner_training_loop.load_checkpoint(ner_output_dir+"/model.pkl.gz")
# You'll have to run for a while to get really acceptable results. Don't be fooled by high accuracy, 
# there's a huge disproportion between untagged words and tagged words.
ner_training_loop.run(n_steps = 10000)

In [None]:
ner_training_loop.run(n_steps = 2000)

### Performing tagging on new data

To perform tagging on new data, you could tokenize the inputs manually (I did it by splitting the text and converting tokens to indexes).

Just be aware of out of vocabulary words. Since this is a simple example, we're not treating them.

After tokenizing, feed into the autoregressive_sample and it will suggest the tags in a sequential manner. 

* Chances are that, if you did train for long enought or did not use a weighting scheme, you'll get awful results (the model will just throw O at all words and no named entity will be detected).

In [None]:
test_sentence = "I went to Ohio to make a new certificate , its price was 300 USD . Then I moved myself to England ."
split_words = test_sentence.split()
tokens = numpy.array([word_idx[word] for word in split_words])
tokens_with_batch = tokens[None, :]
tokenized_translation = trax.supervised.decoding.autoregressive_sample(
    NERModel, tokens_with_batch, temperature=0.0)

In [None]:
for word, tag in zip(split_words, tokenized_translation[0][:len(split_words)]):
    print(word, words_tag[tag])