##### Copyright 2019 The TensorFlow Authors.

In [1]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Masking and Padding

<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://www.tensorflow.org/beta/guide/keras/masking_and_padding">
    <img src="https://www.tensorflow.org/images/tf_logo_32px.png" />
    View on TensorFlow.org</a>
  </td>
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/r2/guide/keras/masking_and_padding.ipynb">
    <img src="https://www.tensorflow.org/images/colab_logo_32px.png" />
    Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/tensorflow/docs/blob/master/site/en/r2/guide/keras/masking_and_padding.ipynb">
    <img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />
    View source on GitHub</a>
  </td>
  <td>
    <a target="_blank" href="https://storage.googleapis.com/tensorflow_docs/docs/site/en/r2/guide/keras/masking_and_padding.ipynb">
    <img src="https://www.tensorflow.org/images/download_logo_32px.png" />
    Download notebook</a>
  </td>
</table>

For sequential data input, it is very common that the each individual entry has different length in the sequence length dimension. Consider the following example with text input.

```
[
  ["It", "is", "a", "nice", "weather", "today"],
  ["How", "are", "you", "doing", "today"],
  ["Hello", "world", "!"]
]
```

After the vocab lookup, the data might be converted to numerical form like:

```
[
  [83, 91, 1, 645, 1253, 927],
  [73, 8, 3215, 55, 927],
  [71, 1331, 4231]
]
```

The data is a 2D list, the individual item within it has length [6, 5, 3]. Since the input data for the model need to be to be in a uniformed shape, which means the data that is shorter than the lengest item need to be padded with some placeholder.

Keras provides an API for user to easily pad their data to the same length by [tf.keras.preprocessing.sequence.pad_sequences](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/preprocessing/sequence/pad_sequences)


## Setup

In [2]:
from __future__ import absolute_import, division, print_function, unicode_literals

import numpy as np

!pip install -q tensorflow==2.0.0-beta1
import tensorflow as tf

from tensorflow.keras import layers

## Pad sequence data

In [3]:
raw_inputs = [
  [83, 91, 1, 645, 1253, 927],
  [73, 8, 3215, 55, 927],
  [711, 632, 71]
]

# By default, the API will pad 0s, and it is configurable with "value" parameter.
padded_inputs = tf.keras.preprocessing.sequence.pad_sequences(raw_inputs, padding='post')

print(padded_inputs)

[[  83   91    1  645 1253  927]
 [  73    8 3215   55  927    0]
 [ 711  632   71    0    0    0]]


## Masking

Now the data have the uniformed sequence length, the model need to be informed that some part of the data is actually padding and need to be ignored. The mechanism is <b>Masking</b>.

There are two ways to introduce mask within Keras:
- Add a [tf.keras.layers.Masking](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/layers/Masking).
- Config an [Embedding layer](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/layers/Embedding) with `mask_zero` = True.

Under the hood, Keras will create a mask tensor (2D with shape [batch, sequence_length]), and attach it with tensor output from masking or embedding layer.

In [4]:
embedding = layers.Embedding(5000, 16, mask_zero=True)
masked_output = embedding(padded_inputs)

print(masked_output._keras_mask)


tf.Tensor(
[[ True  True  True  True  True  True]
 [ True  True  True  True  True False]
 [ True  True  True False False False]], shape=(3, 6), dtype=bool)


In [5]:
masking_layer = layers.Masking()
# Simulate the embedding lookup by expand the 2D input to 3D and make the
# embedding dimension to be 10.
unmasked_embedding = tf.cast(
    tf.tile(tf.expand_dims(padded_inputs, axis=-1), [1, 1, 10]),
    tf.float32)

masked_embedding = masking_layer(unmasked_embedding)
print(masked_embedding._keras_mask)

tf.Tensor(
[[ True  True  True  True  True  True]
 [ True  True  True  True  True False]
 [ True  True  True False False False]], shape=(3, 6), dtype=bool)


As you can see from the printed result, the mask is a 2D boolean tensor, with shape `[batch, sequence_length]`, each individual "False" means the data is not a real input, and should be ignored during process.

## Handle masks in custom layers

The mask will be propagate through the network, for any layer that uses mask (for example, RNN layers), Keras will fetch the ._keras_mask tensor and pass the mask as a key word argument to `call()`.

For any layer that produces a tensor with different rank with regard to the input, for example RNN layer might return 2D tensor while the input is 3D, it will need to overwrite the `layer.compute_mask()` method to produce a new mask given the input. This is applicable for layer with multiple input/output, for example `tf.keras.layers.Concatenate`, it needs to recompute the mask by applying `tf.concat` to the two input masks.

Most of layers don't worry about the mask, the default behavior is just pass the mask through.

In [6]:
class Split(tf.keras.layers.Layer):
  """Split the input tensor into 2 tensors among the axis parameter."""

  def __init__(self, axis=1, **kwargs):
    self.axis = axis
    super(Split, self).__init__(**kwargs)

  def call(self, inputs):
    # Expect the input to be 3D and mask to be 2D, split the input tensor into 2
    # among the `axis`.
    return tf.split(inputs, 2, axis=self.axis)
    
  def compute_mask(self, inputs, mask=None):
    # Also split the mask into 2 if it presents.
    if mask is None:
      return None
    return tf.split(mask, 2, axis=self.axis)

first_half, second_half = Split()(masked_embedding)
print(first_half._keras_mask)
print(second_half._keras_mask)


tf.Tensor(
[[ True  True  True]
 [ True  True  True]
 [ True  True  True]], shape=(3, 3), dtype=bool)
tf.Tensor(
[[ True  True  True]
 [ True  True False]
 [False False False]], shape=(3, 3), dtype=bool)
