# Mask & Padding

Mask和Padding 在NLP中是一个非常重要的概念，同时也是必不可缺的一部分，熟练掌握Mask&Padding的常规操作对于使用tensorflow而言至关重要。

## 概念

Mask，顾名思义为掩盖，在处理时间序列数据时，由于每个batch中，文本的实际长度不一致，且需要将所有的数据组装成一个矩阵来标识，比如：

```text
[   ["Hello", "world", "!"],   
    ["How", "are", "you", "doing", "today"],   
    ["The", "weather", "will", "be", "nice", "tomorrow"], 
]
```


In [3]:
import tensorflow as tf

# 1. define the data
raw_inputs = [
    [711, 632, 71],
    [73, 8, 3215, 55, 927],
    [83, 91, 1, 645, 1253, 927],
]

# 2. pad the input with post mode
padded_inputs = tf.keras.preprocessing.sequence.pad_sequences(
    raw_inputs, padding="post", maxlen=4, truncating="post"
)

print(padded_inputs)


# 3. generate mask data
raw_length = [len(raw_inputs[i]) for i in range(len(raw_inputs))]
mask = tf.sequence_mask(raw_length, maxlen=10)
print(mask)


[[ 711  632   71    0]
 [  73    8 3215   55]
 [  83   91    1  645]]
tf.Tensor(
[[ True  True  True False False False False False False False]
 [ True  True  True  True  True False False False False False]
 [ True  True  True  True  True  True False False False False]], shape=(3, 10), dtype=bool)


In [None]:
import tensorflow as tf

# 1. define the data
raw_inputs = [
    [711, 632, 71],
    [73, 8, 3215, 55, 927],
    [83, 91, 1, 645, 1253, 927],
]

# 2. pad the input with post mode
padded_inputs = tf.keras.preprocessing.sequence.pad_sequences(
    raw_inputs, padding="post", maxlen=4, truncating="post"
)

print(padded_inputs)


# 3. generate mask data
raw_length = [len(raw_inputs[i]) for i in range(len(raw_inputs))]
mask = tf.sequence_mask(raw_length, maxlen=10)
print(mask)
