<a href="https://colab.research.google.com/github/rochdiebold/hello-word/blob/master/TextVectorization_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
!pip install tf-nightly

In [0]:
import tensorflow as tf
import numpy as np
import tensorflow_datasets as tfds

tf.__version__

'2.1.0-dev20191111'

In [0]:
# Load the IMDB reviews dataset using tfds. This is the raw data.
imdb_reviews = tfds.load('imdb_reviews', in_memory=True)

# The IMDB dataset contains a train split and a test split; we create a separate
# handle for each.
train_raw = imdb_reviews['train']
test_raw = imdb_reviews['test']

# Once we have our handles, we format the datasets in a Keras-fit compatible
# format: a tuple of the form (text_data, label).
def format_dataset(input_data):
  return (input_data['text'], input_data['label'])

train_dataset = train_raw.map(format_dataset)
test_dataset = test_raw.map(format_dataset)

# We also create a dataset with only the textual data in it. This will be used
# to build our vocabulary later on.
text_dataset = train_raw.map(lambda data: data['text'])

In [0]:
# It's important to take a look at your raw data to ensure your normalization
# and tokenization will work as expected. We can do that by taking a few
# examples from the training set and looking at them.
# This is one of the places where eager execution shines:
# we can just evaluate these tensors using .numpy()
# instead of needing to evaluate them in a Session/Graph context.
for item in train_raw.take(4):
  print(item['label'].numpy())
  print(item['text'].numpy())

1
b"As a lifelong fan of Dickens, I have invariably been disappointed by adaptations of his novels.<br /><br />Although his works presented an extremely accurate re-telling of human life at every level in Victorian Britain, throughout them all was a pervasive thread of humour that could be both playful or sarcastic as the narrative dictated. In a way, he was a literary caricaturist and cartoonist. He could be serious and hilarious in the same sentence. He pricked pride, lampooned arrogance, celebrated modesty, and empathised with loneliness and poverty. It may be a clich\xc3\xa9, but he was a people's writer.<br /><br />And it is the comedy that is so often missing from his interpretations. At the time of writing, Oliver Twist is being dramatised in serial form on BBC television. All of the misery and cruelty is their, but non of the humour, irony, and savage lampoonery. The result is just a dark, dismal experience: the story penned by a journalist rather than a novelist. It's not real

In [0]:
import string
import re

# Having looked at our data above, we see that the raw text contains HTML break
# tags of the form '<br />'. These tags will not be removed by the default
# standardizer (which doesn't strip HTML). Because of this, we will need to 
# create a custom standardization function.
def custom_standardization(input_data):
  lowercase = tf.strings.lower(input_data)
  stripped_html = tf.strings.regex_replace(lowercase, '<br />', ' ')
  return tf.strings.regex_replace(stripped_html,
                                  '[%s]' % re.escape(string.punctuation),
                                  '')

In [0]:
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

# Now that we have our custom standardization, we can instantiate our text
# vectorization layer. We are using this layer to normalize, split, and map
# strings to integers, so we set our 'output_mode' to 'int'.
# Note that we're using the default split function,
# and the custom standardization defined above.
# We also set an explicit maximum sequence length, since the CNNs later in our
# model won't support ragged sequences.
vectorize_layer = TextVectorization(
    standardize=custom_standardization,
    max_tokens=max_features,
    output_mode='int',
    output_sequence_length=400)

# Now that the vocab layer has been created, call `adapt` on the text-only
# dataset to create the vocabulary. You don't have to batch, but for very large
# datasets this means you're not keeping spare copies of the dataset in memory.
vectorize_layer.adapt(text_dataset.batch(64))

# Next, let's build a model.

In [0]:
from tensorflow.keras import layers

# Model constants.
max_features = 20000
embedding_dims = 50

# A text input.
text_input = tf.keras.Input(shape=(1,), dtype=tf.string, name='text')

# The first layer in our model is the vectorization layer. After this layer,
# we have a tensor of shape (batch_size, max_len) containing vocab indices.
x = vectorize_layer(text_input)

# Next, we add a layer to map those vocab indices into a space of dimensionality
# 'embedding_dims'. Note that we're using max_features+1 here, since there's an
# OOV token that gets added to the vocabulary in vectorize_layer.
x = layers.Embedding(max_features + 1, embedding_dims)(x)
x = layers.Dropout(0.2)(x)

# Conv1D + global max pooling
x = layers.Conv1D(128, 7, padding='valid', activation='relu', strides=1)(x)
x = layers.GlobalMaxPooling1D()(x)

# We add a vanilla hidden layer:
x = layers.Dense(128, activation='relu')(x)
x = layers.Dropout(0.2)(x)

# We project onto a single unit output layer, and squash it with a sigmoid:
predictions = layers.Dense(1, activation='sigmoid', name='predictions')(x)

model = tf.keras.Model(text_input, predictions)

# Compile the model with binary crossentropy loss and an adam optimizer.
model.compile(
    loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [0]:
batch_size = 32
epochs = 3

# Fit the model using the train and test datasets.
model.fit(
    train_dataset.batch(batch_size),
    validation_data=test_dataset.batch(batch_size),
    epochs=epochs)

Train for 782 steps, validate for 782 steps
Epoch 1/3
Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.callbacks.History at 0x7f0f8c97bba8>