## Text Classification with RNN

In [23]:
import tensorflow_datasets as tfds
import tensorflow as tf
import matplotlib.pyplot as plt
from tensorflow.contrib import rnn

In [27]:
tfds.disable_progress_bar()
dataset, info = tfds.load('imdb_reviews/subwords8k', with_info=True, as_supervised=True)
train_dataset, test_dataset = dataset['train'], dataset['test']

*tfds* includes a set of *TextEncoders* and *Tokenizers*.

[**TextEncoder**](https://www.tensorflow.org/datasets/api_docs/python/tfds/features/text/TextEncoder) class in TensorFlow: Is an abstract base class for conversion between integers and text. Since text data has variable length and requires padding, ID 0 is always reserved for padding.

It has *vocab_size* as an attribute. vocab_size includes ID 0.

Method *encode()* encodes text into a list of integers. It never returns ID 0, and all IDs are always 1+.
Method *decode()* decodes a list of integers into text. It drops 0 in the input IDs.


[**SubwordTextEncoder**](https://www.tensorflow.org/datasets/api_docs/python/tfds/features/text/SubwordTextEncoder) is an invertible TextEncoder using word pieces with a byte-level fallback. This encoding is fully invertible as all out-of-vocab wordpieces are byte-encoded. 

It contains *vocab_list* attribute which contains a list of subwords for the vocabulary. An underscore at the end of the vocabulary indicates the end of the word. Underscores in the interior of subword are disallowed and should be used with escape sequence.

**The dataset *info* includes the SubTextEncoder.**

In [13]:
print("info features: ", info.features)
encoder = info.features["text"].encoder
print("\n Vocabulary size: ", encoder.vocab_size)
print("\n Sample Subwords", encoder.subwords[10:20])

info features:  FeaturesDict({'text': Text(shape=(None,), dtype=tf.int64, encoder=<SubwordTextEncoder vocab_size=8185>), 'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=2)})

 Vocabulary size:  8185

 Sample Subwords ['in_', 'I_', 'that_', 'this_', 'it_', ' /><', ' />', 'was_', 'The_', 'as_']


As discussed above, the encoding is invertible.

In [17]:
sample_str = "IMDB Review Classification"
print("Original string: ", sample_str)

encoded_str = encoder.encode(sample_str)
print("Encoded string is: ", encoded_str)

decoded_str = encoder.decode(encoded_str)
print("Decoded string si: ", decoded_str)

for index in encoded_str:
    print("%s ----> %s"%(index, encoder.decode([index])))

Original string:  IMDB Review Classification
Encoded string is:  [5469, 7997, 2432, 3621, 739, 656, 2369, 1395, 3203, 757]
Decoded string si:  IMDB Review Classification
5469 ----> IM
7997 ----> D
2432 ----> B 
3621 ----> Rev
739 ----> ie
656 ----> w 
2369 ----> Cla
1395 ----> ssi
3203 ----> fic
757 ----> ation


### Preparing the dataset for training

We create batches of the encoded strings. **padded_batch** is used to zero pad the sequences to the length of the longest sequence in the batch. It combines consecutive elements of the dataset into padded batches.

In [18]:
BUFFER_SIZE = 10000
BATCH_SIZE = 64

In [22]:
# output_shapes returns the shape of each component of an element of this dataset.
train_dataset = train_dataset.shuffle(BUFFER_SIZE)
train_dataset = train_dataset.padded_batch(BATCH_SIZE, train_dataset.output_shapes)

test_dataset = test_dataset.padded_batch(BATCH_SIZE, test_dataset.output_shapes)

### Model

In [24]:
n_hidden = 5
basic_cell = rnn.BasicRNNCell(n_hidden)
# output_seqs, states = rnn.static_rnn(basic_cell, train_dataset, dtype=tf.float32)


# TODO: Read about layers: tf.keras.layers.RNN, tf.keras.layers.LSTM, tf.keras.layers.GRU

Instructions for updating:
This class is equivalent as tf.keras.layers.SimpleRNNCell, and will be replaced by that in Tensorflow 2.0.


Instructions for updating:
This class is equivalent as tf.keras.layers.SimpleRNNCell, and will be replaced by that in Tensorflow 2.0.


TypeError: inputs must be a sequence

![alt text](images/SingleRNN.png)