In [None]:
import os
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_datasets as tfds

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

![The kaggle logo][1]

[1]: https://upload.wikimedia.org/wikipedia/commons/thumb/6/69/IMDB_Logo_2016.svg/2560px-IMDB_Logo_2016.svg.png

# Natural Language Processing (NLP) Tutorial

**Notes**

- This notebook is built in conjunction with Tensorflow's tutorial: [Text classification with TensorFlow Hub: Movie reviews](https://www.tensorflow.org/tutorials/keras/text_classification_with_hub). This notebook is a blend of what I have learned so far in NLP(Natural Language Processing), and contains useful comments throughout to better understand the training process.


**Useful Links/Notebooks**

[Tensorflow RNN Tutorial](https://www.tensorflow.org/text/tutorials/text_classification_rnn)

[Tensorflow Tokenizer](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer)

[getting-started-with-text-preprocessing](https://www.kaggle.com/sudalairajkumar/getting-started-with-text-preprocessing)

### Loading Data

- 50,000 rows, equal distribution of negative and positive reviews.

In [None]:
#this is the same dataset as the TF.dataset from TFhub, but it is easier to explore using pandas
df = pd.read_csv('../input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv')
df

### IMDB Movie Review Sentiment Analysis using TFHub and TFDatasets

In the first cell we are downloading the dataset from TFDatasets to be trained using tf.keras, a high-level API to build and train models in TensorFlow.

The purpose of the model is to determine wether a movie review is either positive or negative. 

In [None]:
#Splitting the dataset into training, validation, and test datasets
train_data, validation_data, test_data = tfds.load(
    name="imdb_reviews", 
    split=('train[:75%]', 'train[75%:]', 'test'),
    as_supervised=True)

print("Training Data: {}".format(len(train_data)))
print("Validation Data: {}".format(len(validation_data)))
print("Test Data: {}".format(len(test_data)))

In the next cell we are getting a batch of labels and data from the tf.dataset to view. We can see that there are two possible labels, either a 0 or 1. 0 represents negative reviews, and 1 represents positive reviews.

In [None]:
train_examples_batch, train_labels_batch = next(iter(train_data.batch(5)))
print(train_examples_batch, "\n\n", train_labels_batch)

In this example we are using a pre-trained embedding layer from tensoflow hub called [google/nnlm-en-dim50/2](https://tfhub.dev/google/nnlm-en-dim50/2). This layer will map a string of any length to 50 floating point values. 

There are many other pretrained embedding layers that can be used in this case, but they all have pros/cons. 

- For example, we could use [nnlm-en-dim128](https://tfhub.dev/google/nnlm-en-dim128/2), which is trained with the same data as google/nnlm-en-dim50/2, yet it maps a string to 128 floating point values rather than 50. This may make the model more accurate, but will take longer to train and make predictions.

- We could also use [nnlm-en-dim128-with-normalization](https://tfhub.dev/google/nnlm-en-dim128-with-normalization/2) which is built w/ additional feautures that normalize text. Ie. removing punctuation/additional characters.

See more text_embedding_models here --> [TF-Hub text-embedding Models](https://tfhub.dev/s?module-type=text-embedding)

In [None]:
#downloading and defining the embedding layer from tfhub
embedding = "https://tfhub.dev/google/nnlm-en-dim50/2"
hub_layer = hub.KerasLayer(embedding, input_shape=[], #input shape is a list
                           dtype=tf.string, trainable=True)

#prints the output when passing three examples through the layer.
hub_layer(train_examples_batch[:3])

Next we define the model structure.

The first layer is the embedding layer, followed by a dense/linear layer with 16 hidden nodes, and then a final dense layer with 1 node.

The beauty of Keras is that is requires very little code to create powerful Neural Networks.

In [None]:
model = tf.keras.Sequential()
model.add(hub_layer)
model.add(tf.keras.layers.Dense(16, activation='relu')) #note relu maps all negative values to zero
model.add(tf.keras.layers.Dense(1, activation='sigmoid')) #sigmoid maps values between 0-1

#printing model structure
model.summary()

Finally we compile and train the model. 

We are using ADAM optimizer, binary crossentropy loss function (as we are predicting labels either 0 or 1), and evaluting the model performance based on its accuracy on the labels.

In [None]:
model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'])

history = model.fit(train_data.shuffle(10000).batch(512),
                    epochs=10,
                    validation_data=validation_data.batch(512),
                    verbose=1)

I had to look into the tf.data.Dataset documentation to understand what was going on when we batched a tf dataset, as I thought that it was only evaluating 512 test_data examples and not all the test data. 

---

We can see from the following example how tf datasets are arranged when the batch method is called. 

If the dataset can not be split evenly, the final batch will be smaller than the rest. If for any reason we need to ensure the batch size stays the same we can pass `drop_remainder=True` into the batch method to drop the last batch.

To get a better understanding of tf.data.datsets check out the docs here --> [tf/data/Dataset](https://www.tensorflow.org/api_docs/python/tf/data/Dataset)

In [None]:
dataset = tf.data.Dataset.range(8)
dataset = dataset.batch(3) # or with batch shuffle --- dataset = dataset.shuffle(3).batch(3)
list(dataset.as_numpy_iterator())

Evaluating model performance on the test dataset.

In [None]:
results = model.evaluate(test_data.batch(64), verbose=1)

for name, value in zip(model.metrics_names, results):
    print("%s: %.3f" % (name, value))

We can see that the model performed pretty well on the test_data and scores ~85%. Thats pretty good! Hopefully with some improvements in the model structure, and the embedding layer we can make a model that scores >90%!

---

Finally, we have a trained model and can pass unseen data into the model to make predictions. As the final layer of the model uses a sigmoid activation function the output will be mapped between 0-1.

The output of the model is a prediction on the likelihood that the text is a positive review.

In [None]:
examples = [
    'this is such an amazing movie!',  # this is the same sentence tried earlier
    'The movie was great!',
    'The movie was decent.',
    'The movie was okish.',
    'The movie was so awful and terrible...'
]

#tf.sigmoid maps values between 0-1
#tf.constant creates a tensor from a tensor-like object
original_results = model(tf.constant(examples))

for (x, y) in zip(original_results, examples) :
    print("Input: {} --- {:.2f}".format(y, x[0]))

Saving the model weights so we can use the model in the future.

In [None]:
model.save_weights("keras_NLP_basic_weights.h5")

### Closing Thoughts

I would like to experiment with Tensorflow using a lower level API, and not relying on the higher level API in Keras. 

That being said, Keras is amazing at prototyping and getting started on a wide range of machine learning problems. I think it is beneficial to start with keras to get a sense of the problem you are working with, and then work down from there. I think it would be interesting to mess with other embedding layers, create a custom embedding layer, experiment with different model structures, and training parameters. 

I also recently created a blog_application using Flask, and would like to incorporate an NLP model into that. Feel fre to check out the code for that here --> [flask_blog_application](https://github.com/brendanartley/Flask_Blog_Application)

--- 

Any thoughts/suggestions on this notebook would be greatly appreciated!