<a href="https://colab.research.google.com/github/saffarizadeh/INSY4054/blob/main/Text_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="http://saffarizadeh.com/Logo.png" width="300px"/>

# *INSY 4054: Emerging Technologies*

# **Trained Models and Transfer Learning**

Instructor: Dr. Kambiz Saffarizadeh

---

Source: https://www.kaggle.com/models/google/nnlm/frameworks/tensorFlow2/variations/en-dim50/versions/1

Read the complete tutorial: https://www.tensorflow.org/tutorials/keras/text_classification_with_hub

In [1]:
import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import tensorflow_hub as hub
import tensorflow_datasets as tfds

We first use `tensorflow_datasets` to load the imdb_reviews dataset.
`tensorflow_datasets` loads large datasets in a specific way that works well with TensorFlow models. The data loaded using this method is not in numpy array format. Instead it is a TensorFlow Dataset (see https://www.tensorflow.org/api_docs/python/tf/data/Dataset).

In [2]:
# Split the training set into 60% and 40%, so we'll end up with 15,000 examples
# for training, 10,000 examples for validation and 25,000 examples for testing.
train_data, validation_data, test_data = tfds.load(
    name="imdb_reviews",
    split=('train[:60%]', 'train[60%:]', 'test'),
    as_supervised=True)

Downloading and preparing dataset 80.23 MiB (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Generating splits...:   0%|          | 0/3 [00:00<?, ? splits/s]

Generating train examples...:   0%|          | 0/25000 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteFVBGVI/imdb_reviews-train.tfrecord…

Generating test examples...:   0%|          | 0/25000 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteFVBGVI/imdb_reviews-test.tfrecord*…

Generating unsupervised examples...:   0%|          | 0/50000 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteFVBGVI/imdb_reviews-unsupervised.t…

Dataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.


We load the pretrained model and extend it using a Sequential model to classify the input into two classes (positive vs. negative sentiment).
Note that we set the model as `trainable` which means all parameters can be retrained. Since those parameters already have some weights and biases that work well, it is likely that the final weights and biases are close to what we already have in this part of the model.

In [3]:
model = tf.keras.models.Sequential([
    hub.KerasLayer("https://www.kaggle.com/models/google/nnlm/frameworks/tensorFlow2/variations/en-dim50/versions/1", input_shape=[], dtype=tf.string, trainable=True),
    tf.keras.layers.Dense(16, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

In [4]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 keras_layer (KerasLayer)    (None, 50)                48190600  
                                                                 
 dense (Dense)               (None, 16)                816       
                                                                 
 dense_1 (Dense)             (None, 1)                 17        
                                                                 
Total params: 48191433 (183.84 MB)
Trainable params: 48191433 (183.84 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [5]:
model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'])

Since the dataset we are using is large, we can choose the batch size we want to use to feed the data into our model. Before batching the data, we can shuffle it. To have the same shuffling outcome everytime we run this code on our machine, we can set a seed for the random shuffle. Here we use `10000`. We can use any number we want.

In [6]:
history = model.fit(train_data.shuffle(10000).batch(512),
                    epochs=10,
                    validation_data=validation_data.batch(512),
                    verbose=1)

Epoch 1/10


  output, from_logits = _get_logits(


Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [7]:
model.evaluate(test_data.batch(512), verbose=2)

49/49 - 2s - loss: 0.3508 - accuracy: 0.8575 - 2s/epoch - 31ms/step


[0.35084518790245056, 0.857479989528656]

In [8]:
new_reviews = ["This was a great movie and I enjoyed my time!", "This was a very bad movie", "This movie sucked ass!"]

In [9]:
new_reviews_array = np.array(new_reviews)

In [10]:
model(new_reviews_array)

<tf.Tensor: shape=(3, 1), dtype=float32, numpy=
array([[0.99650156],
       [0.16472302],
       [0.20577902]], dtype=float32)>