<a href="https://colab.research.google.com/github/sekoukeita/-sekoukeita-Text-Classification-Using-Tensorflow-Keras-and-Scikit-Learn/blob/master/Tensorflow_hub%2C_Keras_and_Scikit_Learn_in_Text_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Classification with Tensorflow-hub: Movies Reviews

## 1- Installing and Importing libraries then printing their different versions

In [3]:
from __future__ import absolute_import, division, print_function, unicode_literals

!pip install -q tensorflow==2.0
!pip install -q tensorflow-hub
!pip install -q tensorflow-datasets

import numpy as np
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_datasets as tfds

print("Version: ", tf.__version__)
print("Eager mode: ", tf.executing_eagerly())
print("Hub version: ", hub.__version__)
print("GPU is", "available" if tf.config.experimental.list_physical_devices("GPU") else "NOT AVAILABLE")

Version:  2.0.0
Eager mode:  True
Hub version:  0.7.0
GPU is NOT AVAILABLE


## 2- Downloading and Exploring Data IMDB (Internet Movie Data Base)

* Downloading and Spliting the data

In [0]:
# the data (50,000) is first splited into 2 sets equaly: train(25,000) and test(25,000)
# then the train is split into train(60% of train that is 15,000) and validation(40% of train that is 10,000)
# finally: train=15,000, validation=10,000 and test=25,000
train_data, validation_data, test_data = tfds.load(
    name="imdb_reviews", 
    split=('train[:60%]', 'train[60%:]', 'test'),
    as_supervised=True)

* Exploring the data

In [10]:
# print 5 reviews and their label
train_5_reviews,train_5_labels = next(iter(train_data.batch(5)))
print(train_5_reviews,'\n')
print(train_5_labels)

tf.Tensor(
[b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it."
 b'I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However on this occasion I fell asleep because the film was rubbish. The plot de

## 3- Bulding the model

* Get the **pre-trained text embedding model** from **tensorflow hub**.
  This model preprocesses the text. (Tokenization, vectorization)

In [0]:
embedding = "https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1"
hub_layer = hub.KerasLayer(embedding,input_shape=[],dtype=tf.string,trainable=True) # create the hub layer (the input layer)

In [12]:
# check how the hub layer preprocesses the text on the 5 reviews above
hub_layer(train_5_reviews)

<tf.Tensor: id=754, shape=(5, 20), dtype=float32, numpy=
array([[ 1.765786  , -3.882232  ,  3.9134233 , -1.5557289 , -3.3362343 ,
        -1.7357955 , -1.9954445 ,  1.2989551 ,  5.081598  , -1.1041286 ,
        -2.0503852 , -0.72675157, -0.65675956,  0.24436149, -3.7208383 ,
         2.0954835 ,  2.2969332 , -2.0689783 , -2.9489717 , -1.1315987 ],
       [ 1.8804485 , -2.5852382 ,  3.4066997 ,  1.0982676 , -4.056685  ,
        -4.891284  , -2.785554  ,  1.3874227 ,  3.8476458 , -0.9256538 ,
        -1.896706  ,  1.2113281 ,  0.11474707,  0.76209456, -4.8791065 ,
         2.906149  ,  4.7087674 , -2.3652055 , -3.5015898 , -1.6390051 ],
       [ 0.71152234, -0.6353217 ,  1.7385626 , -1.1168286 , -0.5451594 ,
        -1.1808156 ,  0.09504455,  1.4653089 ,  0.66059524,  0.79308075,
        -2.2268345 ,  0.07446612, -1.4075904 , -0.70645386, -1.907037  ,
         1.4419787 ,  1.9551861 , -0.42660055, -2.8022065 ,  0.43727064],
       [ 1.5165    , -0.71034056,  1.8556767 , -1.2033532 , -1.3

* Building the model

In [13]:
# configure layers
model = tf.keras.Sequential([
                             hub_layer, # input layer
                             tf.keras.layers.Dense(16,activation='relu'), # hidden layer
                             tf.keras.layers.Dense(1,activation='sigmoid') # output layer
])
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
keras_layer (KerasLayer)     (None, 20)                400020    
_________________________________________________________________
dense (Dense)                (None, 16)                336       
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 17        
Total params: 400,373
Trainable params: 400,373
Non-trainable params: 0
_________________________________________________________________


In [0]:
# compile the model
  # different optimizers and loss functions can be used to twick the model
model.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])

## 4- Training the model

In [15]:
# train the model using the train and validation data set
model.fit(train_data.shuffle(10000).batch(512),
          epochs=20,
          validation_data=validation_data.batch(512),
          verbose=1
          )

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<tensorflow.python.keras.callbacks.History at 0x7efdce3e6dd8>

## 5- Evaluating the model

In [21]:
results = model.evaluate(test_data.batch(512), verbose=2)

print('After evaluating the model: \n')
for name, value in zip(model.metrics_names, results):
  print("%s: %.3f" % (name, value))

49/49 - 4s - loss: 0.3236 - accuracy: 0.8611
After evaluating the model: 

loss: 0.324
accuracy: 0.861


## 6- Testing on other data set

* Let's get some reviews on the movie [STAR WARS: THE RISE OF SKYWALKER](https://rottentomatoes.com/m/star_wars_the_rise_of_skywalker/reviews?type=user) and use them to test the model.

In [0]:
# dictionary of 10 picked reviews. 5 five stars reviews (good reviews) and 5 one or less star review (bad reviews)
dic = {'Review':["Great characters, amazing special effects, compelling story line and plot twists.",
                 "Loved the movie! Has a few issues with pacing but all in all I had a great time watching it and I feel like it wrapped up the saga perfectly!",
                 "Loved it! Surprises and good twists.",
                 "Just let go and enjoy the movie. It was a great entry to the Star Wars Skywalker Saga. Just let yourself enjoy it as you would have as a kid.",
                 "If you can drop all the opinions about what you think it should be and view this movie from a child's perspective you'll have a good time!",
                 "Although very rushed it started out strong until the god awful ending. Apparently anyone can have the force & anyone can be a force ghost. Completely goes against the history of Star Wars & the history behind the force. Storyline focused on the wrong character. Kylo Ren/Ben Solo deserved more. Too much of a political agenda. The ending truly destroyed the movie.",
                 "never been so hurt by a film before. Praying that one of the events in the movie will be undone or revisited. It was a terrible film but with some good moments.",
                 "JJ Abrams only good thing was Lost ( until 5 season ) Looks like somebody said to him : Avengers was hit , do the same with star wars . total Non sense",
                 "Just wasnt for me. Story, pacing, ending, ouch...",
                 "Terrible story telling, it's worst then the previous flop... absolute waste of time There is nothing of value in this movie besides memes and jokes"],
       'Author':['Manuel','Scott','Juliette','Ringo','Pedro','Chealse G','Sara M','Anonymous','Tony R','Devin C'],
       'stars':[5,5,5,5,5,0.5,1,0.5,1,1],
       'Good/Bad':[1,1,1,1,1,0,0,0,0,0]
       }

In [0]:
import pandas as pd

In [0]:
df = pd.DataFrame(data=dic, index=['R01','R02','R03','R04','R05','R06','R07','R08','R09','R10'])

In [25]:
df

Unnamed: 0,Review,Author,stars,Good/Bad
R01,"Great characters, amazing special effects, com...",Manuel,5.0,1
R02,Loved the movie! Has a few issues with pacing ...,Scott,5.0,1
R03,Loved it! Surprises and good twists.,Juliette,5.0,1
R04,Just let go and enjoy the movie. It was a grea...,Ringo,5.0,1
R05,If you can drop all the opinions about what yo...,Pedro,5.0,1
R06,Although very rushed it started out strong unt...,Chealse G,0.5,0
R07,never been so hurt by a film before. Praying t...,Sara M,1.0,0
R08,JJ Abrams only good thing was Lost ( until 5 s...,Anonymous,0.5,0
R09,"Just wasnt for me. Story, pacing, ending, ouch...",Tony R,1.0,0
R10,"Terrible story telling, it's worst then the pr...",Devin C,1.0,0


* Get the data and labels in form of np array (not pandas series) in order to test the model

In [0]:
# get the data (reviews sentences).
review_data = np.array(df['Review'])

In [27]:
review_data

array(['Great characters, amazing special effects, compelling story line and plot twists.',
       'Loved the movie! Has a few issues with pacing but all in all I had a great time watching it and I feel like it wrapped up the saga perfectly!',
       'Loved it! Surprises and good twists.',
       'Just let go and enjoy the movie. It was a great entry to the Star Wars Skywalker Saga. Just let yourself enjoy it as you would have as a kid.',
       "If you can drop all the opinions about what you think it should be and view this movie from a child's perspective you'll have a good time!",
       'Although very rushed it started out strong until the god awful ending. Apparently anyone can have the force & anyone can be a force ghost. Completely goes against the history of Star Wars & the history behind the force. Storyline focused on the wrong character. Kylo Ren/Ben Solo deserved more. Too much of a political agenda. The ending truly destroyed the movie.',
       'never been so hurt by a f

In [0]:
# get the labels (1 for good or 0 for bad).
review_label = np.array(df['Good/Bad'])

In [29]:
review_label

array([1, 1, 1, 1, 1, 0, 0, 0, 0, 0])

* Predictions on the data

In [0]:
review_predictions = model.predict(review_data)

In [31]:
# the result is an array of 10 elements representing the probability that the review is good or bad.
review_predictions

array([[0.89827377],
       [0.90774465],
       [0.87103516],
       [0.925084  ],
       [0.6942902 ],
       [0.529283  ],
       [0.52531147],
       [0.22333205],
       [0.37560692],
       [0.00660934]], dtype=float32)

In [0]:
# let's adjust this array so that the number are 0 or 1 according to their distance to 0 or 1 using a function.

def adjust(pred):
  adjusted_predictions = []
  for elt in pred:
    if elt[0] < 0.5:
      adjusted_predictions.append(0)
    else:
      adjusted_predictions.append(1)
  return np.array(adjusted_predictions)

In [33]:
# the result is a np array 
review_predictions = adjust(review_predictions)
review_predictions

array([1, 1, 1, 1, 1, 1, 1, 0, 0, 0])

* Using sklearn to evaluate the model on the new data.

In [0]:
from sklearn.metrics import confusion_matrix,classification_report

In [35]:
print(confusion_matrix(review_label,review_predictions),'\n')
print(classification_report(review_label,review_predictions))

[[3 2]
 [0 5]] 

              precision    recall  f1-score   support

           0       1.00      0.60      0.75         5
           1       0.71      1.00      0.83         5

    accuracy                           0.80        10
   macro avg       0.86      0.80      0.79        10
weighted avg       0.86      0.80      0.79        10



In [0]:
# Note: The model get 9/10. The review 'Just wasnt for me. Story, pacing, ending, ouch...' were the model was wrong wasn't 
# long enough for the model to well perform. 