<a href="https://colab.research.google.com/github/https-deeplearning-ai/tensorflow-1-public/blob/master/C3/W2/ungraded_labs/C3_W2_Lab_2_sarcasm_classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Ungraded Lab: Training a binary classifier with the Sarcasm Dataset

In this lab, you will revisit the [News Headlines Dataset for Sarcasm Detection](https://www.kaggle.com/rmisra/news-headlines-dataset-for-sarcasm-detection/home) from last week and proceed to build a train a model on it. The steps will be very similar to the previous lab with IMDB Reviews with just some minor modifications. You can tweak the hyperparameters and see how it affects the results. Let's begin!

In [1]:
# !pip install tensorflow_hub
# !pip install tensorflow_text

## Download the dataset

You will first download the JSON file, load it into your workspace and put the sentences and labels into lists. 

In [17]:
import pandas as pd
from sklearn import preprocessing
import csv
import tensorflow as tf
import pathlib

In [3]:
data_train = 'survey.csv'

In [4]:
sentences_train = []
labels_train = []
separator = ' '
with open (data_train, 'r', encoding='utf8') as csvfile:
  sentences = csv.reader(csvfile, delimiter=',')
  next(sentences)
  for row in sentences:
    sentences_train.append(separator.join(row[2:]))
    labels_train.append(str(row[1]))


le = preprocessing.LabelEncoder()
labels_train = le.fit_transform(labels_train)

num_classes = max(labels_train) + 1

y = tf.keras.utils.to_categorical(labels_train)



In [5]:
# Split the sentences
training_sentences = sentences_train[0:]

<h1> Preprocess Data

In [6]:
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Parameters for padding and OOV tokens
trunc_type='post'
padding_type='post'
oov_tok = "<OOV>"

# Initialize the Tokenizer class
tokenizer = Tokenizer(num_words=10000, oov_token=oov_tok)

# Generate the word index dictionary
tokenizer.fit_on_texts(training_sentences)
word_index = tokenizer.word_index

# Generate and pad the training sequences
training_sequences = tokenizer.texts_to_sequences(training_sentences)
training_padded = pad_sequences(training_sequences, maxlen=100, padding=padding_type, truncating=trunc_type)




## Build and Compile the Model

Next, you will build the model. The architecture is similar to the previous lab but you will use a [GlobalAveragePooling1D](https://www.tensorflow.org/api_docs/python/tf/keras/layers/GlobalAveragePooling1D) layer instead of `Flatten` after the Embedding. This adds the task of averaging over the sequence dimension before connecting to the dense layers. See a short demo of how this works using the snippet below. Notice that it gets the average over 3 arrays (i.e. `(10 + 1 + 1) / 3` and `(2 + 3 + 1) / 3` to arrive at the final output.

This added computation reduces the dimensionality of the model as compared to using `Flatten()` and thus, the number of training parameters will also decrease. See the output of `model.summary()` below and see how it compares if you swap out the pooling layer with a simple `Flatten()`.

In [7]:
# Build the model
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(10000, 16, input_length=100),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)),
    tf.keras.layers.Dense(24, activation='relu'),
    tf.keras.layers.Dense(num_classes, activation='softmax')
])

# Print the model summary
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 100, 16)           160000    
                                                                 
 bidirectional (Bidirectiona  (None, 64)               12544     
 l)                                                              
                                                                 
 dense (Dense)               (None, 24)                1560      
                                                                 
 dense_1 (Dense)             (None, 10)                250       
                                                                 
Total params: 174,354
Trainable params: 174,354
Non-trainable params: 0
_________________________________________________________________


You will use the same loss, optimizer, and metrics from the previous lab.

In [8]:
# Compile the model
model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])

## Train the Model

Now you will feed in the prepared datasets to train the model. If you used the default hyperparameters, you will get around 99% training accuracy and 80% validation accuracy.

*Tip: You can set the `verbose` parameter of `model.fit()` to `2` to indicate that you want to print just the results per epoch. Setting it to `1` (default) displays a progress bar per epoch, while `0` silences all displays. It doesn't matter much in this Colab but when working in a production environment, you may want to set this to `2` as recommended in the [documentation](https://keras.io/api/models/model_training_apis/#fit-method).*

In [9]:
num_epochs = 150

# Train the model
history = model.fit(training_padded, y, epochs=num_epochs, verbose=2)

Epoch 1/150
1/1 - 6s - loss: 2.3027 - accuracy: 0.0435 - 6s/epoch - 6s/step
Epoch 2/150
1/1 - 0s - loss: 2.2959 - accuracy: 0.4783 - 49ms/epoch - 49ms/step
Epoch 3/150
1/1 - 0s - loss: 2.2904 - accuracy: 0.5217 - 65ms/epoch - 65ms/step
Epoch 4/150
1/1 - 0s - loss: 2.2849 - accuracy: 0.5217 - 67ms/epoch - 67ms/step
Epoch 5/150
1/1 - 0s - loss: 2.2790 - accuracy: 0.5217 - 61ms/epoch - 61ms/step
Epoch 6/150
1/1 - 0s - loss: 2.2729 - accuracy: 0.5217 - 64ms/epoch - 64ms/step
Epoch 7/150
1/1 - 0s - loss: 2.2663 - accuracy: 0.5217 - 47ms/epoch - 47ms/step
Epoch 8/150
1/1 - 0s - loss: 2.2591 - accuracy: 0.5217 - 49ms/epoch - 49ms/step
Epoch 9/150
1/1 - 0s - loss: 2.2511 - accuracy: 0.5217 - 45ms/epoch - 45ms/step
Epoch 10/150
1/1 - 0s - loss: 2.2421 - accuracy: 0.5217 - 55ms/epoch - 55ms/step
Epoch 11/150
1/1 - 0s - loss: 2.2319 - accuracy: 0.5217 - 59ms/epoch - 59ms/step
Epoch 12/150
1/1 - 0s - loss: 2.2203 - accuracy: 0.5217 - 50ms/epoch - 50ms/step
Epoch 13/150
1/1 - 0s - loss: 2.2070 - ac

In [10]:
def predict(input):
    test_sequences = tokenizer.texts_to_sequences(input)
    data_padded = pad_sequences(test_sequences, maxlen=100, padding=padding_type, truncating=trunc_type)
    result = []
    result = model.predict(data_padded)
    return le.inverse_transform([np.argmax(pred) for pred in result])

In [11]:
for result in predict(sentences_train):
    print(result)

Akuntansi
Teknik Informatika
Teknik Informatika
Teknik Sipil
Teknik Informatika
Akuntansi
Matematika
Matematika
Matematika
Matematika
Matematika
Matematika
Matematika
Pendidikan Matematika
Matematika
Hukum Tata Negara
kedokteran
Perpustakaan dan Ilmu Informasi 
Jurusan Hukum Tata Negara
Matematika
Matematika
Matematika
Matematika


<h1> for converting to TFlite

In [16]:
export_dir = 'saved_model/1'

tf.saved_model.save(model, export_dir)



INFO:tensorflow:Assets written to: saved_model/1\assets


INFO:tensorflow:Assets written to: saved_model/1\assets


In [22]:
converter = tf.lite.TFLiteConverter.from_saved_model(export_dir)

# Set the optimzations
converter.optimizations = [tf.lite.Optimize.DEFAULT]

converter.target_spec.supported_ops = [
  tf.lite.OpsSet.TFLITE_BUILTINS, # enable TensorFlow Lite ops.
  tf.lite.OpsSet.SELECT_TF_OPS # enable TensorFlow ops.
]
tflite_model = converter.convert()


In [23]:
tflite_model_file = pathlib.Path('./model.tflite')
tflite_model_file.write_bytes(tflite_model)

194256