<a href="https://colab.research.google.com/github/shuchimishra/Tensorflow_projects/blob/main/Tensorflow_Code/NLP/Sarcasm_classifier_.w_hyperas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Ungraded Lab: Training a binary classifier with the Sarcasm Dataset

In this lab, you will revisit the [News Headlines Dataset for Sarcasm Detection](https://www.kaggle.com/rmisra/news-headlines-dataset-for-sarcasm-detection/home) from last week and proceed to build a train a model on it. The steps will be very similar to the previous lab with IMDB Reviews with just some minor modifications. You can tweak the hyperparameters and see how it affects the results. Let's begin!

Reference article - https://medium.com/@fiona.s.feng/hyperparameter-tuning-for-text-classification-in-keras-tensorflow-with-hyperas-86668a7e732b

In [None]:
# !pip uninstall hyperas -Y
# !pip uninstall hyperopt -Y
!pip install git+https://github.com/maxpumperla/hyperas.git#egg=hyperas
!pip install hyperopt

Ensure you have right version with this line -
**rstate=np.random.default_rng(rseed)**

In [None]:
!nl -b a  /usr/local/lib/python3.10/dist-packages/hyperas/optim.py | grep 142

In [None]:
from google.colab import drive
drive.mount('/gdrive'
            # ,force_remount=True
            )
%ls /gdrive

## Download the dataset

You will first download the JSON file, load it into your workspace and put the sentences and labels into lists.

In [None]:
# # Download the dataset
# !wget https://storage.googleapis.com/tensorflow-1-public/course3/sarcasm.json

## Preprocessing the train and test sets

Now you can preprocess the text and labels so it can be consumed by the model. You use the `Tokenizer` class to create the vocabulary and the `pad_sequences` method to generate padded token sequences. You will also need to set the labels to a numpy array so it can be a valid data type for `model.fit()`.

In [None]:
import numpy as np
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import json

def data():

  filename = './sarcasm.json'

  if os.path.isfile('./sarcasm.json'):
    print("skipping the download")
  else:
    # Download the dataset
    !wget https://storage.googleapis.com/tensorflow-1-public/course3/sarcasm.json


  ## Opening JSON file
  f = open(filename)

  # Load the json file
  datastore = json.load(f)

  # Initializing the lists
  urls, texts, labels = [],[],[]

  #Iterating through json data
  for row in datastore:
    urls.append(row['article_link'])
    texts.append(row['headline'])
    labels.append(row['is_sarcastic'])

  # Closing file
  f.close()

  # Maximum length of the padded sequences
  # Find the maximum length of texts across all texts
  max_length = max(len(s.split()) for s in texts) #initial hardcode value : max_length = 32

  # Number of examples to use for training
  training_size = int(0.8 * len(labels)) #80% training data split

  # Split the sentences
  train_sentences = texts[0:training_size]
  test_sentences = texts[training_size:]

  # Split the labels
  train_labels = labels[0:training_size]
  test_labels = labels[training_size:]

  # Max_words of the tokenizer
  max_words = 10000

  # Output dimensions of the Embedding layer
  embedding_dim = 16

  # Parameters for padding and OOV tokens
  oov_tok = '<OOV>'
  trunc_type = 'post'
  pad_type = 'post'

  # Initialize the Tokenizer class
  tokenizer = Tokenizer(num_words=max_words, oov_token=oov_tok)

  # Generate the word index dictionary
  tokenizer.fit_on_texts(train_sentences)
  train_word_index = tokenizer.word_index

  # Vocabulary size
  vocab_size = len(tokenizer.word_index)+ 1

  # Generate and pad the training sequences
  train_sequences = tokenizer.texts_to_sequences(train_sentences)
  train_padded_seqs = pad_sequences(train_sequences, maxlen=max_length, padding=pad_type, truncating=trunc_type)

  # Generate and pad the testing sequences
  test_sequences = tokenizer.texts_to_sequences(test_sentences)
  test_padded_seqs = pad_sequences(test_sequences, maxlen=max_length, padding=pad_type, truncating=trunc_type)

  # Convert the labels lists into numpy arrays
  final_train_labels = np.array(train_labels)
  final_test_labels = np.array(test_labels)

  return train_padded_seqs,final_train_labels,test_padded_seqs,final_test_labels

## Build and Compile the Model

Next, you will build the model. The architecture is similar to the previous lab but you will use a [GlobalAveragePooling1D](https://www.tensorflow.org/api_docs/python/tf/keras/layers/GlobalAveragePooling1D) layer instead of `Flatten` after the Embedding. This adds the task of averaging over the sequence dimension before connecting to the dense layers. See a short demo of how this works using the snippet below. Notice that it gets the average over 3 arrays (i.e. `(10 + 1 + 1) / 3` and `(2 + 3 + 1) / 3` to arrive at the final output.

This added computation reduces the dimensionality of the model as compared to using `Flatten()` and thus, the number of training parameters will also decrease. See the output of `model.summary()` below and see how it compares if you swap out the pooling layer with a simple `Flatten()`.

In [None]:
import tensorflow as tf
from tensorflow import keras as keras
from hyperas.distributions import choice
from keras.optimizers import Adam, SGD, RMSprop
import time
from time import time
from pprint import pprint

def create_model(train_padded_seqs,final_train_labels,test_padded_seqs,final_test_labels):

  #Initiallize the hyperparameters
  start_time = time()
  num_epochs = 5
  output_dim_list = [16, 32, 64] #Output dimension parameter in first layer
  num_units = [16, 32, 64] #no. of units in Third layers
  lr = [10**-3, 10**-2, 10**-1] #learning rate values for optimizers

  # Build the model
  model = keras.Sequential()

  #First Embedding layer
  model.add(keras.layers.Embedding(vocab_size, output_dim = {{choice([16, 32, 64])}}, input_length=max_length))

  #Second layer
  model.add(keras.layers.GlobalAveragePooling1D())

  # Third layer
  model.add(keras.layers.Dense({{choice([16, 32, 64])}}, activation='relu')),

  #Fourth layer
  model.add(keras.layers.Dense(1, activation='sigmoid'))

  #Initialize the learning rates
  adam = Adam(learning_rate={{choice([10**-3, 10**-2, 10**-1])}})
  rmsprop = RMSprop(learning_rate={{choice([10**-3, 10**-2, 10**-1])}})
  sgd = SGD(learning_rate={{choice([10**-3, 10**-2, 10**-1])}})

  # Compile the model
  model.compile(optimizer={{choice(['adam', 'sgd', 'rmsprop'])}},
              loss='binary_crossentropy',
              metrics='accuracy')

  # Optional to log output from Keras
  csv_logger = keras.callbacks.CSVLogger('./dl_model.log')

  result = model.fit(train_padded_seqs,final_train_labels,
              epochs=num_epochs,
              verbose=2,
              validation_data=(test_padded_seqs,final_test_labels),
              callbacks=[csv_logger])

  pprint(vars(result))

  # # added to collect optimisation results
  # if 'results' not in globals():
  #   global results
  #   results = []

  # val_acc = result.history['val_accuracy']
  # parameters = space
  # parameters["val_acc"] = val_acc
  # parameters["time"] = str(int(time() - start_time)) + "sec"
  # score, val_acc_final = model.evaluate((test_padded_seqs,final_test_labels), verbose=2)
  # parameters["val_acc_final"] = val_acc_final
  # results.append(parameters)
  # print(tabulate(results, headers="keys", tablefmt="fancy_grid", floatfmt=".8f"))

  #get the highest validation accuracy of the training epochs
  validation_acc = np.amax(result.history['val_accuracy'])
  print('Best validation acc of epoch:', validation_acc)

  return {'loss': -validation_acc, 'status': STATUS_OK, 'model': model}

In [None]:
import hyperas
from hyperas import optim
from hyperopt import Trials,STATUS_OK, tpe

import logging
import os

try:
    best_run, best_model, space = optim.minimize(model=create_model,
                                          data=data,
                                          # functions=['parse_json_file'],
                                          algo=tpe.suggest,
                                          max_evals=5,
                                          trials=Trials(),
                                          notebook_name=os.path.join('..','gdrive','My Drive','Colab Notebooks','C3_W2_Lab_2_sarcasm_classifier_v2'),
                                          eval_space=True,   # <-- this is the line that puts real values into 'best_run'
                                          return_space=True,  # <-- this allows you to save the space for later evaluations
                                          )
except Exception as e:
    logging.exception("message")

Refer this link for details - https://www.linkedin.com/pulse/hyperas-prateek-khanna/


Hyperas is a simple wrapper for hyperparameter optimization using keras and hyperopt. It uses template notation to define hyper-parameter ranges to tune. You can wrap the parameters you want to optimize into double curly brackets {{...}} and choose a distribution over which to run the algorithm. Hyperas translates your script into hyperopt compliant code at runtime.

Hyperas needs a data function to load your data. It returns your X_train, Y_train, X_test and Y_test values.The Model Function is where you define your model. You can use all the available keras functions and layers to create the model.

Hyperas provides a optim.minimize function for minimizing a keras model for given data and implicit hyperparameters. It takes as input the following parameters:

**model:** A function defining a keras model with hyperas templates, which returns a valid hyperopt results dictionary, e.g. return {'loss': -acc, 'status': STATUS_OK}

**data:** A parameter-less function that defines and return all data needed in the above model definition.

**algo:** A hyperopt algorithm, like tpe.suggest or rand.suggest. Tree-structured Parzen Estimator (TPE) algorithm is a bayesian algorithm which explore intelligently the search space while narrowing down to the estimated best parameters. Rand.suggest on the other hand does a random search through the search space.

**max_evals:** Maximum number of optimization runs
trials: A hyperopt Trials object, used to store intermediate results for all optimization runs

**rseed:** Integer random seed for experiments
notebook_name: If running from an ipython notebook, provide filename (not path)

**verbose:** Print verbose output

**eval_space:** Evaluate the best run in the search space such that 'choice's contain actually meaningful values instead of mere indices
return_space: Return the hyperopt search space object (e.g. for further processing) as last return value

**keep_temp:** Keep temp_model.py file on the filesystem

**Return value:**A pair consisting of the results dictionary of the best run and the corresponding keras model. If "return_space" is True it also returns the hyperopt search space.

In [None]:
X_train, Y_train, X_test, Y_test = data()
print("Evalutation of best performing model:")
print(best_model.evaluate(X_test, Y_test))
print("Best performing model chosen hyper-parameters:")
print(best_run)

In [None]:
best_model.save('./best_model.h5')

## Visualize the Results

You can use the cell below to plot the training results. You may notice some overfitting because your validation accuracy is slowly dropping while the training accuracy is still going up. See if you can improve it by tweaking the hyperparameters. Some example values are shown in the lectures.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.fugure_format = 'retina'
def plot_loss_acc(history):
  #-----------------------------------------------------------
  # Retrieve a list of list results on training and test data
  # sets for each training epoch
  #-----------------------------------------------------------
  acc      = history.history[     'accuracy' ]
  val_acc  = history.history[ 'val_accuracy' ]
  loss     = history.history[    'loss' ]
  val_loss = history.history['val_loss' ]
  epochs   = range(len(acc)) # Get number of epochs
  #------------------------------------------------
  # Plot training and validation accuracy per epoch
  #------------------------------------------------
  plt.plot  ( epochs,     acc, label='Training accuracy' )
  plt.plot  ( epochs, val_acc, label='Validation accuracy' )
  plt.title ('Training and validation accuracy')
  plt.grid()
  plt.legend()
  plt.xlabel("Epochs")
  plt.ylabel("Accuracy")
  plt.figure()
  #------------------------------------------------
  # Plot training and validation loss per epoch
  #------------------------------------------------
  plt.plot  ( epochs,     loss, label='Training loss' )
  plt.plot  ( epochs, val_loss, label='Validation loss' )
  plt.grid()
  plt.legend()
  plt.xlabel("Epochs")
  plt.ylabel("Loss")
  plt.title ('Training and validation loss'   )

# Plot training results
plot_loss_acc(history)

In [None]:
# model.predict(["Donald Trump is the best thing happened to US",])

## Visualize Word Embeddings

As before, you can visualize the final weights of the embeddings using the [Tensorflow Embedding Projector](https://projector.tensorflow.org/).

In [None]:
# Get the index-word dictionary
reverse_word_index = tokenizer.index_word

# Get the embedding layer from the model (i.e. first layer)
embedding_layer = model.layers[0]

# Get the weights of the embedding layer
embedding_weights = embedding_layer.get_weights()[0]

# Print the shape. Expected is (vocab_size, embedding_dim)
print(embedding_weights.shape)


In [None]:
import io

# Open writeable files
out_v = io.open('vecs.tsv', 'w', encoding='utf-8')
out_m = io.open('meta.tsv', 'w', encoding='utf-8')

# Initialize the loop. Start counting at `1` because `0` is just for the padding
for word_num in range(1, vocab_size):

  # Get the word associated at the current index
  word_name = reverse_word_index[word_num]

  # Get the embedding weights associated with the current index
  word_embedding = embedding_weights[word_num]

  # Write the word name
  out_m.write(word_name + "\n")

  # Write the word embedding
  out_v.write('\t'.join([str(x) for x in word_embedding]) + "\n")

# Close the files
out_v.close()
out_m.close()

In [None]:
# Import files utilities in Colab
try:
  from google.colab import files
except ImportError:
  pass

# Download the files
else:
  files.download('vecs.tsv')
  files.download('meta.tsv')

## Wrap Up

In this lab, you were able to build a binary classifier to detect sarcasm. You saw some overfitting in the initial attempt and hopefully, you were able to arrive at a better set of hyperparameters.

So far, you've been tokenizing datasets from scratch and you're treating the vocab size as a hyperparameter. Furthermore, you're tokenizing the texts by building a vocabulary of full words. In the next lab, you will make use of a pre-tokenized dataset that uses a vocabulary of *subwords*. For instance, instead of having a uniqe token for the word `Tensorflow`, it will instead have a token each for `Ten`, `sor`, and `flow`. You will see the motivation and implications of having this design in the next exercise. See you there!