# Exercise 1 - Text Classification using Pretrained Embeddings

Handling tough real-world problems in Natural Language Processing (NLP) include tackling with class imbalance and the lack of availability of enough labeled data for training. Thanks to the recent advancements in deep transfer learning in NLP, we have been able to make rapid strides in not only tackling these problems but also leverage these models for diverse downstream NLP tasks.

The intent of this notebook is to look at various SOTA models in deep transfer learning for NLP with hands-on examples:

- Pre-trained word embeddings for Deep Learning Models (FastText with CNNs)
- Universal Embeddings (Sentence Encoders, NNLMs)

We will take a benchmark classification dataset and train and compare the performance of these models. All examples here will be showcased using Python and leveraging the latest and best of TensorFlow 2.x.


In [None]:
!nvidia-smi  

# Load Necessary Dependencies

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import tensorflow_hub as hub
import tqdm
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

%matplotlib inline

# fix random seed for reproducibility
seed = 42
np.random.seed(seed)
tf.random.set_seed(seed)

In [None]:
print("TF Version: ", tf.__version__)
print("Eager mode: ", tf.executing_eagerly())
print("TF Hub version: ", hub.__version__)

## Load Dataset - Hate Speech

Social media unfortunately is rampant with hate speech in the form of posts and comments. This is a practical example of perhaps building an automated hate speech detection system using NLP in the form of text classification.

In this notebook, we will leverage an open sourced collection of hate speech posts and comments.

The dataset is available here: [kaggle](https://www.kaggle.com/usharengaraju/dynamically-generated-hate-speech-dataset) which in turn has been curated from a wider [data source for hate speech](https://hatespeechdata.com/)

In [None]:
df = pd.read_csv('HateDataset.csv')
df.info()

In [None]:
df = df[['text', 'label']]
df.head()

# Preparing Train and Test Datasets

In [None]:
from sklearn.model_selection import train_test_split

train_reviews, test_reviews, train_labels, test_labels = train_test_split(df.text.values,
                                                                          df.label.values,
                                                                          test_size=0.2, random_state=42)

In [None]:
len(train_reviews), len(test_reviews)

# Basic Text Pre-processing

We do minimal text pre-processing here since we are using deep learning models and not count-based methods. Steps include the following:

- Removing HTML characters
- Converting accented characters
- Fixing contractions
- Removing special characters

In [None]:
!sudo pip3 install contractions
!sudo pip3 install textsearch
!sudo pip3 install tqdm
!sudo pip3 install nltk
!sudo pip3 install beautifulsoup4

import nltk
nltk.download('punkt')

# **Question 1**: Build the text pre-processing pipeline (3 points)

__Hint:__ You can follow the same sequence of steps like the tutorial

In [None]:
import contractions
from bs4 import BeautifulSoup
import numpy as np
import re
import tqdm
import unicodedata


def strip_html_tags(text):
    <YOUR CODE HERE>

def remove_accented_chars(text):
    <YOUR CODE HERE>

def pre_process_corpus(docs):
    norm_docs = []
    <YOUR CODE HERE>
    return norm_docs

In [None]:
%%time

norm_train_reviews = <YOUR CODE HERE>
norm_test_reviews = <YOUR CODE HERE>

## Label Encode Classes

# **Question 2**: Label Encode Class Labels (2 points)

In [None]:
from sklearn.preprocessing import LabelEncoder
# positive -> 1, negative -> 0

le = <YOUR CODE HERE>
num_classes = <YOUR CODE HERE>

In [None]:
y_train = <YOUR CODE HERE>
y_test = <YOUR CODE HERE>

# __Question 3:__ Build Model 0 - Simple Baseline ML Model - Logistic Regression (3 points)

## Feature Extraction with BOW Model

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

cv = <YOUR CODE HERE>

cv_train_features = <YOUR CODE HERE>
cv_test_features = <YOUR CODE HERE>
print('BOW model:> Train features shape:', cv_train_features.shape, ' Test features shape:', cv_test_features.shape)

## Train the ML Model

In [None]:
%%time

# Logistic Regression model on BOW features
from sklearn.linear_model import LogisticRegression

# instantiate model
lr = <YOUR CODE HERE>

# train model
<YOUR CODE HERE>

# predict on test data
lr_bow_predictions = <YOUR CODE HERE>

## Predict and Test Model Performance

In [None]:
<YOUR CODE HERE>

___Not that great a performance! Can we do better?___

# __Question 4:__ Build Model 1: FastText Embeddings + CNN (4 points)

![](https://i.imgur.com/6Pk3Nrv.png)

Convolutional Neural Networks (CNNs) have also proven to be very effective in text classification besides computer vision tasks. The idea is to leverage embeddings as features for text data and apply convolutions and poolings on them.

We will leverage the ``tensorflow.keras`` utilities to tokenize text before we use the FastText embeddings

## Tokenizing text to create vocabulary

### Tokenize text corpus.
_Hint: Use ``tf.keras.preprocessing.text.Tokenizer``_

In [None]:
t = <YOUR CODE HERE>
# fit the tokenizer on the documents
<YOUR CODE HERE>

## Convert texts (sequences of words) to sequence of numeric ids

In [None]:
<YOUR CODE HERE>

In [None]:
print("Vocabulary size={}".format(len(t.word_index)))
print("Number of Documents={}".format(t.document_count))

## Visualizing sentence length distribution

In [None]:
<YOUR CODE HERE>

## Padding text sequences

We standardize the sentence lengths by defining a maximum length. Sentences longer than this are truncated while shorter ones are padded.

___Use a max sequence length of around 250 based on the above histogram___

In [None]:
MAX_SEQUENCE_LENGTH = <YOUR CODE HERE>

# pad dataset to a maximum review length in words
<YOUR CODE HERE>
X_train.shape, X_test.shape

## Building FastText based Embedding Matrix

Here we will build an embedding matrix based on pre-trained FastText Embeddings available __[here](https://fasttext.cc/docs/en/english-vectors.html)__.

We will be using the __wiki-news-300d-1M.vec.zip__ embedding file which has 1 million word vectors trained on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens).

![](https://i.imgur.com/5de9N5R.png)

## Download Pre-trained FastText Embeddings

We have chosen a slightly less powerful model which should download faster than the tutorial but feel free to play around with different pretrained embeddings from [here](https://fasttext.cc/docs/en/english-vectors.html)

In [None]:
!wget https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip

In [None]:
!unzip wiki-news-300d-1M.vec.zip

## Generate Pre-trained Embedding Matrix

In [None]:
VOCAB_SIZE = len(t.word_index)
EMBED_SIZE = 300

In [None]:
word2idx = t.word_index
FASTTEXT_INIT_EMBEDDINGS_FILE = './wiki-news-300d-1M.vec'


def load_pretrained_embeddings(word_to_index, max_features, embedding_size, embedding_file_path):  
    """
    Utility function to load the pre-trained embeddings
    """  
    
    <YOUR CODE HERE>

    return embedding_matrix

In [None]:
# get FastText embeddings based on our word to index mapping dictionary
ft_embeddings = <YOUR CODE HERE>
ft_embeddings.shape

## Build Model Architecture

We will use the ``tensorflow.keras`` high level API for building our deep neural network. One slight modification is required for the ``Embedding`` layer. In place of initializing this layer with random weights (as is usual), we start from FastText embeddings weights by setting the ``weights`` parameter. We also keep ``trainable`` parameter as ``True`` in order to learn/improve the pretrained weights as per our corpus. The rest of the model has usual ``Conv1D`` and ``MaxPool`` layers.

### Build a 1D-Convolution based classification model. Initialize the embedding layer with FastText weights

___You can use a similar architecture as the tutorial or build your own!___

In [None]:
# create the model
<YOUR CODE HERE>

## Train and Validate Model

### Train the Model

Use a similar methodology as the tutorial but use the following configs also:
- __`validation_split`__ of __0.02__ i.e. 2%
- 5 epochs
- 128 batch size
- no callbacks needed to keep things simple

In [None]:
# Fit the model
<YOUR CODE HERE>

## Model Performance Evaluation on the Test Dataset

### Evaluate the Model

In [None]:
<YOUR CODE HERE>

___Do you observe a better performance?___

# __Question 5:__ Build Model 2: Neural Network Language Model (4 points)

Authors Bengio et. al. in their paper titled [A Neural Probabilistic Model](https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf) present a novel method to learn the joint probability function of sequences of
words in a language. This LM learns useful representation of sentences and words which can be leveraged for other NLP tasks such as Classication, Translation, etc.

Let us leverage NNLM embeddings to train a classifier on the hate speech dataset

![](https://i.imgur.com/blaLxUp.png)

## Prepare Datasets

In [None]:
norm_train_reviews = np.array(norm_train_reviews)
norm_test_reviews = np.array(norm_test_reviews)

## Build a NNLM Embedding Layer 

In [None]:
model = "https://tfhub.dev/google/tf2-preview/nnlm-en-dim128/1"
hub_layer = <YOUR CODE HERE>

## Build Model Architecture

### Build a Classification Model using the TF_Hub pretrained model

___Use a similar architecture as the tutorial or try your own!___

In [None]:
<YOUR CODE HERE>

## Train and Validate Model

Use a similar methodology as the tutorial but use the following configs also:
- __`validation_split`__ of __0.02__ i.e. 2%
- 5 epochs
- 128 batch size
- no callbacks needed to keep things simple


In [None]:
# Fit the model
<YOUR CODE HERE>

## Model Performance Evaluation on the Test Dataset

In [None]:
<YOUR CODE HERE>

# __Question 6:__ Build Model 3: Google's Universal Sentence Encoder (4 points)

These models take as input English strings and
produce as output a fixed dimensional embedding
representation of the string.

It has two models for encoding
sentences into embedding vectors. 
- One makes use
of the transformer (Vaswani et al., 2017) architecture
- The other is formulated as a deep averaging network (DAN) (Iyyer et al., 2015)

__Methodology 1: Transformers__

- The transformer based sentence encoding model
constructs sentence embeddings using the encoding sub-graph of the transformer architecture
(Vaswani et al., 2017). 
- This sub-graph uses attention to compute context aware representations
of words in a sentence that take into account both
the ordering and identity of all the other words.
- The context aware word representations are converted to a fixed length sentence encoding vector
by computing the element-wise sum of the representations at each word position
- The encoder takes as input a lowercased (Penn TreeBank) PTB tokenized string
and outputs a 512 dimensional vector as the sentence embedding


__Methodology 2: Deep Averaging Network (DAN)__

- In the deep averaging network (DAN) (Iyyer et al.,
2015) the input embeddings for words and
bi-grams are first averaged together and then
passed through a feedforward deep neural network
(DNN) to produce sentence embeddings. 
- Similar to the Transformer encoder, the DAN encoder takes as input a lowercased PTB tokenized string and outputs a 512 dimensional sentence embedding.

__Training Methodology:__

The encoding model is designed to be as general purpose as possible. This is accomplished by using multi-task learning whereby a single encoding model is used to feed multiple downstream tasks. 

Unsupervised training data for the sentence encoding models are drawn from a variety of web sources. The sources are Wikipedia, web news,
web question-answer pages and discussion forums. We augment unsupervised learning with training on supervised data from the Stanford Natural Language Inference (SNLI) corpus.


![](https://i.imgur.com/HIeb3tY.png)

## Build a USE Embedding Layer

### Using Tensorflow hub, prepare an instance of ``hub.KerasLayer`` to get sentence embeddings

In [None]:
model = "https://tfhub.dev/google/universal-sentence-encoder/4"
hub_layer = <YOUR CODE HERE>

## Build Model Architecture

___Use a similar architecture as the tutorial or try your own!___

In [None]:
<YOUR CODE HERE>

## Train and Validate Model

Use a similar methodology as the tutorial but use the following configs also:
- __`validation_split`__ of __0.02__ i.e. 2%
- 5 epochs
- 128 batch size
- no callbacks needed to keep things simple

In [None]:
# Fit the model
<YOUR CODE HERE>

## Model Performance Evaluation on the Test Dataset

### **Question 7**: Get Evaluation Results of the model (1 point)

In [None]:
<YOUR CODE HERE>