# Deep Transfer Learning for Text Classification

Handling tough real-world problems in Natural Language Processing (NLP) include tackling with class imbalance and the lack of availability of enough labeled data for training. Thanks to the recent advancements in deep transfer learning in NLP, we have been able to make rapid strides in not only tackling these problems but also leverage these models for diverse downstream NLP tasks.

In this exercise you will be writing code to build a Hate Speech Classifier using:

- BERT (fine-tuning)
- DistilBERT (fine-tuning)

# GPU Check

In [None]:
!nvidia-smi

# Load Necessary Dependencies

In [None]:
!pip install transformers

In [None]:
import transformers
transformers.__version__

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import tensorflow_hub as hub
import transformers
import tqdm
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

%matplotlib inline

# fix random seed for reproducibility
seed = 42
np.random.seed(seed)
tf.random.set_seed(seed)

In [None]:
print("TF Version: ", tf.__version__)
print("Eager mode: ", tf.executing_eagerly())
print("TF Hub version: ", hub.__version__)
print("GPU is", "available" if tf.test.is_gpu_available() else "NOT AVAILABLE")

## Load Dataset - Hate Speech

Social media unfortunately is rampant with hate speech in the form of posts and comments. This is a practical example of perhaps building an automated hate speech detection system using NLP in the form of text classification.

In this notebook, we will leverage an open sourced collection of hate speech posts and comments.

The dataset is available here: [kaggle](https://www.kaggle.com/usharengaraju/dynamically-generated-hate-speech-dataset) which in turn has been curated from a wider [data source for hate speech](https://hatespeechdata.com/)

In [None]:
df = pd.read_csv('HateDataset.csv')
df.info()

# Subset Dataset

BERT is a HUGE model which takes a long time to fine-tune!  

So let's try to subset our data so you can work with a small dataset and train faster

In [None]:
df = df[['text', 'label']]
df = df.sample(10000, random_state=42)
df.head()

# Preparing Train, Validation and Test Datasets


### Prepare Train-Test Split. Keep test set to be 20% of the total

In [None]:
from sklearn.model_selection import train_test_split

train_reviews, test_reviews, train_labels, test_labels = train_test_split(df.text.values,
                                                                          df.label.values,
                                                                          test_size=0.2, random_state=42)

In [None]:
len(train_reviews), len(test_reviews)

# No Text Pre-processing

__Note:__ For some models we don't use any pre-processing like BERT! It should be able to handle a wide variety of text in its natural format

## Label Encode Classes

# **Question 1**: Label Encode Class Labels

`hate` and `nothate` needs to be encoded to numbers

In [None]:
from sklearn.preprocessing import LabelEncoder

le = <YOUR CODE HERE>

y_train = <YOUR CODE HERE> # fit - transform on train labels
y_test = <YOUR CODE HERE>  # transform on test labels

# Model 1: BERT (Bi-directional Encoder Representations from Transformers)

We will be using the BERT base model which has already been pre-trained by Google on MLM + Next Sentence Prediction Tasks

## BERT Tokenization

The BERT model we're using expects lowercase data. Here we leverage Huggingface `transformers`' `BertTokenizer`, which breaks words into word pieces.

Word Piece Tokenizer is based on [Byte Pair Encodings (BPE)](https://www.aclweb.org/anthology/P16-1162).

WordPiece and BPE are two similar and commonly used techniques to segment words into subword-level in NLP tasks. In both cases, the vocabulary is initialized with all the individual characters in the language, and then the most frequent/likely combinations of the symbols in the vocabulary are iteratively added to the vocabulary.

magine that the model sees the word walking. Unless this word occurs at least a few times in the training corpus, the model can't learn to deal with this word very well. However, it may have the words walked, walker, walks, each occurring only a few times. Without subword segmentation, all these words are treated as completely different words by the model.

However, if these get segmented as walk@@ ing, walk@@ ed, etc., notice that all of them will now have walk@@ in common, which will occur much frequently while training, and the model might be able to learn more about it.

Huggingface's transformers library has easy to use utilities for each type of model

# **Question 2**: Load BERT Tokenizer

_Hint: Use ``transformers.BertTokenizer.from_pretrained`` with the right pretrained model (bert base uncased) similar to the tutorial_

In [None]:
tokenizer = <YOUR CODE HERE>

## BERT Data Preparation

We need to preprocess our data so that it matches the data format BERT was trained on. For this, we'll need to do a couple of things.

- Lowercase our text (if we're using a BERT lowercase model)
- Tokenize it (i.e. "sally says hi" -> ["sally", "says", "hi"])
- Break words into WordPieces (i.e. "calling" -> ["call", "##ing"])
- Map our words to indexes using a vocab file that BERT provides
- Add special "CLS" and "SEP" tokens (see the readme)
- Append "mask" and "segment" tokens to each input (see the BERT paper)

# **Question 3**: Create function to tokenize and encode text into BERT token IDs

Use a similar strategy as you learnt in the tutorial and fill in the following function

In [None]:
def create_bert_input_features(tokenizer, docs, max_seq_length):
    
    <YOUR CODE HERE>

# **Question 4**: Build Model Architecture

Use a similar model architecture as the tutorial with a max sequence length of 500 tokens to be used.

You can always experiment with aspects like learning rate, number of layers etc.

In [None]:
MAX_SEQ_LENGTH = 500

<YOUR CODE HERE>

model.summary()

## Convert text to BERT input features

We leverage our utility function we created earlier to convert our text documents into BERT input features.

# **Question 5**: Prepare Feature ID, masks and segments for train and test sets

In [None]:
train_features_ids, train_features_masks, train_features_segments = <YOUR CODE HERE>

test_features_ids, test_features_masks, test_features_segments = <YOUR CODE HERE>

print('Train Features:', train_features_ids.shape, train_features_masks.shape, train_features_segments.shape)
print('Test Features:', test_features_ids.shape, test_features_masks.shape, test_features_segments.shape)

## Train and Validate Model

# **Question 6**: Train the Model

You can train it for around 3 epochs as it takes a fair bit of time to train even on a good GPU

For validation data you can use the test data tokens from the previous cell to keep things simple

In [None]:
<YOUR CODE HERE>

# **Question 7**: Model Performance Evaluation on the Test Dataset

Show the model's performance on the test dataset

In [None]:
<YOUR CODE HERE>

# Model 2: DistilBERT (Distilled BERT)


## BERT Tokenization

The DistilBERT model we're using expects lowercase data. Here we leverage Huggingface `transformers`' `DistilBertTokenizer`, which breaks words into word pieces.

Word Piece Tokenizer is based on [Byte Pair Encodings (BPE)](https://www.aclweb.org/anthology/P16-1162).

WordPiece and BPE are two similar and commonly used techniques to segment words into subword-level in NLP tasks. In both cases, the vocabulary is initialized with all the individual characters in the language, and then the most frequent/likely combinations of the symbols in the vocabulary are iteratively added to the vocabulary.

magine that the model sees the word walking. Unless this word occurs at least a few times in the training corpus, the model can't learn to deal with this word very well. However, it may have the words walked, walker, walks, each occurring only a few times. Without subword segmentation, all these words are treated as completely different words by the model.

However, if these get segmented as walk@@ ing, walk@@ ed, etc., notice that all of them will now have walk@@ in common, which will occur much frequently while training, and the model might be able to learn more about it.

Huggingface's transformers library has easy to use utilities for each type of model

# **Question 8**: Load DistilBERT Tokenizer

_Hint: Use ``transformers.DistilBertTokenizer.from_pretrained`` with the right pretrained model (distilbert base uncased) similar to the tutorial_

In [None]:
tokenizer = <YOUR CODE HERE>

## DistilBERT Data Preparation

We need to preprocess our data so that it matches the data format DistilBERT was trained on. For this, we'll need to do a couple of things.

- Lowercase our text (if we're using a BERT lowercase model)
- Tokenize it (i.e. "sally says hi" -> ["sally", "says", "hi"])
- Break words into WordPieces (i.e. "calling" -> ["call", "##ing"])
- Map our words to indexes using a vocab file that BERT provides
- Add special "CLS" and "SEP" tokens (see the readme)
- Append "mask" tokens to each input (see https://medium.com/huggingface/distilbert-8cf3380435b5)

# **Question 9**: Create function to tokenize and encode text into DistilBERT token IDs

Use a similar strategy as you learnt in the tutorial and fill in the following function

In [None]:
def create_bert_input_features(tokenizer, docs, max_seq_length):
    
    <YOUR CODE HERE>

# **Question 10**: Build Model Architecture

Use a similar model architecture as the tutorial with a max sequence length of 500 tokens to be used.

You can always experiment with aspects like learning rate, number of layers etc.

In [None]:
MAX_SEQ_LENGTH = 500

<YOUR CODE HERE>

model.summary()

# **Question 11**: Prepare Feature ID, masks and segments for train and test sets

In [None]:
train_features_ids, train_features_masks = <YOUR CODE HERE>
test_features_ids, test_features_masks = <YOUR CODE HERE>

print('Train Features:', train_features_ids.shape, train_features_masks.shape)
print('Val Features:', test_features_ids.shape, test_features_masks.shape)

## Train and Validate Model

# **Question 12**: Train the Model

You can train it for around 3 epochs as it takes a fair bit of time to train even on a good GPU

For validation data you can use the test data tokens from the previous cell to keep things simple

In [None]:
<YOUR CODE HERE>

# **Question 13**: Model Performance Evaluation on the Test Dataset

Show the model's performance on the test dataset

In [None]:
<YOUR CODE HERE>