# Exercise 1: Text Classification using CNNs, LSTMs and Bi-Directional LSTMs

Understanding the text content and predicting the sentiment of the reviews is a form of supervised machine learning. To be more specific, we will be using classification models for solving the problem of hate speech detection. We will be building an automated hate speech classification system in subsequent sections. The major steps to achieve this are mentioned as follows.

+ Prepare train and test datasets (optionally a validation dataset)
+ Pre-process and normalize text documents
+ Feature Engineering 
+ Model training
+ Model prediction and evaluation

These are the major steps for building our system. Optionally the last step would be to deploy the model in your server or on the cloud. 

We will build models using deep learning in the subsequent sections. Our focus will be on Convolutional Neural Networks and Long Short Term Memory (LSTM) Networks

## Load Dataset - Hate Speech

Social media unfortunately is rampant with hate speech in the form of posts and comments. This is a practical example of perhaps building an automated hate speech detection system using NLP in the form of text classification.

In this notebook, we will leverage an open sourced collection of hate speech posts and comments.

The dataset is available here: [kaggle](https://www.kaggle.com/usharengaraju/dynamically-generated-hate-speech-dataset) which in turn has been curated from a wider [data source for hate speech](https://hatespeechdata.com/)

In [None]:
!nvidia-smi

## Install Dependencies

In [None]:
!pip install contractions
!pip install textsearch
!pip install tqdm
import nltk
nltk.download('punkt')

## Load Libraries

In [None]:
import numpy as np
import pandas as pd
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Conv1D
from tensorflow.keras.layers import MaxPooling1D
from tensorflow.keras.layers import Embedding
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing import sequence
from sklearn.preprocessing import LabelEncoder
import tensorflow as tf

# fix random seed for reproducibility
seed = 42
np.random.seed(seed)

## Load Dataset

In [None]:
df = pd.read_csv('HateDataset.csv')
df.info()

In [None]:
df = df[['text', 'label']]
df.head()

## Prepare Train-Test Splits

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
train_reviews, test_reviews, train_labels, test_labels = train_test_split(df.text.values,
                                                                          df.label.values,
                                                                          test_size=0.2, random_state=42)

In [None]:
len(train_reviews), len(test_reviews)

## Text Preprocessing : Text Wrangling and Normalization

In [None]:
import contractions
from bs4 import BeautifulSoup
import numpy as np
import re
import tqdm
import unicodedata


def strip_html_tags(text):
  soup = BeautifulSoup(text, "html.parser")
  [s.extract() for s in soup(['iframe', 'script'])]
  stripped_text = soup.get_text()
  stripped_text = re.sub(r'[\r|\n|\r\n]+', '\n', stripped_text)
  return stripped_text

def remove_accented_chars(text):
  text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
  return text

def pre_process_corpus(docs):
  norm_docs = []
  for doc in tqdm.tqdm(docs):
    doc = strip_html_tags(doc)
    doc = doc.translate(doc.maketrans("\n\t\r", "   "))
    doc = doc.lower()
    doc = remove_accented_chars(doc)
    doc = contractions.fix(doc)
    # lower case and remove special characters\whitespaces
    doc = re.sub(r'[^a-zA-Z0-9\s]', '', doc, flags=re.I|re.A)
    doc = re.sub(' +', ' ', doc)
    doc = doc.strip()  
    norm_docs.append(doc)
  
  return norm_docs

In [None]:
%%time

norm_train_reviews = pre_process_corpus(train_reviews)
norm_test_reviews = pre_process_corpus(test_reviews)

## Feature Engineering

In [None]:
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dropout, Activation, Dense
from sklearn.preprocessing import LabelEncoder

In [None]:
t = Tokenizer(oov_token='<UNK>')
# fit the tokenizer on the documents
t.fit_on_texts(norm_train_reviews)
t.word_index['<PAD>'] = 0

In [None]:
# transform train set using the tokenizer
train_sequences = t.texts_to_sequences(norm_train_reviews)

In [None]:
# transform test set using the tokenizer
test_sequences = t.texts_to_sequences(norm_test_reviews)

In [None]:
print("Vocabulary size={}".format(len(t.word_index)))
print("Number of Documents={}".format(t.document_count))

### Visualize Document Lengths

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

train_lens = [len(s) for s in train_sequences]
test_lens = [len(s) for s in test_sequences]

fig, ax = plt.subplots(1,2, figsize=(12, 6))
h1 = ax[0].hist(train_lens)
h2 = ax[1].hist(test_lens)

In [None]:
# while 250 is long should be a safe bet
MAX_SEQUENCE_LENGTH = 250

In [None]:
# pad dataset to a maximum review length in words
X_train = sequence.pad_sequences(train_sequences, maxlen=MAX_SEQUENCE_LENGTH)
X_test = sequence.pad_sequences(test_sequences, maxlen=MAX_SEQUENCE_LENGTH)
X_train.shape, X_test.shape

## Label Encode Class Labels

In [None]:
le = LabelEncoder()
# positive -> 1, negative -> 0
num_classes=2 

In [None]:
y_train = le.fit_transform(train_labels)
y_test = le.transform(test_labels)

In [None]:
VOCAB_SIZE = len(t.word_index)

## **Question 1**:  Build and Train a CNN Model (4 points)

**Define** a Convolutional Neural Network such as it has:

+ An embedding layer with embedding size of 300
+ 3 pairs of Convolutional-1d and Maxpooling layer pairs
+ Dense layers 
+ Choose an appropriate loss function and activation function for the final layer

_Hint: Use a similar config as the tutorial and if you have more time feel free to play around with the layers and necessary hyperparameters_

In [None]:
<YOUR CODE HERE>

## Train the CNN Model

In [None]:
<YOUR CODE HERE>

## Evaluate CNN Model

In [None]:
<YOUR CODE HERE>

## **Question 2**: Build and Train a LSTM based Model (4 points)

### **Define** a LSTM based Neural Network such as it has:

+ An embedding layer with embedding size of 300
+ An LSTM layer
+ Dense layers 
+ Choose an appropriate loss function and activation function for the final layer

_Hint: Use a similar config as the tutorial and if you have more time feel free to play around with the layers and necessary hyperparameters_

In [None]:
<YOUR CODE HERE>

## Train the model

In [None]:
<YOUR CODE HERE>

## Evaluate the Model

In [None]:
<YOUR CODE HERE>

## **Question 3**: Build and Train a Bi-LSTM based Model (6 points)

### **Define** a Bi-Directional LSTM based Neural Network such as it has:

+ An embedding layer with embedding size of 300
+ 2 bi-directional LSTM layers (hint: remeber how to use ``return sequences``)
+ Dense and Dropout layers 
+ Choose an appropriate loss function and activation function for the final layer

_Hint: Use a similar config as the tutorial and if you have more time feel free to play around with the layers and necessary hyperparameters_

In [None]:
<YOUR CODE HERE>

## Train the Model

In [None]:
<YOUR CODE HERE>

## Evaluate the Model

In [None]:
<YOUR CODE HERE>