# Text Classification using Word Embeddings and Dense Neural Network Models

## Building a Hate Speech Classifier

Understanding the text content and predicting the sentiment of the reviews is a form of supervised machine learning. To be more specific, we will be using classification models for solving this problem. We will be building an automated hate speech text classification system in subsequent sections. The major steps to achieve this are mentioned as follows.

+ Prepare train and test datasets (optionally a validation dataset)
+ Pre-process and normalize text documents
+ Feature Engineering 
+ Model training
+ Model prediction and evaluation

These are the major steps for building our system. Optionally the last step would be to deploy the model in your server or on the cloud. The following figure shows a detailed workflow for building a standard text classification system with supervised learning (classification) models.

In our scenario, documents indicate the posts \ comments and classes indicate the nature of whether the post was a hate speech incited post or not, which can either be hate or nothate making it a binary classification problem. We will build models using deep learning in the subsequent sections.

In [39]:
!nvidia-smi

Thu Mar  4 21:52:26 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.39       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   36C    P0    37W / 300W |    577MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [40]:
!pip install contractions
!pip install textsearch
!pip install tqdm
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Load Dataset - Hate Speech

Social media unfortunately is rampant with hate speech in the form of posts and comments. This is a practical example of perhaps building an automated hate speech detection system using NLP in the form of text classification.

In this notebook, we will leverage an open sourced collection of hate speech posts and comments.

The dataset is available here: [kaggle](https://www.kaggle.com/usharengaraju/dynamically-generated-hate-speech-dataset) which in turn has been curated from a wider [data source for hate speech](https://hatespeechdata.com/)

In [42]:
import pandas as pd

df = pd.read_csv('HateDataset.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40623 entries, 0 to 40622
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Unnamed: 0      40623 non-null  int64  
 1   id              40623 non-null  object 
 2   text            40623 non-null  object 
 3   label           40623 non-null  object 
 4   type            40623 non-null  object 
 5   model_wrong     26097 non-null  object 
 6   db.model_preds  26097 non-null  float64
 7   status          40623 non-null  object 
 8   round           40623 non-null  object 
 9   split           40623 non-null  object 
 10  annotator       40623 non-null  object 
dtypes: float64(1), int64(1), object(9)
memory usage: 3.4+ MB


To keep things simple we will focus on predicting the labels from the text content

In [43]:
df = df[['text', 'label']]
df.head()

Unnamed: 0,text,label
0,It should be legal for a husband to rape his wife,hate
1,dalits are lowlife scum,hate
2,dalits are lowlives,hate
3,It was a better world when women dare not ques...,hate
4,I don't work this hard so that those immigrant...,hate


### Split data into train-test datasets

In [44]:
from sklearn.model_selection import train_test_split

In [45]:
train_reviews, test_reviews, train_labels, test_labels = train_test_split(df.text.values,
                                                                          df.label.values,
                                                                          test_size=0.2, random_state=42)

In [46]:
len(train_reviews), len(test_reviews)

(32498, 8125)

## Text Wrangling and Normalization

In this section, we will also normalize our corpus by removing accented characters, newline characters and so on. Lets get started

### **Question 1**: **Complete** the following utility functions

__Hint:__ Use the knowledge gained from NLP-1 or the classification tutorial to solve this

In [47]:
import contractions
from bs4 import BeautifulSoup
import numpy as np
import re
from tqdm import tqdm
import unicodedata


def strip_html_tags(text):
    # hint use beautifulsoup to remove html tags
    soup = BeautifulSoup(text, "html.parser")
    [s.extract() for s in soup(['iframe', 'script'])]
    stripped_text = soup.get_text()
    stripped_text = re.sub(r'[\r|\n|\r\n]+', '\n', stripped_text)
    return stripped_text

def remove_accented_chars(text):
    # hint use the normalize function from unicodedata
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    return text

def pre_process_corpus(docs):
    norm_docs = []
    for doc in tqdm(docs):
        # strip HTML tags
        doc = strip_html_tags(doc)
        # remove extra newlines
        doc = doc.translate(doc.maketrans("\n\t\r", "   "))
        # lower case
        doc = doc.lower()
        # remove accented characters
        doc = remove_accented_chars(doc)
        # fix contractions
        doc = contractions.fix(doc)
        # remove special characters\whitespaces
        # use regex to keep only letters, numbers and spaces
        doc = re.sub(r'[^a-zA-Z0-9\s]', '', doc, flags=re.I|re.A)
        # use regex to remove extra spaces
        doc = re.sub(' +', ' ', doc)
        # remove trailing and leading spaces
        doc = doc.strip()  

        norm_docs.append(doc)
  
    return norm_docs

In [48]:
%%time

norm_train_reviews = pre_process_corpus(train_reviews)
norm_test_reviews = pre_process_corpus(test_reviews)

  ' Beautiful Soup.' % markup)
100%|██████████| 32498/32498 [00:04<00:00, 6981.13it/s]
100%|██████████| 8125/8125 [00:01<00:00, 7505.39it/s]

CPU times: user 5.75 s, sys: 344 ms, total: 6.1 s
Wall time: 5.75 s





## Label Encode Class Labels

Our dataset has labels in the form of positive and negative classes. We transform them into consumable form by performing label encoding. Label encoding assigns a unique numerical value to each class. For example: 
``negative: 0 and positive:1``

In [49]:
import gensim
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dropout, Activation, Dense
from sklearn.preprocessing import LabelEncoder

### **Question 2**: **Complete** the following transformations

In [50]:
le = LabelEncoder()
# tokenize train reviews & encode train labels
tokenized_train = [nltk.word_tokenize(text)
                       for text in tqdm(norm_train_reviews)]
y_train = le.fit_transform(train_labels)
# tokenize test reviews & encode test labels
tokenized_test = [nltk.word_tokenize(text)
                       for text in tqdm(norm_test_reviews)]
y_test = le.transform(test_labels)

100%|██████████| 32498/32498 [00:04<00:00, 7179.25it/s]
100%|██████████| 8125/8125 [00:01<00:00, 7415.09it/s]


## Feature Engineering based on Word2Vec Embeddings

In the previous notebook we discussed different word embedding techniques like word2vec, glove, fastText, etc. In this section we will leverage ``gensim`` to transform our dataset into word2vec  representation

In [51]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

### **Question 3**: **Get** feature vectors using Word2Vec

Build the word2vec model on your tokenized train data

In [63]:
%%time
# build word2vec model
w2v_num_features = 300
# use a similar config as the tutorial but use a min_count of 2 and train for 10 iterations
w2v_model = gensim.models.Word2Vec(tokenized_train, size=w2v_num_features, window=150,
                                   min_count=2, workers=4, iter=10)    

2021-03-04 22:39:51,929 : INFO : collecting all words and their counts
2021-03-04 22:39:51,931 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2021-03-04 22:39:51,968 : INFO : PROGRESS: at sentence #10000, processed 191896 words, keeping 12443 word types
2021-03-04 22:39:52,000 : INFO : PROGRESS: at sentence #20000, processed 381698 words, keeping 16843 word types
2021-03-04 22:39:52,037 : INFO : PROGRESS: at sentence #30000, processed 569659 words, keeping 19666 word types
2021-03-04 22:39:52,048 : INFO : collected 20221 word types from a corpus of 616735 raw words and 32498 sentences
2021-03-04 22:39:52,048 : INFO : Loading a fresh vocabulary
2021-03-04 22:39:52,080 : INFO : effective_min_count=2 retains 13578 unique words (67% of original 20221, drops 6643)
2021-03-04 22:39:52,081 : INFO : effective_min_count=2 leaves 610092 word corpus (98% of original 616735, drops 6643)
2021-03-04 22:39:52,127 : INFO : deleting the raw counts dictionary of 20221 items
2

CPU times: user 40.9 s, sys: 88.3 ms, total: 41 s
Wall time: 22.7 s


## Averaged Document Vectors

A sentence in very simple terms is a collection of words. By now we know how to transform words into vector representation. But how do we transform sentences and documents into vector representation?

A simple and naïve way is to average all words in a given sentence to form a sentence vector. In this section, we will leverage this technique itself to prepare our sentence/document vectors

### **Question 4**: **Complete** the following utility to build a function to generate and obtain averaged document embeddings

In [64]:
def averaged_doc_vectorizer(corpus, model, num_features):
    vocabulary = set(model.wv.index2word)
    
    def average_word_vectors(words, model, vocabulary, num_features):
        feature_vector = np.zeros((num_features,), dtype="float64")
        nwords = 0.
        
        for word in words:
            if word in vocabulary: 
                nwords = nwords + 1.
                feature_vector = np.add(feature_vector, model.wv[word])
        if nwords:
            feature_vector = np.divide(feature_vector, nwords)

        return feature_vector

    features = [average_word_vectors(tokenized_sentence, model, vocabulary, num_features)
                    for tokenized_sentence in corpus]
    return np.array(features)

In [65]:
# generate averaged word vector features from word2vec model
avg_w2v_train_features = averaged_doc_vectorizer(corpus=tokenized_train, model=w2v_model,
                                                     num_features=w2v_num_features)
avg_w2v_test_features = averaged_doc_vectorizer(corpus=tokenized_test, model=w2v_model,
                                                    num_features=w2v_num_features)

In [66]:
print('Word2Vec model:> Train features shape:', avg_w2v_train_features.shape, 
      ' Test features shape:', avg_w2v_test_features.shape)

Word2Vec model:> Train features shape: (32498, 300)  Test features shape: (8125, 300)


## Define DNN Model

Let us leverage ``tensorflow.keras`` to build our deep neural network for movie review classification task.
We will make use of ``Dense`` layers with ``ReLU`` activation and ``Dropout`` to prevent overfitting.

### **Question 5**: **Complete** the following utility to build a deep neural network for classification task

Use a similar architecture as the tutorial, key components listed below for reference:

- 3 Dense Layers
- 512 - 256 - 256 (neurons)
- 20% dropout in each layer
- 1 output layer for binary classification
- binary crossentropy loss 
- adam optimizer

In [67]:
def construct_deepnn_architecture(num_input_features):
    dnn_model = Sequential()
    dnn_model.add(Dense(512, input_shape=(num_input_features,)))
    dnn_model.add(Activation('relu'))
    dnn_model.add(Dropout(0.2))
    
    dnn_model.add(Dense(256))
    dnn_model.add(Activation('relu'))
    dnn_model.add(Dropout(0.2))
    
    dnn_model.add(Dense(256))
    dnn_model.add(Activation('relu'))
    dnn_model.add(Dropout(0.2))
    
    dnn_model.add(Dense(1))
    dnn_model.add(Activation('sigmoid'))

    dnn_model.compile(loss='binary_crossentropy', optimizer='adam',                 
                      metrics=['accuracy'])
    return dnn_model

## Compile and Visualize Model

In [68]:
w2v_dnn = construct_deepnn_architecture(num_input_features=w2v_num_features)

In [69]:
w2v_dnn.summary()

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_16 (Dense)             (None, 512)               154112    
_________________________________________________________________
activation_16 (Activation)   (None, 512)               0         
_________________________________________________________________
dropout_12 (Dropout)         (None, 512)               0         
_________________________________________________________________
dense_17 (Dense)             (None, 256)               131328    
_________________________________________________________________
activation_17 (Activation)   (None, 256)               0         
_________________________________________________________________
dropout_13 (Dropout)         (None, 256)               0         
_________________________________________________________________
dense_18 (Dense)             (None, 256)              

## Train the Model using Word2Vec Features

The first exercise is to leverage word2vec features as input to our deep neural network to perform moview review classification

### **Question 6**: **Train** the model

In [70]:
batch_size = 64
w2v_dnn.fit(avg_w2v_train_features, y_train, epochs=15, batch_size=batch_size, 
            shuffle=True, validation_split=0.1, verbose=1)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<tensorflow.python.keras.callbacks.History at 0x7f6b3fa9eb90>

### Evaluate Model

In [71]:
from sklearn.metrics import confusion_matrix, classification_report

In [72]:
y_pred = w2v_dnn.predict_classes(avg_w2v_test_features)
predictions = le.inverse_transform(y_pred) 

  y = column_or_1d(y, warn=True)


### **Question 7**: **Get** evaluation results

In [73]:
labels = le.classes_.tolist()
# print classification report
print(classification_report(test_labels, predictions))
# display confusion matrix
pd.DataFrame(confusion_matrix(test_labels, predictions), index=labels, columns=labels)

              precision    recall  f1-score   support

        hate       0.67      0.83      0.74      4401
     nothate       0.72      0.52      0.60      3724

    accuracy                           0.69      8125
   macro avg       0.70      0.67      0.67      8125
weighted avg       0.69      0.69      0.68      8125



Unnamed: 0,hate,nothate
hate,3655,746
nothate,1800,1924


Congratulations you have built your first hate speech detection model!

We will look at more complex models in the future to see if we can improve this performance given this is a pretty complex dataset \ domain as compared to basic sentiment analysis