# Text Classification using Word Embeddings and Dense Neural Network Models

## Building a Hate Speech Classifier

Understanding the text content and predicting the sentiment of the reviews is a form of supervised machine learning. To be more specific, we will be using classification models for solving this problem. We will be building an automated hate speech text classification system in subsequent sections. The major steps to achieve this are mentioned as follows.

+ Prepare train and test datasets (optionally a validation dataset)
+ Pre-process and normalize text documents
+ Feature Engineering 
+ Model training
+ Model prediction and evaluation

These are the major steps for building our system. Optionally the last step would be to deploy the model in your server or on the cloud. The following figure shows a detailed workflow for building a standard text classification system with supervised learning (classification) models.

In our scenario, documents indicate the posts \ comments and classes indicate the nature of whether the post was a hate speech incited post or not, which can either be hate or nothate making it a binary classification problem. We will build models using deep learning in the subsequent sections.

__Fill the sections marked with blanks or `<YOUR CODE HERE>`__

In [1]:
!nvidia-smi

Mon Nov 22 18:32:51 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.44       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   38C    P8    26W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [2]:
!pip install contractions
!pip install textsearch
!pip install tqdm
import nltk
nltk.download('punkt')

Collecting contractions
  Downloading contractions-0.0.58-py2.py3-none-any.whl (8.0 kB)
Collecting textsearch>=0.0.21
  Downloading textsearch-0.0.21-py2.py3-none-any.whl (7.5 kB)
Collecting pyahocorasick
  Downloading pyahocorasick-1.4.2.tar.gz (321 kB)
[K     |████████████████████████████████| 321 kB 5.2 MB/s 
[?25hCollecting anyascii
  Downloading anyascii-0.3.0-py3-none-any.whl (284 kB)
[K     |████████████████████████████████| 284 kB 39.1 MB/s 
[?25hBuilding wheels for collected packages: pyahocorasick
  Building wheel for pyahocorasick (setup.py) ... [?25l[?25hdone
  Created wheel for pyahocorasick: filename=pyahocorasick-1.4.2-cp37-cp37m-linux_x86_64.whl size=85440 sha256=6748f7c620be7f02f165daaa404fb5581bed134b2cb6ca493172d3662e358733
  Stored in directory: /root/.cache/pip/wheels/25/19/a6/8f363d9939162782bb8439d886469756271abc01f76fbd790f
Successfully built pyahocorasick
Installing collected packages: pyahocorasick, anyascii, textsearch, contractions
Successfully install

True

## Load Dataset - Hate Speech

Social media unfortunately is rampant with hate speech in the form of posts and comments. This is a practical example of perhaps building an automated hate speech detection system using NLP in the form of text classification.

In this notebook, we will leverage an open sourced collection of hate speech posts and comments.

The dataset is available here: [kaggle](https://www.kaggle.com/usharengaraju/dynamically-generated-hate-speech-dataset) which in turn has been curated from a wider [data source for hate speech](https://hatespeechdata.com/)

In [4]:
import pandas as pd

df = pd.read_csv('HateDataset.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40623 entries, 0 to 40622
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Unnamed: 0      40623 non-null  int64  
 1   id              40623 non-null  object 
 2   text            40623 non-null  object 
 3   label           40623 non-null  object 
 4   type            40623 non-null  object 
 5   model_wrong     26097 non-null  object 
 6   db.model_preds  26097 non-null  float64
 7   status          40623 non-null  object 
 8   round           40623 non-null  object 
 9   split           40623 non-null  object 
 10  annotator       40623 non-null  object 
dtypes: float64(1), int64(1), object(9)
memory usage: 3.4+ MB


To keep things simple we will focus on predicting the labels from the text content

In [5]:
df = df[['text', 'label']]
df.head()

Unnamed: 0,text,label
0,It should be legal for a husband to rape his wife,hate
1,dalits are lowlife scum,hate
2,dalits are lowlives,hate
3,It was a better world when women dare not ques...,hate
4,I don't work this hard so that those immigrant...,hate


### Split data into train-test datasets

In [6]:
from sklearn.model_selection import train_test_split

In [7]:
train_reviews, test_reviews, train_labels, test_labels = train_test_split(df.text.values,
                                                                          df.label.values,
                                                                          test_size=0.2, random_state=42)

In [8]:
len(train_reviews), len(test_reviews)

(32498, 8125)

## Text Wrangling and Normalization

In this section, we will also normalize our corpus by removing accented characters, newline characters and so on. Lets get started

### **Question 1**: **Complete** the following utility functions (2 points)

__Hint:__ Use the knowledge gained from NLP-1 or the classification tutorial to solve this

In [9]:
import contractions
from bs4 import BeautifulSoup
import numpy as np
import re
from tqdm import tqdm
import unicodedata


def strip_html_tags(text):
    # hint use beautifulsoup to remove html tags
    soup = BeautifulSoup(text, "html.parser")
    [s.extract() for s in soup(['iframe', 'script'])]
    stripped_text = soup.get_text()
    stripped_text = re.sub(r'[\r|\n|\r\n]+', '\n', stripped_text)
    return stripped_text

def remove_accented_chars(text):
    # hint use the normalize function from unicodedata
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    return text

def pre_process_corpus(docs):
    norm_docs = []
    for doc in tqdm(docs):
        # strip HTML tags
        doc = strip_html_tags(doc)
        # remove extra newlines
        doc = doc.translate(doc.maketrans("\n\t\r", "   "))
        # lower case
        doc = doc.lower()
        # remove accented characters
        doc = remove_accented_chars(doc)
        # fix contractions
        doc = contractions.fix(doc)
        # remove special characters\whitespaces
        # use regex to keep only letters, numbers and spaces
        doc = re.sub(r'[^a-zA-Z0-9\s]', '', doc, flags=re.I|re.A)
        # use regex to remove extra spaces
        doc = re.sub(' +', ' ', doc)
        # remove trailing and leading spaces
        doc = doc.strip() 

        norm_docs.append(doc)
  
    return norm_docs

In [10]:
%%time

norm_train_reviews = pre_process_corpus(train_reviews)
norm_test_reviews = pre_process_corpus(test_reviews)

  ' Beautiful Soup.' % markup)
100%|██████████| 32498/32498 [00:05<00:00, 5621.75it/s]
100%|██████████| 8125/8125 [00:01<00:00, 5560.85it/s]

CPU times: user 6.58 s, sys: 680 ms, total: 7.26 s
Wall time: 7.26 s





## Label Encode Class Labels

Our dataset has labels in the form of positive and negative classes. We transform them into consumable form by performing label encoding. Label encoding assigns a unique numerical value to each class. For example: 
``negative: 0 and positive:1``

In [11]:
import gensim
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dropout, Activation, Dense
from sklearn.preprocessing import LabelEncoder

### **Question 2**: **Complete** the following transformations (1 point)

In [12]:
le = LabelEncoder()
# tokenize train reviews & encode train labels
tokenized_train = [nltk.word_tokenize(text)
                       for text in tqdm(norm_train_reviews)]
y_train = le.fit_transform(train_labels)
# tokenize test reviews & encode test labels
tokenized_test = [nltk.word_tokenize(text)
                       for text in tqdm(norm_test_reviews)]
y_test = le.transform(test_labels)

100%|██████████| 32498/32498 [00:05<00:00, 6441.59it/s]
100%|██████████| 8125/8125 [00:01<00:00, 6569.22it/s]


## Feature Engineering based on Word2Vec Embeddings

In the previous notebook we discussed different word embedding techniques like word2vec, glove, fastText, etc. In this section we will leverage ``gensim`` to transform our dataset into word2vec  representation

In [13]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

### **Question 3**: **Get** feature vectors using Word2Vec (2 points)

Build the word2vec model on your tokenized train data

In [14]:
%%time
# build word2vec model
w2v_num_features = 300
# use a similar config as the tutorial but use a min_count of 2 and train for 10 iterations
w2v_model = gensim.models.Word2Vec(tokenized_train, size=w2v_num_features, window=150,
                                   min_count=2, workers=4, iter=10)    

2021-11-22 18:35:11,982 : INFO : collecting all words and their counts
2021-11-22 18:35:11,984 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2021-11-22 18:35:12,032 : INFO : PROGRESS: at sentence #10000, processed 191882 words, keeping 12443 word types
2021-11-22 18:35:12,078 : INFO : PROGRESS: at sentence #20000, processed 381657 words, keeping 16843 word types
2021-11-22 18:35:12,131 : INFO : PROGRESS: at sentence #30000, processed 569605 words, keeping 19664 word types
2021-11-22 18:35:12,147 : INFO : collected 20219 word types from a corpus of 616684 raw words and 32498 sentences
2021-11-22 18:35:12,148 : INFO : Loading a fresh vocabulary
2021-11-22 18:35:12,187 : INFO : effective_min_count=2 retains 13577 unique words (67% of original 20219, drops 6642)
2021-11-22 18:35:12,188 : INFO : effective_min_count=2 leaves 610042 word corpus (98% of original 616684, drops 6642)
2021-11-22 18:35:12,245 : INFO : deleting the raw counts dictionary of 20219 items
2

CPU times: user 43.5 s, sys: 514 ms, total: 44.1 s
Wall time: 24.7 s


## Averaged Document Vectors

A sentence in very simple terms is a collection of words. By now we know how to transform words into vector representation. But how do we transform sentences and documents into vector representation?

A simple and naïve way is to average all words in a given sentence to form a sentence vector. In this section, we will leverage this technique itself to prepare our sentence/document vectors

### **Question 4**: **Complete** the following utility to build a function to generate and obtain averaged document embeddings (3 points)

In [15]:
def averaged_doc_vectorizer(corpus, model, num_features):
    vocabulary = set(model.wv.index2word)
    
    def average_word_vectors(words, model, vocabulary, num_features):
        feature_vector = np.zeros((num_features,), dtype="float64")
        nwords = 0.
        
        for word in tqdm(words):
            if word in vocabulary: 
                nwords = nwords + 1.
                feature_vector = np.add(feature_vector, model.wv[word])
        if nwords:
            feature_vector = np.divide(feature_vector, nwords)

        return feature_vector

    features = [average_word_vectors(tokenized_sentence, model, vocabulary, num_features)
                    for tokenized_sentence in tqdm(corpus)]
    return np.array(features)

In [16]:
# generate averaged word vector features from word2vec model
avg_w2v_train_features = averaged_doc_vectorizer(corpus=tokenized_train, model=w2v_model,
                                                     num_features=w2v_num_features)
avg_w2v_test_features = averaged_doc_vectorizer(corpus=tokenized_test, model=w2v_model,
                                                    num_features=w2v_num_features)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m

100%|██████████| 11/11 [00:00<00:00, 44880.68it/s]
 69%|██████▉   | 5627/8125 [01:39<00:46, 53.34it/s]
100%|██████████| 9/9 [00:00<00:00, 32208.82it/s]

100%|██████████| 8/8 [00:00<00:00, 38926.26it/s]

100%|██████████| 9/9 [00:00<00:00, 41573.50it/s]

100%|██████████| 13/13 [00:00<00:00, 57035.51it/s]

100%|██████████| 50/50 [00:00<00:00, 115418.38it/s]

100%|██████████| 8/8 [00:00<00:00, 42690.12it/s]

100%|██████████| 14/14 [00:00<00:00, 45661.16it/s]
 69%|██████▉   | 5634/8125 [01:40<00:43, 57.48it/s]
100%|██████████| 15/15 [00:00<00:00, 62851.71it/s]

100%|██████████| 12/12 [00:00<00:00, 37117.73it/s]

100%|██████████| 4/4 [00:00<00:00, 20610.83it/s]

100%|██████████| 8/8 [00:00<00:00, 35432.35it/s]

100%|██████████| 59/59 [00:00<00:00, 69375.93it/s]

100%|██████████| 31/31 [00:00<00:00, 52598.47it/s]

100%|██████████| 45/45 [00:00<00:00, 71224.03it/s]
 69%|██████▉   | 5641/8125 [01:40<00:43, 57.62it/s]
100%|███████

In [17]:
print('Word2Vec model:> Train features shape:', avg_w2v_train_features.shape, 
      ' Test features shape:', avg_w2v_test_features.shape)

Word2Vec model:> Train features shape: (32498, 300)  Test features shape: (8125, 300)


## Define DNN Model

Let us leverage ``tensorflow.keras`` to build our deep neural network for movie review classification task.
We will make use of ``Dense`` layers with ``ReLU`` activation and ``Dropout`` to prevent overfitting.

### **Question 5**: **Complete** the following utility to build a deep neural network for classification task (3 points)

Use a similar architecture as the tutorial, key components listed below for reference:

- 3 Dense Layers
- 512 - 256 - 256 (neurons)
- 20% dropout in each layer
- 1 output layer for binary classification
- binary crossentropy loss 
- adam optimizer

In [18]:
def construct_deepnn_architecture(num_input_features):
    dnn_model = Sequential()
    dnn_model.add(Dense(512, input_shape=(num_input_features,)))
    dnn_model.add(Activation('relu'))
    dnn_model.add(Dropout(0.2))
    
    dnn_model.add(Dense(256))
    dnn_model.add(Activation('relu'))
    dnn_model.add(Dropout(0.2))
    
    dnn_model.add(Dense(256))
    dnn_model.add(Activation('relu'))
    dnn_model.add(Dropout(0.2))
    
    dnn_model.add(Dense(1))
    dnn_model.add(Activation('sigmoid'))

    dnn_model.compile(loss='binary_crossentropy', optimizer='adam',                 
                      metrics=['accuracy'])
    return dnn_model

## Compile and Visualize Model

In [19]:
w2v_dnn = construct_deepnn_architecture(num_input_features=w2v_num_features)

In [20]:
w2v_dnn.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 512)               154112    
                                                                 
 activation (Activation)     (None, 512)               0         
                                                                 
 dropout (Dropout)           (None, 512)               0         
                                                                 
 dense_1 (Dense)             (None, 256)               131328    
                                                                 
 activation_1 (Activation)   (None, 256)               0         
                                                                 
 dropout_1 (Dropout)         (None, 256)               0         
                                                                 
 dense_2 (Dense)             (None, 256)               6

## Train the Model using Word2Vec Features

The first exercise is to leverage word2vec features as input to our deep neural network to perform moview review classification

### **Question 6**: **Train** the model (1 point)

In [21]:
batch_size = 64
w2v_dnn.fit(avg_w2v_train_features, y_train, epochs=10, batch_size=batch_size, 
            shuffle=True, validation_split=0.1, verbose=1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f55965a6a50>

### Evaluate Model

In [22]:
from sklearn.metrics import confusion_matrix, classification_report

In [23]:
y_pred = w2v_dnn.predict(avg_w2v_test_features).ravel()
predictions = le.inverse_transform(y_pred) 

ValueError: ignored

### **Question 7**: **Get** evaluation results (1 point)

In [None]:
labels = le.classes_.tolist()
# print classification report
print(classification_report(test_sentiments, predictions))
# display confusion matrix
pd.DataFrame(confusion_matrix(test_sentiments, predictions), index=labels, columns=labels)

Congratulations you have built your first hate speech detection model!

We will look at more complex models in the future to see if we can improve this performance given this is a pretty complex dataset \ domain as compared to basic sentiment analysis