# Text Classification using Word Embeddings and Dense Neural Network Models

## Building a Hate Speech Classifier

Understanding the text content and predicting the sentiment of the reviews is a form of supervised machine learning. To be more specific, we will be using classification models for solving this problem. We will be building an automated hate speech text classification system in subsequent sections. The major steps to achieve this are mentioned as follows.

+ Prepare train and test datasets (optionally a validation dataset)
+ Pre-process and normalize text documents
+ Feature Engineering 
+ Model training
+ Model prediction and evaluation

These are the major steps for building our system. Optionally the last step would be to deploy the model in your server or on the cloud. The following figure shows a detailed workflow for building a standard text classification system with supervised learning (classification) models.

In our scenario, documents indicate the posts \ comments and classes indicate the nature of whether the post was a hate speech incited post or not, which can either be hate or nothate making it a binary classification problem. We will build models using deep learning in the subsequent sections.

__Fill the sections marked with blanks or `<YOUR CODE HERE>`__

In [None]:
!nvidia-smi

In [None]:
!pip install contractions
!pip install textsearch
!pip install tqdm
import nltk
nltk.download('punkt')

## Load Dataset - Hate Speech

Social media unfortunately is rampant with hate speech in the form of posts and comments. This is a practical example of perhaps building an automated hate speech detection system using NLP in the form of text classification.

In this notebook, we will leverage an open sourced collection of hate speech posts and comments.

The dataset is available here: [kaggle](https://www.kaggle.com/usharengaraju/dynamically-generated-hate-speech-dataset) which in turn has been curated from a wider [data source for hate speech](https://hatespeechdata.com/)

In [None]:
import pandas as pd

df = pd.read_csv('HateDataset.csv')
df.info()

To keep things simple we will focus on predicting the labels from the text content

In [None]:
df = df[['text', 'label']]
df.head()

### Split data into train-test datasets

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
train_reviews, test_reviews, train_labels, test_labels = train_test_split(df.text.values,
                                                                          df.label.values,
                                                                          test_size=0.2, random_state=42)

In [None]:
len(train_reviews), len(test_reviews)

## Text Wrangling and Normalization

In this section, we will also normalize our corpus by removing accented characters, newline characters and so on. Lets get started

### **Question 1**: **Complete** the following utility functions (2 points)

__Hint:__ Use the knowledge gained from NLP-1 or the classification tutorial to solve this

In [None]:
import contractions
from bs4 import BeautifulSoup
import numpy as np
import re
from tqdm import tqdm
import unicodedata


def strip_html_tags(text):
    # hint use beautifulsoup to remove html tags
    <YOUR CODE HERE>

def remove_accented_chars(text):
    # hint use the normalize function from unicodedata
    <YOUR CODE HERE>

def pre_process_corpus(docs):
    norm_docs = []
    for doc in tqdm(docs):
        # strip HTML tags
        doc = <YOUR CODE HERE>
        # remove extra newlines
        doc = <YOUR CODE HERE>
        # lower case
        doc = <YOUR CODE HERE>
        # remove accented characters
        doc = <YOUR CODE HERE>
        # fix contractions
        doc = <YOUR CODE HERE>
        # remove special characters\whitespaces
        # use regex to keep only letters, numbers and spaces
        doc = <YOUR CODE HERE>
        # use regex to remove extra spaces
        doc = <YOUR CODE HERE>
        # remove trailing and leading spaces
        doc = <YOUR CODE HERE>

        norm_docs.append(doc)
  
    return norm_docs

In [None]:
%%time

norm_train_reviews = pre_process_corpus(train_reviews)
norm_test_reviews = pre_process_corpus(test_reviews)

## Label Encode Class Labels

Our dataset has labels in the form of positive and negative classes. We transform them into consumable form by performing label encoding. Label encoding assigns a unique numerical value to each class. For example: 
``negative: 0 and positive:1``

In [None]:
import gensim
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dropout, Activation, Dense
from sklearn.preprocessing import LabelEncoder

### **Question 2**: **Complete** the following transformations (1 point)

In [None]:
le = LabelEncoder()
# tokenize train reviews & encode train labels
tokenized_train = <YOUR CODE HERE>
y_train = <YOUR CODE HERE>
# tokenize test reviews & encode test labels
tokenized_test = <YOUR CODE HERE>
y_test = <YOUR CODE HERE>

## Feature Engineering based on Word2Vec Embeddings

In the previous notebook we discussed different word embedding techniques like word2vec, glove, fastText, etc. In this section we will leverage ``gensim`` to transform our dataset into word2vec  representation

In [None]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

### **Question 3**: **Get** feature vectors using Word2Vec (2 points)

Build the word2vec model on your tokenized train data

In [None]:
%%time
# build word2vec model
w2v_num_features = 300
# use a similar config as the tutorial but use a min_count of 2 and train for 10 iterations
w2v_model = <YOUR CODE HERE>

## Averaged Document Vectors

A sentence in very simple terms is a collection of words. By now we know how to transform words into vector representation. But how do we transform sentences and documents into vector representation?

A simple and naïve way is to average all words in a given sentence to form a sentence vector. In this section, we will leverage this technique itself to prepare our sentence/document vectors

### **Question 4**: **Complete** the following utility to build a function to generate and obtain averaged document embeddings (3 points)

In [None]:
def averaged_doc_vectorizer(corpus, model, num_features):
    <YOUR CODE HERE>

In [None]:
# generate averaged word vector features from word2vec model
avg_w2v_train_features = <YOUR CODE HERE>
avg_w2v_test_features = <YOUR CODE HERE>

In [None]:
print('Word2Vec model:> Train features shape:', avg_w2v_train_features.shape, 
      ' Test features shape:', avg_w2v_test_features.shape)

## Define DNN Model

Let us leverage ``tensorflow.keras`` to build our deep neural network for movie review classification task.
We will make use of ``Dense`` layers with ``ReLU`` activation and ``Dropout`` to prevent overfitting.

### **Question 5**: **Complete** the following utility to build a deep neural network for classification task (3 points)

Use a similar architecture as the tutorial, key components listed below for reference:

- 3 Dense Layers
- 512 - 256 - 256 (neurons)
- 20% dropout in each layer
- 1 output layer for binary classification
- binary crossentropy loss 
- adam optimizer

In [None]:
def construct_deepnn_architecture(num_input_features):
    <YOUR CODE HERE>
    return dnn_model

## Compile and Visualize Model

In [None]:
w2v_dnn = construct_deepnn_architecture(num_input_features=w2v_num_features)

In [None]:
w2v_dnn.summary()

## Train the Model using Word2Vec Features

The first exercise is to leverage word2vec features as input to our deep neural network to perform moview review classification

### **Question 6**: **Train** the model (1 point)

In [None]:
batch_size = 64
w2v_dnn.<YOUR CODE HERE>

### Evaluate Model

In [None]:
from sklearn.metrics import confusion_matrix, classification_report

In [None]:
y_pred = w2v_dnn.predict_classes(avg_w2v_test_features)
predictions = le.inverse_transform(y_pred) 

### **Question 7**: **Get** evaluation results (1 point)

In [None]:
labels = <YOUR CODE HERE>
# print classification report
<YOUR CODE HERE>
# display confusion matrix
<YOUR CODE HERE>

Congratulations you have built your first hate speech detection model!

We will look at more complex models in the future to see if we can improve this performance given this is a pretty complex dataset \ domain as compared to basic sentiment analysis