# Multi-Class Text Classification with word2vec and CNNs 

In this notebook, I train a classifier that can classify input text into 12 classes

I use the pre-trained word2vec embeddings to generate the tokens from the text 

I then apply 1D convolutions of varying kernel sizes on the tokenized text to combine the pre-trained embeddings from word2vec which can help discover useful patterns in text. 

I then use a Softmax layer to classify the outcome of combining the outputs of several kernel convolutions to output the class probabilities for each instance. 

Reasons why this approach is chosen: 
- Plenty of data is available for training a NN
- An attempt to train end-to-end and not worry about feature engineering
- To use feature vectors of low dimensions such as 300 in this example vs. a few 100k features as in TfIdf vectorization
- Curious how CNNs will learn the spatial depedency in the text


In [1]:
# Imports

import re
import nltk
import torch
import gensim
import pandas as pd
import numpy as np
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import TreebankWordTokenizer
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from gensim.models import KeyedVectors
from torch.utils.data import TensorDataset, DataLoader

# Loading, tidying and sampling data 

In the below block, I perform the following functions. 

- Read the raw data from 'complaints_users.csv' and 'products.csv'
- Merge the two tables on 'PRODUCT_ID' to get the raw text data and the class information into one frame
- Drop columns that I won't be processing or using  
- Merging classes
- Sampling the majority classes to reduce the class imbalance
- That creates a 'new_df' which has the necessary data for further processing

## On merging classes: 
EDA has revealed that there are classes where the class name of one class is a sub-string of another class name. 
The below code snippet is an attempt to merge such classes.

Below are the rules of the merger:
- Class A's name should be fully contained in class B's name 
- Class A will be merged with Class B - this is to keep the longer class name intact as it has Class A's tags and much more 

In [2]:
# Input data into dataframes 
complaints_users = pd.read_csv('../data/complaints_users.csv')
products = pd.read_csv('../data/products.csv')

# Merge tables to create a unified dataset with predictors and response 
df = pd.merge(complaints_users, products, left_on="PRODUCT_ID", right_on="PRODUCT_ID", how="left")

# Drop columns that are not required
df = df[["COMPLAINT_TEXT", "PRODUCT_ID", "MAIN_PRODUCT", "SUB_PRODUCT"]]
df = df.drop_duplicates()
df = df.reset_index()

# Merging classes
df.loc[df["MAIN_PRODUCT"]=="Credit card", "MAIN_PRODUCT"] = "Credit card or prepaid card"
df.loc[df["MAIN_PRODUCT"]=="Prepaid card", "MAIN_PRODUCT"] = "Credit card or prepaid card"
df.loc[df["MAIN_PRODUCT"]=="Payday loan", "MAIN_PRODUCT"] = "Payday loan, title loan, or personal loan"
df.loc[df["MAIN_PRODUCT"]=="Money transfers", "MAIN_PRODUCT"] = "Money transfer, virtual currency, or money service"
df.loc[df["MAIN_PRODUCT"]=="Virtual currency", "MAIN_PRODUCT"] = "Money transfer, virtual currency, or money service"
df.loc[df["MAIN_PRODUCT"]=="Credit reporting", "MAIN_PRODUCT"] = "Credit reporting, credit repair services, or other personal consumer reports"

# groupby "main_products" and perform majaority undersampling 
grouped_complaints = df.groupby("MAIN_PRODUCT")
new_df = pd.DataFrame()
for name, group in grouped_complaints:
    if group.shape[0] > 10000:
        chosen_records = group.sample(n=10000, axis=0, random_state=9)
    else:
        chosen_records = group
    new_df = pd.concat([new_df, chosen_records])

# the new_df is ready
new_df = df
new_df = new_df.reset_index(drop=True)


  interactivity=interactivity, compiler=compiler, result=result)


## Text tidy tasks
Cleaning text so that it is ready for further processing 

In [3]:
# Some basic text tidy job is done here 

# regex to remove anything other than word and space - i.e, punctuations 
remove_punctuation = re.compile('[^\w\s]')

# regex to remove xxxx usually credit card entries - do not use
remove_xxxx = re.compile('\sx+x')

# regex to remove digits - do not use
remove_digits = re.compile('\d')

# stopwords corpora 
stopwords = set(stopwords.words('english'))

# this is a good lemmatizer that reduces nouns to their correct root form but leaves the verbs out
stemmer = WordNetLemmatizer()

# this tokenizer splits not only on space but on punctuation too
tokenizer = TreebankWordTokenizer()

# function to clean the text
def text_cleaning(text):
    text = text.lower()
    text = remove_punctuation.sub('', text)
    #text = remove_xxxx.sub('', text)
    #text = remove_digits.sub('', text)
    text = tokenizer.tokenize(text)
    text = ' '.join(stemmer.lemmatize(word) for word in text if word not in stopwords)
    return text

# Using apply to apply the above function on the COMPLAINT_TEXT series 
new_df["COMPLAINT_TEXT"] = new_df["COMPLAINT_TEXT"].apply(text_cleaning)

# Below is an attempt to remove outlier text snippets that are too short 
lengths = new_df["COMPLAINT_TEXT"].apply(lambda x: len(x))

# short texts are those that have character count less than 100 - only for the purpose of this excercise 
short_texts = lengths[lengths < 100]

# drop rows that have very short texts 
new_df.drop(short_texts.index, inplace=True)

# this here is a spurious entry which has high character count but has absolutely no spaces 
# removing it
new_df.drop([79984], inplace=True)
new_df.reset_index(drop=True)


Unnamed: 0,index,COMPLAINT_TEXT,PRODUCT_ID,MAIN_PRODUCT,SUB_PRODUCT
0,0,xxxx transunion reporting incorrectly 120 day ...,26,"Credit reporting, credit repair services, or o...",Credit reporting
1,1,xxxx transunion reporting incorrectly 120 day ...,26,"Credit reporting, credit repair services, or o...",Credit reporting
2,2,xxxx xxxx experian need remove collection acco...,26,"Credit reporting, credit repair services, or o...",Credit reporting
3,3,3 company inconsistency violation double jeopa...,26,"Credit reporting, credit repair services, or o...",Credit reporting
4,4,personal loan patriot finance incorrectly repo...,26,"Credit reporting, credit repair services, or o...",Credit reporting
...,...,...,...,...,...
346275,383059,used money gram send mother xxxx xxxx xxxx nev...,57,"Money transfer, virtual currency, or money ser...",International money transfer
346276,383060,sent letter got response unfortunately victim ...,35,Debt collection,I do not know
346277,383062,requested credit score paid special reduced fe...,28,"Credit reporting, credit repair services, or o...",
346278,383064,originally attended xxxx xxxx located xxxx xxx...,91,Student loan,Non-federal student loan


# Encoding class labels 

Class names are strings so far. Since I'm going to be using Cross Entropy Loss, I need the labels to be numeric. So I encode them to integers 

In [4]:
# Encoding the class labels to integers 
label_encoder = LabelEncoder()
new_df["classes"] = label_encoder.fit_transform(new_df["MAIN_PRODUCT"])
label_class_mapping = dict(zip(label_encoder.transform(label_encoder.classes_), label_encoder.classes_))

# Loading pre-trained embeddings 

Loading the slim Google news vectors below. I have downloaded them to /word2vec

In [5]:
# loading the word2vec embeddings for lookup 
lookup = KeyedVectors.load_word2vec_format('../word2vec/GoogleNews-vectors-negative300-SLIM.bin', 
                                                 binary=True)

In [6]:
# Making a note of all the available words in this embedding lookup
available_words = []
for word in lookup.vocab:
    available_words.append(word)
    
# Information on embeddings and vocabulory 
print("Size of Vocab: {}\n".format(len(available_words)))
print('Dimension of each word embedding: {}\n'.format(lookup[available_words[0]].shape))

Size of Vocab: 299567

Dimension of each word embedding: (300,)



# Tokenization

The lookup we created earlier holds a lot of words (those stored in "available_words") and their embeddings. Our corpus which is a list of complaints also will surely contain the same words, therefore embeddings are straighaway useful. 

To identify the words we have in our corpus with the words in pre-trained embeddings lookup, we assign to each word in our corpus, the identifier of our word in the pre-trained corpus. 

In [42]:
# function to tokenize our corpus
def tokenize_complaints(lookup, complaint):
    # here we can split on whitespace as previously we have already removed the punctuations 
    complaint_words = complaint.split(' ')
    tokens = []
    for word in complaint_words:
        try:
            idx = lookup.vocab[word].index
        except:
            idx = 0
        tokens.append(idx)    
    return tokens

In [43]:
# apply the tokenization function defined above to the series of our interest 
new_df["tokens"] = new_df["COMPLAINT_TEXT"].apply(lambda x: tokenize_complaints(lookup, x))

In [45]:
# EDA has revealed that the mean length of text in corpus is ~110. The median is ~80 which is less than mean. We have 
# a distribution that is skewed right.  

# To make the lengths of all of our text sequences identical, we pad the sequence with 0. 0 is also used while tokenization
# to use as a token for "unknown words(UNK)" or words not present in "available_words" dictionary

# max sequence length we would like our texts to have 
# texts with length > seq_length are truncated 
seq_length = 200

# min length below which we pad the 0s on the left of the tokens
min_length = 25

# function to perform padding or truncation of tokens 
def pad_or_truncate_tokens(tokens):
    n = len(tokens)
    if n <= min_length:
        pad_sequence = [0] * (seq_length-n)
        pad_sequence.extend(tokens)
        tokens = pad_sequence
    elif n > seq_length:
        tokens = tokens[0:seq_length]
    elif n > min_length and n < seq_length:
        pad_sequence = [0] * (seq_length-n)
        tokens.extend(pad_sequence)
    return tokens

In [46]:
# apply the function defined above for token paddings / truncation 
new_df["padded_tokens"] = new_df["tokens"].apply(lambda x: pad_or_truncate_tokens(x))

# Model training

In [47]:
# we generate a numpy matrix of the tokens below 
features = np.zeros((new_df.shape[0], seq_length), dtype=int)

for idx, row in enumerate(new_df["padded_tokens"]):
    for col, token in enumerate(row):
        features[idx, col] = token

In [48]:
# using sklearn's train_test_split with 'stratified sampling' so that the partitions have roughly the same 
# class balance as the original dataset
X = features
y = new_df["classes"]

# splitting into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, stratify=new_df["classes"])

# test is being split again into validation and test 
X_val, X_test, y_val, y_test =  train_test_split(X_test, y_test, test_size = 0.5)

In [49]:
# Converting numpy arrays to tensors for processing in Pytorch
train_data = TensorDataset(torch.from_numpy(X_train), torch.from_numpy(y_train.to_numpy()))
val_data = TensorDataset(torch.from_numpy(X_val), torch.from_numpy(y_val.to_numpy()))
test_data = TensorDataset(torch.from_numpy(X_test), torch.from_numpy(y_test.to_numpy()))

# batch size needs to be chosen before - keeping it low so that it is memory efficient 
batch_size = 256

# batching data into three dataloaders to ensure data flow during train/val/test
train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size, drop_last=True)
valid_loader = DataLoader(val_data, shuffle=True, batch_size=batch_size, drop_last=True)
test_loader = DataLoader(test_data, shuffle=True, batch_size=batch_size, drop_last=True)

In [50]:
# First checking if GPU is available
train_on_gpu=torch.cuda.is_available()

if(train_on_gpu):
    print('Training on GPU.')
else:
    print('No GPU available, training on CPU.')

Training on GPU.


In [51]:
import torch.nn as nn
import torch.nn.functional as F

class ClassifierCNN(nn.Module):
    """
    The embedding layer + CNN model that will be used to perform classification.
    """

    def __init__(self, word2vec_lookup, vocab_size, output_size, embedding_dim,
                 num_filters=100, kernel_sizes=[3, 4, 5, 6], freeze_embeddings=True, drop_prob=0.5):
        """
        Initialize the model by setting up the layers.
        """
        super(ClassifierCNN, self).__init__()

        # set from input parameters 
        self.num_filters = num_filters
        self.embedding_dim = embedding_dim
        
        # Embedding layer
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        # set weights to pre-trained so that training is not needed, but they can be trained too
        self.embedding.weight = nn.Parameter(torch.from_numpy(word2vec_lookup.vectors)) 
        if freeze_embeddings:
            self.embedding.requires_grad = False
        
        # Convolutional layers - with kernel sizes [3,4,5] to cover 3,4,5 grams - totally 300 kernels 
        # padding is used to aid with edges (top & bottom)
        self.convs_1d = nn.ModuleList([
            nn.Conv2d(1, num_filters, (k, embedding_dim), padding=(k-2,0)) 
            for k in kernel_sizes])
        
        # Fully-connected layer for classification
        self.fc = nn.Linear(len(kernel_sizes) * num_filters, output_size) 
        
        # Dropout and Softmax activation 
        self.dropout = nn.Dropout(drop_prob)
        self.softmax = nn.Softmax(dim=1)
        
    
    def convolution_and_pooling(self, x, conv):
        """
        Convolutional + max pooling layer
        """
        # squeeze last dim to get size: (batch_size, num_filters, conv_seq_length)
        # conv_seq_length will be ~ 200
        x = F.relu(conv(x)).squeeze(3)
        
        # 1D pool over conv_seq_length
        # squeeze to get size: (batch_size, num_filters)
        x_max = F.max_pool1d(x, x.size(2)).squeeze(2)
        return x_max

    def forward(self, x):
        """
        Defines how a batch of inputs, x, passes through the model layers.
        Returns a single, sigmoid-activated class score as output.
        """
        # embedded vectors
        embeddings = self.embedding(x) # (batch_size, seq_length, embedding_dim)
        # embeds.unsqueeze(1) creates a channel dimension that conv layers expect
        embeddings = embeddings.unsqueeze(1)
        
        # get output of each convolution-max-pooling layer
        conv_output = [self.convolution_and_pooling(embeddings, conv) for conv in self.convs_1d]
        
        # concatenate results and add dropout
        x = torch.cat(conv_output, 1)
        x = self.dropout(x)
        
        # apply linear layer 
        fc_out = self.fc(x) 
        
        # softmax-activation for num_classes - np.argmax on axis=1 for class output
        return self.softmax(fc_out)
      

In [52]:
# Setting model hyper-parameters 
vocab_size = len(available_words)
output_size = len(y.unique()) # num_classes after merging 
embedding_dim = len(lookup[available_words[0]]) # 300-dim vectors
num_filters = 100
kernel_sizes = [3, 4, 5]

model = ClassifierCNN(lookup, vocab_size, output_size, embedding_dim,
                   num_filters, kernel_sizes)

print(model)

ClassifierCNN(
  (embedding): Embedding(299567, 300)
  (convs_1d): ModuleList(
    (0): Conv2d(1, 100, kernel_size=(3, 300), stride=(1, 1), padding=(1, 0))
    (1): Conv2d(1, 100, kernel_size=(4, 300), stride=(1, 1), padding=(2, 0))
    (2): Conv2d(1, 100, kernel_size=(5, 300), stride=(1, 1), padding=(3, 0))
  )
  (fc): Linear(in_features=300, out_features=12, bias=True)
  (dropout): Dropout(p=0.5, inplace=False)
  (softmax): Softmax(dim=1)
)


In [53]:
# loss and optimization functions
lr=0.001 # chosen empirically - could experiment if time permits

# choosing loss criterion and optimizer 
# cross entropy loss as we have multiple classes
criterion = nn.CrossEntropyLoss() 

# Adam optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=lr)

In [54]:
# function to train the neural network
def train(model, train_loader, epochs, print_every=100):

    # move model to GPU, if available
    if(train_on_gpu):
        model.cuda()

    counter = 0 # for printing
    
    # mark model for training 
    model.train()
    for e in range(epochs):

        # looping through the batches from train_loader 
        for inputs, labels in train_loader:
            counter += 1

            if(train_on_gpu):
                inputs, labels = inputs.cuda(), labels.cuda()

            # reset grads - this helps reset on each epoch
            model.zero_grad()

            # prediction from the model
            output = model(inputs)

            # calculate loss and back propogate 
            #print(output.shape)
            #print(labels.shape)
            loss = criterion(output.squeeze(), labels.long())
            loss.backward()
            optimizer.step()
            
            # validation 
            if counter % print_every == 0:
                val_losses = []
                
                # mark the model for eval - so no gradients are accumulated or back propagated 
                model.eval()
                for inputs, labels in valid_loader:

                    if(train_on_gpu):
                        inputs, labels = inputs.cuda(), labels.cuda()
                    
                    # prediction on validation sample 
                    output = model(inputs)
                    val_loss = criterion(output.squeeze(), labels.long())

                    val_losses.append(val_loss.item())
                
                # Reset model for train for next epoch
                model.train()
                
                # print stats 
                print("Epoch: {}/{}...".format(e+1, epochs), "Step: {}...".format(counter), "Loss: {:.6f}...".format(loss.item()),
                      "Val Loss: {:.6f}".format(np.mean(val_losses)))

In [55]:
# training params

num_epochs = 1
print_every = 100

train(model, train_loader, num_epochs, print_every=print_every)

Epoch: 1/1... Step: 100... Loss: 2.042921... Val Loss: 2.055054
Epoch: 1/1... Step: 200... Loss: 1.976489... Val Loss: 1.970317
Epoch: 1/1... Step: 300... Loss: 1.975717... Val Loss: 1.953417
Epoch: 1/1... Step: 400... Loss: 1.949370... Val Loss: 1.926835
Epoch: 1/1... Step: 500... Loss: 1.904049... Val Loss: 1.905294
Epoch: 1/1... Step: 600... Loss: 1.915051... Val Loss: 1.904817
Epoch: 1/1... Step: 700... Loss: 1.904727... Val Loss: 1.899374
Epoch: 1/1... Step: 800... Loss: 1.920650... Val Loss: 1.894605
Epoch: 1/1... Step: 900... Loss: 1.900077... Val Loss: 1.893631
Epoch: 1/1... Step: 1000... Loss: 1.874893... Val Loss: 1.892006


In [56]:
# Get test data loss and accuracy

test_losses = [] # track loss
num_correct = 0


model.eval()
# iterate over test data
for inputs, labels in test_loader:

    if(train_on_gpu):
        inputs, labels = inputs.cuda(), labels.cuda()
    
    # get predicted outputs
    output = model(inputs)
    
    # calculate loss
    test_loss = criterion(output.squeeze(), labels.long())
    test_losses.append(test_loss.item())
    
    # convert output probabilities to predicted class (0 or 1)
    pred = np.argmax(output.detach().cpu(), axis=1)#torch.round(output.squeeze())  # rounds to the nearest integer
    
    # compare predictions to true label
    correct_tensor = pred.eq(labels.detach().cpu().view_as(pred))
    correct = np.squeeze(correct_tensor.numpy()) if not train_on_gpu else np.squeeze(correct_tensor.cpu().numpy())
    num_correct += np.sum(correct)


# -- stats! -- ##
# avg test loss
print("Test loss: {:.3f}".format(np.mean(test_losses)))

# accuracy over all test data
test_acc = num_correct/len(test_loader.dataset)
print("Test accuracy: {:.3f}".format(test_acc))

Test loss: 1.895
Test accuracy: 0.724


In [57]:
def predict(lookup, model, complaint_text, pad_length=200):
    
    # mark the model for evaluation
    model.eval()
    
    # get the integer tokens from lookup
    integer_tokens = tokenize_complaints(lookup, complaint_text)
    
    # pad or truncate the complaint if required 
    tokens = pad_or_truncate_tokens(integer_tokens)
    
    # convert the integer tokens to numpy features and then tensordata
    features = np.zeros((1, len(tokens)), dtype="int")
    for i in range(len(features)):
        features[i] = tokens[i]
    features = torch.from_numpy(features)
    
    # pass the text through the model
    batch_size = features.shape[0]
    if train_on_gpu:
        features = features.cuda()
    output = model(features)
    
    # find the class predicted 
    pred = np.argmax(output.detach().cpu(), axis=1)
    pred = np.array(pred)[0]
    
    # let's find the name of the class
    predicted_class = label_class_mapping[pred]
    
    return predicted_class
        

In [59]:
# some driver code for using predict 
random_sample = np.random.randint(0, new_df.shape[0])
complaint_text = new_df.iloc[random_sample]["COMPLAINT_TEXT"]
predicted_class = predict(lookup, model, complaint_text, pad_length=200)
print("Complaint text: {}".format(complaint_text), "\nPredicted class: {}".format(predicted_class), 
      "\nTrue class: {}".format(new_df.iloc[random_sample]["MAIN_PRODUCT"]))

Complaint text: received unsolicited credit card business ready activation upon arrival scary part actual card 15000000 limit addressed former board director active organization 15 year xxxx company never heard often receive solicitation credit application never one like would afraid someone could card like company credit headache would ensue thank help matter 
Predicted class: Credit card or prepaid card 
True class: Credit card or prepaid card
