<a href="https://colab.research.google.com/github/sayyed-uoft/TSSA/blob/main/Vector_Institute_Text_Classification_Example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Vector Institute - TSSA Intro to AI 

### Thank you for joining Day 2 of the Vector Institute, 'Intro to AI' workshop series.

If you have any questions or if you would like to learn more about this program, contact: learn@vectorinstitute.ai

# Case Study 2: Text Classification


In this notebook, we will use pre-trained deep learning model to process some text. We will then use the output of that model to classify the text. The text is a list of narratives from consumer complaints. And we will calssify each narrative to one of the compalint classes.

## Models: Consumer Complaint Classification
Our goal is to create a model that takes the text of a compliant and produces the class code. 

Under the hood, the model is actually made up of two model.

* DistilBERT processes the text and passes along some information it extracted from it on to the next model. DistilBERT is a smaller version of BERT developed and open sourced by the team at HuggingFace. It’s a lighter and faster version of BERT that roughly matches its performance.
* The next model, a basic Logistic Regression model from scikit learn will take in the result of DistilBERT’s processing, and classify the text. We will train both binary and multi-class calssifiers and will explain the methods to evaluate the results..

The data we pass between the two models is a vector of size 768. We can think of this of vector as an embedding for the sentence that we can use for classification.

## Problem
Each week the Consumer Financial Protection Bureau sends thousands of consumer’s complaints about financial product and services to company for a response. Classify those consumer complaints into the product category it belongs to using the description of the complaint.

## Dataset
The dataset is a small subset of data extracted from Data.gov website. We extracted only a very small part due to memory limitation of Google Colab.  The data is already clean. It is made of two columns:

1. **Product:** the complaint class
1. **Consumer complaint narrative:** the text of the complaint

## Installing the transformers library
Let's start by installing the huggingface transformers library so we can load our deep learning NLP model. Also, importing the required Python packages.

In [None]:
!pip install transformers

In [None]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.metrics import confusion_matrix, plot_confusion_matrix
import torch
import transformers as ppb
import warnings
import matplotlib.pyplot as plt
%matplotlib inline
warnings.filterwarnings('ignore')

## Importing the dataset
We'll use pandas to read the dataset and load it into a dataframe.

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/sayyed-uoft/TSSA/main/customer_complaints_samples.csv')

Let's look at the data

In [None]:
df

Let's look at the distribution of the products (labels):

In [None]:
df['Product'].value_counts()

In [None]:
df['Product'].value_counts().plot.bar()
plt.show()

## Loading the Pre-trained BERT model
Let's now load a pre-trained BERT model. 

In [None]:
# For DistilBERT:
model_class, tokenizer_class, pretrained_weights = (ppb.DistilBertModel, ppb.DistilBertTokenizer, 'distilbert-base-uncased')

# Load pretrained model/tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

Right now, the variable `model` holds a pretrained distilBERT model -- a version of BERT that is smaller, but much faster and requiring a lot less memory.

## Preparing the Dataset
Before we can hand our narratives to BERT, we need to do some minimal processing to put them in the format it requires.

### Tokenization
Our first step is to tokenize the narratives -- break them up into word and subwords in the format BERT is comfortable with. The model accepts obly 512 tockens. So, we truncate longer messages.

In [None]:
# Tokenize the narratives
tokenized = df['Consumer complaint narrative'].apply((lambda x: tokenizer.encode(x, add_special_tokens=True, truncation=True)))

In [None]:
# view a few tockenized samples
tokenized.head()


### Padding
After tokenization, `tokenized` is a list of narratives -- each narrative is represented as a list of tokens. We want BERT to process our examples all at once (as one batch). It's just faster that way. For that reason, we need to pad all lists to the same size, so we can represent the input as one 2-d array, rather than a list of lists (of different lengths).

In [None]:
max_len = 0
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i)

padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values])

Our dataset is now in the `padded` variable, we can view its dimensions below:

In [None]:
np.array(padded).shape

### Masking
If we directly send `padded` to BERT, that would slightly confuse it. We need to create another variable to tell it to ignore (mask) the padding we've added when it's processing its input. That's what attention_mask is:

In [None]:
attention_mask = np.where(padded != 0, 1, 0)
attention_mask.shape

## Narrative Embeddings
Now that we have our model and inputs ready, let's run our model!

The `model()` function runs our narratives through BERT. The results of the processing will be returned into `last_hidden_states`.

In [None]:
# Note: This will take a while
input_ids = torch.tensor(padded)  
attention_mask = torch.tensor(attention_mask)

with torch.no_grad():
    last_hidden_states = model(input_ids, attention_mask=attention_mask)

Let's slice only the part of the output that we need. That is the output corresponding the first token of each sentence. The way BERT does sentence classification, is that it adds a token called `[CLS]` (for classification) at the beginning of every sentence. The output corresponding to that token can be thought of as an embedding for the entire sentence.

We'll save those in the `features` variable, as they'll serve as the features to our logitics regression model.

In [None]:
features = last_hidden_states[0][:,0,:].numpy()

## Classification

The last step is to use the narrative representations as the input of a simple linear classification model. For the output we will use the index of the associated categories.

We will train two models:
1. Binary classification (if the complaint’s type is ‘Credit reporting, credit repair services, or other personal consumer reports’)
1. Multi-class classification


## Binary Classification

The labels indicating which sentence is positive and negative now go into the `labels` variable

In [None]:
labels = df['Product'] == 'Credit reporting, credit repair services, or other personal consumer reports'
labels

In [None]:
labels.value_counts().plot.bar()
plt.show()

### Model Trainingn and Validation
We now train and validate a LogisticRegression model. We will use "coss_val_scores" to perform a 5-fold cross validation and we choose "accuracy" as the score.

In [None]:
lr_clf = LogisticRegression()
scores = cross_val_score(lr_clf, features, labels, scoring='accuracy', cv=5)
print("Score is {:.2f} +- {:.2f}".format(scores.mean(), 2*scores.std()))

How good is this score? What can we compare it against? Let's first look at a dummy classifier. A dummy classifieris a classifier that makes predictions using simple rules. By dedfault, always predicts the class that maximizes the class prior.

In [None]:
from sklearn.dummy import DummyClassifier
clf = DummyClassifier()

scores = cross_val_score(clf, features, labels)
print("Score is {:.2f} +- {:.2f}".format(scores.mean(), 2*scores.std()))

So our model clearly does better than a dummy classifier. But, the data is not balanced and accuracy is not a good score. We should use confusion matrix to analyze the results and use the combination of precision and recall scores.


In [None]:
# Calculate and print confusion matrix
pred = cross_val_predict(lr_clf, features, labels, cv=5)
conf_mx = confusion_matrix(labels, pred)
conf_mx

In [None]:
# Calculate recall scores
scores = cross_val_score(lr_clf, features, labels, scoring='recall', cv=5)
print("Score is {:.2f} +- {:.2f}".format(scores.mean(), 2*scores.std()))

In [None]:
# Calculate precision scores
scores = cross_val_score(lr_clf, features, labels, scoring='precision', cv=5)
print("Score is {:.2f} +- {:.2f}".format(scores.mean(), 2*scores.std()))

## Multi-class Classification

In [None]:
# convert classes to class numbers
factorized = df['Product'].factorize()
labels_multi = factorized[0]
labels_text = factorized[1]

In [None]:
# Calculate accuracy scores
scores = cross_val_score(lr_clf, features, labels_multi, scoring='accuracy', cv=5)
print("Score is {:.2f} +- {:.2f}".format(scores.mean(), 2*scores.std()))

In [None]:
# Calculate accuracy scores (Dummy Classifier)
scores = cross_val_score(clf, features, labels_multi)
scores

In [None]:
# Claculate and print confusion matrix
pred = cross_val_predict(lr_clf, features, labels_multi, cv=5)

In [None]:
conf_mx = confusion_matrix(labels_multi, pred)
conf_mx

In [None]:
# Plot the multi-clkass confusion matrix.
row_sums = conf_mx.sum(axis=1, keepdims=True)
norm_conf_mx = conf_mx / row_sums
np.fill_diagonal(norm_conf_mx, 0)
plt.matshow(norm_conf_mx, cmap=plt.cm.gray)
plt.xticks(range(9), labels_text, rotation=90)
plt.yticks(range(9), labels_text)
plt.show()

Try to interpret the confusion matrix.

# Contact Information

Congratulations, you have completed the tutorial for Day 2 of the Vector Institute 'Intro to AI' program! Thank you for your time and attention.


*   Instructor: Sayyed Nezhadi 
*   Program Director: Shingai Manjengwa 
*   Contact: learn@vectorinstitute.ai

Never stop learning!