# Setup

Installed needed libraries and dataset from Kaggle for this assignment.
Imports are all located at the top of the notebook.

In [2]:
# install only once
# ! pip install -q kaggle
# ! pip install -q transformers
# ! pip install -q --upgrade keras-nlp
# ! pip install tensorflow --upgrade


# Download the dataset from kaggle (kaggle api key required) or upload manually zip
# ! kaggle datasets download -d shashwatwork/consume-complaints-dataset-fo-nlp

# unzip it
#! tar -zxf consume-complaints-dataset-fo-nlp.zip

tar: This does not look like a tar archive
tar: Skipping to next header

gzip: stdin: unexpected end of file
tar: Child returned status 1
tar: Error is not recoverable: exiting now


In [41]:
from transformers import (set_seed,
                          TrainingArguments,
                          Trainer,
                          AdamW,
                          get_linear_schedule_with_warmup,
                          BertTokenizer, TFBertForSequenceClassification, InputExample, InputFeatures,
                          TFDistilBertForSequenceClassification, AutoTokenizer)
# from torch.utils.data import DataLoader, TensorDataset, random_split
import torch
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
# from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, roc_auc_score
from sklearn.metrics import classification_report
import tensorflow as tf

# import keras_nlp
# import keras_core as keras
# from tensorflow.keras.utils import to_categorical
# from tensorflow.keras.losses import CategoricalCrossentropy
# from tensorflow.keras.metrics import CategoricalAccuracy
# from tensorflow.keras.utils import to_categorical
# from tensorflow.keras.layers import Input, Dense

# from tensorflow import keras
# from tensorflow.keras import layers

# Set seed for reproducibility.
set_seed(58)

# Look for gpu to use. Will use `cpu` by default if no gpu found.
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# EDA

I found that the dataset did contain some null values and decided to remove those rows. I also found the the dataset was imbalanced which a majority of the products being credit_reporting. The length of the narratives also varied so i knew some padding and truncation would be needed when tokenizing the data. There was also an extra column that was just a copy of the index so i removed that as well.

In [5]:
# if uploaded zip manually
df = pd.read_csv('/content/consume-complaints-dataset-fo-nlp.zip', index_col=False)

# if downloaded using kaggle and unzipped it
# df = pd.read_csv('complaints_processed.csv', index_col=False)

# remove rows with null values
df = df.dropna()
if 'Unnamed: 0' in df.columns:
    df = df.drop('Unnamed: 0', axis=1)
print("\nPost removal of null values: \n\n", df.isnull().sum())
print("\nDistribution of products: \n\n", df['product'].value_counts())
print("\nUnique product values: \n\n", df['product'].unique().tolist())
print("\nNarative sentence word count describe:\n ", df['narrative'].apply(lambda x: len(str(x).split())).describe())


Post removal of null values: 

 product      0
narrative    0
dtype: int64

Distribution of products: 

 credit_reporting       91172
debt_collection        23148
mortgages_and_loans    18990
credit_card            15566
retail_banking         13535
Name: product, dtype: int64

Unique product values: 

 ['credit_card', 'retail_banking', 'credit_reporting', 'mortgages_and_loans', 'debt_collection']

Narative sentence word count describe:
  count    162411.000000
mean         80.232798
std         108.872213
min           1.000000
25%          27.000000
50%          50.000000
75%          95.000000
max        2685.000000
Name: narrative, dtype: float64


# Data Preprocessing

My biggest struggle was determining which format to use for the labels. I decided to use the one hot encoding method, but went down two different routes. I learned that both routes work, but in the end you need an array of arrays (i.e. [[0,0,1,0,0],[0,1,0,0,0]... ]) coresponding with a 1 for the right feature it belongs to.

In my second cell I define what my global variables are and split the data into train and test. I would want to continue and do a validate set as well, but for the sake of time i did not. From my EDA, I was able to determine that the mean was 80 words and 95 words for the 75%. I thought that by making my batch 128, I would be able to capture a lot of the narratives in one batch. I used 512 for max length because that's what the BERT model was trained on and I was going to add padding and trunction to account for the narratives that were not 512 words long.

I approached the assignment by understanding the data types and reading on the different approaches between tensor and pytorch and which parts were similar in terms of conversion tools and training the models. So a big advantage in starting a project like this, is to create reusable code and quickly testing small portions, which is why I made a test_data and val_data which are just smaller subsets of _train and _test data.

In [6]:
def label_to_int(label):
    """ Convert label to int.
    Returns: List of converted labels.
    """
    global label_dict
    label_dict = {'credit_card': 0, 'retail_banking': 1, 'credit_reporting': 2, 'mortgages_and_loans': 3, 'debt_collection': 4}
    # Convert labels to integers
    return label_dict[label]

# convert product to label as an integer (0-4)
df['label'] = df['product'].apply(label_to_int)

# calculate the number of words in each narrative
# Notice that type in each row didn't defautl to string, so i had to convert to string prior to splitting
df['num_words'] = df['narrative'].apply(lambda x: len(str(x).split(" ")))
product_labels = df['product'].unique().tolist()

# create columns for each product label
for label in product_labels:
    df[label] = df['product'].apply(lambda x: 1 if x == label else 0)

# create columns
display(df.head(5))

Unnamed: 0,product,narrative,label,num_words,credit_card,retail_banking,credit_reporting,mortgages_and_loans,debt_collection
0,credit_card,purchase order day shipping amount receive pro...,0,230,1,0,0,0,0
1,credit_card,forwarded message date tue subject please inve...,0,132,1,0,0,0,0
2,retail_banking,forwarded message cc sent friday pdt subject f...,1,173,0,1,0,0,0
3,credit_reporting,payment history missing credit report speciali...,2,131,0,0,1,0,0
4,credit_reporting,payment history missing credit report made mis...,2,123,0,0,1,0,0


In [52]:
# Initializing multiclasification labels
label_vals = [0,1,2,3,4]
labels = ['credit_card', 'retail_banking', 'credit_reporting', 'mortgages_and_loans', 'debt_collection']
num_labels = len(labels)

# create traing and test sets
X = df['narrative'].values
y = df[labels].values

#  Batch size and the max length of the sequence
batch_size = 128
max_length = 512

# Use this to train the whole dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=58, stratify=y)

# Tried to train the whole data set but took too long
# test_data = X_train
# test_labels = y_train

# val_data = X_test
# val_labels = y_test

# create smaller test and validation data to test working flow with
test_data = X_train[:10000]
test_labels = y_train[:10000]

val_data = X_test[:200]
val_labels = y_test[:200]

In [53]:
print(len(test_labels))
print(len(val_labels))

10000
200


## Model tokenization and conversion of input features

I tested using the BERT-base-uncased because that's what I'm familiar with from the last assignment and I didn't fully get the last assignment to work. What I did do more purposely in this assignment is to use tensorflow and to use the appropriate TF SequenceClassification model for this tokenized data. I later noticed that distilBert also has a normal DistilBertTokenizer and TFDistilBertForSequenceClassification.

In [42]:
model_name = "bert-base-uncased"
model_name_2 = "distilbert-base-uncased"
# tokenizer = BertTokenizer.from_pretrained(model_name, do_lower_case=True)
tokenizer = AutoTokenizer.from_pretrained(model_name_2, do_lower_case=True)

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [54]:
# Converting test and validation data to InputExamples
train_examples = [InputExample(guid=str(i), text_a=test_data[i], label=test_labels[i]) for i in range(len(test_data))]
print(train_examples[0])

test_examples = [InputExample(guid=str(i), text_a=val_data[i], label=val_labels[i]) for i in range(len(val_data))]

InputExample(guid='0', text_a='diagnosed treatment included removing large portion basically paid remove huge area specialized xxxxxxxx area make look le unfortunately serious problem walked around year upsetting happy alive free notified eventually agreed redo year month two painful week wearing huge bandage satisfied result total bill service huge paid almost original invoice amount refused pay final felt pain suffering brutal opinion took year month two painful week wearing huge bandage reduce total price goodwill gesture refused adding insult injury hired collection agency collection agency listed bill disclosing medical condition everyone violation hippa law gla collection listed item bill major collection agency medical shared major credit reporting agency asked gla collection remove credit report refused terrible many level gla think acceptable knowingly share medical condition would like entire collection account removed credit report especially since credit report year delique

I enjoyed using both InputExample and InputFeatures because it helped me understand the data types and how to convert them. I also learned that the input_ids are the input words encoded and the attention_mask is identification of where the enconding begin and end with the padding and truncation up to the max length. I also learned that the best approach was to have the labels are the one hot encoded labels at this point. This was a big paint point to get the data just right. By making it reusable I was able to test it on the smaller test_data and val_data sets, and can build on it in the future.

In [55]:
# Convert InputExamples to InputFeatures
def convert_examples_to_features(examples, tokenizer, max_length=max_length):
    features = []
    for example in examples:
        inputs = tokenizer.encode_plus(
            example.text_a,
            add_special_tokens=True,
            padding='max_length',
            truncation=True,
            return_attention_mask = True,
            # return_token_type_ids=True,
            max_length=max_length,
        )
        features.append(InputFeatures(input_ids=inputs['input_ids'], attention_mask=inputs['attention_mask'], label=example.label))
    return features

train_features = convert_examples_to_features(train_examples, tokenizer, max_length=max_length)
test_features = convert_examples_to_features(test_examples, tokenizer, max_length=max_length)

This was the final step to convert the features into the larger object of a tensor flow dataset. I was able to use the from_tensor_slices method to convert the features into a dataset. I also used the shuffle method to shuffle the data and the batch method to create the batches.

In [56]:
# Convert InputFeatures to TensorFlow datasets
def convert_features_to_tf_dataset(features):
    all_input_ids = []
    all_attention_masks = []
    all_labels = []

    for feature in features:
        all_input_ids.append(feature.input_ids)
        all_attention_masks.append(feature.attention_mask)
        all_labels.append(feature.label)

    # Convert lists to TF tensors
    tf_ds = tf.data.Dataset.from_tensor_slices(({"input_ids": all_input_ids, "attention_mask": all_attention_masks}, all_labels))
    return tf_ds

train_dataset = convert_features_to_tf_dataset(train_features)
test_dataset = convert_features_to_tf_dataset(test_features)

In [57]:
batch_dataset = train_dataset.shuffle(1000).batch(10, drop_remainder=True).shuffle(100)
batch_test_dataset = test_dataset.shuffle(1000).batch(10, drop_remainder=True).shuffle(100)

print(len(batch_dataset))
print(len(batch_test_dataset))

1000
20


# Model definition, training, and evaluation

## Model definition and intstantiation

The biggest part of the assignment was choosing a model you were comfortable with. Since I've never trained a model end to end, I wanted to look at some of the newer ways to do it, so I could reuse this pattern and potentially deploy the model. Tensor flow is also Keras friendly, and I knew that would help with getting metrics. I quickly learned I could keras could initialize the Adam optimizer and has some built in metrics that are easy to implement. Similar to a pipeline I would say, but there are probably some more advances ways of doing that. One configuration that I had to dig into was what from_logits was and why it was needed. I learned that it was needed because the model was not trained on the softmax function, so it needed to be added to the model. T

he learning_rate of 5e-05 the learning rate for the model was used from hugging face documentation. I also learned that the loss function was categorical_crossentropy because the labels were one hot encoded. Balanced accuracy will take care of our average accuracy for all the classes. (balanced_accuracy). The thresholds needed to be 0 because the logits was true.

In [60]:
#model = TFBertForSequenceClassification.from_pretrained(model_name, num_labels=5)
model = TFDistilBertForSequenceClassification.from_pretrained(model_name_2, num_labels=num_labels)

model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=5e-05, epsilon=1e-08, clipnorm=1.0),
              loss=tf.keras.losses.CategoricalCrossentropy(from_logits = True),
              metrics=[tf.keras.metrics.CategoricalAccuracy('balanced_accuracy'),
                       tf.keras.metrics.Recall(thresholds=0),
                       tf.keras.metrics.Precision(thresholds=0)])

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_projector.bias']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should 

## Model training

I was able to use fit the data to the model and use the validation data to see how the model was doing. I was able to get the accuracy and loss metrics, recall, and precision from the model. I noticed ...

In [61]:
train_history = model.fit(batch_dataset, epochs=3, verbose=1)
model.summary()

Epoch 1/3
Epoch 2/3
Epoch 3/3
Model: "tf_distil_bert_for_sequence_classification"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 distilbert (TFDistilBertMa  multiple                  66362880  
 inLayer)                                                        
                                                                 
 pre_classifier (Dense)      multiple                  590592    
                                                                 
 classifier (Dense)          multiple                  3845      
                                                                 
 dropout_247 (Dropout)       multiple                  0         
                                                                 
Total params: 66957317 (255.42 MB)
Trainable params: 66957317 (255.42 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


## Model evaluation and Prediction

In [62]:
predicted_raw = model.predict(batch_test_dataset).logits



In [63]:
probs = tf.nn.softmax(predicted_raw, axis=-1)
y_predicted = tf.argmax(probs, axis=1).numpy()
predicted_label = y_predicted[0]
print(f"Predicted label: {predicted_label}")

# Convert y_true to label-encoded format
y_true = np.argmax(val_labels, axis=1)
print("Sample Prediction", y_predicted)
print("True values of Prediction range", y_true)
print(classification_report(y_true, y_predicted, zero_division=1))

Predicted label: 0
Sample Prediction [0 3 2 2 2 3 2 4 3 2 2 3 4 1 4 2 2 2 2 2 0 4 0 2 2 4 0 2 2 2 0 1 2 2 3 3 2
 2 3 2 4 2 4 2 2 2 4 2 2 2 3 2 3 4 2 4 3 2 2 2 1 0 0 2 2 2 2 2 2 0 2 2 1 2
 2 2 1 2 2 2 2 4 2 4 0 0 2 4 2 2 3 1 3 2 3 1 2 4 2 2 1 0 1 2 2 2 2 4 4 2 4
 3 0 2 1 2 2 3 4 3 3 2 2 2 2 4 4 2 2 2 0 4 0 3 2 2 4 2 2 4 4 2 4 1 2 2 2 2
 2 2 2 2 2 2 2 4 2 2 4 3 2 0 4 2 4 2 1 2 4 0 2 4 1 3 2 4 2 1 4 2 4 0 1 3 2
 3 0 2 3 0 4 4 2 2 2 2 2 1 2 1]
True values of Prediction range [3 2 2 2 4 3 0 4 2 2 2 2 2 2 2 2 2 2 2 4 2 1 0 2 2 2 2 3 4 3 2 2 1 1 2 2 4
 4 2 1 2 2 4 2 4 3 4 2 0 1 0 2 2 0 0 0 2 1 0 0 2 0 4 0 2 3 1 2 2 2 2 1 0 2
 0 0 2 3 2 4 4 2 2 2 1 2 2 1 2 3 2 3 2 2 2 1 4 3 2 3 2 2 2 1 4 4 2 4 2 2 2
 1 2 2 2 0 2 2 2 2 3 2 0 2 0 4 4 3 2 1 2 2 4 3 2 2 2 2 2 4 3 4 2 2 0 4 0 3
 3 3 2 2 2 2 4 2 4 2 2 3 2 2 2 4 1 2 2 2 4 2 3 4 0 4 2 2 4 3 2 4 4 2 2 2 2
 2 2 3 4 0 2 2 2 2 0 2 2 2 3 2]
              precision    recall  f1-score   support

           0       0.16      0.14      0.15        22
        