<a href="https://colab.research.google.com/github/utkarshkant/Bank-Complaints-Text-Classification/blob/main/Bank_Complaints_Multiclass_Text_Classifier_Tensorflow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Objective
Build a bank complaints classification model (multi-class text classifier) with Tensorflow.


In [31]:
# imports
from __future__ import absolute_import, division, print_function, unicode_literals

import tensorflow as tf
import tensorflow_hub as hub  # for the pre-trained embedding layer

import os
import datetime
import numpy as np
import pandas as pd
pd.set_option("display.max_colwidth", -1)
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.utils import class_weight

  # This is added back by InteractiveShellApp.init_path()


In [5]:
# download training data
df = pd.read_csv("https://github.com/utkarshkant/Bank-Complaints-Text-Classification/blob/main/consumer_compliants.zip?raw=true", 
                 compression='zip', 
                 sep=",", 
                 quotechar='"')

In [6]:
# preview data
df.head()

Unnamed: 0,Date received,Product,Sub-product,Issue,Sub-issue,Consumer complaint narrative,Company public response,Company,State,ZIP code,Tags,Consumer consent provided?,Submitted via,Date sent to company,Company response to consumer,Timely response?,Consumer disputed?,Complaint ID
0,4/3/2020,Vehicle loan or lease,Loan,Getting a loan or lease,Fraudulent loan,"This auto loan was opened on XX/XX/2020 in XXXX, NC with BB & T in my name. I have NEVER been to North Carolina and I have NEVER been a resident. I have filed a dispute twice through my credit bureaus but both times BB & T has claimed that this is an accurate loan. Which I wasn't aware of until today. I have tried to contact BB & T multiple times but I have never gotten through to a live person. I do n't drive and I have never owned a car before. I didn't have any knowledge of this account until I checked XXXXXXXX XXXX and noticed it. I've tried twice to dispute it. Additionally I never received any bills or information about this account. This is my last resort in trying to remove this fraudulent loan off of my account.",Company has responded to the consumer and the CFPB and chooses not to provide a public response,TRUIST FINANCIAL CORPORATION,PA,,,Consent provided,Web,4/3/2020,Closed with explanation,Yes,,3591341
1,3/12/2020,Debt collection,Payday loan debt,Attempts to collect debt not owed,Debt is not yours,"In XXXX of 2019 I noticed a debt for {$620.00} on my credit which i believed was mine I thought speedy cash had bought one of my old debts and sold it to XXXX XXXX XXXX XXXX. I contacted XXXX XXXX XXXX XXXX and after several attempts of giving my full name, nothing came up in their system. I gave my social and the rep said the account popped up but DID NOT tell me that the account was under someone elses name and continued to let me make a payment. The payment was for {$120.00}. Confirmation number-XXXX. After realizing it was not my account, I called back to get my money back and inform them of the mistake. I was told i needed to mail them an FTC report and dispute letter to get my money back. I completed all of this and when i called again they said they transferred the account back to speedy cash for fraud review and I would need to contact them. After contacting them i was again told that i can not get my money back. The issue im having is this representative at XXXX XXXX played blind to obvious fraud and let an innocent person make a payment on someone elses debt and i want my money back.",,CURO Intermediate Holdings,CO,806XX,,Consent provided,Web,3/12/2020,Closed with explanation,Yes,,3564184
2,2/6/2020,Vehicle loan or lease,Loan,Getting a loan or lease,Credit denial,"As stated from Capital One, XXXX XX/XX/XXXX and XXXX 2018, My wife and I went to several car dealerships to request for a car loan to get a used car. However, according to their credit requirements unfortunately my credit score was insufficient for the car loan approval at that time. It seemed as though they pulled my credit report multiple times.",,CAPITAL ONE FINANCIAL CORPORATION,OH,430XX,,Consent provided,Web,2/6/2020,Closed with explanation,Yes,,3521949
3,3/6/2020,Checking or savings account,Savings account,Managing an account,Banking errors,"Please see CFPB case XXXX. \n\nCapital One, in the letter they provided ( and attached to that case as their response ) said this : "" The funds were reversed and sent back to XXXX XXXX XXXX on XX/XX/XXXX ''. \n\nXXXX XXXX XXXX ( now XXXX XXXX ) has not received these funds. Staff at XXXX XXXX - and also staff at the account-holder 's business - have looked for return of my money ( {$650.00} ) and find nothing. \n\nCapital One needs to document - actually prove - they returned the funds, as stated in their letter. Capital One must provide electronic information, if the return was made that way, or document the paper check they sent back to XXXX XXXX. \n\nI've left 3 messages about this problem for the person who signed the letter ( XXXX ) from Capital One. I have received no call-backs. \n\nSummary : Capital One said they returned my money on XX/XX/XXXX : they did not. If they continue claim they did, then they need to prove that.",,CAPITAL ONE FINANCIAL CORPORATION,CA,,,Consent provided,Web,3/6/2020,Closed with explanation,Yes,,3556237
4,2/14/2020,Debt collection,Medical debt,Attempts to collect debt not owed,Debt is not yours,"This debt was incurred due to medical malpractice ( XXXX XXXX XXXX, XXXX, TX ). I asked the doctor to turn over my claim to his malpractice insurance company. This has cost me thousands of dollars to XXXX XXXX XXXX. I am still trying to collect damages from this doctor. He never responded and turned over me to collections Merchants and Professional Collection Bureau , Inc. I sent them a letter describing exactly this issue and instead of not contacting me and verifying my debt they start reporting this debt to the credit reporting agencies. They never verified the debt, like I asked and they never stopped it from being reported when I specifically told them not to, due to the circumstances above.",Company believes it acted appropriately as authorized by contract or law,"Merchants and Professional Bureau, Inc.",OH,432XX,,Consent provided,Web,2/14/2020,Closed with explanation,Yes,,3531704


In [7]:
# features
df.columns

Index(['Date received', 'Product', 'Sub-product', 'Issue', 'Sub-issue',
       'Consumer complaint narrative', 'Company public response', 'Company',
       'State', 'ZIP code', 'Tags', 'Consumer consent provided?',
       'Submitted via', 'Date sent to company', 'Company response to consumer',
       'Timely response?', 'Consumer disputed?', 'Complaint ID'],
      dtype='object')

In [8]:
# target feature
df['Product'].value_counts()

Debt collection                21772
Credit card or prepaid card    13193
Mortgage                       9799 
Checking or savings account    7003 
Student loan                   2950 
Vehicle loan or lease          2736 
Name: Product, dtype: int64

The data appears to be imbalanced



In [9]:
# split dataset
X_train, X_test = train_test_split(df, test_size=0.2, random_state=111)

In [10]:
# handling imbalanced dataset
class_weights = list(class_weight.compute_class_weight("balanced",
                                                       np.unique(df['Product']),
                                                       df['Product']))
class_weights.sort()
print(class_weights)   # less represented classes have been assigned higher weights

[0.43980801028844385, 0.7258015614340938, 0.9771915501581794, 1.3673425674710837, 3.2459322033898306, 3.4998172514619883]


In [11]:
# convert weights from list to dictionary with a class label and value
# for processing with Keras
weights = {}
for index, weight in enumerate(class_weights):
  weights[index] = weight

print(weights)

{0: 0.43980801028844385, 1: 0.7258015614340938, 2: 0.9771915501581794, 3: 1.3673425674710837, 4: 3.2459322033898306, 5: 3.4998172514619883}


In [12]:
# train and test data convert into Tensorflow data
data_train = tf.data.Dataset.from_tensor_slices((X_train['Consumer complaint narrative'].values, X_train['Product'].values))
data_test = tf.data.Dataset.from_tensor_slices((X_test['Consumer complaint narrative'].values, X_test['Product'].values))

In [13]:
# preview data
for text, target in data_train.take(5):
  print(f"Complaint: {text}, Target: {target}")

Complaint: b"The below complaint was submitted to the CFPB numerous times prior, the Wells Fargo rep, XXXX XXXX, replies with the same general form response stating numerous attempts at resolution have been made and exhausted which is a lie. He also attempts to state he can not comment due to past litigation which is also a lie, he reverts to this reply so as to avoid detailing any supposed attempt at resolution which he can not as there is none. Further, he should be aware of resolution which is to refund fees totaling {$610.00} and has not done so, no refund received to date. Please see complaint below. The below complaint was submitted prior, a duplicate form response received from XXXX XXXX with Wells Fargo, one of several. Based on this, my complaint was not addressed. Further, a reply in XXXX was never received as was mentioned. XXXX replied by stating numerous attempts at resolution have been made but has not detailed one, there was no contact from anyone at Wells Fargo aside fr

In [14]:
# preview data
for text, target in data_test.take(5):
  print(f"Complaint: {text}, Target: {target}")

Complaint: b'I have a business checking account at BB & T. On XX/XX/2019, I attempted to deposit a check into my account and I received a message stating that I was over my monthly mobile deposit limit. I was confused because it was the first of the month and I had not deposited any checks since the previous month. I called BB & T and they said that I couldnt deposit checks into business accounts via the mobile app even though I had done that before. \n\nI was instructed to open a personal account, into which I could deposit checks via the mobile app. I was told that if I opened the account online I would have immediate access, that I could link my personal and business accounts, and immediately be able to transfer money between them. \n\nOn XX/XX/XXXX, I opened my personal account online. Though I successfully opened online, I did not have online access as I had been promised. Because I was traveling in an area where there were no BB & T branches, I could not go into a branch until XX

In [15]:
# convert target column into numerical representation with StaticHashTable
# Hash Table is a key value pair
table = tf.lookup.StaticHashTable(
    initializer = tf.lookup.KeyValueTensorInitializer(
        keys = tf.constant(['Debt collection','Credit card or prepaid card','Mortgage','Checking or savings account','Student loan','Vehicle loan or lease']),
        values = tf.constant ([0,1,2,3,4,5]),
    ),
    default_value = tf.constant(-1),
    name = "target _encoding"
)

# a tf function for lookup of target into encoded values
@tf.function
def target(x):
  return table.lookup(x)

In [16]:
# preview data
def show_batch(dataset, size=5):
  for batch, label in dataset.take(size):
    print(batch.numpy())
    print(target(label).numpy())

show_batch(data_test, 6)

b'I have a business checking account at BB & T. On XX/XX/2019, I attempted to deposit a check into my account and I received a message stating that I was over my monthly mobile deposit limit. I was confused because it was the first of the month and I had not deposited any checks since the previous month. I called BB & T and they said that I couldnt deposit checks into business accounts via the mobile app even though I had done that before. \n\nI was instructed to open a personal account, into which I could deposit checks via the mobile app. I was told that if I opened the account online I would have immediate access, that I could link my personal and business accounts, and immediately be able to transfer money between them. \n\nOn XX/XX/XXXX, I opened my personal account online. Though I successfully opened online, I did not have online access as I had been promised. Because I was traveling in an area where there were no BB & T branches, I could not go into a branch until XX/XX/2019. I

In [17]:
# apply target encoding to the dataset

# UDF for target class one-hot encoding
def fetch(text, labels):
  return text, tf.one_hot(target(labels), 6)
# apply one-hot encoding to data
train_data_f = data_train.map(fetch)
test_data_f = data_test.map(fetch)  

In [18]:
next(iter(train_data_f))

(<tf.Tensor: shape=(), dtype=string, numpy=b"The below complaint was submitted to the CFPB numerous times prior, the Wells Fargo rep, XXXX XXXX, replies with the same general form response stating numerous attempts at resolution have been made and exhausted which is a lie. He also attempts to state he can not comment due to past litigation which is also a lie, he reverts to this reply so as to avoid detailing any supposed attempt at resolution which he can not as there is none. Further, he should be aware of resolution which is to refund fees totaling {$610.00} and has not done so, no refund received to date. Please see complaint below. The below complaint was submitted prior, a duplicate form response received from XXXX XXXX with Wells Fargo, one of several. Based on this, my complaint was not addressed. Further, a reply in XXXX was never received as was mentioned. XXXX replied by stating numerous attempts at resolution have been made but has not detailed one, there was no contact fro

In [19]:
train_data, train_labels = next(iter(train_data_f.batch(5)))
train_data, train_labels

(<tf.Tensor: shape=(5,), dtype=string, numpy=
 array([b"The below complaint was submitted to the CFPB numerous times prior, the Wells Fargo rep, XXXX XXXX, replies with the same general form response stating numerous attempts at resolution have been made and exhausted which is a lie. He also attempts to state he can not comment due to past litigation which is also a lie, he reverts to this reply so as to avoid detailing any supposed attempt at resolution which he can not as there is none. Further, he should be aware of resolution which is to refund fees totaling {$610.00} and has not done so, no refund received to date. Please see complaint below. The below complaint was submitted prior, a duplicate form response received from XXXX XXXX with Wells Fargo, one of several. Based on this, my complaint was not addressed. Further, a reply in XXXX was never received as was mentioned. XXXX replied by stating numerous attempts at resolution have been made but has not detailed one, there was no 

Build Model

In [20]:
# create an embedding layer

# pretrained layer on Google News
embedding = "https://tfhub.dev/google/tf2-preview/nnlm-en-dim128/1"
hub_layer = hub.KerasLayer(embedding,
                           output_shape=[128],
                           input_shape=[],
                           dtype=tf.string,
                           trainable=True)
hub_layer(train_data[:1])

<tf.Tensor: shape=(1, 128), dtype=float32, numpy=
array([[ 1.92570090e+00,  1.04540564e-01,  1.63910031e-01,
        -1.11070231e-01, -5.49944043e-02,  7.91307315e-02,
         4.86127660e-02,  2.75555283e-01, -8.30120146e-02,
         1.80027023e-01,  1.44885316e-01, -1.59288287e-01,
        -1.94997236e-01, -4.00722414e-01, -1.37285963e-01,
         3.61927778e-01, -3.39035988e-01, -3.47756073e-02,
        -6.16503179e-01,  1.44515026e+00,  3.19949985e-01,
         3.94565135e-01, -3.26071948e-01,  2.14843497e-01,
        -3.63146365e-02, -4.13899511e-01,  1.67138502e-01,
        -5.15444636e-01, -1.83536708e-01,  3.90481912e-02,
         9.09547061e-02, -2.18187377e-01,  1.16193175e-01,
        -2.13140637e-01,  2.62742490e-01,  3.77820075e-01,
        -2.58993626e-01, -5.08686364e-01, -2.61283755e-01,
        -5.75724877e-02,  5.14812469e-02, -1.78749681e-01,
        -8.78547728e-02, -5.78522384e-01,  3.56190205e-01,
         3.41897160e-01, -3.03279817e-01, -3.03550176e-02,
      

In [21]:
# create model
model = tf.keras.Sequential()   # create a sequential layer
model.add(hub_layer)            # add the hub_layer, the embedding layer
for units in [128,128,64,32]:   # number of neurons in hidden layers
  model.add(tf.keras.layers.Dense(units, activation='relu'))
  model.add(tf.keras.layers.Dropout(0.3))   # dropout of 30% in each layer
model.add(tf.keras.layers.Dense(6, activation='softmax'))   # output in 6 classes

model.summary()   # get model summary

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
keras_layer (KerasLayer)     (None, 128)               124642688 
_________________________________________________________________
dense (Dense)                (None, 128)               16512     
_________________________________________________________________
dropout (Dropout)            (None, 128)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 128)               16512     
_________________________________________________________________
dropout_1 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 64)                8256      
_________________________________________________________________
dropout_2 (Dropout)          (None, 64)                0

In [22]:
# compile model
model.compile(optimizer='adam',
              loss=tf.keras.losses.CategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])   # evaluating model by accuracy

Since we have an imbalanced dataset, we will also evaluate the model performance with other metrics

In [23]:
# create batches of data after shuffling it
# creates a batch of 512
train_data_f = train_data_f.shuffle(70000).batch(512)
test_data_f = test_data_f.batch(512)    # test data need not be shuffled

In [24]:
# model training
history = model.fit(train_data_f,
                    epochs=15,
                    validation_data=test_data_f,
                    verbose=1,
                    class_weight=weights)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


In [25]:
# check length of test dataset
len(list(data_test))

11491

In [26]:
# model evaluation - accuracy
results = model.evaluate(data_test.map(fetch).batch(11491), verbose=2)
print(results)

1/1 - 0s - loss: 1.1659 - accuracy: 0.8769
[1.1659382581710815, 0.8768601417541504]


In [29]:
# check length of test dataset
len(list(data_train))

45962

In [30]:
# model evaluation - classification report
test_data, test_labels = next(iter(data_test.map(fetch).batch(45962)))
y_pred = model.predict(test_data)
print(classification_report(test_labels.numpy().argmax(axis=1), y_pred.argmax(axis=1)))

              precision    recall  f1-score   support

           0       0.95      0.89      0.91      4295
           1       0.88      0.83      0.85      2583
           2       0.93      0.93      0.93      2015
           3       0.80      0.90      0.85      1461
           4       0.81      0.84      0.83       611
           5       0.59      0.82      0.69       526

    accuracy                           0.88     11491
   macro avg       0.83      0.87      0.84     11491
weighted avg       0.88      0.88      0.88     11491



- all classes are equally represented


In [32]:
# model evaluation - confusion matrix
confusion_matrix(test_labels.numpy().argmax(axis=1), y_pred.argmax(axis=1))

array([[3805,  162,   79,   67,   66,  116],
       [ 115, 2149,   18,  208,   11,   82],
       [  23,   25, 1865,   36,   28,   38],
       [  21,   84,   18, 1314,    5,   19],
       [  30,   12,   11,    5,  513,   40],
       [  28,   16,   25,   18,    9,  430]])