<a href="https://colab.research.google.com/github/seanmcalevey/cfpb_complaint_clf/blob/master/CFPB_Complaint_Classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import numpy as np
import pandas as pd
import re

# Start of preprocessing

In [2]:
from google.colab import drive

drive.mount('/content/drive', force_remount=True)

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


### Store in DataFrame

##### Store dataset in pandas dataframe. Only keep rows with narratives.

In [3]:
master_df = pd.read_csv('/content/drive/My Drive/Consumer_Complaints.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [4]:
proc_df = master_df.dropna(subset=['Consumer complaint narrative'])

proc_df.head()

Unnamed: 0,Date received,Product,Sub-product,Issue,Sub-issue,Consumer complaint narrative,Company public response,Company,State,ZIP code,Tags,Consumer consent provided?,Submitted via,Date sent to company,Company response to consumer,Timely response?,Consumer disputed?,Complaint ID
263,09/24/2019,"Credit reporting, credit repair services, or o...",Credit reporting,Problem with a credit reporting company's inve...,Their investigation did not fix an error on yo...,this is the final complaint that I am going to...,Company has responded to the consumer and the ...,"Certegy Holdings, LLC",OH,450XX,,Consent provided,Web,09/24/2019,Closed with explanation,Yes,,3384460
291,09/24/2019,Debt collection,Medical debt,Attempts to collect debt not owed,Debt was result of identity theft,This amount is XXXX dollars is not mine and Im...,Company has responded to the consumer and the ...,"American Credit Bureau, Inc.",FL,328XX,,Consent provided,Web,09/24/2019,Closed with explanation,Yes,,3384012
315,09/24/2019,Debt collection,I do not know,Attempts to collect debt not owed,Debt is not yours,After numerous attempts to get verification ab...,Company has responded to the consumer and the ...,"Medical Data Systems, Inc.",SC,,,Consent provided,Web,09/24/2019,Closed with explanation,Yes,,3383985
321,09/24/2019,Debt collection,I do not know,Took or threatened to take negative or legal a...,Threatened or suggested your credit would be d...,I received a letter stating i owned this compa...,Company believes the complaint is the result o...,"CCS Financial Services, Inc.",NY,115XX,,Consent provided,Web,09/24/2019,Closed with explanation,Yes,,3384865
326,09/24/2019,Debt collection,I do not know,Attempts to collect debt not owed,Debt is not yours,"XXXX XXXX XXXX XXXX XXXX XXXX XXXX, GA XXXX CR...",Company believes the complaint is the result o...,"CCS Financial Services, Inc.",GA,305XX,Servicemember,Consent provided,Web,09/24/2019,Closed with explanation,Yes,,3383856


### Count of Company Responses

In [5]:
proc_df['Company response to consumer'].value_counts()

Closed with explanation            359418
Closed with non-monetary relief     54254
Closed with monetary relief         24334
Closed                               3741
Untimely response                    2935
Name: Company response to consumer, dtype: int64

### Sample the Data

#####Take an even sample of 20,000 responses: w/ explanation and w/ monetary relief

In [0]:
tmp_df_1 = proc_df[proc_df['Company response to consumer']=='Closed with explanation'].sample(22000, random_state=42)

tmp_df_2 = proc_df[proc_df['Company response to consumer']=='Closed with monetary relief'].sample(22000, random_state=42)

df = tmp_df_1.append(tmp_df_2)

### Create Dictionary to Replace Target Responses with Binary

In [7]:
replace_dict = {'Closed with explanation': 0, 'Closed with monetary relief': 1}

df['Company response to consumer'].replace(replace_dict, inplace=True)

df['Company response to consumer'].value_counts()

1    22000
0    22000
Name: Company response to consumer, dtype: int64

### Clean Text Data

##### Clean data so that only alphabetic letters remain. Also remove capital Xs because they are used to conceal personal information.

In [0]:
clean = [re.sub('[^A-WY-Za-z.\s\']', '', str(text)) for text in df['Consumer complaint narrative']]

split_word_nars = [nar.split() for nar in clean]

"""Contractions Import"""

import sys
sys.path.append('/content/drive/My Drive')
from english_contractions import replace_contraction

""" Loops """

new_words = []

for nar in split_word_nars:

  nar_words = []

  for word in nar:

    if re.search('\w+[.]', word):

      splitted = word.split('.')

      tmp_words = replace_contraction(splitted[0].lower())

      for w in tmp_words.split():

        nar_words.append(w)

      nar_words.append('.')
    
    elif re.search('\w+[,]', word):

      splitted = word.split(',')

      tmp_words = replace_contraction(splitted[0].lower())

      for w in tmp_words.split():

        nar_words.append(w)
      
      nar_words.append(',')
    
    else:

      tmp_words = replace_contraction(word)

      for w in tmp_words.split():

        nar_words.append(w)
  
  nar_words = [word for word in nar_words]

  new_words.append(' '.join(nar_words))

df['Cleaned narratives'] = new_words

## Tokenize

In [9]:
lengths = []

for nar in df['Cleaned narratives']:

  tot_words = len(nar.split())

  lengths.append(tot_words)

np.mean(lengths), np.quantile(lengths, q=0.9)

(223.61736363636365, 477.0)

In [0]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

max_len_text = 200

max_words = 20000

tokenizer = Tokenizer(num_words=max_words, filters='')

tokenizer.fit_on_texts(df['Cleaned narratives'])

# convert text sequences into integer sequences
X = tokenizer.texts_to_sequences(df['Cleaned narratives'])

# padding zero up to maximum length
X_proc = pad_sequences(X, maxlen=max_len_text, padding='pre')

max_id = max_words + 1

## Train Test Split

In [0]:
from sklearn.model_selection import train_test_split

y = df['Company response to consumer']

# Test split:
X_train_val, X_test, y_train_val, y_test = train_test_split(X_proc, y, stratify=y, test_size=1000, random_state=42)

# Val split:
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, stratify=y_train_val, test_size=4000, random_state=42)

## Establish Checkpoints

In [31]:
import tensorflow as tf
from tensorflow import keras

# Early Stopping

early_stop = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=2, restore_best_weights=True)

# Checkpointing Model Weights

import os

checkpoint_path = 'checkpoints/cp-{epoch:01d}.ckpt'

checkpoint_dir = os.path.dirname(checkpoint_path)

checkpoint_cb = tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_path, verbose=1, save_weights_only=True, period=1)

latest_checkpoint = tf.train.latest_checkpoint(checkpoint_dir)

latest_checkpoint



'checkpoints/cp-5.ckpt'

## Bidirectional LSTM Model

In [52]:
import tensorflow as tf
from tensorflow import keras

emb_reg, reg_factor = 0, 0

dropout_rate = 0.5

emb_dim, lstm_dim, dense_dim = 512, 256, 512

model = Sequential()

model.add(Embedding(max_id, emb_dim, input_length=max_len_text, embeddings_regularizer=l2(emb_reg)))

model.add(Bidirectional(LSTM(lstm_dim, return_sequences=True, kernel_regularizer=l2(reg_factor), dropout=dropout_rate)))

model.add(Bidirectional(LSTM(lstm_dim, return_sequences=False, kernel_regularizer=l2(reg_factor), dropout=dropout_rate)))

model.add(Dense(dense_dim, activation='relu', kernel_regularizer=l2(reg_factor)))

model.add(Dropout(dropout_rate))

model.add(Dense(dense_dim, activation='relu', kernel_regularizer=l2(reg_factor)))

model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer='adamax', loss='binary_crossentropy', metrics=['acc'])

model.summary()

Model: "sequential_18"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_17 (Embedding)     (None, 200, 512)          10240512  
_________________________________________________________________
bidirectional_32 (Bidirectio (None, 200, 512)          1574912   
_________________________________________________________________
bidirectional_33 (Bidirectio (None, 512)               1574912   
_________________________________________________________________
dense_42 (Dense)             (None, 512)               262656    
_________________________________________________________________
dropout_10 (Dropout)         (None, 512)               0         
_________________________________________________________________
dense_43 (Dense)             (None, 512)               262656    
_________________________________________________________________
dense_44 (Dense)             (None, 1)               

In [53]:
epochs = 4

batch_size = 256

model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=epochs, batch_size=batch_size)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<tensorflow.python.keras.callbacks.History at 0x7fb040713a90>

## Evaluate Model on Test Set

In [54]:
model.evaluate(X_test, y_test)



[0.40066492557525635, 0.8339999914169312]

The final model (after multiple iterations) returned an 83.4% accuracy score on the test set. The model consisted of a 512-unit embedding layer connected to two bidirectional lstm layers of 512 units each, which were in turn connected to two 512-unit dense layers before connecting to a sigmoid activation output unit at the end. The bidirectional layers were used to encode the text in both directions before passing the output to dense, fully connected layers to sort out the encoded data before classification. Given that only text data was used in classification, an 83.4% accuracy score on a balanced test set is a promising result. What it means is that, given a consumer complaint, this classifier can determine whether or not it will receive monetary compensation with 83.4% accuracy.