## Complaint Categorization Baseline Model

Fast and efficient handling of complaints on consumer forums is vital to commerce industry today. This notebook presents a baseline approach towards solving this problem. Consumer complaints on financial products is taken as the dataset to establish results.

Tf-idf (term frequency times inverse document frequency) scheme to weight individual tokens is often used in information retrieval. One of the advantage of tf-idf is reduce the impact of tokens that occur very frequently, hence offering little to none in terms of information.
The tf-idf of term 't' in document 'd' is tf-idf(d, t) = tf(t) * idf(d, t), where tf(t) is the number of times t occurs while idf is given by idf(d, t) = log [(1 + n) / (1 + df(d,t) + 1] 

In [5]:
from google.colab import drive
drive.mount('/gdrive')

Mounted at /gdrive


In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

# Importing pandas for operating on dataset
import pandas as pd

df = pd.read_csv('/gdrive/MyDrive/Colab Notebooks/NLP/complaints.csv')

### Typical Complaint

In [7]:
df['Consumer complaint narrative'][0]

'I have outdated information on my credit report that I have previously disputed that has yet to be removed this information is more then seven years old and does not meet credit reporting requirements'

### Categories

In [8]:
print(df.Product.unique())

['Credit reporting' 'Consumer Loan' 'Debt collection' 'Mortgage'
 'Credit card' 'Other financial service' 'Bank account or service'
 'Student loan' 'Money transfers' 'Payday loan' 'Prepaid card'
 'Virtual currency'
 'Credit reporting, credit repair services, or other personal consumer reports'
 'Credit card or prepaid card' 'Checking or savings account'
 'Payday loan, title loan, or personal loan'
 'Money transfer, virtual currency, or money service'
 'Vehicle loan or lease']


### Train-test split
15% of the total data is used as validation data while the remaining as training. This leads to 152809 training instances while 26967 validation instances.

In [9]:
X_train, X_test, y_train, y_test = train_test_split(
    df['Consumer complaint narrative'].values, df['Product'].values, 
    test_size=0.15, random_state=0)
print('Training utterances: {}'.format(X_train.shape[0]))
print('Validation utterances: {}'.format(X_test.shape[0]))

Training utterances: 152809
Validation utterances: 26967


### Calculating tf-idf scores
Calculating tf-idf scores for each unique token in the dataset and creating frequency chart for each utterance in the dataset.

In [10]:
vectorizer = TfidfVectorizer()
vectorizer.fit(X_train)

TfidfVectorizer()

In [11]:
X_train = vectorizer.transform(X_train)
X_test = vectorizer.transform(X_test)
X_train, X_test

(<152809x76350 sparse matrix of type '<class 'numpy.float64'>'
 	with 13864799 stored elements in Compressed Sparse Row format>,
 <26967x76350 sparse matrix of type '<class 'numpy.float64'>'
 	with 2447784 stored elements in Compressed Sparse Row format>)

### Naive Bayes
In multinomial naive bayes the probability of a document $d$ being in class $c$ is computed as $$P(c|d) = P(c) \prod_{1\le k \le n_d}{P(t_k|c)} $$ where, $P(c)$ is the prior probablity of a document occuring in class $c$ and $P(t_k|c)$ is the conditional probability of term $t_k$ occurring in a document of class $c$.

In [12]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
clf = MultinomialNB()
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
print(accuracy_score(y_test, pred))

0.6822041754737271


# Some extra stuff

## Doing pre-processing on the dataset

In [13]:
from nltk import word_tokenize

In [17]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [25]:
# Split the words
# We use word_tokenize to make sure words are split properly. Eg : don't , 3.14 etc
from tqdm import tqdm
data=df['Consumer complaint narrative'].values[0:1000]
tokenized=[word_tokenize(value) for value in data]
del data
tokenized[0]

['I',
 'have',
 'outdated',
 'information',
 'on',
 'my',
 'credit',
 'report',
 'that',
 'I',
 'have',
 'previously',
 'disputed',
 'that',
 'has',
 'yet',
 'to',
 'be',
 'removed',
 'this',
 'information',
 'is',
 'more',
 'then',
 'seven',
 'years',
 'old',
 'and',
 'does',
 'not',
 'meet',
 'credit',
 'reporting',
 'requirements']

In [44]:
lower=[[word.lower() for word in tokens] for tokens in tokenized]
lower[0]

['i',
 'have',
 'outdated',
 'information',
 'on',
 'my',
 'credit',
 'report',
 'that',
 'i',
 'have',
 'previously',
 'disputed',
 'that',
 'has',
 'yet',
 'to',
 'be',
 'removed',
 'this',
 'information',
 'is',
 'more',
 'then',
 'seven',
 'years',
 'old',
 'and',
 'does',
 'not',
 'meet',
 'credit',
 'reporting',
 'requirements']

In [45]:
lower=[' '.join(text) for text in lower]
lower[0]

'i have outdated information on my credit report that i have previously disputed that has yet to be removed this information is more then seven years old and does not meet credit reporting requirements'

In [47]:
# Remove punctuations
import string
punctuations_removed=[text.translate(str.maketrans('','',string.punctuation)) for text in lower]
punctuations_removed[0]

'i have outdated information on my credit report that i have previously disputed that has yet to be removed this information is more then seven years old and does not meet credit reporting requirements'

In [50]:
# removing stopwords
nltk.download('stopwords')
from nltk.corpus import stopwords
stopwords=stopwords.words('english')
cleaned=[[word for word in sentence.split(' ') if word not in stopwords] for sentence in punctuations_removed]

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [51]:
cleaned[0]

['outdated',
 'information',
 'credit',
 'report',
 'previously',
 'disputed',
 'yet',
 'removed',
 'information',
 'seven',
 'years',
 'old',
 'meet',
 'credit',
 'reporting',
 'requirements']

In [48]:
# Stemming
from nltk.stem import PorterStemmer
stemmer=PorterStemmer()
stemmer.stem('basically')

'basic'

In [52]:
stemmed=[[stemmer.stem(word) for word in sentence] for sentence in cleaned]

In [53]:
stemmed[0]

['outdat',
 'inform',
 'credit',
 'report',
 'previous',
 'disput',
 'yet',
 'remov',
 'inform',
 'seven',
 'year',
 'old',
 'meet',
 'credit',
 'report',
 'requir']

In [54]:
# Create a BOW
vocab=[]
for sentence in stemmed:
  temp=list(set(sentence))
  vocab+=temp
  vocab=list(set(vocab))

In [56]:
len(vocab)

5122

In [58]:
# Create Index Map
map={word:idx for idx,word in enumerate(vocab)}

In [59]:
# BOW feature

for sentence in stemmed:
  for word in sentence:
    try:
      map[word]+=1
    except:
      pass

In [60]:
map

{'': 20452,
 '5500': 3,
 'lazi': 3,
 'sherrif': 5,
 'equifaxxxxx': 6,
 'never': 375,
 'exclus': 7,
 'temp': 8,
 'amt': 9,
 'ruin': 24,
 'revers': 41,
 'init': 13,
 'pleaschang': 13,
 'alleg': 50,
 'faith': 37,
 'afni': 17,
 'reisssu': 17,
 '850000': 18,
 'proceed': 56,
 'yield': 20,
 'homein': 21,
 'cfpb': 101,
 'unfair': 53,
 'akin': 24,
 'therefor': 67,
 'back': 379,
 'afford': 68,
 'disappoint': 31,
 'creator': 30,
 'softer': 30,
 'caxxxxxxxx': 31,
 'uneasi': 32,
 'index': 33,
 'agreement': 120,
 'gold': 40,
 'custum': 36,
 'entri': 43,
 'asset': 47,
 'endors': 40,
 '6': 87,
 'taxestwo': 41,
 'latest': 44,
 'threefold': 43,
 'collud': 44,
 'senat': 48,
 'confus': 69,
 'creditorxxxx': 47,
 'trap': 51,
 'reset': 52,
 'tree': 53,
 'wrongli': 53,
 'natrionstar': 52,
 'thiev': 53,
 'scari': 54,
 'greenbtre': 55,
 'crazi': 58,
 'honest': 64,
 'ill': 60,
 'sanction': 60,
 'mileston': 60,
 '58000': 61,
 'certainli': 68,
 'northwest': 63,
 'bake': 64,
 'couldnt': 66,
 'cours': 73,
 'dc': 67,

### Feature Selection
Chi-square test measures dependence between stochastic variables, so using this function “weeds out” the features that are the most likely to be independent of class and therefore irrelevant for classification.

In [None]:
from sklearn.feature_selection import SelectKBest, chi2

ch2 = SelectKBest(chi2, k=5000)
X_train = ch2.fit_transform(X_train, y_train)
X_test = ch2.transform(X_test)

X_train, X_test

(<152809x5000 sparse matrix of type '<class 'numpy.float64'>'
 	with 10780400 stored elements in Compressed Sparse Row format>,
 <26967x5000 sparse matrix of type '<class 'numpy.float64'>'
 	with 1907878 stored elements in Compressed Sparse Row format>)