# Brief explanation of the dataset & features

Consumer Complaint Narrative: This is a paragraph (or text) written by the customer explaining his complaint in detail. The data is a string type consisting of text in the form of paragraphs.
Product: This is the category we are to classify each complaint to. The 12 categories the complaints need to be categorized into are:

'Mortgage', 'Student loan', 'Credit card or prepaid card', 'Credit card', 'Debt collection', 'Credit reporting', 'Credit reporting, credit repair services, or other personal consumer reports', 'Bank account or service', 'Consumer Loan', 'Money transfers', 'Vehicle loan or lease', 'Money transfer, virtual currency, or money service', 'Checking or savings account', 'Payday loan', 'Payday loan, title loan, or personal loan', 'Other financial service', 'Prepaid card'

<h3>What we want as the outcome?</h3>

We would classify each complaint to its respective category, so that the complaint can be directed to the right vertical.

In [1]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')
from nltk.tokenize import sent_tokenize
nltk.download('wordnet')
from sklearn.metrics import accuracy_score,roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\snake\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\snake\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [2]:
# Loading of dataset
full_data = pd.read_csv('file.csv')
# keeping the relevant columns
data = full_data[["Consumer complaint narrative","Product"]]
data.columns = ['X','y']
data.head()

Unnamed: 0,X,y
0,,Mortgage
1,When my loan was switched over to Navient i wa...,Student loan
2,I tried to sign up for a spending monitoring p...,Credit card or prepaid card
3,,Credit card
4,,Debt collection


In [None]:
data.shape

# Why is it difficult to work with text?
Comprehending Language is hard for computers. Some of the unique challenges of working with text are as follows:

Synonymy - This corresponds to different words having the same meaning. A similar intent can be conveyed in various ways and this is one of the prime reasons, why computers have a hard time deciphering the meaning or intent of those statements. "The President of United States has signed a new decree" and "POTUS has inked in a new law" are basically advocating the same sentiment. However as they are completely different sentences syntactically, computers have a hard time figuring out the user intent.

Ambiguity - "The bank deposit rate is quite high" and "He stood near the bank admiring the river". In these statements, the word bank has completely different meanings. In the first case it represents a financial institution, and in the second case it refers to land near the river. Disambiguating the meaning in sentences is quite challenging.

Anaphora Resolution - "George is my friend. He likes football". In the second statement he refers to George. It is difficult for the computers to discern what person/entity the pronoun he is referring to.

Language related issues - Every language has its own uniqueness. For English we have words, sentences, paragraphs and so on. But in Thai, there is no concept of sentences at all! The grammar and morphology of languages is so different. This is why we observe that Google Translator or any other translator service struggles to perfectly convert a piece of text from one language to another.

Out of Vocabulary problem - Machines have a hard time adapting to any new constructs that humans come up with. As humans when we come across a word we haven't seen earlier, we might not understand its meaning instantly. But this does not mean we cannot adapt. After looking at the word in several different sentences and understanding its usage, we understand the context and meaning of the new word. Machines can only handle data that they have seen before. It is unable to adapt well.

Language generation - While language understanding is hard, language generation too has its own set of challenges. For chatbots to work effectively, they need to communication properly constructed sentences which are grammatically correct. This is quite a hard problem and a challenge that needs to be overcome.

We now know that working with text is hard. But there are also exciting applications and use cases involved with working on text. We will now take a look at some of the use cases.

<h2>Usecases of NLP</h2>
The usecases of NLP encompass almost anything you can do with Language in relation to a problem.

1) Sentiment Analysis - Finding if the text is leaning towards a positive or negative sentiment.

The process of computationally identifying and categorizing opinions expressed in a piece of text, especially in order to determine whether the writer's attitude towards a particular topic, product, etc. is positive, negative, or neutral is called Sentiment Analysis. The information present over the Internet is constantly growing resulting in a large number of texts expressing opinions in review sites, forums, blogs and different social media forums. Sentiment analysis is therefore a topic of great interest and development since it has many practical applications. It is immensely useful in figuring the overall sentiment of products (Amazon), movies (Netflix), food (Yelp),etc. Its applications include Market Research, Social Monitoring, Customer Support and Product Analytics.

2) Text Classification - Categorizing text to various categories

Text classifiers can be used to organize, structure, and categorize almost any text data we have. For e.g. New articles can be organized by topics, chat conversations can be organized by language, support tickets can be organized by urgency etc. Other examples of text classification include:

Directing customer queries to the right vertical

Detection of spam and non-spam emails,

Auto tagging of customer queries

3) Document Summarization - Compressing a paragraph/document into few words or sentences

Text summarization is the method of compressing a text document, in order to create a summary of the major points of the document. The idea of summarization is to find a subset of data which contains the information of the entire set. It's applications include News summary(Inshorts app), Novel Summary, Book Summary (Blinkist) etc. With the overall attention span declining, the need to provide information in the shortest possible words has risen - and summarization helps solve this problem.

4) Parts of Speech Tagging - Figuring out the various nouns, adverbs, verbs etc in the text

Identifying part of speech tags is much more complicated than it looks. This is because over time in the development of language, a single word can have different parts of speech tag in different sentences based on different contexts. This makes it impossible to have a generic mapping for POS tags. Few of its applications include:

Text to speech conversion

Word Sense Disambiguation (Teach machine to know the difference of the meaning of word 'bears' in "I saw a couple of bears" and "Hard work always bears fruit")

5) Machine translation - Translate text from one language to another

Machine Translation is the task of automatically translating one natural language into another while retaining the meaning of the original text. Translation from one language to another is complex because some of the words in the original language could have multiple meanings and these words could have different forms in the output language. Its most popular application is Google Translate and it is employed in devices like Google Home as well. Machine translation allows business transactions between partners in different countries without the need of a human interpreter.

6) Named Entity Recognition - Identify the entities present in text

Named Entity Recognition deals with named entity mentions in text and categorizes these entities into person, organization, datetime reference etc. This is used a lot in the field of bioinformatics, molecular biology and other medical NLP applications. It also plays an important role in the overall field of Information Extraction where we try to extract knowledge from unstructured text.

7) Conversational AI - Chat with a machine in natural language and get queries resolved

Conversational AI deals with creating an interface between machines and humans to converse in natural language. Such interfaces are known as chatbots. A user can interact in natural language with natural language, the same way he usually communicates with a human. For organizations to truly scale in terms of customer support, chatbots are increasingly adopted as the first point of contact for customer query resolution across all organizations.

So for enabling all the NLP usecases, the first challenge is to convert the text into a form that the machine can understand. For that, we need to arrive at a fundamental component of text known as tokens.

# Tokenization

<h3> Motivation for tokenization</h3>

We can see that unlike all the machine learning datasets we have worked with previously, the data isn't boolean, numeric, categorical etc. Usually a text is composed of paragraphs, paragraphs are composed of sentences, and sentences are composed of words. You could also go deeper into letters, but the letters have no meaning. It's only when they are combined into words, that the text starts to make sense. Hence, it is better to work at the word level.

Tokenization is the process of splitting the text into smaller parts called tokens. Tokens are the basic units of a particular dataset. The choice of tokens could be based on the application we are working on.

<h3>Introduction to NLTK</h3>

Natural Language Tool Kit/NLTK is the standard library in python which specifically deals with text. All the text processing tasks could be easily done with this library. It is a leading platform for building Python programs to work with human language data. It also provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, along with an active discussion forum. On top of it, it is completely free and open-source with a vibrant developer community supporting it. Let us now take the first step towards categorizing the consumer complaints by starting with tokenization.

<h3>Tokenizing with NLTK - The problem intuition</h3>

We will first need to find a way to convert the text to numbers to get them to a form where you would be able to apply an algorithm to this. Think of this like sklearn, which require all non-numeric data to be encoded (label or one-hot) prior to the sklearn pipeline.

Intuitively, it would make sense to divide each paragraph of text to its basic form (words) and then convert each of those words to numbers. We could assign a particular number to each word, in which case a sentence could look like a set of numbers to us, each number representing a particular word.

The first step to achieving that would be to break the text down to words. That's what tokenization aims to do. NLTK has a built in libraries for tokenization which we will use for our purpose.

In [27]:
# Dropping nan values from dataframe
data.dropna(inplace=True)

# Storing the first complaint
first_complaint = data.iloc[0][0]


# Printing the first complaint
print("\nFirst Complaint\n")
print(first_complaint)

# Using the split command
print("\nUsing the Split Command\n")
bag_of_words_1 = first_complaint.split(" ")
print(bag_of_words_1)

# Using the tokenize command
print("\nUsing tokenize\n")
bag_of_words_2 = word_tokenize(first_complaint)
print(bag_of_words_2)



First Complaint

When my loan was switched over to Navient i was never told that i had a deliquint balance because with XXXX i did not. When going to purchase a vehicle i discovered my credit score had been dropped from the XXXX into the XXXX. I have been faithful at paying my student loan. I was told that Navient was the company i had delinquency with. I contacted Navient to resolve this issue you and kept being told to just contact the credit bureaus and expalin the situation and maybe they could help me. I was so angry that i just hurried and paid the balance off and then after tried to dispute the delinquency with the credit bureaus. I have had so much trouble bringing my credit score back up.

Using the Split Command

['When', 'my', 'loan', 'was', 'switched', 'over', 'to', 'Navient', 'i', 'was', 'never', 'told', 'that', 'i', 'had', 'a', 'deliquint', 'balance', 'because', 'with', 'XXXX', 'i', 'did', 'not.', 'When', 'going', 'to', 'purchase', 'a', 'vehicle', 'i', 'discovered', 'my'

# Sentence Tokenization

In [28]:
# first_complaint is already loaded onto the workspace
from nltk.tokenize import sent_tokenize

# Tokenizing sentences
list_of_sentences = sent_tokenize(first_complaint)

print("List of sentences\n", list_of_sentences)

# Lowering first complaint
first_complaint_lower = first_complaint.lower()

# Tokenizing first complaint lower
bag_of_words_lower = word_tokenize(first_complaint_lower)

print("\n",bag_of_words_lower)

List of sentences
 ['When my loan was switched over to Navient i was never told that i had a deliquint balance because with XXXX i did not.', 'When going to purchase a vehicle i discovered my credit score had been dropped from the XXXX into the XXXX.', 'I have been faithful at paying my student loan.', 'I was told that Navient was the company i had delinquency with.', 'I contacted Navient to resolve this issue you and kept being told to just contact the credit bureaus and expalin the situation and maybe they could help me.', 'I was so angry that i just hurried and paid the balance off and then after tried to dispute the delinquency with the credit bureaus.', 'I have had so much trouble bringing my credit score back up.']

 ['when', 'my', 'loan', 'was', 'switched', 'over', 'to', 'navient', 'i', 'was', 'never', 'told', 'that', 'i', 'had', 'a', 'deliquint', 'balance', 'because', 'with', 'xxxx', 'i', 'did', 'not', '.', 'when', 'going', 'to', 'purchase', 'a', 'vehicle', 'i', 'discovered', '

# Stemming

Stemming is the process of converting the words of a sentence to its non-changing portions. So stemming a word or sentence may result in words that are not actual words. Stems are created by removing the suffixes or prefixes used with a word.

For eg: Likes, liked, likely, unlike\Rightarrow⇒like

Lot of different algorithms have been defined for the process, each with their own set of rules. The popular ones include:

Porter Stemmer(Implemented in almost all languages)

Paice Stemmer

Lovins Stemmer


In [29]:
import nltk

text="Natural Language Processing is really fun and I want to study it more"
print("The words of text:",text,"\nis stemmed in the following way: ")

#Breaking the sentence to words
tokens=text.split()

#Defining Porter Stemmer object
porter = nltk.PorterStemmer()

#Applying the stemming
stem = [porter.stem(i) for i in tokens]
print(stem)

The words of text: Natural Language Processing is really fun and I want to study it more 
is stemmed in the following way: 
['natur', 'languag', 'process', 'is', 'realli', 'fun', 'and', 'I', 'want', 'to', 'studi', 'it', 'more']


# Lemmatization:

This method is a more refined way of breaking words through the use of a vocabulary and morphological analysis of words. The aim is to always return the base form of a word known as lemma.

Consider the following words:

'Studied', 'Studious' ,'Studying'

Stemming of them will result in Studi

Lemmatisation of them will result in Study

As it can be seen Lemmatization is more complex than stemming because it requires words to be categorized by a part-of-speech as well as by inflected form.

In languages other than English, it can become quite complicated.

In [30]:
from nltk.stem import WordNetLemmatizer


text = "Women in  technology are amazing at coding"
print("The words of text:",text,"\nis lemmatized in the following way: ")

tokens=text.lower().split()
lemma = WordNetLemmatizer()
lemma_result = [lemma.lemmatize(i) for i in tokens]
print(lemma_result)

The words of text: Women in  technology are amazing at coding 
is lemmatized in the following way: 
['woman', 'in', 'technology', 'are', 'amazing', 'at', 'coding']


# Vectorization
Bag of words:
The problem with modeling text is that there is no well defined fixed-length inputs.

A bag of words model is a way of extracting features from text for use in modeling. In this approach, we use the tokenized words for each observation and find out the frequency of each token.

Let's take an example to understand it.

Consider the following sentences:

"Hope is a good thing"
"Maybe the best thing"
"No good thing ever dies"
We will treat each sentence as a different document and make a list of all unique words from the three documentations. We get:

"hope", "is", "a", "good", "thing", "maybe", "the", "best", "no", "ever", "dies"

Next, we try to create vectors from it.

In this, we take the first document = "Hope is a good thing" and check the frequency of words from the 11 unique words:

"hope" - 1
" is" - 1
"a" - 1
"good" - 1
"thing" - 1
"maybe" - 0
"the"-0
"best" - 0
"no" - 0
"ever" - 0
"dies" - 0
Following is how each document will look like:

"Hope is a good thing" - [1,1,1,1,1,0,0,0,0,0,0]

"Maybe the best thing" - [0,0,0,0,1,1,1,1,0,0,0]

"No good thing ever dies" - [0,0,0,1,1,0,0,0,0,1,1]

This process of converting text data to numbers is called vectorization

There are multiple methods to convert words to numbers. We will be start with discussing the count Vectorizer.

In [31]:
from collections import Counter
count_vectorizer = Counter(bag_of_words_lower)
count_vectorizer

Counter({'when': 2,
         'my': 4,
         'loan': 2,
         'was': 5,
         'switched': 1,
         'over': 1,
         'to': 5,
         'navient': 3,
         'i': 11,
         'never': 1,
         'told': 3,
         'that': 3,
         'had': 4,
         'a': 2,
         'deliquint': 1,
         'balance': 2,
         'because': 1,
         'with': 3,
         'xxxx': 3,
         'did': 1,
         'not': 1,
         '.': 7,
         'going': 1,
         'purchase': 1,
         'vehicle': 1,
         'discovered': 1,
         'credit': 4,
         'score': 2,
         'been': 2,
         'dropped': 1,
         'from': 1,
         'the': 8,
         'into': 1,
         'have': 2,
         'faithful': 1,
         'at': 1,
         'paying': 1,
         'student': 1,
         'company': 1,
         'delinquency': 2,
         'contacted': 1,
         'resolve': 1,
         'this': 1,
         'issue': 1,
         'you': 1,
         'and': 5,
         'kept': 1,
         'bein

### sklearn library

In [32]:
from sklearn.feature_extraction.text import CountVectorizer

#Initialising a CountVectorizer object
cv = CountVectorizer()

#Storing the first row in Text
txt = [data["X"].iloc[0]]

#Printing the first row
print ("\nFirst Row:\n",txt)

#Fitting the CountVectorizer objext
cv.fit(txt)

#Transforming the first row
vector = cv.transform(txt)


print ("\nVector Shape:\n", vector.shape)

#Storing the values of vector in array format
vector_values = vector.toarray()

print("\nVector Values:\n",vector_values)

print("These are the counts of the 69 unique words in our first complaint.")

print ("\nCount Vectorizer Vocabulary:\n",cv.vocabulary_) 


First Row:
 ['When my loan was switched over to Navient i was never told that i had a deliquint balance because with XXXX i did not. When going to purchase a vehicle i discovered my credit score had been dropped from the XXXX into the XXXX. I have been faithful at paying my student loan. I was told that Navient was the company i had delinquency with. I contacted Navient to resolve this issue you and kept being told to just contact the credit bureaus and expalin the situation and maybe they could help me. I was so angry that i just hurried and paid the balance off and then after tried to dispute the delinquency with the credit bureaus. I have had so much trouble bringing my credit score back up.']

Vector Shape:
 (1, 69)

Vector Values:
 [[1 5 1 1 1 2 1 2 1 1 2 1 1 1 1 4 2 1 1 1 1 1 1 1 1 1 4 2 1 1 1 1 2 1 2 1
  1 1 4 3 1 1 1 1 1 1 1 1 2 1 2 1 1 3 8 1 1 1 5 3 1 1 1 1 5 2 3 3 1]]
These are the counts of the 69 unique words in our first complaint.

Count Vectorizer Vocabulary:
 {'when': 

 The vocabulary only specifies the index of the word and the not the counts.

Comparing the vector values with vocabulary helps in identifying the word count.

So for the word with index 1 is and we see its value is 5. That means the count of the word and is 5.

In [33]:
# #Converting the vector values to list
vector_values = vector_values.tolist()[0]
print (vector_values)


print ("count value of the word at index 22")
print (vector_values[22]) 

print ("count value of the word at index 34, the word is 'loan'")
print (vector_values[34]) 

[1, 5, 1, 1, 1, 2, 1, 2, 1, 1, 2, 1, 1, 1, 1, 4, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 4, 2, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 4, 3, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 3, 8, 1, 1, 1, 5, 3, 1, 1, 1, 1, 5, 2, 3, 3, 1]
count value of the word at index 22
1
count value of the word at index 34, the word is 'loan'
2


### Data Vectorisation
In this task we will try to implement the vectorisation on all rows and implement a logistic regression model on the vectorised dataframe

In [34]:
#Subsetting 'X'
all_text = data[["X"]]

#Converting 'X' to lower case
all_text["X"] = all_text['X'].str.lower()

#Initialising a count vectorizer object
cv = CountVectorizer()

#Creating the count vectorizer of our 'X' column
vector =cv.fit_transform(all_text["X"])

#Converting the count vectoriser to array
X = vector.toarray()

#Subsetting y
labels = data[["y"]]

#Initialising a label encoder object
le = LabelEncoder()

#Label encoding 'y' column
labels["y"] = le.fit_transform(labels["y"])

#Splitting the dataset into train and test
X_train,X_test,y_train,y_test = train_test_split(X,labels["y"],test_size=0.4,random_state=42)

#Initialising Logistic Regression model
log_reg = LogisticRegression(random_state=42)

#Fitting the model on train data
log_reg.fit(X_train,y_train)

#Finding the accuracy score on test data
acc = log_reg.score(X_test,y_test)
print (acc)

0.48507462686567165


# Removing Stopwords
In the previous task, we have seen 49% accuracy of predicting the product category. Now the question we need to ask is - can we improve the accuracy further? The answer lies in dealing with stopwords.

In [35]:
import nltk
nltk.download("stopwords")
from nltk.corpus import stopwords
print (set(stopwords.words('english')))

{'then', 'below', "aren't", 'each', 'an', 'as', 'will', 'this', 'won', 'y', 'off', 'such', 't', "shan't", 'weren', 'these', 'out', 'yourselves', "don't", 'm', "hadn't", 'ain', 'in', 'by', 'do', 'your', 'themselves', 'him', 'they', 'very', 'why', 'its', 'ours', "you'd", 'few', 'couldn', 'but', 'd', 'now', 'be', 'itself', 'once', 'other', 'it', 'i', 'or', 'again', 'there', 'between', 'when', 'she', 'only', 'own', 'too', 'should', 'while', "couldn't", 'ourselves', 'whom', 'doesn', 'we', "mustn't", 'am', 'himself', 'does', 'no', 'same', 'isn', 'shan', 'of', 'me', 'if', 'mightn', "didn't", "weren't", "it's", "doesn't", 'over', 'up', 'don', 'against', 'shouldn', 'what', 'about', 'before', 'didn', 'ma', 'those', 'theirs', "isn't", 'that', 'until', 'nor', "shouldn't", 'being', 'myself', 'aren', 's', "you'll", 'through', "that'll", 'is', 'any', "you've", "you're", 'some', 'than', 'down', 'a', 'did', "hasn't", 'haven', 'hers', 'at', 'most', "she's", 'all', 're', "mightn't", 'here', 'where', 'and

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Acer\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [36]:
from string import punctuation
print (list(punctuation))

['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~']


We can also add our own list of stop words we want to remove from our body of text. Different domains can have different stopwords - for example, if we are classifying medical articles into different subdomains like orthopedic and neurology, then the word medicine would be a stopword for our case. So we can add medicine to the set of stopwords in the following manner

In [37]:
custom_set_of_stopwords = set(stopwords.words('english')+list(punctuation)+["medicine"])
print ("medicine" in custom_set_of_stopwords)

True


In [38]:
#Storing the first complaint
first_complaint = data.iloc[0][0]

print("\nFirst Complaint:\n",first_complaint)

bag_of_words = word_tokenize(first_complaint)

print ("\nBag of words of first complaint:\n",bag_of_words)
print("\nLen of bag of words:\n",len(bag_of_words))

#Removing stopwords
bow_stopwords_removed = [x for x in bag_of_words if x not in custom_set_of_stopwords]

print ("\nBag of words with stopwords removed:\n",bow_stopwords_removed)

print("Len of bag of words with stopwords removed:\n",len(bow_stopwords_removed))


First Complaint:
 When my loan was switched over to Navient i was never told that i had a deliquint balance because with XXXX i did not. When going to purchase a vehicle i discovered my credit score had been dropped from the XXXX into the XXXX. I have been faithful at paying my student loan. I was told that Navient was the company i had delinquency with. I contacted Navient to resolve this issue you and kept being told to just contact the credit bureaus and expalin the situation and maybe they could help me. I was so angry that i just hurried and paid the balance off and then after tried to dispute the delinquency with the credit bureaus. I have had so much trouble bringing my credit score back up.

Bag of words of first complaint:
 ['When', 'my', 'loan', 'was', 'switched', 'over', 'to', 'Navient', 'i', 'was', 'never', 'told', 'that', 'i', 'had', 'a', 'deliquint', 'balance', 'because', 'with', 'XXXX', 'i', 'did', 'not', '.', 'When', 'going', 'to', 'purchase', 'a', 'vehicle', 'i', 'dis

### Applying in whole dataset

In [39]:
#Initialising the count vectorizer with stop words parameter
cv_stop = CountVectorizer(stop_words="english")

#Creating the count vectorizer of our 'X' column
vector_stop = cv_stop.fit_transform(all_text["X"])

#Converting the count vectoriser to array
X_stop = vector_stop.toarray()

#Splitting the data to train and test
X_train,X_test,y_train,y_test = train_test_split(X_stop,labels["y"],test_size=0.4,random_state=42)

#Initalising a logistic regression model
log_reg = LogisticRegression(random_state=42)

#Fitting the model on train
log_reg.fit(X_train,y_train)

#Finding the accuracy score on test data
stop_acc = log_reg.score(X_test,y_test)
print (stop_acc)

0.5373134328358209


# TF-IDF
In the above cell, we saw how text was converted to numerics using a count vectorizer.

In other words, a count vectorizer, counts the occurences of the words in a document and all the documents are considered independent of each other. Very similar to a one hot encoding or pandas getdummies function. However in cases where multiple documents are involved, count vectorizer still does not assume any interdependence between the documents and considers each of the documents as a seperate entity.

It does not rank the words based on their importance in the document, but just based on whether they exist or not. This is not a wrong approach, but it intuitively makes more sense to rank words based on their importance in the document right? In fact, the process of converting, text to numbers should essentially be a ranking system of the words so that the documents can each get a score based on what words they contain. All words cannot have the same imprtance or relevance in the document right?

There are two ways to approach document similarity:

TF-IDF Score

Cosine Similarity

Let's look at them one by one.

## TF-IDF!!
TF-IDF or Term Frequency and Inverse Document Frequency is kind of the holy grail of ranking metrics to convert text to numbers. Consider the count vectorizer as a metric which just counts the occurences of words in a document.

TF-IDF takes it a step further and ranks the words based not just on their occurences in one document but across all the documents. Hence if CV or Count vectorizer was giving more importance to words because they have appeared multiple times in the document, TF-IDF will rank them high if they have appeared only in that document, meaning that they are rare, hence higher importance and lower if they have appeared in all or most documents, because they are more common, hence lower ranking.

TF: Term Frequency, which measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization:

TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).

IDF: Inverse Document Frequency, which measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following:

IDF(t) = log_e(Total number of documents / Number of documents with term t in it).

Example
Consider a document containing 100 words wherein the word cat appears 3 times. The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the word cat appears in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12.

# Python Implementation of TF-IDF

In [40]:
complaint_1 = data["X"].iloc[0]
complaint_2 = data["X"].iloc[1]
complaint_3 = data["X"].iloc[2]

print ("Complaint 1: ", complaint_1)

print ("\nComplaint 2: ", complaint_2)

print ("\nComplaint 3: ", complaint_3)

Complaint 1:  When my loan was switched over to Navient i was never told that i had a deliquint balance because with XXXX i did not. When going to purchase a vehicle i discovered my credit score had been dropped from the XXXX into the XXXX. I have been faithful at paying my student loan. I was told that Navient was the company i had delinquency with. I contacted Navient to resolve this issue you and kept being told to just contact the credit bureaus and expalin the situation and maybe they could help me. I was so angry that i just hurried and paid the balance off and then after tried to dispute the delinquency with the credit bureaus. I have had so much trouble bringing my credit score back up.

Complaint 2:  I tried to sign up for a spending monitoring program and Capital One will not let me access my account through them

Complaint 3:  My mortgage is with BB & T Bank, recently I have been investigating ways to pay down my mortgage faster and I came across Biweekly Mortgage Calculator

In [41]:
from sklearn.feature_extraction.text import TfidfVectorizer
# list of text documents called sents
sents = [complaint_1, complaint_2, complaint_3]
# create the transform
vectorizer = TfidfVectorizer()
# tokenize and build vocab

vectorizer.fit(sents)

vector = vectorizer.transform(sents)

print("Shape of the vectorized sentence:",vector.shape)

vector_values = vector.toarray().tolist()[0]

print("The tf-idf score of first five elements:",vector_values[:5])


# Converting the tf-idf score with the word into a dictionary
import operator
sorted_x = sorted(vectorizer.vocabulary_.items(), key=operator.itemgetter(1))
words = [x[0] for x in sorted_x]
d = dict(zip(words,vector_values))

print("Dictionary of words with tf-idf score:\n", d)

#Sorting this dictionary by value in the descending order to see the ranking
print("Sorted dictonary:\n")
print (sorted(d.items(), key=operator.itemgetter(1), reverse = True))

Shape of the vectorized sentence: (3, 214)
The tf-idf score of first five elements: [0.0, 0.0, 0.0, 0.0, 0.0]
Dictionary of words with tf-idf score:
 {'26': 0.0, '30': 0.0, 'about': 0.0, 'accelerated': 0.0, 'access': 0.0, 'account': 0.0, 'across': 0.0, 'active': 0.0, 'advertising': 0.0, 'after': 0.05253580411952334, 'all': 0.0, 'amount': 0.0, 'and': 0.20399369240069398, 'angry': 0.06907826902804956, 'answer': 0.0, 'applied': 0.0, 'asked': 0.0, 'at': 0.06907826902804956, 'back': 0.05253580411952334, 'balance': 0.13815653805609912, 'bank': 0.0, 'bb': 0.0, 'bbt': 0.0, 'be': 0.0, 'because': 0.06907826902804956, 'been': 0.10507160823904668, 'being': 0.06907826902804956, 'bi': 0.0, 'biweekly': 0.0, 'bringing': 0.06907826902804956, 'bureaus': 0.13815653805609912, 'but': 0.0, 'calculates': 0.0, 'calculator': 0.0, 'call': 0.0, 'called': 0.0, 'calling': 0.0, 'came': 0.0, 'can': 0.0, 'capital': 0.0, 'center': 0.0, 'checking': 0.0, 'collected': 0.0, 'com': 0.0, 'company': 0.06907826902804956, 'con

##### We can see that the model learns to give lesser importance to words like is,it,in etc;. Unfortunately, it also gives a low importance to important words like financial, mortgage and a fairly high importance to unwanted words like the, was. It does give higher importance to words such as credit. And that is because TF-DF works better with larger corpuses. Just like a machine learning model, the larger the data, the better the model. With a larger corpus, these issues would be resolved when a lot more documents would have words like financial but not the.

Rerunning this for about 100 documents, we see that the ranking is completely different.

In [42]:
sents=[]
for x in range(100):
    sents.append(data["X"].iloc[x])

from sklearn.feature_extraction.text import TfidfVectorizer
# list of text documents called sents
# create the transform
vectorizer = TfidfVectorizer()
# tokenize and build vocab
vectorizer.fit(sents)
vector = vectorizer.transform(sents)
vector.shape 
vector_values = vector.toarray().tolist()[0]

sorted_x = sorted(vectorizer.vocabulary_.items(), key=operator.itemgetter(1))
words = [x[0] for x in sorted_x]
d = dict(zip(words,vector_values))
print("Sorted dictionary: \n")
print ((sorted(d.items(), key=operator.itemgetter(1), reverse = True))[:20])

Sorted dictionary: 

[('navient', 0.35431369455512246), ('the', 0.24041090745878946), ('delinquency', 0.23620912970341496), ('had', 0.20963581996093744), ('told', 0.19801036084460227), ('bureaus', 0.19189624902422267), ('was', 0.18441012909485335), ('score', 0.1637072393027124), ('credit', 0.15907030235158917), ('just', 0.15203704157649714), ('balance', 0.14549112699503114), ('to', 0.14437952896171652), ('and', 0.14013869297167467), ('loan', 0.1344398602659683), ('angry', 0.12870728663065903), ('deliquint', 0.12870728663065903), ('expalin', 0.12870728663065903), ('faithful', 0.12870728663065903), ('hurried', 0.12870728663065903), ('switched', 0.12870728663065903)]


##### You can notice that "the" has moved down from 0.42 to 0.27. Navient has increased from 0.20 to 0.35. So has bureaus from 0.13 to 0.19. As we include more and more sentences, the words whch have appeared more and more frequently across all the documents, such as "the" are moving down in value, and words like bureau and navient, which have appeared far lesser number of times have started increasing. Which reiterates the point we had. TF-DF works better with larger corpuses. Just like a machine learning model, the larger the data, the better the model.

## Applying in dataset

In [43]:
#Initialising the tf-idf model
tfidf = TfidfVectorizer(stop_words="english")

#Vectorizing the 'X' column
vector =tfidf.fit_transform(all_text["X"])

#Converting the vector to array
X_tfidf = vector.toarray()

#Splitting the dataset into train and test
X_train,X_test,y_train,y_test = train_test_split(X_tfidf,labels["y"],test_size=0.4,random_state=42)

#Initialising the logistic regression model
log_reg = LogisticRegression(random_state=42)

#Fitting the model with train data
log_reg.fit(X_train,y_train)

#Finding the accuracy score of model on test data
tfidf_acc = log_reg.score(X_test,y_test)
print (tfidf_acc)

0.44029850746268656


# Applying Naive Bayes Classifier
Naive Bayes classifier is a linear classifier based on the Bayes' theorem. The term naive comes from the assumption of considering all features in a dataset are mutually independent. The independent assumption is generally violated in real datasets, but the naive Bayes Classifier still tends to perform very well.

In this task instead of our normal dataset, we will be using a larger sized dataset having the same features but 10000 rows. After loading it from a csv file, we will apply the TF-IDF vectorization and then implement the Naive-Bayes classifier

In [44]:
from sklearn.naive_bayes import MultinomialNB


# reading the data
data = pd.read_csv('file2.csv')

# keeping the relevant columns
data = data[["Consumer complaint narrative", "Product"]]

# renaming the columns
data.columns = ["X", "y"]

# dropping the nan values
data = data.dropna()

# X

# Subsetting 'X' column
all_text = data[["X"]]

# Converting the 'X' column to lower case
all_text["X"] = all_text['X'].str.lower()

# Initialising a tfidf vectorizer object with stopwords
tfidf = TfidfVectorizer(stop_words="english")

# Vectorizing the 'X' column
vector = tfidf.fit_transform(all_text["X"])

# Converting vector to array
X_tfidf = vector.toarray()

# y

# Subsetting 'y' column
labels = data[["y"]]

# Initialising label encoder object
le = LabelEncoder()

# Label encoding 'y' column
labels["y"] = le.fit_transform(labels["y"])

# Splitting the data into train and test
X_train, X_test, y_train, y_test = train_test_split(
    X_tfidf, labels["y"], test_size=0.4, random_state=42)

# Initialsing a naive bayes classifier
nb = MultinomialNB()

# Fitting the model on train data
nb.fit(X_train, y_train)

# Finding the accuracy score of model on test data
nb_acc = nb.score(X_test, y_test)
print(nb_acc)

#Code ends here


0.42461964038727523


In [45]:
from imblearn.over_sampling import RandomOverSampler
from sklearn.naive_bayes import MultinomialNB

#Code starts here
#Initialising a random over sampler object
ros = RandomOverSampler(random_state=0)

#Sampling the train data
X_ros, y_ros = ros.fit_sample(X_train, y_train)

#Initialsing multinomial naive bayes model
nb = MultinomialNB()

#Fitting the sampled train data
nb.fit(X_ros,y_ros)

#Finding the accuracy score of model on test data
ros_score=nb.score(X_test,y_test)
print(ros_score)

#Code ends here

0.6016597510373444


# Applying SVM
Support Vector Machines are based on the concept of decision planes that define decision boundaries. In other words, given labeled training data (supervised learning), the algorithm outputs an optimal hyperplane which can help categorize new examples.

### Why SVMs work for text classification?

High Dimensional input space:
When dealing with text data, we know we need to deal with many features(>10000 usually). Since SVM(particulary Linear SVM) uses overfitting protection, they have the capability to handle large feature space.

Few irrelevant features:
Extension of the above point, during text classification one can't really do a rigourous feature selection. Research has shown that even the features ranked low still contain considerable information. SVM is therefore apt to handle this large amount of feature space in which feature selection or reduction can't be achieved satisfactorily.

Most text categorisation problems are linearly separable
Lot of experiments has resulted in the conclusion that text categorisation problems are usually linearly separable, since the concept of SVM is to find such linear separators, SVMS work better than most other models.

# Comparison between ML text classifiers


### Advantages
Naive Bayes

- Performs well in while dealing with small amount training data(spam filtering and email categorization) to estimate the parameters for classification

- Works well on numeric and textual data and easy to implement

SVM

- Captures the inherent characteristics of the data better and handles missing data well

-When you are building the model from the features point of view , SVM looks at the interaction between the features to a certain degree.

- SVM handle efiiceiently Presence of very few irrelevant features ,Linear separability of data

<h3>Disadvantages</h3>

Naive Bayes

- Performs very poorly when features are highly correlated and does not the consider frequency of word occurrences

SVM

- Difficult to do parameter tuning and kernel selection

In [46]:
from sklearn.svm import SVC

#Code starts here

svc = SVC(random_state = 0, kernel = 'linear')
svc.fit(X_ros,y_ros)

print(svc.score(X_test,y_test))

0.648686030428769
