<a href="https://colab.research.google.com/github/vatsal2210/Topic-Modeling-Using-LDA/blob/main/Topic_Modeling_using_LDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import  TfidfVectorizer, CountVectorizer
from sklearn import decomposition
import matplotlib.pylab as plt
import numpy as np
import re
import nltk
from nltk.stem.porter import PorterStemmer
from sklearn.model_selection import train_test_split

In [2]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

1. Fetch Consumer Complaints Database - [Link](https://github.com/vatsal2210/Topic-Modeling-Using-LDA/blob/main/dataset/consumer_compliants.zip)

In [3]:
df = pd.read_csv('https://github.com/vatsal2210/Topic-Modeling-Using-LDA/blob/main/dataset/consumer_compliants.zip?raw=true', compression='zip', sep=',', quotechar='"')

In [6]:
df.shape

(57453, 18)

In [8]:
df.tail(5)

Unnamed: 0,Date received,Product,Sub-product,Issue,Sub-issue,Consumer complaint narrative,Company public response,Company,State,ZIP code,Tags,Consumer consent provided?,Submitted via,Date sent to company,Company response to consumer,Timely response?,Consumer disputed?,Complaint ID
57448,2/29/2020,Student loan,Federal student loan servicing,Dealing with your lender or servicer,Trouble with how payments are being handled,I am attempting to make a payment toward my st...,,"Nelnet, Inc.",KS,,,Consent provided,Web,2/29/2020,Closed with explanation,Yes,,3549178
57449,2/11/2020,Debt collection,Other debt,Attempts to collect debt not owed,Debt was paid,Received letter for {$480.00}. Original credit...,Company has responded to the consumer and the ...,"The Receivable Management Services LLC, New Yo...",AZ,853XX,,Consent provided,Web,2/18/2020,Closed with explanation,Yes,,3527928
57450,2/29/2020,Debt collection,Other debt,Communication tactics,"Used obscene, profane, or other abusive language",entire time 10 years until XX/XX/2020. XXXX ma...,Company has responded to the consumer and the ...,"Convergent Resources, Inc.",NJ,8101,,Consent provided,Web,2/29/2020,Closed with explanation,Yes,,3549238
57451,1/16/2020,Checking or savings account,Checking account,Problem with a lender or other company chargin...,Transaction was not authorized,I am a customer with Wells Fargo Bank. Recentl...,Company has responded to the consumer and the ...,WELLS FARGO & COMPANY,AZ,852XX,,Consent provided,Web,1/22/2020,Closed with explanation,Yes,,3498566
57452,1/16/2020,Debt collection,Auto debt,Took or threatened to take negative or legal a...,Threatened or suggested your credit would be d...,I spoken with them several times in a year. An...,,Exeter Finance Corp.,SC,,,Consent provided,Web,1/16/2020,Closed with explanation,Yes,,3498524


2. Understand the dataset

In [10]:
df['Product'].value_counts()

Debt collection                21772
Credit card or prepaid card    13193
Mortgage                        9799
Checking or savings account     7003
Student loan                    2950
Vehicle loan or lease           2736
Name: Product, dtype: int64

In [11]:
df['Company'].value_counts()

CITIBANK, N.A.                           3226
CAPITAL ONE FINANCIAL CORPORATION        2711
BANK OF AMERICA, NATIONAL ASSOCIATION    2580
JPMORGAN CHASE & CO.                     2409
WELLS FARGO & COMPANY                    2001
                                         ... 
JAMES B. NUTTER & COMPANY                   1
WEBER & ASSOCIATES, INC.                    1
MedShield, Inc.                             1
BANC OF CALIFORNIA, INC.                    1
ABR Recovery Services LLC.                  1
Name: Company, Length: 2197, dtype: int64

3. Rename columns names and required only 3 columns to understand complaint <> company <> product relationship


In [13]:
complaints_df=df[['Consumer complaint narrative','Product','Company']].rename(columns={'Consumer complaint narrative':'complaints'})

3. To indentify topics --> Train a model on the compaints column and split dataset with 60-

In [18]:
X_train, Y_train = train_test_split(complaints_df ,test_size=0.6, random_state=111)

In [19]:
X_train['Product'].value_counts()

Debt collection                8720
Credit card or prepaid card    5297
Mortgage                       3809
Checking or savings account    2822
Student loan                   1236
Vehicle loan or lease          1097
Name: Product, dtype: int64

In [28]:
# 
stemmer = PorterStemmer()
# stemmer = nltk.stem.SnowballStemmer('english')
nltk.download('stopwords')
stop_words = set(nltk.corpus.stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [49]:
# Filter out words which are < 3 character + stop_words + sensitive information such as Xx...

def tokenize(text):
  tokens = [word for word in nltk.word_tokenize(text) if (len(word) > 3 and len(word.strip('Xx/')) > 2)]
  tokens = map(str.lower, tokens)
  stems = [stemmer.stem(item) for item in tokens if (item not in stop_words)]
  return stems

In [50]:
# Tfidf --> Convert to vector/numeric data
# LDA Alog --> Needs count a word in a document
# max_features --> 10,000 
# max_df --> word contains in a 75% of a document
# min_df --> at least that word being in 50 documents o/w it's a rare word

vectorizer_tf = TfidfVectorizer(tokenizer=tokenize, stop_words=None, max_df=0.75, min_df=50, max_features=10000, use_idf=False, norm=None) # lowercase=False, ngram_range=(1,2)
tf_vectors = vectorizer_tf.fit_transform(X_train.complaints)

In [51]:
# Output feature network --> vectors (10,000 elements in each row)
tf_vectors.A

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [52]:
# Top 10,000 featyres
vectorizer_tf.get_feature_names()

['0.00',
 '1.00',
 '10.00',
 '100.00',
 '1000.00',
 '10000.00',
 '100000.00',
 '110.00',
 '1100.00',
 '11000.00',
 '12.00',
 '120.00',
 '1200.00',
 '12000.00',
 '130.00',
 '1300.00',
 '13000.00',
 '140.00',
 '1400.00',
 '14000.00',
 '15.00',
 '150.00',
 '1500.00',
 '15000.00',
 '160.00',
 '1600.00',
 '16000.00',
 '1681c-2',
 '1692',
 '1692g',
 '170.00',
 '1700.00',
 '180.00',
 '1800.00',
 '18000.00',
 '190.00',
 '1900.00',
 '2.00',
 '20.00',
 '200.00',
 '2000.00',
 '20000.00',
 '2015',
 '2016',
 '2017',
 '2018',
 '2019',
 '2019.',
 '2020',
 '210.00',
 '2100.00',
 '220.00',
 '2200.00',
 '230.00',
 '2300.00',
 '240.00',
 '2400.00',
 '25.00',
 '250.00',
 '2500.00',
 '25000.00',
 '260.00',
 '2600.00',
 '27.00',
 '270.00',
 '2700.00',
 '28.00',
 '280.00',
 '2800.00',
 '29.00',
 '290.00',
 '2900.00',
 '3.00',
 '30-day',
 '30.00',
 '300.00',
 '3000.00',
 '30000.00',
 '310.00',
 '320.00',
 '3200.00',
 '3300.00',
 '340.00',
 '3400.00',
 '35.00',
 '350.00',
 '3500.00',
 '36.00',
 '360.00',
 '360

In [53]:
# Apply LDA --> Explain by a group / unknown group
# n_components --> Distribute in 6 topics (based on the domain)
# max_iter --> 3 iteration (More iteration -- better topic modeling)
# learning_method --> "Batch" or "Online" (More prefer for large dataset) --> Create mini-dataset 

lda = decomposition.LatentDirichletAllocation(n_components=6, max_iter=3, learning_method='online', learning_offset=50, n_jobs=1, random_state=111)

# lda.components_ --> Topic
W1 = lda.fit_transform(tf_vectors)
H1 = lda.components_

In [54]:
# Score the document in each topic 
# H1 --> 6 topics component

W1

array([[0.1036939 , 0.23654372, 0.00099128, 0.65679507, 0.00098766,
        0.00098837],
       [0.001034  , 0.14970793, 0.12708838, 0.72010606, 0.00103304,
        0.00103059],
       [0.00482538, 0.44654118, 0.00477829, 0.53425531, 0.00480577,
        0.00479408],
       ...,
       [0.33211016, 0.13613353, 0.00274475, 0.28016976, 0.24606311,
        0.00277869],
       [0.01401685, 0.01402635, 0.01391855, 0.6112595 , 0.33265038,
        0.01412837],
       [0.00122595, 0.0012292 , 0.73695656, 0.25813306, 0.0012208 ,
        0.00123442]])

In [55]:
# Find top 15 words in each topic to understand what're they talking in each topic
# Lambda_function --> Take vocab object and build top words

num_words = 15

vocab = np.array(vectorizer_tf.get_feature_names())

top_words = lambda t: [vocab[i] for i in np.argsort(t)[:-num_words-1:-1]]
topic_words = ([top_words(t) for t in H1])
topics = [' '.join(t) for t in topic_words]

In [56]:
# 6 topics words

topics

['call time told would money year back said help compani work never tri want even',
 'call receiv would told email payment inform phone time contact ask back servic said number',
 'loan mortgag home servic payment insur properti document request state compani provid escrow modif date',
 'payment account credit balanc check charg month paid amount late interest bank fee made statement',
 'card account bank credit charg fraud chase disput transact open close claim citi use america',
 'debt report credit collect account inform compani letter disput provid request agenc valid receiv remov']

In [57]:
# Understand prominent topic for each document
# argmax --> Pick the most prominent topic

colnames = ["Topic" + str(i) for i in range(lda.n_components)]
docnames = ["Doc" + str(i) for i in range(len(X_train.complaints))]
df_doc_topic = pd.DataFrame(np.round(W1, 2), columns=colnames, index=docnames)
significant_topic = np.argmax(df_doc_topic.values, axis=1)
df_doc_topic['dominant_topic'] = significant_topic

In [58]:
df_doc_topic

Unnamed: 0,Topic0,Topic1,Topic2,Topic3,Topic4,Topic5,dominant_topic
Doc0,0.10,0.24,0.00,0.66,0.00,0.00,3
Doc1,0.00,0.15,0.13,0.72,0.00,0.00,3
Doc2,0.00,0.45,0.00,0.53,0.00,0.00,3
Doc3,0.28,0.20,0.00,0.00,0.03,0.49,5
Doc4,0.00,0.00,0.00,0.00,0.21,0.77,5
...,...,...,...,...,...,...,...
Doc22976,0.05,0.28,0.00,0.57,0.09,0.00,3
Doc22977,0.23,0.00,0.05,0.43,0.13,0.16,3
Doc22978,0.33,0.14,0.00,0.28,0.25,0.00,0
Doc22979,0.01,0.01,0.01,0.61,0.33,0.01,3


In [61]:
X_train.head()

Unnamed: 0,complaints,Product,Company
36524,"On XX/XX/2020. Checks for the full amounts owed were sent to each of these accounts from XXXX to Capital One : Capital One ending in # XXXX {$3700.00} Capital One ending in # XXXX {$680.00} Capital One ending in # XXXX {$2600.00} Capital One ending in # XXXX {$2100.00} Capital One ending in # XXXX {$3400.00} Total payment on these accounts : {$12000.00}. \nPrior to these payoff checks sent, I made my monthly payments due XX/XX/XXXX to avoid any late fees while these checks were received and processed by Capital One. \nI later received statements showing I owed interest on each one after payment in full was received. After sending them a letter priority mail on XX/XX/2020, asking why I was paying interest on top of interest, I received a letter on XX/XX/2020 letting me know the accounts were closed per my request, which I had requested they close since I was paying in full BUT no answer to my question. I attempted to call several times prior to mailing this letter but had not success in getting anyone on the line after waiting over an hour. I went ahead and paid the interest on these accounts which added to {$280.00}, out of fear I may be hit with a late fee. I attempted to call again XX/XX/2020. After waiting over an hour for a rep on phone, I was told I had to pay the interest on interest and also a residual fee after for each one. ( how do you pay interest on interest? ) Residual fee? On accounts paid in full? They said, that was their policy. \nShortly after making full payments on these accounts, On XX/XX/XXXX, my husband lost his job due to COVID and is currently unemployed. I have struggled to pay these additional fees that I find to be an illegal act by this credit card company. I was told if I did not pay the additional fees, a late fee will be added on top. I should not have to pay interest or residual fees on a payoff amount that had not outstanding balance. I also requested a copy of their agreement stating they are allowed to do this. I never received a response or any response to any of my questions. Again, we are in no financial position at this time due to loss of unemployment to be slammed with extra fees. It is unfortunate that the timing on these payments made in full and my husband losing his job came at a worse time.",Credit card or prepaid card,CAPITAL ONE FINANCIAL CORPORATION
46281,"On or about XX/XX/XXXX I opened a Citibank Basic Checking Account after seeing an offer described by Citibank that if I opened a Basic Checking Account, deposited at least {$5000.00} within 30 days of accounting opening, and held at least that amount in the account for 60 days that within 90 days after fulfilling the conditions of the agreement I would receive a {$200.00} checking account bonus. I soon deposited approximately {$5000.00} on XX/XX/XXXX into the account. On XX/XX/XXXX I received another email update from Citibank about the offer and thanking me for enrolling in the Citibank {$200.00} checking offer and repeating the terms of the agreement. \n\nOn or about XX/XX/XXXX I utilized the Citibank Mobile Application 's chat function to inquire as to when my bonus might be depositing, as it had been well over 60 days since when I opened the account and deposited the qualifying funds on XX/XX/XXXX. 60 days since my deposit would have been around early XX/XX/XXXX. The Mobile Application representative stated they could not resolve the issue and to call their One Stop Sales Unit, which I did on XX/XX/XXXX. The representative at the One Stop Sales Unit was confused and did not know how to search for the offer clearly, saying too that often the offers are for holding money in the account for 90 days or more, and provided essentially no clarity or guidance. \n\nI withdrew {$3000.00} of the {$5000.00} I had held in the account since XX/XX/XXXX on XX/XX/XXXX, my first withdrawl from the account. I am concerned as Citibank has been very unclear on when my deposit bonus will be depositing, as that is {$5000.00} I could be storing in other accounts that actually provided material interest income ( which are generally well over 1.7 % at the moment compared to Citibank Basic Checking 's essential zero interest ). Furthermore Citibank makes it extremely difficult to get any clear, definite answers on the matter as neither their chat representative nor their One Stop Sales Unit personnel have any meaningful knowledge on how to handle the matter.",Checking or savings account,"CITIBANK, N.A."


In [62]:
# Run on new data - Y_train

W_hold = lda.transform(vectorizer_tf.transform(Y_train.complaints[:5]))

In [65]:
colnames = ["Topic" + str(i) for i in range(lda.n_components)]
docnames = ["Doc" + str(i) for i in range(len(Y_train.complaints[:5]))]
df_doc_topic = pd.DataFrame(np.round(W_hold, 2), columns=colnames, index=docnames)
significant_topic = np.argmax(df_doc_topic.values, axis=1)
df_doc_topic['dominant_topic'] = significant_topic

In [66]:
df_doc_topic

Unnamed: 0,Topic0,Topic1,Topic2,Topic3,Topic4,Topic5,dominant_topic
Doc0,0.05,0.21,0.0,0.44,0.3,0.0,3
Doc1,0.0,0.29,0.45,0.0,0.0,0.25,2
Doc2,0.2,0.01,0.01,0.28,0.5,0.01,4
Doc3,0.25,0.34,0.33,0.08,0.0,0.0,1
Doc4,0.23,0.0,0.0,0.76,0.0,0.0,3


In [67]:
Y_train.head()

Unnamed: 0,complaints,Product,Company
30060,"I have a business checking account at BB & T. On XX/XX/2019, I attempted to deposit a check into my account and I received a message stating that I was over my monthly mobile deposit limit. I was confused because it was the first of the month and I had not deposited any checks since the previous month. I called BB & T and they said that I couldnt deposit checks into business accounts via the mobile app even though I had done that before. \n\nI was instructed to open a personal account, into which I could deposit checks via the mobile app. I was told that if I opened the account online I would have immediate access, that I could link my personal and business accounts, and immediately be able to transfer money between them. \n\nOn XX/XX/XXXX, I opened my personal account online. Though I successfully opened online, I did not have online access as I had been promised. Because I was traveling in an area where there were no BB & T branches, I could not go into a branch until XX/XX/2019. In the intervening time, my business account became overdrawn. I had the money in my personal account to bring it back into the black, but BB & T could not make the transfer until I went into a branch. During this time, I incurred an astonishing {$320.00} in overdraft fees in my business account because I had no way of transferring money between the two accounts. \n\nWhen I went into the branch, I met with XXXX XXXX in XXXX XXXX County, Florida. She investigated the accounts and told me she would link them and that I would have online access in XXXX hours. The online access still never material, I was not able to transfer money and I received an additional {$36.00} overdraft fee. This brought the total to {$360.00} in overdraft fees. \n\nI had only opened the second account on BB & Ts advice ; I could have had my customer send a bank wire directly into business account in the first week of XXXX to keep the account positive. My TOTAL deposits between the two accounts was always positive. \n\nI requested a refund and was told they would not refund any of the fees.",Checking or savings account,BB&T CORPORATION
53473,"To who it may concern, My concern is regarding Shellpoint Mortgage Servicing Company. This company has received my monthly mortgage payments and have failed to account it as payments received. My bank has sent me proof the account was cleared and paid. This company is trying to foreclosed on my house and has sent me a letter indicating they sent my file to their Loss Mitigation dept. The person who they assigned it to is XXXX XXXX XXXX x XXXX. Their corporate number is XXXX. \n\nI have can send proof of my claim from above should you need it. Please assist.",Mortgage,"Shellpoint Partners, LLC"
35879,I contacted XXXX about fraudulent charges that were made on my account and the customer service representative told me that they wouldnt be able to issue anew card or remove my fraudulent charges since the account was closed due to non-payment. I was unaware of this these charging that were made.. To my knowledge I didnt owe a payment because the account wasnt being used.,Credit card or prepaid card,"EQUIFAX, INC."
20993,"I first applied for the Fedloan Serving program in XXXX in hopes of getting loan forgiveness since I've been paying on my student loans since late XXXX or early XXXX and I have been a public servant since XXXX, XXXX. I have never missed a payment to any of the various loan companies that owned my loans. The Fedloan Program turned me down in XXXX, saying my loans, which were originally Federal Stafford Loans, did not qualify me for the Program. In XXXX, XXXX, I was solicited by XXXX XXXX and told I could get into this program to reduce my loan payments. I was charged {$690.00} by XXXX XXXX to get me into the Fedloan Serving program. My loan payments went down to $ XXXX/month for the first year. In XXXX, XXXX, Fedloan Servicing informed me via email that I needed to recertify my "" income-driven repayment plan. Their customer service helped me submit the form XXXX XXXX ) via the internet and it involved the IRS. For reasons I don't understand, Fedloan is now increasing my monthly payments to {$380.00}, effective XX/XX/XXXX. This is a 60 % increase. On XX/XX/XXXX, I called Fedloan regarding this issue. The customer service person suggested I ask for a recalculation and provide pay-stubs rather than IRS information. On the same call, I was transferred to a second agent who listened to my concern and suggested I get out of the Fedloan Program since I really didn't qualify for it anyway and I could save money by going to a regular payment plan with a different servicer such as XXXX which had serviced my loan previously. I feel like I've been miss-lead by the Fedloan Serving Program and by Mr. XXXX XXXX, the case manager at XXXX XXXX ( http : //www.nexum-servicing.com ). After receiving my {$690.00}, I was never able to talk to XXXX on the phone again. He had convinced me I could save money with Fedloan and he told me my payments would go down to {$0.00} when I retire from State Service in XXXX. He told me I would receive loan forgiveness once my payments went to {$0.00}. I have concluded his advise was incorrect and or miss-leading. I regret having my loan transferred to the Fedloan Program because its representatives have indicated I will not receive loan forgiveness for my 20 years of public service. I am very concerned about the cost of my student loan debt, especially since I intend to retire from State service in XXXX. I am also concerned about how I will be able to afford these student loan payments once I am receiving a pension. I am also very disappointed that I haven't been able to "" qualify '' for loan forgiveness based on my long years of public service. I think this whole Fedloan Program is complete scam and I have been duped.",Student loan,AES/PHEAA
53,"On several occasions ( XXXX ) I have tried to reach someone in costumer service today XX/XX/2020 I tried again to get through and was on hold for over 20 minutes and hung up. \n\n3 Years ago I have made no more than {$600.00} in purchases, and while making payments on my account with a {$9000.00} limit the interest charges are deducted from the credit limit and the interest was based on the balance each month. this left my interest and charge fess higher than my monthly payment which amounted to around {$240.00} each month. \n\nIn all of this time the interest was so big that it completely eat up the limit balance on the card, which left in me in a vicious cycle of never paying down the money owed. with in 2 years I was able to get the balance down {$2000.00} and I asked Walmart Capital One to reduce the balance down. Now it is back up to the high bank fees and interest charges that have again gone beyond what I can afford monthly the {$240.00}. This is a unfair and infantile game that is forced on me as an consumer.",Credit card or prepaid card,CAPITAL ONE FINANCIAL CORPORATION
