# Lending Club Description Topic Modeling

In this notebook, I perform topic modeling on the 'desc' field for Lending Club loans categorized as 'other'.

Topic modeling is used to discover abstract "topics" that occur in a collection of documents. The topics produced are clusters of similar words.

I use two techniques, Nonnegative Matrix Factorization (NMF) and Latent Dirichlet Allocation (LDA), and two input matrices, one of the raw counts of the n-grams and one weighted by term frequency-inverse document frequency (tf-idf), for a total of four iterations.

In [1]:
%%time
import time
import numpy as np
import pandas as pd
start_time = time.perf_counter()

Wall time: 920 ms


### 0. Data Load

In [2]:
%%time
# Read in csv containing 'other' loans with populated 'desc' fields
df = pd.read_csv('data/other_loans.csv', header=0, low_memory=False)
print('There are {} observations and {} variables in this dataset.'.format(*df.shape))

# Pull out descriptions
descriptions = df.loc[:, 'desc']

There are 7063 observations and 152 variables in this dataset.
Wall time: 457 ms


### 1. Text Analysis

Now that we have the observations we're interested in, we can extract features.

We begin by using CountVectorizer to count word n-grams of size 1 to 5.

An example of an n-gram of size 1 would be "consolidation" while an n-gram of size 4 would be "credit card debt consolidation". The words must be contiguous in the text.

CountVectorizer converts a collection of text documents to a matrix of token counts. 
Here, each document is a user-provided loan description. 

TfIdfTransformer converts a count matrix to a normalized tf-idf (term frequency-inverse document frequency) representation. 
Explanation in a later cell.

In [3]:
%%time
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

# Our features are word n-grams
analyzer = 'word'

# Create list of stop words
stop_nums = [str(i) for i in list(range(0, 101))]
stop_chars = [
    'br', 'added', 'borrower', 'pay',
    '000', '00', '01', '02', '03', '04', '05',
    '06', '07', '08', '09', '200', '2000',
    'to', 'on', 'and', 'the', 'my', 'for',
    'have', 'of', 'in', 'this', 'is', 'will',
    'off', 'with', 'be', 'am', 'that', 'it',
    'need', 'as', 'some', 'you', 'me', 'we',
    'would', 'been', 'at', 'so', 'very',
    'thank', 'you', 'an', 'get', 'all', 'are',
    'from', 'like', 'our', 'help', 'but', 'if',
    'not', 'was', 'has', 'just', 'do', 'up',
    'over', 'by', 'any', 'right', 'now',
    'month', 'year', 'xc2', 'x80', 'xc3', 'xa2'
]
stop_words = stop_chars + stop_nums

# Upper and lower bound of the range of n-values for n-grams to be extracted
ngram_range = (1, 5)

# Only interested in the 100000 most frequent n-grams
max_features = 100000

# Regex pattern. Currently not working or used
token_pattern = '^([a-zA-Z]+|\d+|\W)$' # TODO: fix this

# Initialize and fit CountVectorizer
cv = CountVectorizer(analyzer=analyzer, stop_words=stop_words,
                    ngram_range=ngram_range, max_features=max_features)
cv_fit = cv.fit_transform(descriptions)

# Convert sparse matrix of n-gram counts to array - this is our raw input
X_raw = cv_fit.toarray()

# Transform our array of n-gram counts to tf-idf weighted
transformer = TfidfTransformer()
X_tfidf = transformer.fit_transform(X_raw)

Wall time: 12.3 s


### 2. Non-Negative Matrix Factorization (NMF)

From docs: "Find two non-negative matrices (W, H) whose product approximates the non-negative matrix X. This factorization can be used for example for dimensionality reduction, source separation or topic extraction."

We're interested in topic extraction for topic modeling.

In [4]:
%%time
# Define a function for putting our topic modeling output into a readable format
def get_topics(model, num_topics=5):
    """ 
    Pulls out the 30 largest values for each topic, maps
    them to their respective topics in a dictionary, and 
    returns a DataFrame.
    """
    # Get feature names for mapping
    feat_names = cv.get_feature_names()
    word_dict = {}
    # For each topic, get the 30 largest values
    for i in range(num_topics):        
        words_ids = model.components_[i].argsort()[:-30 - 1:-1]
        words = [feat_names[key] for key in words_ids]
        word_dict['Topic # ' + '{:02d}'.format(i+1)] = words
    return pd.DataFrame(word_dict)

Wall time: 0 ns


### 2a. Plain NMF

First, I use the raw counts matrix from CountVectorizer as an input for the NMF model.

In [5]:
%%time
from sklearn.decomposition import NMF

# Number of topics
n_components = 5

# Instantiate NMF model
nmf_model = NMF(n_components=n_components)

# Fit model
nmf_model.fit(X_raw)

Wall time: 2min 23s


In [6]:
nmf_topics = get_topics(nmf_model)
nmf_topics

Unnamed: 0,Topic # 01,Topic # 02,Topic # 03,Topic # 04,Topic # 05
0,loan,credit,time,bills,per
1,payment,cards,years,paid,mortgage
2,personal,credit cards,job,loan,loan
3,lending,card,money,loan paid,credit
4,car,credit card,can,paid bills,income
5,interest,debt,good,economic,interest
6,club,interest,bills,hardship,payment
7,monthly,high,also,economic hardship,card
8,expenses,payment,work,facing,credit card
9,lending club,one,business,loan paid bills,about


Topic #1 seems to be related to **personal loans**. Some n-grams in the topic suggest student or car loans.

Topic #2 is clearly related to **credit card debt**. High interest rates are also mentioned.

Topic #3 is less clear, but looks like it may be related to **small businesses**. Business, expenses, and company are mentioned.

Topic #4 shows that there are a lot of larger n-grams related to **economic hardship** and paying bills. Notably, it has the largest n-grams and seems to be the most focused topic.

Topic #5 is also not very clear, but it looks to be somewhat related to **mortgages**. The mentions of social security, credit card, and community college fall outside of that scope, though. It is also the only topic containing numbers. 

Next, I do NMF again but with TfIdf-transformed counts.

### 2b. NMF with TfIdf

Tf-Idf stands for term frequency-inverse document frequency. It is used to weight a word's importance.

>The goal of using tf-idf instead of the raw frequencies of occurrence of a token in a given document is to scale down the impact of tokens that occur very frequently in a given corpus and that are hence empirically less informative than features that occur in a small fraction of the training corpus.

"Typically, the tf-idf weight is composed by two terms: the first computes the normalized Term Frequency (TF), aka. the number of times a word appears in a document, divided by the total number of words in that document; the second term is the Inverse Document Frequency (IDF), computed as the logarithm of the number of the documents in the corpus divided by the number of documents where the specific term appears.

TF: Term Frequency, which measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization: 

TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).

IDF: Inverse Document Frequency, which measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following: 

IDF(t) = log_e(Total number of documents / Number of documents with term t in it).

See below for a simple example.

Example:

Consider a document containing 100 words wherein the word cat appears 3 times. The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the word cat appears in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12."

In [7]:
%%time

# Number of topics
n_components = 5

# Instantiate NMF model
nmf_tfidf = NMF(n_components=n_components)

# Fit model
nmf_tfidf.fit(X_tfidf)

Wall time: 1.75 s


In [8]:
# Get topics
nmf_tfidf_topics = get_topics(nmf_tfidf)
nmf_tfidf_topics

Unnamed: 0,Topic # 01,Topic # 02,Topic # 03,Topic # 04,Topic # 05
0,cards,personal,consolidation,bills,loan
1,credit cards,personal loan,debt consolidation,medical bills,card
2,credit,loan,debt,medical,credit card
3,interest,personal use,bill consolidation,consolidate,car
4,high interest,use,consolidate debt,consolidate bills,home
5,high,loan personal,consolidate,bills bills,credit
6,interest credit,personal personal,consolidation debt,paying bills,repairs
7,high interest credit,personal expenses,consolidation loan,loan bills,expenses
8,interest credit cards,personal debts,bill,paying,debt
9,high interest credit cards,loan personal use,debt consolidation loan,card bills,years


These topics look more focused compared to using the non-TfIdf data. 

There is overlap between topics (credit card vs cards, consolidate vs consolidation) that can be fixed by proper stemming and/or lemmatization.

It looks like credit cards are mentioned in every topic so hopefully this can be fixed by stemming.

Topic #1 is clearly related to **high interest credit cards** and consolidation.

Topic #2 is related to **personal loans** but credit cards also make it in there. Personal family and personal trip hint at vacation related expenses.

Topic #3 is about **debt consolidation** with mentions of credit card debt, home improvement and wedding debt.

Topic #4 is about **bills**, specifically medical bills. Credit cards and vacation are also mentioned.

Topic #5 is less focused than the rest, but some terms that stand out are **car, home, repairs, and improvement**.

Next, I use Latent Dirichlet Allocation (LDA) for topic modeling.

### 3. Latent Dirichlet Allocation (LDA)

From docs: "Latent Dirichlet Allocation with online variational Bayes algorithm."

This process is random so it's somewhat less reliable than NMF, but `random_state` can be set for reproducibility.

### 3a. Plain LDA

First, I use the raw counts matrix from CountVectorizer as an input for the LDA model.

In [9]:
%%time
from sklearn.decomposition import LatentDirichletAllocation as LDA

# Number of topics
n_components = 5

# Use all available cores
n_jobs = -1

# Instantiate LatentDirichletAllocation model
lda_model = LDA(n_components=n_components, n_jobs=n_jobs, verbose=1, random_state=0)

# Fit model
lda_model.fit(X_raw)



iteration: 1 of max_iter: 10
iteration: 2 of max_iter: 10
iteration: 3 of max_iter: 10
iteration: 4 of max_iter: 10
iteration: 5 of max_iter: 10
iteration: 6 of max_iter: 10
iteration: 7 of max_iter: 10
iteration: 8 of max_iter: 10
iteration: 9 of max_iter: 10
iteration: 10 of max_iter: 10
Wall time: 4min 17s


In [10]:
lda_topics = get_topics(lda_model)
lda_topics

Unnamed: 0,Topic # 01,Topic # 02,Topic # 03,Topic # 04,Topic # 05
0,taxes,car,loan,credit,business
1,loan,repairs,credit,cards,loan
2,property,home,time,credit cards,money
3,your,consolidation,years,interest,personal
4,irs,loan,job,loan,her
5,what,repair,debt,bills,family
6,federal,debt,payment,card,home
7,im,debt consolidation,bills,credit card,she
8,your loan,new,monthly,high,wife
9,real,vacation,good,rate,his


Results don't seem as good as with NMF.

Topic #1: **Taxes?** Weird results with "tell your story what your loan"

Topic #2: All over the place with car repairs, debt consolidation, and home improvement. **Personal loans?**

Topic #3: Vaguely about **credit cards** but with a positive spin - good, stable, etc.

Topic #4: **High interest credit cards** with a few mentions of medical bills.

Topic #5: **Business** is the top term but there are a lot of connected terms like family, wife, son and daughter. Also mentions personal loans and medical.

Overall not as useful. Let's see what happens when we use the Tf-Idf weighted inputs.

### 3b. LDA with TfIdf

In [11]:
%%time
# Number of topics
n_components = 5

# Use all available cores
n_jobs = -1

# Instantiate LatentDirichletAllocation model
lda_tfidf = LDA(n_components=n_components, n_jobs=n_jobs, verbose=1, random_state=0)

# Fit model
lda_tfidf.fit(X_tfidf)



iteration: 1 of max_iter: 10
iteration: 2 of max_iter: 10
iteration: 3 of max_iter: 10
iteration: 4 of max_iter: 10
iteration: 5 of max_iter: 10
iteration: 6 of max_iter: 10
iteration: 7 of max_iter: 10
iteration: 8 of max_iter: 10
iteration: 9 of max_iter: 10
iteration: 10 of max_iter: 10
Wall time: 2min 27s


In [12]:
lda_tfidf_topics = get_topics(lda_tfidf)
lda_tfidf_topics

Unnamed: 0,Topic # 01,Topic # 02,Topic # 03,Topic # 04,Topic # 05
0,taxes,loan,camping,cards medical bills,credit
1,state taxes,bills,camping membership,credit cards medical bills,cards
2,property taxes,personal,membership,cards medical,credit cards
3,back taxes,debt,dr,credit cards medical,home
4,state,credit,home renovations,wedding,card
5,federal,car,purchase bills,daughters wedding,interest
6,property,consolidation,occasion,honeymoon,credit card
7,federal state,expenses,use personal use stable,consolation,improvement
8,federal state taxes,years,cut expenses creditcard rates,planing,home improvement
9,federal taxes,time,use personal use stable job,wedding expenses wedding,high


We get much longer n-grams here because of the inverse document frequency portion of Tf-Idf.

Topic #1: Dominated by **taxes** (state, federal, property, back) with some mentions of **paying attorney fees**

Topic #2: Not as focused, but looks like **credit cards** and **debt consolidation**. Also mentions wedding, personal loan, medical, and repairs.

Topic #3: **Camping** is at the top of the list, which is strange. The rest of the list shows that the topic is about having a stable job and more than 15 years experience. Not very useful.

Topic #4: **Credit cards, medical bills, and wedding expenses** dominate this topic. 

Topic #5: Looks similar to topic #2 because of the mentions of **high interest credit cards** but also has terms that hint at **home improvement**

These results are also not that great. Credit cards make an appearance in 3/5 of the topics, and a lot of the longer n-grams seem irrelevant but are weighted relatively heavily because of their low document frequency.

TODO: 
* Understand regex in order to come up with a custom token_pattern
* Stemming and lemmatization? 
  * See https://rstudio-pubs-static.s3.amazonaws.com/79360_850b2a69980c4488b1db95987a24867a.html

In [13]:
end_time = (time.perf_counter() - start_time)/60
print("This notebook took {:0.2f} minutes to run".format(end_time))

This notebook took 9.41 minutes to run
