# Topic Modeling- LDA

### Written by: Rodrigo Escandon

# Executive Summary

A Natural Language Processing model was developed using Machine Learning to evaluate text messages using unsupervised learning. This model used two topics to separate text messages and then its topic prediction was then compared to the actual target. The intent here is to show that this unsupervised model could be used as a soft labeller for this type of messages in case their is not an automated labelling process and supervised modelling is the final intent. Be aware that human supervision would still be required during labelling but doing this process would create a ranking order list that could enhance the labelling process. This model was created using Python (Pandas and Scikit-Learn) to structure, analyze and visualize the data set.

## Model Performance

The accuracy of the model predicting which classification (user) wrote each message for the testing set was calculated at 95% on "Top Scores" (around 75% of population) using a model cut-off of 0.8 and above from the topic score. In this example the model that was used was a topic model using LDA.

In [2]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.metrics import confusion_matrix, classification_report

In [3]:
#Loading data using pandas 
messages =pd.read_csv('SMSSpamCollection',sep='\t',names=['labels','message'])

In [4]:
messages.head()

Unnamed: 0,labels,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [20]:
#Creating function to run the loaded text
def text_process(mess):
    nopunc=[char for char in mess if char not in string.punctuation]
    nopunc=''.join(nopunc)
    return [word for word in nopunc.split()if word.lower() not in stopwords.words('english')]

In [21]:
cv=CountVectorizer(text_process)

In [22]:
%%time
#Document Term Matrix (DTM) creation
dtm=cv.fit_transform(messages['message'])

Wall time: 146 ms


In [40]:
#Grabbing Word from DTM
cv.get_feature_names()[4000]

'huge'

In [24]:
#Assigning number of components in the Latent Dirichlet Allocation
LDA=LatentDirichletAllocation(n_components=2,evaluate_every=500,n_jobs=10,verbose=1,random_state=1234)

In [25]:
%%time
LDA.fit(dtm)



iteration: 1 of max_iter: 10
iteration: 2 of max_iter: 10
iteration: 3 of max_iter: 10
iteration: 4 of max_iter: 10
iteration: 5 of max_iter: 10
iteration: 6 of max_iter: 10
iteration: 7 of max_iter: 10
iteration: 8 of max_iter: 10
iteration: 9 of max_iter: 10
iteration: 10 of max_iter: 10
Wall time: 23.4 s


LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=500, learning_decay=0.7, learning_method=None,
             learning_offset=10.0, max_doc_update_iter=100, max_iter=10,
             mean_change_tol=0.001, n_components=2, n_jobs=10,
             n_topics=None, perp_tol=0.1, random_state=1234,
             topic_word_prior=None, total_samples=1000000.0, verbose=1)

In [26]:
#Verifying number of topics
len(LDA.components_)

2

In [27]:
first_topic=LDA.components_[0]

In [28]:
#Grabbing the top 10 values for this topic
top_ten_words=first_topic.argsort()[-10:]

In [29]:
#Dispalying top 10 words for this topic
for i in top_ten_words:
    print(cv.get_feature_names()[i])

is
the
for
or
now
free
your
ur
call
to


In [30]:
#Displaying top 50 words for all topics
for i, topic in enumerate(LDA.components_):
    print(f"The Top 50 Words for Topic #{i}")
    print([cv.get_feature_names()[index]for index in topic.argsort()[-50:]])
    print("\n")

The Top 50 Words for Topic #0
['min', '150p', 'urgent', 'com', 'nokia', 'we', 'win', 'uk', 'in', 'just', 'please', 'new', 'won', 'cash', 'send', 'and', 'prize', 'wat', 'this', 'get', 'our', 'www', 'go', 'msg', 'no', 'claim', 'with', 'of', 'only', 'text', 'reply', 'stop', 'have', 'mobile', 'da', 'from', 'lor', 'txt', 'on', 'you', 'is', 'the', 'for', 'or', 'now', 'free', 'your', 'ur', 'call', 'to']


The Top 50 Words for Topic #1
['out', 'day', 'there', 'ok', 'come', 'he', 'its', 'was', 'all', 'like', 'good', 'this', 'know', 'no', 'with', 'll', 'what', 'up', 'when', 'just', 'get', 'how', 'lt', 'gt', 'we', 'on', 'be', 'your', 'if', 'at', 'will', 'do', 'are', 'have', 'not', 'can', 'but', 'so', 'for', 'of', 'that', 'is', 'my', 'it', 'me', 'in', 'and', 'the', 'to', 'you']




In [41]:
%%time
#Getting topic results for each document
topic_results=LDA.transform(dtm)

Wall time: 6.58 s


In [42]:
#Getting the topic results for the first row
topic_results[0].round(2)

array([0.67, 0.33])

In [50]:
#Including topic result into data set
messages['Topic']=topic_results.argmax(axis=1)
messages['Target']=np.where(messages['labels']=='ham',1,0)

In [51]:
messages.head(10)

Unnamed: 0,labels,message,Topic,Target
0,ham,"Go until jurong point, crazy.. Available only ...",0,1
1,ham,Ok lar... Joking wif u oni...,0,1
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,0,0
3,ham,U dun say so early hor... U c already then say...,1,1
4,ham,"Nah I don't think he goes to usf, he lives aro...",1,1
5,spam,FreeMsg Hey there darling it's been 3 week's n...,1,0
6,ham,Even my brother is not like to speak with me. ...,1,1
7,ham,As per your request 'Melle Melle (Oru Minnamin...,0,1
8,spam,WINNER!! As a valued network customer you have...,0,0
9,spam,Had your mobile 11 months or more? U R entitle...,0,0


In [52]:
#Confusion Matrix to compare topic prediction to actual result
TM=confusion_matrix(messages['Target'],messages['Topic'])
TM

array([[ 700,   47],
       [ 671, 4154]], dtype=int64)

In [53]:
#Classification Report gives precision scores, recall scores and f1-scores
print(classification_report(messages['Target'],messages['Topic']))

             precision    recall  f1-score   support

          0       0.51      0.94      0.66       747
          1       0.99      0.86      0.92      4825

avg / total       0.92      0.87      0.89      5572



In [54]:
#Including topic score into data set
messages['Topic_Score']=topic_results.max(axis=1)

In [56]:
messages.head(10)

Unnamed: 0,labels,message,Topic,Target,Topic_Score
0,ham,"Go until jurong point, crazy.. Available only ...",0,1,0.667222
1,ham,Ok lar... Joking wif u oni...,0,1,0.901983
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,0,0,0.979152
3,ham,U dun say so early hor... U c already then say...,1,1,0.62445
4,ham,"Nah I don't think he goes to usf, he lives aro...",1,1,0.960161
5,spam,FreeMsg Hey there darling it's been 3 week's n...,1,0,0.655139
6,ham,Even my brother is not like to speak with me. ...,1,1,0.966941
7,ham,As per your request 'Melle Melle (Oru Minnamin...,0,1,0.975897
8,spam,WINNER!! As a valued network customer you have...,0,0,0.976921
9,spam,Had your mobile 11 months or more? U R entitle...,0,0,0.978171


In [81]:
#Creating a data set with only high scores (>=0.8)
data_nlp_best=messages[messages.Topic_Score>=0.8]

In [82]:
data_nlp_best.head(10)

Unnamed: 0,labels,message,Topic,Target,Topic_Score
1,ham,Ok lar... Joking wif u oni...,0,1,0.901983
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,0,0,0.979152
4,ham,"Nah I don't think he goes to usf, he lives aro...",1,1,0.960161
6,ham,Even my brother is not like to speak with me. ...,1,1,0.966941
7,ham,As per your request 'Melle Melle (Oru Minnamin...,0,1,0.975897
8,spam,WINNER!! As a valued network customer you have...,0,0,0.976921
9,spam,Had your mobile 11 months or more? U R entitle...,0,0,0.978171
10,ham,I'm gonna be home soon and i don't want to tal...,1,1,0.971605
11,spam,"SIX chances to win CASH! From 100 to 20,000 po...",0,0,0.97573
12,spam,URGENT! You have won a 1 week FREE membership ...,0,0,0.975055


In [83]:
#Confusion Matrix
TM_best=confusion_matrix(data_nlp_best['Target'],data_nlp_best['Topic'])
TM_best

array([[ 572,   24],
       [ 185, 3396]], dtype=int64)

In [84]:
#Classification Report gives precision scores, recall scores and f1-scores for the Top Scores
print(classification_report(data_nlp_best['Target'],data_nlp_best['Topic']))


             precision    recall  f1-score   support

          0       0.76      0.96      0.85       596
          1       0.99      0.95      0.97      3581

avg / total       0.96      0.95      0.95      4177

