Topic modeling provides methods for automatically organizing, understanding, searching, and summarizing large electronic archives. Topic Modeling in NLP seeks to find hidden semantic structure in documents.\
It can help with the following:
* discovering the hidden themes in the collection.
* classifying the documents into the discovered themes.
* using the classification to organize/summarize/search the documents.

Latent Dirichlet Allocation: It is one of the most popular topic modeling methods. Each document is made up of various words, and each topic also has various words belonging to it. The aim of LDA is to find topics a document belongs to, based on the words in it.

1. Go through each document and randomly assign each word in the document to one of k topics (k is chosen beforehand).
2. For each document d, go through each word w and compute: 
* p(topic t | document d): Tries to capture how many words belong to the topic t for a given document d. 
* p(word w| topic t): Tries to capture how many documents are in topic t because of word w.

**Use Case:** Mapping Customer Complaints into pre_defined Complaint Categories 

Flow: Gather Data -> Text Normalization -> Extract Topics LDA -> Classify Data

In [1]:
import numpy as np
import pandas as pd
import re
import warnings
warnings.filterwarnings('ignore')
import spacy
nlp = spacy.load('en_core_web_sm')
from sklearn.feature_extraction.text import TfidfVectorizer

import gensim
from gensim.corpora import Dictionary
from gensim.models import LdaModel,CoherenceModel
import pyLDAvis.gensim

In [2]:
data = pd.read_csv('../input/comcastcomplaints/comcast_fcc_complaints_2015.csv')
data.head()

Unnamed: 0,Ticket #,Customer Complaint,Date,Time,Received Via,City,State,Zip code,Status,Filing on Behalf of Someone,Description
0,250635,Comcast Cable Internet Speeds,4/22/2015,3:53:50 PM,Internet,Abingdon,Maryland,21009,Closed,No,I have been contacting Comcast Internet Techni...
1,223441,Payment disappear - service got disconnected,4/8/2015,10:22:56 AM,Internet,Acworth,Georgia,30102,Closed,No,Back in January 2015 I made 2 payments: One fo...
2,242732,Speed and Service,4/18/2015,9:55:47 AM,Internet,Acworth,Georgia,30101,Closed,Yes,Our home is located at in Acworth Georgia 3010...
3,277946,Comcast Imposed a New Usage Cap of 300GB that ...,5/7/2015,11:59:35 AM,Internet,Acworth,Georgia,30101,Open,Yes,Comcast in the Atlanta area has just put into ...
4,307175,Comcast not working and no service to boot,5/26/2015,1:25:26 PM,Internet,Acworth,Georgia,30101,Solved,No,I have been a customer of Comcast of some sort...


In [3]:
data.tail()

Unnamed: 0,Ticket #,Customer Complaint,Date,Time,Received Via,City,State,Zip code,Status,Filing on Behalf of Someone,Description
2220,213550,Service Availability,4/2/2015,9:13:18 AM,Internet,Youngstown,Florida,32466,Closed,No,I am a deaf guy. I have asked ATT or Comcast t...
2221,318775,Comcast Monthly Billing for Returned Modem,6/2/2015,1:24:39 PM,Internet,Ypsilanti,Michigan,48197,Solved,No,We purchased our own modem and returned the Co...
2222,331188,complaint about comcast,6/9/2015,5:28:41 PM,Internet,Ypsilanti,Michigan,48197,Solved,No,i had an agreement with comcast agent 1 year f...
2223,360489,Extremely unsatisfied Comcast customer,6/23/2015,11:13:30 PM,Internet,Ypsilanti,Michigan,48197,Solved,No,A few months ago I was forced to finally call ...
2224,363614,"Comcast, Ypsilanti MI Internet Speed",6/24/2015,10:28:33 PM,Internet,Ypsilanti,Michigan,48198,Open,Yes,My Internet disconnects all of the time and I ...


In [4]:
data['Customer Complaint'].value_counts()

Comcast                                          83
Comcast Internet                                 18
Comcast Data Cap                                 17
comcast                                          13
Comcast Data Caps                                11
                                                 ..
Improper Billing and non resolution of issues     1
Deceptive trade                                   1
intermittent internet                             1
Internet Speed on Wireless Connection             1
Comcast, Ypsilanti MI Internet Speed              1
Name: Customer Complaint, Length: 1842, dtype: int64

In [5]:
customize_stop_words = ['comcast', 'i', 'fcc', 'hello', 'service', 'services', 'issue', 'issues', 'problem', 'problems', 'xfinity', 'customer', 'complaint', '$']
for word in customize_stop_words:
    nlp.vocab[word].is_stop = True
    
def preprocess(text):
    text = text.split('\n')[0].lower()
    doc = nlp(text)
    temp = []
    for word in doc:
        # If it's not a stop word or punctuation mark, add it to our article!
        if word.text != 'n' and not word.is_stop and not word.is_punct and not word.like_num:
            # We add the lematized version of the word
            temp.append(word.lemma_.lower())
    return temp

# Tokenize each complaint
docs = data['Description'].apply(lambda text: preprocess(text))

In [6]:
docs

0       [contact, internet, technical, support, month,...
1       [january, payment, january, february, advance,...
2       [home, locate, acworth, georgia, sign, year, c...
3       [atlanta, area, effect, unprecendented, usage,...
4                                      [sort, year, like]
                              ...                        
2220    [deaf, guy, ask, att, provide, cable, dsl, are...
2221    [purchase, modem, return, modem, 8/12/13, rece...
2222    [agreement, agent, year, mg, bite, hbo, tv, ge...
2223    [month, ago, force, finally, extremely, slow, ...
2224    [internet, disconnect, time, rarely, 106mbit, ...
Name: Description, Length: 2225, dtype: object

In [7]:
dictionary = Dictionary(docs)
print('Distinct words in initial documents:', len(dictionary))

# Filter out words that occur less than 10 documents, or more than 40% of the documents.
dictionary.filter_extremes(no_below=10, no_above=0.4)
print('Distinct words after removing rare and common words:', len(dictionary))

Distinct words in initial documents: 5697
Distinct words after removing rare and common words: 925


In [8]:
#id2word is an optional dictionary that maps the word_id to a token
corpus = [dictionary.doc2bow(doc) for doc in docs]
num_topics = 5
model = gensim.models.LdaMulticore(corpus=corpus, num_topics=num_topics, id2word=dictionary, passes=20, workers=2)

In [9]:
for i in range(num_topics):
    print('\nTopic {}\n'.format(str(i)))
    for term, frequency in model.show_topic(i, topn=10):
        print(u'{:20} {:.3f}'.format(term, round(frequency, 3)))


Topic 0

call                 0.030
time                 0.026
tell                 0.023
phone                0.017
day                  0.016
work                 0.016
say                  0.014
home                 0.013
come                 0.013
tech                 0.011

Topic 1

speed                0.085
pay                  0.039
time                 0.020
slow                 0.019
month                0.017
get                  0.014
download             0.014
mbps                 0.013
connection           0.013
high                 0.011

Topic 2

cap                  0.046
datum                0.032
data                 0.028
gb                   0.026
usage                0.022
month                0.020
use                  0.019
limit                0.018
charge               0.018
area                 0.015

Topic 3

cable                0.047
price                0.039
month                0.031
package              0.025
bill                 0.022
tv             

In [11]:
top_labels = {0: 'Customer Services', 1:'Internet Speed', 2:'Data Caps', 3: 'Pricing', 4:'Billing'}

In [12]:
vector = TfidfVectorizer(input='content', analyzer = 'word', lowercase=True, stop_words='english',tokenizer=preprocess)
desc = vector.fit_transform(data['Description']).toarray()

In [13]:
from collections import OrderedDict
def get_doc_topic_dist(model, corpus, kwords=False): 
    '''
    LDA transformation, for each doc only returns topics with non-zero weight
    This function makes a matrix transformation of docs in the topic space.
    
    model: the LDA model
    corpus: the documents
    kwords: if True adds and returns the keys
    '''
    
    keys = []
    for d in corpus:
        tmp = {i:0 for i in range(num_topics)}
        tmp.update(dict(model[d]))
        vals = list(OrderedDict(tmp).values())
        if kwords:
            keys += [np.asarray(vals).argmax()]

    return keys

In [14]:
features = vector.get_feature_names() #This will print feature names selected (terms selected) from the raw documents
lda_keys= get_doc_topic_dist(model, corpus, True)

In [15]:
top_words = []
for n in range(len(desc)):
    inds = np.int0(np.argsort(desc[n])[::-1][:5])
    top_words += [', '.join([features[i] for i in inds])]
    
data['Description Top Words'] = pd.DataFrame(top_words)
data['Topic'] = pd.DataFrame(lda_keys)
data.head(10)

Unnamed: 0,Ticket #,Customer Complaint,Date,Time,Received Via,City,State,Zip code,Status,Filing on Behalf of Someone,Description,Description Top Words,Topic
0,250635,Comcast Cable Internet Speeds,4/22/2015,3:53:50 PM,Internet,Abingdon,Maryland,21009,Closed,No,I have been contacting Comcast Internet Techni...,"permanent, hardware, residence, technical, rep...",1
1,223441,Payment disappear - service got disconnected,4/8/2015,10:22:56 AM,Internet,Acworth,Georgia,30102,Closed,No,Back in January 2015 I made 2 payments: One fo...,"investigation, bank, payment, confirmation, ja...",4
2,242732,Speed and Service,4/18/2015,9:55:47 AM,Internet,Acworth,Georgia,30101,Closed,Yes,Our home is located at in Acworth Georgia 3010...,"acworth, partner, compliant, mr, georgia",4
3,277946,Comcast Imposed a New Usage Cap of 300GB that ...,5/7/2015,11:59:35 AM,Internet,Acworth,Georgia,30101,Open,Yes,Comcast in the Atlanta area has just put into ...,"unprecendented, effect, atlanta, gb, usage",2
4,307175,Comcast not working and no service to boot,5/26/2015,1:25:26 PM,Internet,Acworth,Georgia,30101,Solved,No,I have been a customer of Comcast of some sort...,"sort, like, year, dropping, drilling",3
5,338519,ISP Charging for arbitrary data limits with ov...,6/12/2015,9:59:40 PM,Internet,Acworth,Georgia,30101,Solved,No,To whom it may concern:\n I am a Comcast custo...,"concern, €, drill, drink, drink($5",1
6,361148,Throttling service and unreasonable data caps,6/24/2015,10:13:55 AM,Internet,Acworth,Georgia,30101,Pending,No,"Good morning,\n Comcast has been throttling my...","morning, good, dsl, drink($5, drive",0
7,359792,Comcast refuses to help troubleshoot and corre...,6/23/2015,6:56:14 PM,Internet,Adrian,Michigan,49221,Solved,No,When I moved to Michigan I contacted Comcast r...,"gaming, perfect, forwarding, michigan, host",0
8,318072,Comcast extended outages,6/1/2015,11:46:30 PM,Internet,Alameda,California,94502,Closed,No,Comcast Xfinity cable service was interrupted ...,"pt, minute, pm, estimate, interrupt",0
9,371214,Comcast Raising Prices and Not Being Available...,6/28/2015,6:46:31 PM,Internet,Alameda,California,94501,Open,Yes,"All of a sudden our ""bundle discount"" dropped ...","busy, sudden, type, discount, notify",4


In [16]:
#Topic Coherence measures score a single topic by measuring the degree of semantic similarity between high scoring words in the topic. These measurements help 
#distinguish between topics that are semantically interpretable topics.
CoherenceModel(model = model, texts = docs, dictionary = dictionary, coherence='c_v').get_coherence()

0.5376588428971818

In [17]:
test = 'internet speed is slow, call customer service now!!'
tokens = preprocess(test)
model[dictionary.doc2bow(tokens)]

[(0, 0.066689454),
 (1, 0.7330982),
 (2, 0.06674157),
 (3, 0.066795915),
 (4, 0.06667482)]

In [18]:
pyLDAvis.enable_notebook()
pyLDAvis.gensim.prepare(model, corpus, dictionary)

  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload


The left panel displays different topics and the distance between them. The closer the topics are in meaning the closer they appear, the same goes for dissimilar topics. 

The right panel, displays a bar chart representing  how frequent the term is in the corpus and how "distinctive" it is in distinguishing between different topics.