Topic modeling provides methods for automatically organizing, understanding, searching, and summarizing large electronic archives. Topic Modeling in NLP seeks to find hidden semantic structure in documents.\
It can help with the following:
* discovering the hidden themes in the collection.
* classifying the documents into the discovered themes.
* using the classification to organize/summarize/search the documents.

Latent Dirichlet Allocation: It is one of the most popular topic modeling methods. Each document is made up of various words, and each topic also has various words belonging to it. The aim of LDA is to find topics a document belongs to, based on the words in it.

1. Go through each document and randomly assign each word in the document to one of k topics (k is chosen beforehand).
2. For each document d, go through each word w and compute: 
* p(topic t | document d): Tries to capture how many words belong to the topic t for a given document d. 
* p(word w| topic t): Tries to capture how many documents are in topic t because of word w.

**Use Case:** Mapping Customer Complaints into pre_defined Complaint Categories 

Flow: Gather Data -> Text Normalization -> Extract Topics LDA -> Classify Data

In [1]:
import numpy as np
import pandas as pd
import re
import warnings
warnings.filterwarnings('ignore')
import spacy
nlp = spacy.load('en_core_web_sm')
from sklearn.feature_extraction.text import TfidfVectorizer

import gensim
from gensim.corpora import Dictionary
from gensim.models import LdaModel,CoherenceModel
import pyLDAvis.gensim

In [2]:
data = pd.read_csv('../input/comcastcomplaints/comcast_fcc_complaints_2015.csv')
data.head()

In [3]:
data.tail()

In [4]:
data['Customer Complaint'].value_counts()

In [5]:
customize_stop_words = ['comcast', 'i', 'fcc', 'hello', 'service', 'services', 'issue', 'issues', 'problem', 'problems', 'xfinity', 'customer', 'complaint', '$']
for word in customize_stop_words:
    nlp.vocab[word].is_stop = True
    
def preprocess(text):
    text = text.split('\n')[0].lower()
    doc = nlp(text)
    temp = []
    for word in doc:
        # If it's not a stop word or punctuation mark, add it to our article!
        if word.text != 'n' and not word.is_stop and not word.is_punct and not word.like_num:
            # We add the lematized version of the word
            temp.append(word.lemma_.lower())
    return temp

# Tokenize each complaint
docs = data['Description'].apply(lambda text: preprocess(text))

In [6]:
docs

In [7]:
dictionary = Dictionary(docs)
print('Distinct words in initial documents:', len(dictionary))

# Filter out words that occur less than 10 documents, or more than 40% of the documents.
dictionary.filter_extremes(no_below=10, no_above=0.4)
print('Distinct words after removing rare and common words:', len(dictionary))

In [8]:
#id2word is an optional dictionary that maps the word_id to a token
corpus = [dictionary.doc2bow(doc) for doc in docs]
num_topics = 5
model = gensim.models.LdaMulticore(corpus=corpus, num_topics=num_topics, id2word=dictionary, passes=20, workers=2)

In [9]:
for i in range(num_topics):
    print('\nTopic {}\n'.format(str(i)))
    for term, frequency in model.show_topic(i, topn=10):
        print(u'{:20} {:.3f}'.format(term, round(frequency, 3)))

In [11]:
top_labels = {0: 'Customer Services', 1:'Internet Speed', 2:'Data Caps', 3: 'Pricing', 4:'Billing'}

In [12]:
vector = TfidfVectorizer(input='content', analyzer = 'word', lowercase=True, stop_words='english',tokenizer=preprocess)
desc = vector.fit_transform(data['Description']).toarray()

In [13]:
from collections import OrderedDict
def get_doc_topic_dist(model, corpus, kwords=False): 
    '''
    LDA transformation, for each doc only returns topics with non-zero weight
    This function makes a matrix transformation of docs in the topic space.
    
    model: the LDA model
    corpus: the documents
    kwords: if True adds and returns the keys
    '''
    
    keys = []
    for d in corpus:
        tmp = {i:0 for i in range(num_topics)}
        tmp.update(dict(model[d]))
        vals = list(OrderedDict(tmp).values())
        if kwords:
            keys += [np.asarray(vals).argmax()]

    return keys

In [14]:
features = vector.get_feature_names() #This will print feature names selected (terms selected) from the raw documents
lda_keys= get_doc_topic_dist(model, corpus, True)

In [15]:
top_words = []
for n in range(len(desc)):
    inds = np.int0(np.argsort(desc[n])[::-1][:5])
    top_words += [', '.join([features[i] for i in inds])]
    
data['Description Top Words'] = pd.DataFrame(top_words)
data['Topic'] = pd.DataFrame(lda_keys)
data.head(10)

In [16]:
#Topic Coherence measures score a single topic by measuring the degree of semantic similarity between high scoring words in the topic. These measurements help 
#distinguish between topics that are semantically interpretable topics.
CoherenceModel(model = model, texts = docs, dictionary = dictionary, coherence='c_v').get_coherence()

In [17]:
test = 'internet speed is slow, call customer service now!!'
tokens = preprocess(test)
model[dictionary.doc2bow(tokens)]

In [18]:
pyLDAvis.enable_notebook()
pyLDAvis.gensim.prepare(model, corpus, dictionary)

The left panel displays different topics and the distance between them. The closer the topics are in meaning the closer they appear, the same goes for dissimilar topics. 

The right panel, displays a bar chart representing  how frequent the term is in the corpus and how "distinctive" it is in distinguishing between different topics.