# <font color='#31708f'><center>Zendesk Tickets Word Cloud

* [Obtaining data from APIs](#first-bullet)

* [Preprocessing](#second-bullet)

* [WordCloud](#third-bullet)

* [CountVectorizer](#fourth-bullet)

* [LDA Model](#fifth-bullet)

Set up the development environment: 
- Install Git
- Install Anaconda distribution (bookmarks)

- Install Node.js
- Install Postman native app

Create new Python environment and check if we are in the right environment:
- create the new environment without asking for confirmation (conda create --name testenv pandas jupyter ipykernel tabulate -y)
- activate the environment (conda activate testenv)
- use ipykernel to register your new environment as a kernel named testenv (python -m ipykernel install --user --name testenv --display-name "testenv")
- import tabulate

In the new environment install the following packages:
- Install ijson (conda install)
- Install natsort (conda install)
- Install package gensim (pip install)
- Install package spacy (conda install)
- Download package en-core-web-sm (python -m spacy download en)
- Install matplotlib (conda install)
- Install package sklearn (pip install)

In the test environment for jupyter unresponsive error install the following: 
- Install package nbstripout (pip install nbstripout)
- To clear the error: in terminal nbstripout filename.ipynb

In [None]:
!mkdir diagrams

In [None]:
%%file diagrams/block_diagram
blockdiag {
    orientation = portrait

  A -> B -> C -> D -> E -> F;

  // Set labels to nodes.
   A [label = "Obtaining data from APIs"];
   B [label = "Preprocessing"];
   C [label = "WordCloud"];
   D [label = "CountVectorizer"];
   E [label = "LDA Model"];
   F [label = "Test"];
 
  // Set boder-style, background-color and text-color to nodes.
  
   A,B,C,D,E,F [color = "#31708f", textcolor="#FFFFFF"];

  // Set width and height to nodes.
   A,B,C,D,E,F [width = 300]; // default value is 128
    
  // Set fontsize
  default_fontsize = 20;  // default value is 11
        
}

In [None]:
!blockdiag diagrams/block_diagram

In [None]:
from IPython.display import Image
Image("diagrams/block_diagram.png")

Set up jekyll for gh pages:
- Install Ruby (verify installation: ruby -v; gem -v)
- Install jekyll bundler (gem install; verify installation: jekyll -v)
- cd Documents (in git bash)
- jekyll new mynewsite
- cd mynewsite
- bundle exec jekyll serve
- gem "jekyll-theme-hydeout" (in the gemfile)
- bundle install (in git bash)
- theme: jekyll-theme-hydeout (in _config file)

In the test environment for word cloud install the following: 
- Install package wordcloud (conda install -c conda-forge wordcloud)

In the test environment for word cloud TF-IDF install the following:
- Install package numpy (conda install)

In the new environment for topic modelling install the following:
- Install package pandas (conda install)
- Install package nltk (conda install)



- Install package pyLDAvis (pip install)
- Install package scipy (conda install)

In the test environment for block diagram install the following:
- Install package blockdiag (pip install)

Check if a package is installed:
- In Anaconda Prompt (run as Admin)
- conda list

<div class="alert alert-block alert-info">In Anaconda Prompt</div>

Get a list of all my environments, active environment is shown with *. It will display the base environment.

In [None]:
conda env list

Create a new environment named testenv:

In [None]:
conda create -n testenv

View the new environment created:

In [None]:
conda env list

View Jupyter Lab: http://localhost:8888/lab?

# <font color='#31708f'><center>Obtaining data from APIs<a class="anchor" id="first-bullet"></a></center></font>

 <div class="alert alert-block alert-info">In Git Bash</div>

Set the working directory:

In [None]:
cd Desktop/JUPYTER_NOTEBOOKS

- Install Newman (command-line collection runner for Postman):

In [None]:
npm install newman -global

<div class="alert alert-block alert-info">In Base64</div>

AUTHORIZATION - Basic Authentication and API tokens<br> 
Use your company email address and Zendesk API key. The credentials must be sent in an Authorization header in the HTTP request. 

Authenticate a request with basic authentication and API token:
- Combine your email address/token with your Zendesk API key with a colon:
```svetlana.staneva@eventsforce.com/token: {zendesk_api_key}```
- Base64-encode the resulting string:
```amRvZUBleGFtcGxlLmNvbTpwYSQkdzByZA==```

<div class="alert alert-block alert-info">In TextMechanic<div>

- Generate a lists of numbers (1-3000, 3001-5000, etc.) up to 59046
- Create a runner.csv files with a column ticket_id and listing numbers from 1 to 3000, etc.
- Move the files to the working directory JUPYTER_NOTEBOOKS

<div class="alert alert-block alert-info">In Postman<div>

Postman Comments Collection<br>
The Comments is a Postman collection that lists comments for Zendesk tickets from number 1 to 59046.

Create a new environment:
- Click on the Cog icon
- Click Add
- In Add Environment enter a name for the environment - for instance, Test 
- Click Add

Select an active environment:
- Click the dropdown menu in the upper right corner of the Postman app to select an active environment (Test)

Request data via GET request: https://eventsforce.zendesk.com/api/v2/tickets/{ticket_id}/comments.json

In Headers include the base64-encoded string:
- In Headers go to Presets > Manage Presets
- Add the Authorization and click Add:
```Authorization: Basic {base64-encoded-string}```

In Tests add the following line of code:
```tests[responseBody] = true;```

Create Comments Collection
- Click on Collections > Create New Collection
- Enter a name - for instance, Comments and click Create

Save the Request in the Collection:
- Click on Save As
- Select the Collection and click Save

Run the collection locally in the Postman Collection Runner with the runner_head.csv file.

Export the Comments collection: 
- Click on Collections 
- Hover over the Comments collection and click on the dots '...' that appear
- Click Export and again Export
- Save the file on the Desktop
- The exported file is COMMENTS.postman_collection.json
- Move the file COMMENTS.postman_collection.json to the working directory JUPYTER_NOTEBOOKS

<div class="alert alert-block alert-info">In Git Bash</div>

Run the Comments collection with the additional runner.csv file of key values (by 3000) and generate report in json.<br>
```newman run COMMENTS.postman_collection.json -d runner.csv -r cli,json```<br>
In case it gives an error 'JavaScript heap out of memory' run the code in the following format:<br>
```NODE_OPTIONS=--max_old_space_size=2048 newman run COMMENTS.postman_collection.json -d runner.csv -r cli,json```<br>
or<br>
```NODE_OPTIONS=--max_old_space_size=3072 newman run COMMENTS.postman_collection.json -d runner.csv -r cli,json```<br>
or run the Comments collection with the additional runner.csv file of key values (by 1000) and generate report in json.

In order to run multiple runs with different runner.csv files, create a name.sh file listing the commands:<br>
```newman run COMMENTS.postman_collection.json -d runner_1000.csv -r cli,json```<br>
```newman run COMMENTS.postman_collection.json -d runner_2000.csv -r cli,json```

Run the .sh file:<br>
```bash name.sh```

<div class="alert alert-block alert-info">In Jupyter Notebook</div>

Check the working environment:

In [None]:
conda list

In [None]:
import sys
    print(sys.version)

<div class="alert alert-block alert-info">In Jupyter Notebook</div>

# <font color='#31708f'><center>Preprocessing<a class="anchor" id="second-bullet"></a></center></font>

In [None]:
import glob 

#Make sure glob.glob returns a list of files. List all json files in newman folder.
file_list = glob.glob('/home/smsta/Desktop/zendesk_tag_cloud/newman/*.json')
for filename in file_list:
    print(filename)

In [None]:
import ijson
def parse_json(json_filename):
        with open(filename, 'r', encoding="utf8") as file:
            # load json iteratively
            parser = ijson.parse(file)
            for prefix, event, value in parser:
                print('prefix={}, event={}, value={}'.format(prefix, event, value))
    
if __name__ == '__main__':
    parse_json(filename)

In [None]:
import ijson
import re
from string import punctuation
def extract_ticket_text_generator(json_filename):
    """This function takes a list of files with tickets and extracts text from each ticket. The result is a list of text strings."""
    for filename in file_list:
            # When you open the file specify the encoding.
            with open(filename, 'r', encoding="utf8") as input_file:
                # Extract specific items from the file
                tickets = ijson.items(input_file, 'run.executions.item.assertions.item.assertion')
                for ticket in tickets:
                   # Extract the substring between two markers
                    l = re.findall('plain_body(.+?)public', ticket)
                    #Remove escaped newline '\\n' and non-breaking space 'nbsp' characters
                    m = [re.sub(r'\\n|nbsp', ' ', t) for t in l]
                    # Remove any URL within a string
                    p = [re.sub(r'http\S+|www\S+', '', o) for o in m]          
                    # Remove all of the punctuation in any item in the list. The result is for each ticket a list of comments.
                    q = [''.join(c for c in s if c not in punctuation) for s in p]
                    # Join list elements without any separator. The result is for each ticket a list of merged comments.
                    r = [' '.join(q)] 
                    yield(r)
                
    if __name__ == '__main__':
        extract_ticket_text_generator(filename)

In [None]:
def create_txt_files():
        """This function takes a list of text strings and saves each ticket in a .txt file."""
        data = extract_ticket_text_generator(filename)
        # Make a flat list out of list of lists.
        flat_list = [item for sublist in data for item in sublist]
        for i in range(len(flat_list)):
            with open("ticket_%d.txt" % (i+1), 'w', encoding="utf-8") as f:
                f.write(flat_list[i])

In [None]:
create_txt_files()

(Optional) Create a .txt file with all tickets in it:

In [None]:
def create_tickets_all_txt_file():
        """This function takes a list of text strings and saves all tickets in a .txt file."""
        data = extract_ticket_text_generator(filename)
        # Make a flat list out of list of lists.
        flat_list = [item for sublist in data for item in sublist]
        with open('ticket_all.txt', 'w', encoding="utf-8") as filehandle:
            #Save all elements of a list as a text file:
            for listitem in flat_list:
                filehandle.write('%s\n\n' % listitem)

(Optional)

In [None]:
create_tickets_all_txt_file()

In [None]:
from pathlib import Path
    
all_txt_files =[]
for file in Path("zendesk_txt").rglob("*.txt"):
    all_txt_files.append(file.parent / file.name)
    # counts the length of the list
    n_files = len(all_txt_files)
    print(n_files)

In [None]:
all_docs = []
for txt_file in all_txt_files:
    with open(txt_file, encoding="utf-8") as f:
        txt_file_as_string = f.read()
        all_docs.append(txt_file_as_string)

# <font color='#31708f'><center>WordCloud<a class="anchor" id="third-bullet"></a></center></font>

# <font color='#576675'>Load the packages:</font>

In [None]:
# Run in terminal or command prompt
# python3 -m spacy download en

import numpy as np
import pandas as pd
import re, nltk, spacy, gensim

#Wordcloud
from wordcloud import WordCloud, STOPWORDS

# Sklearn
from sklearn.decomposition import LatentDirichletAllocation, TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from pprint import pprint

# Plotting tools
import pyLDAvis
import pyLDAvis.sklearn
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
#from sklearn.feature_extraction.text import pandas as pd

# <font color='#576675'>Tokenize and Clean-up using gensim’s simple_preprocess()</font>

In [None]:
def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations

data_words = list(sent_to_words(all_docs))

print(data_words[:1])

# <font color='#576675'>Lemmatization</font>

In [None]:
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append(" ".join([token.lemma_ if token.lemma_ not in ['-PRON-'] else '' for token in doc if token.pos_ in allowed_postags]))
    return texts_out

# Initialize spacy 'en' model, keeping only tagger component (for efficiency)
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

# Do lemmatization keeping only Noun, Adj, Verb, Adverb
data_lemmatized = lemmatization(data_words, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

print(data_lemmatized[:2])

In [None]:
#Create custom list of English stopwords
custom_stopwords = ["a","about","above","after","again","against","ain","all","am","an","and","any","are","aren","aren't","as","at","be","because","been","before","being","below","between","both","but","by","can","couldn","couldn't","d","did","didn","didn't","do","does","doesn","doesn't","doing","don","don't","down","during","each","few","for","from","further","had","hadn","hadn't","has","hasn","hasn't","have","haven","haven't","having","he","her","here","hers","herself","him","himself","his","how","i","if","in","into","is","isn","isn't","it","it's","its","itself","just","ll","m","ma","me","mightn","mightn't","more","most","mustn","mustn't","my","myself","needn","needn't","no","nor","not","now","o","of","off","on","once","only","or","other","our","ours","ourselves","out","over","own","re","s","same","shan","shan't","she","she's","should","should've","shouldn","shouldn't","so","some","such","t","than","that","that'll","the","their","theirs","them","themselves","then","there","these","they","this","those","through","to","too","under","until","up","ve","very","was","wasn","wasn't","we","were","weren","weren't","what","when","where","which","while","who","whom","why","will","with","won","won't","wouldn","wouldn't","y","you","you'd","you'll","you're","you've","your","yours","yourself","yourselves","could","he'd","he'll","he's","here's","how's","i'd","i'll","i'm","i've","let's","ought","she'd","she'll","that's","there's","they'd","they'll","they're","they've","we'd","we'll","we're","we've","what's","when's","where's","who's","why's","would","able","abst","accordance","according","accordingly","across","act","actually","added","adj","affected","affecting","affects","afterwards","ah","almost","alone","along","already","also","although","always","among","amongst","announce","another","anybody","anyhow","anymore","anyone","anything","anyway","anyways","anywhere","apparently","approximately","arent","arise","around","aside","ask","asking","auth","available","away","awfully","b","back","became","become","becomes","becoming","beforehand","begin","beginning","beginnings","begins","behind","believe","beside","besides","beyond","biol","brief","briefly","c","ca","came","cannot","can't","cause","causes","certain","certainly","co","com","come","comes","contain","containing","contains","couldnt","date","different","done","downwards","due","e","ed","edu","effect","eg","eight","eighty","either","else","elsewhere","end","ending","enough","especially","et","etc","even","ever","every","everybody","everyone","everything","everywhere","ex","except","f","far","ff","fifth","first","five","fix","followed","following","follows","former","formerly","forth","found","four","furthermore","g","gave","get","gets","getting","give","given","gives","giving","go","goes","gone","got","gotten","h","happens","hardly","hed","hence","hereafter","hereby","herein","heres","hereupon","hes","hi","hid","hither","home","howbeit","however","hundred","id","ie","im","immediate","immediately","importance","important","inc","indeed","index","information","instead","invention","inward","itd","it'll","j","k","keep","keeps","kept","kg","km","know","known","knows","l","largely","last","lately","later","latter","latterly","least","less","lest","let","lets","like","liked","likely","line","little","'ll","look","looking","looks","ltd","made","mainly","make","makes","many","may","maybe","mean","means","meantime","meanwhile","merely","mg","might","million","miss","ml","moreover","mostly","mr","mrs","much","mug","must","n","na","name","namely","nay","nd","near","nearly","necessarily","necessary","need","needs","neither","never","nevertheless","new","next","nine","ninety","nobody","non","none","nonetheless","noone","normally","nos","noted","nothing","nowhere","obtain","obtained","obviously","often","oh","ok","okay","old","omitted","one","ones","onto","ord","others","otherwise","outside","overall","owing","p","page","pages","part","particular","particularly","past","per","perhaps","placed","please","plus","poorly","possible","possibly","potentially","pp","predominantly","present","previously","primarily","probably","promptly","proud","provides","put","q","que","quickly","quite","qv","r","ran","rather","rd","readily","really","recent","recently","ref","refs","regarding","regardless","regards","related","relatively","research","respectively","resulted","resulting","results","right","run","said","saw","say","saying","says","sec","section","see","seeing","seem","seemed","seeming","seems","seen","self","selves","sent","seven","several","shall","shed","shes","show","showed","shown","showns","shows","significant","significantly","similar","similarly","since","six","slightly","somebody","somehow","someone","somethan","something","sometime","sometimes","somewhat","somewhere","soon","sorry","specifically","specified","specify","specifying","still","stop","strongly","sub","substantially","successfully","sufficiently","suggest","sup","sure","take","taken","taking","tell","tends","th","thank","thanks","thanx","thats","that've","thence","thereafter","thereby","thered","therefore","therein","there'll","thereof","therere","theres","thereto","thereupon","there've","theyd","theyre","think","thou","though","thoughh","thousand","throug","throughout","thru","thus","til","tip","together","took","toward","towards","tried","tries","truly","try","trying","ts","twice","two","u","un","unfortunately","unless","unlike","unlikely","unto","upon","ups","us","use","used","useful","usefully","usefulness","uses","using","usually","v","value","various","'ve","via","viz","vol","vols","vs","w","want","wants","wasnt","way","wed","welcome","went","werent","whatever","what'll","whats","whence","whenever","whereafter","whereas","whereby","wherein","wheres","whereupon","wherever","whether","whim","whither","whod","whoever","whole","who'll","whomever","whos","whose","widely","willing","wish","within","without","wont","words","world","wouldnt","www","x","yes","yet","youd","youre","z","zero","a's","ain't","allow","allows","apart","appear","appreciate","appropriate","associated","best","better","c'mon","c's","cant","changes","clearly","concerning","consequently","consider","considering","corresponding","course","currently","definitely","described","despite","entirely","exactly","example","going","greetings","hello","help","hopefully","ignored","inasmuch","indicate","indicated","indicates","inner","insofar","it'd","keep","keeps","novel","presumably","reasonably","second","secondly","sensible","serious","seriously","sure","t's","third","thorough","thoroughly","three","well","wonder"]

In [None]:
stopwords = set(STOPWORDS)
wordcloud = WordCloud(stopwords=custom_stopwords,
                      background_color="white",
                      width = 4000,
                      height = 2000,
                      max_words=200, 
                      collocations = False,   #remove repetitive words
                      min_word_length = 3
                         ).generate(' '.join(data_lemmatized))
    
plt.figure( figsize=(20,10) )
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

In [None]:
# write to file
wordcloud.to_file("word_cloud.png")

# <font color='#31708f'><center>CountVectorizer<a class="anchor" id="fourth-bullet"></a></center></font>

# <font color='#576675'>Create the Document-Word matrix</font>

In [None]:
#Convert a collection of text documents to a matrix of token counts
vectorizer=CountVectorizer(analyzer='word',   
                           token_pattern='[a-zA-Z]{3,}', # only non-digit characters > 3
                           stop_words=custom_stopwords,  # remove stop words
                           lowercase=True,               # convert all words to lowercase
                           # min_df=10,                  # minimum reqd occurences of a word
                           # max_features=50000,         # max number of uniq words
                           
                       )
    
# this step generates word counts for the words in your docs 
data_vectorized=vectorizer.fit_transform(data_lemmatized)

6 rows (6 tickets), 54 columns (unique words)

In [None]:
#check rows(docs) and columns(unique words), minus single character words
#The columns number is raw word frequency
data_vectorized.shape

In [None]:
(Optional)

In [None]:
print(vectorizer.get_feature_names())

In [None]:
(Optional)

How many times a word has been used in a ticket

In [None]:
print(data_vectorized.toarray())

# <font color='#576675'>Count</font>

Get top_n_words:

In [None]:
#Count column in Excel spreadsheet
np.asarray(data_vectorized.sum(axis=0))

In [None]:
sum_words = np.asarray(data_vectorized.sum(axis=0))

In [None]:
words_freq = [(word, sum_words[0, idx]) for word, idx in vectorizer.vocabulary_.items()]
sorted_words_freq =sorted(words_freq, 
                          key = lambda x: x[1], 
                          reverse=True)
sorted_words_freq[:10]

In [None]:
dataframe = pd.DataFrame(sorted_words_freq[:200],
                         columns=['words', 'count'])


dataframe.head(201)

dataframe.style.set_properties(subset=['words', 'count'], **{'width': '200px'})

In [None]:
# select rows in pandas dataframe
sliced = dataframe.iloc[[1,2,7,8,9,10,11,13,14,16,19,20,25,26,27,29,30,35,38,39,40,41,49,55,59,64,65,66,67,69,71,76,78,79,80,81,84,87,88,92,93,96,99,104,108,119,126,137,140,146,147,148,149,150,162,164,168,185,186,197], [0,1]]

sliced.style.set_properties(subset=['words', 'count'], **{'width': '200px'})

In [None]:
#create horizontal barplot
ax = sliced.head(10).plot.barh(x='words', y='count')

(Optional)

In [None]:
#Verification for count of specific word
grep -o -i support *.txt | wc -l

Got 393559. Check where is the discrepancy.

In [None]:
fig, ax = plt.subplots(figsize=(20, 10))
    
# Plot horizontal bar graph
dataframe.head(10).sort_values(by='count').plot.barh(x='words',
                                                     y='count',
                                                     ax=ax,
                                                     color="#a3e8e5",
                                                     fontsize=16,
                                                     xlabel='Words')

ax.set_title("Top ten words in Zendesk tickets", fontsize=18)

plt.show()

# <font color='#576675'>Document Frequency</font>

In [None]:
#Verification for document frequency of a specific word
grep -i support *txt | wc -l

Got 34654. Check where is the discrepancy.

In [None]:
#Count strings with substring via string list
#Verification for document frequency of a specific word
subs = 'support'
res = len([i for i in all_docs if subs in i]) 
print ("All strings count with given substring are : " + str(res))

In [None]:
#Count in how many strings within a list a specific substring appears
#Verification for document frequency of a specific word
substring = "support"
l = []
for x in all_docs:
    if substring in x:
        l.append(1)
    print(substring, len(l))

# <font color='#31708f'><center>TfidfTransformer</center></font>

# <font color='#576675'>Smoothed-IDF</font>

In [None]:
tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True) 
tfidf_transformer.fit(word_count_vector)

In [None]:
fig, ax = plt.subplots(figsize=(20, 10))
    
# Plot horizontal bar graph\n",
dataframe.head(10).sort_values(by='count').plot.barh(x='words',
                                                     y='count',
                                                     ax=ax,
                                                     color="#a3e8e5",
                                                     fontsize=16, 
                                                     xlabel='Words')
    
ax.set_title("Ten most frequent words in Zendesk tickets", fontsize=18)

plt.show()

In [None]:
# print idf values:
df_idf = pd.DataFrame(tfidf_transformer.idf_, 
                      index=cv.get_feature_names(),
                      columns=["idf_weights"]) 
    
# sort ascending 
df_idf.sort_values(by=['idf_weights'])

# <font color='#576675'>TF_IDF</font>

In [None]:
# count matrix
count_vector=cv.transform(all_docs)

# tf-idf scores
tf_idf_vector=tfidf_transformer.transform(count_vector)

In [None]:
feature_names = cv.get_feature_names() 
    
#get tfidf vector for first document 
first_document_vector=tf_idf_vector[0] 
    
#print the scores 
df = pd.DataFrame(first_document_vector.T.todense(), index=feature_names, columns=["tfidf"]) 
df.sort_values(by=["tfidf"],ascending=False)

In [None]:
# settings that you use for count vectorizer will go here 
tfidf_vectorizer=TfidfVectorizer(use_idf=True) 
 
# just send in all your docs here 
tfidf_vectorizer_vectors=tfidf_vectorizer.fit_transform(all_docs)

In [None]:
# get the first vector out (for the first document) 
first_vector_tfidfvectorizer=tfidf_vectorizer_vectors[0] 
 
# place tf-idf values in a pandas data frame 
df = pd.DataFrame(first_vector_tfidfvectorizer.T.todense(), index=tfidf_vectorizer.get_feature_names(), columns=["tfidf"]) 
df.sort_values(by=["tfidf"],ascending=False)

In [None]:
tfidf_vectorizer=TfidfVectorizer(use_idf=True)
 
# just send in all your docs here
fitted_vectorizer=tfidf_vectorizer.fit(all_docs)
tfidf_vectorizer_vectors=fitted_vectorizer.transform(all_docs)

Check the Sparsicity:

In [None]:
# Materialize the sparse data
data_dense = data_vectorized.todense()

# Compute Sparsicity = Percentage of Non-Zero cells
print("Sparsicity: ", ((data_dense > 0).sum()/data_dense.size)*100, "%")

Build LDA model with sklearn:

STUDY MORE WHAT IS THE BELOW ABOUT

In [None]:
# Build LDA Model
lda_model = LatentDirichletAllocation(n_components=10,               # Number of topics
                                      max_iter=10,               # Max learning iterations
                                      learning_method='online',   
                                      random_state=100,          # Random state
                                      batch_size=128,            # n docs in each learning iter
                                      evaluate_every = -1,       # compute perplexity every n iters, default: Don't
                                      n_jobs = -1,               # Use all available CPUs
                                      )
lda_output = lda_model.fit_transform(data_vectorized)

print(lda_model)  # Model attributes

In [None]:
for i,topic in lda_model.get_topic_terms(formatted=True, num_topics=num_topics, num_words=10):
    print(str(i)+": "+ topic)
    print()

10 topics

Diagnose model performance with perplexity and log-likelihood:

Log-likelihood higher the better. Perplexity lower the better.

In [None]:
# Log Likelyhood: Higher the better
print("Log Likelihood: ", lda_model.score(data_vectorized))

# Perplexity: Lower the better. Perplexity = exp(-1. * log-likelihood per word)
print("Perplexity: ", lda_model.perplexity(data_vectorized))

# See model parameters
pprint(lda_model.get_params())

How to GridSearch the best LDA model?

In [None]:
# Define Search Param
search_params = {'n_components': [10, 15, 20, 25, 30], 'learning_decay': [.5, .7, .9]}

# Init the Model
lda = LatentDirichletAllocation()

# Init Grid Search Class
model = GridSearchCV(lda, param_grid=search_params)

# Do the Grid Search
model.fit(data_vectorized)

How to see the best topic model and its parameters?

In [None]:
# Best Model
best_lda_model = model.best_estimator_

# Model Parameters
print("Best Model's Params: ", model.best_params_)

# Log Likelihood Score
print("Best Log Likelihood Score: ", model.best_score_)

# Perplexity
print("Model Perplexity: ", best_lda_model.perplexity(data_vectorized))

Compare LDA Model Performance Scores

In [None]:
# Get Log Likelyhoods from Grid Search Output
n_topics = [10, 15, 20, 25, 30]

log_likelyhoods_5 = [round(model.cv_results_['mean_test_score'][index]) for index, gscore in enumerate(model.cv_results_['params']) if gscore['learning_decay']==0.5]
log_likelyhoods_7 = [round(model.cv_results_['mean_test_score'][index]) for index, gscore in enumerate(model.cv_results_['params']) if gscore['learning_decay']==0.7]
log_likelyhoods_9 = [round(model.cv_results_['mean_test_score'][index]) for index, gscore in enumerate(model.cv_results_['params']) if gscore['learning_decay']==0.9]

# Show graph
plt.figure(figsize=(12, 8))
plt.plot(n_topics, log_likelyhoods_5, label='0.5')
plt.plot(n_topics, log_likelyhoods_7, label='0.7')
plt.plot(n_topics, log_likelyhoods_9, label='0.9')
plt.title("Choosing Optimal LDA Model")
plt.xlabel("Num Topics")
plt.ylabel("Log Likelyhood Scores")
plt.legend(title='Learning decay', loc='best')
plt.show()

How to see the dominant topic in each document?

In [None]:
# Create Document - Topic Matrix
lda_output = best_lda_model.transform(data_vectorized)

# column names
topicnames = ["Topic" + str(i) for i in range(best_lda_model.n_components)]

# index names
docnames = ["Doc" + str(i) for i in range(len(data))]

# Make the pandas dataframe
df_document_topic = pd.DataFrame(np.round(lda_output, 2), columns=topicnames, index=docnames)

# Get dominant topic for each document
dominant_topic = np.argmax(df_document_topic.values, axis=1)
df_document_topic['dominant_topic'] = dominant_topic

# Styling
def color_green(val):
    color = 'green' if val > .1 else 'black'
    return 'color: {col}'.format(col=color)

def make_bold(val):
    weight = 700 if val > .1 else 400
    return 'font-weight: {weight}'.format(weight=weight)

# Apply Style
df_document_topics = df_document_topic.head(15).style.applymap(color_green).applymap(make_bold)
df_document_topics

Review topics distribution across documents

In [None]:
df_topic_distribution = df_document_topic['dominant_topic'].value_counts().reset_index(name="Num Documents")
df_topic_distribution.columns = ['Topic Num', 'Num Documents']
df_topic_distribution

How to visualize the LDA model with pyLDAvis?

Enable the automatic display of visualizations in the IPython Notebook.<br>
Transform and prepare a LDA model’s data for visualization.

In [None]:
pyLDAvis.enable_notebook()
panel = pyLDAvis.sklearn.prepare(best_lda_model, data_vectorized, vectorizer, mds='tsne')
panel

pyLDAvis - Python library for interactive topic model visualization. The package extracts information from a fitted LDA topic model to inform an interactive web-based visualization.

Save the visualization to a stand-alone HTML file for easy sharing:

In [None]:
import pyLDAvis.gensim
p = pyLDAvis.gensim.prepare(best_lda_model, data_vectorized, vectorizer)
pyLDAvis.save_html(p, 'lda.html')