# 04 Topic Modeling

> "Language shapes the way we think, and determines what we can think about." ~ Benjamin Lee Whorf

![word_cloud](https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fspecials-images.forbesimg.com%2Fimageserve%2F491732087%2F960x0.jpg%3Ffit%3Dscale&f=1&nofb=1)

## Table of Contents

1. What is Natural Language Processing?
2. Key Concepts
3. What is Latent Dirichlet Allocation?
4. Analysis
5. Automated Topic Search

## 1. What is Natural Language Processing and What is Topic Modeling?

**Natural Language Processing**

> "Natural language processing (NLP) refers to the branch of computer science—and more specifically, the branch of artificial intelligence or AI—concerned with giving computers the ability to understand text and spoken words in much the same way human beings can." ~ [IBM](https://www.ibm.com/cloud/learn/natural-language-processing)

**Topic Modeling**

> "In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body. Intuitively, given that a document is about a particular topic, one would expect particular words to appear in the document more or less frequently: "dog" and "bone" will appear more often in documents about dogs, "cat" and "meow" will appear in documents about cats, and "the" and "is" will appear approximately equally in both." ~ [Wikipedia](https://en.wikipedia.org/wiki/Topic_model)

## 2. Key Concepts

Here is a non-exhaustive list of concepts that you should understand, and be familiar with, from the field of Natural Language Processing.

- **Corpus** - Collection of documents filled with words, sentences, parragraphs, numbers, punctuations, etc. For example, a collection of letters is a corpus.
- **Corpora** - More than one corpus. For example, the collection of job descriptions for a company would be a corpus, the collection of the collection of these companies job descriptions would be a corpora.
- **Token** - Element inside a piece of text. This may be a word, a number, a space, any kind of punctuation, etc.
- **Tokenization** - separating pieces of strings (i.e. text) into their smallest components or, tokens.
- **Document** - A block of text of varying sizes. For example, a document might be a tweet, a menu, a book, a review, etc. 
- **Bag of Words** - A numerical representation of textual information that a statistical model can understand, process, and make inferences from. In a bag of words, the rows represent the documents in your corpus and the columns represent all of the unique words from all of your documents.
- **Topic** - A representation of similar information based words and sometimes context as well.
- **Stop Words** - the most common words used in a language. These words appear so often that in many applications of NLP these get removed before the modeling stage.

## 3. What is Latent Dirichlet Allocation?

> "Latent Dirichlet Allocation (LDA) is a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document." ~ [David M. Blei, Andrew Y. Ng and Michael I. Jordan (2003)](https://jmlr.org/papers/volume3/blei03a/blei03a.pdf)

**Assumptions**
- There is some sort of structure in these documents and LDA will try and collapse or separate these structure among your pre-defined set of topics.
- Each topic comes from, and can be represented as, a distribution of words or term frequencies.

## 4. Analysis

We will first look at how topic modeling is done with one company and with some base functions, and we will then look at the automated way of searching for a topic.

Let's start by importing the packages we will use throughout this session.

In [1]:
import pandas as pd
import numpy as np
import scipy as sp
import spacy # one of the best NLP libraries available in any programming language
from pprint import pprint # the extra p stands for printing

pd.options.display.max_columns = None # this allows us to see all columns displayed after a .head() or .tail() on our dataframes

In [2]:
df = pd.read_csv('data/netflix.csv') # let's read our dataframe
df.head() # show the first 5 rows

Unnamed: 0,reviewID,employerID,userID,gender,birthYear,highestEducation,metroID,metroName,stateID,stateName,countryID,jobTitleID,JobTitle,GOC,GOCconfidence,MGOC,MGOCconfidence,reviewDateTime,isCurrentJobFlag,jobEndingYear,OverallRating,CareerOpps,CompensationBenefits,SeniorLeadership,Worklife,CultureValues,RecommendFriend,BusinessOutlook,CEO,employerName,stockTicker,employerTypeCode,numberEmployees,annualRevenue,industry,sector,pros,cons,feedback
0,4151950,11891,24353329,FEMALE,1984.0,BACHELORS,0,,0,,1,0,,,,,,2014-04-30 23:52:26.027,1,,4.0,3.0,5.0,3.0,2.0,3.0,YES,Same,Approve,Netflix,NFLX,COMPANY_PUBLIC,4700,8830669000,Internet,Information Technology,You will be working with the most talented ppl...,Little bit politics in some teams.,
1,1863,11891,-1,,,,761,San Jose,2280,CA,1,35739,"Director, Product Management",product manager,0.913,product manager,0.913,2008-04-23 23:42:17.157,1,,5.0,4.0,4.5,5.0,4.5,,YES,,Approve,Netflix,NFLX,COMPANY_PUBLIC,4700,8830669000,Internet,Information Technology,Freedom and responsibility. You're treated lik...,"Netflix is not for everyone. You don't get ""di...",I have none. Senior management is fantastic. s...
2,4991,11891,2076,,,,761,San Jose,2280,CA,1,13321,Marketing Manager,marketing manager,1.0,marketing manager,1.0,2008-06-11 00:03:28.907,1,,5.0,5.0,5.0,5.0,4.5,,YES,,Approve,Netflix,NFLX,COMPANY_PUBLIC,4700,8830669000,Internet,Information Technology,Great colleagues -- incredible really,Domestic not global business -- wish we did eu...,"Focus on the customer, not on Apple"
3,53799,11891,68043,,,,700,Portland,3163,OR,1,64668,Support Staff,support staff,1.0,retail representative,1.0,2008-08-07 23:30:14.267,0,2008.0,2.0,1.0,4.5,4.0,5.0,,NO,,Approve,Netflix,NFLX,COMPANY_PUBLIC,4700,8830669000,Internet,Information Technology,The upper management of Netflix really does se...,"Specific to the Hillsboro location, the middle...","To the senior-most management in Los Gatos, I ..."
4,53937,11891,68207,,,,0,,0,,1,36451,Does IT Matter?,,0.0,,0.0,2008-08-08 09:12:42.493,0,2008.0,2.0,2.0,2.5,3.5,1.0,,NO,,Approve,Netflix,NFLX,COMPANY_PUBLIC,4700,8830669000,Internet,Information Technology,"The people there are fantastic, the service is...",It's frustrating to work for direct management...,"Stop being so secretive, just be upfront and h..."


In [3]:
df.shape

(693, 39)

Notice how we have quite a few columns but, since we are only interested in the **pros** reviews, let's extract that column out.

In [4]:
pros_reviews = df['pros'].copy() # take the reviews column out of the dataframe
pros_reviews.head()

0    You will be working with the most talented ppl...
1    Freedom and responsibility. You're treated lik...
2                Great colleagues -- incredible really
3    The upper management of Netflix really does se...
4    The people there are fantastic, the service is...
Name: pros, dtype: object

Because we will be extracting words that will form a topic, we'll need to do some text preprocessing in order to get rid whatever is not a word. We will also want to have all letters in lowercase and we might want to reduce them their root, if any. To do this, we will use `spacy` which has an English language model ready to use. **Note**, the English language model allows us to use different functionalities on top of English words. The details are not important but note that we now have a tool that will help us wrangle English text.

In [6]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.0.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0-py3-none-any.whl (13.7 MB)
[K     |████████████████████████████████| 13.7 MB 6.2 MB/s eta 0:00:01     |████████████████████████████▌   | 12.2 MB 6.2 MB/s eta 0:00:01
Collecting smart-open<4.0.0,>=2.2.0
  Downloading smart_open-3.0.0.tar.gz (113 kB)
[K     |████████████████████████████████| 113 kB 7.1 MB/s eta 0:00:01
Building wheels for collected packages: smart-open
  Building wheel for smart-open (setup.py) ... [?25ldone
[?25h  Created wheel for smart-open: filename=smart_open-3.0.0-py3-none-any.whl size=107097 sha256=ad340f85746224aff6b8218db4fc133ff903261b90ae4995435f5a4103842401
  Stored in directory: /home/jovyan/.cache/pip/wheels/83/a6/12/bf3c1a667bde4251be5b7a3368b2d604c9af2105b5c1cb1870
Successfully built smart-open
Installing collected packages: en-core-web-sm, smart-open
  Attempting uninstall: smart-open
    Found existing 

In [8]:
nlp = spacy.load('en_core_web_sm') # we first load out English language model

Let's look at a review

In [9]:
one_review = pros_reviews[418]

In [10]:
pprint(one_review)

('"Freedom and Responsibility" rings mostly true, although you\'d have to be '
 "stupid to not realize it isn't 100% true always and in everything. Simple, "
 "indisputable example: you can't decide how much you make, yourself. That's "
 'an obvious statement, yet it proves there definitely are no-goes and as '
 "such, gray areas. In practice, it hasn't been an issue for me--just "
 'clarifying. You want X number of monitors, Y super-amazing-computer, come in '
 'at 11am, leave at 4pm, take numerous vacations, ask (and receive a real '
 "answer!) dang near anything about what movies/shows they're secretly bidding "
 'on in Hollywood, etc, etc. Not only things a reasonable person would want to '
 'do/have, more like a well spoiled person. This all assumes you are '
 'communicating this with your team and performing well above average--Netflix '
 'does *not* knowingly hire junior, mid-level, or wannabe-senior engineers. I '
 "haven't personally seen a bunch of people get fired, but of th

Notice how the review above is quite messy and it has a lot of characters that, for all intents and purposes, will not be useful for our analysis. Let's examine a cleaner version of the review above by running it through our tokenizer.

In [11]:
parsed_review = nlp(one_review)

In [12]:
parsed_review

"Freedom and Responsibility" rings mostly true, although you'd have to be stupid to not realize it isn't 100% true always and in everything. Simple, indisputable example: you can't decide how much you make, yourself. That's an obvious statement, yet it proves there definitely are no-goes and as such, gray areas. In practice, it hasn't been an issue for me--just clarifying. You want X number of monitors, Y super-amazing-computer, come in at 11am, leave at 4pm, take numerous vacations, ask (and receive a real answer!) dang near anything about what movies/shows they're secretly bidding on in Hollywood, etc, etc. Not only things a reasonable person would want to do/have, more like a well spoiled person. This all assumes you are communicating this with your team and performing well above average--Netflix does *not* knowingly hire junior, mid-level, or wannabe-senior engineers. I haven't personally seen a bunch of people get fired, but of the one or two I've seen, it was not surprising. I ha

Much better and easier to read. Can we examine the sentences as well? You bet we can by using spaCy's many features.

Below we will use a loop to go over the index of each sentence of our single review, plus the sentence.

In [57]:
for i in range(10):
    print(i * 10)

0
10
20
30
40
50
60
70
80
90


In [59]:
next(enumerate(parsed_review.sents))

(0,
 "Freedom and Responsibility" rings mostly true, although you'd have to be stupid to not realize it isn't 100% true always and in everything.)

In [16]:
# the temporary variable num will represent the index
# and the temporary variable sentence will represent each line of the review
# enumerate is a buil-in Python function

for num, sentence in enumerate(parsed_review.sents):
    print(f"Sentence #{num}:\n {sentence}\n")

Sentence #0:
 "Freedom and Responsibility" rings mostly true, although you'd have to be stupid to not realize it isn't 100% true always and in everything.

Sentence #1:
 Simple, indisputable example: you can't decide how much you make, yourself.

Sentence #2:
 That's an obvious statement, yet it proves there definitely are no-goes and as such, gray areas.

Sentence #3:
 In practice, it hasn't been an issue for me--just clarifying.

Sentence #4:
 You want X number of monitors, Y super-amazing-computer, come in at 11am, leave at 4pm, take numerous vacations, ask (and receive a real answer!)

Sentence #5:
 dang near anything about what movies/shows they're secretly bidding on in Hollywood, etc, etc.

Sentence #6:
 Not only things a reasonable person would want to do/have, more like a well spoiled person.

Sentence #7:
 This all assumes you are communicating this with your team and performing well above average--Netflix does *not* knowingly hire junior, mid-level, or wannabe-senior enginee

Let's look at the entities of the words that make up our single review using the same approach as above.

In [17]:
for num, entity in enumerate(parsed_review.ents):
    print(f"Entity #{num}: {entity} -- {entity.label_}\n")

Entity #0: Freedom and Responsibility -- WORK_OF_ART

Entity #1: 100% -- PERCENT

Entity #2: 11am -- TIME

Entity #3: 4pm -- TIME

Entity #4: Hollywood -- GPE

Entity #5: one -- CARDINAL

Entity #6: two -- CARDINAL

Entity #7: BS -- ORG

Entity #8: Netflix -- GPE



We will now use additional functionalities to showcase more characteristics about our review. We will do so using list comprehensions. Think of these as loops cousins whose two main differences are that the action happens first and they always return a list.

In [20]:
# here we are taking out of the parsed review each token
token_text = [token.text for token in parsed_review]

# here we are lemmatizing each word possible
token_lemmas = [token.lemma_ for token in parsed_review]

# stopwords are very common so here we will extract a variable that will tell us whether
# a word is a stopword or not
token_stop = [token.is_stop for token in parsed_review]

# we will now add all three to a dataframe and display it without assigning it to a variable
pd.DataFrame(zip(token_text, token_lemmas, token_stop), columns=['Original Text', 'Lemmatized Text', 'stopwords']).head(40)

Unnamed: 0,Original Text,Lemmatized Text,stopwords
0,"""","""",False
1,Freedom,freedom,False
2,and,and,True
3,Responsibility,Responsibility,False
4,"""","""",False
5,rings,ring,False
6,mostly,mostly,True
7,true,true,False
8,",",",",False
9,although,although,True


Notice the middle column above, Lemmatized Text. This column represents the root of some of the words in our review. Think about this as reducing the words with the same meaning but spelled with a different conjugation, to their lowest common denominator. For example, related and relate, reasons and reason, considered and consider, etc. This steps helps us assign the exact word and meaning to the same topic as opposed to differenly spelled words with the same meaning to different topics.

Let's now define a function that will return only the punctuations or the trailing space next to some words.

In [26]:
def puncs_out(token): 
    return token.is_punct or token.is_space

We will also need to import our stopwords from spaCy to be able to filter them out from our reviews. We don't want `the`, `a`, `so`, etc. influencing our topics.

In [21]:
from spacy.lang.en.stop_words import STOP_WORDS
STOP_WORDS

{"'d",
 "'ll",
 "'m",
 "'re",
 "'s",
 "'ve",
 'a',
 'about',
 'above',
 'across',
 'after',
 'afterwards',
 'again',
 'against',
 'all',
 'almost',
 'alone',
 'along',
 'already',
 'also',
 'although',
 'always',
 'am',
 'among',
 'amongst',
 'amount',
 'an',
 'and',
 'another',
 'any',
 'anyhow',
 'anyone',
 'anything',
 'anyway',
 'anywhere',
 'are',
 'around',
 'as',
 'at',
 'back',
 'be',
 'became',
 'because',
 'become',
 'becomes',
 'becoming',
 'been',
 'before',
 'beforehand',
 'behind',
 'being',
 'below',
 'beside',
 'besides',
 'between',
 'beyond',
 'both',
 'bottom',
 'but',
 'by',
 'ca',
 'call',
 'can',
 'cannot',
 'could',
 'did',
 'do',
 'does',
 'doing',
 'done',
 'down',
 'due',
 'during',
 'each',
 'eight',
 'either',
 'eleven',
 'else',
 'elsewhere',
 'empty',
 'enough',
 'even',
 'ever',
 'every',
 'everyone',
 'everything',
 'everywhere',
 'except',
 'few',
 'fifteen',
 'fifty',
 'first',
 'five',
 'for',
 'former',
 'formerly',
 'forty',
 'four',
 'from',
 'fron

In [22]:
len(STOP_WORDS)

326

Lastly, let's create a function that will remove the punctuations, spaces, and also lemmatize the words in our reviews at the same time.

In [23]:
def lemma_in_stopw_out(doc):
    """
    This function takes in a piece of text, tokenizes it,
    lemmatizes it, removes punctuations and spaces, takes
    all stopwords out, and returns the clean piece of text.
    """
    
    tokens = nlp(doc)
    tokens_lemma = [token.lemma_ for token in tokens if not puncs_out(token)]
    tokens_clean = [token for token in tokens_lemma if token not in STOP_WORDS]
    return ' '.join(tokens_clean)

We will use pandas' convenient `.apply()` method to pass in our function above to each of the reviews we have. But first we will make every word in our reviews lowercase.

In [24]:
ready_revs = pros_reviews.str.lower()
ready_revs.head()

0    you will be working with the most talented ppl...
1    freedom and responsibility. you're treated lik...
2                great colleagues -- incredible really
3    the upper management of netflix really does se...
4    the people there are fantastic, the service is...
Name: pros, dtype: object

In [27]:
%%time 

# the function is called a magic method and it allows us to see how long this cell took to run

ready_revs = ready_revs.apply(lemma_in_stopw_out)
ready_revs.head()

CPU times: user 7.38 s, sys: 19 ms, total: 7.4 s
Wall time: 7.4 s


0                                    work talented ppl
1    freedom responsibility treat like adult pro te...
2                           great colleague incredible
3    upper management netflix want different approa...
4    people fantastic service great facility terrif...
Name: pros, dtype: object

Because we don't want the company we are analysing reviews for to appear in the topics, we will remove it from our corpus.

In [28]:
# we access the string and create a mask of True's and False' where the company appears
netflix_mask = ready_revs.str.contains('netflix')
netflix_mask.head()

0    False
1     True
2    False
3     True
4     True
Name: pros, dtype: bool

In [29]:
# we can filter a dataset by passing in the mask through square brackets []
# notice the index
ready_revs[netflix_mask].head()

1     freedom responsibility treat like adult pro te...
3     upper management netflix want different approa...
4     people fantastic service great facility terrif...
7     surround brilliant competent mature hard work ...
10    favorite thing netflix surround ridiculously t...
Name: pros, dtype: object

In [30]:
# a ~ in front of the mask gives us the opposite results
ready_revs[~netflix_mask].head()

0                                    work talented ppl
2                           great colleague incredible
5    benefit terrific I perfect shift free movie wo...
6    transparent corporate culture opportunity lear...
8    innovation intelligent people incredible brand...
Name: pros, dtype: object

In [32]:
# notice how the word netflix has now dissapeared from the reviews
ready_revs[netflix_mask] = ready_revs[netflix_mask].str.replace('netflix', '', regex=False)
ready_revs.head(10)

0                                    work talented ppl
1    freedom responsibility treat like adult pro te...
2                           great colleague incredible
3    upper management  want different approach hand...
4    people fantastic service great facility terrif...
5    benefit terrific I perfect shift free movie wo...
6    transparent corporate culture opportunity lear...
7    surround brilliant competent mature hard work ...
8    innovation intelligent people incredible brand...
9    fast pace dynamic afraid try different company...
Name: pros, dtype: object

Let's examine the differences between and after our preprocessing stage.

In [33]:
print('Original:')
print('-' * 30)
print(nlp(df.loc[148, 'pros']))
print()
print('Processed Text:')
print('-' * 30)
print(ready_revs[148])

Original:
------------------------------
Amazing experience, very competitive salary, great office space cutting edge technologies.
If you like fast moving environment and challenging task then this is a place for you. Company treats employees like adults

Processed Text:
------------------------------
amazing experience competitive salary great office space cut edge technology like fast environment challenging task place company treat employee like adult


Let's now create a bag of words with sklearn's `CountVectorizer()` method. We will remove words that appear less than 4 times, as well as those that appear in 95% of the reviews.

In [35]:
from sklearn.feature_extraction.text import CountVectorizer

In [37]:
# first we instantiate the vectorizer
vectorizer = CountVectorizer(min_df=3, max_df=0.95)

In [38]:
# then we fit and transform our clean reviews
bow = vectorizer.fit_transform(ready_revs)
bow

<693x658 sparse matrix of type '<class 'numpy.int64'>'
	with 8400 stored elements in Compressed Sparse Row format>

Notice the output of our bag of words. This is called a sparse matrix and is an efficient way of holding large amounts of 1's and 0's.

In [60]:
# select a topic
topics = 10

We will now instantiate our LDA model with the topics selected above and the fit our sparse matrix to this model.

In [61]:
from sklearn.decomposition import LatentDirichletAllocation

In [62]:
lda_model = LatentDirichletAllocation(n_components=topics, # number of topics
                                      max_iter=100, # these are the amount of times the algorithm will run
                                      learning_method='online', 
                                      random_state=42, # setting a seed for reproducible results
                                      n_jobs=-1) # this parameter makes sure we use all of the cores in our machine

In [63]:
# pass in the bag of words
lda_model.fit(bow)

LatentDirichletAllocation(learning_method='online', max_iter=100, n_jobs=-1,
                          random_state=42)

Awesome, we just ran our first model so let's go ahead and create a function to evaluate the topics we extracted and see if these make sense.

In [64]:
def show_topics(vectorizer, lda_model, n_words=15):
    """
    This function takes our vectorizer, our model, and a
    number of words to display the topics from our model.
    """
    keywords = np.array(vectorizer.get_feature_names())
    topic_keywords = []
    for topic_weights in lda_model.components_:
        top_keyword_locs = (-topic_weights).argsort()[:n_words]
        topic_keywords.append(keywords.take(top_keyword_locs))
    return topic_keywords

In [65]:
# let's evaluate the topics
show_topics(vectorizer=vectorizer, lda_model=lda_model, n_words=20)

[array(['free', 'pay', 'work', 'decent', 'food', 'benefit', 'balance',
        'life', 'health', 'time', 'coffee', 'good', 'plan', 'easy',
        'training', 'great', 'schedule', 'hour', 'insurance', 'fun'],
       dtype='<U15'),
 array(['dedicate', 'mail', 'year', 'salary', 'easy', 'lead', 'economy',
        'bonus', 'expense', 'ask', 'innovation', 'structure', '000', 'add',
        'film', 'grow', 'home', 'question', 'problem', 'subscriber'],
       dtype='<U15'),
 array(['lot', 'place', 'smart', 'culture', 'freedom', 'autonomy', 'great',
        'fast', 'idea', 'feedback', 'people', 'help', 'responsibility',
        'creative', 'pace', 'self', 'grow', 'thing', 'opportunity',
        'excellent'], dtype='<U15'),
 array(['pay', 'good', 'employee', 'great', 'environment', 'company',
        'salary', 'benefit', 'free', 'unlimited', 'room', 'perk',
        'vacation', 'excellent', 'work', 'competitive', 'stock',
        'friendly', 'nice', 'break'], dtype='<U15'),
 array(['company', 'f

To finish up we will create a variable with the names of the words in our vocabulary.

In [66]:
# this method create an array with the words/keys
terms = sorted(vectorizer.vocabulary_.keys())

In [67]:
# let's now create a dataframe with our ba
bow_docs = pd.DataFrame(bow.toarray(), columns=terms)
bow_docs.head()

Unnamed: 0,000,10,14,20,24,401k,ability,able,absolutely,accept,accomplishment,account,action,actually,add,adult,advance,advantage,advice,affordable,afraid,agency,agent,agile,ahead,allow,amazing,amenity,annual,answer,apple,apply,appreciate,appreciation,approach,approval,area,ask,aspect,assume,atmosphere,attendance,attitude,autonomy,available,average,away,awesome,bad,balance,banana,bar,base,bay,beat,beautiful,benefit,benifit,big,bike,bit,bold,bonus,book,brand,break,breakfast,breakroom,bright,brilliant,bring,bs,buck,build,building,bureaucracy,business,buzz,cafe,cafeteria,candid,capable,care,career,cash,casual,cater,cell,center,ceo,challenge,challenging,change,chat,cheap,check,cheese,choice,choose,class,clean,clear,close,co,code,coffee,collaborate,collaboration,collaborative,colleague,college,come,comfortable,communicate,communication,comp,company,compare,compensation,competent,competitive,complete,completely,complicated,con,consider,consistent,constant,constantly,content,context,continue,contribute,control,cool,corporate,couple,course,coworker,create,creative,cross,culture,cup,customer,cut,cutting,daily,date,datum,day,dead,deal,decent,decide,decision,deck,dedicate,definitely,degree,deliver,demand,dental,department,depend,development,device,difference,different,difficult,direct,direction,directly,director,distraction,door,dress,drink,drive,dvd,dvds,dynamic,early,easily,easy,economy,edge,effective,ego,emphasis,employ,employee,employer,employment,empower,encourage,end,engage,engineer,engineering,enjoy,entertainment,environment,equal,especially,etc,everybody,everyday,exactly,example,excel,excellent,exceptional,exciting,execute,execution,expect,expectation,expense,experience,expression,extra,extremely,face,facility,fact,fail,failure,fair,fairly,family,fantastic,far,fast,fear,feedback,feel,fellow,field,fill,film,find,fine,fire,fit,flat,flexibility,flexible,focus,focused,folk,food,form,forward,free,freedom,fresh,friend,friendly,fruit,fully,fun,function,future,gain,game,gatos,gear,general,generally,generous,genuinely,gets,goal,good,great,group,grow,growth,hand,handle,happen,happy,hard,hasting,having,head,health,healthy,hear,help,helpful,high,highly,hire,hiring,hit,holiday,home,honest,hot,hour,hourly,hr,huge,idea,impact,important,impressed,impressive,improve,include,incredible,incredibly,independence,individual,industry,information,initiative,innovate,innovation,innovative,inspire,inspiring,instead,insurance,intelligent,interesting,internal,interview,issue,jerk,job,join,judgement,kind,kitchen,know,knowledge,lack,large,lay,lead,leader,leadership,learn,learning,leave,let,level,life,like,likely,line,little,live,location,long,look,los,lot,love,low,lucky,lunch,mac,machine,mail,maintain,manage,management,manager,market,match,matter,mature,meal,mean,medical,medium,meet,meeting,membership,merit,metric,micro,micromanagement,minimal,miss,model,money,month,motivate,motivated,movie,multiple,near,need,negative,new,nice,non,noodle,notch,number,oatmeal,offer,office,ok,online,open,operation,opinion,opportunity,option,organization,orient,outside,outstanding,overall,overtime,ownership,pace,package,parking,passionate,past,path,pay,paycheck,peer,people,perfect,perform,performance,performer,period,perk,permission,person,personal,perspective,phone,pick,place,plan,play,plenty,plus,policy,politic,popcorn,position,positive,possible,potential,practice,pretty,price,pro,problem,process,product,productive,professional,program,progressive,project,promote,provide,pto,purchase,push,quality,question,quickly,raise,raman,rank,rapid,rare,rate,reach,read,ready,real,realize,reason,reasonable,receive,recognize,reed,refreshing,regard,relate,relationship,relatively,relaxed,rental,rep,respect,respected,respectful,responsibility,responsible,result,resume,review,reward,rewarding,right,risk,rockstar,role,room,rule,run,salaried,salary,scale,schedule,scheduling,script,season,self,senior,sense,seriously,service,set,share,sharing,shift,short,silicon,simple,single,sit,site,skill,slide,small,smart,snack,soda,solid,solution,solve,somewhat,space,speak,spend,staff,stand,standard,start,starter,startup,statement,status,stay,stock,strategic,strategy,streaming,stress,strong,structure,student,stuff,stunning,subscriber,subscription,subsidize,succeed,success,successful,super,superior,supervisor,support,supportive,sure,surround,swag,talent,talented,talk,task,tea,team,teamwork,tech,technical,technology,tell,temp,tend,term,terrific,test,thing,think,thoughtful,time,tolerate,ton,tool,total,track,training,transparency,transparent,treat,true,truly,trust,try,tv,type,understand,understanding,unique,unlike,unlimited,unnecessary,upper,use,user,usually,vacation,valley,value,vision,vp,wage,walk,want,watch,way,wear,week,weekend,willing,win,wish,wo,wonder,wonderful,work,worker,workforce,working,workplace,world,worth,year,young
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,4,0,0,0,0,0,0,1,1,2,0,0,0,0,0,1,0,0,0,1,0,0,0,2,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,2,3,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,0,1,2,0,2,0,1,0,0,1,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,1,0,0,1,0,1,0,1,0,0,0,0,5,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,2,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,2,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,1,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,2,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,2,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,1,0,0


We can also examine the proportion of a word given the topic(s) it fell under.

In [68]:
components = pd.DataFrame(lda_model.components_.T, index=terms, columns=['topic_' + str(i) for i in range(topics)])
components.head(20)

Unnamed: 0,topic_0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_7,topic_8,topic_9
000,0.100038,0.977075,0.1,0.1,0.100003,0.100004,0.100003,0.100007,0.1,2.75117
10,0.10005,0.1,0.1,0.100075,1.044369,0.100011,0.1,0.100002,0.1,4.427836
14,2.282222,0.1,0.1,0.100025,0.100004,0.100024,0.977048,0.100003,0.1,2.323097
20,0.100007,0.1,0.10002,3.619879,0.10001,2.799783,0.100031,0.100018,0.1,0.1
24,2.619751,0.1,0.100019,0.100029,1.16903,0.100026,0.1,0.100006,0.100013,0.1
401k,0.100021,0.1,0.1,7.348347,0.1,0.100005,0.1,0.100008,0.100005,0.100012
ability,0.100013,0.1,2.047246,0.100004,0.100012,0.1,0.100006,6.012648,3.12691,2.610968
able,0.10005,0.1,0.100011,0.100024,5.877869,0.100018,0.189227,4.948823,0.1,0.100001
absolutely,0.1,0.1,0.100026,0.100007,1.140983,2.344943,0.1,1.279872,0.1,0.1
accept,0.100002,0.1,0.1,5.002541,0.100002,0.100002,0.1,0.100007,0.1,0.1


Now that we know how to get topics given a model, let's automate the search of the best one.

## 5. Automated Topic Search

In [None]:
import nltk, re, math, csv
# nltk.download('wordnet')
# nlkt.download('punkt')

import koolture as kt

from string import punctuation # list of all punctuations in English
from functools import partial # takes a piece of a function and fixes another
import concurrent.futures as cf # parallel processing modul
from collections import defaultdict

Read in the new dataset.

In [None]:
df = pd.read_csv('data/clean_gs.csv')
df.head()

How many columns and rows do we have?

In [None]:
df.shape

First range of topics we will search through.

In [None]:
our_range = 2, 10, 50, 100, 150, 200, 250, 300

Let's look at the companies and the amount of reviews we have for them.

In [None]:
comps_of_interest = df.employer.value_counts()
comps_of_interest.head(8)

Let's extract companies with only 48 reviews. You can change this however you please.

In [None]:
comps_of_interest = (comps_of_interest[(comps_of_interest == 48)]).index
len(comps_of_interest), comps_of_interest

We will now create a dataframe with only those companies selected above.

In [None]:
cond2 = df['employer'].isin(comps_of_interest) # create the condition
df_interest = df[cond2].copy() # get the new dataset
unique_ids = df_interest['employer'].unique() # get the unique IDs or unique employers in the dataset
unique_ids

We will need a mini dataframe with only the name of the company and the amount of reviews they have.

In [None]:
reviews_nums = df_interest['employer'].value_counts().reset_index()
reviews_nums.columns = ['employerID', 'reviews_nums']
reviews_nums.head()

## Fix Custom Stopwords List Before Cleaning

Select our reviews and create a customized list of stopwords.

In [None]:
data_pros = df_interest['pros'].values
stopwords = nltk.corpus.stopwords.words('english') + [token.lower() for token in unique_ids]
stopwords[-10:]

The text preprocessing of the corpus takes place in parallel. You first normalize the reviews and then take the root of the words.

In [None]:
normalize_doc = partial(kt.normalize_doc, stopwords=stopwords)

Clean and process the data. Assign cleaned reviews to a new variable.

In [None]:
%%time

with cf.ProcessPoolExecutor() as e:
    data_pros_cleaned = e.map(normalize_doc, data_pros)
    data_pros_cleaned = list(e.map(kt.root_of_word, data_pros_cleaned))

df_interest['pros_clean'] = data_pros_cleaned

## Create Vectorizers Container

We will need a bag of word for each of the companies we are analyzing reviews for.

In [None]:
%%time

vectorizers_dicts = kt.get_vectorizers(data=df_interest, unique_ids=unique_ids,
                                       company_col='employer', reviews_col='pros', 
                                       vrizer=CountVectorizer())

The following block run the models in parallel over the companies available and using the specified amount of topics in our_range variable and return a dictionary with the output of the get_models function for each company. It is used to identify the interval to search further for optimal topic number.

In [None]:
%%time

partial_func = partial(kt.get_models, topics=our_range, vrizer_dicts=vectorizers_dicts, unique_ids=unique_ids)

with cf.ProcessPoolExecutor() as e:
    output = list(e.map(partial_func, unique_ids))

The next function will now iterate over the dictionary output from above, add each dataset into a list, and then concatenate them all into one dataset (output df contains exactly same information, but more readable, and used in next blocks).

In [None]:
output_df = kt.build_dataframe(output)
output_df.head()

The following loop iterates over the new dataframe, searches for the top 2 topics based on highest coherence, and appends to a list a tuple containing the company, a tuple with the top two topic numbers, and the fitted vectorizer from the original `vectorizers_list`.

In [None]:
%%time

topics_sorted, comps, tops = kt.top_two_topics(data=output_df, companies_var='company',
                               coherence_var='coherence', topics_var='topics',
                               unique_ids=unique_ids, vrizers_list=vectorizers_dicts.values())

Now run the `get_models` function again over the new space of topics. You will  need to
1. sort the tuple with the top two topics.
2. create a linearly spaced array with 10 elements between the top 2 topics, turn it into integers, make the array a set to eliminate any duplicates that might arise if there is a 2 in the top two topics, and then turn that into a list.
3. get your fixed partial function again
4. the output is the same as before

In [None]:
%%time


partial_func = partial(kt.get_models, vrizer_dicts=vectorizers_dicts, unique_ids=unique_ids)

with cf.ProcessPoolExecutor() as e:
    output2 = list(e.map(partial_func, comps, tops))

Create multiple dataframes from dictionaries again and collapse them into 1.

In [None]:
output_df2 = kt.build_dataframe(output2)
output_df2.head()

Search for the best topic based on the new output, and get the top 10 words per topic. At the moment, you are only adding 1 of the topics for each company but you can change this by removing the indexing in `top_topics` below.

In [None]:
%%time

best_topics = kt.absolute_topics(output_df2, 'company', 'coherence', 
                                 'topics', 'models', vectorizers_dicts.values())

In [None]:
best_topics

Check out your output. Get the probabilities dataframes for each company and add them to a dictionary.

In [None]:
#generate matrix summarizing distribution of docs (reviews) over topics
docs_of_probas = defaultdict(pd.DataFrame)

for tup in vectorizers_dicts.values():
    docs_of_probas[tup[0]] = pd.DataFrame(best_topics[tup[0]][1].transform(tup[1]))

## Calculate Measures of Interest

In [None]:
%%time

comP_h_results = defaultdict(float)
comT_h_results = defaultdict(float)
entropy_avg_results = defaultdict(float)
cross_entropy_results = defaultdict(float)

for company, proba_df in docs_of_probas.items():
    comP_h_results[company] = kt.comph(proba_df.values)
    comT_h_results[company] = kt.conth(proba_df)
    entropy_avg_results[company] = kt.ent_avg(proba_df.values)
    cross_entropy_results[company] = kt.avg_crossEnt(proba_df.values)

In [None]:
comph_df = pd.DataFrame.from_dict(comP_h_results.items())
conth_df = pd.DataFrame.from_dict(comT_h_results.items())
crossEnt_df = pd.DataFrame.from_dict(cross_entropy_results.items())
cultureMetrics = comph_df.merge(conth_df, how = 'inner', right_on = 0, left_on = 0)
cultureMetrics = cultureMetrics.merge(crossEnt_df, how = 'inner', right_on = 0, left_on = 0)
cultureMetrics.columns = ['employerID', 'comph', 'conth', 'avgCrossEnt']
cultureMetrics.head()

In [None]:
df_best_topics = pd.DataFrame.from_records(best_topics).T.reset_index()
df_best_topics.columns = ['employerID', 'best_topic', 'model', 'coherence']
df_best_topics.head()