# 04 Topic Modeling

> "Language shapes the way we think, and determines what we can think about." ~ Benjamin Lee Whorf

![word_cloud](https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fspecials-images.forbesimg.com%2Fimageserve%2F491732087%2F960x0.jpg%3Ffit%3Dscale&f=1&nofb=1)

## Table of Contents

1. What is Natural Language Processing?
2. Key Concepts
3. What is Latent Dirichlet Allocation?
4. Analysis
5. Automated Topic Search

## 1. What is Natural Language Processing and What is Topic Modeling?

**Natural Language Processing**

> "Natural language processing (NLP) refers to the branch of computer science—and more specifically, the branch of artificial intelligence or AI—concerned with giving computers the ability to understand text and spoken words in much the same way human beings can." ~ [IBM](https://www.ibm.com/cloud/learn/natural-language-processing)

**Topic Modeling**

> "In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body. Intuitively, given that a document is about a particular topic, one would expect particular words to appear in the document more or less frequently: "dog" and "bone" will appear more often in documents about dogs, "cat" and "meow" will appear in documents about cats, and "the" and "is" will appear approximately equally in both." ~ [Wikipedia](https://en.wikipedia.org/wiki/Topic_model)

## 2. Key Concepts

Here is a non-exhaustive list of concepts that you should understand, and be familiar with, from the field of Natural Language Processing.

- **Corpus** - Collection of documents filled with words, sentences, parragraphs, numbers, punctuations, etc. For example, a collection of letters is a corpus.
- **Corpora** - More than one corpus. For example, the collection of job descriptions for a company would be a corpus, the collection of the collection of these companies job descriptions would be a corpora.
- **Token** - Element inside a piece of text. This may be a word, a number, a space, any kind of punctuation, etc.
- **Tokenization** - separating pieces of strings (i.e. text) into their smallest components or, tokens.
- **Document** - A block of text of varying sizes. For example, a document might be a tweet, a menu, a book, a review, etc. 
- **Bag of Words** - A numerical representation of textual information that a statistical model can understand, process, and make inferences from. In a bag of words, the rows represent the documents in your corpus and the columns represent all of the unique words from all of your documents.
- **Topic** - A representation of similar information based words and sometimes context as well.
- **Stop Words** - the most common words used in a language. These words appear so often that in many applications of NLP these get removed before the modeling stage.

## 3. What is Latent Dirichlet Allocation?

> "Latent Dirichlet Allocation (LDA) is a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document." ~ [David M. Blei, Andrew Y. Ng and Michael I. Jordan (2003)](https://jmlr.org/papers/volume3/blei03a/blei03a.pdf)

**Assumptions**
- There is some sort of structure in these documents and LDA will try and collapse or separate these structure among your pre-defined set of topics.
- Each topic comes from, and can be represented as, a distribution of words or term frequencies.

## 4. Analysis

We will first look at how topic modeling is done with one company and with some base functions, and we will then look at the automated way of searching for a topic.

Let's start by importing the packages we will use throughout this session.

In [None]:
import pandas as pd
import numpy as np
import scipy as sp
import spacy # one of the best NLP libraries available in any programming language
from pprint import pprint # the extra p stands for printing

pd.options.display.max_columns = None # this allows us to see all columns displayed after a .head() or .tail() on our dataframes

In [None]:
df = pd.read_csv('data/netflix.csv') # let's read our dataframe
df.head() # show the first 5 rows

Notice how we have quite a few columns but, since we are only interested in the **pros** reviews, let's extract that column out.

In [None]:
pros_reviews = df['pros'].copy() # take the reviews column out of the dataframe
pros_reviews.head()

Because we will be extracting words that will form a topic, we'll need to do some text preprocessing in order to get rid whatever is not a word. We will also want to have all letters in lowercase and we might want to reduce them their root, if any. To do this, we will use `spacy` which has an English language model ready to use. **Note**, the English language model allows us to use different functionalities on top of English words. The details are not important but note that we now have a tool that will help us wrangle English text.

In [None]:
nlp = spacy.load('en_core_web_sm') # we first load out English language model

Let's look at a review

In [None]:
one_review = pros_reviews[418]

In [None]:
pprint(one_review)

Notice how the review above is quite messy and it has a lot of characters that, for all intents and purposes, will not be useful for our analysis. Let's examine a cleaner version of the review above by running it through our tokenizer.

In [None]:
parsed_review = nlp(one_review)

In [None]:
parsed_review

Much better and easier to read. Can we examine the sentences as well? You bet we can by using spaCy's many features.

Below we will use a loop to go over the index of each sentence of our single review, plus the sentence.

In [None]:
# the temporary variable num will represent the index
# and the temporary variable sentence will represent each line of the review
# enumerate is a buil-in Python function

for num, sentence in enumerate(parsed_review.sents):
    print(f"Sentence #{num}:\n {sentence}\n")

Let's look at the entities of the words that make up our single review using the same approach as above.

In [None]:
for num, entity in enumerate(parsed_review.ents):
    print(f"Entity #{num}: {entity} -- {entity.label_}\n")

We will now use additional functionalities to showcase more characteristics about our review. We will do so using list comprehensions. Think of these as loops cousins whose two main differences are that the action happens first and they always return a list.

In [None]:
# here we are taking out of the parsed review each token
token_text = [token.text for token in parsed_review]

# here we are lemmatizing each word possible
token_lemmas = [token.lemma_ for token in parsed_review]

# stopwords are very common so here we will extract a variable that will tell us whether
# a word is a stopword or not
token_stop = [token.is_stop for token in parsed_review]

# we will now add all three to a dataframe and display it without assigning it to a variable
pd.DataFrame(zip(token_text, token_lemmas, token_stop), columns=['Original Text', 'Lemmatized Text', 'stopwords']).tail(20)

Notice the middle column above, Lemmatized Text. This column represents the root of some of the words in our review. Think about this as reducing the words with the same meaning but spelled with a different conjugation, to their lowest common denominator. For example, related and relate, reasons and reason, considered and consider, etc. This steps helps us assign the exact word and meaning to the same topic as opposed to differenly spelled words with the same meaning to different topics.

Let's now define a function that will return only the punctuations or the trailing space next to some words.

In [None]:
def puncs_out(token): return token.is_punct or token.is_space

We will also need to import our stopwords from spaCy to be able to filter them out from our reviews. We don't want `the`, `a`, `so`, etc. influencing our topics.

In [None]:
from spacy.lang.en.stop_words import STOP_WORDS
STOP_WORDS

Lastly, let's create a function that will remove the punctuations, spaces, and also lemmatize the words in our reviews at the same time.

In [None]:
def lemma_in_stopw_out(doc):
    """
    This function takes in a piece of text, tokenizes it,
    lemmatizes it, removes punctuations and spaces, takes
    all stopwords out, and returns the clean piece of text.
    """
    
    tokens = nlp(doc)
    tokens_lemma = [token.lemma_ for token in tokens if not puncs_out(token)]
    tokens_clean = [token for token in tokens_lemma if token not in STOP_WORDS]
    return ' '.join(tokens_clean)

We will use pandas' convenient `.apply()` method to pass in our function above to each of the reviews we have. But first we will make every word in our reviews lowercase.

In [None]:
ready_revs = pros_reviews.str.lower()
ready_revs.head()

In [None]:
%%time 

# the function is called a magic method and it allows us to see how long this cell took to run

ready_revs = ready_revs.apply(lemma_in_stopw_out)
ready_revs.head()

Because we don't want the company we are analysing reviews for to appear in the topics, we will remove it from our corpus.

In [None]:
# we access the string and create a mask of True's and False' where the company appears
netflix_mask = ready_revs.str.contains('netflix')
netflix_mask.head()

In [None]:
# we can filter a dataset by passing in the mask through square brackets []
# notice the index
ready_revs[netflix_mask].head()

In [None]:
# a ~ in front of the mask gives us the opposite results
ready_revs[~netflix_mask].head()

In [None]:
# notice how the word netflix has now dissapeared from the reviews
ready_revs[netflix_mask] = ready_revs[netflix_mask].str.replace('netflix', '', regex=False)
ready_revs.head()

Let's examine the differences between and after our preprocessing stage.

In [None]:
print('Original:')
print('-' * 30)
print(nlp(df.loc[148, 'pros']))
print()
print('Processed Text:')
print('-' * 30)
print(ready_revs[148])

Let's now create a bag of words with sklearn's `CountVectorizer()` method. We will remove words that appear less than 4 times, as well as those that appear in 95% of the reviews.

In [None]:
# first we instantiate the vectorizer
vectorizer = CountVectorizer(min_df=3, max_df=0.95)

In [None]:
# then we fit and transform our clean reviews
bow = vectorizer.fit_transform(ready_revs)
bow

Notice the output of our bag of words. This is called a sparse matrix and is an efficient way of holding large amounts of 1's and 0's.

In [None]:
# select a topic
topics = 50

We will now instantiate our LDA model with the topics selected above and the fit our sparse matrix to this model.

In [None]:
lda_model = LatentDirichletAllocation(n_components=topics, # number of topics
                                      max_iter=100, # these are the amount of times the algorithm will run
                                      learning_method='online', 
                                      random_state=42, # setting a seed for reproducible results
                                      n_jobs=-1) # this parameter makes sure we use all of the cores in our machine

In [None]:
# pass in the bag of words
lda_model.fit(bow)

Awesome, we just ran our first model so let's go ahead and create a function to evaluate the topics we extracted and see if these make sense.

In [None]:
def show_topics(vectorizer, lda_model, n_words=15):
    """
    This function takes our vectorizer, our model, and a
    number of words to display the topics from our model.
    """
    keywords = np.array(vectorizer.get_feature_names())
    topic_keywords = []
    for topic_weights in lda_model.components_:
        top_keyword_locs = (-topic_weights).argsort()[:n_words]
        topic_keywords.append(keywords.take(top_keyword_locs))
    return topic_keywords

In [None]:
# let's evaluate the topics
show_topics(vectorizer=vectorizer, lda_model=lda_model, n_words=20)

To finish up we will create a variable with the names of the words in our vocabulary.

In [None]:
# this method create an array with the words/keys
terms = sorted(vectorizer.vocabulary_.keys())

In [None]:
# let's now create a dataframe with our ba
bow_docs = pd.DataFrame(bow.toarray(), columns=terms)
bow_docs.head()

We can also examine the proportion of a word given the topic(s) it fell under.

In [None]:
components = pd.DataFrame(lda_model.components_.T, index=terms, columns=['topic_' + str(i) for i in range(topics)])
components.head(20)

Now that we know how to get topics given a model, let's automate the search of the best one.

## 5. Automated Topic Search

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import nltk, re, math, csv
# nltk.download('wordnet')
# nlkt.download('punkt')

import koolture as kt

from string import punctuation
from functools import partial
import concurrent.futures as cf
from collections import defaultdict

In [None]:
df = pd.read_csv('data/clean_gs.csv')
df.head()

In [None]:
df.shape

In [None]:
our_range = 2, 10, 50, 100, 150, 200, 250, 300

In [None]:
comps_of_interest = df.employer.value_counts()
comps_of_interest.head(8)

In [None]:
comps_of_interest = (comps_of_interest[(comps_of_interest == 48)]).index
len(comps_of_interest), comps_of_interest

In [None]:
cond2 = df['employer'].isin(comps_of_interest) # create the condition
df_interest = df[cond2].copy() # get the new dataset
unique_ids = df_interest['employer'].unique() # get the unique IDs or unique employers in the dataset
unique_ids

In [None]:
reviews_nums = df_interest['employer'].value_counts().reset_index()
reviews_nums.columns = ['employerID', 'reviews_nums']
reviews_nums.head()

## Fix Custom Stopwords List Before Cleaning

The text preprocessing of the corpus takes place in parallel. You first normalize the reviews and then take the root of the words.

In [None]:
data_pros = df_interest['pros'].values
stopwords = nltk.corpus.stopwords.words('english') + [token.lower() for token in unique_ids]
stopwords[-10:]

In [None]:
normalize_doc = partial(kt.normalize_doc, stopwords=stopwords)

In [None]:
%%time

with cf.ProcessPoolExecutor() as e:
    data_pros_cleaned = e.map(normalize_doc, data_pros)
    data_pros_cleaned = list(e.map(kt.root_of_word, data_pros_cleaned))

df_interest['pros_clean'] = data_pros_cleaned

## Create Vectorizers Container

In [None]:
%%time

vectorizers_dicts = kt.get_vectorizers(data=df_interest, unique_ids=unique_ids,
                                       company_col='employer', reviews_col='pros', 
                                       vrizer=CountVectorizer())

The following block run the models in parallel over the companies available and using the specifiedamount of topics in our_range variable and return a dictionary with the output of the get_models function for each company. It is used to identify the interval to search further for optimal topic number.

In [None]:
%%time

partial_func = partial(kt.get_models, topics=our_range, vrizer_dicts=vectorizers_dicts, unique_ids=unique_ids)

with cf.ProcessPoolExecutor() as e:
    output = list(e.map(partial_func, unique_ids))

The next function will now iterate over the dictionary output from above, add each dataset into a list, and then concatenate them all into one dataset (output df contains exactly same information, but more readable, and used in next blocks).

In [None]:
output_df = kt.build_dataframe(output)
output_df.head()

The following loop iterates over the new dataframe, searches for the top 2 topics based on highest coherence, and appends to a list a tuple containing the company, a tuple with the top two topic numbers, and the fitted vectorizer from the original `vectorizers_list`.

In [None]:
%%time

topics_sorted, comps, tops = kt.top_two_topics(data=output_df, companies_var='company',
                               coherence_var='coherence', topics_var='topics',
                               unique_ids=unique_ids, vrizers_list=vectorizers_dicts.values())

Now run the `get_models` function again over the new space of topics. You will  need to
1. sort the tuple with the top two topics.
2. create a linearly spaced array with 10 elements between the top 2 topics, turn it into integers, make the array a set to eliminate any duplicates that might arise if there is a 2 in the top two topics, and then turn that into a list.
3. get your fixed partial function again
4. the output is the same as before

In [None]:
%%time


partial_func = partial(kt.get_models, vrizer_dicts=vectorizers_dicts, unique_ids=unique_ids)

with cf.ProcessPoolExecutor() as e:
    output2 = list(e.map(partial_func, comps, tops))

Create multiple dataframes from dictionaries again and collapse them into 1.

In [None]:
output_df2 = kt.build_dataframe(output2)
output_df2.head()

Search for the best topic based on the new output, and get the top 10 words per topic. At the moment, you are only adding 1 of the topics for each company but you can change this by removing the indexing in `top_topics` below.

In [None]:
%%time

best_topics = kt.absolute_topics(output_df2, 'company', 'coherence', 
                                 'topics', 'models', vectorizers_dicts.values())

In [None]:
best_topics

Check out your output. Get the probabilities dataframes for each company and add them to a dictionary.

In [None]:
#generate matrix summarizing distribution of docs (reviews) over topics
docs_of_probas = defaultdict(pd.DataFrame)

for tup in vectorizers_dicts.values():
    docs_of_probas[tup[0]] = pd.DataFrame(best_topics[tup[0]][1].transform(tup[1]))

## Calculate Measures of Interest

In [None]:
%%time

comP_h_results = defaultdict(float)
comT_h_results = defaultdict(float)
entropy_avg_results = defaultdict(float)
cross_entropy_results = defaultdict(float)

for company, proba_df in docs_of_probas.items():
    comP_h_results[company] = kt.comph(proba_df.values)
    comT_h_results[company] = kt.conth(proba_df)
    entropy_avg_results[company] = kt.ent_avg(proba_df.values)
    cross_entropy_results[company] = kt.avg_crossEnt(proba_df.values)

In [None]:
comph_df = pd.DataFrame.from_dict(comP_h_results.items())
conth_df = pd.DataFrame.from_dict(comT_h_results.items())
crossEnt_df = pd.DataFrame.from_dict(cross_entropy_results.items())
cultureMetrics = comph_df.merge(conth_df, how = 'inner', right_on = 0, left_on = 0)
cultureMetrics = cultureMetrics.merge(crossEnt_df, how = 'inner', right_on = 0, left_on = 0)
cultureMetrics.columns = ['employerID', 'comph', 'conth', 'avgCrossEnt']
cultureMetrics.head()

In [None]:
df_best_topics = pd.DataFrame.from_records(best_topics).T.reset_index()
df_best_topics.columns = ['employerID', 'best_topic', 'model', 'coherence']
df_best_topics.head()