# Topic Modeling

## Introduction

Another popular text analysis technique is called topic modeling. The ultimate goal of topic modeling is to find various topics that are present in your corpus. Each document in the corpus will be made up of at least one topic, if not multiple topics.

In this notebook, we will be covering the steps on how to do **Latent Dirichlet Allocation (LDA)**, which is one of many topic modeling techniques. It was specifically designed for text data.

To use a topic modeling technique, you need to provide (1) a document-term matrix and (2) the number of topics you would like the algorithm to pick up.

Once the topic modeling technique is applied, your job as a human is to interpret the results and see if the mix of words in each topic make sense. If they don't make sense, you can try changing up the number of topics, the terms in the document-term matrix, model parameters, or even try a different model.

## Topic Modeling - Attempt #1 (All Text)

In [19]:
# Let's read in our document-term matrix
import pandas as pd
import pickle
import nltk
from nltk import word_tokenize, pos_tag

data = pd.read_pickle('dtm_stop.pkl')
data

Unnamed: 0,aaaaah,aaah,aah,aaras,abandon,abandoned,abbott,abby,abc,abcs,...,zip,zippers,zombie,zombies,zones,zoning,zoo,zoom,zoomed,zucker
ali,0,0,0,0,0,0,0,0,1,0,...,0,0,1,0,0,0,0,0,0,0
anthony,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
bill,1,0,0,0,0,0,0,0,0,1,...,0,0,1,1,0,1,0,0,0,0
chris,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
dave,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
deon,0,0,0,0,0,1,0,0,0,0,...,1,1,0,0,0,0,0,0,0,0
fortune,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,1,0
gabriel,0,1,0,1,0,0,0,0,1,0,...,0,0,0,0,1,0,0,4,0,0
iliza,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0
jim,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [20]:
# Import the necessary modules for LDA with gensim
# Terminal / Anaconda Navigator: conda install -c conda-forge gensim
from gensim import matutils, models
import scipy.sparse

# import logging
# logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [21]:
# One of the required inputs is a term-document matrix
tdm = data.transpose()
tdm.head()

Unnamed: 0,ali,anthony,bill,chris,dave,deon,fortune,gabriel,iliza,jim,joe,john,louis,mike,moses,patton,ricky,tom,trevor
aaaaah,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
aaah,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0
aah,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0
aaras,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
abandon,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0


In [22]:
# We're going to put the term-document matrix into a new gensim format, from df --> sparse matrix --> gensim corpus
sparse_counts = scipy.sparse.csr_matrix(tdm)
corpus = matutils.Sparse2Corpus(sparse_counts)

In [23]:
# Gensim also requires dictionary of the all terms and their respective location in the term-document matrix
cv = pickle.load(open("cv_stop.pkl", "rb"))
id2word = dict((v, k) for k, v in cv.vocabulary_.items())

Now that we have the corpus (term-document matrix) and id2word (dictionary of location: term), we need to specify two other parameters - the number of topics and the number of passes. Let's start the number of topics at 2, see if the results make sense, and increase the number from there.

In [24]:
# Now that we have the corpus (term-document matrix) and id2word (dictionary of location: term),
# we need to specify two other parameters as well - the number of topics and the number of passes
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=2, passes=10)
lda.print_topics()

[(0,
  '0.007*"man" + 0.006*"women" + 0.005*"theres" + 0.004*"come" + 0.004*"thing" + 0.004*"didnt" + 0.004*"day" + 0.004*"life" + 0.004*"okay" + 0.004*"little"'),
 (1,
  '0.006*"goes" + 0.004*"little" + 0.004*"okay" + 0.004*"look" + 0.004*"didnt" + 0.004*"day" + 0.004*"way" + 0.004*"tell" + 0.004*"thing" + 0.004*"make"')]

In [25]:
# LDA for num_topics = 3
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=3, passes=10)
lda.print_topics()

[(0,
  '0.008*"man" + 0.004*"didnt" + 0.004*"aint" + 0.004*"black" + 0.004*"little" + 0.004*"make" + 0.004*"kids" + 0.004*"way" + 0.004*"women" + 0.004*"look"'),
 (1,
  '0.006*"goes" + 0.005*"theres" + 0.005*"guy" + 0.005*"okay" + 0.005*"look" + 0.004*"come" + 0.004*"life" + 0.004*"didnt" + 0.004*"day" + 0.004*"shes"'),
 (2,
  '0.005*"thing" + 0.005*"went" + 0.005*"day" + 0.005*"theres" + 0.005*"little" + 0.005*"come" + 0.004*"didnt" + 0.004*"man" + 0.004*"ive" + 0.004*"okay"')]

In [26]:
# LDA for num_topics = 4
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=4, passes=10)
lda.print_topics()

[(0,
  '0.006*"okay" + 0.005*"little" + 0.005*"way" + 0.005*"look" + 0.004*"thing" + 0.004*"tell" + 0.004*"shes" + 0.004*"thank" + 0.004*"didnt" + 0.004*"day"'),
 (1,
  '0.008*"man" + 0.007*"women" + 0.007*"day" + 0.006*"aint" + 0.005*"gotta" + 0.005*"went" + 0.005*"come" + 0.005*"tell" + 0.005*"thing" + 0.004*"motherfucker"'),
 (2,
  '0.006*"goes" + 0.006*"theres" + 0.005*"life" + 0.005*"look" + 0.004*"didnt" + 0.004*"man" + 0.004*"thing" + 0.004*"make" + 0.004*"guy" + 0.004*"little"'),
 (3,
  '0.006*"okay" + 0.005*"didnt" + 0.004*"women" + 0.004*"theres" + 0.004*"way" + 0.004*"look" + 0.004*"shes" + 0.004*"goes" + 0.004*"says" + 0.004*"guy"')]

These topics aren't looking too great. We've tried modifying our parameters. Let's try modifying our terms list as well.

## Topic Modeling - Attempt #2 (Nouns Only)

One popular trick is to look only at terms that are from one part of speech (only nouns, only adjectives, etc.). Check out the UPenn tag set: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html.

In [27]:
# Let's create a function to pull out nouns from a string of text
def nouns(text):
    '''Given a string of text, tokenize the text and pull out only the nouns.'''
    is_noun = lambda pos: pos[:2] == 'NN'
    tokenized = word_tokenize(text)
    all_nouns = [word for (word, pos) in pos_tag(tokenized) if is_noun(pos)] 
    return ' '.join(all_nouns)

In [28]:
# Read in the cleaned data, before the CountVectorizer step
data_clean = pd.read_pickle('data_clean.pkl')
data_clean

Unnamed: 0,transcript
ali,ladies and gentlemen please welcome the stage ...
anthony,thank you thank you thank you san francisco th...
bill,all right thank you thank you very much thank...
chris,ladies and gentlemen live from the worldfamous...
dave,this dave tells dirty jokes for living that st...
deon,this water good dont know why was thirsty but...
fortune,please welcome fortune feimster powerful woma...
gabriel,can you please state your name martin moreno ...
iliza,cleveland ohio thank you thank you much this ...
jim,ladies and gentlemen please welcome the stage...


In [29]:
# Apply the nouns function to the transcripts to filter only on nouns
data_nouns = pd.DataFrame(data_clean.transcript.apply(nouns))
data_nouns

Unnamed: 0,transcript
ali,ladies gentlemen stage ali thank hello na shit...
anthony,thank thank people told tape francisco city wo...
bill,thank thank pleasure georgia area oasis nice n...
chris,ladies gentlemen apollo theater york chris roc...
dave,dave jokes living work signifies train alchemi...
deon,water dont thirsty see gon cole seminar cole c...
fortune,fortune woman dont way cause woman dont way ca...
gabriel,state name martin moreno gabriel iglesias year...
iliza,cleveland home likes thats history people free...
jim,ladies gentlemen stage jim jefferies thank tha...


In [30]:
# Create a new document-term matrix using only nouns
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import CountVectorizer

# Re-add the additional stop words since we are recreating the document-term matrix
add_stop_words = ['like', 'im', 'know', 'just', 'dont', 'thats', 'right', 'people',
                  'youre', 'got', 'gonna', 'time', 'think', 'yeah', 'said']
stop_words = text.ENGLISH_STOP_WORDS.union(add_stop_words)

# Recreate a document-term matrix with only nouns
cvn = CountVectorizer(stop_words=stop_words)
data_cvn = cvn.fit_transform(data_nouns.transcript)
data_dtmn = pd.DataFrame(data_cvn.toarray(), columns=cvn.get_feature_names())
data_dtmn.index = data_nouns.index
data_dtmn



Unnamed: 0,aaaaah,aaah,aah,abbott,abby,abc,abcs,ability,abortion,abortions,...,zeppelin,zhoosh,zillion,zip,zippers,zombie,zombies,zones,zoo,zoom
ali,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
anthony,0,0,0,0,0,0,0,0,2,0,...,0,0,0,0,0,0,0,0,0,0
bill,1,0,0,0,0,0,1,0,0,0,...,0,0,1,0,0,1,1,0,0,0
chris,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
dave,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
deon,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,1,0,0,0,0,0
fortune,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
gabriel,0,1,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,1,0,4
iliza,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2
jim,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [31]:
# Create the gensim corpus
corpusn = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtmn.transpose()))

# Create the vocabulary dictionary
id2wordn = dict((v, k) for k, v in cvn.vocabulary_.items())

In [32]:
# Let's start with 2 topics
ldan = models.LdaModel(corpus=corpusn, num_topics=2, id2word=id2wordn, passes=10)
ldan.print_topics()

[(0,
  '0.010*"thing" + 0.009*"day" + 0.009*"cause" + 0.008*"hes" + 0.007*"way" + 0.007*"life" + 0.007*"guy" + 0.007*"man" + 0.006*"shit" + 0.006*"theyre"'),
 (1,
  '0.013*"man" + 0.011*"shit" + 0.010*"women" + 0.009*"fuck" + 0.006*"motherfucker" + 0.006*"way" + 0.006*"woman" + 0.006*"hes" + 0.005*"gon" + 0.005*"day"')]

In [33]:
# Let's try topics = 3
ldan = models.LdaModel(corpus=corpusn, num_topics=3, id2word=id2wordn, passes=10)
ldan.print_topics()

[(0,
  '0.010*"man" + 0.008*"cause" + 0.008*"shit" + 0.007*"day" + 0.007*"way" + 0.007*"hes" + 0.006*"gon" + 0.006*"thing" + 0.006*"women" + 0.006*"fuck"'),
 (1,
  '0.010*"man" + 0.010*"thing" + 0.009*"day" + 0.009*"shit" + 0.008*"hes" + 0.008*"life" + 0.007*"guy" + 0.007*"cause" + 0.007*"women" + 0.007*"way"'),
 (2,
  '0.006*"car" + 0.006*"way" + 0.005*"mom" + 0.005*"lot" + 0.004*"pool" + 0.004*"camp" + 0.004*"food" + 0.004*"ice" + 0.004*"cream" + 0.004*"kids"')]

In [34]:
# Let's try 4 topics
ldan = models.LdaModel(corpus=corpusn, num_topics=4, id2word=id2wordn, passes=10)
ldan.print_topics()

[(0,
  '0.019*"shit" + 0.019*"man" + 0.011*"fuck" + 0.010*"women" + 0.009*"gon" + 0.008*"kids" + 0.007*"motherfucker" + 0.007*"cause" + 0.007*"life" + 0.007*"everybody"'),
 (1,
  '0.011*"thing" + 0.010*"life" + 0.009*"cause" + 0.008*"day" + 0.007*"hes" + 0.006*"guy" + 0.006*"way" + 0.006*"things" + 0.005*"lot" + 0.005*"kids"'),
 (2,
  '0.010*"day" + 0.009*"cause" + 0.009*"thing" + 0.009*"way" + 0.008*"hes" + 0.006*"shes" + 0.006*"things" + 0.006*"house" + 0.005*"man" + 0.005*"guy"'),
 (3,
  '0.008*"hes" + 0.007*"man" + 0.007*"fuck" + 0.007*"way" + 0.007*"day" + 0.007*"women" + 0.007*"shit" + 0.006*"guy" + 0.006*"thing" + 0.006*"years"')]

## Topic Modeling - Attempt #3 (Nouns and Adjectives)

In [17]:
# Let's create a function to pull out nouns from a string of text
def nouns_adj(text):
    '''Given a string of text, tokenize the text and pull out only the nouns and adjectives.'''
    is_noun_adj = lambda pos: pos[:2] == 'NN' or pos[:2] == 'JJ'
    tokenized = word_tokenize(text)
    nouns_adj = [word for (word, pos) in pos_tag(tokenized) if is_noun_adj(pos)] 
    return ' '.join(nouns_adj)

In [18]:
# Apply the nouns function to the transcripts to filter only on nouns
data_nouns_adj = pd.DataFrame(data_clean.transcript.apply(nouns_adj))
data_nouns_adj

Unnamed: 0,transcript
ali,ladies gentlemen stage ali wong hello welcome ...
anthony,thank san francisco thank good people told tap...
bill,right thank thank pleasure greater atlanta geo...
chris,ladies gentlemen worldfamous apollo theater ne...
dave,dave dirty jokes living hard work signifies tr...
deon,water good dont thirsty comfortable real good ...
fortune,welcome fortune powerful woman dont way cause ...
gabriel,state name martin moreno ive gabriel iglesias ...
iliza,cleveland great nice home ten likes thats real...
jim,ladies gentlemen stage jim jefferies thank tha...


In [19]:
# Create a new document-term matrix using only nouns and adjectives, also remove common words with max_df
cvna = CountVectorizer(stop_words=stop_words, max_df=.8)
data_cvna = cvna.fit_transform(data_nouns_adj.transcript)
data_dtmna = pd.DataFrame(data_cvna.toarray(), columns=cvna.get_feature_names())
data_dtmna.index = data_nouns_adj.index
data_dtmna



Unnamed: 0,aaaaah,aaah,aah,abandoned,abbott,abby,abc,abcs,abercrombie,ability,...,zeppelin,zhoosh,zillion,zip,zippers,zombie,zombies,zones,zoo,zoom
ali,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,1,0,0,0,0
anthony,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
bill,1,0,0,0,0,0,0,1,0,0,...,0,0,1,0,0,1,1,0,0,0
chris,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
dave,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
deon,0,0,0,1,0,0,0,0,0,0,...,0,0,0,1,1,0,0,0,0,0
fortune,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
gabriel,0,1,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,1,0,4
iliza,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,2
jim,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [20]:
# Create the gensim corpus
corpusna = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtmna.transpose()))

# Create the vocabulary dictionary
id2wordna = dict((v, k) for k, v in cvna.vocabulary_.items())

In [21]:
# Let's start with 2 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=2, id2word=id2wordna, passes=10)
ldana.print_topics()

[(0,
  '0.005*"fuck" + 0.004*"joke" + 0.003*"fucking" + 0.003*"fck" + 0.003*"wrong" + 0.003*"girls" + 0.003*"dude" + 0.002*"different" + 0.002*"men" + 0.002*"everybody"'),
 (1,
  '0.009*"fuck" + 0.005*"motherfucker" + 0.004*"everybody" + 0.004*"aint" + 0.003*"ass" + 0.003*"mom" + 0.003*"bitch" + 0.003*"dog" + 0.003*"men" + 0.003*"different"')]

In [22]:
# Let's try 3 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=3, id2word=id2wordna, passes=10)
ldana.print_topics()

[(0,
  '0.006*"fuck" + 0.004*"men" + 0.003*"ahah" + 0.003*"jenny" + 0.003*"ass" + 0.003*"friends" + 0.003*"different" + 0.003*"fucking" + 0.003*"president" + 0.003*"everybody"'),
 (1,
  '0.005*"fuck" + 0.004*"mom" + 0.004*"joke" + 0.003*"story" + 0.003*"dude" + 0.003*"fck" + 0.003*"dad" + 0.003*"different" + 0.003*"kid" + 0.003*"everybody"'),
 (2,
  '0.011*"fuck" + 0.008*"motherfucker" + 0.008*"aint" + 0.005*"bitch" + 0.004*"everybody" + 0.004*"ass" + 0.004*"men" + 0.004*"yall" + 0.003*"girls" + 0.003*"ngga"')]

In [23]:
# Let's try 4 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=4, id2word=id2wordna, passes=10)
ldana.print_topics()

[(0,
  '0.005*"fck" + 0.005*"motherfucker" + 0.004*"president" + 0.004*"mom" + 0.004*"ass" + 0.004*"fuck" + 0.004*"everybody" + 0.004*"different" + 0.004*"bitch" + 0.003*"aint"'),
 (1,
  '0.012*"fuck" + 0.005*"everybody" + 0.005*"fucking" + 0.004*"dude" + 0.004*"aint" + 0.003*"kid" + 0.003*"fuckin" + 0.003*"joke" + 0.003*"dead" + 0.003*"food"'),
 (2,
  '0.008*"fuck" + 0.004*"ahah" + 0.004*"joke" + 0.003*"wrong" + 0.003*"men" + 0.003*"mad" + 0.003*"anthony" + 0.003*"girls" + 0.003*"jokes" + 0.003*"friend"'),
 (3,
  '0.003*"jenny" + 0.003*"story" + 0.003*"mom" + 0.003*"fuck" + 0.003*"special" + 0.003*"ass" + 0.003*"girls" + 0.003*"comedy" + 0.003*"security" + 0.003*"guns"')]

## Identify Topics in Each Document

Out of the 9 topic models we looked at, the nouns and adjectives, 4 topic one made the most sense. So let's pull that down here and run it through some more iterations to get more fine-tuned topics.

In [32]:
# Our final LDA model (for now)
ldana = models.LdaModel(corpus=corpusna, num_topics=4, id2word=id2wordna, passes=200)
ldana.print_topics()

[(0,
  '0.016*"motherfucker" + 0.011*"fuck" + 0.010*"bitch" + 0.009*"ngga" + 0.008*"aint" + 0.007*"ass" + 0.006*"yall" + 0.005*"older" + 0.005*"young" + 0.005*"motherfuckers"'),
 (1,
  '0.011*"fuck" + 0.006*"everybody" + 0.005*"dude" + 0.004*"fucking" + 0.004*"men" + 0.004*"fck" + 0.004*"ass" + 0.003*"kid" + 0.003*"different" + 0.003*"dick"'),
 (2,
  '0.007*"mom" + 0.004*"grandma" + 0.004*"parents" + 0.004*"joke" + 0.003*"wife" + 0.003*"clinton" + 0.003*"jax" + 0.003*"family" + 0.003*"anthony" + 0.003*"dad"'),
 (3,
  '0.004*"joke" + 0.003*"fuck" + 0.003*"jenny" + 0.003*"dogs" + 0.003*"story" + 0.003*"girls" + 0.003*"fact" + 0.003*"wrong" + 0.002*"different" + 0.002*"idea"')]

These four topics look pretty decent. Let's settle on these for now.
* Topic 0: profanity
* Topic 1: men
* Topic 2: family
* Topic 3: women

In [33]:
# Let's take a look at which topics each transcript contains
corpus_transformed = ldana[corpusna]
list(zip([a for [(a,b)] in corpus_transformed], data_dtmna.index))

[(1, 'ali'),
 (2, 'anthony'),
 (1, 'bill'),
 (1, 'chris'),
 (1, 'dave'),
 (0, 'deon'),
 (2, 'fortune'),
 (3, 'gabriel'),
 (3, 'iliza'),
 (1, 'jim'),
 (1, 'joe'),
 (2, 'john'),
 (1, 'louis'),
 (3, 'mike'),
 (2, 'moses'),
 (3, 'patton'),
 (3, 'ricky'),
 (1, 'tom'),
 (1, 'trevor')]

For a first pass of LDA, these kind of make sense to me, so we'll call it a day for now.
* Topic 0: profanity [Deon]
* Topic 1: men [Ali, Bill, Chris, Dave, Jim, Joe, Louis, Tom, Trevor]
* Topic 2: family [Anthony, Fortune, John, Moses]
* Topic 3: women [Gabriel, Iliza, Mike, Patton, Ricky]

### Assignment:
1. Try further modifying the parameters of the topic models above and see if you can get better topics.
2. Create a new topic model that includes terms from a different [part of speech](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html) and see if you can get better topics.

## Topic Modeling - Nouns and Adverbs

In [55]:
# Let's create a function to pull out nouns from a string of text
def nouns_adverb(text):
    '''Given a string of text, tokenize the text and pull out only the nouns and adverbs.'''
    is_noun_adverb = lambda pos: pos[:2] == 'NN' or pos[:2] == 'RB'
    tokenized = word_tokenize(text)
    nouns_adverb = [word for (word, pos) in pos_tag(tokenized) if is_noun_adverb(pos)] 
    return ' '.join(nouns_adverb)

In [56]:
# Apply the function to the transcripts to filter only on nouns and adverbs
data_nouns_adverb = pd.DataFrame(data_clean.transcript.apply(nouns_adverb))
data_nouns_adverb

Unnamed: 0,transcript
ali,ladies gentlemen stage ali thank hello na shit...
anthony,thank thank much here people told tape francis...
bill,thank very much thank pleasure here georgia ar...
chris,ladies gentlemen apollo theater york chris roc...
dave,dave jokes living most work signifies train al...
deon,water dont thirsty anyway now here see gon col...
fortune,fortune feimster woman always dont way now not...
gabriel,state name martin moreno martinnnnn gabriel ig...
iliza,cleveland much here not home likes thats right...
jim,ladies gentlemen stage jim jefferies thank tha...


In [57]:
# Create a new document-term matrix using only nouns and adjectives, also remove common words with max_df
cvna = CountVectorizer(stop_words=stop_words, max_df=.8)
data_cvna = cvna.fit_transform(data_nouns_adverb.transcript)
data_dtmna = pd.DataFrame(data_cvna.toarray(), columns=cvna.get_feature_names())
data_dtmna.index = data_nouns_adverb.index
data_dtmna



Unnamed: 0,aaaaah,aaah,aah,abbott,abby,abc,abcs,ability,abortion,abortions,...,zeppelin,zhoosh,zillion,zip,zippers,zombie,zombies,zones,zoo,zoom
ali,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
anthony,0,0,0,0,0,0,0,0,2,0,...,0,0,0,0,0,0,0,0,0,0
bill,1,0,0,0,0,0,1,0,0,0,...,0,0,1,0,0,1,1,0,0,0
chris,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
dave,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
deon,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,1,0,0,0,0,0
fortune,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
gabriel,0,1,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,1,0,4
iliza,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2
jim,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [58]:
# Create the gensim corpus
corpusna = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtmna.transpose()))

# Create the vocabulary dictionary
id2wordna = dict((v, k) for k, v in cvna.vocabulary_.items())

In [59]:
# Let's try 5 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=5, id2word=id2wordna, passes=300)
ldana.print_topics()

[(0,
  '0.013*"fuck" + 0.006*"men" + 0.005*"ahah" + 0.005*"son" + 0.005*"ass" + 0.004*"guns" + 0.004*"dude" + 0.003*"fucking" + 0.003*"business" + 0.003*"friends"'),
 (1,
  '0.012*"fuck" + 0.008*"motherfucker" + 0.008*"everybody" + 0.007*"aint" + 0.006*"ass" + 0.006*"fck" + 0.006*"dude" + 0.005*"bitch" + 0.005*"clinton" + 0.004*"wife"'),
 (2,
  '0.007*"joke" + 0.005*"nuts" + 0.005*"food" + 0.005*"mom" + 0.004*"pool" + 0.004*"jenner" + 0.004*"camp" + 0.004*"bob" + 0.003*"ice" + 0.003*"cream"'),
 (3,
  '0.005*"fact" + 0.004*"girls" + 0.004*"dogs" + 0.004*"course" + 0.004*"dog" + 0.004*"jax" + 0.003*"story" + 0.003*"idea" + 0.003*"candy" + 0.003*"food"'),
 (4,
  '0.005*"president" + 0.005*"joke" + 0.005*"fuck" + 0.004*"phone" + 0.004*"jenny" + 0.004*"jokes" + 0.003*"family" + 0.003*"morning" + 0.003*"anthony" + 0.003*"pubes"')]

These five topics look pretty decent.
* Topic 0: business (Because of words like business, men)
* Topic 1: profanity (Because of words like fuck, motherfucker, ass, bitch)
* Topic 2: food (Because of words like nuts, food, ice, cream)
* Topic 3: pets (Because of words like dogs, dog, food)
* Topic 4: politics and family (Because of words like president, family)

In [60]:
# Let's take a look at which topics each transcript contains
corpus_transformed = ldana[corpusna]
list(zip([a for [(a,b)] in corpus_transformed], data_dtmna.index))

[(0, 'ali'),
 (4, 'anthony'),
 (1, 'bill'),
 (1, 'chris'),
 (0, 'dave'),
 (1, 'deon'),
 (3, 'fortune'),
 (3, 'gabriel'),
 (3, 'iliza'),
 (0, 'jim'),
 (1, 'joe'),
 (1, 'john'),
 (3, 'louis'),
 (4, 'mike'),
 (2, 'moses'),
 (4, 'patton'),
 (2, 'ricky'),
 (0, 'tom'),
 (4, 'trevor')]

* Topic 0: business [Ali, Dave, Jim, Tom]
* Topic 1: profanity [Bill, Chris, Deon, Joe, John]
* Topic 2: food [Moses, Ricky]
* Topic 3: pets [Fortune, Gabriel, Iliza, Louis]
* Topic 4: politics and family [Anthony, Mike, Patton, Trevor]

## Topic Modeling - Noun, Verb and Adjective 

In [35]:
# Let's create a function to pull out nouns from a string of text
def nouns_adverb(text):
    '''Given a string of text, tokenize the text and pull out only the nouns and adverbs.'''
    is_noun_adverb = lambda pos: pos[:2] == 'NN' or pos[:2] == 'JJ' or pos[:2] == 'VB'
    tokenized = word_tokenize(text)
    nouns_adverb = [word for (word, pos) in pos_tag(tokenized) if is_noun_adverb(pos)] 
    return ' '.join(nouns_adverb)

In [36]:
# Apply the function to the transcripts to filter only on nouns and adverbs
data_nouns_adverb = pd.DataFrame(data_clean.transcript.apply(nouns_adverb))
data_nouns_adverb

Unnamed: 0,transcript
ali,ladies gentlemen please welcome stage ali wong...
anthony,thank thank thank san francisco thank good peo...
bill,right thank thank thank thank are going thank ...
chris,ladies gentlemen live worldfamous apollo theat...
dave,dave tells dirty jokes living stare hard work ...
deon,water good dont know was thirsty feel comforta...
fortune,please welcome fortune powerful woman get want...
gabriel,please state name martin moreno know ive been ...
iliza,cleveland thank thank great nice public were s...
jim,ladies gentlemen please welcome stage jim jeff...


In [37]:
# Create a new document-term matrix using only nouns and adjectives, also remove common words with max_df
cvna = CountVectorizer(stop_words=stop_words, max_df=.8)
data_cvna = cvna.fit_transform(data_nouns_adverb.transcript)
data_dtmna = pd.DataFrame(data_cvna.toarray(), columns=cvna.get_feature_names())
data_dtmna.index = data_nouns_adverb.index
data_dtmna



Unnamed: 0,aaaaah,aaah,aah,aaras,abandon,abandoned,abbott,abby,abc,abcs,...,zillion,zip,zippers,zombie,zombies,zones,zoning,zoo,zoom,zoomed
ali,0,0,0,0,0,0,0,0,1,0,...,0,0,0,1,0,0,0,0,0,0
anthony,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
bill,1,0,0,0,0,0,0,0,0,1,...,1,0,0,1,1,0,1,0,0,0
chris,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
dave,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
deon,0,0,0,0,0,1,0,0,0,0,...,0,1,1,0,0,0,0,0,0,0
fortune,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,1
gabriel,0,1,0,1,0,0,0,0,1,0,...,0,0,0,0,0,1,0,0,4,0
iliza,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,2,0
jim,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [38]:
# Create the gensim corpus
corpusna = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtmna.transpose()))

# Create the vocabulary dictionary
id2wordna = dict((v, k) for k, v in cvna.vocabulary_.items())

In [40]:
# Let's try 5 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=4, id2word=id2wordna, passes=300)
ldana.print_topics()

[(0,
  '0.008*"fucking" + 0.006*"fuck" + 0.003*"joke" + 0.002*"story" + 0.002*"tonight" + 0.002*"friend" + 0.002*"fuckin" + 0.002*"son" + 0.002*"hell" + 0.002*"wrong"'),
 (1,
  '0.005*"mom" + 0.004*"camp" + 0.004*"cheer" + 0.004*"pool" + 0.004*"poor" + 0.004*"ice" + 0.003*"cream" + 0.003*"grandma" + 0.003*"boys" + 0.003*"food"'),
 (2,
  '0.008*"fucking" + 0.006*"dude" + 0.006*"fcking" + 0.006*"fck" + 0.005*"fuck" + 0.004*"kid" + 0.003*"wan" + 0.003*"sleep" + 0.003*"bed" + 0.003*"husband"'),
 (3,
  '0.011*"fuck" + 0.007*"aint" + 0.007*"fucking" + 0.005*"motherfucker" + 0.003*"bitch" + 0.003*"gay" + 0.003*"clinton" + 0.003*"kid" + 0.003*"mom" + 0.003*"dog"')]

These four topics look pretty decent.
* Topic 0: friends (Because of words like friend, story)
* Topic 1: vacation (Because of words like camp, cheer, pool, ice, cream, grandma, boys, food)
* Topic 2: family (Because of words like kid, sleep, bed, husband)
* Topic 3: profanity (Because of words like fuck, fucking, motherfucker, bitch, gay, dog)

In [41]:
# Let's take a look at which topics each transcript contains
corpus_transformed = ldana[corpusna]
list(zip([a for [(a,b)] in corpus_transformed], data_dtmna.index))

[(2, 'ali'),
 (3, 'anthony'),
 (2, 'bill'),
 (3, 'chris'),
 (3, 'dave'),
 (3, 'deon'),
 (3, 'fortune'),
 (0, 'gabriel'),
 (2, 'iliza'),
 (0, 'jim'),
 (2, 'joe'),
 (3, 'john'),
 (3, 'louis'),
 (0, 'mike'),
 (1, 'moses'),
 (0, 'patton'),
 (0, 'ricky'),
 (0, 'tom'),
 (0, 'trevor')]

* Topic 0: friends [Gabriel, Jim, Mike, Patton, Ricky, Tom, Trevor]
* Topic 1: vacation [Moses]
* Topic 2: family [Ali, Bill, Iliza, Joe]
* Topic 3: profanity [Anthony, Chris, Dave, Deon, Fortune, John, Louis]

## Analysis

Ricky:   friends, food, women
Moses:   vacation, food, family


Jim:     friends, business, men
Tom:     friends, business, men
Dave:    profanity, business, men
Ali:     family, business, men


Gabriel: friends, pets, women
Louis:   profanity, pets, men
Iliza:   family, pets, women
Fortune: profanity, pets, family


Mike:    friends, politics, family, women
Patton:  friends, politics, family, women
Trevor:  friends, politics, family, men
Anthony: profanity, politics, family


Bill:    family, profanity, men
Joe:     family, profanity, men
Chris:   profanity, men


Deon:    profanity
John:    profanity, family