**This is a document on Topic Modeling**

https://rstudio-pubs-static.s3.amazonaws.com/79360_850b2a69980c4488b1db95987a24867a.html

In [9]:
doc_a = "Brocolli is good to eat. My brother likes to eat good brocolli, but not my mother."
doc_b = "My mother spends a lot of time driving my brother around to baseball practice."
doc_c = "Some health experts suggest that driving may cause increased tension and blood pressure."
doc_d = "I often feel pressure to perform well at school, but my mother never seems to drive my brother to do better."
doc_e = "Health professionals say that brocolli is good for your health."

# compile sample documents into a list
doc_set = [doc_a, doc_b, doc_c, doc_d, doc_e]

** Cleaning**

Data cleaning is absolutely crucial for generating a useful topic model. The steps below are common to most natural language processing methods:

Tokenizing: converting a document to its atomic elements.
Stopping: removing meaningless words.
Stemming: merging words that are equivalent in meaning.

** Tokenization**

Tokenization can be performed many ways. Here, we are using NLTK’s tokenize.regexp module. Another way is to use "CountVectorizer" to "convert text into a matrix of token counts"

In [10]:
from nltk.tokenize import RegexpTokenizer

# The following code matches any word characters until it reaches a non-word character, like a space. 
# This is a simple solution, but can cause problems for words like "don't" which will be read as two tokens, "don" and "t".
# NLK provides a number of pre-constructed tokenizers like nltk.tokenize.simple. 
# Its better to use regex and iterate until your document is accurately tokenized. 
tokenizer = RegexpTokenizer(r'\w+')

In [11]:
# create English stop words list
from stop_words import get_stop_words

en_stop = get_stop_words('en')

In [12]:
# For example, "stemming", "stemmer", and "stemmed" all have similar meanings;
# stemming reduces those terms to "stem".
from nltk.stem.porter import PorterStemmer

p_stemmer = PorterStemmer()

In [13]:
# list of all stemmed tokens
texts = []

In [20]:
# loop through document list
for i in doc_set:
    
    #clean and tokenize document string
    # convert the document to lower-case
    raw = i.lower()
    # tokens is a list containing each word in the document
    tokens = tokenizer.tokenize(raw)
    
    # remove stop words from tokens, words such as "that", "is", "for", "your" are removed
    stopped_tokens = [i for i in tokens if not i in en_stop]
    
    # stem tokens, "professionals" are replaced with "professional"
    stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
    
    # texts is a list of temmed_token lists
    texts.append(stemmed_tokens)

In [24]:
from gensim import corpora, models
import gensim

# turn our tokenized documents into a id <-> term dictionary
dictionary = corpora.Dictionary(texts)

# convert tokenized documents into a document-term matrix
corpus = [dictionary.doc2bow(text) for text in texts]

In [36]:
# generate LDA model
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=3, id2word = dictionary, passes=20)

# examine the results
#num_words: number of words in every topic
print(ldamodel.print_topics(num_topics=3, num_words=3))

[(0, u'0.083*"mother" + 0.083*"brother" + 0.083*"drive"'), (1, u'0.031*"mother" + 0.031*"brother" + 0.031*"drive"'), (2, u'0.105*"good" + 0.105*"brocolli" + 0.105*"health"')]


In [37]:
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=2, id2word = dictionary, passes=20)

print(ldamodel.print_topics(num_topics=2, num_words=4))

[(0, u'0.094*"health" + 0.066*"drive" + 0.065*"pressur" + 0.036*"tension"'), (1, u'0.107*"brocolli" + 0.107*"good" + 0.083*"brother" + 0.083*"mother"')]
