# Latent Dirichlet Allocation Example

LDA, short for Latent Dirichlet Allocation, is a commonly-used algorithm for topic modeling, but, more broadly, is considered a dimensionality reduction technique.. For example, given a number of documents, LDA can group the texts on similar topics together based on whether they contain similar words. LDA is an unsupervised algorithm, meaning that the groups are created based on the similarity to each other, rather than by comparing them to an idealized or standardized dataset.

- Read more about [LDA in Wikipedia](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation)
- See [SparkTK Documentation](https://github.com/trustedanalytics/spark-tk) for more information about the the API's

In [1]:
# First, let's verify that the SparkTK libraries are installed
import sparktk
print "SparkTK installation path = %s" % (sparktk.__path__)

SparkTK installation path = ['/opt/anaconda2/lib/python2.7/site-packages/sparktk']


In [2]:
from sparktk import TkContext
tc = TkContext()

In [3]:
# Create a new frame by uploading rows
data = [ ['nytimes','harry',3], 
        ['nytimes','economy',35], 
        ['nytimes','jobs',40], 
        ['nytimes','magic',1],     
        ['nytimes','realestate',15], 
        ['nytimes','movies',6], 
        ['economist','economy',50], 
        ['economist','jobs',35], 
        ['economist','realestate',20], 
        ['economist','movies',1], 
        ['economist','harry',1], 
        ['economist','magic',1], 
        ['harrypotter','harry',40], 
        ['harrypotter','magic',30], 
        ['harrypotter','chamber',20], 
        ['harrypotter','secrets',30] ]

schema = [ ('doc_id', str),
          ('word_id', str),
          ('word_count', int) ]

frame = tc.frame.create(data, schema)

In [4]:
# Consider the following frame containing three columns.
frame.inspect()

[#]  doc_id     word_id     word_count
[0]  nytimes    harry                3
[1]  nytimes    economy             35
[2]  nytimes    jobs                40
[3]  nytimes    magic                1
[4]  nytimes    realestate          15
[5]  nytimes    movies               6
[6]  economist  economy             50
[7]  economist  jobs                35
[8]  economist  realestate          20
[9]  economist  movies               1

### Create a new model and train it

In [5]:
# LDA model is trained using the frame above.
model = tc.models.clustering.lda.train(frame, 'doc_id', 'word_id', 'word_count', 
                      max_iterations = 3, num_topics = 2)
print model.report

Number of vertices: 11} (doc: 3, word: 8})
Number of edges: 16

numTopics: 2
alpha: 26.0
beta: 1.100000023841858
maxIterations: 3



### Compute topic probabilities for document

In [6]:
print model.topics_given_doc_frame.inspect()

[#]  doc_id       topic_probabilities             
[0]  harrypotter  [0.242264796494, 0.757735203506]
[1]  nytimes      [0.691637481778, 0.308362518222]
[2]  economist    [0.745181512941, 0.254818487059]


In [7]:
prediction = model.predict(['harry', 'secrets', 'magic', 'harry', 'chamber' 'test'])
print(prediction)

{u'topics_given_doc': [0.19531110732553258, 0.6046888926744676], u'new_words_percentage': 20.0, u'new_words_count': 1}


### Compute LDA score

In [8]:
model.topics_given_doc_frame.rename_columns({'topic_probabilities' : 'lda_topic_given_doc'})
model.word_given_topics_frame.rename_columns({'topic_probabilities' : 'lda_word_given_topic'})

frame= frame.join_left(model.topics_given_doc_frame, left_on="doc_id", right_on="doc_id")
frame= frame.join_left(model.word_given_topics_frame, left_on="word_id", right_on="word_id")

frame.dot_product(['lda_topic_given_doc'], ['lda_word_given_topic'], 'lda_score')
print frame.inspect()

[#]  doc_id_L     word_id_L   word_count  lda_topic_given_doc
[0]  nytimes      realestate          15  None
[1]  economist    realestate          20  None
[2]  nytimes      harry                3  None
[3]  economist    harry                1  None
[4]  harrypotter  harry               40  None
[5]  harrypotter  chamber             20  None
[6]  nytimes      movies               6  None
[7]  economist    movies               1  None
[8]  nytimes      economy             35  None
[9]  economist    economy             50  None

[#]  lda_word_given_topic  lda_score      
[0]  None                   0.110764190642
[1]  None                   0.112209620192
[2]  None                   0.107699286945
[3]  None                  0.0980817303387
[4]  None                   0.188415421764
[5]  None                  0.0830315849033
[6]  None                  0.0252590301573
[7]  None                  0.0265027647554
[8]  None                   0.311066329816
[9]  None                   0.3302161

### Compute histogram of scores

In [9]:
word_hist = frame.histogram('word_count')
lda_hist = frame.histogram('lda_score')
group_frame = frame.group_by('word_id_L', 
                             {'word_count': tc.agg.histogram(word_hist.cutoffs), 
                              'lda_score':  tc.agg.histogram(lda_hist.cutoffs)})
group_frame.inspect()

[#]  word_id_L   lda_score_HISTOGRAM                                  
[0]  jobs                                         [0.0, 0.0, 1.0, 0.0]
[1]  realestate                                   [0.0, 1.0, 0.0, 0.0]
[2]  economy                                      [0.0, 0.0, 0.0, 1.0]
[3]  magic                  [0.666666666667, 0.333333333333, 0.0, 0.0]
[4]  secrets                                      [0.0, 1.0, 0.0, 0.0]
[5]  harry       [0.333333333333, 0.333333333333, 0.333333333333, 0.0]
[6]  movies                                       [1.0, 0.0, 0.0, 0.0]
[7]  chamber                                      [1.0, 0.0, 0.0, 0.0]

[#]  word_count_HISTOGRAM                      
[0]                        [0.0, 0.0, 0.5, 0.5]
[1]                        [0.0, 1.0, 0.0, 0.0]
[2]                        [0.0, 0.0, 0.5, 0.5]
[3]  [0.666666666667, 0.0, 0.333333333333, 0.0]
[4]                        [0.0, 0.0, 1.0, 0.0]
[5]  [0.666666666667, 0.0, 0.0, 0.333333333333]
[6]                     