# Latent Dirichlet Allocation Example

LDA, short for Latent Dirichlet Allocation, is a commonly-used algorithm for topic modeling, but, more broadly, is considered a dimensionality reduction technique.. For example, given a number of documents, LDA can group the texts on similar topics together based on whether they contain similar words. LDA is an unsupervised algorithm, meaning that the groups are created based on the similarity to each other, rather than by comparing them to an idealized or standardized dataset.

- Read more about [LDA in Wikipedia](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation)
- See [ATK Documentation](http://trustedanalytics.github.io/atk/) for more information about the the API's

In [None]:
# First, let's verify that the ATK client libraries are installed
import trustedanalytics as ta
print "ATK installation path = %s" % (ta.__path__)

In [None]:
# Next, look-up your ATK server URI from the TAP Console and enter the information below.
# This setting will be needed in every ATK notebook so that the client knows what server to communicate with.

# E.g. ta.server.uri = 'demo-atk-c07d8047.demotrustedanalytics.com'
ta.server.uri = 'ENTER URI HERE'

In [None]:
# This notebook assumes you have already created a credentials file.
# Enter the path here to connect to ATK
ta.connect('myuser-cred.creds')

In [None]:
# Create a new frame by uploading rows
data = [ ['nytimes','harry',3], 
        ['nytimes','economy',35], 
        ['nytimes','jobs',40], 
        ['nytimes','magic',1],     
        ['nytimes','realestate',15], 
        ['nytimes','movies',6], 
        ['economist','economy',50], 
        ['economist','jobs',35], 
        ['economist','realestate',20], 
        ['economist','movies',1], 
        ['economist','harry',1], 
        ['economist','magic',1], 
        ['harrypotter','harry',40], 
        ['harrypotter','magic',30], 
        ['harrypotter','chamber',20], 
        ['harrypotter','secrets',30] ]

schema = [ ('doc_id', str),
          ('word_id', str),
          ('word_count', ta.int64) ]

frame = ta.Frame(ta.UploadRows(data, schema))

In [None]:
# Consider the following frame containing three columns.
frame.inspect()

### Create a new model and train it

In [None]:
model = ta.LdaModel()

# LDA model is trained using the frame above.
results = model.train(frame, 'doc_id', 'word_id', 'word_count', 
                      max_iterations = 3, num_topics = 2)

### Compute topic probabilities for document

In [None]:
topics_given_doc = results['topics_given_doc']
word_given_topics = results['word_given_topics']
topics_given_word = results['topics_given_word']
report = results['report']

print topics_given_doc.inspect()

print "\n %s" %(report)

prediction = model.predict(['harry', 'economy', 'magic', 'harry' 'test'])
print(prediction)

### Compute LDA score

In [None]:
topics_given_doc.rename_columns({'topic_probabilities' : 'lda_topic_given_doc'})
word_given_topics.rename_columns({'topic_probabilities' : 'lda_word_given_topic'})

frame= frame.join(topics_given_doc, left_on="doc_id", right_on="doc_id", how="left")
frame= frame.join(word_given_topics, left_on="word_id", right_on="word_id", how="left")

frame.dot_product(['lda_topic_given_doc'], ['lda_word_given_topic'], 'lda_score')
print frame.inspect()

### Compute histogram of scores

In [None]:
word_hist = frame.histogram('word_count')
lda_hist = frame.histogram('lda_score')
group_frame = frame.group_by('word_id_L', 
                             {'word_count': ta.agg.histogram(word_hist.cutoffs), 
                              'lda_score':  ta.agg.histogram(lda_hist.cutoffs)})
group_frame.inspect()