In [1]:
from gensim import corpora, models, similarities
from collections import defaultdict
from pprint import pprint

#This is recommended when using gensim
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

---
Latent Dirichlet Allocation (LDA)
=====
***

(Adapted from http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/)

### Objective

Automatically discover **topics** within a set of **documents**.

### Example

Consider the following sentences:

```
1. I like to eat broccoli and bananas.
2. I ate a banana and spinach smoothie for breakfast.
3. Chinchillas and kittens are cute.
4. My sister adopted a kitten yesterday.
5. Look at this cute hamster munching on a piece of broccoli.
```

For example, suppose you want to find the top two topics in the above sentenes. LDA might produce a result like:

```
Sentences 1 and 2: 100% Topic A
Sentences 3 and 4: 100% Topic B
Sentence 5: 60% Topic A, 40% Topic B
Topic A: 30% broccoli, 15% bananas, 10% breakfast, 10% munching, … (at which point, you could interpret topic A to be about food)
Topic B: 20% chinchillas, 20% kittens, 20% cute, 15% hamster, … (at which point, you could interpret topic B to be about cute animals)
```

[The gensim tutorial](https://radimrehurek.com/gensim/tut1.html)

#### From Wikipedia, Latent Dirichlet Allocation

1. Tell the algorithm how many topics you think there are
 - intuitively
 - statistically

2. Assign every word to a topic in a semi-random manner (a dirichlet distribution)
 - a word can appear in more than one topic

3. Iterate: Loop through every word in each topic and update it's topic assignemnt, according to:

a. how prevalent is a word across topics, 

b. how prevalent are topics in the document

Looking at each topic what proportion of the topic is down to each word. Certain words will favor certain topics.

Looking at each document how prevalent are the topics. Divide up the document into the topics.

- I eat fish and vegetables
- Fish are pets
- My kitten eats fish

##### Ask for 2 topics:

Topic A: eat fish, eats fish, vegetables
    
Topic B: Fish, pets, kitten

##### Infer the content spread of each sentence by word count

- Sentence 1: 100% Topic A

- Sentence 2: 100% Topic B

- Sentence 3: 33% Topic B and 66% Topic A

##### Can derive the porportions that each word constitutes in given topics

- Topic A might comprise words in the following proportions: 40% eat, 40% fish, 20%vegetables

---
## Documents represented as strings
---

In [2]:
documents = ["Human machine computer interface for lab abc computer applications",
              "A survey of user opinion of computer system response time",
              "The EPS user interface management system",
              "System and human system engineering testing of EPS",
              "Relation of user perceived response time to error measurement",
              "The generation of random binary unordered trees",
              "The intersection graph of paths in trees",
              "Graph minors IV Widths of trees and well quasi ordering",
              "Graph minors A survey"]

---
## Tokenize the documents, remove stop words and words that only appear once in the corpus
---

##### Firstly let's tokenize the documents, and remove stop words using a 'toy' stop-word list

In [6]:
stop_list = set(['for', 'a', 'of', 'the', 'and', 'to', 'in'])
print stop_list

set(['a', 'and', 'for', 'of', 'to', 'in', 'the'])


> tip: Consider using https://pypi.python.org/pypi/stop-words for more robust stop-word analysis

```python
from stop_words import get_stop_words

en_stop = get_stop_words('en')
stopped_tokens = [i for i in tokens if not i in en_stop]

```

In [7]:
documents_without_stops = []
for docs in documents:
    t = [word for word in docs.lower().split() if word not in stop_list]
    documents_without_stops.append(t)

In [8]:
print documents_without_stops

[['human', 'machine', 'computer', 'interface', 'lab', 'abc', 'computer', 'applications'], ['survey', 'user', 'opinion', 'computer', 'system', 'response', 'time'], ['eps', 'user', 'interface', 'management', 'system'], ['system', 'human', 'system', 'engineering', 'testing', 'eps'], ['relation', 'user', 'perceived', 'response', 'time', 'error', 'measurement'], ['generation', 'random', 'binary', 'unordered', 'trees'], ['intersection', 'graph', 'paths', 'trees'], ['graph', 'minors', 'iv', 'widths', 'trees', 'well', 'quasi', 'ordering'], ['graph', 'minors', 'survey']]


In [9]:
frequency = defaultdict(int)

In [10]:
for text in documents_without_stops:
    for token in text:
        frequency[token] += 1

In [11]:
texts = []
for text in documents_without_stops:
    t = [token for token in text if frequency[token] > 1]
    texts.append(t)

In [12]:
pprint(texts)

[['human', 'computer', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]


##### Now convert documents to vectors, this is a bag-of-words representation

In [14]:


dictionary = corpora.Dictionary(texts)
print "dictionary is ", dictionary

 dictionary is  Dictionary(12 unique tokens: [u'minors', u'graph', u'system', u'trees', u'eps']...)


##### There are 12 distinct words, so each document will be represented by a 12-D vector

##### It is possible also to display the token id's that the words have been mapped to

In [18]:
print "Dictionary to token:" 
pprint(dictionary.token2id) # pprint is short for "pretty print"

Dictionary to token:
{u'computer': 1,
 u'eps': 8,
 u'graph': 10,
 u'human': 2,
 u'interface': 0,
 u'minors': 11,
 u'response': 3,
 u'survey': 5,
 u'system': 6,
 u'time': 4,
 u'trees': 9,
 u'user': 7}


##### The function doc2bow is like the python CountVectorizer. It counts frequency of occurrence of words in each document and returns a spares matrix

In [19]:
for text in texts:
    print text
    print dictionary.doc2bow(text)
    
print "\n\n"
corpus = [dictionary.doc2bow(text) for text in texts]

#remember there are 12 tokens, and you need the dictionary to token information to work out the coding

for c in corpus:
    print c

['human', 'computer', 'interface', 'computer']
[(0, 1), (1, 2), (2, 1)]
['survey', 'user', 'computer', 'system', 'response', 'time']
[(1, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)]
['eps', 'user', 'interface', 'system']
[(0, 1), (6, 1), (7, 1), (8, 1)]
['system', 'human', 'system', 'eps']
[(2, 1), (6, 2), (8, 1)]
['user', 'response', 'time']
[(3, 1), (4, 1), (7, 1)]
['trees']
[(9, 1)]
['graph', 'trees']
[(9, 1), (10, 1)]
['graph', 'minors', 'trees']
[(9, 1), (10, 1), (11, 1)]
['graph', 'minors', 'survey']
[(5, 1), (10, 1), (11, 1)]



[(0, 1), (1, 2), (2, 1)]
[(1, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)]
[(0, 1), (6, 1), (7, 1), (8, 1)]
[(2, 1), (6, 2), (8, 1)]
[(3, 1), (4, 1), (7, 1)]
[(9, 1)]
[(9, 1), (10, 1)]
[(9, 1), (10, 1), (11, 1)]
[(5, 1), (10, 1), (11, 1)]


##### The LDA model converts the bag-of-words representation into a topic-space of lower dimensionality
##### LDA's topics are probability distributions over words
##### The distributions are inferred automatically from the corpus
##### Documents are then interpreted as a mixture of these topics

In [20]:
lda_model = models.ldamodel.LdaModel(corpus, id2word=dictionary, num_topics=2, passes = 10, iterations=1000)

In [21]:
for i, topic in enumerate(lda_model.print_topics(num_topics = 2, num_words = 2)):
    print "\n\ntopic {:d}:\n".format(i), topic



topic 0:
(0, u'0.171*graph + 0.159*computer')


topic 1:
(1, u'0.203*system + 0.157*user')


---
## Classification to topic, with accompanying probability
---

In [15]:
new_doc = 'the grass is greener'
new_doc1 = 'Human Computer Interaction'
new_doc2 = 'Graphs are excellent data structures and are related to trees'

new_vec = dictionary.doc2bow(new_doc.lower().split())
new_vec1 = dictionary.doc2bow(new_doc1.lower().split())
new_vec2 = dictionary.doc2bow(new_doc2.lower().split())

##### Divides the documents up into topics

In [16]:
print lda_model[new_vec]
print lda_model[new_vec1]
print lda_model[new_vec2]

[(0, 0.5), (1, 0.5)]
[(0, 0.20654388799196441), (1, 0.79345611200803567)]
[(0, 0.74678800888043906), (1, 0.25321199111956094)]


In [24]:
lda_model?

### Review:

*Why use LDA?*

**Automatically discover *topics* within a set of *documents***
*LDA represents documents as mixtures of topics that spit out words with certain probabilities.*

### Additional Resources
* [Gensim tutorial](https://radimrehurek.com/gensim/tut1.html)
* [Cosine Similarity Exercise](http://blog.christianperone.com/?p=1589)
* [Tutorial on using NLTK](http://textminingonline.com/dive-into-nltk-part-i-getting-started-with-nltk)
* [Lecture notes on NLP](http://cs.nyu.edu/courses/spring04/G22.2591-001/lecture3.html)
* [Information about the 20 newsgroups dataset](http://qwone.com/~jason/20Newsgroups/)
* [http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/] (Useful intro to LDA)