### LDA Feature Model

Here I use gensim implementation of Latent Dirichlet Allocation (LDA) to extract features

In [1]:
from gensim import corpora
from gensim.models import ldamodel
import numpy as np

It receives a 2d array of documents

In [2]:
def test_model(documents, num_topics):
    dictionary = corpora.Dictionary(documents)
    corpus = [dictionary.doc2bow(doc) for doc in documents]
    np.random.seed(0)  # Set a fixed seed cuz it's non-deterministic
    lda = ldamodel.LdaModel(corpus, id2word=dictionary, 
                            num_topics=num_topics, iterations=100000,
                            alpha='symmetric', gamma_threshold=0.001,
                            eta=None)
    
    print("dictionary.items(): %s\n" % list(dictionary.items()))
    # Converted into a bag of words format - (token_id, token_count) 2-tuples
    print("corpus: %s\n" % corpus)
    print("LDA Topics:")
    for topic in range(num_topics):
        print("\nTopic %s" % topic)
        print(lda.print_topic(topic))
    return lda, dictionary

test_documents = [["top", "pokemon", "incredible"], 
                  ["incredible"], 
                  ["pictures"]]

num_topics = 3
test_lda, test_dict = test_model(test_documents, num_topics)

dictionary.items(): [(0, 'top'), (1, 'pokemon'), (3, 'pictures'), (2, 'incredible')]

corpus: [[(0, 1), (1, 1), (2, 1)], [(2, 1)], [(3, 1)]]

LDA Topics:

Topic 0
0.406*incredible + 0.257*top + 0.255*pokemon + 0.082*pictures

Topic 1
0.490*pictures + 0.200*incredible + 0.157*pokemon + 0.153*top

Topic 2
0.303*incredible + 0.266*pictures + 0.217*pokemon + 0.214*top


**Evaluating new data**

In [3]:
from operator import itemgetter

def extract_features(docs):
    res = []
    for doc in docs:
        np.random.seed(0)  # Set a fixed seed cuz it's non-deterministic
        topic_distribution = test_lda[test_dict.doc2bow(doc)]
        print(topic_distribution)
        if len(topic_distribution) == 0:
            print("Empty")
        else:
            most_probable = max(topic_distribution,key=itemgetter(1))
            print("Most probable topic: {}".format(most_probable))
            print("Most probable word in topic: {}\n".format(
                    test_lda.show_topic(most_probable[0], 1)[0]
                ))

            # iterate over topic prediction tuples
            values_array = []
            for key, value in topic_distribution:
                values_array.append(value)
            res.append(values_array)
    return res

extract_features(test_documents)

[(0, 0.82723213947096941), (1, 0.085512135625795294), (2, 0.08725572490323534)]
Most probable topic: (0, 0.82723213947096941)
Most probable word in topic: ('incredible', 0.40624497455174713)

[(0, 0.65069628706313964), (1, 0.17177125240199564), (2, 0.17753246053486471)]
Most probable topic: (0, 0.65069628706313964)
Most probable word in topic: ('incredible', 0.40624497455174713)

[(0, 0.16791207285071336), (1, 0.65878304940081323), (2, 0.17330487774847336)]
Most probable topic: (1, 0.65878304940081323)
Most probable word in topic: ('pictures', 0.48988096504845463)



[[0.82723213947096941, 0.085512135625795294, 0.08725572490323534],
 [0.65069628706313964, 0.17177125240199564, 0.17753246053486471],
 [0.16791207285071336, 0.65878304940081323, 0.17330487774847336]]

**Those probability distributions are commonly used as features**