## NIPS Topic Model using Expectation Maximization (EM)

The UCI Machine Learning dataset repository hosts several datasets recording word counts for documents [here](https://archive.ics.uci.edu/ml/datasets/Bag+of+Words). Here we will use the NIPS dataset.

It provides (a) a table of word counts per document and (b) a vocabulary list for this dataset at the link.

We implement the multinomial mixture of topics model using our own EM clustering code.

### Cluster to 30 topics, using a simple mixture of multinomial topic model.

In [36]:
# import libs
import numpy as np
import matplotlib.pyplot as plt
import sys
import csv

from math import log
from scipy.sparse import csr_matrix
from scipy.misc import logsumexp as LSE

# read data
D = 1500
W = 12419
NNZ = 746316
J = 30 # number of topics/ clusters

data = np.loadtxt(r'data/docword.nips.txt', dtype=int, delimiter=' ',skiprows=3)

We use CSR matrix for optimal performance as the data matrix is sparse

In [27]:
# store data as numpy matrix
# we subtract by 1 to make data zero-indexed
row = data[:, 0] - 1
col = data[:, 1] - 1
values = data[:, 2]

x = csr_matrix((values, (row, col)), shape=(D, W))

In [29]:
# p corresponds to probability of word in a topic
p = np.ones((J, W))
p = 1.0/W * p

# pi corresponds to probability that document belongs to topic
pi = np.ones(J)
pi = 1.0/J * pi

In [35]:
# EM
prev_q = sys.maxsize
iternum = 0




def w(i, j):
    numerator = 1.0
    denominator = 1.0
    
    for l in range(J):
        for k in range(W):
            temp = p[l,k]**x[i,k]
            if l == j:
                numerator *= temp
            denominator *= temp
        denominator *= pi[l]
    
    return numerator * pi[j]/ denominator






while True:
    # log likelihood
    ll = x.dot(np.log(p).True) + np.log(pi)
    
    row_max = np.amax(ll, 1)
    
    terms = LSE((ll.T - row_max).T)
    # unlist line
    w = np.exp(w)
    np.sum(w, axis=0) # max 
    
    # calculate w_i,j matrix
    w = np.zeros((D, J))
    
    
    # E-Step
    q = np.sum(( * w)
    
    
    if abs(e - prev_expectation) < 100:
        break
    prev_expectation = e
    for j in range(J):
        p[j,] = max_p(j)
        pi[j] = max_pi(j)
    print(t, e)
    iternum += 1


# E-Step computation
def expectation():
    
    
    
    Q = 0.0
    for i in range(D):
        print("expectation round", i)
        for j in range(J):
            Q += (log(pi[j]) + np.dot(x[i,], np.log(p[j,]))) * w(i,j)
    return Q

# M-Step
def max_p(j):
    numer = 0
    denom = 0
    for i in range(D):
        w_ij = w(i,j)
        numer += x[i,] * w_ij
        denom += np.sum(x[i,]) * w_ij
    return numer/denom

def max_pi(j):
    pi_j = 0
    for i in range(D):
        pi_j += w(i,j)
    return pi_j/ D

(30,)

In [106]:
# EM
prev_expectation = sys.maxsize
t = 0

while True:
    e = expectation()
    if abs(e - prev_expectation) < 100:
        break
    prev_expectation = e
    for j in range(J):
        p[j,] = max_p(j)
        pi[j] = max_pi(j)
    print(t, e)
    t += 1

expectation round 0




expectation round 1
expectation round 2
expectation round 3
expectation round 4
expectation round 5
expectation round 6
expectation round 7
expectation round 8
expectation round 9
expectation round 10
expectation round 11
expectation round 12
expectation round 13
expectation round 14
expectation round 15
expectation round 16
expectation round 17
expectation round 18
expectation round 19
expectation round 20
expectation round 21
expectation round 22
expectation round 23
expectation round 24
expectation round 25
expectation round 26
expectation round 27
expectation round 28
expectation round 29
expectation round 30
expectation round 31
expectation round 32
expectation round 33
expectation round 34
expectation round 35
expectation round 36
expectation round 37


KeyboardInterrupt: 

### Graph showing, for each topic, the probability with which the topic is selected.


### Table showing, for each topic, the 10 words with the highest probability for that topic.