# Topic Modeling/NMF
### Jack Bennetto
#### February 14, 2017

## Objectives

 * Write down and explain NMF equation
 * Compare and contrast NMF, SVD, PCA and K-means
 * Implement Alternating-Least-Squares algorithm
 * Use NMF to find and interpret latent topics

## Outline

 * Discussion of topic modeling
 * Problems with SVD for topic analysis
 * Introduce NMF
 * Review solving linear equations
 * Alternating Least Squares
 * NMF for topic analysis example
 * Pair exercise

In [None]:
import pandas as pd
import numpy as np
import random
from IPython.display import display
%matplotlib inline

## Topic Modeling

### Motivation and connections

For most of this program we've talked about $X$ being a matrix of samples and features which are very different things. With SVD yesterday, NMF today, and recommender systems over the next few days, the rows and columns look a lot more like each other. When we figure out how they fit together, we don't just care about clustering the rows, but clustering the columns as well.  These 'topics' (a.k.a. 'concepts', 'latent features' or 'archetype') apply to both axis.

Examples:
 * Recommender systems mapping items to users
 * NLP mapping words to documents
 * Image processing mapping pixels to pictures

With SVD the relation to a topic is purely numeric, but with NMF the mapping we require all entries are non-negative, so we can see the rows (or columns!) as being part of a topic or not (if zero). That said, it's still a 'soft clustering', as rows (or columns) can be part of more than one topic.

Requiring entries to be non-negative provides more interpretable mappings and leads to a parts-based representation of the whole (https://github.com/zipfian/DSI_Lectures/blob/master/topicmodeling/isaac_laughlin/nmf_nature.pdf).


### Example

Let's look at users ratings of different movies. The ratings are from 1-5. A rating of 0 means the user hasn't watched the movie.

In [None]:
movies = ['Matrix','Alien','StarWars','Casablanca','Titanic']
users = ['Alice','Bob','Cindy','Dan','Emily','Frank','Greg']
M = pd.DataFrame([[1, 2, 2, 0, 0],
                  [3, 5, 5, 0, 0],
                  [4, 4, 4, 0, 0],
                  [5, 5, 5, 0, 0],
                  [0, 2, 0, 4, 4],
                  [0, 0, 0, 5, 5],
                  [0, 1, 0, 2, 2]],
                 index=users, columns=movies)
display(M)

We should able to group together these movies and find topics with math.

In [None]:
from numpy.linalg import svd
k = 2

# Compute SVD
U, sigma, VT = svd(M)

U = pd.DataFrame(U, index=users)
VT = pd.DataFrame(VT, columns=movies)

# Keep top two concepts
U = U.iloc[:,:k]
sigma = sigma[:k]
VT = VT.iloc[:k,:]

# Make pretty
U, sigma, VT = (np.around(x,2) for x in (U,sigma,VT))

display(U)
display(pd.DataFrame(np.diag(sigma)))
display(VT)

**Discussion**
1. What do the concepts mean?
2. To which concept(s) does each user/document belong?

## Problems with SVD for topic analysis

**Recall:** $M = U S V^T$

1. Values in $U$ and $V^T$ can be negative, which is weird and hard to interpret. For example, suppose a latent feature is the genre 'Science fiction'. This feature can be positive (makes sense), zero (makes sense), or negative (what does that mean?).

2. The number of columns in $U$ can differ from the number of rows in $V^T$. I.e. The number of latent features differs in $U$ and $V^T$, which is weird.

3. SVD forces us to fill in missing values, then SVD models those missing values, which is bad.

How can we avoid these?

## Non-negative Matrix Factorization (NMF)



Suppose we have a matrix $V \in \mathbb{R}^{m \times n}$. With NMF, we try to write it as the product of two smaller matrices, $W \in \mathbb{R}^{m \times r}$ and $H \in \mathbb{R}^{r \times n}$, 

$$ V = W H$$

or, to be more graphical,

$$\begin{bmatrix}
    v_{11}       & v_{12} & v_{13} & \dots & v_{1n} \\
    v_{21}       & v_{22} & v_{23} & \dots & v_{2n} \\
    v_{31}       & v_{32} & v_{33} & \dots & v_{3n} \\
    v_{41}       & v_{42} & v_{43} & \dots & v_{4n} \\
    \vdots       & \vdots & \vdots & \ddots& \vdots \\
    v_{m1}       & v_{m2} & v_{m3} & \dots & v_{mn}
\end{bmatrix}
=
\begin{bmatrix}
    w_{11}       & \dots & w_{1r} \\
    w_{21}       & \dots & w_{2r} \\
    w_{31}       & \dots & w_{3r} \\
    w_{41}       & \dots & w_{4r} \\
    \vdots       & \ddots& \vdots \\
    w_{m1}       & \dots & w_{mr}
\end{bmatrix}
\cdot
\begin{bmatrix}
    h_{11}       & h_{12} & h_{13} & \dots & h_{1n} \\
    \vdots       & \vdots & \vdots & \ddots& \vdots \\
    h_{r1}       & h_{r2} & h_{r3} & \dots & h_{rn}
\end{bmatrix}
$$

with the constraint that all $w_{ik} \ge 0$ and $h_{kj} \ge 0$.

In general, this isn't possible, but we'll try to do the best we can, minimizing

$$||V - WH||^2 = \sum_{ij} (V_{ij} - \sum_k W_{ik} H_{kj})^2$$

again with the constraint that the components are greater than zero.

Note that the number of topics $r$ is a hyperparameter; we can choose how many topics we want.

**Question** What would the value of $r$ have to do with the bias-variance tradeoff?

**Question** What would we lose compared to SVD?

## Solving NMF

There are a couple approaches to solving NMF.

**Alternating Least Squares** involves solving first for one matrix while holding the other constant, that the other, back and forth until in converges.

**Stocastic Gradient Descent** minimizes the components of the matrices using the previously discussed algorithm.

## Alternating Least Squares

First, some review.

### Exact solution for a system of linear equations
$$ Ax = b$$

$$ \begin{bmatrix} 1 & 2 \\ -3 & 4 \end{bmatrix} \left[ \begin{array}{c} x_1 \\ x_2 \end{array} \right] = \left[ \begin{array}{cc} 7 \\ -9 \end{array} \right] $$

There are two unknowns ($x_1$ and $x_2$) and two equations ($x_1 + 2x_2 = 7$ and $-3x_1 + 4x_2 = -9$) so (usually) there is one solution.

In [None]:
A = np.array([[1, 2], [-3, 4]])
b = np.array([7, -9])

print np.linalg.solve(A, b)

### Least-squares solver

What if we have an overdetermined system of linear equations? E.g.

$$ \begin{bmatrix} 1 & 2 \\ -3 & 4 \\ 1 & -4 \end{bmatrix} \left[ \begin{array}{c} x_1 \\ x_2 \end{array} \right] = \left[ \begin{array}{cc} 7 \\ -9 \\ 17 \end{array} \right] $$

An exact solution is not guaranteed, so we must do something else. Least Squares dictates that we find the $x$ that minimizes the residual sum of squares (RSS).

(Note: This is the solver we use when doing Linear Regression!)

In [None]:
A = np.array([[1, 2], [-3, 4], [1, -4]])
b = np.array([7, -9, 17])

print np.linalg.lstsq(A, b)[0]
print "Residual sum of squares (error): {}".format(np.linalg.lstsq(A, b)[1])

### Non-negative least-squares solver

What if you want to constrain the solution to be non-negative?

We have optimizers for that too!

In [None]:
from scipy.optimize import nnls

A = np.array([[1, 2], [-3, 4], [1, -4]])
b = np.array([7, -9, 17])

print nnls(A, b)[0]
print "Residual sum of squares (error): {}".format(nnls(A, b)[1] ** 2)

### Alternating Least Squares

**Question** Given a matrices $A$ and $B$, least squares and non-negative least squares find the solution $X$ that minimizes the error (RSS) in $A \cdot X = B$. So can you guess what alternating least squares is, and how may we apply it to NMF?

In [None]:
# Implement the first two steps of alternating least squares

# Set up our matrix V we want to decompose
V = np.random.rand(10,15)

# Initialize a random matrix W
W = np.random.rand(10,5)

# Solve for H using a least squares solver
H = np.linalg.lstsq(W, V)[0]

# Clip H so there are no negative values
H[H < 0] = 0

# Print the current error. Why did the error go up?
print np.linalg.norm(V - np.dot(W,H))

In [None]:
# Solve for W using H
W = np.linalg.lstsq(H, V)

Dang that blew up... Let's do some math to figure out why and what we could have done

**Question** np.linalg.lstsq(W, V) solves $W \cdot H = V$. To solve for W we need to solve $H \cdot W = V$ which is invalid due to the dimensions of H and W. What can we do to our matrices to make this fix this problem?

**Exercise** Using the answer provided, go ahead and solve for W and print out the new error

In [None]:
# Your code goes here

With ALS, we continue this, H in terms of W and W in terms of H until
we are "satisfied" with your result (low enough error) or reach some maximum number of iterations.

### General vs. non-negative least squares solver

Non-negative least squares solver:
    
 * Returns result with least squares error given non-negativity constraint
 * While alternating, converges to a local minimum
 * Orders of magnitude slower than general least squares solver

General least squares solver:
    
 * Returns result with least squares error with no constraints
 * While alternating, converges to a stationary point (saddle point or minimum)
 * Much much faster
 * Have to clip the matrix at every iteration to ensure non-negativity
   
In industry the general least squares solver is commonly used. The tradeoff between speed and strong convergence seems to be worth it. For more information check out: http://users.wfu.edu/plemmons/papers/BBLPP-rev.pdf

## NMF for topic analysis

### Example

Let's look at users ratings of different movies. The ratings are from 1-5. A rating of 0 means the user hasn't watched the movie.

In [None]:
display(M)

And again, we'll try to find the topics.

In [None]:
# Compute NMF
from sklearn.decomposition import NMF

def fit_nmf(r):
    nmf = NMF(n_components=r)
    nmf.fit(M)
    W = nmf.transform(M)
    H = nmf.components_
    return nmf.reconstruction_err_

error = [fit_nmf(i) for i in range(1,6)]
plt.plot(range(1,6), error)
plt.xticks(range(1, 6))
plt.xlabel('r')
plt.ylabel('Reconstruction Errror')

**Question** What might be the optimal r (number of topics) value and why?

In [None]:
# Fit using 2 hidden concepts
nmf = NMF(n_components=2)
nmf.fit(M)
W = nmf.transform(M)
H = nmf.components_
print 'RSS = %.2f' % nmf.reconstruction_err_

In [None]:
# Make interpretable
W, H = (np.around(x,2) for x in (W,H))
W = pd.DataFrame(W,index=users)
H = pd.DataFrame(H,columns=movies)

display(W) 
display(H)

**Discussion**
1. What do the concepts (clusters) mean?
2. To which concept(s) does each user/document belong?

In [None]:
# Verify reconstruction
display(np.around(W.dot(H),2))
display(pd.DataFrame(M, index=users, columns=movies))

#### What is concept 0?

In [None]:
# Top 2 movies in genre 0
top_movies = H.iloc[0].sort_values(ascending=False).index[:3]
top_movies

#### Which users align with concept 0?

In [None]:
# Top 2 users for genre 1
top_users = W.iloc[:,0].sort_values(ascending=False).index[:2]
top_users

#### What concepts does Emily align with?

In [None]:
W.loc['Emily']

#### What are all the movies in each concept?

In [None]:
# Number of movies in each concept
thresh = .2  # movie is included if at least 20% of max weight
for g in range(2):
    all_movies = H.iloc[g,:]
    included = H.columns[all_movies >= (thresh * all_movies.max())]
    print "Concept %i contains: %s" % (g, ', '.join(included))

#### Which users are associated with each concept?

In [None]:
# Users in each concept
thresh = .2  # user is included if at least 20% of max weight
for g in range(2):
    all_users = W.iloc[:,g]
    included = W.index[all_users >= (thresh * all_users.max())]
    print "Concept %i contains: %s" % (g, ', '.join(included))

## Additional notes

We can use **regularization** with NMF, adding a terms cost for large values in $W$ and $H$, with either an L1 or L2 penelty.

Some implementations (not sklearn) allow ignoring **missing values** in the original matrix. In a recommender these would correspond to unrated items.

## A scatterplot

Because I like them.

This is meant to simulate ratings of five different items. The users come in two groups with diffferent (poisson) distributions.

In [None]:
import numpy as np
from sklearn.decomposition import NMF
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# generate cluster data
#X, y = make_blobs(centers=2, n_features=2, center_box=(1,3),random_state=1)
#X = np.exp(X)
import scipy.stats as scs
npts = 100
X = np.zeros((npts*2, 5))
y = np.zeros((npts*2,), dtype=int)
X[:npts,0] = scs.poisson(2).rvs(npts)
X[:npts,1] = scs.poisson(3).rvs(npts)
X[:npts,2] = scs.poisson(0).rvs(npts)
X[:npts,3] = scs.poisson(1).rvs(npts)
X[:npts,4] = scs.poisson(1).rvs(npts)

X[npts:,0] = scs.poisson(1).rvs(npts)
X[npts:,1] = scs.poisson(1).rvs(npts)
X[npts:,2] = scs.poisson(4).rvs(npts)
X[npts:,3] = scs.poisson(3).rvs(npts)
X[npts:,4] = scs.poisson(0).rvs(npts)
y[npts:] = 1

In [None]:
nmf_model = NMF(n_components=2)
U = nmf_model.fit_transform(X)
V = nmf_model.components_.T

In [None]:
fig, ax = plt.subplots(figsize=(5,5))
ax.scatter(U[:,0],U[:,1], c=np.array(['r','b'])[y], s=100, alpha=0.3)
ax.set_xlim(xmin=0)
ax.set_ylim(ymin=0)
ax.set_xlabel("topic 1")
ax.set_ylabel("topic 2")

**Extra bonus question** what would be a better distribution for simulating ratings?