# Topic modeling

The topic of today is topic modeling with non-negative matrix factorization. The plan for this document is 

1. Introduce non-negative matrix factorization (nmf for short). 

2. Motivate it with topic modeling. 

3. Go through the alternating least squares algorithm for nmf.

4. Discuss nuances of nmf and topic modeling. 

## Non-negative matrix factorization (NMF). 

Today we will be concerned with factorizing a data matrix $X$ with $m$ features into two non-negative matrices, a weights matrix $W$ and a feature matrix $F$ with rank less than $m$ such that $X\approx WF$. A matrix $A$ is non-negative if its matrix elements are non-negative, that is, if $A_{ij}\geq 0 \ \ \forall i,j$. 

<img src='nmf.png' width = 40% />
Note: Note this image uses $V$ instead of $X$ for the data matrix and $H$ instead of $F$ for the feature matrix. 

However, in order to calculate the closeness of two matrices, we need to define a distance metric between two matrices. For this purpose we use the Frobenius norm. The **Frobenius norm** of a matrix $A$, denoted by $||A||_F^2$, not to be confused with the matrix $F$, is defined by $||A||_F^2 = \sum_{ij} A_{ij}^2$. That is, we just take the sum of squares of each element of the matrix $A$. 

**Non-negative matrix factorization** is concerned with the problem of minimizing $||X-WF||_F$ with the constraint that $W$ and $F$ are non-negative. **$||X-WF||_F^2$ is referred to as the reconstruction error**. It is called that because we are reconstructing $X$ with the matrix $W$ and $F$ and $||X-WF||_F^2$ the amount of error in our reconstruction. It is essentially the residual sum of square errors.



### Comparison to the singular value decomposition(svd)
You should have seen the singular value decomposition by now. There we found out how to approximate a data matrix $X$ of rank $m$ by expressing it as a product $X \approx UDV$ of three  matrices $U$, $D$ and $V$ of rank less than $m$. That is, $U$ is thinner matrix than $X$. Nmf is also a factorization of a matrix, however nmf is a factorization into two matrices, not three and nmf has the restriction that the matrix elements of the factors are non-negative whereas there is no such restriction with svd. 



## Topic modeling

We have not yet talked about how to find the matrices $W$ and $F$ such that $X\approx WF$. We will. Take it for granted that we have a way of finding $W$ and $F$ by minimizing the reconstruction error. We'll motivate nmf with topic modeling in this section and show how to minimize the reconstruction error later.

The idea behind topic modeling is that observations are often well described by topics. For instance, a book may be a sci-fi book, in which case we should expect a lot of scientific sounding words. If a book is an adventure, we should have a lot of adventure oriented words. This is the first assumption of topic modeling.

[1.] Each topic corresponds to a different distribution of words. Let's denote the frequency of a word in a topic by $tf(word|topic)$. 

The second assumption of topic modeling is that the distribution of words within a document can be represented by a distribution of topics. For instance, a book may be half adventure and half romance. In that case, we would expect that the distribution of words for that book would be in between the distributions for adventure and romance separately, for instance, if half the scenes are adventure scenes and half the scenes are romance scenes. This is the second assumption of topic modeling.

[2.] The term frequencies of a particular document can be expressed a distribution of topics, where each topic is given some positive rank $w(top \ | \ doc)$ for a document $doc$. That is, $tf(word \ | \ doc) = \sum_{top} tf(word \ | \ top)\times w(top \ | \ doc)$.

Assumptions [1.] and [2.] are equivalent to saying that a document term matrix $X$ can be factorized into two non-negative matrices $W$ and $F$, i.e. $X=WF$. 

### Example

Consider the three following term frequency vectors corresponding to the following three documents. Note, the vocabulary is $\{'buffalo','cats','eat'\}$. 

<table width = 70%>
<tr><td></td><td>(buffalo,cats,eat)</td></tr>
<tr><td>"the buffalo eat"</td><td> $\rightarrow(1,0,1)$</td></tr>
<tr><td>"the cats eat"</td><td> $\rightarrow(0,1,1)$</td></tr>
<tr><td>"the buffalo eat and the cats eat"</td><td> $\rightarrow(1,1,2)$</td></tr>
</table>

This corpus corresponds to the following document term matrix. 

$ X=  \begin{bmatrix} 1 & 0 & 1 \\ 0 & 1 & 1 \\ 1 & 1 & 2 \end{bmatrix} $ 

The third row is just a sum of the first two, which makes sense since the third document is the first two documents concatenated together. We can expess this matrix as a product of two matrices. 

$ X = \begin{bmatrix} 1 & 0 & 1 \\ 0 & 1 & 1 \\ 1 & 1 & 2 \end{bmatrix} = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 1 & 1 \end{bmatrix} \begin{bmatrix} 1 & 0 & 1 \\ 0 & 1 & 1 \end{bmatrix} = WF$ 

We can identify $W$ and $F$ as:

$W = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 1 & 1 \end{bmatrix}$

and 

$F = \begin{bmatrix} 1 & 0 & 1 \\ 0 & 1 & 1 \end{bmatrix}$. 

The rows of $W$ are the distribution of words for each topic. The rows of $F$ are the amount that that document is about each topic. For instance, the first row of $F$ says that the first row of data corresponds only to topic 1 and thus to row 1 of $W$; the second row of $F$ says that the second row of the data corresponds only to topic 2 and thus to row 2 of $W$; the third row of $F$ says that the third row of the data corresponds to topic 1 and topic 2 and thus that row 3 of the data is the sum of row 1 and row 2 of $W$.  

In this example we applied non-negative matrix factorization to topic modeling. In this case the columns of $F$ and rows of $W$ correspond to topics. In general, when doing matrix factorization, these rows and columns are called **latent features**. For instance, the first column of $F$ is the weight of the first latent feature and the first row of $W$ is the distribution of words corresponding to the first latent feature. 

We can inspect the distribution of term frequencies from $W$ to interpret the latent features. In this case, it seems that the first latent feature could be about large mammal, e.g. buffalo and the second latent feature might correspond to small animals, e.g. cats. The interpretation of these latent features is up to us. 

In the previous example, there was an exact factorization in terms of two latent features, big and small mammals, whereas there were three features, "buffalo", "cat" and "eats". Thus this topic modeling with non-negative matrix factorization is a way of reducing the number of features that we have in our model. 

In general, we will not be able to factor the data matrix exactly. For instance, if the third document was "the buffalo and the cats eat" instead of "the buffalo eat and the cats eat", then the third row of the document matrix would be $\begin{bmatrix} 1 & 1 & 1 \end{bmatrix}$ instead of $\begin{bmatrix} 1 & 1 & 2 \end{bmatrix}$ and the data matrix would no longer be exactly factorizable in terms of two latent features(or topics). We can see this because the three rows are now linearly independent(there is no way to add row 1 to row 2 to get row 3) and thus $X$ is a full rank matrix.

$  X = \begin{bmatrix} 1 & 0 & 1 \\ 0 & 1 & 1 \\ 1 & 1 & 1 \end{bmatrix} $.

In general we will seek to get as close as possible to the data matrix by minimizing the reconstruction error. We can use the following figure to interpret how this is done. The two topics correspond to the term frequency vectors labeled with black dots. The term frequency vectors for the documents are labeled with circles. The matrix factorization will seek to approximate these documents by finding a vector within the span of the two topic vectors such that the reconstructed document vector is as close as possible to the original document.  We do discuss this in the next section.

### Interpretation of topics

<img src='geometric_interpretation.png' width = 50% /> 
Note: if we normalize the document term frequencies, then the documents lie on a plane. If we also normalize the topic term frequencies and topic affinities with the 1norm, then the reconstructed vectors will lie between the two topic vectors as shown in the figure. However, to be clear, in the discussion that we have had, we have not assumed such a normalization.


Once we have a factorization, we can visualize our latent features by highlight words according to a color that corresponds to the topic that has the highest term frequency for that word. Notice how each color corresponds to a different theme. Green is government, creation and people. Red is about dedication, devotion, work and freedom. And yellow is about war, causes and death.

<img src='http://www.themacroscope.org/wp-content/uploads/2013/08/gettysburg-markup1.png' width=70% />





## The alternating least squares algorithm for non-negative matrix factorization

We now describe the alternating least squares algorithm for non-negative matrix factorization. Just to remind you, we are trying to minimize $\| X - WH \|$ for fixed non-negative $X$ subject to the constraint that $W$ and $H$ are non-negative. It involves the following steps. 

[-1.] Choose k, the latent features.

One way to do this is to look at the reconstruction error for each k and see when raising $k$ does not decrease the reconstruction error much. Another way is to choose different values for $k$ and see if you can interpret the latent features by inspection.

[0.] Initialize the matrix $W$ with positive random values. 

[1.] Minimize $H$ leaving $W$ constant: $F = [W^TW]^{-1} W^T X$

* [1.5] Set any values of $F$ to zero if they are negative. That is, if $F_{ij}<0$, set $F_{ij}=0$. 

If you inspect the cost function, you will see that for fixed $W$, the cost function is quadratic in terms of the matrix elements $F_{ij}$. Taking the first derivative of the cost function with respect to $F_{ij}$ leads tot he equation in [1.]. We then cutoff any negative matrix elements. 

[2.] Minimize $W$ leaving $F$ constant: $W = X F^T [FF^T]^{-1}$. 

* [2.5] Set any values of $W$ to zero if they are negative. That is, if $W_{ij}<0$, set $W_{ij}=0$. 

[3.] Go back to [1.] until some stopping criteria has been met, e.g. a maximum number of iterations has been reached, the cost has stopped decreasing or the gradient of the cost function has a magnitude that is close to z. 

## Random topics

### Sparseness

Non-negative matrix factorization tends to lead to sparse matrices. This is because whenever the solution corresponds to a component being negative, we just set it equal to zero.

<img src='sparseness.png' width=50% />



### Regularization

There are many possible solutions for $W$ and $H$. For instance, we can replace $W$ with $W/1000$ and replace $H$ with $1000H$ and still have a minimum. One way to handle this is to add regularization terms so that large values of $H$ and large values of $W$ are penalized. L2 and L1 regularization are two natural choices:


L2: $\phi = \| V - WH \|_F^2 + \frac{\lambda}{2} \left( \|W\|_F^2 + \|Q\|_F^2 \right)$

L1: $\phi = \| V - WH \|_F^2 + \frac{\lambda}{2} \left( \sum_{ij} |W_{ij}| + \sum_{ij} |Q_{ij}| \right)$

However, there are still many solutions for $W$ and $H$ that minimize the reconstruction error. For instance, if we have a solution $W$ and $H$, we can swap column $i$ and column $j$ for $W$ while simultaneously swapping row $i$ and row $j$ of $H$. This is equivalent to changing our labeling for our latent features. The take-away is that minimizing the cost function will likely lead to a different factorization each time. 

### New data

You can use non-negative matrix factorization to reduce the number of features before training a model. That is, you do matrix factorization, unsupervised learning, and use the rows of $F$ to do supervised learning for another model $M$. However, for this model to work, we have to be able to be able to find the latent feature affinities for new rows of data? However if we factorize again, then our training set will have changed. One way to handle this is to use the same $F$. Then if we have a new row of data $x$, we find a row vector $w$ such that $\|x-wF\|_2^2$ is minimized with the constraint that $w_i\geq 0$. This $w$ can then serve as an input into our model $M$. 

### Soft clustering

The k-means algorithm involves clustering data into disjoint sets. A point is entirely in one cluster or entirely in another cluster. Non-negative matrix factorization is a way of doing **soft clustering** where a point can be slightly in one set and slightly in another. 

In the k-means algorithm we are trying to minimize the following quantity $\|X-WF\|$ where the rows of $F$ correspond to the k means and the columns in W are binary, either zero or one signifying which cluster a row of data belongs to. For instance if there were two clusters, k=2, and the means were $u_1,u_2,\cdots$ then the factorization would look like the following:

$X \approx \begin{bmatrix} 1 & 0 \\ 1 & 0 \\ 0 & 1 \\ \cdots & \cdots \end{bmatrix} \begin{bmatrix} [ \cdots u_1 \cdots ] \\ [ \cdots u_2 \cdots ]  \end{bmatrix}$

The difference between k-means and non-negative matrix factorization is that $W$ is a binary matrix for k-means and a real-valued matrix for non-negative matrix factorization. The columns of $F$ tell us to what degree a row of data belongs to one or another cluster. 



### NMF vs. SVD vs. PCA vs. K-means

<table>
<tr><td></td><th>NMF</th><th>SVD</th><th>PCA</th><th>K-means</th></tr>
<tr><td>Reduced features space</td><td>Yes</td><td>Yes</td><td>Yes</td><td>Yes</td></tr>
<tr><td>Interpretation as ranking of topics</td><td>Yes</td><td>No</td><td>No</td><td>No</td></tr>
<tr><td>Categorical</td><td>No</td><td>No</td><td>No</td><td>Yes</td></tr>
<tr><td>Sparse</td><td>Yes</td><td>No</td><td>No</td><td>No</td></tr>
<tr><td>Orthogonal</td><td>No</td><td>Yes</td><td>Yes</td><td>N/A</td></tr>
</table>

## Summary and discusion

Non-negative matrix factorization is when you approximate a data matrix $X$ with a product of two matrices, a weights matrix $W$ and a feature matrix $F$ such that $X\approx WF$. This factorization arises naturally in topic modeling. We can find these factors using alternating least squares among many alternative methods. Nmf tends to lead to sparse matrices. Regularization helps keep matrix elements small and to make unimportant matrix elements small. We can use nmf as a way to do feature reduction. Nmf is a way of doing soft clustering, where a row of data is partially in more than one cluster. NMF, SVD, PCA and k-means are all ways of doing feature reduction. Nmf is very interpretable but SVD and PCA are convenient to work with mathematically because it leads to orthogonal representations of latent features.  

Geometric-interpretation and highlighted-words diagrams from:

https://courses.engr.illinois.edu/cs598jhm/sp2010/Slides/Lecture06.pdf