In [23]:
import pandas as pd
import numpy as np
import random
import matplotlib.pyplot as plt
%matplotlib inline

# Topic Modeling: Non-negative Matrix Factorization (NMF)

## Objectives

 * define "topic modeling"
 * use matrix factorization find a reduced-dimension approximation to the data
 * interpret the new dimensions as vectors that group together the original features into "latent topics" or "latent features"
 * implement an alternating-least-squares algorithm to solve NMF

## What is topic modeling?

Say we have a matrix of word count vectors for a large corpus of **unlabeled documents**

|doc_id | word_1 | word_2 | ... | word_10000 |
|--|--|--|--|--|
|**1**|0|11|...|0|
|**2**|0|0|...|0|
|**3**|0|3|...|2|

How could we find out what these documents are _about_? Even though we don't have labels for these documents, is there a way to sort out documents about sports from documents about music? Is there a way to discover **latent (hidden) topics** in the corpus?

Topic modeling is a kind of **unsupervised learning**: an attempt to distinguish structure in unlabeled data. 

Today we'll see a mathematical transformation that turns the above feature matrix into something like the below:

- a new, reduced-dimension feature matrix where the features are "topic weights"

|doc_id | topic_A | topic_B | topic_C |
|--|--|--|--|
|**1**|0.2|0.1|1.1|
|**2**|1.3|0|0.4|
|**3**|0.4|1.4|0.2|

- a matrix of word (original feature) to topic (latent feature) weights

| &nbsp; | word_1 | word_2 | ... | word_10000 |
|--|--|--|--|--|
|**topic_A**|0.01|0|...|0|
|**topic_B**|0.2|0.1|...|0.3|
|**topic_C**|0|1.3|...|0|

#### What does this give us?
- Each document has a strength of association with each topic. This is a **soft clustering** of documents into topics.
- Each word has a weight associated with each topic; the topics also represent clusters of features (words).

#### What does this lack?
- We still don't know what each topic **is**. This still requires a human to read the words associated with each topic and assign some kind of label.

# Example: unbaking a cake

Say you have the following information:
- a cake recipe that says "2 cups flour, 1 cup brown sugar"
- nutritional facts for flour and sugar:
  - "1 cup of flour contains 90g carbohydrates, 13g protein, 1g fat"
  - "1 cup of brown sugar contains 200g carbohydrates, 1g protein, 0g fat"

How do you get the nutritional content of the cake? Easy: linear algebra. The cake is a linear combination of ingredients, and each ingredient is a linear combination of nutrients.

Let's call the recipe vector $\vec{w} = \begin{bmatrix} w_{flour} & w_{sugar} \end{bmatrix} = \begin{bmatrix} 2 & 1 \end{bmatrix} $

And let's make a matrix of the nutritional content of each ingredient:
$$ H = \begin{bmatrix} f_{carb} & f_{prot} & f_{fat} \\ s_{carb} & s_{prot} & s_{fat} \end{bmatrix} = 
\begin{bmatrix} 90 & 13 & 1 \\ 200 & 1 & 0 \end{bmatrix}
$$

then the cake vector is
$$ \vec{v} = \begin{bmatrix} v_{carb} & v_{prot} & v_{fat} \end{bmatrix} 
= \vec{w}H 
=\begin{bmatrix} 2 & 1 \end{bmatrix} 
\begin{bmatrix} 90 & 13 & 1 \\ 200 & 1 & 0 \end{bmatrix}
= \begin{bmatrix} 380 & 27 & 2 \end{bmatrix} 
$$

I know this seems trivial so far, but bear with me. Say we had many different recipes: $\vec{w_1}, \vec{w_2}, \ldots$. Each would give a different cake: $\vec{v_1}, \vec{v_2}, \ldots$. We can write all this in matrix form: $V = WH$

$$\begin{bmatrix}
    v_{1,carb}   & v_{1,prot} & v_{1,fat}\\
    v_{2,carb}   & v_{2,prot} & v_{2,fat}\\
    v_{3,carb}   & v_{3,prot} & v_{3,fat}\\
    \vdots       & \vdots & \vdots
\end{bmatrix}
=
\begin{bmatrix}
    w_{1,flour}  & w_{1,sugar} \\
    w_{2,flour}  & w_{2,sugar} \\
    w_{3,flour}  & w_{3,sugar} \\
    \vdots       & \vdots \\
\end{bmatrix}
\cdot
\begin{bmatrix} 
f_{carb} & f_{prot} & f_{fat} \\ s_{carb} & s_{prot} & s_{fat} 
\end{bmatrix} 
$$

## problem 1: unknown recipes
Say we know the nutritional information for each cake, but we don't know the recipes. In fact, the true recipes might even contain other ingredients. Can we solve for a "best guess" flour & sugar recipe for each cake?

$$\begin{bmatrix}
    290   & 14 & 1\\
    380   & 27 & 1\\
    120   & 7 & 0.5\\
    \vdots       & \vdots & \vdots
\end{bmatrix}
\approx
\begin{bmatrix}
    w_{1,flour}  & w_{1,sugar} \\
    w_{2,flour}  & w_{2,sugar} \\
    w_{3,flour}  & w_{3,sugar} \\
    \vdots       & \vdots \\
\end{bmatrix}
\cdot
\begin{bmatrix} 
90 & 13 & 1 \\ 
200 & 1 & 0
\end{bmatrix} 
$$

The impossibility of an exact solution become apparent when we try to solve for $\vec{w_2}$
$$
\begin{align}
380 &= 90w_{2f} + 200w_{2s} \\
27 &= 13w_{2f} + 1w_{2s} \\
1 &= 1w_{2f} + 0w_{2s} \\
\end{align}
$$

This is an overdetermined system of equations. There is no set of values for $w_{2f},w_{2s}$ that will make all three equations true. So let's get as close as possible using a "least squares" estimate.

Let $$\vec{v_2}_{true} = \begin{bmatrix} 380 \\ 27 \\ 1 \end{bmatrix}$$

and our estimate 
$$\hat{\vec{v_2}} = 
\begin{bmatrix} 
90 & 200  \\ 
13 & 1  \\
1 & 0
\end{bmatrix}
\begin{bmatrix} 
w_{2,flour}  \\ 
w_{2,sugar} \\
\end{bmatrix}$$


The least squares estimate for $\vec{w_2}$ is the vector that minimizes the squared error
$$ ||\vec{v_2}_{true} - \hat{\vec{v_2}}|| = \sum_{i\in\{f,c,p\}}(v_{2true, i} - \hat{v}_i)^2$$

Lucky for us, least squares optimization is an ancient problem. Let's use the implementation in `numpy.linalg.lstsq`

In [64]:
np.linalg.lstsq?

[0;31mSignature:[0m [0mnp[0m[0;34m.[0m[0mlinalg[0m[0;34m.[0m[0mlstsq[0m[0;34m([0m[0ma[0m[0;34m,[0m [0mb[0m[0;34m,[0m [0mrcond[0m[0;34m=[0m[0;34m'warn'[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Return the least-squares solution to a linear matrix equation.

Solves the equation `a x = b` by computing a vector `x` that
minimizes the Euclidean 2-norm `|| b - a x ||^2`.  The equation may
be under-, well-, or over- determined (i.e., the number of
linearly independent rows of `a` can be less than, equal to, or
greater than its number of linearly independent columns).  If `a`
is square and of full rank, then `x` (but for round-off error) is
the "exact" solution of the equation.

Parameters
----------
a : (M, N) array_like
    "Coefficient" matrix.
b : {(M,), (M, K)} array_like
    Ordinate or "dependent variable" values. If `b` is two-dimensional,
    the least-squares solution is calculated for each of the `K` columns
    of `b`.
rcond : float,

In [65]:
vtrue = np.array([[380],[27],[1]])
vtrue

array([[380],
       [ 27],
       [  1]])

In [66]:
vtrue.shape

(3, 1)

In [67]:
H = np.array([[90,13,1],[200,1,0]])
H

array([[ 90,  13,   1],
       [200,   1,   0]])

In [68]:
H.T

array([[ 90, 200],
       [ 13,   1],
       [  1,   0]])

In [69]:
np.linalg.lstsq(H.T, vtrue, rcond=None)

(array([[1.99369079],
        [1.00284112]]),
 array([0.99369079]),
 2,
 array([219.40669262,  11.47620288]))

In [70]:
w_t, resid, rank, singular_values = np.linalg.lstsq(H.T, vtrue, rcond=None)

In [71]:
w_t

array([[1.99369079],
       [1.00284112]])

In [72]:
w_t.T

array([[1.99369079, 1.00284112]])

In [73]:
resid

array([0.99369079])

In [74]:
vhat = H.T.dot(w)
vhat

array([[380.00039589],
       [ 26.92082145],
       [  1.99369079]])

In [75]:
vtrue

array([[380],
       [ 27],
       [  1]])

We can do this with the whole matrix of cakes

In [76]:
V = np.array([[290, 14, 1],[380,27,1],[120,7,0.5]])
V

array([[290. ,  14. ,   1. ],
       [380. ,  27. ,   1. ],
       [120. ,   7. ,   0.5]])

In [77]:
H

array([[ 90,  13,   1],
       [200,   1,   0]])

In [78]:
WT, _, _, _ = np.linalg.lstsq(H.T, V.T, rcond=None)

In [79]:
WT.T

array([[1.        , 1.        ],
       [1.99369079, 1.00284112],
       [0.50989732, 0.37054623]])

#### _moral of story: yes, we can solve for unknown recipes_

## problem 2: known recipes, unknown ingredient nutrition
We know the cake nutritional content and the recipes. Can we deduce the ingredient nutritional content?

$$\begin{bmatrix}
    290   & 14 & 1\\
    380   & 27 & 1\\
    120   & 7 & 0.5\\
    \vdots       & \vdots & \vdots
\end{bmatrix}
\approx
\begin{bmatrix}
    2  & 2 \\
    2  & 1 \\
    1  & 2 \\
    \vdots       & \vdots \\
\end{bmatrix}
\cdot
\begin{bmatrix} 
f_{carb} & f_{prot} & f_{fat} \\ s_{carb} & s_{prot} & s_{fat} 
\end{bmatrix} 
$$

Again we'll have a big system of equations needing a least squares estimate.

In [80]:
V

array([[290. ,  14. ,   1. ],
       [380. ,  27. ,   1. ],
       [120. ,   7. ,   0.5]])

In [81]:
W = np.array([[2,2],[2,1],[1,2]])
W

array([[2, 2],
       [2, 1],
       [1, 2]])

In [82]:
H, _, _, _ = np.linalg.lstsq(W, V, rcond=None)
H

array([[208.23529412,  14.64705882,   0.5       ],
       [-51.76470588,  -5.35294118,   0.        ]])

Hm, negative nutritional content doesn't really make sense, does it? Let's hold that thought for a minute.

## problem 3 (the cool one): both unknown
We know the nutritional content of the cake, but we don't know the recipes or the ingredient nutrition.

$$\begin{bmatrix}
    290   & 14 & 1\\
    380   & 27 & 1\\
    120   & 7 & 0.5\\
    \vdots       & \vdots & \vdots
\end{bmatrix}
=
\begin{bmatrix}
    w_{1,flour}  & w_{1,sugar} \\
    w_{2,flour}  & w_{2,sugar} \\
    w_{3,flour}  & w_{3,sugar} \\
    \vdots       & \vdots \\
\end{bmatrix}
\cdot
\begin{bmatrix} 
f_{carb} & f_{prot} & f_{fat} \\ s_{carb} & s_{prot} & s_{fat} 
\end{bmatrix} 
$$

This seems like too many unknowns. But here's an approach that turns out to work pretty well: **alternating least squares (ALS)**:

- initialize $W$ and $H$ full of random numbers
- Hold $H$ constant and solve for $W$ using least-squares. Set any negative values to 0.
- Now hold $W$ constant and solve for $H$, again clipping negative values to 0.
- Repeat the last two steps until you have converged on a solution

This process is not globally convex, meaning that if you run it twice, you may get different values for $W$ and $H$.

## Interpreting the factor matrices

$$ V \approx WH $$


$V$ is the data: an $(n \times p)$ matrix ($n$ rows, $p$ features)

$W$ is an $(n \times k)$ matrix mapping *rows* to *topics (latent features)*

$H$ is a $(k \times p)$ matrix mapping *topics (latent features)* to *features*

Note that $k$, the number of topics (latent features), is a **hyperparameter**: you have to pick it.

In the cake example above, we could observe the nutritional information for each cake (carbs, protein, fat), but we knew that there was some other process for constructing cakes, some smaller set of latent features ("ingredients"), where each latent feature is a linear combination of the raw features.

In the text example, we observe the word counts for each document, but we know that documents tend to be about certain topics, and different topics will have different counts for certain groups of words. NMF gives us topics as linear combinations of words.

### other uses
- image processing: treat each image as a "bag of pixels", a single long vector where each pixel is a feature and its value is its intensity. Running NMF on a corpus of images of faces can find the latent features (which, remember, are just linear combinations of raw features) that represent noses, ears, etc. [See this paper](references/1999-Lee-Seung-Learning-Parts-of-Objects-by-NMF.pdf)
- bioinformatics
- "music structure": [see this paper](http://www.mirlab.org/conference_papers/international_conference/ISMIR%202010/ISMIR_2010_papers/ismir2010-73.pdf)
- recommendation systems