## Objectives

    · Write down and explain NMF equation
    · Compare and contrast NMF, SVD, PCA and K-means
    · Implement Alternating Least Squares algorithm
    · Use NMF to find and interpret latent topics

## Outline

    · Problems with SVD for topic analysis
    · Introduce NMF
    · Review solving linear equations
    · Alternating Least Sqaures
    · NMF for topic analysis example
    · Pair exercise

## Topic analysis

### Example

Let's look at users ratings of different movies. The ratings are from 1-5. A rating of 0 means the user hasn't watched the movie.

|       | Matrix | Alien | StarWars | Casablanca | Titanic |
| ----- | ------ | ----- | -------- | ---------- | ------ |
| **Alice** |      1 |     2 |        2 |          0 |      0 |
|   **Bob** |      3 |     5 |        5 |          0 |      0 |
| **Cindy** |      4 |     4 |        4 |          0 |      0 |
|   **Dan** |      5 |     5 |        5 |          0 |      0 |
| **Emily** |      0 |     2 |        0 |          4 |      4 |
| **Frank** |      0 |     0 |        0 |          5 |      5 |
|  **Greg** |      0 |     1 |        0 |          2 |      2 |

Note that the first three movies (Matrix, Alien, StarWars) are Sci-fi movies and the last two (Casablanca, Titanic) are Romance. We will be able to mathematically pull out these topics!

In [None]:
import pandas as pd
import numpy as np
import random

%matplotlib inline

In [None]:
M = np.array([[1, 2, 2, 0, 0],
              [3, 5, 5, 0, 0],
              [4, 4, 4, 0, 0],
              [5, 5, 5, 0, 0],
              [0, 2, 0, 4, 4],
              [0, 0, 0, 5, 5],
              [0, 1, 0, 2, 2]])

In [None]:
# Let's try and pull out the topics with SVD
from numpy.linalg import svd
k = 2

movies = ['Matrix','Alien','StarWars','Casablanca','Titanic']
users = ['Alice','Bob','Cindy','Dan','Emily','Frank','Greg']

# Compute SVD
U, sigma, VT = svd(M)

# Make pretty
U, sigma, VT = (np.around(x,2) for x in (U,sigma,VT))
U = pd.DataFrame(U, index=users)
VT = pd.DataFrame(VT, columns=movies)

# Keep top two concepts
U = U.iloc[:,:k]
sigma = sigma[:k]
VT = VT.iloc[:k,:]

print U
print sigma
print VT

**Discussion**
1. What do the concepts mean?
2. To which concept(s) does each user/document belong?

In [None]:
# print answers[5]

## Problems with SVD for topic analysis

**Recall:** $M = U S V^T$

1. The number of columns in $U$ can differ from the number of rows in $V^T$. I.e. The number of latent features differs in $U$ and $V^T$, which is weird.

2. Values in $U$ and $V^T$ can be negative, which is weird and hard to interpret. For example, suppose a latent feature is the genre 'Sci-fi'. This feature can be positive (makes sense), zero (makes sense), or negative (what does that mean?).

3. SVD forces us to fill in missing values, then SVD models those missing values, which is bad.

**Discussion:** Can you think of a way to potentially factor a matrix that will respect constraints #1 and #2?

We won't cover issue #3 today, but Jack may talk about it tomorrow.

In [None]:
# Your answer goes here

# Potential answers are at the bottom
# print answers[0]

## Non-negative Matrix Factorization (NMF)

<img src='nmf.png' width = 40% />

<p style="text-align: center;">$r$ is the number of latent features</p>

**Discussion** What convenient properties in SVD do we lose when using non-negative matrix factorization?

In [None]:
# Your answer goes here

# Potential answers are at the bottom
# print answers[1]

## Review

### System of Linear Equations Exact Solver
$$ Ax = b$$

$$ \begin{bmatrix} 1 & 2 \\ -3 & 4 \end{bmatrix} \left[ \begin{array}{c} x_1 \\ x_2 \end{array} \right] = \left[ \begin{array}{cc} 7 \\ -9 \end{array} \right] $$

In [None]:
A = np.array([[1, 2], [-3, 4]])
b = np.array([7, -9])

print np.linalg.solve(A, b)

### Least Squares Solver

What if we have an overdetermined system of linear equations? E.g.

$$ \begin{bmatrix} 1 & 2 \\ -3 & 4 \\ 1 & -4 \end{bmatrix} \left[ \begin{array}{c} x_1 \\ x_2 \end{array} \right] = \left[ \begin{array}{cc} 7 \\ -9 \\ 17 \end{array} \right] $$

An exact solution is not guaranteed, so we must do something else. Least Squares dictates that we find the $x$ that minimizes the residual sum of squares (RSS).

(Note: This is the solver we use when doing Linear Regression!)

In [None]:
A = np.array([[1, 2], [-3, 4], [1, -4]])
b = np.array([7, -9, 17])

print np.linalg.lstsq(A, b)[0]
print "Residual sum of squares (error): {}".format(np.linalg.lstsq(A, b)[1])

### Non-negative Least Squares Solver

What if you want to constrain the solution to be non-negative?

We have optomizers for that too!

In [None]:
from scipy.optimize import nnls

A = np.array([[1, 2], [3, 4], [1, 4]])
b = np.array([7, 2, 4])

# A = np.array([[1, 2], [-3, 4], [1, -4]])
# b = np.array([7, -9, 17])

print nnls(A, b)[0]
print "Residual sum of squares (error): {}".format(nnls(A, b)[1] ** 2)

### Alternating Leasts Squares

**Question** Given a matrices $A$ and $B$, leasts squares and non-negative least squares find the solution $X$ that minimizes the error (RSS) in $A * X = B$. So can you guess what alternating least squares is, and how may we apply it to NMF?

In [None]:
random.choice(students)

In [None]:
# print answers[2]

In [None]:
# Implement the first two steps of alternating least squares

# Set up our matrix V we want to decompose
V = np.random.rand(10,15)

# Initialize a random matrix W
W = np.random.rand(10,5)

# Solve for H using a least squares solver
H = np.linalg.lstsq(W, V)[0]

# Clip H so there are no negative values
H[H < 0] = 0

# Print the current error. Why did the error go up?
print np.linalg.norm(V - np.dot(W,H))

In [None]:
# Solve for W using H
W = np.linalg.lstsq(H, V)

Dang that blew up... Let's do some math to figure out why and what we could have done

**Question** np.linalg.lstsq(W, V) solves $W * H = V$. To solve for W we need to solve $H * W = V$ which is invalid due to the dimensions of H and W. What can we do to our matrices to make this fix this problem?

In [None]:
print random.choice(students)

In [None]:
# print answers[3]

**Exercise** Using the answer provided, go ahead and solve for W and print out the new error

In [None]:
# Your code goes here

In [None]:
# Error should be lower than the error we established above
W = solve_for_w(H, V)

In [None]:
# Repeat solving for H in terms of W and W in terms of H until
# you are "satisfied" with your result (low enough error) or reach some
# maximum number of iterations.

### General vs. non-negative least squares solver

Non-negative least squares solver:
    
    · Returns result with least squares error given non-negativity constraint
    · While alternating, converges to a local minimum
    · Orders of magnitude slower than general least squares solver

General least squares solver:
    
    · Returns result with least squares error with no constraints
    · While alternating, converges to a stationary point (saddle point or minimum)
    · Much much faster
    · Have to clip the matrix at every iteration to ensure non-negativity
   
In industry the general least squares solver is commonly used. The tradeoff between speed and strong convergence seems to be worth it. For more information check out: http://users.wfu.edu/plemmons/papers/BBLPP-rev.pdf

## NMF for topic analysis

### Example

Let's look at users ratings of different movies. The ratings are from 1-5. A rating of 0 means the user hasn't watched the movie.

|       | Matrix | Alien | StarWars | Casablanca | Titanic |
| ----- | ------ | ----- | -------- | ---------- | ------ |
| **Alice** |      1 |     2 |        2 |          0 |      0 |
|   **Bob** |      3 |     5 |        5 |          0 |      0 |
| **Cindy** |      4 |     4 |        4 |          0 |      0 |
|   **Dan** |      5 |     5 |        5 |          0 |      0 |
| **Emily** |      0 |     2 |        0 |          4 |      4 |
| **Frank** |      0 |     0 |        0 |          5 |      5 |
|  **Greg** |      0 |     1 |        0 |          2 |      2 |

Note that the first three movies (Matrix, Alien, StarWars) are Sci-fi movies and the last two (Casablanca, Titanic) are Romance. We will be able to mathematically pull out these topics!

In [None]:
# Compute NMF
from sklearn.decomposition import NMF

def fit_nmf(r):
    nmf = NMF(n_components=r)
    nmf.fit(M)
    W = nmf.transform(M)
    H = nmf.components_
    return nmf.reconstruction_err_

error = [fit_nmf(i) for i in range(1,6)]
plt.plot(range(1,6), error)
plt.xticks(range(1, 6))
plt.xlabel('r')
plt.ylabel('Reconstruction Errror')

**Question** What might be the optimal r (number of topics) value and why?

In [None]:
print random.choice(students)

In [None]:
# print answers[4]

In [None]:
# Fit using 2 hidden concepts
nmf = NMF(n_components=2)
nmf.fit(M)
W = nmf.transform(M)
H = nmf.components_
print 'RSS = %.2f' % nmf.reconstruction_err_

In [None]:
# Make interpretable
W, H = (np.around(x,2) for x in (W,H))
W = pd.DataFrame(W,index=users)
H = pd.DataFrame(H,columns=movies)

print W 
print H

## Interpreting Concepts

**Rethink this part**
#### Think of NMF like 'fuzzy clustering'
- The concepts are like clusters
- Each row (document, user, etc...) can belong to more than one concept

**Discussion**
1. What do the concepts (clusters) mean?
2. To which concept(s) does each user/document belong?

In [None]:
# print answers[6]

In [None]:
# Verify reconstruction
print np.around(W.dot(H),2)
print pd.DataFrame(M, index=users, columns=movies)

#### What is concept 0?

In [None]:
# Top 2 movies in genre 0
top_movies = H.iloc[0].sort_values(ascending=False).index[:3]
top_movies

#### Which users align with concept 0?

In [None]:
# Top 2 users for genre 1
top_users = W.iloc[:,0].sort_values(ascending=False).index[:2]
top_users

#### What concepts does Emily align with?

In [None]:
W.loc['Emily']

#### What are all the movies in each concept?

In [None]:
# Number of movies in each concept
thresh = .2  # movie is included if at least 20% of max weight
for g in range(2):
    all_movies = H.iloc[g,:]
    included = H.columns[all_movies >= (thresh * all_movies.max())]
    print "Concept %i contains: %s" % (g, ', '.join(included))

#### Which users are associated with each concept?

In [None]:
# Users in each concept
thresh = .2  # user is included if at least 20% of max weight
for g in range(2):
    all_users = W.iloc[:,g]
    included = W.index[all_users >= (thresh * all_users.max())]
    print "Concept %i contains: %s" % (g, ', '.join(included))

# Pair programming

https://github.com/zipfian/topicmodeling/blob/master/pair.md

# Helper functions and lists
(Not part of the material)

In [None]:
answers = ["One approach we could try is to limit S to be a square matrix and " +
           "enforce positivity requirements on all matrices (including M). If this were " +
           "possible, then we could multiply S into U or V which may result in the " +
           "decomposition M = H * W with everything being positive.",
           
           "SVD is nice because it has a closed form solution that will always exist. " +
           "NMF on the other hand does not. NMF can be approximated through various " +
           "numerical techniques. " +
           "Today we will using alternating least squares.",
           
           "Alternating least squares solves the least squares equation for H while " +
           "leaving W fixed, then solves the least squares equation for W while " +
           "leaving H fixed. This process continues and alternates until either " +
           "the error is reduced to some amount or you hit some number of max iterations",
           
           "We can solve the equation (W * H).T = V.T or H.T * W.T = V.T for W.T. We just " +
           "need to remember to transpose our final answer.",
           
           "At r = 2 there is a significant change in the amount of benefit we are " +
           "gaining from adding aditional features. At higher r values, we will end up " +
           "be adding dimensions that don't necessarily help much. In a modeling sense " +
           "this would be similar to overfitting our data. This is typically not a problem " +
           "though since we are using NMF in the first place to do dimensionality reduciton.",
           
           "There are no perfect answers here. The problem is that there are positive and " +
           "negative values and it is really hard to interpret what they mean exactly. " +
           "Should we be looking at the absolute value of the numbers, or maybe just the " +
           "max without considering the sign? There seem to be issues with both of those " +
           "approaches. Maybe something more sofisticated would work, or maybe we should " +
           "just use a different tool...",
           
           "Here all of the values are positive. A large positive number can be " +
           "interpretted as an item belonging to a certain topic. We can then look at the " +
           "items that belong to a topic and make some sort of best guess as to what the " +
           "topics are. These guesses will be heavily influenced by the type of data that " +
           "populated the original matrix as well as the business question we are trying to " +
           "answer."
           ]

In [None]:
students = ["Suresh","Sally","Corbin","Danius","David","Dustin","Eduardo","Evan","Grier","Jane","Jared","Jonathan T.","Jonathan B.","Kathy","Morgan","Nicholas ","Rob D.","Rob W.","Rob F. ","Sal","Scott","Sean ","Stefanie","Sydney","Zane",]

In [None]:
def solve_for_w(H, v):
    # Solve for W leaving H fixed
    W = np.linalg.lstsq(H.T, V.T)[0].T

    # Clip W so there are no negative values
    W[W < 0] = 0
    
    print np.linalg.norm(V - np.dot(W,H))
    return W