## SVD for topic analysis

We can use SVD to determine what we call ***latent features***. This will be best demonstrated with an example.

### Example

Let's look at users ratings of different movies. The ratings are from 1-5. A rating of 0 means the user hasn't watched the movie.

|       | Matrix | Alien | StarWars | Casablanca | Titanic |
| ----- | ------ | ----- | -------- | ---------- | ------ |
| **Alice** |      1 |     2 |        2 |          0 |      0 |
|   **Bob** |      3 |     5 |        5 |          0 |      0 |
| **Cindy** |      4 |     4 |        4 |          0 |      0 |
|   **Dan** |      5 |     5 |        5 |          0 |      0 |
| **Emily** |      0 |     2 |        0 |          4 |      4 |
| **Frank** |      0 |     0 |        0 |          5 |      5 |
|  **Greg** |      0 |     1 |        0 |          2 |      2 |

Note that the first three movies (Matrix, Alien, StarWars) are Sci-fi movies and the last two (Casablanca, Titanic) are Romance. We will be able to mathematically pull out these topics!

Let's do the computation with Python.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
%pylab inline

In [None]:
M = np.array([[1, 2, 2, 0, 0],
              [3, 5, 5, 0, 0],
              [4, 4, 4, 0, 0],
              [5, 5, 5, 0, 0],
              [0, 2, 0, 4, 4],
              [0, 0, 0, 5, 5],
              [0, 1, 0, 2, 2]])

In [None]:
# Compute SVD
from numpy.linalg import svd

U, sigma, VT = svd(M)

In [None]:
# Make interpretable
movies = ['Matrix','Alien','StarWars','Casablanca','Titanic']
users = ['Alice','Bob','Cindy','Dan','Emily','Frank','Greg']

U, sigma, VT = (np.around(x,2) for x in (U,sigma,VT))

U = pd.DataFrame(U, index=users)
VT = pd.DataFrame(VT, columns=movies)

print U
print
print np.diag(sigma)
print
print VT

In [None]:
print U.shape
print sigma.shape
print VT.shape

In [None]:
# Variance
# singular values are square roots of eigenvalues
total_variance = np.sum(sigma**2)
total_variance

fraction_variance = np.cumsum(sigma**2) / total_variance
fraction_variance

In [None]:
# Keep only top two concepts
U = U.iloc[:,:2]
sigma = sigma[:2]
VT = VT.iloc[:2,:]

print U
print 
print sigma
print 
print VT

In [None]:
# Check the reconstruction

np.around(U.dot(np.diag(sigma)).dot(VT), 1)

### What we had:

|       | Matrix | Alien | StarWars | Casablanca | Titanic |
| ----- | ------ | ----- | -------- | ---------- | ------ |
| **Alice** |      1 |     2 |        2 |          0 |      0 |
|   **Bob** |      3 |     5 |        5 |          0 |      0 |
| **Cindy** |      4 |     4 |        4 |          0 |      0 |
|   **Dan** |      5 |     5 |        5 |          0 |      0 |
| **Emily** |      0 |     2 |        0 |          4 |      4 |
| **Frank** |      0 |     0 |        0 |          5 |      5 |
|  **Greg** |      0 |     1 |        0 |          2 |      2 |


## Queries

In [None]:
# Which movies are most similar to Matrix?
from scipy.spatial.distance import cosine

matrix = VT['Matrix']

print matrix
print 

distances = [cosine(matrix, VT[col]) for col in VT]
pd.Series(distances, index=movies)

In [None]:
# Make recommendations for a new user
my_ratings = np.array([[5, 0, 4, 0, 3]])

# Translate to weighted concept space
my_weighted_concept = my_ratings.dot(VT.T)
print my_weighted_concept
print

# Translate back to rating space
new_rating = my_weighted_concept.dot(VT)
print movies
new_rating

It looks like the best recommendation for a new movie for me to watch is Alien.

#### Which user am I most similar to?

Translate to user space by multiplying by $V \Sigma^{-1}$ _on the right_

In [None]:
sigma_inv = np.diag(1/sigma)

# Translate to concept space
my_concept = my_ratings.dot(VT.T).dot(sigma_inv)
print my_concept
print 

# Find distance to other users
distances = [cosine(my_concept, row) for name,row in U.iterrows()]
pd.Series(distances, index=users)

In [None]:
for name, row in U.iterrows():
    print row