# Lecture 26: 
- Machine Learning (Scikit-learn)
- Non-negative Matrix Factorization

__Optional Reading Material:__
- [Scikit-learn: Non-negative Matrix Factorization](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html)

### Non-negative Matrix Factorization
Suppose that we have a matrix that represents users and movie ratings:
$$
M=
\begin{align}
  \\
u_0\\
u_1\\
u_2\\
u_3\\
u_4\\
\end{align}
\begin{array}{c}
\begin{matrix}
m_0 & m_1 & m_2 & m_3
\end{matrix} \\
\begin{pmatrix}
1&3&1&2\\
5&5&1&1\\
1&2&5&4\\
4&3&1&1\\
5&4&2&2
\end{pmatrix}.
\end{array}
$$

The goal of NMF is to approximately factorize this non-negative $n \times m$ matrix into two smaller
matrices; of size $n \times r$ and $r \times m$, respectively. The underlying idea is that we can describe our
data in terms of r “features”, such that each user and each movie is described as a (non-negative)
linear combination of the features, and each rating as the dot product of the specific user and movie
combination. 

For illustration, movie genres could be features. Suppose that user A likes comedies
a lot and thrillers somewhat, and user B does not like comedies but loves thrillers. If movie C is
mostly a comedy with some thriller elements, then user A will like this movie a lot more than user B.
Their ratings might look like
$\begin{pmatrix}2\\1\end{pmatrix}\cdot \begin{pmatrix}2\\1\end{pmatrix}=5$ and $\begin{pmatrix}0\\2\end{pmatrix}\cdot \begin{pmatrix}2\\1\end{pmatrix}=2$, respectively. 


Usually, the features are unlikely to coincide exactly with movie or music genres. It is almost always impossible to write $M$ as a factor of two such matrices, because this is a simplification (and potentially noise reduction) of the data. Therefore, we find two matrices such that the product is “as close as possible” to the original matrix. This is done by minimizing the L2-norm (square root of the sum of squares of all the elements) of the difference matrix. 

In __sklearn__ we implement this as follows:

In [1]:
import numpy as np
from sklearn.decomposition import NMF
M=np.array([[1,3,1,2],[5,5,1,1],[1,2,5,4],[4,3,1,1],[5,4,2,2]])
model = NMF(n_components=2)
model.fit(M)
W = model.fit_transform(M)
H = model.components_

In [141]:
W

array([[ 0.73187392,  0.75167863],
       [ 2.09408547,  0.        ],
       [ 0.34119281,  2.62041194],
       [ 1.47426902,  0.12132919],
       [ 1.84565285,  0.60341817]])

In [142]:
H

array([[  2.51186087e+00,   2.23168317e+00,   4.53271270e-01,
          5.61520045e-01],
       [  1.37765499e-05,   5.33379161e-01,   1.78497633e+00,
          1.50778417e+00]])

In [113]:
np.matmul(W,H)

array([[ 1.83837581,  2.03424042,  1.67346599,  1.54433101],
       [ 5.26005135,  4.6733353 ,  0.94918878,  1.17587097],
       [ 0.85706498,  2.15910738,  4.8320262 ,  4.14260223],
       [ 3.70316033,  3.35481582,  0.88481352,  1.01076984],
       [ 4.63603147,  4.44076307,  1.91366857,  1.94619543]])

We can see, just from inspecting the matrix $W \times H$, that this is a good approximation of the orginal matrix $M$. This method is extremely powerful, because we can use only partial data from $M$ to approximate $W$ and $H$. For example, we do not need to use all of the movies (columns) to approximate $W$, or all of the users to approximate $H$. Intuitively, you can guess quite well what kind of movie fan somebody is (and predict how much they will like new movies) by knowing how they have rated some number of past movies. You do not need to know how they have rated all movies in the history of film making.

Let's assume that the data $(u_4, m_3)$ is missing, and we would like to use NMF to approximate the missing value.
$$
M=
\begin{align}
  \\
u_0\\
u_1\\
u_2\\
u_3\\
u_4\\
\end{align}
\begin{array}{c}
\begin{matrix}
m_0 & m_1 & m_2 & m_3
\end{matrix} \\
\begin{pmatrix}
1&3&1&2\\
5&5&1&1\\
1&2&5&4\\
4&3&1&1\\
5&4&2& N/A
\end{pmatrix}.
\end{array}
$$

In [162]:
T1 = M[:-1,:]
model = NMF(n_components=2)
model.fit(T1)
W1 = model.fit_transform(T1)
H1 = model.components_
H1[:,-1]

array([ 0.51218055,  1.47754213])

In [163]:
print H1

[[ 2.25591372  2.18773609  0.40047802  0.51218055]
 [ 0.          0.54037258  1.74961287  1.47754213]]


In [164]:
T2 = M[:,:-1]
model = NMF(n_components=2)
model.fit(T2)
W2 = model.fit_transform(T2)
H2 = model.components_
W2[-1,:]

array([ 1.86167187,  0.44335129])

In [165]:
print W2

[[ 0.76042577  0.42367443]
 [ 2.10264556  0.        ]
 [ 0.36718755  2.27120935]
 [ 1.48173727  0.06893127]
 [ 1.86167187  0.44335129]]


In [168]:
# This is the approximated value for the missing data
np.dot(H1[:,-1],W2[-1,:])

1.608582327478336

### Exercise:

Use NMF to approximate the missing value in the following data. The data represents movie rating from different users. Each row corresponds to a user, and each column to a movie. The data is given in a more convenient form on the page. Assume that there are 3 features:

|          | Black Panther | The Post | Game Night | Peter Rabbit | Red Sparrow | Death Wish | Three Billboards|
| ------------- |-------------- | ----- |
| Viewer1 | 4 | 4 | 2 | 2 | 3 | 1 | 1 |
| Viewer2 | 1 | 5 | 5 | 2 | 1 | 4 | 5 |
| Viewer3 | 1 | 5 | 1 | 1 | 4 | 1 | 4 |
| Viewer4 | 5 | 4 | 3 | 1 | 1 | 1 | 2 |
| Viewer5 | 1 | 4 | 4 | 1 | 1 | 5 | 5 |
| Viewer6 | 5 | 5 | 3 | 5 | 5 | 1 | 2 |
| Viewer7 | 1 | 5 | 3 | 5 | N/A | 5 | 5 |




In [1]:
# %load movie_data.py
X=[[4,4,2,2,3,1,1],[1,5,5,2,1,4,5],[1,5,1,1,4,1,4],[5,4,3,1,1,1,2],
   [1,4,4,1,1,5,5],[5,5,3,5,5,1,2],[1,5,3,5,None,5,5]]

In [2]:
import numpy as np
from sklearn.decomposition import NMF
X = np.array(X)
X1 = X[:-1,:]
model = NMF(n_components=3)
model.fit(X1)
W1 = model.fit_transform(X1)
H1 = model.components_

X2 = X[:,[0,1,2,3,5,6]]
model = NMF(n_components=3)
model.fit(X2)
W2 = model.fit_transform(X2)
H2 = model.components_

missing_data = np.dot(H1[:,4],W2[-1,:])
print missing_data

3.95673582704
