### NMF applied to Wikipedia articles

Nonnegative matrix factorization (NMF) has become a widely used tool for the analysis of high dimensional data as it automatically extracts sparse and meaningful features from a set of nonnegative data vectors.

NMF approximates a matrix X with a low-rank matrix approximation.

The reason why NMF has become so popular is because of its ability to automatically extract sparse and easily interpretable factors.

#### Image processing
Say we take a gray-level image of a face containing p pixels, and squash the data into a single vector such that the ith entry represents the value of the ith pixel. Let the rows of 
\begin{align}
 \mathbf{X} 
\in \mathbb{R}^{p \times n} 
 \end{align}
 represent the p pixels, and the n columns each represent one image.
 
 NMF will produce two matrices W and H. The columns of W can be interpreted as images (the basis images), and H tells us how to sum up the basis images in order to reconstruct an approximation to a given face.
 
 <img src="imagenmf.png" >
 
In the case of facial images, the basis images are features such as eyes, noses, moustaches, and lips, while the columns of H indicate which feature is present in which image.

### Text mining
In text mining consider the bag-of-words matrix representation where each row corresponds to a word, and each column to a document.

NMF will produce two matrices W and H. The columns of W can be interpreted as basis documents (bags of words). What interpretation can we give to such a basis document in this case? They represent topics! Sets of words found simultaneously in different documents. H tells us how to sum contributions from different topics to reconstruct the word mix of a given original document.

<img src="textnmf.png">

Therefore, given a set of documents, NMF identifies topics and simultaneously classifies the documents among these different topics.

### Hyperspectral unmixing
A hyperspectral image typically has 100 to 200 wavelength-indexed bands showing the fraction of incident light being reflected by the pixel at each of those wavelengths. Given such an image we want to identify the different materials present in it (e.g. grass, roads, metallic surfaces) – these are called the endmembers. Then we want to know which endmembers are present in each pixel, and in what proportion. For example, a pixel might be reflecting 0.3 x the spectral signal of grass, and 0.7 x the spectral signal of a road surface.

NMF will produce two matrices W and H. The columns of W can be interpreted as basis endmembers. H tells us how to sum contributions from different endmembers to reconstruct the spectral signal observed at a pixel.

<img src="Hyperspectral.png">

Given a hyperspectral image, NMF is able to compute the spectral signatures of the endmembers, and simultaneously the abundance of each endmember in each pixel.

In [4]:
import pandas as pd
from scipy.sparse import csr_matrix
from sklearn.decomposition import NMF


df = pd.read_csv('Wikipedia_articles/wikipedia-vectors.csv', index_col=0)
articles = csr_matrix(df.transpose())
titles = list(df.columns)

# Create an NMF instance: model
model = NMF(n_components=6)

# Fit the model to articles
model.fit(articles)

# Transform the articles: nmf_features
nmf_features = model.transform(articles)

# Print the NMF features
#print(nmf_features)

In [2]:
# Create a pandas DataFrame: df
df = pd.DataFrame(nmf_features,index=titles)

# Print the row for 'Anne Hathaway'
print(df.loc['Anne Hathaway'])

# Print the row for 'Denzel Washington'
print(df.loc['Denzel Washington'])

0    0.003845
1    0.000000
2    0.000000
3    0.575687
4    0.000000
5    0.000000
Name: Anne Hathaway, dtype: float64
0    0.000000
1    0.005601
2    0.000000
3    0.422363
4    0.000000
5    0.000000
Name: Denzel Washington, dtype: float64
