# Latent semantic analysis, no tears

[Latent semantic analysis](https://en.wikipedia.org/wiki/Latent_semantic_analysis) (LSA) is a natural language processing (NLP) technique to bridge terms and documents through concepts. The idea is that there are hidden concepts (latent concepts) through which words and documents are related. The heart and soul of LSA is the application of [singular value decomposition](https://en.wikipedia.org/wiki/Singular-value_decomposition) (SVD) to a term-document matrix. In this tutorial, we will see how SVD is applied to documents and terms in those documents to flush out the latent concepts. 

In [1]:
import pandas as pd
import numpy as np

df = pd.DataFrame(
    data={
        'd1': [1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0],
        'd2': [1, 1, 0, 1, 0, 0, 1, 1, 0, 2, 1],
        'd3': [1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1]
    }, 
    index=['a', 'arrived', 'damaged', 'delivery', 'fire', 'gold', 'in', 'of', 'shipment', 'silver', 'truck'])

A = df.as_matrix()
q = np.array([0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1]).reshape(-1, 1)

print('data shape is {}'.format(A.shape))

data shape is (11L, 3L)


In [2]:
from numpy.linalg import svd

U, S, VT = svd(df.as_matrix(), full_matrices=False)
S = np.diag(S)
V = VT.transpose()

In [3]:
U

array([[-0.42012157, -0.07479925, -0.04597244],
       [-0.29948676,  0.20009226,  0.40782766],
       [-0.12063481, -0.27489151, -0.4538001 ],
       [-0.157561  ,  0.30464762, -0.2006467 ],
       [-0.12063481, -0.27489151, -0.4538001 ],
       [-0.26256057, -0.37944687,  0.15467426],
       [-0.42012157, -0.07479925, -0.04597244],
       [-0.42012157, -0.07479925, -0.04597244],
       [-0.26256057, -0.37944687,  0.15467426],
       [-0.315122  ,  0.60929523, -0.40129339],
       [-0.29948676,  0.20009226,  0.40782766]])

In [4]:
S

array([[4.09887197, 0.        , 0.        ],
       [0.        , 2.3615708 , 0.        ],
       [0.        , 0.        , 1.27366868]])

In [5]:
V

array([[-0.49446664, -0.64917576, -0.57799098],
       [-0.64582238,  0.71944692, -0.25555741],
       [-0.58173551, -0.24691489,  0.77499473]])

In [6]:
VT

array([[-0.49446664, -0.64582238, -0.58173551],
       [-0.64917576,  0.71944692, -0.24691489],
       [-0.57799098, -0.25555741,  0.77499473]])

In [18]:
from numpy.linalg import inv

k = 2
U_k = U[:, 0:k]
S_k = inv(S[0:k, 0:k])
V_k = V[:, 0:k]

In [19]:
V_k

array([[-0.49446664, -0.64917576],
       [-0.64582238,  0.71944692],
       [-0.58173551, -0.24691489]])

In [17]:
q.transpose().dot(U_k).dot(S_k)

array([[-0.21400262,  0.18205705]])

In [23]:
q.transpose().dot(U_k).dot(S_k) * V_k

array([[ 0.10581716, -0.11818703],
       [ 0.13820768,  0.13098038],
       [ 0.12449292, -0.0449526 ]])

# References

* [Latent Semantic Indexing (LSI) An Example](http://www1.se.cuhk.edu.hk/~seem5680/lecture/LSI-Eg.pdf)
* [Latent Semantic Analysis (LSA) Tutorial](https://technowiki.wordpress.com/2011/08/27/latent-semantic-analysis-lsa-tutorial/)
* [pyLDAvis](http://nbviewer.jupyter.org/github/bmabey/pyLDAvis/blob/master/notebooks/pyLDAvis_overview.ipynb)
* [Using Gensim for LDA](http://christop.club/2014/05/06/using-gensim-for-lda/)
* [LSA / PLSA / LDA](https://cs.stanford.edu/~ppasupat/a9online/1140.html)