全称[Latent Semantic Indexing(或 Latent Semantic Analysis )](https://en.wikipedia.org/wiki/Latent_semantic_analysis#Latent_semantic_indexing)。 它采用奇异值分解（Singular Value Decomposition）对文档矩阵进行分解，然后保留一部分（最大的）奇异值和其对应的奇异向量。和奇异值分解的目的相同，LSI一般用于去除文本中的噪音。奇异值分解的公式如下：
$$
A_{m \times n}=U_{m \times m}S_{m \times n}V^\mathrm {T}_{n \times n}
$$
A是Term-Document文档矩阵（行是Term，列是Document）。

实际计算中，我们会选取$k$个奇异值（$k < m, k < n$），这时公式变成：
$$
A_{m \times k}^*=U_{m \times k}S_{k \times k}V^\mathrm {T}_{k \times k}
$$

- $U_{m \times k}$ 每个列向量可以看成是一个主题（topic），每个主题有不同的term分布，而且这些主题相互独立（因为垂直）
- 假设对于一个新的文档$a$，则$U_{m \times k}^\mathrm {T}a$可以理解为，$a$在各个主题上的投影，根据投影的大小，可以看出$a$属于哪个主题。

### gensim.models.LsiModel

Gensim中相关代码如下：

In [None]:
from gensim.test.utils import common_corpus, common_dictionary
from gensim.models import LsiModel

print('-'*25, "corpus", '-'*25, sep="") 
print(common_dictionary)
corpus = common_corpus[0:9]
print("corpus number =", len(corpus))
for doc in corpus:
    print([(i, round(w,4)) for i, w in doc])

print('-'*25, "lsi", '-'*25, sep="")  
model = LsiModel(corpus, id2word=common_dictionary, num_topics=3)  # train model

print("U =\n", model.projection.u)
print("S =\n", model.projection.s)
print("lsi[corpus] =")
for doc in model[corpus]:
    print("", [(i, round(w,4)) for i, w in doc])

print('-'*25, "show topic", '-'*25, sep="")    
topics = model.get_topics()
print("topics =")
for topic in topics: 
    print("", topic)
print("\nshow topics:")    
print("", model.show_topics(3, num_words=12))

![image-20200612164118539](images/image-20200612164118539.png)

可以看到topics的每一行就是$U$的每个列向量。

### 和奇异值分解比较

下面的代码比较了奇异值分解和LSI，可以便于我们理解LSI。

In [39]:
import gensim
import numpy as np
from gensim import corpora
from gensim import models

A = np.array([[1,2,3,3], [2,0,2, 1], [2,2,0,1]])
print("A =\n", A)

print('-'*25,  "奇异值分解", '-'*25, sep='')
U, S, VT = np.linalg.svd(A, full_matrices=False) 
print("U =\n", U)
print("S =\n", S)
print("VT = \n", VT)

# A在U为基所对应的矩阵。即A中每个列向量在以U为基中对应的向量（坐标）
UTA = U.T @ A
print("U.T @ A =\n", UTA)

print('-'*25,  "LSI", '-'*25, sep='')
corpus = gensim.matutils.Dense2Corpus(A)
lsi = models.LsiModel(corpus, num_topics=2)

print("U =\n", lsi.projection.u)
print("S =\n", lsi.projection.s)

vt =  (lsi.projection.u.T @ A) / lsi.projection.s.reshape(len(lsi.projection.s),1)
print("VT =\n", vt)

print(lsi.projection.u @ np.diag(lsi.projection.s) @ vt)
print((lsi.projection.u.T @ A).T @ np.diag(lsi.projection.s) @ vt)

# lsi[corpus] 等价U.T @ A
new_a = gensim.matutils.corpus2dense(lsi[corpus], num_terms=len(lsi.projection.s))
print("lsi[corpus] =\n", new_a)



A =
 [[1 2 3 3]
 [2 0 2 1]
 [2 2 0 1]]
-------------------------奇异值分解-------------------------
U =
 [[ 0.80725695  0.35449015  0.47188235]
 [ 0.44390091  0.16223863 -0.88126648]
 [ 0.38895783 -0.9208775   0.02639024]]
S =
 [5.77799333 2.15744631 1.72052854]
VT = 
 [[ 0.42799884  0.4140589   0.57278929  0.56328026]
 [-0.53896479 -0.52505348  0.64332896  0.14129277]
 [-0.71947085  0.57920874 -0.20161591  0.32592938]]
U.T @ A =
 [[ 2.47297444  2.39242956  3.30957268  3.2546296 ]
 [-1.16278759 -1.1327747   1.3879477   0.30483158]
 [-1.23787013  0.99654517 -0.34688592  0.5607708 ]]
-------------------------LSI-------------------------
U =
 [[-0.80725695  0.35449015]
 [-0.44390091  0.16223863]
 [-0.38895783 -0.9208775 ]]
S =
 [5.77799333 2.15744631]
VT =
 [[-0.42799884 -0.4140589  -0.57278929 -0.56328026]
 [-0.53896479 -0.52505348  0.64332896  0.14129277]]
[[1.58412906 1.52974793 3.16368934 2.73538216]
 [0.90910655 0.87822186 1.69430106 1.49418851]
 [2.03266769 1.97370093 0.0091544  0.985201

In [30]:
from gensim import similarities

lsi_corpus = lsi[corpus]
index = similarities.SparseMatrixSimilarity(corpus, num_features=12)
lsi_index = similarities.SparseMatrixSimilarity(lsi_corpus, num_features=3)

a = np.array([[1], [2], [2]])
print(a)
query_bow = gensim.matutils.Dense2Corpus(a)



[[1]
 [2]
 [2]]


In [31]:
tfidf_query = query_bow
tfidf_sims = index[tfidf_query]

lsi_query = lsi[tfidf_query]
lsi_sims = lsi_index[lsi_query]

print(tfidf_sims)
print(lsi_sims)



[[1.         0.70710677 0.6471503  0.7035265 ]]
[[1.         0.70710677 0.6471502  0.7035264 ]]


![image-20200612105123078](images/image-20200612105123078.png)

需要注意：

- LSI分解出的向量，和奇异值分解出的向量有时方向相反。
- *lsi[corpus]* 等价$U^TA$，可以看成$A$变换到以$U$为基的对应矩阵，即$A$中每个列向量在以$U$为基中对应的向量（坐标）。