## Topic Modeling
LSA (Latent Semantic Analysis) also known as LSI (Latent Semantic Index) LSA uses bag of word(BoW) model, which results in a term-document matrix(occurrence of terms in a document).

LSA is basically singular value decomposition.

In [1]:
import nltk

In [2]:
# Classify below mentioned five documents to the topics
TextCorpus = ['Seven continent planet',
'Five ocean planet', 
'Asia largest continent', 
'Pacific Ocean largest', 
'Ocean saline water']

text_tokens = [sent.split() for sent in TextCorpus]
print(text_tokens)

[['Seven', 'continent', 'planet'], ['Five', 'ocean', 'planet'], ['Asia', 'largest', 'continent'], ['Pacific', 'Ocean', 'largest'], ['Ocean', 'saline', 'water']]


In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [4]:
transformer = TfidfVectorizer()
tfidf = transformer.fit_transform(TextCorpus) 

In [5]:
print(tfidf)

  (0, 6)	0.5317722537280788
  (0, 1)	0.5317722537280788
  (0, 8)	0.6591180018251055
  (1, 4)	0.46220770413113277
  (1, 2)	0.6901592662889633
  (1, 6)	0.5568161504458247
  (2, 3)	0.5317722537280788
  (2, 0)	0.6591180018251055
  (2, 1)	0.5317722537280788
  (3, 5)	0.6901592662889633
  (3, 3)	0.5568161504458247
  (3, 4)	0.46220770413113277
  (4, 9)	0.6390704413963749
  (4, 7)	0.6390704413963749
  (4, 4)	0.42799292268317357


`Inference:` Ouptut is in the format of (document number, term number)

## Dimentionality Reduction
LSA learns latent topics by performing a matrix decomposition on the document-term matrix using Singular value decomposition.

In [8]:
from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(n_components = 3)
lsa = svd.fit_transform(tfidf)

In [9]:
lsa

array([[ 5.69995606e-01, -5.21026572e-01,  4.81700519e-01],
       [ 6.29788097e-01,  2.47716942e-01,  5.41216825e-01],
       [ 5.69995606e-01, -5.21026572e-01, -4.81700519e-01],
       [ 6.29788097e-01,  2.47716942e-01, -5.41216825e-01],
       [ 4.08516626e-01,  6.90173499e-01, -2.83275574e-16]])

Shape of output = (no of documents x no of topics)

In [13]:
# Checking the topic of the first ducument.
# However note that values dont add to 1 as in LSA it is not probabiltiy of a topic in a document.
l=lsa[0]
print("Document 0 :")
for i,topic in enumerate(l):
  print("Topic ",i," : ",topic*100)

Document 0 :
Topic  0  :  56.99956057265075
Topic  1  :  -52.1026571569287
Topic  2  :  48.17005191418757


## Determining list of words important for each topic
For simplicity I am printing 3 important words for each topic

In [18]:
vocab = transformer.get_feature_names()

for i, comp in enumerate(svd.components_):
    vocab_comp = zip(vocab, comp)
    sorted_words = sorted(vocab_comp, key= lambda x:x[1], reverse=True)[:3]
    print("Topic "+str(i)+": ")
    for t in sorted_words:
        print(t[0],end=" ")
    print("\n")

Topic 0: 
ocean largest planet 

Topic 1: 
ocean saline water 

Topic 2: 
planet five seven 

