# LSI Text Representation Model

*Gensim software example.*

**Prerequisites:** Skills in tokenization with nltk, conceptual knowledge about LSI Text Representation model.

**data:** Gutenberg Corpus

## Outline

**Main Goal:** To practice how to create the Latent Semantic Index model with Gensim and NLTK. Then introduce how to extract information from this text representation model, and finally how to measure word similarity using the previous result.

- Acquiring and wrangling data for model initialization. 
- Gensim LSI model generation/loading example
- Sklearn, Scipy text similarity measures examples
- Gensim original measures examples

**Note: Sklearn LSI codes remains pendent.**

## What is LSI?

... [(xxx)](#xxx).

**Note**: About Gensim and NLTK software please read the introductions notes about them in [2.01-TfIdf Notebook](2.01-TfIdf.ipynb)

In [1]:
from gensim.models import LsiModel, TfidfModel
from gensim.corpora import TextCorpus, MmCorpus, Dictionary
import os
import nltk
import time

## Wrangling Data Step 1 

From txt collection to a list of strings, and from string-list to a list of word-list by sentence-list.

In [2]:
doc_collection = []
file_path = 'gutenberg/'
file_list = list(os.popen('ls '+ file_path).read().split('\n'))
for file in file_list:
    if file:
        with open(os.path.join(file_path,file)) as doc:
            doc_collection.append(doc.read())

#Wrangling the data from list of doc-strings -> list of word-list by sentences
sentences = []
for doc in range(len(doc_collection)):
    for sent in nltk.sent_tokenize(doc_collection[doc]):
        sent_words = []
        for word in nltk.word_tokenize(sent):
            sent_words.append(word)
        sentences.append(sent_words)

## Generating the LSI Model

In [3]:
init = time.time()
dictionary = Dictionary(sentences)
corpus = [dictionary.doc2bow(text) for text in sentences]
#tfidf = TfidfModel(corpus, dictionary)
lsi = LsiModel(corpus,id2word=dictionary,num_topics=300)
end = time.time()-init
print('Total time:', end)

Total time: 0.0003159046173095703


In [66]:
lsi[dictionary.doc2bow(['Alice'])][:10]

[(0, 0.00056955806905143347),
 (1, 0.00067110806665921202),
 (2, 0.0019692485159702447),
 (3, 0.0014663607785970927),
 (4, -0.00070440171189507632),
 (5, 0.0025261036712473928),
 (6, -0.00022047840204233559),
 (7, 0.0030931735846703359),
 (8, -0.0023823150419961716),
 (9, 0.0021076307016137039)]

## Sklearn LSI-Cosine sentence similarity

### Wrangling Data

From string-sentences to numerical vectors.

In this model the preprocessing is not needed because if we create de bow vectors of the sentences, the LSI model can't handle the numerical vectors.

In [40]:
import numpy as np

sentence1 = 'the girl run into the hall'
sentence2 = 'Here Alice run to the hall'

sent1 = sentence1.split()
sent2 = sentence2.split()

vec_sent1 = dictionary.doc2bow(sent1)
vec_sent2 = dictionary.doc2bow(sent2)

vec_sent1_lsi = np.asarray(lsi[vec_sent1])
vec_sent2_lsi = np.asarray(lsi[vec_sent2])

In [41]:
print(len(vec_sent1_lsi))
vec_sent2_lsi[:10]

300


array([[ 0.        ,  0.58373211],
       [ 1.        , -0.68532563],
       [ 2.        ,  0.18886818],
       [ 3.        ,  0.22626765],
       [ 4.        , -0.44856283],
       [ 5.        ,  0.58655806],
       [ 6.        ,  0.55979944],
       [ 7.        , -0.07446109],
       [ 8.        , -0.30344566],
       [ 9.        ,  0.0765748 ]])

Having into account that LSI model works with topics, the generated vector have 2 components in everyone of the 300 elements that compound the numerical vector: the id of the topic, and their respective value.
The las step in wrangling this data is to eliminate the id, writing the vector as a 1D array only of float values.

In [39]:
vec_sent1_lsi = vec_sent1_lsi[...,1]
vec_sent2_lsi = vec_sent2_lsi[...,1]

In [45]:
print(len(vec_sent1_lsi))
vec_sent2_lsi[:10]

300


array([ 0.58373211, -0.68532563,  0.18886818,  0.22626765, -0.44856283,
        0.58655806,  0.55979944, -0.07446109, -0.30344566,  0.0765748 ])

### Applying Similarity

In [47]:
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(vec_sent1_lsi.reshape(1,-1),vec_sent2_lsi.reshape(1,-1))[0][0]

0.62991053090495575

In [51]:
#Filtering stopwords
sent1s = ['girl','run','hall']
sent2s = ['Alice','run','hall']

vec_sent1s_lsi = lsi[dictionary.doc2bow(sent1s)]
vec_sent2s_lsi = lsi[dictionary.doc2bow(sent2s)]

vec_sent1s_lsi = np.asarray(vec_sent1s_lsi)
vec_sent2s_lsi = np.asarray(vec_sent2s_lsi)

vec_sent1_lsi = vec_sent1s_lsi[...,1]
vec_sent2_lsi = vec_sent2s_lsi[...,1]

cosine_similarity(vec_sent1s_lsi.reshape(1,-1),vec_sent2s_lsi.reshape(1,-1))[0][0]

0.99999999795024486

## Scipy Cosine Similarity

In [55]:
from scipy.spatial.distance import cosine as cosine_scipy

print(cosine_scipy(vec_sent1_lsi.reshape(1,-1),vec_sent2_lsi.reshape(1,-1)))
print(cosine_scipy(vec_sent1s_lsi.reshape(1,-1),vec_sent2s_lsi.reshape(1,-1))) #Filtering stopwords

0.708813408072
2.0497551434e-09


## Gensim LSI sentence similarity

In [56]:
from gensim.matutils import kullback_leibler, jaccard, hellinger, cossim

In [58]:
vec_sent1_lsi = np.asarray(lsi[vec_sent1])
vec_sent2_lsi = np.asarray(lsi[vec_sent2])
cossim(vec_sent1_lsi,vec_sent2_lsi)

0.62991053090495497

In [60]:
#Filtering stopwords
sent1s = ['girl','run','hall']
sent2s = ['Alice','run','hall']
vec_sent1s_lsi = lsi[dictionary.doc2bow(sent1s)]
vec_sent2s_lsi = lsi[dictionary.doc2bow(sent2s)]
cossim(vec_sent1s_lsi,vec_sent2s_lsi)

0.29118659192809249

In [59]:
hellinger(vec_sent1_lsi,vec_sent2_lsi)

  sim = np.sqrt(0.5 * sum((np.sqrt(value) - np.sqrt(vec2.get(index, 0.0)))**2 for index, value in iteritems(vec1)))


nan

In [61]:
kullback_leibler(vec_sent1_lsi,vec_sent2_lsi)

inf

In [62]:
jaccard(vec_sent1_lsi,vec_sent2_lsi)

-3.4701464250118317

## Best Pair Word Overlap

Lets try a different way to compound a sentence similarity, based on WordNet-Augmented-Word-Overlap similarity idea.

$p = {\sum_{w\in\ sent_1}max(df[w][w']) \over len(sent_1)} \ \ \ \forall\ w' \in\ sent_2$

$q = {\sum_{w'\in\ sent_2}max(df[w][w']) \over len(sent_2)} \ \ \ \forall\ w \in\ sent_1$

$sim = \left\{ \begin{array}{rcl} 
0  & if\ p+q = 0\\
{2 p*q \over (p+q)}  & others\\
\end{array}
\right.$

In [73]:
sent1 = ['the','girl','run','into','the','hall']
sent2 = ['Here','Alice','run','to','the','hall']

def harmonic_best_pair_word_sim(sent1,sent2, lsi):
    p=0
    for wi in sent1:
        m = 0
        winp = np.asarray(lsi[dictionary.doc2bow([wi])])[...,1].reshape(1,-1)
        for wc in sent2:
            wcnp = np.asarray(lsi[dictionary.doc2bow([wc])])[...,1].reshape(1,-1)
            m = max(m, cosine_similarity(winp,wcnp))
        p += m
    p = p/len(sent1)

    q=0
    for wc in sent2:
        m = 0
        wcnp = np.asarray(lsi[dictionary.doc2bow([wc])])[...,1].reshape(1,-1)
        for wi in sent1:
            winp = np.asarray(lsi[dictionary.doc2bow([wi])])[...,1].reshape(1,-1)
            m = max(m, cosine_similarity(winp,wcnp))
        q += m
    q = q/len(sent2)

    sim = 2*p*q/(p+q or 1)
    return sim

harmonic_best_pair_word_sim(sent1,sent2, lsi)

array([[ 0.60306345]])

In [74]:
#If we change the sent1 by a very different meaning sent3
sent3 = ['the','boy','eat','a','red','apple']

print(harmonic_best_pair_word_sim(sent3,sent2,lsi))

#With stopword filtering
print(harmonic_best_pair_word_sim(sent1,sent2,lsi))

[[ 0.25150628]]
[[ 0.70268849]]


# Conclusions

* As you can test the LSI is generated fast, because parallel computing is intrinsic on Gensim implementation.
* LSI generate a kind of bow vector because works with topic vectors, then generate an array made by topic_id,value.
* It is astounding that the cosine distance with stopword filtered change a lot (in a good manner) compared with the same sentence with stopwords.

As you can self analyze Gensim and Sklearn cosine have the same result.

# Recomendations

* Made the same example with Wikipedia dump data, to test the similarity difference according to data.