# LSI Text Representation using Wikipedia Corpus

*Gensim software example.*

**Prerequisites:** Skills in tokenization with nltk, knowledge of LSI Text Representation model.

## Outline

**Main Goal:** To practice how to create the Latent Semantic Index model with Gensim and NLTK. Then introduce how to extract information from this text representation, and finally how to measure word similarity with this model.

- Gensim Corpus Inizialization
- LSI model example

**Note: Sklearn LSI codes still pendent.**

In [1]:
import os
import nltk
from gensim.models import LsiModel, TfidfModel
from gensim.corpora import TextCorpus, MmCorpus, Dictionary
from gensim.models.word2vec import Text8Corpus
import time
import numpy as np

In [2]:
#Change this configurations with your paths to Wiki corpus
corpus_path = '/media/DATA/wiki_es/'
#Loading resources generated priviously with Gensim package
dictionary = Dictionary.load_from_text(corpus_path+'_wordids.txt.bz2')
bow_corpus = MmCorpus(corpus_path+'_bow.mm')



## Generating the LSI Model

In [3]:
try:
    lsi =LsiModel.load(corpus_path+'wiki-lsi.model',mmap='r')
    print('TfIdf Model Generated in 6329 seconds')
except:
    init = time.time()
    lsi = LsiModel(bow_corpus,id2word=dictionary,num_topics=300)
    end = time.time()-init
    lsi.save(corpus_path+'wiki-lsi.model')
    print('Total time:', end)

TfIdf Model Generated in 6329 seconds


In [5]:
lsi[dictionary.doc2bow(['run'])][:10]

[(0, 0.000714352979674119),
 (1, 0.0007150450110253348),
 (2, 0.00034488263827122745),
 (3, 0.0036861730666044525),
 (4, 0.002268878648457448),
 (5, -0.00021203173328911284),
 (6, 0.0008200432054711397),
 (7, -0.001305789733660302),
 (8, -0.0007860860529528039),
 (9, -0.00043901833517678027)]

## Wrangling Data for Similarity Measures

From txt collection to a list of strings, and from string-list to a list of word-list by sentence-list.

From string-sentences to numerical vectors.

In this model the preprocessing is not needed, because if we create de bow vectors of the sentences the LSI model can handle the numerical vectors.

In [6]:
sentence1 = 'la niña corrió hacia el hueco'
sentence2 = 'Alicia corrió hacia el hueco'
sent1 = sentence1.split() #sentence in list of words format
sent2 = sentence2.split()
#Filtering stopwords by hand
sent1s = 'niña corrió hueco'
sent2s = 'Alicia corrió hueco'
sent1sl = sent1s.split()
sent2sl = sent2s.split()

In [7]:
vec_sent1 = dictionary.doc2bow(sent1)
vec_sent2 = dictionary.doc2bow(sent2)

bowvec_sent1_lsi = np.asarray(lsi[vec_sent1])
bowvec_sent2_lsi = np.asarray(lsi[vec_sent2])

In [8]:
print(len(bowvec_sent1_lsi))
bowvec_sent2_lsi[:10]

300


array([[0.00000000e+00, 2.12801488e-03],
       [1.00000000e+00, 2.02432313e-03],
       [2.00000000e+00, 3.05863651e-04],
       [3.00000000e+00, 2.14804254e-03],
       [4.00000000e+00, 1.78819965e-04],
       [5.00000000e+00, 5.01868029e-04],
       [6.00000000e+00, 6.10981640e-04],
       [7.00000000e+00, 1.54455533e-03],
       [8.00000000e+00, 1.43817691e-03],
       [9.00000000e+00, 1.66228135e-03]])

Having into account that LSI model works with topics, the generated vector have 2 components in everyone of the 300 elements that compound the numerical vector: the id of the topic, and their respective value.
The las step in wrangling this data is to eliminate the id, writing the vector as a 1D array only of float values.

In [9]:
vec_sent1_lsi = bowvec_sent1_lsi[...,1]
vec_sent2_lsi = bowvec_sent2_lsi[...,1]

In [10]:
print(len(vec_sent1_lsi))
vec_sent2_lsi[:10]

300


array([0.00212801, 0.00202432, 0.00030586, 0.00214804, 0.00017882,
       0.00050187, 0.00061098, 0.00154456, 0.00143818, 0.00166228])

### Applying Sklearn Cosine Similarity

In [11]:
vec_sent1_lsi.shape

(300,)

In [12]:
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(vec_sent1_lsi.reshape(1,-1),vec_sent2_lsi.reshape(1,-1))[0][0]

0.5766073355368637

In [13]:
#Filtering stopwords
vec_sent1s_lsi = lsi[dictionary.doc2bow(sent1sl)]
vec_sent2s_lsi = lsi[dictionary.doc2bow(sent2sl)]

bowvec_sent1s_lsi = np.asarray(vec_sent1s_lsi)
bowvec_sent2s_lsi = np.asarray(vec_sent2s_lsi)

vec_sent1s_lsi = bowvec_sent1s_lsi[...,1]
vec_sent2s_lsi = bowvec_sent2s_lsi[...,1]

cosine_similarity(vec_sent1s_lsi.reshape(1,-1),vec_sent2s_lsi.reshape(1,-1))[0][0]

0.5766073355368637

## Scipy Cosine Similarity

In [14]:
from scipy.spatial.distance import cosine as cosine_scipy

print(cosine_scipy(vec_sent1_lsi.reshape(1,-1),vec_sent2_lsi.reshape(1,-1)))
print(cosine_scipy(vec_sent1s_lsi.reshape(1,-1),vec_sent2s_lsi.reshape(1,-1))) #Filtering stopwords

0.42339266446313617
0.42339266446313617


## Gensim LSI sentence similarity

In [15]:
from gensim.matutils import kullback_leibler, jaccard, hellinger, cossim
print('Cosine similarity:',cossim(bowvec_sent1_lsi,bowvec_sent2_lsi))
print('Cosine similarity without stopwords',cossim(bowvec_sent1s_lsi,bowvec_sent2s_lsi))

Cosine similarity: 0.5766073355368642
Cosine similarity without stopwords 0.5766073355368642


In [16]:
hellinger(vec_sent1_lsi,vec_sent2_lsi)

  sim = np.sqrt(0.5 * ((np.sqrt(vec1) - np.sqrt(vec2))**2).sum())


nan

In [17]:
kullback_leibler(vec_sent1_lsi,vec_sent2_lsi)

inf

In [18]:
jaccard(vec_sent1s_lsi,vec_sent2s_lsi)

1.0

## Best Pair Word Overlap

Lets try a different way to compound a sentence similarity, based on WordNet-Augmented-Word-Overlap similarity idea.

$p = {\sum_{w\in\ sent_1}max(df[w][w']) \over len(sent_1)} \ \ \ \forall\ w' \in\ sent_2$

$q = {\sum_{w'\in\ sent_2}max(df[w][w']) \over len(sent_2)} \ \ \ \forall\ w \in\ sent_1$

$sim = \left\{ \begin{array}{rcl} 
0  & if\ p+q = 0\\
{2 p*q \over (p+q)}  & others\\
\end{array}
\right.$

In [19]:
def harmonic_best_pair_word_sim(sent1,sent2, lsi):
    p=0
    for wi in sent1:
        m = 0
        for wc in sent2:
            try:
                winp = np.asarray(lsi[dictionary.doc2bow([wi])])[...,1].reshape(1,-1)
                wcnp = np.asarray(lsi[dictionary.doc2bow([wc])])[...,1].reshape(1,-1)
                m = max(m, cosine_similarity(winp,wcnp))
            except:
                pass
        p += m
    p = p/len(sent1)

    q=0
    for wc in sent2:
        m = 0
        for wi in sent1:
            try:
                wcnp = np.asarray(lsi[dictionary.doc2bow([wc])])[...,1].reshape(1,-1)
                winp = np.asarray(lsi[dictionary.doc2bow([wi])])[...,1].reshape(1,-1)
                m = max(m, cosine_similarity(winp,wcnp))
            except:
                pass
        q += m
    q = q/len(sent2)

    sim = 2*p*q/(p+q or 1)
    return sim[0][0]

harmonic_best_pair_word_sim(sent1,sent2, lsi)

0.3636363636363637

In [20]:
#If we change the sent1 by a very different meaning sent3
sent3 = ['el','niño','comió','una','manzana','roja']

print(harmonic_best_pair_word_sim(sent3,sent2,lsi))

#With stopword filtering
print(harmonic_best_pair_word_sim(sent1,sent2,lsi))

0.01150097850428035
0.3636363636363637


# Conclusions

* As you can test the LSI is generated fast, because parallel computing is intrinsic on Gensim implementation.
* LSI generate a kind of bow vector because works with topic vectors, then generate an array made by topic_id,value.
* It is astounding that the cosine distance with stopword filtered change a lot (in a good manner) compared with the same sentence with stopwords.

As you can self analyze Gensim and Sklearn cosine have the same result.

# Recomendations

* Made the same example with Wikipedia dump data, to test the similarity difference according to data.