# Tf-Idf with Wikipedia

*Gensim, Scipy, Sklearn software examples.*

**Note**: The next sample codes are made using the data obtained after the transformation of Wikipedia dump with `gensim.scripts.make_wikicorpus.py` methods to converted it to Bag of Word model.

**Prerequisites:** Skills in tokenization with nltk, knowledge of Word2Vec Text Representation model.

## Outline

**Main Goal:** To practice how to create _tfidf_ model using Wikipedia corpus. As previous notebook the roadmap it is generate de model, then learn how to extract information from it, and finally how to measure word similarity using the model as base.

- Wrangling Data to init the model generation
- Gensim Corpus Inizialization
- TfIdf model example generation/loading
- Text similarity measures examples

In [1]:
from gensim.models import TfidfModel
from gensim.corpora import TextCorpus, MmCorpus, Dictionary
import numpy as np
import time

In [None]:
corpus_path = '/media/DATA/wiki_es/'

# 1 Acquiring & Wrangling Data

In [2]:
#Loading resources generated priviously with Gensim package
dictionary = Dictionary.load_from_text(corpus_path+'_wordids.txt.bz2')
bow_corpus = MmCorpus(corpus_path+'_bow.mm')



# 2 Generating the Tf-Idf model

In [8]:
try:
    tfidf =TfidfModel.load(corpus_path+'wiki-tfidf.model')
    print('TfIdf Model Generated in 658.466673374176 seconds')
except:
    init = time.time()
    tfidf = TfidfModel(bow_corpus,dictionary)
    end = time.time()-init
    print(end)
    tfidf._smart_save(corpus_path+'wiki-tfidf.model')


TfIdf Model Generated in 658.466673374176 seconds


After 11 minutes my 1stG-i7 laptop, 8Gb RAM, finish the model.

In [None]:
#print the word with index 1000 in the dictionary
print(dictionary[1000])

#print the bow vector of the sentence 1
sent = "Yo como pescado"
vec_sent = dictionary.doc2bow(sent.lower().split())
print(vec_sent)

#print the TFIDF vector of the sentence 1
vec_sent_tfidf = tfidf[vec_sent]
print(vec_sent_tfidf)

# 3 Measuring Similarity between Pair of Sentences

This section is made to show the utility of _tfidf model in an applied example. Also to show some native similarity methods of `gensim.model.TfIdfModel` class.

## Data

In [6]:
sentence1 = 'the girl run into the hall'
sentence2 = 'Here Alice run to the hall'

sent1 = sentence1.lower().split()
sent2 = sentence2.lower().split()

sentence1_ws = 'girl run hall'
sentence2_ws = 'Alice run hall'

sent1s = sentence1_ws.lower().split()
sent2s = sentence2_ws.lower().split()

#If we change the sent1 by a very different meaning sent3
sent3 = ['the','boy','eat','a','red','apple']
sent3s = ['boy','eat','red','apple']

[(11023, 0.4817709603680243), (20504, 0.5599326515327857), (28174, 0.35634410125465255), (46995, 0.5721809582593161)]
[(28174, 0.49053568087890215), (46995, 0.787652089532131), (72558, 0.3727987817044736)]


## 3.1 Wrangling Data

* First: From string-sentences to bow representation of a sentence.
* Second: From bow representation to numerical-list representation of a sentence.
* Third: From numerical-list vector to numerical-vector (numpy) representation.

In [None]:
vec_sent1 = dictionary.doc2bow(sentence1)
vec_sent2 = dictionary.doc2bow(sentence2)

vec_sent1_tfidf = tfidf[vec_sent1]
vec_sent2_tfidf = tfidf[vec_sent2]

print(vec_sent1_tfidf)

## 3.2 Sklearn TfIdf-Cosine sentence similarity

The last experiment is made with TfIdf matrix from gensim.
Unfortunately to load the Wikipedia dump to make a tf-idf index is to much for this computer.

In [26]:
import numpy as np
from textsim.tokendists import cosine_similarity_sklearn
from sklearn.metrics.pairwise import cosine_similarity

#Sklearn cosine for raw sentences implemented in textsim
cosine_similarity_sklearn(sent1,sent2)

0.5773502691896258

In [27]:
A = np.asarray(nvec1).reshape((1,-1))
B = np.asarray(nvec2).reshape((1,-1))
cosine_similarity(A,B)[0][0]

0.625479023699579

## 3.3 Scipy TfIdf-Cosine sentence similarity

Testing similarity with Scipy equations. A normalized vector with the above problem is showed to correct it.

In [10]:
from scipy.spatial.distance import cosine
from scipy.spatial.distance import jaccard as jaccard_scipy

In [11]:
#cosine(vec_sent1_tfidf,vec_sent2_tfidf)

The above line result in an error because used vectors are bow vectors in the following format: list((word_id,word_tfidf_coef)). Then a previous transformation of vectors is needed to 1D numerical vectors.

In [12]:
from six import iteritems
vec2 = dict(vec_sent1_tfidf)
vec1 = dict(vec_sent2_tfidf)
#[vec1.get(index, 0.0)**2 for index, value in iteritems(vec2)]
nvec1,nvec2 = [],[]
words = set(vec1.keys()).union(vec2.keys())
for word in words:
    nvec1.append(vec1.get(word,0.0))
    nvec2.append(vec2.get(word,0.0))
print(nvec1,'\n',nvec2)

[0.0, 0.787652089532131, 0.0, 0.3727987817044736, 0.49053568087890215] 
 [0.5599326515327857, 0.5721809582593161, 0.4817709603680243, 0.0, 0.35634410125465255]


In [13]:
print('Scipy Cosine:',cosine(nvec1,nvec2))
print('Scipy Jaccard:',jaccard_scipy(nvec1,nvec2))

Scipy Cosine: 0.37452097630042125
Scipy Jaccard: 1.0


## 3.4 Textsim TfIdf-Jaccard sentence similarity

Doing similarity with textsim package.

In [21]:
import sys
sys.path.append('/home/abelm')

from textsim.tokendists import jaccard_distance
from textsim.tokendists import cosine_similarity_sklearn

In [22]:
print('Textsim Jaccard', jaccard_distance(sent1,sent2))
print('TfIdf Textsim Jaccard', jaccard_distance(nvec1,nvec2))
#Prerocessed sentences
print('Textsim Cosine Sklearn',cosine_similarity_sklearn('girl run hall','Alice eat hall'))

Textsim Jaccard 0.625
TfIdf Textsim Jaccard 0.875
Textsim Cosine Sklearn 0.3333333333333334


In [24]:
A = np.asarray(nvec1).reshape((1,-1))
B = np.asarray(nvec2).reshape((1,-1))

print('TfIdf Textsim Cosine Sklearn',cosine_similarity_sklearn(A,B))

Both values need to be string objects or numerical vectors!
TfIdf Textsim Cosine Sklearn 0.625479023699579


## 3.5 Best Pair Word Overlap

## 3.6 Gensim TfIdf-Hellinger sentence similarity

Testing similarity with Gensim ecuations.

In [7]:
from gensim.matutils import kullback_leibler, jaccard, hellinger, cossim

print('Gensim Hellinger',hellinger(vec_sent1_tfidf,vec_sent2_tfidf))
print('Gensim Cosine:',cossim(vec_sent1_tfidf,vec_sent2_tfidf))
print('Gensim Jaccard:',jaccard(vec_sent1_tfidf,vec_sent2_tfidf))

One problem with Hellinger equation in Gensim is that iterates over the major vector, then in the above example the word 74333(eat) never will affect the result.

# Conclusions

* 0.659 input = bowvec, Hellinger, Gensim, 
* 0.267 input = bowvec, Cosine, Gensim
* 0.839 input = bowvec, Jaccard, Gensim
* 0.732 input = tfidf vec, Cosine, Scipy
* 1.000 input = tfidf vec, Jaccard, Scipy
* 0.777 input = str, Jaccard, Textsim, stopwords_filter=no
* 0.800 input = str, Jaccard, Textsim, stopwords_filter=yes
* 0.333 input = str, Cosine, Textsim-sklearn, stopwords_filter=yes
* 0.433 input = str, Cosine, Textsim-sklearn, stopwords_filter=no
* 0.267 input = tfidf vec, Cosine, Sklearn

As you can self analyze Gensim and Sklearn cosine have the same result. The sentences have words in common and in the context of "Alice's Adventures in Wonderland" by Lewis Carroll have the same mining, this book is part of the Gutenberg collection but only appears on Wikipedia dump as articles of few importance.