# Tf-Idf with Wikipedia & Gensim Package

*Gensim, Scipy, Sklearn software examples.*

**Note**: The next sample codes are made using the data obtained after the transformation of Wikipedia dump with `gensim.scripts.make_wikicorpus.py` methods to converted it to Bag of Word model.

**Prerequisites:** Skills in tokenization with nltk, knowledge of Word2Vec Text Representation model.

## Outline

**Main Goal:** To practice how to create _tfidf_ model using Wikipedia corpus. As previous notebook the roadmap it is generate de model, then learn how to extract information from it, and finally how to measure word similarity using the model as base.

- Acquiring and wrangling data for model initialization. 
- Gensim TfIdf model generation/loading example
- Sklearn, Scipy text similarity measures examples
- Gensim original measures examples

This notebook includes the results using `gensim.model.TfIdfModel`, but in order to standarize results with other word-embedding methods a solution with `sklearn.feature_extraction.text.TfidfVectorizer` class is taking as main approach. The TfIdfModel only build a (2,) shape vector instead of  TfidfVectorizer that creates a (300,) shape vector.

**Note**: About Gensim and NLTK software please read the introductions notes about them in [2.01-TfIdf Notebook](2.01-TfIdf.ipynb)

In [1]:
from gensim.models import TfidfModel
from gensim.corpora import TextCorpus, MmCorpus, Dictionary
import numpy as np
import time

In [2]:
corpus_path = '/opt/wiki_es/'

# 1 Acquiring & Wrangling Data

In [3]:
#Loading resources generated priviously with Gensim package
dictionary = Dictionary.load_from_text(corpus_path+'_wordids.txt.bz2')
bow_corpus = MmCorpus(corpus_path+'_bow.mm')



# 2 Generating the Tf-Idf model

In [4]:
try:
    tfidf =TfidfModel.load(corpus_path+'wiki-tfidf.model')
    print('TfIdf Model Generated in 658.466673374176 seconds')
except:
    init = time.time()
    tfidf = TfidfModel(bow_corpus,dictionary)
    end = time.time()-init
    print(end)
    tfidf._smart_save(corpus_path+'wiki-tfidf.model')


TfIdf Model Generated in 658.466673374176 seconds


After 11 minutes my 1stG-i7 laptop, 8Gb RAM, finish the model.

In [5]:
#print the word with index 1000 in the dictionary
print(dictionary[1000])

#print the bow vector of the sentence 1
sent = "Yo como pescado"
vec_sent = dictionary.doc2bow(sent.lower().split())
print(vec_sent)

#print the TFIDF vector of the sentence 1
vec_sent_tfidf = tfidf[vec_sent]
print(vec_sent_tfidf)

australianos
[(11768, 1), (26452, 1)]
[(11768, 0.8147309863935951), (26452, 0.5798391326998545)]


# 3 Measuring Similarity between Pair of Sentences

This section is made to show the utility of _tfidf model in an applied example. Also to show some native similarity methods of `gensim.model.TfIdfModel` class.

## Data

In [70]:
sentence1 = 'la niña corrió hacia el hueco'
sentence2 = 'Alicia corrió hacia el hueco'

sent1 = sentence1.lower().split()
sent2 = sentence2.lower().split()

#Filtering stopwords by hand
sentence1_ws = 'niña corrió hueco'
sentence2_ws = 'Alicia corrió hueco'

sent1s = sentence1_ws.lower().split()
sent2s = sentence2_ws.lower().split()

#If we change the sent1 by a very different meaning sent3
sentence3 = 'El niño comió una manzana roja'
sentence3_ws = 'niño comió manzana roja'
sent3 = ['el','niño','comió','una','manzana','roja']
sent3s = ['niño','comió','manzana','roja']

## 3.1 Wrangling Data

* First: From string-sentences to bow representation of a sentence.
* Second: From bow representation to numerical-list representation of a sentence.
* Third: From numerical-list vector to numerical-vector (numpy) representation.

In [63]:
from scripts.preprocess import wrang_tfidf
#A = np.asarray(nvec1).reshape((1,-1))
bowvec_sent1_tfidf,bowvec_sent2_tfidf,nvec1,nvec2, A, B = wrang_tfidf(sent1,sent2,tfidf, dictionary)

## 3.2 Sklearn TfIdf-Cosine sentence similarity

The last experiment is made with TfIdf matrix from gensim.
Unfortunately to load the Wikipedia dump to make a tf-idf index is to much for this computer.

In [46]:
from sklearn.metrics.pairwise import cosine_similarity

print(float(cosine_similarity(A,B)))

#Filtering stopwords
bowvec_sent1s_tfidf,bowvec_sent2s_tfidf,nvec1s,nvec2s, As, Bs = wrang_tfidf(sent1s,sent2s,tfidf,dictionary)
print(float(cosine_similarity(As,Bs)))

0.7240525705998468
0.7240525705998468


## 3.3 Scipy TfIdf-Cosine sentence similarity

Testing similarity with Scipy equations. A normalized vector with the above problem is showed to correct it.

In [60]:
from scipy.spatial.distance import cosine as cosine_scipy
from scipy.spatial.distance import jaccard as jaccard_scipy

print(cosine_scipy(A,B))
print(cosine_scipy(As,Bs))

0.27594742940015304
0.27594742940015304


## 3.4 Best Pair Word Overlap

To see the mathematical equations of this propousal read the section _3.4 Best Pair Word Overlap_ in the [2.1-TfIdf Notebook](02.1-TfIdf.ipynb).

This implementation do not receibe a TfIdfModel derived variable, only works with vectors as it made in TfidfVectorizer.

In [23]:
from scripts.distances import best_pair_word_overlap

print('Dissimilar sentences tfidf_harmonic_best_pair_word similarity', 
      best_pair_word_overlap(sent3, sent2, pdTfIdf))
print('Dissimilar sentences without stopwords tfidf_harmonic_best_pair_word similarity',
      best_pair_word_overlap(sent3s, sent2s, pdTfIdf))
print('Similar sentences tfidf_harmonic_best_pair_word', 
      best_pair_word_overlap(sent1, sent2, pdTfIdf))
print('Similar sentences tfidf_harmonic_best_pair_word without stopwords',
      best_pair_word_overlap(sent1s, sent2s, pdTfIdf))

1.0

## 3.5 Textsim TfIdf-Jaccard sentence similarity

Doing similarity with textsim package, and testing dissimilar sentence.

In [69]:
bowvec_sent4_tfidf,bowvec_sent3_tfidf,nvec4,nvec3, D, C = wrang_tfidf(sent1,sent3,tfidf, dictionary)
bowvec_sent4s_tfidf,bowvec_sent3s_tfidf,nvec4s,nvec3s, Ds, Cs = wrang_tfidf(sent1s,sent3s,tfidf, dictionary)

In [72]:
from textsim.tokendists import jaccard_distance
from textsim.tokendists import cosine_similarity_sklearn

print(float(cosine_similarity(A,B)),'TfIdf original Cosine Sklearn')
print(cosine_similarity_sklearn(A,B),'TfIdf Textsim Cosine Sklearn')
print(cosine_similarity_sklearn(sentence1,sentence2),'String Textsim Cosine Sklearn')
print(jaccard_distance(sentence1,sentence2), 'String Textsim Jaccard')
print(jaccard_distance(nvec1,nvec2), 'TfIdf Textsim Jaccard')
print('----------------------------------')
#Filtering stopwords
print(float(cosine_similarity(As,Bs)),'TfIdf original Cosine Sklearn without stopwords')
print(cosine_similarity_sklearn(As,Bs),'TfIdf Textsim Cosine Sklearn without stopwords')
print(cosine_similarity_sklearn(sentence1_ws,sentence2_ws),'String Textsim Cosine Sklearn without stopwords')
print(jaccard_distance(sentence1_ws,sentence2_ws), 'String Textsim Jaccard without stopwords')
print(jaccard_distance(nvec1s,nvec2s), 'TfIdf Textsim Jaccard without stopwords')

0.7240525705998468 TfIdf original Cosine Sklearn
0.7240525705998468 TfIdf Textsim Cosine Sklearn
0.7302967433402215 String Textsim Cosine Sklearn
0.42857142857142855 String Textsim Jaccard
0.8571428571428571 TfIdf Textsim Jaccard
----------------------------------
0.7240525705998468 TfIdf original Cosine Sklearn without stopwords
0.7240525705998468 TfIdf Textsim Cosine Sklearn without stopwords
0.6666666666666669 String Textsim Cosine Sklearn without stopwords
0.5 String Textsim Jaccard without stopwords
0.8571428571428571 TfIdf Textsim Jaccard without stopwords


In [73]:
#Dissimilar sentence
print(float(cosine_similarity(D,C)),'TfIdf original Cosine Sklearn')
print(cosine_similarity_sklearn(D,C),'TfIdf Textsim Cosine Sklearn')
print(cosine_similarity_sklearn(sentence1,sentence3),'String Textsim Cosine Sklearn')
print(jaccard_distance(sentence1,sentence3), 'String Textsim Jaccard')
print(jaccard_distance(nvec4,nvec3), 'TfIdf Textsim Jaccard')
print(float(cosine_similarity(Ds,Cs)),'TfIdf original Cosine Sklearn')
print(cosine_similarity_sklearn(Ds,Cs),'TfIdf Textsim Cosine Sklearn')
print(cosine_similarity_sklearn(sentence1_ws,sentence3_ws),'String Textsim Cosine Sklearn')
print(jaccard_distance(sentence1_ws,sentence3_ws), 'String Textsim Jaccard')
print(jaccard_distance(nvec4s,nvec3s), 'TfIdf Textsim Jaccard')

0.0 TfIdf original Cosine Sklearn
0.0 TfIdf Textsim Cosine Sklearn
0.1666666666666667 String Textsim Cosine Sklearn
1.0 String Textsim Jaccard
0.875 TfIdf Textsim Jaccard
0.0 TfIdf original Cosine Sklearn
0.0 TfIdf Textsim Cosine Sklearn
0.0 String Textsim Cosine Sklearn
1.0 String Textsim Jaccard
0.875 TfIdf Textsim Jaccard


# 4 Gensim Original Measures

Gensim has some native similarity measures, some of them are only implemented in some models.

## 4.1 Gensim TfIdf-Hellinger sentence similarity

Testing similarity with Gensim ecuations. Here are all possible calculations:

* Kullback Leibler not accept bow_vec.  
* Cossim not accept numpy arrays.

In [58]:
from gensim.matutils import kullback_leibler, hellinger, cossim
from gensim.matutils import jaccard as gjaccard

print(kullback_leibler(A, B),'Gensim Kullback_leibler')
print(hellinger(A,B), 'Gensim Hellinger')
print(hellinger(bowvec_sent1_tfidf,bowvec_sent2_tfidf), 'Gensim Cosine')
print(cossim(bowvec_sent1_tfidf,bowvec_sent2_tfidf),'Gensim Cosine')
print(gjaccard(nvec1,nvec2),'Gensim Jaccard')
print(gjaccard(bowvec_sent1_tfidf,bowvec_sent2_tfidf),'Gensim Jaccard')
print('----------------------------------')
#Filtering stopwords
print(kullback_leibler(As, Bs),'Gensim Kullback_leibler')
print(hellinger(As,Bs), 'Gensim Hellinger')
print(hellinger(bowvec_sent1s_tfidf,bowvec_sent2s_tfidf), 'Gensim Cosine')
print(cossim(bowvec_sent1s_tfidf,bowvec_sent2s_tfidf),'Gensim Cosine')
print(gjaccard(nvec1s,nvec2s),'Gensim Jaccard')
print(gjaccard(bowvec_sent1s_tfidf,bowvec_sent2s_tfidf),'Gensim Jaccard')

inf Gensim Kullback_leibler
0.7242183179867516 Gensim Hellinger
0.7242183179867516 Gensim Cosine
0.724052570599847 Gensim Cosine
0.8571428571428572 Gensim Jaccard
0.6578982610464937 Gensim Jaccard
----------------------------------
inf Gensim Kullback_leibler
0.7242183179867516 Gensim Hellinger
0.7242183179867516 Gensim Cosine
0.724052570599847 Gensim Cosine
0.8571428571428572 Gensim Jaccard
0.6578982610464937 Gensim Jaccard


One problem with Hellinger equation in Gensim is that iterates over the major vector, then in the above example the word 74333(eat) never will affect the result.

# TfidfVectorizer Example Code

    import pickle
    import gensim                                                        
    from gensim.corpora import Dictionary
    from sklearn.feature_extraction.text import TfidfVectorizer
    from scipy.sparse.linalg import svds

    corpus_path = '/media/abelma/SSD2/wiki_es/'
    wiki_corpus = corpus_path+'dump/eswiki-20161201-pages-articles-multistream.xml.bz2'
    dictionary = Dictionary.load\_from\_text(corpus_path+'\_wordids.txt.bz2')
    article_min_tokens=50
    article_max_tokens=5000

    def read_wikipedia_corpus(filename, 
                              dictionary=dictionary, 
                              article_min_tokens=article_min_tokens, 
                              article_max_tokens=article_max_tokens):

        corpus = gensim.corpora.WikiCorpus(filename, 
                                           dictionary=dictionary,
                                           article_min_tokens=article_min_tokens,
                                           article_max_tokens=article_max_tokens,
                                           lemmatize=None)

        for text in corpus.get_texts():
            yield ' '.join(word for word in text)

    vectorizer = TfidfVectorizer(min_df=20,max_df=0.8, max_features=2000000, use_idf=True)#, ngram_range=(1,2))
    TfIdfMatrix = vectorizer.fit_transform(read_wikipedia_corpus(wiki_corpus))
    coGTfMatrix = coo_matrix(TfIdfMatrix.T)
    coGTfMatrix.data = coGTfMatrix.data*1.0
    Ug, Sg, Vg = ssvds(coGTfMatrix, k=300)
    f = open('/media/abelma/SSD2/wiki_es/UgTfIdfMatrix.pkl','bw')
    pickle.dump(TfIdfMatrix,f)
    f = open('/media/abelma/SSD2/wiki_es/vectorizer.pkl','bw')
    pickle.dump(vectorizer,f)
    
This code only works in computers with high amount of RAM (>=16Gb). In the process every huge matrix was serialized to a pickle object, to initialize the RAM. 

In [74]:
import pickle
f = open('/opt/wiki_es/vectorizer.pkl','br')
vectorizer = pickle.load(f)
f = open('/opt/wiki_es/UgTfIdfMatrix.pkl','br')
Ug = pickle.load(f)

In [77]:
A = Ug[vectorizer.vocabulary_['rey']].reshape((1,-1))
B = Ug[vectorizer.vocabulary_['reina']].reshape((1,-1))
float(cosine_similarity(A,B))

0.5560445857905273

# Conclusions

See the [3.1-Playfull-Experiments-with-MSRPC](3.1-Playfull-Experiments-with-MSRPC) to see more in detail the result analysis, and a feature selection based on a paraphrase recognition problem.
Results analysis:

|similarity| vector type | Measure | Library | Similar/Dissimilar | Stopword Filter |
|----------|-------------|---------|---------|--------------------|-----------------|
| inf      | numpy       | kull-Le | gensim  | sim                | yes/no          |
| 1.0000   | str         | jaccard | textsim | diss               | yes/no          |
| 0.8571   | numpy       | jaccard | textsim | sim/diss           | yes/no          |
|          |             | jaccard | gensim  | sim                | yes/no          |
| 0.7302   | str         | kcosine | textsim | sim                | no              |
| 0.7240   | numpy       | kcosine | textsim | sim                | yes/no          |
|          |             | cosine  | sklearn | sim                | yes/no          |
|          |             | helling | gensim  | sim                | yes/no          |
|          | bow         | helling | gensim  | sim                | yes/no          |
|          | bow         | cosine  | gensim  | sim                | yes/no          |
| 0.6666   | str         | kcosine | textsim | sim                | yes             |
| 0.6578   | bow         | jaccard | gensim  | sim                | yes/no          |
| 0.5000   | str         | jaccard | textsim | sim                | yes             |
| 0.4285   | str         | jaccard | textsim | sim                | no              |
| 0.2759   | numpy       | cosine  | scipy   | sim                | yes/no          |
| 0.1666   | str         | kcosine | textsim | diss               | no              |
| 0.0000   | numpy       | kcosine | textsim | diss               | yes/no          |
| 0.0000   |             | cosine  | sklearn | diss               | yes/no          |
| 0.0000   | str         | kcosine | textsim | diss               | yes             |


* As you can self analyze Gensim and Sklearn cosine have the same result. 
* The example sentences have words in common and in the context of "Alice's Adventures in Wonderland" by Lewis Carroll have the same mining, this book is part of the Gutenberg collection but only appears on Wikipedia dump as articles of few importance.
* Except textsim package with str input in jaccard and cosine distances can differenciate between stopword filtered sentences or not.
* The combinations that classify well similar and dissimilar sentences are:
    + numpy - cosine - sklearn -> diff = 0.72
    + str - jaccard - textsim -> diff = 0.6
    + str - cosine - textsim -> diff = 0.57
    