## TF-IDF

sklearn's implementation receives an array of strings as documents, then those are tokenized, it counts every word then those countings are normalized and stored in a compressed matrix.

Using the TfidfVectorizer directly.

In [1]:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

def test_model(documents, min_df=0):
    vectorizer = TfidfVectorizer(ngram_range=(1, 1),
                        min_df=min_df, max_df=1.0)
    compressed_matrix = vectorizer.fit_transform(documents)
    return compressed_matrix, vectorizer

documents = ["top top pokemon incredible", 
             "incredible", 
             "pictures pokemon"]

compressed_matrix, vectorizer = test_model(documents)
print(compressed_matrix)

  (0, 0)	0.334906702661
  (0, 2)	0.334906702661
  (0, 3)	0.880724134463
  (1, 0)	1.0
  (2, 1)	0.795960541568
  (2, 2)	0.605348508106


Those tuples have the form (document_id, vocabulary_id), starting from 0 and a vocabulary is created and sorted.

In [3]:
vectorizer.get_feature_names()

['incredible', 'pictures', 'pokemon', 'top']

In [4]:
X = compressed_matrix.toarray()
print(X)
X.shape

[[ 0.3349067   0.          0.3349067   0.88072413]
 [ 1.          0.          0.          0.        ]
 [ 0.          0.79596054  0.60534851  0.        ]]


(3, 4)

## TF-IDF as a Feature Model

Here I use CountVectorizer which counts every word (and then can be used with a Multinomial Naive Bayes for example) and TfidfTransformer which normalize those countings separately.

And we could also use n-gram features (a 2-gram is two words together) which allows the bag of words model to have some information about word ordering.

`ngram_range=(self.ngram_min, self.ngram_max)`