## TF-IDF

sklearn's implementation receives an array of strings as documents, they are tokenized, it counts every token, then those countings are normalized and stored in a compressed matrix.

Using TfidfVectorizer directly.

In [1]:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

def test_model(documents, min_df=0):
    vectorizer = TfidfVectorizer(ngram_range=(1, 1),
                        min_df=min_df, max_df=1.0)
    compressed_matrix = vectorizer.fit_transform(documents)
    return compressed_matrix, vectorizer

documents = ["top top pokemon incredible", 
             "incredible", 
             "pictures pokemon"]

compressed_matrix, vectorizer = test_model(documents)
print(compressed_matrix)

  (0, 0)	0.334906702661
  (0, 2)	0.334906702661
  (0, 3)	0.880724134463
  (1, 0)	1.0
  (2, 1)	0.795960541568
  (2, 2)	0.605348508106


Those tuples have the form (document_id, vocabulary_id), starting from 0 and a vocabulary is created and sorted.

In [2]:
vectorizer.get_feature_names()

['incredible', 'pictures', 'pokemon', 'top']

In [3]:
X = compressed_matrix.toarray()
print(X)
X.shape

[[ 0.3349067   0.          0.3349067   0.88072413]
 [ 1.          0.          0.          0.        ]
 [ 0.          0.79596054  0.60534851  0.        ]]


(3, 4)

## TF-IDF as a Feature Model

Here I use CountVectorizer which counts every word (and then can be used with a Multinomial Naive Bayes for example) and TfidfTransformer which normalizes those countings separately.

And we could also use n-gram features (a 2-gram is two words together) which allows the bag of words model to have some information about word ordering.

`ngram_range=(self.ngram_min, self.ngram_max)`

In [4]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

class TFIDFModel():
    """
    TF-IDF model
    """

    def __init__(self, ngram_min=1, ngram_max=1, min_df=0, max_df=1.0,
                 tfidf=True, trace=False):
        self.vectorizer = None
        self.ngram_min = ngram_min
        self.ngram_max = ngram_max
        self.min_df = min_df
        self.max_df = max_df
        self.tfidf = tfidf
        self.trace = trace
        
    def make_x(self, data):
        
        self.vectorizer = CountVectorizer(ngram_range=(self.ngram_min, self.ngram_max),
                                          min_df=self.min_df, max_df=self.max_df)
        if self.trace:
            print("vocabulary: {}".format(vectorizer.get_feature_names()))
        
        compressed_matrix = self.vectorizer.fit_transform(data)
        if self.tfidf:
            compressed_matrix = TfidfTransformer().fit_transform(compressed_matrix)
        X = pd.DataFrame(compressed_matrix.toarray(), index=data.index)
        return X
    
    def extract_features(self, data):
        
        return self.vectorizer.transform(data)

**It receives a pandas Series**

In [5]:
import pandas as pd
docs = pd.Series(documents)
docs

0    top top pokemon incredible
1                    incredible
2              pictures pokemon
dtype: object

**Counting**

In [6]:
tfidf_model = TFIDFModel(ngram_min=1, ngram_max=1, tfidf=False, trace=True)
tfidf_model.make_x(docs)

vocabulary: ['incredible', 'pictures', 'pokemon', 'top']


Unnamed: 0,0,1,2,3
0,1,0,1,2
1,1,0,0,0
2,0,1,1,0


**Normalizing**

In [7]:
feature_model = TFIDFModel(ngram_min=1, ngram_max=1, min_df=0.2, max_df=0.7, tfidf=True, trace=False)
feature_model.make_x(docs)

Unnamed: 0,0,1,2,3
0,0.334907,0.0,0.334907,0.880724
1,1.0,0.0,0.0,0.0
2,0.0,0.795961,0.605349,0.0
