# TF-IDF

In a large text corpus, some words will be very present (e.g. “the”, “a”, “is” in English) hence carrying very little meaningful information about the actual contents of the document. If we were to feed the direct count data directly to a classifier those very frequent terms would shadow the frequencies of rarer yet more interesting terms.

In order to re-weight the count features into floating point values suitable for usage by a classifier it is very common to use the tf–idf transform.

**Tf** means **term-frequency** while tf–idf means term-frequency times **inverse document-frequency**

* http://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting
* https://www.coursera.org/learn/language-processing/lecture/vlmT5/feature-extraction-from-text
* https://en.wikipedia.org/wiki/Tf%E2%80%93idf
    

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import nltk
from nltk import tokenize
from nltk.corpus import stopwords
from string import punctuation

In [None]:
corpus = ["Este é o primeiro documento", "Este é o segundo documento", "Este é o terceiro terceiro documento", 
          "isso é um documento?", "e o quinto."]

In [None]:
tf_idf_vectorizer = TfidfVectorizer()
bow = tf_idf_vectorizer.fit_transform(corpus)
tf_idf_vectorizer.get_feature_names()

In [None]:
pd.DataFrame(bow.toarray(), columns=tf_idf_vectorizer.get_feature_names(), index=corpus)

In [None]:
stopwords_punctuation = set(stopwords.words('portuguese') + list(punctuation))
def tokenizer(document):
    return tokenize.word_tokenize(document, language='portuguese')

tf_idf_vectorizer = TfidfVectorizer(tokenizer=tokenizer, stop_words=stopwords_punctuation)
bow = tf_idf_vectorizer.fit_transform(corpus)
pd.DataFrame(bow.toarray(), columns=tf_idf_vectorizer.get_feature_names(), index=corpus)