# Bag of Words

We call vectorization the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the Bag of Words or “Bag of n-grams” representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.

* http://scikit-learn.org/stable/modules/feature_extraction.html#the-bag-of-words-representation
* https://www.coursera.org/learn/language-processing/lecture/vlmT5/feature-extraction-from-text


In [None]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
import nltk
from nltk import tokenize
from nltk.corpus import stopwords
from string import punctuation

In [None]:
corpus = ["Este é o primeiro documento", "Este é o segundo documento", "Este é o terceiro terceiro documento", 
          "isso é um documento?", "e o quinto."]

In [None]:
vectorizer = CountVectorizer()
bow = vectorizer.fit_transform(corpus)
vectorizer.get_feature_names()

In [None]:
pd.DataFrame(bow.toarray(), columns=vectorizer.get_feature_names(), index=corpus)

## Removing stopwords and punctuation

In [None]:
stopwords_punctuation = set(stopwords.words('portuguese') + list(punctuation))
def tokenizer(document):
    return tokenize.word_tokenize(document, language='portuguese')

vectorizer = CountVectorizer(tokenizer=tokenizer, stop_words=stopwords_punctuation)
bow = vectorizer.fit_transform(corpus)
pd.DataFrame(bow.toarray(), columns=vectorizer.get_feature_names(), index=corpus)