# Examples of text vectorization techniques

## Import dependencies

In [4]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

## Bag of words (BoW) 
The idea behind the bag-of-words model is quite simple and can be summarized as follows:

1. Create a vocabulary of unique tokens—for example, words—from the entire set of documents.

2. Construct a feature vector from each document that contains the counts of how often each word occurs in the particular document.


Bag of words represents raw term frequencies: $tf(t, d)$—the number of times a term $t$ occurs in a document $d$.
### Unigrams
In case of unigrams, each term is an individual word.

In [5]:
count = CountVectorizer()
docs = np.array(['The sun is shining',
                 'The weather is sweet',
                 'The sun is shining, the weather is sweet, '
                 'and one and one is two'])
bag = count.fit_transform(docs)
count.vocabulary_

{'the': 6,
 'sun': 4,
 'is': 1,
 'shining': 3,
 'weather': 8,
 'sweet': 5,
 'and': 0,
 'one': 2,
 'two': 7}

In [6]:
bag.toarray()

array([[0, 1, 0, 1, 1, 0, 1, 0, 0],
       [0, 1, 0, 0, 0, 1, 1, 0, 1],
       [2, 3, 2, 1, 1, 1, 2, 1, 1]])

### Bigrams
The choice of the number n in the n-gram model depends on the particular application

In [7]:
count = CountVectorizer(ngram_range=(2,2))
docs = np.array(['The sun is shining',
                 'The weather is sweet',
                 'The sun is shining, the weather is sweet, '
                 'and one and one is two'])
bag = count.fit_transform(docs)
count.vocabulary_

{'the sun': 9,
 'sun is': 7,
 'is shining': 1,
 'the weather': 10,
 'weather is': 11,
 'is sweet': 2,
 'shining the': 6,
 'sweet and': 8,
 'and one': 0,
 'one and': 4,
 'one is': 5,
 'is two': 3}

In [8]:
bag.toarray()

array([[0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0],
       [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1],
       [2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])

## Term Frequency-Inverse Document Frequency (TF-IDF)
