# Examples of text vectorization techniques

This notebooks presents description and examples of NLP techniques that can be used for text vectorization.

Based on [Python Machine Learning (2nd edition)](https://www.amazon.com/Python-Machine-Learning-scikit-learn-TensorFlow/dp/1787125939) by Sebastian Raschka and Vahid Mirjalili, [Applied Text Analysis with Python](https://www.oreilly.com/library/view/applied-text-analysis/9781491963036/ch04.html) by Tony Ojeda, Rebecca Bilbro, Benjamin Bengfort, and [scikit-learn documentation](https://scikit-learn.org/stable/modules/feature_extraction.html).

## Import dependencies

In [None]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

## Bag of words (BoW) 

### Unigrams
In case of unigrams, each term is an individual word.

In [5]:
count = CountVectorizer()
docs = np.array(['The sun is shining',
                 'The weather is sweet',
                 'The sun is shining, the weather is sweet, '
                 'and one and one is two'])
bag = count.fit_transform(docs)
count.vocabulary_

{'the': 6,
 'sun': 4,
 'is': 1,
 'shining': 3,
 'weather': 8,
 'sweet': 5,
 'and': 0,
 'one': 2,
 'two': 7}

In [6]:
bag.toarray()

array([[0, 1, 0, 1, 1, 0, 1, 0, 0],
       [0, 1, 0, 0, 0, 1, 1, 0, 1],
       [2, 3, 2, 1, 1, 1, 2, 1, 1]])

### n-grams
Contiguous sequences of items in NLP—words, letters, or symbols—are also called n-grams.
The choice of the number n in the n-gram model depends on the particular application.

Character n-grams could also be used as a representation of words, to avoid the use of tokenizers, which could be beneficial for such applications as email anti-spam filtering (since spammers attempt to confuse tokenizers) even if it increases the dimensionality of a problem, as shown in the study by [Kanaris et al.](https://www.researchgate.net/publication/220160318_Words_versus_Character_n-Grams_for_Anti-Spam_Filtering)
### Word bigrams

In [7]:
count = CountVectorizer(ngram_range=(2,2))
docs = np.array(['The sun is shining',
                 'The weather is sweet',
                 'The sun is shining, the weather is sweet, '
                 'and one and one is two'])
bag = count.fit_transform(docs)
count.vocabulary_

{'the sun': 9,
 'sun is': 7,
 'is shining': 1,
 'the weather': 10,
 'weather is': 11,
 'is sweet': 2,
 'shining the': 6,
 'sweet and': 8,
 'and one': 0,
 'one and': 4,
 'one is': 5,
 'is two': 3}

In [8]:
bag.toarray()

array([[0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0],
       [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1],
       [2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])

## Term Frequency-Inverse Document Frequency (TF-IDF)

When analyzing text data, it is common to encounter words that occur across multiple documents from both classes.
These frequently occurring words typically don't contain useful or discriminatory information.

Term frequency-inverse document frequency (tf-idf) method can be used to downweight frequently occurring words in the feature vectors. The tf-idf can be defined as the product of the term frequency and the inverse document frequency:

$tf\_idf(t, d) = tf(t, d) \times idf(t, d)$

Here the $tf(t, d)$ is the term frequency introduced for Bag of Words, and $idf(t, d)$ is the inverse document frequency and can be calculated as follows:

$idf(t, d) = \log \large{ \frac{n_d} {1 + df(d, t)} }$