<a href="https://colab.research.google.com/github/wenxuan0923/My-notes/blob/master/CountVectorizer_TfidfVectorizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Feature Extration
## CountVectorizer & TfidfVectorizer

**Vectorization** is the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the **Bag of Words** or "Bag of n-grams" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document. 

This note will cover the most popular feature extraction techniques for text data: **CountVectorizer** and **TfidfVectorizer** from `sklearn.feature_extraction.text` class using a toy document set.


In [0]:
import pandas as pd
from sklearn.feature_extraction.text import (CountVectorizer, 
                                             TfidfVectorizer,
                                             TfidfTransformer)

In [0]:
corpus = ["The greatest thing of life is love.",
          "Love is great, it's great to be loved.",
          "Is love the greatest thing?",
          "I love lasagna for 1000 times"]

## CountVectorizer

**CountVectorizer** converts a collection of text documents to a matrix of token counts: **the occurrences of tokens in each document**.

This implementation produces a sparse representation of the counts.

In [45]:
vectorizer = CountVectorizer(analyzer='word', ngram_range=(1, 1))
vectorized = vectorizer.fit_transform(corpus)
pd.DataFrame(vectorized.toarray(), 
            index=['sentence '+str(i) for i in range(1, 1+len(corpus))],
            columns=vectorizer.get_feature_names())

Unnamed: 0,1000,be,for,great,greatest,is,it,lasagna,life,love,loved,of,the,thing,times,to
sentence 1,0,0,0,0,1,1,0,0,1,1,0,1,1,1,0,0
sentence 2,0,1,0,2,0,1,1,0,0,1,1,0,0,0,0,1
sentence 3,0,0,0,0,1,1,0,0,0,1,0,0,1,1,0,0
sentence 4,1,0,1,0,0,0,0,1,0,1,0,0,0,0,1,0


Note that for each sentence in the corpus, the position of the tokens (words in our case) is completly ignored. When constructing this bag-of-words representation, the default configuration tokenizes the string by extracting words of at least 2 alphanumeric characters (punctuation is completely ignored and always treated as a token separator). 

### Consider only certain pattern

We can also specify the desired pattern for our token using `token_pattern` argument.


In [46]:
vectorizer = CountVectorizer(analyzer='word', 
                              token_pattern=r'\b[a-zA-Z]{3,}\b',  # Only alphabet, contains at least 3 letters
                              ngram_range=(1, 1))
vectorized = vectorizer.fit_transform(corpus)
pd.DataFrame(vectorized.toarray(), 
             index=['sentence '+str(i) for i in range(1, 1+len(corpus))],
             columns=vectorizer.get_feature_names())

Unnamed: 0,for,great,greatest,lasagna,life,love,loved,the,thing,times
sentence 1,0,0,1,0,1,1,0,1,1,0
sentence 2,0,2,0,0,0,1,1,0,0,0
sentence 3,0,0,1,0,0,1,0,1,1,0
sentence 4,1,0,0,1,0,1,0,0,0,1


Note `1000`, `be`, `is`, `of` are removed from the original feature space.

### Only consider unigrams/bigrams/... tokens

- `ngram-range=(1, 1)`: unigram only
- `ngram-range=(2, 2)`: bigrams only
- `ngram-range=(1, 2)`: both unigrams and bigrams 

In [48]:
vectorizer = CountVectorizer(analyzer='word', 
                              token_pattern=r'\b[a-zA-Z]{3,}\b',  # Only alphabet, contains at least 3 letters
                              ngram_range=(2, 2))  # only bigrams
vectorized = vectorizer.fit_transform(corpus)
pd.DataFrame(vectorized.toarray(), 
             index=['sentence '+str(i) for i in range(1, 1+len(corpus))],
             columns=vectorizer.get_feature_names())

Unnamed: 0,for times,great great,great loved,greatest thing,lasagna for,life love,love great,love lasagna,love the,the greatest,thing life
sentence 1,0,0,0,1,0,1,0,0,0,1,1
sentence 2,0,1,1,0,0,0,1,0,0,0,0
sentence 3,0,0,0,1,0,0,0,0,1,1,0
sentence 4,1,0,0,0,1,0,0,1,0,0,0


### Only consider tokens occur certain times
We can also make the vectorizer to ignore terms that have a **document frequency** stricly lower than a specified threshold by setting `min_df = threshold` or `max_df = threshold` for higher frequency terms.

In [49]:
vectorizer = CountVectorizer(analyzer='word', 
                              token_pattern=r'\b[a-zA-Z]{3,}\b',  # Only alphabet, contains at least 3 letters
                              ngram_range=(1, 2), # both unigrams and bigrams
                              min_df = 2)  # occur at least twice
vectorized = vectorizer.fit_transform(corpus)
pd.DataFrame(vectorized.toarray(), 
             index=['sentence '+str(i) for i in range(1, 1+len(corpus))],
             columns=vectorizer.get_feature_names())

Unnamed: 0,greatest,greatest thing,love,the,the greatest,thing
sentence 1,1,1,1,1,1,1
sentence 2,0,0,1,0,0,0
sentence 3,1,1,1,1,1,1
sentence 4,0,0,1,0,0,0



## TF-IDF

We can further transform a count matrix to a normalized **tf: term-frequency** or **tf-idf: term-frequency times inverse document-frequency** representation using **TfidfTransformer**. The formula that is used to compute the tf-idf for a **term t** of a **document d** in a document set is: <br><br>

$$\text{tf-idf}(t, d) = \text{tf}(t, d) * log(\frac{n}{df(t)+1})$$
<br>

when `smooth_idf=True`, which is also the default setting in sklearn.feature_extraction.text.TfidfTransformer 

> **tf(t, d)** is the number of times a term occurs in the given document. This is same with what we got from the CountVectorizer.
>
> **n** is the total number of documents in the document set
>
> **df(t)** is the number of documents in the document set that contain the term t

The effect of adding 1 to the denominator of $\text{idf}$ in the equation above is that terms with zero $\text{idf}$, i.e., terms that occur in all documents in a training set, will not be entirely ignored. At the end, each row is normalized to have unit Euclidean norm (by dividing l2 norm of itself).

The goal of using tf-idf instead of the raw frequencies of occurrence of a token in a given document is to **scale down the impact of tokens that occur very frequently** in a given corpus and that are hence empirically less informative than features that occur in a small fraction of the training corpus.

## TfidfTransformer v.s. Tfidfvectorizer

Both **TfidfTransformer** and **Tfidfvectorizer** modules can convert a collection of raw documents to a matrix of TF-IDF features.  However,

- With **Tfidftransformer** you will systematically compute word counts using CountVectorizer and then compute the Inverse Document Frequency (IDF) values and only then compute the Tf-idf scores.

- With **Tfidfvectorizer** on the contrary, you will do all three steps at once. Under the hood, it computes the word counts, IDF values, and Tf-idf scores all using the same dataset.

The following two code snippets are actually the same thing:

In [56]:
vectorizer = CountVectorizer(analyzer='word', 
                              token_pattern=r'\b[a-zA-Z]{3,}\b',  # Only alphabet, contains at least 3 letters
                              ngram_range=(1, 1) # only unigrams
                              )  
count_vectorized = vectorizer.fit_transform(corpus)
tfidf_transformer = TfidfTransformer(smooth_idf=True, use_idf=True)
vectorized = tfidf_transformer.fit_transform(count_vectorized)

pd.DataFrame(vectorized.toarray(), 
             index=['sentence '+str(i) for i in range(1, 1+len(corpus))],
             columns=vectorizer.get_feature_names())

Unnamed: 0,for,great,greatest,lasagna,life,love,loved,the,thing,times
sentence 1,0.0,0.0,0.445132,0.0,0.564594,0.294628,0.0,0.445132,0.445132,0.0
sentence 2,0.0,0.871022,0.0,0.0,0.0,0.227268,0.435511,0.0,0.0,0.0
sentence 3,0.0,0.0,0.539313,0.0,0.0,0.356966,0.0,0.539313,0.539313,0.0
sentence 4,0.552805,0.0,0.0,0.552805,0.0,0.288477,0.0,0.0,0.0,0.552805


This is same with:

In [54]:
vectorizer = TfidfVectorizer(analyzer='word', 
                              token_pattern=r'\b[a-zA-Z]{3,}\b',  # Only alphabet, contains at least 3 letters
                              ngram_range=(1, 1) # only unigrams
                              )  
vectorized = vectorizer.fit_transform(corpus)
pd.DataFrame(vectorized.toarray(), 
             index=['sentence '+str(i) for i in range(1, 1+len(corpus))],
             columns=vectorizer.get_feature_names())

Unnamed: 0,for,great,greatest,lasagna,life,love,loved,the,thing,times
sentence 1,0.0,0.0,0.445132,0.0,0.564594,0.294628,0.0,0.445132,0.445132,0.0
sentence 2,0.0,0.871022,0.0,0.0,0.0,0.227268,0.435511,0.0,0.0,0.0
sentence 3,0.0,0.0,0.539313,0.0,0.0,0.356966,0.0,0.539313,0.539313,0.0
sentence 4,0.552805,0.0,0.0,0.552805,0.0,0.288477,0.0,0.0,0.0,0.552805


After getting the numerical representation of the text data, you are ready to build ML & DL models using it.