# TF-IDF

## Term Frequency-Inverse Document Frequency based Vectorizer

**Documentation**: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

In [1]:
%%capture
%run bag_words_example.ipynb

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer

## TfIdfVectorizer

A basic TF-IDF vectorizer can be instantiated by using the `TfidfVectorizer()` function.
The data can be fitted into the TF-IDF vector form by using the `fit_transform()` function.

In [3]:
vectorizer = TfidfVectorizer()
tf_idf_matrix = vectorizer.fit_transform(preprocessed_corpus)

### Let's what features were obtained and the corresponding TF-IDF matrix

In [4]:
print(vectorizer.get_feature_names_out())
print(tf_idf_matrix.toarray())
print("\nThe shape of the TF-IDF matrix is: ", tf_idf_matrix.shape)

['comprehend' 'computers' 'data' 'everyday' 'evolve' 'field' 'language'
 'make' 'natural' 'process' 'read']
[[0.         0.         0.         0.         0.         0.
  0.41285857 0.         0.41285857 0.41285857 0.69903033]
 [0.40512186 0.40512186 0.40512186 0.         0.         0.
  0.478543   0.40512186 0.2392715  0.2392715  0.        ]
 [0.         0.         0.         0.49711994 0.49711994 0.49711994
  0.29360705 0.         0.29360705 0.29360705 0.        ]]

The shape of the TF-IDF matrix is:  (3, 11)


The third column from the end corresponds to the term `natural`. It occurs once in each document.
The `TF-IDF` weight for the term is different across the documents because `TF` changes since the size of each document is different and the `TF` component gets normalized based on that.

## Changing the norm to l1, default option is l2 which was used above

Each output row will have unit norm, which can be one of

**l2**: Sum of squares of vector elements is 1 (the default one)

**l1**: Sum of absolute values of vector elements is 1.

In [5]:
vectorizer_l1_norm = TfidfVectorizer(norm="l1")
tf_idf_matrix_l1_norm = vectorizer_l1_norm.fit_transform(preprocessed_corpus)

In [6]:
print(vectorizer_l1_norm.get_feature_names())
print(tf_idf_matrix_l1_norm.toarray())
print("\nThe shape of the TF-IDF matrix is: ", tf_idf_matrix_l1_norm.shape)

['comprehend', 'computers', 'data', 'everyday', 'evolve', 'field', 'language', 'make', 'natural', 'process', 'read']
[[0.         0.         0.         0.         0.         0.
  0.21307663 0.         0.21307663 0.21307663 0.3607701 ]
 [0.1571718  0.1571718  0.1571718  0.         0.         0.
  0.1856564  0.1571718  0.0928282  0.0928282  0.        ]
 [0.         0.         0.         0.2095624  0.2095624  0.2095624
  0.12377093 0.         0.12377093 0.12377093 0.        ]]

The shape of the TF-IDF matrix is:  (3, 11)




## N-grams and Max features with TfidfVectorizer

The `TF-IDF` vectorizer offers the capability of using `n-grams` and `max_features` to limit our vocabulary.

`analyzer` builds the feature as word or character n-grams.

In [7]:
vectorizer_n_gram_max_features = TfidfVectorizer(norm="l2", analyzer='word', ngram_range=(1,3), max_features = 6)
tf_idf_matrix_n_gram_max_features = vectorizer_n_gram_max_features.fit_transform(preprocessed_corpus)

In [8]:
print(vectorizer_n_gram_max_features.get_feature_names())
print(tf_idf_matrix_n_gram_max_features.toarray())
print("\nThe shape of the TF-IDF matrix is: ", tf_idf_matrix_n_gram_max_features.shape)

['language', 'language process', 'natural', 'natural language', 'natural language process', 'process']
[[0.40824829 0.40824829 0.40824829 0.40824829 0.40824829 0.40824829]
 [0.66666667 0.33333333 0.33333333 0.33333333 0.33333333 0.33333333]
 [0.40824829 0.40824829 0.40824829 0.40824829 0.40824829 0.40824829]]

The shape of the TF-IDF matrix is:  (3, 6)


We took the top six features among unigrams, bigrams and trigrams, and used them to represent the `TF-IDF` vectors. 