In this post, I shall go over TF-IDF Model and its implementation with Scikit-learn.


## Traditional Feature Engineering Models

Traditional (count-based) feature engineering strategies for textual data belong to a family of models popularly known as the Bag of Words model. This includes term frequencies, TF-IDF (term frequency-inverse document frequency), N-grams, topic models, and so on. While they are effective methods for extracting features from text, due to the inherent nature of the model being just a bag of unstructured words, we lose
additional information like the semantics, structure, sequence, and context around nearby words in each text document.

## Bag of Words Model

This is perhaps the most simple vector space representational model for unstructured text. A vector space model is simply a mathematical model to represent unstructured text (or any other data) as numeric vectors, such that each dimension of the vector is a specific feature/attribute. The Bag of Words model represents each text document as a numeric vector where each dimension is a specific word from the corpus and the value could be its frequency in the document, occurrence (denoted by 1 or 0), or even weighted values. The model’s name is such because each document is represented literally as a bag of its own words, disregarding word order, sequences, and grammar.

In [27]:
doc_a = 'this document is first document'
doc_b = 'this document is the second document'

bag_of_words_a = doc_a.split(' ')
bag_of_words_b = doc_b.split(' ')

unique_words_set = set(bag_of_words_a).union(set(bag_of_words_b))
print(unique_words_set)

# Now create a dictionary of words and their occurence for each document in the corpus (collection of documents).

dict_a = dict.fromkeys(unique_words_set, 0)
# print(dict_a) # {'this': 0, 'document': 0, 'second': 0, 'is': 0, 'the': 0}

for word in bag_of_words_a:
    dict_a[word] += 1

print(dict_a)
# {'this': 1, 'document': 2, 'second': 1, 'is': 1, 'the': 1}

# similarly

dict_b = dict.fromkeys(unique_words_set, 0)

for word in bag_of_words_b:
    dict_b[word] += 1

print(dict_b)
# {'this': 1, 'document': 2, 'second': 1, 'is': 1, 'the': 1}

{'this', 'document', 'second', 'is', 'the'}
{'this': 1, 'document': 2, 'second': 1, 'is': 1, 'the': 1}
{'this': 1, 'document': 2, 'second': 1, 'is': 1, 'the': 1}


## TF-IDF Model

There are some potential problems that might arise with the Bag of Words model when it is used on large corpora. Since the feature vectors are based on absolute term frequencies, there might be some terms that occur frequently across all documents and these may tend to overshadow other terms in the feature set. Especially words that don’t occur as frequently, but might be more interesting and effective as features to identify specific categories. This is where TF-IDF comes into the picture. TF-IDF stands for term frequency-inverse document frequency. It’s a combination of two metrics, term frequency (tf ) and inverse document frequency (idf ). This technique was originally developed as a
metric for ranking search engine results based on user queries and has come to be a part of information retrieval and text feature extraction.

#### Suppose the query is “albert einstein“. Then, score(article) = TFIDF(article, “albert”) + TFIDF(article, “einstein”). Our search engine will display the articles sorted by score.

Let’s formally define TF-IDF now and look at the mathematical representations before diving into its implementation.

#### Mathematically, TD-IDF is the product of two metrics and can be represented as follows:


###  $tfidf = tf  * idf$


where term frequency (tf) and inverse-document frequency (idf) represent the two metrics we just talked about.

## Term Frequency (TF)

Term frequency in any document vector is denoted by the raw frequency value of that term in a particular document. Mathematically it can be represented as follows:

### $tf (w ,D ) = f_{wD} $

where $f_{wD}$ denoted frequency for word w in document D, which becomes the term frequency (tf ). Sometimes you can also normalize the absolute raw frequency using logarithms or averaging the frequency. We use the raw frequency in our computations.


So, in other words, The number of times a word appears in a document divded by the total number of words in the document. Every document has its own term frequency.

![img](https://i.imgur.com/zdAyxl8.png)

$$TF(t) = \frac{\text{Number of times term t appears in a document}}{\text{Total number of terms in the document}}.$$

Inverse Document Frequency, which measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following:

$$IDF(t) = \log_{e}\frac{\text{Total  number of documents}} {\text{Number of documents with term t in it}}.$$

for numerical stabiltiy we will be changing this formula little bit

$$IDF(t) = \log_{e}\frac{\text{Total  number of documents}} {\text{Number of documents with term t in it}+1}.$$

Following Python code can compute **Term Frequency**

In [28]:
def compute_term_frequency(word_dictionary, bag_of_words):
    term_frequency_dictionary = {}
    length_of_bag_of_words = len(bag_of_words)

    for word, count in word_dictionary.items():
        term_frequency_dictionary[word] = count / float(length_of_bag_of_words)

    return term_frequency_dictionary

# Implementation

print(compute_term_frequency(dict_a, bag_of_words_a))

{'this': 0.16666666666666666, 'document': 0.3333333333333333, 'second': 0.16666666666666666, 'is': 0.16666666666666666, 'the': 0.16666666666666666}


## Inverse document frequency
Inverse document frequency denoted by idf is the inverse of the document frequency for each term and is computed by dividing the total number of documents in our corpus by the document frequency for each term and then applying logarithmic scaling to the result.


In some implementation, people will be adding 1 to the document frequency for each term to indicate that we also have one more document in our corpus, which essentially has every term in the vocabulary. This is to prevent potential division by zero errors and smoothen the inverse document frequencies. This is to avoid ignoring terms that might have zero idf.

 #### Inverse Document Frequency, measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones

Mathematically, our implementation for idf can be represented as follows:

![img](https://i.imgur.com/LBoEWr8.png)

The log of the number of documents divided by the number of documents that contain the word w. Inverse data frequency determines the weight of rare words across all documents in the corpus.

Consider a document containing 100 words wherein the word cat appears 3 times. The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the word cat appears in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12.

Following Python code can compute **Inverse document frequency**

In [29]:
import math

def compute_inverse_document_frequency(full_doc_list):
    idf_dict = {}
    length_of_doc_list = len(full_doc_list)

    idf_dict = dict.fromkeys(full_doc_list[0].keys(), 0)
    for word, value in idf_dict.items():
        idf_dict[word] = math.log(length_of_doc_list / (float(value) + 1))

    return idf_dict

final_idf_dict = compute_inverse_document_frequency([dict_a, dict_b])
print(final_idf_dict)

#### The IDF is computed once for all documents applying following formulae

### $idfs = compute_inverse_document_frequency([num_of_words_a, num_of_words_b])$

Lastly, the TF-IDF is simply the TF multiplied by IDF.


## the TF-IDF is simply the TF multiplied by IDF

![img](https://i.imgur.com/OCRidLj.png)

#### So then TF-IDF is a score which is applied to every word in every document in our dataset. And for every word, the TF-IDF value increases with every appearance of the word in a document, but is gradually decreased with every appearance in other documents.

Now suppose the word Python doesn’t appear in any of our other blog posts. We draw the conclusion that the word Python isn’t very relevant to most of our blog posts, but very relevant to post #2, where it appeared dozens of times. The inverse document frequency (IDF) tells us how important a term is to a collection of documents. A good example of how IDF comes into play is for the word “the.” We know that just about every document contains “the,” so the term isn’t really special anymore, thereby producing a very low IDF. Now let’s contrast “the” with “Python” in our example. “Python” appears rarely in the other posts, so its IDF should be high. In fact, “Python” now carries a weight signaling that in any document in which it appears, it is important to that document.

#### When we multiply TF and IDF, we observe that the larger the number, the more important a term in a document is to that document. We can then compute the TF-IDF for each word in each document and create a vector.

---

## [sk-learn's implementation of tf-idf](https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction)

- Sklearn does couple of tweaks in its implementation of TF-IDF vectorizer, so to replicate the exact results our custom TF-DF would need to add following

- Sklearn formula of idf is different from the standard textbook formula. Here the constant "1" is added to the numerator and denominator of the idf as if an extra document was seen containing every term in the collection exactly once, which prevents zero divisions.

- Sklearn has its vocabulary generated from idf sorted in alphabetical order

- The final output of sklearn tf-idf vectorizer is a sparse matrix

![img](https://i.imgur.com/U14jzLo.png)

[Source](https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction)

[The below from Sklearn's source code](https://github.com/scikit-learn/scikit-learn/blob/9780abda5fd54a491a6a98cd542da02094f912a5/sklearn/feature_extraction/text.py#L1314)

"The goal of using tf-idf instead of the raw frequencies of occurrence of a token in a given document is to scale down the impact of tokens that occur very frequently in a given corpus and that are hence empirically less informative than features that occur in a small fraction of the training corpus.

    The formula that is used to compute the tf-idf for a term t of a document d in a document set is tf-idf(t, d) = tf(t, d) * idf(t), and the idf is computed as **idf(t) = log [ n / df(t) ] + 1** (if ``smooth_idf=False``), where
    n is the total number of documents in the document set and df(t) is the document frequency of t; the document frequency is the number of documents in the document set that contain the term t. The effect of adding "1" to the idf in the equation above is that terms with zero idf, i.e., terms that occur in all documents in a training set, will not be entirely
    ignored.

    (Note that the idf formula above differs from the standard textbook
    notation that defines the idf as  idf(t) = log [ n / (df(t) + 1) ]).

    If ``smooth_idf=True`` (the default), the constant "1" is added to the numerator and denominator of the idf as if an extra document was seen containing every term in the collection exactly once, which prevents
    zero divisions: idf(t) = log [ (1 + n) / (1 + df(t)) ] + 1."

---

## A note on normalization - and why sklearn normalize the final tf-idf matrix derived from its  TfidfVectorizer class

![img](https://i.imgur.com/DxlL7sm.png)

#### What does it mean to normalize an array?

Data normalization is used in machine learning to make model training less sensitive to the scale of features. This allows our model to converge to better weights and, in turn, leads to a more accurate model. Normalization makes the features more consistent with each other, which allows the model to predict outputs more accurately.

To normalize a vector in math means to divide each of its elements to some value V so that the length/norm of the resulting vector is 1. Turns out the needed V is equal the length (the length of the vector).

So this is basically the norm calculations

![img](https://i.imgur.com/4kPSrDI.png)

For a vector x having N components, the L¹ just adds up the components. Since we would like our magnitude to always be positive, we take the absolute value of the components. The L² norm takes the sum of the squared values, taking the square root at the end. 

Say you have this array.

`[-3, +4]`

Its length (in Euclid metric) is: `V = sqrt((-3)^2 + (+4)^2) = 5`

So its corresponding normalized vector is:

`[-3/5, +4/5]`

Its length is now: `sqrt ( (-3/5)^2 + (+4/5)^2 )` which is 1.

You can use another metric (e.g. I think Manhattan distance)
but the idea is the same. Divide each element of your array
by `V` where `V = || your_vector || = norm (your_vector)`.

[Read further](https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-normalization)

While deriving TF-DF, the final TF-IDF metric that we will be using is a normalized version of the tfidf matrix that we get from the product of tf and idf. We will normalize the tfidf matrix by dividing it by the L2 norm of the matrix, also known as the Euclidean norm, which is the square root of the sum of the square of each term’s tfidf weight. Mathematically we can represent the final tfidf feature vector as follows:

 ### $tfidf= tfidf / || tfidf ||$

where ∥tfidf∥ represents the Euclidean L2 norm for the tfidf matrix. There are multiple variants of this model but they all end up with similar results.

Informally speaking, the norm is a generalization of the concept of (vector) length; from the [Wikipedia entry][1]:

> In linear algebra, functional analysis, and related areas of mathematics, a **norm** is a function that assigns a strictly positive *length* or *size* to each vector in a vector space.

The [L2-norm][2] is the usual Euclidean length, i.e. the square root of the sum of the squared vector elements.

The [L1-norm][3] is the sum of the absolute values of the vector elements.

---

## Normalizing a Vector

First the absolute basics of what norm is -

![img](https://i.imgur.com/AgFmtNY.png)

So, now, calculate the norm of the vector u⃗ =(3,4).

We first note that u⃗ ∈ R2, and we will thus use the formula

![img](https://i.imgur.com/3MH9JXD.png)

When we substitute our values in, we obtain that ∥u⃗ ∥=√25 = 5. Thus our vector u⃗  has length / norm of 5.

![img](https://i.imgur.com/bbYMkwl.png)

Mathematically a norm is a total size or length of all vectors in a vector space or matrices. And after we calculate the Norm, then we can normalize a vector. By definition a norm on a vector space—over the real or complex field—is an assignment of a non-negative real number to a vector. The norm of a vector is its length, and the length of a vector must always be positive (or zero). A negative length makes no sense.

When we think of geometric vectors, i.e., directed line segments that start at the origin, then intuitively the length of a vector is the distance of the “end” of this directed line segment from the origin.

Taking any vector and reducing its magnitude to 1.0 while keeping its direction is called normalization. Normalization is performed by dividing the x and y (and z in 3D) components of a vector by its magnitude:

For any vector V = (x, y, z),

we know the magnitude |V| = sqrt(x*x + y*y + z*z) which gives the length of the vector.

When we normalize a vector, we actually calculate

#### V/|V| = (x/|V|, y/|V|, z/|V|).

#### Lets look at an example

![img](https://i.imgur.com/2PCb8nh.png)

Can do some basic calculation to see that a normalized vector has length 1. This is because:

(In first line below I to bring the sqrt outside of the braces, I am multiplying x/|V| with x/|V| and so on )

```
| V/|V| | = sqrt((x/|V|)*(x/|V|) + (y/|V|)*(y/|V|) + (z/|V|)*(z/|V|))
          = sqrt(x*x + y*y + z*z) / |V|
          = |V| / |V|
          = 1
```


## Differences between Norm of a Vector and distance between two points

#### Key point to remember - Distance are always between two points and Norm are always for a Vector.

#### That means Euclidean Distance between 2 points x1 and x2 is nothing but the L2 norm of vector (x1 - x2)

By definition L2 Norm of a vector = Euclidian distance of that point vector from origin.

In other words, the distance(metric) between any two vectors can be defined as the norm of the difference those vectors.

![img](https://i.imgur.com/cYxlXSv.png)

## TF-IDF Applications

- Information retrieval: by calculating the TF-IDF score of a user query against the whole document set we can figure out how relevant a document is to that given query. Many say, most search engines use some form of TF-IDF implementation.

- Keywords extraction: The highest ranking words for a document in terms of TF-IDF score can very well represent the keywords of that document(as they make that document stand out from the other documents). So we can very easily use some sort of TF-IDF score computation to extract the keywords from a text.


## In general whats exactly fit and transform method do in-scikit-learn

What `fit()` method does is create a model that extracts the various parameters from your training samples to do the neccessary transformation later on. transform() on the other hand is doing the actual transformation to the data itself returning a standardized or scaled form.

`fit_transform()` is just a faster way of doing the operations of fit() and transform() consequently.

Let us take an example for Scaling values in a dataset:

Here the fit method, when applied to the training dataset, learns the model parameters (for example, mean and standard deviation). We then need to apply the transform method on the training dataset to get the transformed (scaled) training dataset. We could also perform both of this steps in one step by applying fit_transform on the training dataset.

Hence, every sklearn's transform's fit() just calculates the relevant parameters (e.g. μ and σ in case of StandardScaler) and saves them as an internal object's state. Afterwards, you can call its transform() method to apply the transformation to any particular set of examples.

#### Then why do we need 2 separate methods - fit and transform ?

##### We use fit_transform() on the train data so that we learn the parameters of scaling on the train data and in the same time we scale the train data. We only use transform() on the test data because we use the scaling paramaters learned on the train data to scale the test data.

And here's why we do like that in detail

In practice we need to have a separate training and testing dataset and that is where having a separate fit and transform method helps. We apply fit on the training dataset and use the transform method on both - the training dataset and the test dataset. Thus the training as well as the test dataset are then transformed(scaled) using the model parameters that were learnt on applying the fit method the training dataset.

Important thing here is that when you divide your dataset into train and test sets what you are trying to achieve is somewhat simulate a real world application. In a real world scenario you will only have training data and you will develop a model according to that and predict unseen instances of similar data.

If you transform the entrire data with fit_transform() and then split to train test you violate that simulation approach and do the transformation according to the unseen examples as well. Which will inevatibly result in an optimistic model as you already somewhat prepared your model by the unseen samples metrics as well.

If you split the data to train test and apply fit_transform() to both you will also be mistaken as your first transformation of train data will be done by train splits metrics only and your second will be done by test metrics only.

The right way to do these preprocessings is to train any transformer with train data only and do the transformations to the test data. Because only then you can be sure that your resulting model represents a real world solution.

Example Code:

```python
scaler = preprocessing.StandardScaler().fit(X_train)
scaler.transform(X_train)
scaler.transform(X_test)
```

Note that mean and std obtained from the training set are used for scaling all training dataset values. And we should not compute a separate mean and std on the test set to scale the test set values instead we have to use the ones obtained using fit on the training set. We have to ensure identical operation on test set.

The idea is, once we executed `t.fit(train_data)`, t is fitted, so you can safely use

`t.transform(test_data)`

## Now implementation with sklearn

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()

corpus_1 = [
     'this is sunny morning',
     'yesterday was a gloomy morning',
     'tomorrow will be rainly morning'
]

vectorizer.fit(corpus_1)



In [None]:
skl_tf_idf_vectorized = vectorizer.transform(corpus_1)

# As the final output of sklearn tf-idf vectorizer is a sparse matrix to save storage space
# To visually understand the output better, we need to convert the sparse output matrix to dense matrix with toarray()
print(skl_tf_idf_vectorized.toarray())
# print(skl_tf_idf_vectorized[0])

# As above Even more clear way to visually inspect the output is to convert it to a pandas dataframe
# So below I will convert that to a dataframe and then use todense()
skl_tfdf_output = skl_tf_idf_vectorized[0]
df_tfdf_sklearn = pd.DataFrame(skl_tfdf_output.T.todense(), index=vectorizer.get_feature_names(), columns=['tf-idf'])
df_tfdf_sklearn.sort_values(by=["tf-idf"], ascending=True)
df_tfdf_sklearn