<a href="https://colab.research.google.com/github/rahiakela/nlp-research-and-practice/blob/main/practical-natural-language-processing/3-text-representation/4_tf_idf.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# TF-IDF: Term frequency–inverse document frequency

In all the three approaches we’ve seen so far, **all the words in the text are treated as equally important—there’s no notion of some words in the document being more important than others. TF-IDF, or term frequency–inverse document frequency, addresses this issue. It aims to quantify the importance of a given word relative to other words in the document and in the corpus.** It’s a commonly used representation scheme for information-retrieval systems, for extracting relevant documents from a corpus for a given text query.

The intuition behind TF-IDF is as follows: if a word $w$ appears many times in a document di but does not occur much in the rest of the documents $d_j$ in the corpus, then the word $w$ must be of great importance to the document $d_i$. The importance of $w$ should increase in proportion to its frequency in $d_i$, but at the same time, its importance should decrease in proportion to the word’s frequency in other documents $d_j$ in the corpus. **Mathematically, this is captured using two quantities: TF and IDF. The two are then combined to arrive at the TF-IDF score.**

TF (term frequency) measures how often a term or word occurs in a given document. Since different documents in the corpus may be of different lengths, a term may occur more often in a longer document as compared to a shorter document. To normalize these counts, we divide the number of occurrences by the length of the document. TF of a term t in a document d is defined as:

`TF(t, d) = (Number of occurrences of term t in document d) / (Total number of terms in the document  d)`

IDF (inverse document frequency) measures the importance of the term across a corpus. In computing TF, all terms are given equal importance (weightage). However, it’s a well-known fact that stop words like is, are, am, etc., are not important, even though they occur frequently. To account for such cases, IDF weighs down the terms that are very common across a corpus and weighs up the rare terms. IDF of a term t is calculated as follows:

`IDF(t, d) = log(Total number of documents in the corpus) / (Number of documents with term t in them)`


**Our toy corpus**

|  |  |
| --- | --- |
| D1 | Dog bites man. |
| D2 | Man bites dog. |
| D3 | Dog eats meat. |
| D4 | Man eats food. |

The TF-IDF score is a product of these two terms. Thus, TF-IDF $score = TF * IDF$. Let’s compute TF-IDF scores for our toy corpus. Some terms appear in only one document, some appear in two, while others appear in three documents. The size of our corpus is N=4. Hence, corresponding TF-IDF values for each term are given.

| **Word** | **TF Score** | **IFD Score** | **TF-IDF Score** |
| --- | --- | --- | --- |
| dog | $\frac{1}{3} = 0.33$ | $log_2(\frac{4}{3})=0.4114$ | $0.4114*0.33=0.136$ |
| bites | $\frac{1}{2} = 0.17$ | $log_2(\frac{4}{2})=1$ | $0.1*0.17=0.17$ |
| man | $\frac{1}{3} = 0.33$ | $log_2(\frac{4}{3})=0.4114$ | $0.4114*0.33=0.136$ |
| eats | $\frac{1}{2} = 0.17$ | $log_2(\frac{4}{2})=1$ | $0.1*0.17=0.17$ |
| meat | $\frac{1}{12} = 0.083$ | $log_2(\frac{4}{1})=2$ | $2*0.083=0.17$ |
| food | $\frac{1}{12} = 0.083$ | $log_2(\frac{4}{1})=2$ | $2*0.083=0.17$ |


The TF-IDF vector representation for a document is then simply the TF-IDF score for each term in that document. So, for D1 we get:

| **dog** | **bites** | **man** | **eats** | **meat** | **food** |
| --- | --- | --- | --- | --- | --- |
| 0.136 | 0.17 | 0.136 | 0 | 0 | 0 |

Finaly, we will get this matrix for **TF-IDF**.

**Documents**

|  |  |
| --- | --- |
| D1 | Dog bites man. |
| D2 | Man bites dog. |
| D3 | Dog eats meat. |
| D4 | Man eats food. |

**TF-IDF Matrix**

|   | dog | bites | man | eats | meat | food |
| --- | --- | --- | --- | --- | --- | --- |
| D1 | 0.136 | 0.17 | 0.136 | 0 | 0 | 0 |
| D2 | 0.136 | 0.17 | 0.136 | 0 | 0 | 0 |
| D3 | 0.136 | 0 | 0 | 0.17 | 0.17 | 0 |
| D4 | 0 | 0 | 0.136 | 0.17 | 0 | 0.17 |

In [1]:
documents = [
  "Dog bites man.",
  "Man bites dog.",
  "Dog eats meat.",
  "Man eats food."
]

processed_docs = [doc.lower().replace('.', '') for doc in documents]
processed_docs

['dog bites man', 'man bites dog', 'dog eats meat', 'man eats food']

 A simple example of how to get the TF-IDF representation of a document using sklearn's TfidfVectorizer.

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer

# vectorization example with TfidfVectorizer
tfidf = TfidfVectorizer()

In [6]:
# IDF for all words in the vocabulary
bow_rep_tfidf = tfidf.fit_transform(processed_docs)
print('IDF for all words in the vocabulary: \n', tfidf.idf_)
print('-' * 10)
print('All words in the vocabulary:\n', tfidf.get_feature_names_out())
print('-' * 10)

IDF for all words in the vocabulary: 
 [1.51082562 1.22314355 1.51082562 1.91629073 1.22314355 1.91629073]
----------
All words in the vocabulary:
 ['bites' 'dog' 'eats' 'food' 'man' 'meat']
----------


In [7]:
# TFIDF representation for all documents in our corpus
print('TFIDF representation for all documents in our corpus: \n', bow_rep_tfidf.toarray())
print('-' * 70)

TFIDF representation for all documents in our corpus: 
 [[0.65782931 0.53256952 0.         0.         0.53256952 0.        ]
 [0.65782931 0.53256952 0.         0.         0.53256952 0.        ]
 [0.         0.44809973 0.55349232 0.         0.         0.70203482]
 [0.         0.         0.55349232 0.70203482 0.44809973 0.        ]]
----------------------------------------------------------------------


Let's show the TF-IDF vetcors in dataframe.

In [9]:
import pandas as pd

bow_indexs = ['D1', 'D2', 'D3', 'D4']
pd.DataFrame(bow_rep_tfidf.toarray(), columns=tfidf.get_feature_names_out(), index=bow_indexs)

Unnamed: 0,bites,dog,eats,food,man,meat
D1,0.657829,0.53257,0.0,0.0,0.53257,0.0
D2,0.657829,0.53257,0.0,0.0,0.53257,0.0
D3,0.0,0.4481,0.553492,0.0,0.0,0.702035
D4,0.0,0.0,0.553492,0.702035,0.4481,0.0


In [10]:
# Get the representation using this vocabulary, for a new text
temp = tfidf.transform(['dog and man are friends'])
print("Tfidf representation for 'dog and man are friends':\n", temp.toarray())

Tfidf representation for 'dog and man are friends':
 [[0.         0.70710678 0.         0.         0.70710678 0.        ]]


Similar to BoW, **we can use the TF-IDF vectors to calculate similarity between two texts using a similarity measure like Euclidean distance or cosine similarity. TF-IDF is a commonly used representation in application scenarios such as information retrieval and text classification. However, despite the fact that TF-IDF is better than the vectorization methods we saw earlier in terms of capturing similarities between words, it still suffers from the curse of high dimensionality.**

> **Tips**:Even today, TF-IDF continues to be a popular representation scheme for many NLP tasks, especially the initial versions of the solution.

If we look back at all the representation schemes we’ve discussed so far, we notice three fundamental drawbacks:

* They’re discrete representations—i.e., they treat language units (words, n-grams, etc.) as atomic units. This discreteness hampers their ability to capture relationships between word.

* The feature vectors are sparse and high-dimensional representations. The dimensionality increases with the size of the vocabulary, with most values being zero for any vector. This hampers learning capability. Further, high-dimensionality representation makes them computationally inefficient.

* They cannot handle OOV words.