<a href="https://colab.research.google.com/github/sagar9926/Natural-Language-Processing/blob/main/TFIDF_in_sentiment_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Credits : https://towardsdatascience.com/a-gentle-introduction-to-calculating-the-tf-idf-values-9e391f8a13e5

## What is TF IDF?

Intuitively, to understand what text is about, we look for words that occur frequently. Term frequency covers that aspect by capturing the number of times each word occurs in the text. However, in this article, there are 200+ occurrences of the word ‘the’ and only 55 occurrences of the word TF (which includes the code used as well). To downgrade the relative importance of words that occur all too frequently, an inverse weighting is introduced to scale down the words that occur too frequently. This inverse weighting is referred to as Inverse Document Frequency. Together, TF-IDF captures the relative importance of words in a set of documents or a collection of texts.

In an applied business context, the text classification problem is one of the common problems in NLP. In text classification problems, the algorithms have to predict the topic based on a predefined set of topics it has trained on. In 2018, Google released a text classification framework based on 450K experiments on a few different text sets. Based on the 450K experiments, Google found that when the number of samples/number of words < 1500, TF IDF was the best way to represent text. When you have a smallish sample size for a relatively common problem, it helps to try out TF IDF.



## Overview

We will be using a beautiful poem by the mystic poet and scholar Rumi as our example corpus. First, we will calculate TF IDF values for the poem using TF IDF Vectorizer from the sklearn package. Then, we will pull apart the various components and work through various steps involved in calculating TF-IDF values. Mathematical calculations and Python code will be provided for each step.
So, let’s go!

In [1]:
corpus =  ["you were born with potential",
"you were born with goodness and trust",
"you were born with ideals and dreams",
"you were born with greatness",
"you were born with wings",
"you are not meant for crawling, so don't",
"you have wings",
"learn to use them and fly"
]

## Looking ahead to the final output

We will be decimating the beautiful poem into mysterious decimals in this step. But, hey, after all, we are trying to demystify these decimals by understanding the calculations involved in TF-IDF. As mentioned before, it is quite easy to derive through sklearn package.

In [2]:
#transform the tf idf vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
tf_idf_vect = TfidfVectorizer()
X_train_tf_idf = tf_idf_vect.fit_transform(corpus)
terms = tf_idf_vect.get_feature_names()

In the matrix below, each row represents a sentence from the above poem. Each column represents a unique word in the poem in alphabetical order. As you can see, there are lot of zeros in the matrix. So, a memory-efficient sparse matrix is used for representing this. I have converted it to a data frame for ease of visualization.

In [18]:
import pandas as pd
tfidf_matrix = pd.DataFrame(columns = ["sentence"] + terms)
tfidf_matrix["sentence"] = corpus
for i in range(len(corpus)):
  tfidf_matrix.iloc[i,1:] = X_train_tf_idf[i].toarray()[0]


tfidf_matrix

Unnamed: 0,sentence,and,are,born,crawling,don,dreams,fly,for,goodness,greatness,have,ideals,learn,meant,not,potential,so,them,to,trust,use,were,wings,with,you
0,you were born with potential,0.0,0.0,0.383289,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.682895,0.0,0.0,0.0,0.0,0.0,0.383289,0.0,0.383289,0.304834
1,you were born with goodness and trust,0.37764,0.0,0.293087,0.0,0.0,0.0,0.0,0.0,0.522185,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.522185,0.0,0.293087,0.0,0.293087,0.233096
2,you were born with ideals and dreams,0.37764,0.0,0.293087,0.0,0.0,0.522185,0.0,0.0,0.0,0.0,0.0,0.522185,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.293087,0.0,0.293087,0.233096
3,you were born with greatness,0.0,0.0,0.383289,0.0,0.0,0.0,0.0,0.0,0.0,0.682895,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.383289,0.0,0.383289,0.304834
4,you were born with wings,0.0,0.0,0.413022,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.413022,0.616716,0.413022,0.328481
5,"you are not meant for crawling, so don't",0.0,0.372697,0.0,0.372697,0.372697,0.0,0.0,0.372697,0.0,0.0,0.0,0.0,0.0,0.372697,0.372697,0.0,0.372697,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166366
6,you have wings,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.725164,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.607744,0.0,0.323703
7,learn to use them and fly,0.307727,0.0,0.0,0.0,0.0,0.0,0.425512,0.0,0.0,0.0,0.0,0.0,0.425512,0.0,0.0,0.0,0.0,0.425512,0.425512,0.0,0.425512,0.0,0.0,0.0,0.0


Let us interpret the numbers we have received so far. As you may have noticed, the words “you were born” are repeated throughout the poem. So, we anticipate that these words will not be getting high TF-IDF scores. If you look at the values for those three words, you can see that most often they get between 0.2 and 0.3.

Let us look at Document 0— You were born with potential. The word potential stands out. If you look at the various TF-IDF values in the first row in the matrix, you will see that the word potential has the highest TF-IDF value.

Let us look at Document 4 (row 5): You were born with wings. Again, same as before, the word “wings” has the highest value in that sentence.
Notice that the word “wings” appears also in Document 6. TF-IDF value for the word “wings” in Document 6 is different to TF-IDF value for the word “wings” in Document 4. 

In Document 6, the word “wings” is deemed less important than the word “have” in Document 6.

We will be focusing on applying the calculations on the words __wings__ and __potential__ in particular, to derive the values highlighted in red in the matrix displayed above.