# **What is TF-IDF?**

TF-IDF is a method which gives us a numerical weightage of words which reflects how important the particular word is to a document in a corpus. A corpus is a collection of documents. Tf is Term frequency, and IDF is Inverse document frequency. This method is often used for information retrieval and text mining.

Tf(Term Frequency): Term frequency can be thought of as how often does a word ‘w’ occur in a document ‘d’. More importance is given to words frequently occurring in a document. The formula of Term frequency is:


DF(inverse document frequency): Sometimes, words like ‘the’ occur a lot and do not give us vital information regarding the document. To minimize the weight of terms occurring very frequently by incorporating the weight of words rarely occurring in the document. In other words, idf is used to calculate rare words’ weight across all documents in corpus. Words rarely occurring in the corpus will have higher IDF values

Combining these two we come up with the TF-IDF score.

##**Building TF-IDF from Scratch**

First and foremost is to import all the libraries needed for this.

In [None]:
!python -m pip install pip --upgrade --user -q --no-warn-script-location
!python -m pip install numpy pandas seaborn matplotlib scipy statsmodels sklearn nltk gensim tqdm --user -q --no-warn-script-location

import IPython
IPython.Application.instance().kernel.do_shutdown(True)

In [None]:
from collections import Counter
from tqdm import tqdm
from scipy.sparse import csr_matrix
import math
import operator
from sklearn.preprocessing import normalize
import numpy as np 

Basic libraries imported. 

‘from sklearn.preprocessing import normalize’:- As the documentation says, normalization here means making our data have a unit length, so specifying which length (i.e. which norm) is also required. Here Sklearn applies L2-normalization on its output matrix, i.e. Euclidean length.

In [None]:
corpus = [
     'this is the first document',
     'this document is the second document',
     'and this is the third one',
     'is this the first document',
] 

For simplicity, we are taking four reviews or documents as our data corpus and storing them in a list

In [None]:
def IDF(corpus, unique_words):
  idf_dict={}
  N=len(corpus)
  for i in unique_words:
    count=0
    for sen in corpus:
      if i in sen.split():
        count=count+1
      idf_dict[i]=(math.log((1+N)/(count+1)))+1
  return idf_dict

We will be defining a function IDF whose parameter will be the corpus and the unique words.

The reason why we are adding ‘1’ to numerator and denominator and also to the whole equation of ‘idf_dict[i]’ is to maintain numerical stability. There might be situations where there are no values, which will generate an error(avoiding division of zeros). So to avoid that error, we are creating numerical stability.

This code snippet will generate the idf values of all the unique words when ‘fit’ function is called.

In [None]:
def fit(whole_data):
  unique_words = set()
  if isinstance(whole_data, (list,)):
    for x in whole_data:
      for y in x.split():
        if len(y)<2:
          continue
        unique_words.add(y)
    unique_words = sorted(list(unique_words))
    vocab = {j:i for i,j in enumerate(unique_words)}
    Idf_values_of_all_unique_words=IDF(whole_data,unique_words)
  return vocab, Idf_values_of_all_unique_words
Vocabulary, idf_of_vocabulary=fit(corpus) 

Here we initialised ‘unique_words’ as a set to get all the unique values.(Set has a property where it does not print out duplicate values).

Checking if the ‘whole_data’ is a list or not. In our case, Corpus is a list. We are splitting the list and iterating over the list to find unique words and appending them in the set.

All words having a length of less than two are discarded.

We are calling IDF function inside the fit function which will give us the idf values of all the unique words generated and will store them in ‘idf_values_of_all_unique_words’ 

The fit function will return the words and their idf values respectively.

We will assign the values to ‘Vocabulary’ and ‘idf_of_vocabulary’ 

In [None]:
print(list(Vocabulary.keys())) 

In [None]:
print(list(idf_of_vocabulary.values()))

This is the output we will get when we perform the fit function.

We are coding the fit and transform the function of TFIDFVectorizer.

Now jumping towards the transform function.

In [None]:
def transform(dataset,vocabulary,idf_values):
    sparse_matrix= csr_matrix( (len(dataset), len(vocabulary)), dtype=np.float64)
    for row  in range(0,len(dataset)):
      number_of_words_in_sentence=Counter(dataset[row].split())
      for word in dataset[row].split():
          if word in  list(vocabulary.keys()):
              tf_idf_value=(number_of_words_in_sentence[word]/len(dataset[row].split()))*(idf_values[word])
              sparse_matrix[row,vocabulary[word]]=tf_idf_value
    print("NORM FORM\n",normalize(sparse_matrix, norm='l2', axis=1, copy=True, return_norm=False))
    output =normalize(sparse_matrix, norm='l2', axis=1, copy=True, return_norm=False)
    return output
final_output=transform(corpus,Vocabulary,idf_of_vocabulary)
print(final_output.shape) 

Here we are using the transform function to get a sparse matrix representation output of the corpus.

We used the TF-IDF formula to calculate the values of all the unique words in the set.

As we talked earlier about the l2 norm, here sklearn implements l2 so with the help of ‘normalize’  we initialize l2 norm to get perfect output. 

We want the sparse matrix representation so initialised ‘sparse_matrix’ in ‘normalize’ 

Sparse matrix is a type of matrix with very few non zero values and more zero values. We use sparse matrix only when the matrix has several zero values.

We can convert the sparse matrix representation to a dense representation or dense matrix.

In [None]:
print(final_output[0].toarray())

# **Related Articles:**

> * [TF-IDF from Scratch in Python](https://analyticsindiamag.com/hands-on-implementation-of-tf-idf-from-scratch-in-python/)

> * [Continuous Bag of Words](https://analyticsindiamag.com/the-continuous-bag-of-words-cbow-model-in-nlp-hands-on-implementation-with-codes/)

> * [NLP Case Study of Documents Similarity](https://analyticsindiamag.com/nlp-case-study-identify/)

> * [Review Classification](https://analyticsindiamag.com/step-by-step-guide-to-reviews-classification-using-svc-naive-bayes-random-forest/)

> * [Multi Class Text Classification](https://analyticsindiamag.com/multi-class-text-classification-in-pytorch-using-torchtext/)

> * [Text Classification](https://analyticsindiamag.com/how-to-solve-your-first-ever-nlp-classification-challenge/)

