#Term Frequency - Inverse Document Frequency (TF-IDF)


* TF-IDF is a **statistical** measure.
* It reflects how important/relevant a word is to a document in a collection or corpus.
*   It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling.
*   A [survey conducted in 2015](https://kops.uni-konstanz.de/handle/123456789/32348) showed that 83% of text-based recommender systems in digital libraries use tf–idf.
*   The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general.
* Applications: Search Engines (in determining the relevance of queries and documents) and stop-words removal (especially in text-summarization or document classification) 





##Term Frequency (TF)


*   The first form of term weighting is due to Hans Peter Luhn (1957) [link text](https://ieeexplore.ieee.org/document/5392697)
*   The number of times a term occurs in a document is called its term frequency



##Inverse Document Frequency


*   Motivation: TF will tend to incorrectly emphasize documents which happen to use the words like "the" more frequently
*   [Karen Spärck Jones](https://en.wikipedia.org/wiki/Karen_Sp%C3%A4rck_Jones) (1972) conceived a statistical interpretation of term-specificity called Inverse Document Frequency (idf), which became a cornerstone of term weighting



In [None]:
from collections import Counter
from scipy.sparse import csr_matrix
import math
from sklearn.preprocessing import normalize
import numpy as np 

In [None]:
corpus = [
      'this is the first document',
      'this document is the second document',
      'and this is the third one',
      'is this the first document',
 ] 

In [None]:
def IDF(corpus, unique_words):
   idf_dict={}
   N=len(corpus)
   for i in unique_words:
     count=0
     for sen in corpus:
       if i in sen.split():
         count=count+1
       idf_dict[i]=(math.log((1+N)/(count+1)))+1
   return idf_dict 

In [None]:
def fit(whole_data):
    unique_words = set()
    if isinstance(whole_data, (list,)):
      for x in whole_data:
        for y in x.split():
          if len(y)<2:
            continue
          unique_words.add(y)
      unique_words = sorted(list(unique_words))
      vocab = {j:i for i,j in enumerate(unique_words)}
      Idf_values_of_all_unique_words=IDF(whole_data,unique_words)
    return vocab, Idf_values_of_all_unique_words


In [None]:
Vocabulary, idf_of_vocabulary=fit(corpus) 

In [None]:
print(list(Vocabulary.keys())) 

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']


In [None]:
def transform(dataset, vocabulary, idf_values):
     sparse_matrix= csr_matrix( (len(dataset), len(vocabulary)), dtype=np.float64)
     for row  in range(0,len(dataset)):
       number_of_words_in_sentence=Counter(dataset[row].split())
       for word in dataset[row].split():
           if word in  list(vocabulary.keys()):
               tf_idf_value=(number_of_words_in_sentence[word]/len(dataset[row].split()))*(idf_values[word])
               sparse_matrix[row,vocabulary[word]]=tf_idf_value
     print("NORM FORM\n",normalize(sparse_matrix, norm='l2', axis=1, copy=True, return_norm=False))
     output =normalize(sparse_matrix, norm='l2', axis=1, copy=True, return_norm=False)
     return output


In [None]:
final_output=transform(corpus,Vocabulary,idf_of_vocabulary)
print(final_output.shape) 

NORM FORM
   (0, 1)	0.4697913855799205
  (0, 2)	0.580285823684436
  (0, 3)	0.3840852409148149
  (0, 6)	0.3840852409148149
  (0, 8)	0.3840852409148149
  (1, 1)	0.6876235979836937
  (1, 3)	0.2810886740337529
  (1, 5)	0.5386476208856762
  (1, 6)	0.2810886740337529
  (1, 8)	0.2810886740337529
  (2, 0)	0.511848512707169
  (2, 3)	0.267103787642168
  (2, 4)	0.511848512707169
  (2, 6)	0.267103787642168
  (2, 7)	0.511848512707169
  (2, 8)	0.267103787642168
  (3, 1)	0.4697913855799205
  (3, 2)	0.580285823684436
  (3, 3)	0.3840852409148149
  (3, 6)	0.3840852409148149
  (3, 8)	0.3840852409148149
(4, 9)


  self._set_intXint(row, col, x.flat[0])


In [None]:
print(final_output[0].toarray())

[[0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]]
