<font face='georgia'>
    
   <h4><strong>What does tf-idf mean?</strong></h4>

   <p>    
Tf-idf stands for <em>term frequency-inverse document frequency</em>, and the tf-idf weight is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. Variations of the tf-idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query.
</p>
    

<font face='georgia'>
    <h4><strong>How to Compute:</strong></h4>

Typically, the tf-idf weight is composed by two terms: the first computes the normalized Term Frequency (TF), aka. the number of times a word appears in a document, divided by the total number of words in that document; the second term is the Inverse Document Frequency (IDF), computed as the logarithm of the number of the documents in the corpus divided by the number of documents where the specific term appears.

 <ul>
    <li>
<strong>TF:</strong> Term Frequency, which measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization: <br>

$TF(t) = \frac{\text{Number of times term t appears in a document}}{\text{Total number of terms in the document}}.$
</li>
<li>
<strong>IDF:</strong> Inverse Document Frequency, which measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following: <br>

$IDF(t) = \log_{e}\frac{\text{Total  number of documents}} {\text{Number of documents with term t in it}}.$
for numerical stabiltiy we will be changing this formula little bit
$IDF(t) = \log_{e}\frac{\text{Total  number of documents}} {\text{Number of documents with term t in it}+1}.$
</li>
</ul>

<br>
<h4><strong>Example</strong></h4>
<p>

Consider a document containing 100 words wherein the word cat appears 3 times. The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the word cat appears in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12.
</p>
</font>

### Corpus

In [28]:
## SkLearn# Collection of string documents

corpus = [
     'this is the first document',
     'this document is the second document',
     'and this is the third one',
     'is this the first document',
]

### SkLearn Implementation

In [29]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
vectorizer.fit(corpus)
skl_output = vectorizer.transform(corpus)

In [30]:
# sklearn feature names, they are sorted in alphabetic order by default.

print(vectorizer.get_feature_names())

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']


In [31]:
# Here we will print the sklearn tfidf vectorizer idf values after applying the fit method
# After using the fit function on the corpus the vocab has 9 words in it, and each has its idf value.

print(vectorizer.idf_)

[1.91629073 1.22314355 1.51082562 1.         1.91629073 1.91629073
 1.         1.91629073 1.        ]


In [32]:
# shape of sklearn tfidf vectorizer output after applying transform method.

skl_output.shape

(4, 9)

In [None]:
# sklearn tfidf values for first line of the above corpus.
# Here the output is a sparse matrix

print(skl_output[0])

  (0, 8)	0.38408524091481483
  (0, 6)	0.38408524091481483
  (0, 3)	0.38408524091481483
  (0, 2)	0.5802858236844359
  (0, 1)	0.46979138557992045


In [33]:
# sklearn tfidf values for first line of the above corpus.
# To understand the output better, here we are converting the sparse output matrix to dense matrix and printing it.
# Notice that this output is normalized using L2 normalization. sklearn does this by default.

print(skl_output[0].toarray())

[[0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]]


In [34]:
from collections import Counter
from tqdm import tqdm
from scipy.sparse import csr_matrix
import math
import operator
from sklearn.preprocessing import normalize
import numpy
import math


def fit(dataset):
    """
    This will give out a dictionary with the vocab(unique words) in the whole corpus 
    """
    unique_words = set() # Assign a set to store the unique word and sets avoids repetition of words.
    if isinstance(dataset, list): # Check if dataset is a list of lists
        for row in dataset: #For every row in the data corpus
            for word in row.split(" "):
                if len(word)<2: 
                    continue
                unique_words.add(word)
                #print(unique_words)
        unique_words = sorted(list(unique_words))  # Converts the set to list and Sorts in alphabetical order
        vocab = {j:i for i,j in enumerate(unique_words)}     # Creates a dict. of the unique words 
        return vocab
    else:
        print("Provide a list of sentences")
        
def idf(dataset):
    """
    This fucntion will give return a dictinary of the each word as a key with its idf as its value"""
    N=len(dataset)
    df=() 
    vocab = list(fit(dataset))
    df = dict.fromkeys(vocab, 0) # Create a new dict with keys from the vocab.
    token = 0 #initialize the number of times a word occurs in a document to the term TOKEN
    for word in vocab:
        token = 0
        for sentence in dataset:
            if sentence.count(word)>=1:
                token = token+1
        df[word]=token
    idfDict=df.copy()
    #print(idfDict)
    for word, value in idfDict.items(): #Iterate through the dictionary to calculate the idf of each word (word-key, value-idf value) 
        idfDict[word]= 1+ (math.log((1+N)/(1+value)))
    return idfDict


def transform(dataset, vocab, idfDict):
    """
    This function will return a sparse matrix representatiom of the tf idf values
    """
    rows=[]
    columns=[]
    values=[]
    if isinstance(dataset, list):
        for idx, row in enumerate(tqdm(dataset)):
            word_freq = Counter(row.split(" "))
            #print(word_freq)
            for word, freq in word_freq.items():
                if len(word)<2:
                    continue
                col_index= vocab.get(word,-1)
                if col_index != -1:
                    rows.append(idx)
                    columns.append(col_index)
                    tf= freq/len(row)
                    tfidf= tf * idfDict[word]
                    values.append(tfidf)
        return csr_matrix((values, (rows,columns)),shape=(len(dataset),len(vocab)))




dataset= ['this is the first document','this document is the second document','and this is the third one','is this the first document']
vocab=fit(dataset)
#print(vocab)
print("The vocab of the given dataset is")
print(list(vocab))
print("---"*30)
idfDict =idf(dataset)
print("The idf values of each of words is ")
print(idfDict)
print("---"*30)
output = transform(dataset,vocab,idfDict).toarray()
#print(output)
final_output = normalize(output, norm='l2', axis=1, copy=True, return_norm=False)
print("The tfidf of the given corpus in array form is")
print(final_output)
print("---"*30)
print("The shape of the tfidf vector is  ",final_output.shape)



100%|████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<?, ?it/s]

The vocab of the given dataset is
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
------------------------------------------------------------------------------------------
The idf values of each of words is 
{'and': 1.916290731874155, 'document': 1.2231435513142097, 'first': 1.5108256237659907, 'is': 1.0, 'one': 1.916290731874155, 'second': 1.916290731874155, 'the': 1.0, 'third': 1.916290731874155, 'this': 1.0}
------------------------------------------------------------------------------------------
The tfidf of the given corpus in array form is
[[0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]
 [0.         0.6876236  0.         0.28108867 0.         0.53864762
  0.28108867 0.         0.28108867]
 [0.51184851 0.         0.         0.26710379 0.51184851 0.
  0.26710379 0.51184851 0.26710379]
 [0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]]
-------------------------




In [35]:
# Below is the code to load the cleaned_strings pickle file provided
# Here corpus is of list type

import pickle
with open('cleaned_strings', 'rb') as f:
    corpus = pickle.load(f)
    
# printing the length of the corpus loaded
print("Number of documents in corpus = ",len(corpus))
print(corpus[3:10])

Number of documents in corpus =  746
['little music anything speak', 'best scene movie gerardo trying find song keeps running head', 'rest movie lacks art charm meaning emptiness works guess empty', 'wasted two hours', 'saw movie today thought good effort good messages kids', 'bit predictable', 'loved casting jimmy buffet science teacher']


In [38]:
#For Dict sorting methods, used https://stackabuse.com/how-to-sort-dictionary-by-value-in-python/

from collections import Counter
from tqdm import tqdm
from scipy.sparse import csr_matrix
import math
import operator
from sklearn.preprocessing import normalize
import numpy
import math



def fit(dataset):
    """
    This will give out a dictionary of vocab(unique words) in the whole corpus 
    """
    unique_words = set()           # Assign a set to store the vocab(unique words) as set avoids the repetition of words.
    if isinstance(dataset, list):      # Check if dataset is a list of lists
        for row in dataset:            #For every row in the data corpus
            for word in row.split(" "):
                if len(word)<2: 
                    continue
                unique_words.add(word)
                #print(unique_words)
        unique_words = sorted(list(unique_words))   # Convert the set to a list sorted in alphabetical order
        vocab = {j:i for i,j in enumerate(unique_words)}
        return vocab
    else:
        print("Provide a list of sentences")
        
def idf(dataset):
    """
    This will return a dict with the words and their idf values, but for the top 50 values only
    This was modified from the idf function above
    """
    N=len(dataset)
    df=() #Start with making a dictionary of data frequency and use this to create a idf dictinary
    vocab = list(fit(dataset))
    df = dict.fromkeys(vocab, 0)
    token = 0 #initiatilize the number of times a word occurs in a document to the term TOKEN
    for word in vocab:
        token = 0
        for sentence in dataset:
            if sentence.count(word)>=1:
                token = token+1
        df[word]=token
    idfDict=df.copy()
    #print(idfDict)
    for word, value in idfDict.items():
        idfDict[word]= 1+ (math.log((1+N)/(1+value)))
    #print(idfDict)
    #Create  a sorted tuple using the values for keys and then create a dict out of the sorted tuple
    sorted_idfDic_tuple = sorted(idfDict.items(), key=operator.itemgetter(1),reverse=True)
    #print(sorted_idfDic_tuple)
    sorted_idfDict = {k: v for k, v in sorted_idfDic_tuple}
    #print(sorted_idfDict)
    top_50_values = dict(list(sorted_idfDict.items())[0:50])
    #print(top_50_values)
    return top_50_values


def transform(dataset, vocab,idfDict):
    """
    This function witll return a sparse matrix representation of the tfidf vector for the dataset
    """
    rows=[]
    columns=[]
    values=[]
    if isinstance(dataset, list):
        for idx, row in enumerate(tqdm(dataset)):
            word_freq = Counter(row.split(" "))
            #print(word_freq)
            for word, freq in word_freq.items():
                if len(word)<2:
                    continue
                col_index= vocab.get(word,-1)
                if col_index != -1:
                    rows.append(idx)
                    columns.append(col_index)
                    tf= freq/len(row)
                    tfidf= tf * idfDict[word]
                    values.append(tfidf)
        return csr_matrix((values, (rows,columns)),shape=(len(dataset),len(vocab)))



dataset= corpus  #Did this as I wrote the code using the term dataset and did not want to confuse and change everything in the prgram
vocab=fit(dataset)
#print("The list of the whole vocab in the corpus is")
#print(vocab)
#print(list(vocab))
print("---"*30)
idfDict =idf(dataset)
print("The top 50 idf values for the given corpus is:")
print(idfDict)
print("---"*30)
#Create a new vocab with only top 50 idf values
new_vocab ={j:i for i,j in enumerate(list(idfDict))}
print("The words with top 50 idf values are ")
print(list(new_vocab))
print("---"*30)
vocab=new_vocab
output = transform(dataset,new_vocab,idfDict).toarray()
#print(output)
print("---"*30)
final_output = normalize(output, norm='l2', axis=1, copy=True, return_norm=False)
print("THe final tfidf vector in dense form is")
print(final_output)
print("---"*30) 
print("THe shape final tfidf vector is")
print(final_output.shape)


------------------------------------------------------------------------------------------


100%|█████████████████████████████████████████████████████████████████████████████| 746/746 [00:00<00:00, 81191.31it/s]

The top 50 idf values for the given corpus is:
{'aailiyah': 6.922918004572872, 'abandoned': 6.922918004572872, 'abroad': 6.922918004572872, 'abstruse': 6.922918004572872, 'academy': 6.922918004572872, 'accents': 6.922918004572872, 'accessible': 6.922918004572872, 'acclaimed': 6.922918004572872, 'accolades': 6.922918004572872, 'accurately': 6.922918004572872, 'achille': 6.922918004572872, 'ackerman': 6.922918004572872, 'adams': 6.922918004572872, 'added': 6.922918004572872, 'admins': 6.922918004572872, 'admiration': 6.922918004572872, 'admitted': 6.922918004572872, 'adrift': 6.922918004572872, 'adventure': 6.922918004572872, 'aesthetically': 6.922918004572872, 'affected': 6.922918004572872, 'affleck': 6.922918004572872, 'afternoon': 6.922918004572872, 'agreed': 6.922918004572872, 'aimless': 6.922918004572872, 'aired': 6.922918004572872, 'akasha': 6.922918004572872, 'alert': 6.922918004572872, 'alike': 6.922918004572872, 'allison': 6.922918004572872, 'allowing': 6.922918004572872, 'along


