# TF-IDF Vectorization and Cosine Similarity

TF-IDF is known as term frequency inverse document frequency, this is used commonly to represent text into numbers, these numbers are fed to machine learning algorithms for prediction. In this article we will use scikit package to calculate these number then we will use basic math to calculate terms frequency and tf-idf.

Before learning let's understand count vectorization, count vectorization is used to find term frequency in a document, once the term frequency is calculated we will use this metrics to calculate term frequency inverse document frequency.

## Terminology

- **Term Frequency tf(t,d)** - Raw Count of terms in a document. No of time term t appears in document d. This is also terms as **Count Vectorization**.
- **Inverse Document Frequency** - It diminishes the weight of terms appearing very frequently in the document set and increases the weight of terms that occur rarely.
- **Cosine similarity** is a metric, helpful in determining, how similar the data objects are irrespective of their size. We can measure the similarity between two sentences using Cosine Similarity.

## Scikit Sneak Peek

Let's look at the scikit briefly as how they calculate these values, **disclaimer** first we will these packages and later use basic math to calculat the same value. As per [sckikit TfidTransformer Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html)

The formula that is used to compute the tf-idf for a term t of a document d in a document set is tf-idf(t, d) = tf(t, d) * idf(t), and the idf is computed as idf(t) = log [ n / df(t) ] + 1 (if smooth_idf=False), where n is the total number of documents in the document set and df(t) is the document frequency of t; the document frequency is the number of documents in the document set that contain the term t. The effect of adding “1” to the idf in the equation above is that terms with zero idf, i.e., terms that occur in all documents in a training set, will not be entirely ignored. (Note that the idf formula above differs from the standard textbook notation that defines the idf as idf(t) = log [ n / (df(t) + 1) ]).

If smooth_idf=True (the default), the constant “1” is added to the numerator and denominator of the idf as if an extra document was seen containing every term in the collection exactly once, which prevents zero divisions: idf(t) = log [ (1 + n) / (1 + df(t)) ] + 1.

For reference below is a standard used to measure tf-idf, in our calculation the tf-idf may not match because of normalization technique used in the below diagram.

![](https://nlp.stanford.edu/IR-book/html/htmledition/img462.png)

Let's get started by importing packages we are going to use:-

In [2]:
import numpy as np
import pandas as pd
import collections
import math

# scikit packages
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer,TfidfTransformer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity,cosine_distances

## Test Data

We are going to use 3 sentences as a corpus of documents we are going to use for testing purpose, these sentences are alcohol warning for 3 of the drugs.

In [4]:
test_data = ['It is unsafe to consume alcohol with 2 Dep 30mg Tablet',
          'Abroxy 100mg Capsule may cause excessive drowsiness with alcohol',
          'Caution is advised when consuming alcohol with Abrophyll DM Tablet Please consult your doctor']
test_data_array = np.array(test_data, dtype = 'object')
test_data_array = test_data_array.astype('str')
test_data_array

array(['It is unsafe to consume alcohol with 2 Dep 30mg Tablet',
       'Abroxy 100mg Capsule may cause excessive drowsiness with alcohol',
       'Caution is advised when consuming alcohol with Abrophyll DM Tablet Please consult your doctor'],
      dtype='<U93')

## Term Frequency

We will first use scikit package vectorizer to calculate term frequency, which meaning dictionary of words appearing in each document.

In [5]:
# initalize count vectorizer 
countvectorizer = CountVectorizer(analyzer= 'word', stop_words='english')
# output of this is document term frequency, as this name suggests it measures
# terms per document
document_term_matrix = countvectorizer.fit_transform(test_data_array)
document_term_matrix

<3x19 sparse matrix of type '<class 'numpy.int64'>'
	with 22 stored elements in Compressed Sparse Row format>

Document term frequency is a sparse matrix to save space as the document corpus may not have many terms per document, let's see the dense matrix to see the original matrix as stored per document.

In [6]:
dense = document_term_matrix.todense()
dense

matrix([[0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1],
        [1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0],
        [0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0]],
       dtype=int64)

To understand the above array let's print the bag of words or terms which is part of this term frequency.

In [14]:
countvectorizer.get_feature_names_out()

array(['100mg', '30mg', 'abrophyll', 'abroxy', 'advised', 'alcohol',
       'capsule', 'cause', 'caution', 'consult', 'consume', 'consuming',
       'dep', 'dm', 'doctor', 'drowsiness', 'excessive', 'tablet',
       'unsafe'], dtype=object)

As you can see from both the array it is merely frequency count of words appearing per document. As an example the word 30mg appears once in the first document.

## Dense and Sparse Matrix

- A **sparse matrix** is a matrix that is comprised of mostly zero values.
- Sparse matrices are distinct from matrices with mostly non-zero values, which are referred to as **dense matrices**.

In [15]:
# no of non zero values
document_term_matrix.nnz

22

In [16]:
# no of zero values
len(np.argwhere(dense == 0))

35

In [17]:
# total number of records = document_term_matrix.shape.noOfDocuments * document_term_matrix.shape.noOfFeatures
total_values = document_term_matrix.shape[0]* document_term_matrix.shape[1]
total_values

57

In [18]:
# values contained in sparse matrix is non zeros + no of zeros in dense matrix = total no of values.
document_term_matrix.nnz + len(np.argwhere(dense == 0))

57

## Let's see everything visually

In [20]:
terms = countvectorizer.get_feature_names_out()
# create data frame
df_countvect = pd.DataFrame(data = document_term_matrix.toarray(),index = ['Doc1','Doc2','Doc3'],columns = terms)
print("Count Vectorizer\n")
print(df_countvect)

Count Vectorizer

      100mg  30mg  abrophyll  abroxy  advised  alcohol  capsule  cause  \
Doc1      0     1          0       0        0        1        0      0   
Doc2      1     0          0       1        0        1        1      1   
Doc3      0     0          1       0        1        1        0      0   

      caution  consult  consume  consuming  dep  dm  doctor  drowsiness  \
Doc1        0        0        1          0    1   0       0           0   
Doc2        0        0        0          0    0   0       0           1   
Doc3        1        1        0          1    0   1       1           0   

      excessive  tablet  unsafe  
Doc1          0       1       1  
Doc2          1       0       0  
Doc3          0       1       0  


## Let's see Tfid Vectorization

In [21]:
tfidfvectorizer = TfidfVectorizer(analyzer= 'word', stop_words='english',norm='l2')
# returns array of shape (n_samples, n_features) = Document-term matrix.
tfid_document_term_matrix = tfidfvectorizer.fit_transform(test_data)
terms = tfidfvectorizer.get_feature_names_out()
# create data frame
df_tfidvect = pd.DataFrame(data = tfid_document_term_matrix.toarray(),index = ['Doc1','Doc2','Doc3'],columns = terms)
print("TD-IDF Vectorizer\n")
print(df_tfidvect)

TD-IDF Vectorizer

         100mg      30mg  abrophyll    abroxy   advised   alcohol   capsule  \
Doc1  0.000000  0.450504   0.000000  0.000000  0.000000  0.266075  0.000000   
Doc2  0.396875  0.000000   0.000000  0.396875  0.000000  0.234400  0.396875   
Doc3  0.000000  0.000000   0.355173  0.000000  0.355173  0.209771  0.000000   

         cause   caution   consult   consume  consuming       dep        dm  \
Doc1  0.000000  0.000000  0.000000  0.450504   0.000000  0.450504  0.000000   
Doc2  0.396875  0.000000  0.000000  0.000000   0.000000  0.000000  0.000000   
Doc3  0.000000  0.355173  0.355173  0.000000   0.355173  0.000000  0.355173   

        doctor  drowsiness  excessive    tablet    unsafe  
Doc1  0.000000    0.000000   0.000000  0.342620  0.450504  
Doc2  0.000000    0.396875   0.396875  0.000000  0.000000  
Doc3  0.355173    0.000000   0.000000  0.270118  0.000000  


## Basic Math to calculate term frequency

We will use the test data and create a dictionary of words and then go over each document and count how many times each term appears each document.

In [22]:
# for the first document.
unique_words_doc1 = test_data[0].lower().split()
unique_words_doc1 = set(unique_words_doc1)
# remove english stop words
unique_words_doc1.remove('2')
unique_words_doc1.remove('is')
unique_words_doc1.remove('it')
unique_words_doc1.remove('to')
unique_words_doc1.remove('with')
unique_words_doc1

{'30mg', 'alcohol', 'consume', 'dep', 'tablet', 'unsafe'}

In [24]:
# for the second document.
unique_words_doc2 = test_data[1].lower().split()
unique_words_doc2 = set(unique_words_doc2)

# remove english stop words
unique_words_doc2.remove('may')
unique_words_doc2.remove('with')
unique_words_doc2

{'100mg', 'abroxy', 'alcohol', 'capsule', 'cause', 'drowsiness', 'excessive'}

In [25]:
# for the third document.
unique_words_doc3 = test_data[2].lower().split()
unique_words_doc3 = set(unique_words_doc3)

# remove english stop words
unique_words_doc3.remove('is')
unique_words_doc3.remove('when')
unique_words_doc3.remove('with')
unique_words_doc3.remove('your')
unique_words_doc3.remove('please')
unique_words_doc3

{'abrophyll',
 'advised',
 'alcohol',
 'caution',
 'consult',
 'consuming',
 'dm',
 'doctor',
 'tablet'}

In [27]:
# all unique words
all_unique_words = set()
all_unique_words.update(unique_words_doc1,unique_words_doc2,unique_words_doc3)
all_unique_words = sorted(all_unique_words)
all_unique_words

['100mg',
 '30mg',
 'abrophyll',
 'abroxy',
 'advised',
 'alcohol',
 'capsule',
 'cause',
 'caution',
 'consult',
 'consume',
 'consuming',
 'dep',
 'dm',
 'doctor',
 'drowsiness',
 'excessive',
 'tablet',
 'unsafe']

In [28]:
# bag of words
bag_of_words = dict()
i = 0
for word in all_unique_words:
    bag_of_words[word] = i
    i = i + 1

bag_of_words

{'100mg': 0,
 '30mg': 1,
 'abrophyll': 2,
 'abroxy': 3,
 'advised': 4,
 'alcohol': 5,
 'capsule': 6,
 'cause': 7,
 'caution': 8,
 'consult': 9,
 'consume': 10,
 'consuming': 11,
 'dep': 12,
 'dm': 13,
 'doctor': 14,
 'drowsiness': 15,
 'excessive': 16,
 'tablet': 17,
 'unsafe': 18}

In [29]:
# order by key such that it matches with the scikit array
bag_of_words = collections.OrderedDict(sorted(bag_of_words.items()))
bag_of_words

OrderedDict([('100mg', 0),
             ('30mg', 1),
             ('abrophyll', 2),
             ('abroxy', 3),
             ('advised', 4),
             ('alcohol', 5),
             ('capsule', 6),
             ('cause', 7),
             ('caution', 8),
             ('consult', 9),
             ('consume', 10),
             ('consuming', 11),
             ('dep', 12),
             ('dm', 13),
             ('doctor', 14),
             ('drowsiness', 15),
             ('excessive', 16),
             ('tablet', 17),
             ('unsafe', 18)])

In [30]:
# count frequency of each word
one_freq = [0]*len(bag_of_words)
two_freq = [0]*len(bag_of_words)
three_freq = [0]*len(bag_of_words)
all_words = ['']*len(bag_of_words)

In [31]:
for word in unique_words_doc1:
    word_index = bag_of_words[word]
    one_freq[word_index] +=1
one_freq

[0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1]

In [32]:
for word in unique_words_doc2:
    word_index = bag_of_words[word]
    two_freq[word_index] +=1
two_freq

[1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0]

In [33]:
for word in unique_words_doc3:
    word_index = bag_of_words[word]
    three_freq[word_index] +=1
three_freq

[0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0]

In [34]:
# create header for each term
for word in bag_of_words:
    word_index = bag_of_words[word]
    all_words[word_index] = word
all_words

['100mg',
 '30mg',
 'abrophyll',
 'abroxy',
 'advised',
 'alcohol',
 'capsule',
 'cause',
 'caution',
 'consult',
 'consume',
 'consuming',
 'dep',
 'dm',
 'doctor',
 'drowsiness',
 'excessive',
 'tablet',
 'unsafe']

In [35]:
# add each document word frequency in a numpy matrix
matrix = np.array([one_freq,two_freq,three_freq])

In [36]:
# create a data frame, matrix of word frequency document with the headers and index
manually_create_df = pd.DataFrame(matrix,index = ['Doc1','Doc2','Doc3'],columns=bag_of_words,dtype='float64')
manually_create_df

Unnamed: 0,100mg,30mg,abrophyll,abroxy,advised,alcohol,capsule,cause,caution,consult,consume,consuming,dep,dm,doctor,drowsiness,excessive,tablet,unsafe
Doc1,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0
Doc2,1.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0
Doc3,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0


In [37]:
# check with original count vectorizor
pd.DataFrame(data = document_term_matrix.toarray(),index = ['Doc1','Doc2','Doc3'],columns = terms)

Unnamed: 0,100mg,30mg,abrophyll,abroxy,advised,alcohol,capsule,cause,caution,consult,consume,consuming,dep,dm,doctor,drowsiness,excessive,tablet,unsafe
Doc1,0,1,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,1,1
Doc2,1,0,0,1,0,1,1,1,0,0,0,0,0,0,0,1,1,0,0
Doc3,0,0,1,0,1,1,0,0,1,1,0,1,0,1,1,0,0,1,0


So far we have been able to calculate the count vectorization using simple math.

## Define a function to calculate TF-IDF

In [39]:
# our simulated tf-idf function
def ski_tfid(value,samples):
    """document frequency."""
    df = (value/samples) 
    #  idf(t) = log [ (1 + n) / (1 + df(t)) ] + 1
    idf = np.log( (1 + samples) / (1 + df)) + 1
    return df*idf

Apply this function to each of the element in pandas data frame we created based on our calculation.

In [40]:
res = manually_create_df.apply(lambda x: ski_tfid(x,manually_create_df.shape[0]))
res

Unnamed: 0,100mg,30mg,abrophyll,abroxy,advised,alcohol,capsule,cause,caution,consult,consume,consuming,dep,dm,doctor,drowsiness,excessive,tablet,unsafe
Doc1,0.0,0.699537,0.0,0.0,0.0,0.699537,0.0,0.0,0.0,0.0,0.699537,0.0,0.699537,0.0,0.0,0.0,0.0,0.699537,0.699537
Doc2,0.699537,0.0,0.0,0.699537,0.0,0.699537,0.699537,0.699537,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.699537,0.699537,0.0,0.0
Doc3,0.0,0.0,0.699537,0.0,0.699537,0.699537,0.0,0.0,0.699537,0.699537,0.0,0.699537,0.0,0.699537,0.699537,0.0,0.0,0.699537,0.0


## Cosine Similarity

Let's say we have 4 documents as below:-

- document 1 which has **3** times appearing **'iphone'** term and 1 times appearing **'galaxy'** term
- document 2 which has **2** times appearing **'iphone'** term and 0 times appearing **'galaxy'** term
- docuemnt 3 which has **1** times appearing **'iphone'** term and 3 times appearing **'galaxy'** term
- docuemnt 4 which has **1** times appearing **'iphone'** term and 2 times appearing **'galaxy'** term

Let's create a dataframe with these values

In [76]:
df = pd.DataFrame([
        {'iphone':3, 'galaxy':1} , # iphone document
        {'iphone':2, 'galaxy':0} , # iphone document
        {'iphone':1, 'galaxy':3} , # galaxy document
        {'iphone':1, 'galaxy':2}   # galaxy document
    ],
    index = [
        'document-1',
        'document-2',
        'document-3',
        'document-4',
    ]
)
df

Unnamed: 0,iphone,galaxy
document-1,3,1
document-2,2,0
document-3,1,3
document-4,1,2


- document-1 (iphone) and document-2 (iphone) - they are similar
- document-1 (iphone) and document-3 (galaxy) - they are not similar
- document-1 (iphone) and document-4 (galaxy) - they are not similar
- document-2 (iphone) and document-3 (galaxy) - they are not similar
- document-2 (iphone) and document-4 (galaxy) - they are not similar
- document-3 (galaxy) and document-4 (galaxy) - they are similar

Only 2 document is similar they are:-

- document-1 and document-2 
- document-3 and document-4

Let's check the similarity score and distance for these documents

In [78]:
## comparing document-1 and document-2
sim = cosine_similarity(df['document-1':'document-1'],df['document-2':'document-2'])
dis = cosine_distances(df['document-1':'document-1'],df['document-2':'document-2'])
print('cosine similarity',sim)
print('cosine distance',dis)

[[0.9486833]]
[[0.0513167]]


In [86]:
## comparing document-3 and document-4
sim = cosine_similarity(df['document-3':'document-3'],df['document-4':'document-4'])
dis = cosine_distances(df['document-3':'document-3'],df['document-4':'document-4'])
print('cosine similarity',sim)
print('cosine distance',dis)

[[0.98994949]]
[[0.01005051]]


In [81]:
## comparing document-1 and document-3
sim = cosine_similarity(df['document-1':'document-1'],df['document-3':'document-3'])
dis = cosine_distances(df['document-1':'document-1'],df['document-3':'document-3'])
print('cosine similarity',sim)
print('cosine distance',dis)

[[0.6]]
[[0.4]]


In [82]:
## comparing document-1 and document-4
sim = cosine_similarity(df['document-1':'document-1'],df['document-4':'document-4'])
dis = cosine_distances(df['document-1':'document-1'],df['document-4':'document-4'])
print('cosine similarity',sim)
print('cosine distance',dis)

[[0.70710678]]
[[0.29289322]]


In [83]:
## comparing document-2 and document-3
sim = cosine_similarity(df['document-2':'document-2'],df['document-3':'document-3'])
dis = cosine_distances(df['document-2':'document-2'],df['document-3':'document-3'])
print('cosine similarity',sim)
print('cosine distance',dis)

[[0.31622777]]
[[0.68377223]]


In [84]:
## comparing document-2 and document-4
sim = cosine_similarity(df['document-2':'document-2'],df['document-4':'document-4'])
dis = cosine_distances(df['document-2':'document-2'],df['document-4':'document-4'])
print('cosine similarity',sim)
print('cosine distance',dis)

[[0.4472136]]
[[0.5527864]]


Hope you enjoyed this...