
**Is TF-IDF and Word2vec the same?**
Not at all. TF-IDF is a word-document mapping (with some normalization). It ignore the order of words and gives nxm matrix (or mxn depending on implementation) where n is number of words in the vocabulary and m is number of documents. Word2Vec on the other hand gives a unique vector for each word based on the words appearing around the particular word. TF-IDF is obtained from straightforward linear algebra. Word2Vec is obtained from the hidden layer of a two layered neural network. TF-IDF can be used either for assigning vectors to words or to documents. Word2Vec can be directly used to assign vector to a word but to get the vector representation of a document further processing is needed. Unlike TF-IDF Word2Vec takes into account placement of words in a document(to some extent).

https://www.youtube.com/watch?v=_RhHA_tYYXI&ab_channel=CodeWithAarohi

In [None]:


# importing libraries
from sklearn.feature_extraction.text import TfidfVectorizer


In [None]:

tfidf = TfidfVectorizer()
doc_1="piford provide trainings to working prefessionals"
doc_2="piford provide trainings to students"


In [None]:

response = tfidf.fit_transform([doc_1,doc_2])


In [None]:

#doc_ & doc_2 te koto gulo unique word ase ta print korbe
print(len(tfidf.vocabulary_))


7


In [None]:

#machine word or text buje na sudu boje nuber tai 
#word gulo k random number dara replace korbo
tfidf.vocabulary_


{'piford': 0,
 'provide': 2,
 'trainings': 5,
 'to': 4,
 'working': 6,
 'prefessionals': 1,
 'students': 3}

In [None]:

# 0 dara doc_1 k bujatse
# 1 dara doc_2 k bujatse
# sobar right er fraction number ta holo frequency
print(response)


  (0, 1)	0.49844627974580596
  (0, 6)	0.49844627974580596
  (0, 4)	0.35464863330313684
  (0, 5)	0.35464863330313684
  (0, 2)	0.35464863330313684
  (0, 0)	0.35464863330313684
  (1, 3)	0.5749618667993135
  (1, 4)	0.40909010368335985
  (1, 5)	0.40909010368335985
  (1, 2)	0.40909010368335985
  (1, 0)	0.40909010368335985


In [None]:

feature_names= tfidf.get_feature_names()
for col in response.nonzero()[1]:
  print(feature_names[col],"  :  ",response[0,col])

prefessionals   :   0.49844627974580596
working   :   0.49844627974580596
to   :   0.35464863330313684
trainings   :   0.35464863330313684
provide   :   0.35464863330313684
piford   :   0.35464863330313684
students   :   0.0
to   :   0.35464863330313684
trainings   :   0.35464863330313684
provide   :   0.35464863330313684
piford   :   0.35464863330313684




https://towardsdatascience.com/tf-idf-explained-and-python-sklearn-implementation-b020c5e83275

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfTransformer
dataset = [
    "I enjoy reading about Machine Learning and Machine Learning is my PhD subject",
    "I would enjoy a walk in the park",
    "I was reading in the library"
]

In [None]:
tfIdf[0]

<1x17 sparse matrix of type '<class 'numpy.float64'>'
	with 10 stored elements in Compressed Sparse Row format>

In [None]:

tfIdfVectorizer=TfidfVectorizer(use_idf=True)
tfIdf = tfIdfVectorizer.fit_transform(dataset)
#https://stackoverflow.com/questions/30416695/numpy-and-scipy-difference-between-todense-and-toarray
#toarray returns an ndarray; todense returns a matrix. If you want a matrix,
#use todense; otherwise, use toarray.
df = pd.DataFrame(tfIdf[0].T.todense(), index=tfIdfVectorizer.get_feature_names(), columns=["TF-IDF"])
df = df.sort_values('TF-IDF', ascending=False)
print (df.head(25))


            TF-IDF
machine   0.513720
learning  0.513720
about     0.256860
subject   0.256860
phd       0.256860
and       0.256860
my        0.256860
is        0.256860
reading   0.195349
enjoy     0.195349
library   0.000000
park      0.000000
in        0.000000
the       0.000000
walk      0.000000
was       0.000000
would     0.000000


**TfidfVectorizer vs TfidfTransformer — what is the difference**

If you’ve ever seen other implementations of TF-IDF you may have seen that there are 2 different ways of implementing TF-IDF using Scikit-Learn. One is using the TfidfVectorizer class(like we just did) and the other one is by using the TfidfTransformer class. You may have wondered what’s the difference between the 2 of them, so let’s discuss that.

Theoretically speaking, there is actually no difference between the 2 implementations. Practically speaking, we need to write some more code if we want to use TfidfTransformer. The main difference between the 2 implementations is that TfidfVectorizer performs both term frequency and inverse document frequency for you, while using TfidfTransformer will require you to use the CountVectorizer class from Scikit-Learn to perform Term Frequency.

In [None]:

from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer


In [None]:

tfIdfTransformer = TfidfTransformer(use_idf=True)
countVectorizer = CountVectorizer()
wordCount = countVectorizer.fit_transform(dataset)
newTfIdf = tfIdfTransformer.fit_transform(wordCount)
df = pd.DataFrame(newTfIdf[0].T.todense(), index=countVectorizer.get_feature_names(), columns=["TF-IDF"])
df = df.sort_values('TF-IDF', ascending=False)
print (df.head(25))


            TF-IDF
machine   0.513720
learning  0.513720
about     0.256860
subject   0.256860
phd       0.256860
and       0.256860
my        0.256860
is        0.256860
reading   0.195349
enjoy     0.195349
library   0.000000
park      0.000000
in        0.000000
the       0.000000
walk      0.000000
was       0.000000
would     0.000000




https://www.learndatasci.com/glossary/tf-idf-term-frequency-inverse-document-frequency/

**Python Implementation**

Some popular python libraries have a function to calculate TF-IDF. The popular machine learning library Sklearn has TfidfVectorizer() function (docs).

We will write a TF-IDF function from scratch using the standard formula given above, but we will not apply any preprocessing operations such as stop words removal, stemming, punctuation removal, or lowercasing. It should be noted that the result may be different when using a native function built into a library.

In [None]:

import pandas as pd
import numpy as np


In [None]:

corpus = ['data science is one of the most important fields of science', # for row 1 (in outptut : 0)
          'this is one of the best data science courses',                # for row 2 (in outptut : 1)
          'data scientists analyze data' ]                               # for row 3 (in outptut : 2)


In [None]:

words_set = set()

for doc in  corpus:
    words = doc.split(' ')
    # all unique word
    words_set = words_set.union(set(words))
    
print('Number of words in the corpus:',len(words_set))
print('The words in the corpus: \n', words_set)


Number of words in the corpus: 14
The words in the corpus: 
 {'best', 'science', 'most', 'this', 'important', 'fields', 'data', 'scientists', 'one', 'of', 'is', 'courses', 'the', 'analyze'}


In [None]:

n_docs = len(corpus)         #·Number of documents in the corpus
n_words_set = len(words_set) #·Number of unique words in the 

df_tf = pd.DataFrame(np.zeros((n_docs, n_words_set)), columns=words_set)

# Compute Term Frequency (TF)
for i in range(n_docs):
    words = corpus[i].split(' ') # Words in the document
    for w in words:
        df_tf[w][i] = df_tf[w][i] + (1 / len(words))
        
df_tf


Unnamed: 0,best,science,most,this,important,fields,data,scientists,one,of,is,courses,the,analyze
0,0.0,0.181818,0.090909,0.0,0.090909,0.090909,0.090909,0.0,0.090909,0.181818,0.090909,0.0,0.090909,0.0
1,0.111111,0.111111,0.0,0.111111,0.0,0.0,0.111111,0.0,0.111111,0.111111,0.111111,0.111111,0.111111,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.25,0.0,0.0,0.0,0.0,0.0,0.25


The dataframe above shows we have a column for each word and a row for each document. This shows the frequency of each word in each document.

**Computing Inverse Document Frequency**

Now, we'll compute the inverse document frequency (IDF):

In [None]:

print("IDF of: ")

idf = {}

for w in words_set:
    k = 0    # number of documents in the corpus that contain this word
    
    for i in range(n_docs):
        if w in corpus[i].split():
            k += 1
            
    idf[w] =  np.log10(n_docs / k)
    
    print(f'{w:>15}: {idf[w]:>10}' )
    

IDF of: 
           best: 0.47712125471966244
        science: 0.17609125905568124
           most: 0.47712125471966244
           this: 0.47712125471966244
      important: 0.47712125471966244
         fields: 0.47712125471966244
           data:        0.0
     scientists: 0.47712125471966244
            one: 0.17609125905568124
             of: 0.17609125905568124
             is: 0.17609125905568124
        courses: 0.47712125471966244
            the: 0.17609125905568124
        analyze: 0.47712125471966244


**Putting it Together: Computing TF-IDF**

Since we have TF and IDF now, we can compute TF-IDF:

In [None]:

df_tf_idf = df_tf.copy()

for w in words_set:
    for i in range(n_docs):
        df_tf_idf[w][i] = df_tf[w][i] * idf[w]
        
df_tf_idf


Unnamed: 0,best,science,most,this,important,fields,data,scientists,one,of,is,courses,the,analyze
0,0.0,0.032017,0.043375,0.0,0.043375,0.043375,0.0,0.0,0.016008,0.032017,0.016008,0.0,0.016008,0.0
1,0.053013,0.019566,0.0,0.053013,0.0,0.0,0.0,0.0,0.019566,0.019566,0.019566,0.053013,0.019566,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.11928,0.0,0.0,0.0,0.0,0.0,0.11928


Notice that "data" has an IDF of 0 because it appears in every document. As a result, is not considered to be an important term in this corpus. This will change slightly in the following sklearn implementation, where "data" will be non-zero.

**TF-IDF Using scikit-learn**

First, we need to import sklearn's TfidfVectorizer:

In [None]:

from sklearn.feature_extraction.text import TfidfVectorizer


We need to instantiate the class first, then we can call the fit_transform method on our test corpus. This will perform all of the calculations we performed above.

In [None]:

tr_idf_model  = TfidfVectorizer()
tf_idf_vector = tr_idf_model.fit_transform(corpus)


After vectorizing the corpus by the function, a sparse matrix is obtained.

Here's the current shape of the matrix:

In [None]:

print(type(tf_idf_vector), tf_idf_vector.shape)


<class 'scipy.sparse.csr.csr_matrix'> (3, 14)


And we can convert to an regular array to get a better idea of the values:

In [None]:


tf_idf_array = tf_idf_vector.toarray()

print(tf_idf_array)

[[0.         0.         0.         0.18952581 0.32089509 0.32089509
  0.24404899 0.32089509 0.48809797 0.24404899 0.48809797 0.
  0.24404899 0.        ]
 [0.         0.40029393 0.40029393 0.23642005 0.         0.
  0.30443385 0.         0.30443385 0.30443385 0.30443385 0.
  0.30443385 0.40029393]
 [0.54270061 0.         0.         0.64105545 0.         0.
  0.         0.         0.         0.         0.         0.54270061
  0.         0.        ]]


It's now very straightforward to obtain the original terms in the corpus by using get_feature_names:

In [None]:


words_set = tr_idf_model.get_feature_names()

print(words_set)


['analyze', 'best', 'courses', 'data', 'fields', 'important', 'is', 'most', 'of', 'one', 'science', 'scientists', 'the', 'this']




Finally, we'll create a dataframe to better show the TF-IDF scores of each document:

In [None]:


df_tf_idf = pd.DataFrame(tf_idf_array, columns = words_set)

df_tf_idf

As you can see from the output above, the TF-IDF scores are different than the scores obtained by the manual process we used earlier. This difference is due to sklearn's implementation of TF-IDF, which uses a slightly different formula. For more details, you can learn more about how sklearn calculates TF-IDF term weighting here.

https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting
