<a href="https://colab.research.google.com/github/ravadhani/NLP/blob/main/BOWandTF_IDF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Topics in the workbook:**



1.   Implementation of BOW from scratch
2.   BOW using Sklearn tool CountVectorizer
1.   TF-IDF from scratch
2.   TF-IDF using the Sklearn tool TfidfVectorizer
1.   Summary and drawbacks








In [4]:
!pip install nltk



#Implementation of BOW from scratch

In [12]:
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [9]:
#Custom implementation of BOW

#sample data
corpus = [
    "it was the best of times",
    "it was the worst of times",
    "it was the age of wisdom and the age of foolishness"
]

#function to get BOW regresentation using sparse vectors
#frequency is set to True as we are taking the count of words freq
def get_bow_representation(corpus, frequency = True):
  vocabulary = set([x for x in " ".join(corpus).lower().split(" ")])

  bow_rep = []
  for sentence in corpus:
    sentence_rep = dict([(v,0) for v in vocabulary])
    for word in word_tokenize(sentence):
      if frequency:
        sentence_rep[word] += 1
      else:
        sentence_rep[word] = 1
    bow_rep.append(sentence_rep)

  return bow_rep

bow_representation = get_bow_representation(corpus, True)
df = pd.DataFrame(bow_representation)
df.index = corpus
display(df)


Unnamed: 0,foolishness,best,worst,it,of,times,wisdom,age,and,the,was
it was the best of times,0,1,0,1,1,1,0,0,0,1,1
it was the worst of times,0,0,1,1,1,1,0,0,0,1,1
it was the age of wisdom and the age of foolishness,1,0,0,1,2,0,1,2,1,2,1


#Implementation of BOW using Sklearn CountVectorizer



*   The Sklearn countvectorizer converts the text documents into a matrix of token counts.
*   It essentially created a document-term matrix where each row represents a document from the corpus, and each column represents a unique word (or token) in the corpus. The value in each cell represents the frequency of the corresponding word in the corresponding document.
*   This is what we actually tried to implement above from scratch. Now we can use sklearn tool to compare both implementations.






In [11]:
#Using CountVectorizer from scikit-learn
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()
#learn the vocabulary dictionary and return document-term matrix
bow_rep = cv.fit_transform(corpus).todense() #todense() returns a matrix
#create dataframe
df = pd.DataFrame(bow_rep)
#get output feature names for dataframe columns
df.columns = cv.get_feature_names_out()
df.index = corpus
display(df)

Unnamed: 0,age,and,best,foolishness,it,of,the,times,was,wisdom,worst
it was the best of times,0,0,1,0,1,1,1,1,1,0,0
it was the worst of times,0,0,0,0,1,1,1,1,1,0,1
it was the age of wisdom and the age of foolishness,2,1,0,1,1,2,2,0,1,1,0




*   As you can see the results are same. So our froms cratch implementation is good
*   CountVectorizer has lot of inbuilt text processing features.


*   Like, it can remove stopwords directly from the corpus.
*   The below code snippet shows using the stopwords with CountVectorizer





In [15]:
#using CountVectorizer with removing stopwords from the corpus
cv1 = CountVectorizer(stop_words="english")
bow_rep1 = cv1.fit_transform(corpus).todense()
df1 = pd.DataFrame(bow_rep1)
df1.columns = cv1.get_feature_names_out()
df1.index = corpus
display(df1)

Unnamed: 0,age,best,foolishness,times,wisdom,worst
it was the best of times,0,1,0,1,0,0
it was the worst of times,0,0,0,1,0,1
it was the age of wisdom and the age of foolishness,2,0,1,0,1,0




*  As you can see the above matrix representation, in the columns among the unique words fromt he corpus, the stopwords have been removed.




Drawbacks of simple BOW:



*   It cannot distinguish between rare important words and the common words.
*   Giving weightage the words appropriately can solve this problem.


*   TF-IDF is a popular technique to get the BOW representation.
*   This technique evaluates the term importance in a document relative to the collection of documents by attributing weights calculated through Term Frequency(TF) and Inverse Document Frequency (IDF)





# Implementing TF-IDF from scratch

In [17]:
#Executing Term Frequency (TF)

#sample data
corpus = [
    "it was the best of times",
    "it was the worst of times",
    "it was the age of wisdom and the age of foolishness"
]

def get_term_frequency(corpus):
  vocabulary = set([x for x in " ".join(corpus).lower().split(" ")])

  term_freq = []
  for sentence in corpus:
    sentence_tf = dict([(v,0) for v in vocabulary])
    for word in word_tokenize(sentence):
      #finding out frequency of the word in the sentence
      sentence_tf[word] += 1
    for v in vocabulary:
      #calculating the term frequency of each word in the corpus
      # tf = ratio of the frequency of term in the sentence with the total number of terms in the sentence
      sentence_tf[v] /= len(word_tokenize(sentence))
    term_freq.append(sentence_tf)

  return term_freq

term_freq = get_term_frequency(corpus)
df_tf = pd.DataFrame(term_freq)
df_tf.index = corpus
display(df_tf)


Unnamed: 0,foolishness,best,worst,it,of,times,wisdom,age,and,the,was
it was the best of times,0.0,0.166667,0.0,0.166667,0.166667,0.166667,0.0,0.0,0.0,0.166667,0.166667
it was the worst of times,0.0,0.0,0.166667,0.166667,0.166667,0.166667,0.0,0.0,0.0,0.166667,0.166667
it was the age of wisdom and the age of foolishness,0.090909,0.0,0.0,0.090909,0.181818,0.0,0.090909,0.181818,0.090909,0.181818,0.090909


In [18]:
#Executing Inverse Document Frequency (IDF)
import numpy as np

def get_inverse_document_frequency(corpus):
  vocabulary = set([x for x in " ".join(corpus).lower().split(" ")])
  n = len(corpus)

  inverse_document_frequency = {}
  for v in vocabulary:
    num_docs = 0
    for sentence in corpus:
      if v in word_tokenize(sentence):
        num_docs += 1
    # IDF is calculated as the log(number of documents/number of document with term/word)
    #IDF is calculated for each word in the corpus
    inverse_document_frequency[v] = np.log(n/num_docs)

  return inverse_document_frequency

inverse_document_frequency = get_inverse_document_frequency(corpus)
inverse_document_frequency



{'foolishness': 1.0986122886681098,
 'best': 1.0986122886681098,
 'worst': 1.0986122886681098,
 'it': 0.0,
 'of': 0.0,
 'times': 0.4054651081081644,
 'wisdom': 1.0986122886681098,
 'age': 1.0986122886681098,
 'and': 1.0986122886681098,
 'the': 0.0,
 'was': 0.0}

**Calculating TF-IDF from the above two functions.**

In [19]:
def get_tf_idf(corpus):
  tf = get_term_frequency(corpus)
  idf = get_inverse_document_frequency(corpus)

  tf_idf = []
  for tf_dict in tf:
    tf_idf_sentence = {}
    for t, term_freq in tf_dict.items():
      #for each term/word we are calculating the tf-idf by multiple the term frequency with IDF of that term
      tf_idf_sentence[t] = term_freq * idf[t]
    tf_idf.append(tf_idf_sentence)

  return tf_idf

tf_idf = get_tf_idf(corpus)
df2 = pd.DataFrame(tf_idf)
df2.index = corpus
display(df2)


Unnamed: 0,foolishness,best,worst,it,of,times,wisdom,age,and,the,was
it was the best of times,0.0,0.183102,0.0,0.0,0.0,0.067578,0.0,0.0,0.0,0.0,0.0
it was the worst of times,0.0,0.0,0.183102,0.0,0.0,0.067578,0.0,0.0,0.0,0.0,0.0
it was the age of wisdom and the age of foolishness,0.099874,0.0,0.0,0.0,0.0,0.0,0.099874,0.199748,0.099874,0.0,0.0




*   The above implementation of TF-IDF is from scratch.
*   There is a sklearn tool TfidfVectorizer which can do the same.




**TF-IDF using Sklearn tool TfidfVectorizer**

In [23]:
from sklearn.feature_extraction.text import TfidfVectorizer

#using inbuilt TfidfVectorizer() function to calculate TF-IDF
tf_idf_vectorizer = TfidfVectorizer()
tf_idf_rep = tf_idf_vectorizer.fit_transform(corpus).todense()
df_tf_idf_inbuilt = pd.DataFrame(tf_idf_rep)
df_tf_idf_inbuilt.columns = tf_idf_vectorizer.get_feature_names_out()
df_tf_idf_inbuilt.index = corpus
display(df_tf_idf_inbuilt)

Unnamed: 0,age,and,best,foolishness,it,of,the,times,was,wisdom,worst
it was the best of times,0.0,0.0,0.579897,0.0,0.342496,0.342496,0.342496,0.441027,0.342496,0.0,0.0
it was the worst of times,0.0,0.0,0.0,0.0,0.342496,0.342496,0.342496,0.441027,0.342496,0.0,0.579897
it was the age of wisdom and the age of foolishness,0.617558,0.308779,0.0,0.308779,0.18237,0.36474,0.36474,0.0,0.18237,0.308779,0.0


*   The results of the scartch implementation may vary from sklearn TFIDF implemenmtation, as sklearn uses smoothening for IDF, normalization and other improvements.
*   Similar to CountVectorizer, TfidfVectorizer also has many inbuilt text processing functionalities, like removing stopwords, setting ngarm range etc.




In [25]:
#Bigram using TfidfVectorizer

tf_idf_vectorizer_bigram = TfidfVectorizer(ngram_range=(1,2))
tf_idf_rep1 = tf_idf_vectorizer_bigram.fit_transform(corpus).todense()
df_tf_idf1 = pd.DataFrame(tf_idf_rep1)
df_tf_idf1.columns = tf_idf_vectorizer_bigram.get_feature_names_out()
df_tf_idf1.index = corpus
display(df_tf_idf1)

Unnamed: 0,age,age of,and,and the,best,best of,foolishness,it,it was,of,...,the age,the best,the worst,times,was,was the,wisdom,wisdom and,worst,worst of
it was the best of times,0.0,0.0,0.0,0.0,0.400008,0.400008,0.0,0.236251,0.236251,0.236251,...,0.0,0.400008,0.0,0.304216,0.236251,0.236251,0.0,0.0,0.0,0.0
it was the worst of times,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.236251,0.236251,0.236251,...,0.0,0.0,0.400008,0.304216,0.236251,0.236251,0.0,0.0,0.400008,0.400008
it was the age of wisdom and the age of foolishness,0.415353,0.415353,0.207677,0.207677,0.0,0.0,0.207677,0.122657,0.122657,0.245314,...,0.415353,0.0,0.0,0.0,0.122657,0.122657,0.207677,0.207677,0.0,0.0


#Summary



*  Both BOW and TF-IDF are used to represent text in vector formats.
*  BOW uses CountVectorizer which finds out the frequency of the words whereas TF-IDF is built of top of CountVectorizer to penalize highly frequent words and lowe frequency words.


*  TF-IDF can be used to filter out uncommon and irrelevants words and helps the model train and converge faster.

**Drawbacks:**

*  Both fail to capture positional information of the word
*  In a large corpus of data, these two methods can become computationally heavy and difficult to store too(memory constraints)
*  They both look for presence or absence of words but fail to understand the meaning of similar words. They treat each word independent.
*  Highly sparse vectors with few non-zero values
*  Doesn't capture context nor semantics of the word








