<a href="https://colab.research.google.com/github/scsanjay/ml_from_scratch/blob/main/01.%20Bag%20of%20Words.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

BoW (Bag of Words) is one of the simplest technique to convert document into vectors. These documents can be text message, review, email,etc. We can not perform any ml operations on any data unless it's in numeric form.


---


The BoW are of length equal to number of unique words in the corpus (corpus is collection of documents). We represent each document with the vector of same length. And each cell in the vector keeps the count of occurence of the word in that document. If it is Boolean BoW, we use 1 if the word is present in the document otherwise 0. Since the vectors are very sparse we will use compressed sparse row matrix.

In [60]:
from scipy.sparse import csr_matrix,lil_matrix
import numpy as np

class Bow:
  """
  Converts a corpus into vector representation
  
  Parameters
  ----------
  binary : bool, default=False
      If True it will return Boolean BoW.

  Attributes
  ----------
  vocabulary_ : dict
      Dictionary with key as the features and the values as the
  
  Methods
  -------
  fit(corpus) : return self;
  transform(corpus) : return scipy.sparse.csr_matrix;
  fit_transform(corpus) : return scipy.sparse.csr_matrix;
  get_feature_names(corpus) : return list;

  Note
  -----
  It assumes the data is already preprocessed.
  """

  def __init__(self, binary=False):
    self.binary = binary
  
  def fit(self, corpus):
    """
    It will learn the vocabulary from the given corpus.

    Parameters
    ----------
    corpus : iterable
        A list of documents.

    Returns
    -------
    self
    """
    if len(corpus)==0:
      raise ValueError('Empty corpus provided.')
    self.vocabulary = set()
    for document in corpus:
      document = set(document.split())
      self.vocabulary = self.vocabulary.union(document)
    self.vocabulary = sorted(list(self.vocabulary))
    self.no_of_features = len(self.vocabulary)
    self.vocabulary_ = {j:i for i,j in enumerate(self.vocabulary)}

  def transform(self, corpus):
    """
    It will transform the corpus into sparsed matrix and return it.

    Parameters
    ----------
    corpus : iterable
        A list of documents.

    Returns
    -------
    scipy.sparse.csr_matrix
    """
    if not hasattr(self, 'vocabulary_'):
      raise NotImplementedError('fit method not called yet.')
    self.no_of_documents = len(corpus)
    corpus_array = lil_matrix((self.no_of_documents, self.no_of_features), dtype=np.int8)
    for i,document in enumerate(corpus):
      document = document.split()
      for feature in set(document):
        feature_index = self.vocabulary_.get(feature)
        if feature_index != None:
          count = document.count(feature)
          if self.binary and count:
            count = 1
          corpus_array[i,feature_index] = count
    corpus_array = corpus_array.tocsr()
    corpus_array.sort_indices()
    return corpus_array

  def fit_transform(self, corpus):
    """
    It will learn the vocabulary and transform the corpus into sparsed matrix and return it.

    Parameters
    ----------
    corpus : iterable
        A list of documents.

    Returns
    -------
    scipy.sparse.csr_matrix
    """
    self.fit(corpus)
    corpus_array = self.transform(corpus)
    return corpus_array

  def get_feature_names(self):
    """
    It will transform the corpus into sparsed matrix.

    Parameters
    ----------
    corpus : iterable
        A list of documents.

    Returns
    -------
    scipy.sparse.csr_matrix
    """
    if not hasattr(self, 'vocabulary'):
      raise NotImplementedError('fit or fit_transform method not called yet.')
    return self.vocabulary


##Compare Bow with sklearn's CountVectorizer

In [52]:
corpus = [
    'this is the first document',
    'this document is the second document',
    'and this is the third one',
    'is this the first document',
]

In [58]:
model = Bow()
model.fit(corpus)
X = model.transform(corpus)
print(model.get_feature_names())
print(model.vocabulary_)
print(X.toarray())

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
{'and': 0, 'document': 1, 'first': 2, 'is': 3, 'one': 4, 'second': 5, 'the': 6, 'third': 7, 'this': 8}
[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]


In [65]:
model = Bow()
X = model.fit_transform(corpus)
print(model.get_feature_names())
print(model.vocabulary_)
print(X.toarray())

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
{'and': 0, 'document': 1, 'first': 2, 'is': 3, 'one': 4, 'second': 5, 'the': 6, 'third': 7, 'this': 8}
[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]


We are getting same results while using fit_transform and fit followed by transform.

In [63]:
model = Bow()
X = model.fit_transform(corpus)
print(model.get_feature_names())
print(model.vocabulary_)
print(X.toarray())
print('-'*50)
model2 = Bow(binary=True)
X = model2.fit_transform(corpus)
print(X.toarray())

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
{'and': 0, 'document': 1, 'first': 2, 'is': 3, 'one': 4, 'second': 5, 'the': 6, 'third': 7, 'this': 8}
[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]
--------------------------------------------------
[[0 1 1 1 0 0 1 0 1]
 [0 1 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]


In [64]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print(vectorizer.vocabulary_)
print(X.toarray())
print('-'*50)
vectorizer2 = CountVectorizer(binary=True)
X = vectorizer2.fit_transform(corpus)
print(X.toarray())

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
{'this': 8, 'is': 3, 'the': 6, 'first': 2, 'document': 1, 'second': 5, 'and': 0, 'third': 7, 'one': 4}
[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]
--------------------------------------------------
[[0 1 1 1 0 0 1 0 1]
 [0 1 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]


Both results, from our implementation and sklearn's implementation are similar.