<a href="https://colab.research.google.com/github/rahiakela/nlp-research-and-practice/blob/main/practical-natural-language-processing/3-text-representation/2_bag_of_words.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Bag of Words

Bag of words (BoW) is a classical text representation technique that has been used commonly in NLP, especially in text classification problems. The key idea behind it is as follows: **represent the text under consideration as a bag (collection) of words while ignoring the order and context.**

The basic intuition behind it is that it assumes that the text belonging to a given class in the dataset is characterized by a unique set of words. If two text pieces have nearly the same words, then they belong to the same bag (class). Thus, by analyzing the words present in a piece of text, one can identify the class (bag) it belongs to.

Similar to one-hot encoding, BoW maps words to unique integer IDs between 1 and $|V|$. Each document in the corpus is then converted into a vector of $|V|$ dimensions where in the ith component of the vector, $i = w_{id}$, is simply the number of times the word w occurs in the document, i.e., we simply score each word in $V$ by their occurrence count in the document.

**Our toy corpus**

|  |  |
| --- | --- |
| D1 | Dog bites man. |
| D2 | Man bites dog. |
| D3 | Dog eats meat. |
| D4 | Man eats food. |

Thus, for our toy corpus, where the word IDs are `dog = 1, bites = 2, man = 3, meat = 4 , food = 5, eats = 6`, D1 becomes `[1 1 1 0 0 0]`. This is because the first three words in the vocabulary appeared exactly once in D1, and the last three did not appear at all. D4 becomes `[0 0 1 0 1 1]`.

Finaly, we will get this matrix for **Bag-of-Words**.

**Documents**

|  |  |
| --- | --- |
| D1 | Dog bites man. |
| D2 | Man bites dog. |
| D3 | Dog eats meat. |
| D4 | Man eats food. |

**Bag-of-Words Matrix**

|   | dog | bites | man | meat | food | eats |
| --- | --- | --- | --- | --- | --- | --- |
| D1 | 1 | 1 | 1 | 0 | 0 | 0 |
| D2 | 1 | 1 | 1 | 0 | 0 | 0 |
| D3 |1 | 0 | 0 | 1 | 0 | 1 |
| D4 | 0 | 0 | 1 | 0 | 1 | 1 |

In [1]:
documents = [
  "Dog bites man.",
  "Man bites dog.",
  "Dog eats meat.",
  "Man eats food."
]

processed_docs = [doc.lower().replace('.', '') for doc in documents]
processed_docs

['dog bites man', 'man bites dog', 'dog eats meat', 'man eats food']

Now, let's do the main task of finding bag of words representation. We will use CountVectorizer from sklearn.

In [2]:
from sklearn.feature_extraction.text import CountVectorizer

In [3]:
count_vect = CountVectorizer()

In [4]:
print('Our corpus: ', processed_docs)

Our corpus:  ['dog bites man', 'man bites dog', 'dog eats meat', 'man eats food']


In [5]:
# Build a BOW representation for the corpus
bow_rep = count_vect.fit_transform(processed_docs)
print('Our vocabulary: ', count_vect.vocabulary_)

Our vocabulary:  {'dog': 1, 'bites': 0, 'man': 4, 'eats': 2, 'meat': 5, 'food': 3}


In [6]:
# see the BOW rep for first 2 documents
print("BoW representation for 'dog bites man': ", bow_rep[0].toarray())
print("BoW representation for 'man bites dog': ", bow_rep[1].toarray())

BoW representation for 'dog bites man':  [[1 1 0 0 1 0]]
BoW representation for 'man bites dog':  [[1 1 0 0 1 0]]


In [7]:
bow_rep.toarray()

array([[1, 1, 0, 0, 1, 0],
       [1, 1, 0, 0, 1, 0],
       [0, 1, 1, 0, 0, 1],
       [0, 0, 1, 1, 1, 0]])

Let's show the BOW vetcors in dataframe.

In [8]:
import pandas as pd

bow_cols = [key for key, _ in count_vect.vocabulary_.items()]
bow_indexs = ['D1', 'D2', 'D3', 'D4']
pd.DataFrame(bow_rep.toarray(), columns=bow_cols, index=bow_indexs)

Unnamed: 0,dog,bites,man,eats,meat,food
D1,1,1,0,0,1,0
D2,1,1,0,0,1,0
D3,0,1,1,0,0,1
D4,0,0,1,1,1,0


In [9]:
# Get the representation using this vocabulary, for a new text
temp = count_vect.transform(['dog and dog are friends'])
print("Bow representation for 'dog and dog are friends':", temp.toarray())

Bow representation for 'dog and dog are friends': [[0 2 0 0 0 0]]


In the above code, we represented the text considering the frequency of words into account. **However, sometimes, we don't care about frequency much, but only want to know whether a word appeared in a text or not**. Researchers have shown that such **a representation without considering frequency is useful for sentiment analysis**.

That is, each document is represented as a vector of 0s and 1s. We will use the option binary=True in CountVectorizer for this purpose.

In [10]:
# BoW with binary vectors
count_vect = CountVectorizer(binary=True)
bow_rep_bin = count_vect.fit_transform(processed_docs)
temp = count_vect.transform(['dog and dog are friends'])
print("Bow representation for 'dog and dog are friends':", temp.toarray())

Bow representation for 'dog and dog are friends': [[0 1 0 0 0 0]]


Let’s look at some of the advantages of this encoding:

* Like one-hot encoding, BoW is fairly simple to understand and implement.
* With this representation, documents having the same words will have their vector representations closer to each other in Euclidean space as compared to documents with completely different words. The distance between D1 and D2 is 0 as compared to the distance between D1 and D4, which is 2. Thus, the vector space resulting from the BoW scheme captures the semantic similarity of documents. So if two documents have similar vocabulary, they’ll be closer to each other in the vector space and vice versa.
* We have a fixed-length encoding for any sentence of arbitrary length.

However, it has its share of disadvantages, too:

* The size of the vector increases with the size of the vocabulary. Thus, sparsity continues to be a problem. One way to control it is by limiting the vocabulary to n number of the most frequent words.
* It does not capture the similarity between different words that mean the same thing. Say we have three documents: “I run”, “I ran”, and “I ate”. BoW vectors of all three documents will be equally apart.
* This representation does not have any way to handle out of vocabulary words (i.e., new words that were not seen in the corpus that was used to build the vectorizer).
* As the name indicates, it is a “bag” of words—word order information is lost in this representation. Both D1 and D2 will have the same representation in this scheme.

However, despite these shortcomings, due to its simplicity and ease of implementation, BoW is a commonly used text representation scheme, especially for text classification among other NLP problems.