# Bag Of Words(BoW)

### Steps involved in BoW:

#### 1. Construction of a d-dimensional dictionary 

- Here, we create an array of all the unique words in the document corpus.
- Let there be 'd' unique words.
- Every unique word is a dimension.

> **Note 1:** A text, which could be a word or a sentence, is known as a document in NLP.

> **Note 2:** A collection of such documents is known as a document corpus.

#### 1.1 Example:

Let there be two documents in the document corpus as given below:
1. This car drives good and is expensive.
2. This car is not expensive and drives good.

We create a dictionary(or an array) of all the unique words in the document corpus as:

`[This, car, drives, good, and, is, expensive, not]`

#### 2. Creating vector for each document

- For every document we create a d-dimensional vector.
- Every dimension of a vector corresponds to a unique word.
- The value of every dimension is equivalent to the number of occurrences of the unique word, in the given document, corresponding to that dimension.

> **Note 3:** Generally the BoW creates sparse vectors. In a sparse vector, most of the dimensions have 0 value.

#### 2.1 Example:

Let vectors v<sub>1</sub> and v<sub>2</sub> correspond to document 1 and document 2 respectively. Then these vectors are represented as:

v<sub>1</sub> = `[1 1 1 1 1 1 1 0]`

v<sub>2</sub> = `[1 1 1 1 1 1 1 1]`

#### 3. Calculating the distance between vectors

- The Euclidean distance is found between the vectors.
- Smaller Euclidean distance between the considered vectors corresponds to greater similarity between their corresponding documents.
- On the other hand, larger Euclidean distance between the considered vectors corresponds to lesser similarity between their corresponding documents.

#### 3.1 Example:

We calculate the Euclidean distance between v<sub>1</sub> and v<sub>2</sub> as:

|v<sub>1</sub>-v<sub>2</sub>| = 1

### Binary BoW- Variation of BoW

- Here, the value of a dimension of a vector is either 0 or 1.
- The value is 1 if the occurrences of the unique word, corresponding to the dimension, is at least once and 0 otherwise.
- Binary BoW is also known as boolean BoW.

#Implementation of BoW through Sklearn

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
cv = CountVectorizer()

corpus = ['This car drives good and is expensive',
          'This car is better than the other car and is less expensive and drives good']

X = cv.fit(corpus) # cv.fit() creates the dictionary of all the unique words in the corpus.
print('Dictionary of all the unique words in the corpus:',X.vocabulary_)


Dictionary of all the unique words in the corpus: {'this': 11, 'car': 2, 'drives': 3, 'good': 5, 'and': 0, 'is': 6, 'expensive': 4, 'better': 1, 'than': 9, 'the': 10, 'other': 8, 'less': 7}


In [None]:
print(cv.get_feature_names())

['and', 'better', 'car', 'drives', 'expensive', 'good', 'is', 'less', 'other', 'than', 'the', 'this']


In [None]:
X = cv.transform(corpus)
print(X.toarray())

[[1 0 1 1 1 1 1 0 0 0 0 1]
 [2 1 2 1 1 1 2 1 1 1 1 1]]


In [None]:
df =pd.DataFrame(X.toarray(), columns = cv.get_feature_names())
df

Unnamed: 0,and,better,car,drives,expensive,good,is,less,other,than,the,this
0,1,0,1,1,1,1,1,0,0,0,0,1
1,2,1,2,1,1,1,2,1,1,1,1,1


In the above output we can see the number of times each unique word occurs in every document.

## Conclusion:

- Bag of words can be thought of as counting the differing words between vectors.
- Bag of words doesn't work well when there are subtle differences in words.
- That means BoW doesn't consider the semantic meaning of words. For example the words 'tasty' and 'delicious' are synonyms yet the BoW considers them different.
- Bag of words contains a lot of stopwords(which are trivial).