# Bag of Words

The Bag of Words model transforms text data into numerical data. It does so by representing each document as a vector of word frequencies, ignoring grammar, word order, or semantics.

Example:
Consider two review sentences:

- Review 1: “The food was amazing, and the service was great.”
- Review 2: “The food was bad, but the ambiance was excellent.”

Here we are going to implement Bag of Words using Python’s CountVectorizer from sklearn

In [4]:
from sklearn.feature_extraction.text import CountVectorizer

In [5]:
reviews = [
    "The food was amazing, and the service was great.",
    "The food was bad, but the ambiance was excellent."
]

In [6]:
reviews

['The food was amazing, and the service was great.',
 'The food was bad, but the ambiance was excellent.']

In [7]:
# Create the Bag of Words model
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(reviews)

In [8]:
X

<2x11 sparse matrix of type '<class 'numpy.int64'>'
	with 14 stored elements in Compressed Sparse Row format>

In [11]:
# vectorized representation
X.toarray()

array([[1, 0, 1, 0, 0, 0, 1, 1, 1, 2, 2],
       [0, 1, 0, 1, 1, 1, 1, 0, 0, 2, 2]], dtype=int64)

In [12]:
# vocabulary
vectorizer.get_feature_names_out()

array(['amazing', 'ambiance', 'and', 'bad', 'but', 'excellent', 'food',
       'great', 'service', 'the', 'was'], dtype=object)