## Applying the bag-of-words model

Python Machine Learning 2nd Edition by Sebastian Raschka, Packt Publishing Ltd. 2017

Code Repository: https://github.com/rasbt/python-machine-learning-book-2nd-edition


In [1]:
# Import Python libraries

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
# Construct the bag-of-words model based on the word counts in a document

vectorizer = CountVectorizer()
bow_model = np.array([
    'AWS Summit Sydney 2024 at International Convention & Exhibition Centre, Sydney',
    'What magic will you build?',
    'There was Builders Day on Wednesday 10 April',
    'There was Innovation Day on Thursday 11 April',
    'I am excited for AWS Summit London on 24 April',
    'Will you be attending AWS Summit New York on 10 July?'])

In [3]:
# Construct the vocabulary of the bag-of-words-model and transformed 6 sentences into feature vectors
bag = vectorizer.fit_transform(bow_model)

In [4]:
# Print the vocabulary
print(vectorizer.vocabulary_)

{'aws': 8, 'summit': 25, 'sydney': 26, '2024': 2, 'at': 6, 'international': 19, 'convention': 13, 'exhibition': 16, 'centre': 12, 'what': 31, 'magic': 22, 'will': 32, 'you': 34, 'build': 10, 'there': 27, 'was': 29, 'builders': 11, 'day': 14, 'on': 24, 'wednesday': 30, '10': 0, 'april': 5, 'innovation': 18, 'thursday': 28, '11': 1, 'am': 4, 'excited': 15, 'for': 17, 'london': 21, '24': 3, 'be': 9, 'attending': 7, 'new': 23, 'york': 33, 'july': 20}


In [5]:
# Vocabulary is stored in a Python dictionary
print(bag.toarray())

[[0 0 1 0 0 0 1 0 1 0 0 0 1 1 0 0 1 0 0 1 0 0 0 0 0 1 2 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 1]
 [1 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 1 0 0 0 0]
 [0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 1 1 1 0 0 0 0 0]
 [0 0 0 1 1 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0]
 [1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 1 0 0 0 0 0 0 1 1 1]]


The sequence of  items in the bag-of-words model is called **1-gram** or **unigram**
For example:

* unigram: "aws", "summit", "sydney"
* bigram: "aws summit", "summit sydney", "sydney 2024"