In [None]:
'''
What is a Bag-of-Words?

A bag-of-words model, or BoW for short, is a way of extracting features from text for use in modeling, 
such as with machine learning algorithms.

A bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things:
1. A vocabulary of known words.
2. A measure of the presence of known words.

It is called a “bag” of words, because any information about the order or structure of words in the document is discarded. 
The model is only concerned with whether known words occur in the document, not where in the document.

'''

In [None]:
'''
Example of the Bag-of-Words Model

Step 1: Collect Data

It was the best of times,
it was the worst of times,
it was the age of wisdom,
it was the age of foolishness,

Step 2: Design the Vocabulary

Now we can make a list of all of the words in our model vocabulary.
The unique words here (ignoring case and punctuation) are:

“it”
“was”
“the”
“best”
“of”
“times”
“worst”
“age”
“wisdom”
“foolishness”

That is a vocabulary of 10 words from a corpus containing 24 words.

Step 3: Create Document Vectors

The scoring of the document would look as follows:

“it” = 1
“was” = 1
“the” = 1
“best” = 1
“of” = 1
“times” = 1
“worst” = 0
“age” = 0
“wisdom” = 0
“foolishness” = 0

[1, 1, 1, 1, 1, 1, 0, 0, 0, 0]

"it was the worst of times" = [1, 1, 1, 0, 1, 1, 1, 0, 0, 0]
"it was the age of wisdom" = [1, 1, 1, 0, 1, 0, 0, 1, 1, 0]
"it was the age of foolishness" = [1, 1, 1, 0, 1, 0, 0, 1, 0, 1]
'''

In [9]:
# Preprossing
import nltk 
import re 
import numpy as np 
  
# execute the text here as : 
text = """Beans. I was trying to explain to somebody as we were flying in, that’s corn. 
That’s beans. And they were very impressed at my agricultural knowledge. 
Please give it up for Amaury once again for that outstanding introduction. 
I have a bunch of good friends here today, including somebody who I served with, 
who is one of the finest senators in the country, and we’re lucky to have him, your Senator, 
Dick Durbin is here. I also noticed, by the way, former Governor Edgar here, who I haven’t seen in a long time, 
and somehow he has not aged and I have. And it’s great to see you, Governor. I want to thank President Killeen 
and everybody at the U of I System for making it possible for me to be here today. 
And I am deeply honored at the Paul Douglas Award that is being given to me. 
He is somebody who set the path for so much outstanding public service here in Illinois.
Now, I want to start by addressing the elephant in the room. I know people are still 
wondering why I didn’t speak at the commencement."""

dataset = nltk.sent_tokenize(text) 
for i in range(len(dataset)): 
    dataset[i] = dataset[i].lower() 
    dataset[i] = re.sub(r'\W', ' ', dataset[i]) 
    dataset[i] = re.sub(r'\s+', ' ', dataset[i])
                        
# Creating the Bag of Words model 
word2count = {}
for data in dataset: 
    words = nltk.word_tokenize(data) 
    for word in words: 
        if word not in word2count.keys(): 
            word2count[word] = 1
        else: 
            word2count[word] += 1
print(word2count)

                        
X = [] 
for data in dataset: 
    vector = [] 
    for word in word2count: 
        if word in nltk.word_tokenize(data): 
            vector.append(1) 
        else: 
            vector.append(0) 
    X.append(vector) 
X = np.asarray(X)                    
print(X)                        

{'beans': 2, 'i': 12, 'was': 1, 'trying': 1, 'to': 8, 'explain': 1, 'somebody': 3, 'as': 1, 'we': 2, 'were': 2, 'flying': 1, 'in': 5, 'that': 4, 's': 3, 'corn': 1, 'and': 7, 'they': 1, 'very': 1, 'impressed': 1, 'at': 4, 'my': 1, 'agricultural': 1, 'knowledge': 1, 'please': 1, 'give': 1, 'it': 3, 'up': 1, 'for': 5, 'amaury': 1, 'once': 1, 'again': 1, 'outstanding': 2, 'introduction': 1, 'have': 3, 'a': 2, 'bunch': 1, 'of': 3, 'good': 1, 'friends': 1, 'here': 5, 'today': 2, 'including': 1, 'who': 4, 'served': 1, 'with': 1, 'is': 4, 'one': 1, 'the': 9, 'finest': 1, 'senators': 1, 'country': 1, 're': 1, 'lucky': 1, 'him': 1, 'your': 1, 'senator': 1, 'dick': 1, 'durbin': 1, 'also': 1, 'noticed': 1, 'by': 2, 'way': 1, 'former': 1, 'governor': 2, 'edgar': 1, 'haven': 1, 't': 2, 'seen': 1, 'long': 1, 'time': 1, 'somehow': 1, 'he': 2, 'has': 1, 'not': 1, 'aged': 1, 'great': 1, 'see': 1, 'you': 1, 'want': 2, 'thank': 1, 'president': 1, 'killeen': 1, 'everybody': 1, 'u': 1, 'system': 1, 'making'