### Preliminaries


In [1]:
# Load library
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

### Create Text Data

In [2]:
# Create text
text_data = np.array(['I love Brazil. Brazil!',
                      'Sweden is best',
                      'Germany beats both'])
'''
#task
'It was the best of times',
'it was the worst of times',
'it was the age of wisdom',
'it was the age of foolishness'

'''

"\n#task\n'It was the best of times',\n'it was the worst of times',\n'it was the age of wisdom',\n'it was the age of foolishness'\n\n"

### Create Bag Of Words

In [3]:
# Create the bag of words feature matrix
count = CountVectorizer()
bag_of_words = count.fit_transform(text_data)

# Show feature matrix
bag_of_words.toarray()

array([[0, 0, 0, 2, 0, 0, 1, 0],
       [0, 1, 0, 0, 0, 1, 0, 1],
       [1, 0, 1, 0, 1, 0, 0, 0]], dtype=int64)

### View Bag Of Words Matrix Column Headers

In [4]:
# Get feature names
feature_names = count.get_feature_names()

# View feature names
feature_names

['beats', 'best', 'both', 'brazil', 'germany', 'is', 'love', 'sweden']

### View As A Data Frame

In [5]:
# Create data frame
pd.DataFrame(bag_of_words.toarray(), columns=feature_names)

Unnamed: 0,beats,best,both,brazil,germany,is,love,sweden
0,0,0,0,2,0,0,1,0
1,0,1,0,0,0,1,0,1
2,1,0,1,0,1,0,0,0


### Code to generate bag of word vectors in Python

It’s always good to understand how the libraries in frameworks work, and understand the methods behind them. The better you understand the concepts, the better use you can make of frameworks.

<b>The input to our code will be multiple sentences and the output will be the vectors.</b>

<b>Input text:</b>

In [6]:
["Joe waited for the train", "The train was late", "Mary and Samantha took the bus",
"I looked for Mary and Samantha at the bus station",
"Mary and Samantha arrived at the bus station early but waited until noon for the bus"]

['Joe waited for the train',
 'The train was late',
 'Mary and Samantha took the bus',
 'I looked for Mary and Samantha at the bus station',
 'Mary and Samantha arrived at the bus station early but waited until noon for the bus']

In [7]:
# import statments
import numpy
import re

'''
Tokenize each the sentences, example:
Input : "John likes to watch movies. Mary likes movies too"
Ouput : "John","likes","to","watch","movies","Mary","likes","movies","too"
'''

'\nTokenize each the sentences, example:\nInput : "John likes to watch movies. Mary likes movies too"\nOuput : "John","likes","to","watch","movies","Mary","likes","movies","too"\n'

We will start by removing stopwords from the sentences.

In [8]:
def word_extraction(sentence):
    ignore = ['a', "the", "is"]
    words = re.sub("[^\w]", " ",  sentence).split()
    cleaned_text = [w.lower() for w in words if w not in ignore]
    return cleaned_text    
    

<b>Step 2: Apply tokenization to all sentences</b><br>
The method iterates all the sentences and adds the extracted word into an array.


In [9]:
def tokenize(sentences):
    words = []
    for sentence in sentences:
        w = word_extraction(sentence)
        words.extend(w)
        
    words = sorted(list(set(words)))
    return words

In [32]:
'''
sentences = ["Joe waited for the train", "The train was late", "Mary and Samantha took the bus",
"I looked for Mary and Samantha at the bus station",
"Mary and Samantha arrived at the bus station early but waited until noon for the bus"]

print("The output of this method will be:")
tokenize(sentences)

'''

'\nsentences = ["Joe waited for the train", "The train was late", "Mary and Samantha took the bus",\n"I looked for Mary and Samantha at the bus station",\n"Mary and Samantha arrived at the bus station early but waited until noon for the bus"]\n\nprint("The output of this method will be:")\ntokenize(sentences)\n\n'

<b>Step 3: Build vocabulary and generate vectors</b><br>
Use the methods defined in steps 1 and 2 to create the document vocabulary and extract the words from the sentences.

In [10]:
def generate_bow(allsentences):    
    vocab = tokenize(allsentences)
    print("Word List for Document \n{0} \n".format(vocab));

    for sentence in allsentences:
        words = word_extraction(sentence)
        bag_vector = numpy.zeros(len(vocab))
        for w in words:
            for i,word in enumerate(vocab):
                if word == w: 
                    bag_vector[i] += 1
                    
        print("{0} \n{1}\n".format(sentence,numpy.array(bag_vector)))

Here is the defined input

In [11]:
allsentences = ["Joe waited for the train", "The train was late", "Mary and Samantha took the bus", 
            "I looked for Mary and Samantha at the bus station", 
            "Mary and Samantha arrived at the bus station early but waited until noon for the bus"]

The output vectors for each of the sentences are:



In [12]:
generate_bow(allsentences)

Word List for Document 
['and', 'arrived', 'at', 'bus', 'but', 'early', 'for', 'i', 'joe', 'late', 'looked', 'mary', 'noon', 'samantha', 'station', 'the', 'took', 'train', 'until', 'waited', 'was'] 

Joe waited for the train 
[0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0.]

The train was late 
[0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 1.]

Mary and Samantha took the bus 
[1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 1. 0. 0. 0. 0.]

I looked for Mary and Samantha at the bus station 
[1. 0. 1. 1. 0. 0. 1. 1. 0. 0. 1. 1. 0. 1. 1. 0. 0. 0. 0. 0. 0.]

Mary and Samantha arrived at the bus station early but waited until noon for the bus 
[1. 1. 1. 2. 1. 1. 1. 0. 0. 0. 0. 1. 1. 1. 1. 0. 0. 0. 1. 1. 0.]



Our previous code can be replaced with:



In [13]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(allsentences)

print(X.toarray())

[[0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 1 0 1 0]
 [0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 1]
 [1 0 0 1 0 0 0 0 0 0 1 0 1 0 1 1 0 0 0 0]
 [1 0 1 1 0 0 1 0 0 1 1 0 1 1 1 0 0 0 0 0]
 [1 1 1 2 1 1 1 0 0 0 1 1 1 1 2 0 0 1 1 0]]
