# Example of Bag of Words Vectorization

#### Bag of Words
* A way to convert unstructured text data into structured text data as a common vector space.

#### A bit about what Bag of Words is:
* Vectorization is the general process of turning a collection of text documents into numerical feature vectors
* Each document is a literal bag of words. It doesn't care about the sequence of words or sentence structures or anything like that.
* Looks at word occurrences and completely ignores relative position information, grammar, punctuation
* Usually you convert everything to lowercase, remove punctuation, remove markup, stop-words, etc.
* Corpus = collection of text documents
* Each word is treated as a feature
* 3 step process:
    * Tokenization: split document up
    * Vocabulary building: all the words in the corpus. Non-repetitive.
    * Encoding: transform a sequence of documents into a document-term matrix


Based on the given BoW Vocabulary, create a feature vector (vectorized representation) of the following text documents:
* John likes to watch movies. Mary likes movies too.
* John also likes to watch football games.



{'john': 3, 'likes': 4, 'to': 7, 'watch': 9, 'movies': 6, 'mary': 5, 'too': 8, 'also': 0, 'football': 1, 'games': 2}

In [1]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
count = CountVectorizer()
docs = np.array([
       'John likes to watch movies. Mary likes movies too.',
       'John also likes to watch football games.'])
bag = count.fit_transform(docs)
print(count.vocabulary_)
print(bag.toarray())

{'john': 3, 'likes': 4, 'to': 7, 'watch': 9, 'movies': 6, 'mary': 5, 'too': 8, 'also': 0, 'football': 1, 'games': 2}
[[0 0 0 1 2 1 2 1 1 1]
 [1 1 1 1 1 0 0 1 0 1]]


## Output looks like this:
[[0 0 0 1 2 1 2 1 1 1]

 [1 1 1 1 1 0 0 1 0 1]]
 
Each number represents the frequency of the word within the vector, and in the correct order. You can now compare these vectors in terms of similarity to each other in a structured way.