# Bag of Words Representation

It is the way of transforming the text into the numerical vectors. This transformation is needed because the raw text into many algorithms cannot be feeded and these algos. expect numerical feature vectors with a fixed size rather than the raw text documents with variable length.

Most common way to extract numerical features vectors is: 

    . Tokenization: Way of chopping the sentences into the individual words
    . Counting: counting the occurrances of each token in all documents
    . Normalizing: normalizing and weighting with diminishing importance tokens that occur in the majority of samples / documents.
    
The corpus of documents can thus be represented by a matrix with one row per document and one column per token (e.g. word) occurring in the corpus.

In [1]:
import sklearn
# CountVectorizer implements both tokenization and occurrence counting in a single class
from sklearn.feature_extraction.text import CountVectorizer

In [4]:
vectorizer = CountVectorizer()
data_corpus = ["John likes to watch movies. Mary likes movies too.", 
"John also likes to watch football games."]
X = vectorizer.fit_transform(data_corpus) 
# this shows data has two documents/sentences, there are total 10 unique tokens
# Therefore sparse matrix is 2 by 10
X

<2x10 sparse matrix of type '<type 'numpy.int64'>'
	with 14 stored elements in Compressed Sparse Row format>

In [8]:
# To check the count of each word in sentence
# This does not maintains order
print "count of token in each document \n ", X.toarray()

count of token in each document 
  [[0 0 0 1 2 1 2 1 1 1]
 [1 1 1 1 1 0 0 1 0 1]]


In [9]:
# To see the order of the tokens

print "features order \n ", vectorizer.get_feature_names()

features order 
  [u'also', u'football', u'games', u'john', u'likes', u'mary', u'movies', u'to', u'too', u'watch']


# Verfying on the unknown text

In [12]:
# Only two words "football and games matches" 
# Again order of words is not maintained which is drawback of bag of words
vectorizer.transform(['Football is one of the most famous games']).toarray()

array([[0, 1, 1, 0, 0, 0, 0, 0, 0, 0]])

In the large text corpus, sometimes it is useful to discard the higher frequency terms such as stop words("the, a, or, an") as they might shadow the lower frequency terms(that are most important) in the text. Therefore, we can use another important concept known as **Term frequency- inverse document frequency(TF-IDF) term weighting** Where we can re-weight the count features into floating point values suitable for usage by a classifier