In [1]:
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
corpus = [
"Authman ran faster than Harry because he is an athlete.",
"Authman and Harry ran faster and faster.",
]

In [3]:
bow = CountVectorizer()

In [4]:
X = bow.fit_transform(corpus)
X

<2x11 sparse matrix of type '<class 'numpy.int64'>'
	with 15 stored elements in Compressed Sparse Row format>

In [5]:
bow.get_feature_names()

['an',
 'and',
 'athlete',
 'authman',
 'because',
 'faster',
 'harry',
 'he',
 'is',
 'ran',
 'than']

In [6]:
X.toarray()

array([[1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [0, 2, 0, 1, 0, 2, 1, 0, 0, 1, 0]], dtype=int64)

> NOTE:  this **is** a frequency matrix already.  This is too cool.

## Sparse matrix and Pandas Sparse DataFrame

Some new points of interest. X is not the regular [n_samples, n_features] dataframe you are used to. Rather, it is a SciPy compressed, sparse, row matrix. SciPy is a collection of mathematical algorithms and convenience functions that further extend NumPy. The reason X is now a sparse matrix instead of a classical dataframe is because even with this small example of two sentences, 11 features were created. The average English speaker knows around 8000 unique words. If each sentence were to be an 8000 vector sample in your dataframe, consisting mostly of 0's, it would be a poor use of memory.

To circumvent this, SciPy implements their sparse matrices similar to the way Python implements its dictionaries: only the keys that have a value are stored, and everything else is assumed empty. You can always convert it to a regular Python list by using the .toarray() method, but this converts it to a dense array, which might not be desirable due to memory reasons. To use your compressed, spare, row matrix in Pandas, you're going to want to convert it to a ***Pandas SparseDataFrame. More notes on that in the Dive Deeper section***.

Bag of words has other configurable parameters, such as considering the order of words. In such implementations, pairs or tuples of successive words are used to build the corpus instead of individual words:

    >>> bow.get_feature_names()
    ['authman ran', 'ran faster', 'faster than', 'than harry', 'harry because', 'because he', 'he is', 'is an', 'an    athlete', 'authman and', 'and harry', 'harry ran', 'faster and', 'and faster']

Another parameter is to have it use frequencies rather than pure counts. This is useful when you have documents of different lengths, so to allow direct comparisons even though the raw counts for the longer document would of course be higher. Dive deeper into the feature extraction section of SciKit-Learn's documentation to learn more about how you can best represent your textual features!