In [3]:
from sklearn.feature_extraction.text import CountVectorizer

## Basic vectorization

Vectorizing text is a fundamental concept in analyzing documents. Basically, you can think of it as turning the words in a given text document into counts or weights, represented in what's called a term-document matrix.

The library [scikit-learn](http://scikit-learn.org/stable/) has some handy tools for this, called vectorizers. Another popular library for this kind of work is called [NLTK](http://www.nltk.org/). 

Here's an example of how a [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) works, using a simple string representing a piece of legislation:

In [1]:
bill_titles = ['An act to amend Section 44277 of the Education Code, relating to teachers.']

In [4]:
vectorizer = CountVectorizer()
features = vectorizer.fit_transform(bill_titles).toarray()

In [5]:
print features
print vectorizer.get_feature_names()

[[1 1 1 1 1 1 1 1 1 1 1 2]]
[u'44277', u'act', u'amend', u'an', u'code', u'education', u'of', u'relating', u'section', u'teachers', u'the', u'to']


Think of this as a spreadsheet with one row and 12 columns. The row corresponds to our document above. The columns each correspond to a word contained in that document (the first is "44277", the second is "act", etc.) The numbers correspond to the number of times each word appears in that document. You'll see that all words appear once, except the last one, "to", which appears twice.

Now what happens if we add another bill and run it again?

In [6]:
bill_titles = ['An act to amend Section 44277 of the Education Code, relating to teachers.',
               'An act relative to health care coverage']
features = vectorizer.fit_transform(bill_titles).toarray()

print features
print vectorizer.get_feature_names()

[[1 1 1 1 0 1 0 1 0 1 1 0 1 1 1 2]
 [0 1 0 1 1 0 1 0 1 0 0 1 0 0 0 1]]
[u'44277', u'act', u'amend', u'an', u'care', u'code', u'coverage', u'education', u'health', u'of', u'relating', u'relative', u'section', u'teachers', u'the', u'to']



Now we've got two rows, each corresponding to a document. The columns correspond to all words contained in BOTH documents, with counts. For example, the first entry from the first column, "44277', appears once in the first document but zero times in the second. This, basically, is the concept of a term-document matrix.

## So what?

After you extract text from documents, converting that text into a structured format like this is the first step in doing any kind of sophisticated analysis with it. At the most basic level, a [word cloud](http://www.nytimes.com/interactive/2012/09/04/us/politics/democratic-convention-words.html) is basically just a term-document matrix visualized. But there are other interesting examples as well:

  - The LA Times used this technique to BLAH