### Prologue: In Which the Author Explains Why He's Doing What He's Doing

Books are fun. <sup>[[citation needed](https://xkcd.com/285)]</sup> 

What's even more fun are vector space models, clustering algorithms, and dimensionality reduction techniques. In this blog post, we're going to combine it all by playing around with a small set of texts from project Gutenberg. With a bit of luck, Python, and lots of trial and error, we might just learn something interesting.

### Chapter One: In Which Books are Fetched and Puns are Made
We should start by fetching some books. There are many ways to do it, but for starters let's just use what NLTK has to offer:

In [2]:
from nltk.corpus import gutenberg
titles = gutenberg.fileids()
print(titles)

['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']


This rather eclectic collection will serve as our dataset. How about we weed out some boring books and get the full text for the rest (your definition of boring may vary):

In [5]:
boring = {'bible-kjv.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt'} 
titles = [t for t in titles if t not in boring] 
texts = [gutenberg.raw(t) for t in titles]
print(titles)

['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']


Conveniently (and completely coincidentally) the remaining titles fall into five categories I spent far too much time naming:
- Novel and Novelty: `emma`, `persuasion`, `sense` 
- Bard's Tales: `caesar`, `macbeth`, `hamlet`
- Chestertomes: `ball`, `brown`, `thursday`
- BMW (Blake, Milton, Whitman): `poems`, `paradise`, `leaves`
- BBC (Bryant, Burgess, Carroll): `stories`, `buster`, `alice`

In other words, our modest library contains three Jane Austen's novels, three Shakespeare's plays, three novels by Gilbert K. Chesterton, three poem collections, and three children books (I'm sorry, Mr. Carroll). Let's find out if this classification is equally intuitive to a machine.

### Chapter Two: In which Books are Turned into Numbers and What Happens Then
There is a couple of ways we could represent a collection of documents as a set of numerical vectors. We're going to use a tf-idf matrix.

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(texts, max_df=.5, min_df=1, stop_words='english')
tfidf_matrix = vectorizer.fit_transform(texts)
terms = vectorizer.get_feature_names()

The `TfidfVectorizer` does all the work for us – it filters stop words, normalizes every row of the tf-idf matrix, and even lets us impose constraints on the maximal value of document frequency.

In [14]:
import numpy as np

maxmatrix = np.argmax(tfidf_matrix.toarray(), axis=1)
for index, title in enumerate(titles):
    print("{}: {}".format(title, terms[maxmatrix[index]]))

austen-emma.txt: emma
austen-persuasion.txt: anne
austen-sense.txt: elinor
blake-poems.txt: weep
bryant-stories.txt: jackal
burgess-busterbrown.txt: buster
carroll-alice.txt: alice
chesterton-ball.txt: turnbull
chesterton-brown.txt: flambeau
chesterton-thursday.txt: syme
milton-paradise.txt: hath
shakespeare-caesar.txt: bru
shakespeare-hamlet.txt: ham
shakespeare-macbeth.txt: macb
whitman-leaves.txt: states
