### Prologue: In Which the Author Explains Why He's Doing What He's Doing

Books are fun.

What's even more fun are vector space models, clustering algorithms, and dimensionality reduction techniques. In this blog post, we're going to combine it all by playing around with a small set of texts from project Gutenberg. With a bit of luck, Python, and lots of trial and error, we might just learn something interesting.

### First, the data
Let's get some books using NLTK:

In [9]:
from nltk.corpus import gutenberg

fileids = gutenberg.fileids()
print(fileids)

['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']


This rather eclectic collection will serve as our dataset. We can weed out some boring books (your definition of boring may vary) and fetch the full text for the rest. Let’s also be pedantic and strip the extension from the titles:

In [10]:
boring = {'bible-kjv.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt'} 
fileids = [f for f in fileids if f not in boring] 
texts = [gutenberg.raw(f) for f in fileids]
titles = [f.replace('.txt', '') for f in fileids]
print(titles)

['austen-emma', 'austen-persuasion', 'austen-sense', 'blake-poems', 'bryant-stories', 'burgess-busterbrown', 'carroll-alice', 'chesterton-ball', 'chesterton-brown', 'chesterton-thursday', 'milton-paradise', 'shakespeare-caesar', 'shakespeare-hamlet', 'shakespeare-macbeth', 'whitman-leaves']


Conveniently, and completely coincidentally, the remaining titles fall into five categories I spent far too much time naming:
- Novel and Novelty: `emma`, `persuasion`, `sense` 
- Bard's Tales: `caesar`, `macbeth`, `hamlet`
- Chestertomes: `ball`, `brown`, `thursday`
- BMW (Blake, Milton, Whitman): `poems`, `paradise`, `leaves`
- BBC (Bryant, Burgess, Carroll): `stories`, `buster`, `alice`

In other words, our modest library contains three Jane Austen's novels, three Shakespeare's plays, three novels by Gilbert K. Chesterton, three poem collections, and three children books (I'm sorry, Mr. Carroll). Let's find out if this classification is equally intuitive to a machine.

### Chapter Two: In which Books are Turned into Numbers and What Happens Then
There is a couple of ways we could represent a collection of documents as a set of numerical vectors. We're going to use a tf-idf matrix.

In [23]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(texts, max_df=.5, min_df=1, stop_words='english')
tfidf_matrix = vectorizer.fit_transform(texts)
terms = vectorizer.get_feature_names()

The `TfidfVectorizer` does all the work for us – it filters stop words, normalizes every row of the tf-idf matrix, and even lets us impose constraints on the maximal value of document frequency.

In [24]:
import numpy as np

maxmatrix = np.argmax(tfidf_matrix.toarray(), axis=1)
for index, title in enumerate(titles):
    print("{}: {}".format(title, terms[maxmatrix[index]]))

austen-emma: emma
austen-persuasion: anne
austen-sense: elinor
blake-poems: weep
bryant-stories: jackal
burgess-busterbrown: buster
carroll-alice: alice
chesterton-ball: turnbull
chesterton-brown: flambeau
chesterton-thursday: syme
milton-paradise: hath
shakespeare-caesar: bru
shakespeare-hamlet: ham
shakespeare-macbeth: macb
whitman-leaves: states


In [25]:
tfidf_array = tfidf_matrix.toarray()
for i, score in enumerate(tfidf_array[-2, :]):
    print(terms[i], score)

00 0.0
000 0.0
00021053 0.0
00081429 0.0
00482129 0.0
01 0.0
02 0.0
10 0.0
100 0.0
1000 0.0
10000 0.0
11 0.0
112 0.0
113 0.0
119 0.0
12 0.0
1240 0.0
13 0.0
1350 0.0
14 0.0
1492 0.0
15 0.0
1500 0.0
1599 0.0
16 0.0
1603 0.00387264828829
1667 0.0
16th 0.0
17 0.0
1739 0.0
1760 0.0
1780 0.0
1784 0.0
1785 0.0
1787 0.0
1789 0.0
1791 0.0
17th 0.0
18 0.0
1800 0.0
1803 0.0
1806 0.0
1809 0.0
1810 0.0
1811 0.0
1814 0.0
1816 0.0
1818 0.0
1854 0.0
1855 0.0
1859 0.0
1861 0.0
1865 0.0
1870 0.0
1873 0.0
1874 0.0
1876 0.0
1881 0.0
1884 0.0
1885 0.0
18th 0.0
19 0.0
1908 0.0
1909 0.0
1914 0.0
1918 0.0
1920 0.0
1971 0.0
1991 0.0
1994 0.0
1997 0.0
1998 0.0
1999 0.0
19th 0.0
1st 0.0
20 0.0
200 0.0
2000 0.0
2001 0.0
2002 0.0
2003 0.0
2004 0.0
21 0.0
217 0.0
22 0.0
23 0.0
23rd 0.0
24 0.0
24th 0.0
25 0.0
2500 0.0
26 0.0
26th 0.0
27 0.0
27th 0.0
28 0.0
28th 0.0
29 0.0
29th 0.0
30 0.0
3000 0.0
31 0.0
32 0.0
33 0.0
34 0.0
35 0.0
36 0.0
37 0.0
379 0.0
38 0.0
38655 0.0
39 0.0
40 0.0
4000 0.0
41 0.0
4109 0.0
42 0.0
4

In [139]:
from wordcloud import WordCloud, ImageColorGenerator
from matplotlib import pyplot as plt
from PIL import Image


alice_coloring = np.array(Image.open("data/alice.png"))
image_colors = ImageColorGenerator(alice_coloring)
wc = WordCloud(background_color="white",
               max_words=2000,
               mask=alice_coloring,
               max_font_size=40,
               random_state=42,
               color_func=image_colors,
               collocations=False,
               relative_scaling=0.3)

tfidf_alice = {terms[i]: score for i, score in enumerate(tfidf_array[6, :])}
wc.generate_from_frequencies(tfidf_alice)
wc.to_file("alice_wordcloud.png")

<wordcloud.wordcloud.WordCloud at 0x10c51af60>

In [37]:
from wordcloud import WordCloud, ImageColorGenerator, STOPWORDS
from matplotlib import pyplot as plt
from PIL import Image

macbeth_coloring = np.array(Image.open("data/hamlet.png"))
colors = ImageColorGenerator(macbeth_coloring)

wc = WordCloud(background_color="white",
               max_words=2000,
               mask=macbeth_coloring,
               max_font_size=40,
               random_state=42,
               color_func=colors,
               collocations=False,
               relative_scaling=0.6)

tfidf_sense = {terms[i]: score for i, score in enumerate(tfidf_array[-2, :])}
tfidf_sense.pop('exeunt')
tfidf_sense.pop('scena')
tfidf_sense.pop('macb')
tfidf_sense.pop('macd')
tfidf_sense.pop('banq')

wc.generate_from_frequencies(tfidf_sense)
wc.to_file("macbeth.png")

<wordcloud.wordcloud.WordCloud at 0x10d42beb8>