# Topic modeling: *The Human Side of Animals*

For a homework assignment on Latent Dirichlet Allocations, I decided to try topic modeling for *The Human Side of Animals* (1918) by Royal Dixon (wow, real name), author of *The Human Side of Plants* (1885) and *The Human Side of Trees* (1917). In the anthropomorphizing spirit of Mr. Dixon, we'll take a look at the human side of machines and see if we can generate a tidy gist of his work without reading it ourselves. Book from Project Gutenberg.

### Preprocessing

In [8]:
import string
from nltk.corpus import stopwords
from collections import OrderedDict
import re
import pandas as pd

In [9]:
pd.set_option('display.max_colwidth', 200)

In [10]:
### First, let's extract the table of contents
### so that we can pull each chapter from the book

with open('data-files/animals.txt', 'r') as book:
    t = book.readlines()

contents  = []    
contents_table = False

for line in t:
    
    if 'CONTENTS' in line:
        contents_table = True
        
    if 'ILLUSTRATIONS\r\n' in line:
        contents_table = False
    
    if contents_table is True:
        contents.append(line)

contents

['  CONTENTS\r\n',
 '\r\n',
 '  CHAPTER                                                  PAGE\r\n',
 '\r\n',
 '  FOREWORD                                                 \r\n',
 '\r\n',
 '  1 ANIMALS THAT PRACTISE CAMOUFLAGE                          1\r\n',
 '\r\n',
 '  2 ANIMAL MUSICIANS                                        18\r\n',
 '\r\n',
 '  3 ANIMALS AT PLAY                                        32\r\n',
 '\r\n',
 '  4 ARMOUR-BEARING AND MAIL-CLAD ANIMALS                    46\r\n',
 '\r\n',
 '  5 MINERS AND EXCAVATORS                                    61\r\n',
 '\r\n',
 '  6 ANIMAL MATHEMATICIANS                                   88\r\n',
 '\r\n',
 '  7 THE LANGUAGE OF ANIMALS                                99\r\n',
 '\r\n',
 '  8 IN THEIR BOUDOIRS, HOSPITALS AND CHURCHES            120\r\n',
 '\r\n',
 '  9 SELF-DEFENCE AND HOME-GOVERNMENT                       130\r\n',
 '\r\n',
 '  10 ARCHITECTS, ENGINEERS, AND HOUSE-BUILDERS             150\r\n',
 '\r\n',
 '  11 FOOD CONS

In [11]:
### Okay, neaten up table of contents

non_chapters = ['CONTENTS', 'CHAPTER']        

contents = [c.strip() for c in contents]
contents = [c for c in contents if c != '']
contents = [c for c in contents if 
              not any([n in c for n in non_chapters])]
contents = [re.sub("\s\s+" , " ", c) for c in contents]

chapter_names = []

for c in contents:
    
    name = []
    for w in c.split():
        try:
            int(w)
        except ValueError:
            name.append(w)
            
    chapter_names.append(' '.join(name))
    
chapter_names

['FOREWORD',
 'ANIMALS THAT PRACTISE CAMOUFLAGE',
 'ANIMAL MUSICIANS',
 'ANIMALS AT PLAY',
 'ARMOUR-BEARING AND MAIL-CLAD ANIMALS',
 'MINERS AND EXCAVATORS',
 'ANIMAL MATHEMATICIANS',
 'THE LANGUAGE OF ANIMALS',
 'IN THEIR BOUDOIRS, HOSPITALS AND CHURCHES',
 'SELF-DEFENCE AND HOME-GOVERNMENT',
 'ARCHITECTS, ENGINEERS, AND HOUSE-BUILDERS',
 'FOOD CONSERVERS',
 'TOURISTS AND SIGHT-SEERS',
 'ANIMAL SCAVENGERS AND CRIMINALS',
 'AS THE ALLIES OF MAN',
 'THE FUTURE LIFE OF ANIMALS']

In [12]:
def extract_chapters(path_to_book, table_contents_end):
    '''Return a dataframe holding each chapter
        of the book, indexed by chapter title.
        
        ---- Params ----
        path_to_book : path
        table_contents_end : hard code a line from the 
            book that comes right after table of contents
        '''
    
    with open(path_to_book, 'r') as book:
        b = book.read()

    chapter_dict = OrderedDict()

    end_contents = b.find(table_contents_end) + len(table_contents_end)
    b = b[end_contents:]

    chapter_idxs = []

    for chapter in chapter_names:
        idx = b.find(chapter) + len(chapter)
        chapter_idxs.append(idx)
        
    end_of_book = '*** END OF THIS PROJECT GUTENBERG EBOOK'
    end_idx = b.find(end_of_book) + len(end_of_book)

    chapter_idxs.append(end_idx)
    chapter_bounds = zip(chapter_idxs, chapter_idxs[1:])

    for i in range(len(chapter_names)):
        start, end = chapter_bounds[i]
        chapter_name = chapter_names[i]
        curr_chap = b[start : end]
        chapter_dict[chapter_name] = curr_chap

    df = pd.DataFrame.from_dict(chapter_dict, orient='index')
    df.columns = ['raw_text']

    return df

In [13]:
def clean_df(df):
    '''Clean up the text contained in df,
        making new columns with progressively
        tidier text'''

    # Establish stopwords
    sw = set(stopwords.words('english'))

    # Remove punctuation & make lowercase
    df['no_punctuation'] = df.raw_text.map(
        lambda x: x.translate(None, string.punctuation).lower())

    # Remove stopwords from lowercased, punc-free text
    df['no_stopwords'] = df.no_punctuation.map(
        lambda x: [word for word in x.split() if word not in sw])

    df['no_stopwords'] = df.no_stopwords.map(
        lambda x: " ".join(x))

    return df

In [16]:
book = extract_chapters('data-files/animals.txt', 'ILLUSTRATIONS\r\n')
book = clean_df(book)

book.head()

Unnamed: 0,raw_text,no_punctuation,no_stopwords
FOREWORD,"\r\n\r\n _""And in the lion or the frog--\r\n In all the life of moor or fen--\r\n In ass and peacock, stork and dog,\r\n He read similitudes of men.""_\r\n\r\nMore and more science is being tau...",\r\n\r\n and in the lion or the frog\r\n in all the life of moor or fen\r\n in ass and peacock stork and dog\r\n he read similitudes of men\r\n\r\nmore and more science is being taught in a ne...,lion frog life moor fen ass peacock stork dog read similitudes men science taught new way men beginning discard lumber brains workshop get real facts real conclusions laboratories experiments tabl...
ANIMALS THAT PRACTISE CAMOUFLAGE,"\r\n\r\n _""She was a gordian shape of dazzling line,\r\n Vermilion-spotted, golden, green and blue;\r\n Striped like a zebra, freckled like a pard,\r\n Eyed like a peacock, and all crimson bar...",\r\n\r\n she was a gordian shape of dazzling line\r\n vermilionspotted golden green and blue\r\n striped like a zebra freckled like a pard\r\n eyed like a peacock and all crimson barrd\r\n an...,gordian shape dazzling line vermilionspotted golden green blue striped like zebra freckled like pard eyed like peacock crimson barrd full silver moons breathed dissolved brighter shone interwreath...
ANIMAL MUSICIANS,"\r\n\r\n _""Nay, what is Nature's self,\r\n But an endless strife towards\r\n Music, euphony, rhyme?""_\r\n\r\n --WATSON.\r\n\r\n\r\nThe great thinkers of the age believe that the world is one m...",\r\n\r\n nay what is natures self\r\n but an endless strife towards\r\n music euphony rhyme\r\n\r\n watson\r\n\r\n\r\nthe great thinkers of the age believe that the world is one marvellous\r\n...,nay natures self endless strife towards music euphony rhyme watson great thinkers age believe world one marvellous blending innumerable varied voices unison sound forms great music spheres poets p...
ANIMALS AT PLAY,"\r\n\r\n _""... _About them frisking, played\r\n All beasts of the earth, since wild, and of all chase\r\n In wood or wilderness, forest or den;\r\n Sporting the lion romped, and in his paw\r\n...",\r\n\r\n about them frisking played\r\n all beasts of the earth since wild and of all chase\r\n in wood or wilderness forest or den\r\n sporting the lion romped and in his paw\r\n dandled th...,frisking played beasts earth since wild chase wood wilderness forest den sporting lion romped paw dandled kid bears tigers ounces pards gambled unwieldy elephant make mirth used might wreathed lig...
ARMOUR-BEARING AND MAIL-CLAD ANIMALS,"\r\n\r\n _""The spectacle of Nature is always new, for she is always\r\n renewing the spectators. Life is her most exquisite invention;\r\n and death is her expert contrivance to get plenty of l...",\r\n\r\n the spectacle of nature is always new for she is always\r\n renewing the spectators life is her most exquisite invention\r\n and death is her expert contrivance to get plenty of life\r...,spectacle nature always new always renewing spectators life exquisite invention death expert contrivance get plenty life goethes aphorisms trans huxley civilised nations throughout world different...


### LDA modeling

First I'll fit the LDA to the entire corpus, all 16 chapters of the book, and we'll see what the top 10 topics are.

In [17]:
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer

n_features = 1000
n_topics = 10
n_top_words = 10

lda = LatentDirichletAllocation(n_components=n_topics, 
                                max_iter=5,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0)

vectorizer = CountVectorizer(max_df=0.95, min_df=2,
                             max_features=n_features)

data_vect = vectorizer.fit_transform(book.no_stopwords.loc[:,])
lda.fit(data_vect)

feature_names = vectorizer.get_feature_names()

for topic_idx, topic in enumerate(lda.components_):
    print("Topic %.f:" % topic_idx)
    print " ".join([feature_names[i] for i in topic.argsort()
                   [:-n_top_words - 1:-1]])

print 

Topic 0:
young play great man life home ground long dog two
Topic 1:
man small life death dog old could beasts large said
Topic 2:
young language man among white colour enemies life long home
Topic 3:
language life something dogs polar understood habitat leaves come claimed
Topic 4:
among body great live means man language two enemies food
Topic 5:
great home music small food water large feet young long
Topic 6:
beaver heart red around home long scientists day america tiny
Topic 7:
man dog life among two great language time dogs could
Topic 8:
man great use life language small food water play much
Topic 9:
among dog food two home time enemies man escape fox



Touching. I'm curious whether we can get more granular in order to get a better sense of each chapter's topic breakdown. I'm going to try treating each document — each chapter — as a corpus unto itself, and print its top topic.

In [18]:
n_topics = 1

for i in range(len(book)):
    
    print('Chapter {}: {}'.format(i, book.index[i]))

    vectorizer = CountVectorizer(max_df=0.95, min_df=2,
                                 max_features=n_features,
                                )

    data_vect = vectorizer.fit_transform(book.no_stopwords[i].split())

    lda = LatentDirichletAllocation(n_components=n_topics, 
                                    max_iter=5,
                                    learning_method='online',
                                    learning_offset=50.,
                                    random_state=0)

    lda.fit(data_vect)

    feature_names = vectorizer.get_feature_names()
    
    for topic in lda.components_:
        print " ".join([feature_names[i] for i in topic.argsort()
                       [:-n_top_words - 1:-1]])
        
    print 

Chapter 0: FOREWORD
animals man love animal new life things one may long

Chapter 1: ANIMALS THAT PRACTISE CAMOUFLAGE
animals young colour one white colouration animal among even upon

Chapter 2: ANIMAL MUSICIANS
music musical us animals would one great time earth come

Chapter 3: ANIMALS AT PLAY
play one would young animals like playing delight games little

Chapter 4: ARMOUR-BEARING AND MAIL-CLAD ANIMALS
armour like animals protection enemies one mail coats spines animal

Chapter 5: MINERS AND EXCAVATORS
one home like animal animals little underground usually many food

Chapter 6: ANIMAL MATHEMATICIANS
one animals dog dogs time two man would animal knowledge

Chapter 7: THE LANGUAGE OF ANIMALS
language animals one dog words human animal ideas man understood

Chapter 8: IN THEIR BOUDOIRS, HOSPITALS AND CHURCHES
animals seek many animal one like water lying bear wound

Chapter 9: SELF-DEFENCE AND HOME-GOVERNMENT
animals one many use cases strong animal form man human

Chapter 10: ARCHI

### A human pretending to be a computer pretending to be a human 
might write some counterfeit text, using the information provided from the LDA, like this:

> At home underground, animals such as dogs delight in little games. One kind of animal is a dog. This is something that man has little understood. With their armor-like protection, animals usually find many kinds of food. One animal may seek another animal underground. Animals play music — they are very musical — while they wear their little coats and spines, and one day the time will come on this great earth that man and animal join together to play a little music.