Motive

Inspired by Vonnegut's statement, "There is no reason why the simple shapes of stories can’t be fed into computers, they are beautiful shapes”, I'm interested in seeing if NLP can reveal these "shapes of stories". Specifically, I'm investigating if literary texts' sentiment time series can indicate certain literary archetypes (e.g. "Rags to riches" (rise), "Tragedy" (fall), "Cinderella" (rise-fall-rise)).

I have since discovered a relevant paper and sought to employ different techniques for classification & clustering (see Methodology)

Methodology

Training data set/How are words assigned a happiness score? To compare my findings with the paper, I used the same training data, and recategorizing floating happiness scores to 0-9 integer scores (e.g. 8.6-> 9).

"To quantify the happiness of the atoms of language, we merged the 5,000 most frequent words from a collection of four corpora: Google Books, New York Times articles, Music Lyrics, and Twitter messages, resulting in a composite set of roughly 10,000 unique words."

A suitable corpus: Literary texts are obtained from Project Gutenberg. Following Regan's method, books are English, between 20k-100k owrds,and exlude those with fewer than 40 downloads. We process 1106 texts.

Scraped with this web scraper,with some modifications

preprocessing: removing Gutenberg markups, tokenizing, removing whitespace & stop words & punctuation & numbers & common names.

sentiment score: To determine text sentiment, words are rated on a 1-9 scale from most negative to most positive. See data source. I generate a sentiment score for every 5% of a text to get the same number of datapoints for each text. This means for each text we get an array of 20 sentiment scores. To get a time series of datapoints, we use sliding window (window size 10k words).

classification: Multinomial Naive Bayes

clustering: KMeans -- I group texts with similar "shapes" into clusters, determining a good K by elbow method, which recommends 2-3 clusters.

Finally we're at the fun part: generating clusters that represent archetypical emotion journeys in literature.

So was Kurt Vonnegut right?

It's unclear. After processing 1106 texts, there does not appear to be major differences between the 3 clusters --i.e. we cannot group texts into literary archetypes.

How to improve

Expand features selection: to include negation, humour

Find a better training data set: the current training dataset contains more "modern" words and modern sentiments, as a good part of them are sourced from NYT articles and Twitter. However our testing dataset, books from Project Gutenberg are old enough that their US copyright protection has expired (age of author+70 years)

Inherent issue with Naive Bayes classification: assumption of independence in features -> experiment with other classification algos

How to run

ensure that you've pip installed relevant packages e.g. nltk, matplotlib, numpy, pickle

run: python3 main.py

This program with call 3 other processes to download texts, apply Naive Bayes, and cluster

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
img		img
src		src
.gitignore		.gitignore
Hedonometer.csv		Hedonometer.csv
README.md		README.md
bookids.txt		bookids.txt
dataForClustering.csv		dataForClustering.csv
my_classifier.pickle		my_classifier.pickle
names.csv		names.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

img

img

src

src

.gitignore

.gitignore

Hedonometer.csv

Hedonometer.csv

README.md

README.md

bookids.txt

bookids.txt

dataForClustering.csv

dataForClustering.csv

my_classifier.pickle

my_classifier.pickle

names.csv

names.csv

Repository files navigation

Motive

Methodology

So was Kurt Vonnegut right?

How to improve

How to run

About

Releases

Packages

Languages

yumiobuchi/shape-of-stories

Folders and files

Latest commit

History

Repository files navigation

Motive

Methodology

So was Kurt Vonnegut right?

How to improve

How to run

About

Topics

Resources

Stars

Watchers

Forks

Languages