# Project: Literature Analysis

### Reading is great. And with so many amazing books out there also come great movies, reviews, and summaries. Reading those reviews and watching those films often only gives us a picture of what the book is actually like, though. With the power of data science and natural language processing, I am able to bring another dimension to how we understand literature.

For this project, I am looking at the following eight writings:
* **The Foundation by Isaac Asimov** - a book I am currently reading, by my favorite sci-fi writer 
* **A Clockwork Orange by Anthony Burgess** - the writing behind a famous extravagant horror movie by Stanley Kubrik, a book with a unique writing style and vocabulary
* **Comments to the Society of the Spectacle by Guy Debord** - a continuation of a book I was taught in university about the influence of the capitalist media on the society
* **A Brief History of Time by Stephen Hawking** - a book that excited millions about the workings of our universe
* **For Whom the Bell Tolls by Ernest Hemingway** - a writing with a unique writing style and themes specific to American writers
* **Carrie by Stephen King** - one of the most well-known horrors out there
* **The Hobbit by J.R.R. Tolkien** - a very long journey by very short people, one that so many people and communities hold dear to their heart
* **Slaughterhouse Five by Kurt Vonnegut** - a book highly recommended to me

# Sentiment Analysis

## Outline

**3. Sentiment Analysis**
1. Sentiment of books **overall**
    - Create lambda functions for polarity and subjectivity using TextBlob
    - Plot the data using matplotlib based on the new data frame
    
    
2. Sentiment of books **over time**
    - Create a function to split each writing into 40 pieces using numpy and math
        - *40 pieces ended up a good balance between too little vs too much detail*
    - Plot the data using matplotlib


A few key concepts I will be using with sentiment analysis:

1. **TextBlob Module:** Linguistic researchers have labeled the sentiment of words based on their domain expertise. Sentiment of words can vary based on where it is in a sentence. The TextBlob module allows us to take advantage of these labels.
2. **Sentiment Labels:** Each word in a corpus is labeled in terms of polarity and subjectivity (there are more labels as well, but we're going to ignore them for now). A corpus' sentiment is the average of these.
   * **Polarity**: How positive or negative a word is. -1 is very negative. +1 is very positive. In literature, this is very useful for determining whether the story's events and characters' feelings are at a low or a high.
   * **Subjectivity**: How subjective, or opinionated a word is. 0 is fact. +1 is an opinion or feeling. In literature, this could address the difference between feelings or judgements of a character or author vs actual events of the story.

For more info on how TextBlob coded up its [sentiment function](https://planspace.org/20150607-textblob_sentiment/).


## Sentiment of books overall

In [None]:
# We'll start by reading in the corpus, which preserves word order
import pandas as pd

data = pd.read_pickle('corpus.pkl')
data

In [None]:
# Create quick lambda functions to find the polarity and subjectivity of each writing
# Terminal / Anaconda Navigator: conda install -c conda-forge textblob
from textblob import TextBlob

pol = lambda x: TextBlob(x).sentiment.polarity # lambda function for polarity
sub = lambda x: TextBlob(x).sentiment.subjectivity # lambda function for subjectivity

data['polarity'] = data['writing'].apply(pol) # apply the polarity function on each writing and add its value into a new column in the data frame
data['subjectivity'] = data['writing'].apply(sub) # apply the subjectivity function on each writing and add its value into a new column in the data frame
data

In [None]:
# Run this cell twice to see the graph
# Let's plot the results
import matplotlib.pyplot as plt

plt.rcParams['figure.figsize'] = [10, 8]

for index, writer in enumerate(data.index):
    x = data.polarity.loc[writer]
    y = data.subjectivity.loc[writer]
    plt.scatter(x, y, color='blue')
    plt.text(x+.001, y+.001, data['full_name'][index], fontsize=10)
    plt.xlim(-.01, .12) 
    
plt.title('Sentiment Analysis', fontsize=20)
plt.xlabel('<-- Negative -------- Positive -->', fontsize=15)
plt.ylabel('<-- Facts -------- Feelings -->', fontsize=15)

plt.show()

- **Sentimental Hobbit** - It was surprising to see The Hobbit on the very top of subjectivity. I would think that the hobbit’s journey would be significantly plot-driven or abundant with more fact-like descriptions. However, this finding could hint that The Hobbit does the best job of portraying its world through the eyes of the characters.


- **Facts?** - In contrast to our findings of Guy Debord using the word “fact” a lot, his book was the 2nd most subjective book, especially being non-fiction


- **Fiction or Life?** - While it is not that surprising to see Stephen Hawking’s writing to be most factual, it was interesting to see Kurt Vonnegut’s writing to be very close on that scale. These findings could point at the book being more heavily relying on the plot rather than descriptions. At the same time, that could also be related to Vonnegut’s writings being based on real-life events like WWII. 

## Sentiment of books over time

In [None]:
# Split each writing into n parts
import numpy as np
import math

def split_text(text, n=10):
    '''Takes in a string of text and splits into n equal parts, with a default of 10 equal parts.'''

    # Calculate length of text, the size of each chunk of text and the starting points of each chunk of text
    length = len(text)
    size = math.floor(length / n) # calculate the size of each piece, rounding down
    start = np.arange(0, length, size) # calculate the starting points based on previously calculated parameters of pieces
    
    # Pull out equally sized pieces of text and put it into a list
    split_list = [] # create the list of pieces you will be returning
    for piece in range(n): # for each piece out of n pieces
        split_list.append(text[start[piece]:start[piece]+size]) # add the text from one staring point to before next starting point
    return split_list

In [None]:
# Let's take a look at our data again
data

In [None]:
# Let's create a list to hold all of the pieces of text
list_pieces = []
for t in data.writing: # for each writing
    split = split_text(t, 40) # split the text into 40 pieces, as that ends up being a good balance between too little vs too much detail
    list_pieces.append(split) 
    
list_pieces

In [None]:
# As we can see, the list has 8 elements, one for each book
len(list_pieces)

In [None]:
# Also as we can see, each book has been split into 40 pieces of text
len(list_pieces[0])

In [None]:
# Calculate the polarity for each piece of text

polarity_writing = [] # create empty list to hold polarity values for all books
for lp in list_pieces: # for each book
    polarity_piece = [] # create empty list to hold polarity values for the book
    for p in lp: # for each piece per that book
        polarity_piece.append(TextBlob(p).sentiment.polarity) # add the polarity to the per-book list
    polarity_writing.append(polarity_piece) # add the per-book list to overall polarity list
    
polarity_writing

In [None]:
# Show the plot for one writer
plt.plot(polarity_writing[0])
plt.title(data['full_name'].index[0])
plt.show()

In [None]:
# Show the plot for all writers
plt.rcParams['figure.figsize'] = [16, 12]

for index, writer in enumerate(data.index):    
    plt.subplot(3, 4, index+1)
    plt.plot(polarity_writing[index])
    plt.plot(np.arange(0,10), np.zeros(10))
    plt.title(data['full_name'][index])
    plt.ylim(ymin=-.2, ymax=.3)
    
plt.show()

## Findings

- **Positivity in outer space** - As we can see, Asimov's and Hawking's graphs look similar to each other. They both start off with a fall following a big positive spike; both generally stay on the more positive side, and both change their sentiment in smaller fluctuations. The only exception is Asimov's plot twist at 30 on x-axis. Such overall similarity is surprising due to such different Fact-VS-Feeling ratings. However, the common positivity could be influenced by the topics of the universe and its wonders, about which both authors were passionate.


- **What happened in *A Clockwork Orange*?** - Evidently, something horrible happened in the very middle of the book. We can see a huge spike into negativity. I guess we'll have to read to find out!


- **Carrie** -  It was surprising to see *Carrie* above the positivity line. However, the book is still overall more negative than others. This can also be seen on the overall-sentiment scatterplot, where Carrie was more negative than the others.


- **Rethinking literature?** - While learning literature in middle school and high school, I was always taught that each plot is like a pyramid: leading uphill towards one culmination in the middle, and falling downhill into a resolution. However, as we can see on these graphs, spikes of narrative happen all over the place. For example, the *Hobbit*'s middle seems to have even and consistent spikes, creating key spikes in the beginning and end instead. Asimov, Hemingway, and Vonnegut follow a relatively similar pattern. The only writings that shift dramatically in the middle are horror writings like *Carrie* and *A Clockwork Orange*, from which the latter is arguably the most extravagant of all writings included here.


# Next up - Topic Modeling!