# Project: Literature Analysis

### Reading is great. And with so many amazing books out there also come great movies, reviews, and summaries. Reading those reviews and watching those films often only gives us a picture of what the book is actually like, though. With the power of data science and natural language processing, I am able to bring another dimension to how we understand literature.

For this project, I am looking at the following eight writings:
* **The Foundation by Isaac Asimov** - a book I am currently reading, by my favorite sci-fi writer 
* **A Clockwork Orange by Anthony Burgess** - the writing behind a famous extravagant horror movie by Stanley Kubrik, a book with a unique writing style and vocabulary
* **Comments to the Society of the Spectacle by Guy Debord** - a continuation of a book I was taught in university about the influence of the capitalist media on the society
* **A Brief History of Time by Stephen Hawking** - a book that excited millions about the workings of our universe
* **For Whom the Bell Tolls by Ernest Hemingway** - a writing with a unique writing style and themes specific to American writers
* **Carrie by Stephen King** - one of the most well-known horrors out there
* **The Hobbit by J.R.R. Tolkien** - a very long journey by very short people, one that so many people and communities hold dear to their heart
* **Slaughterhouse Five by Kurt Vonnegut** - a book highly recommended to me

# Exploratory Data Analysis

Now that we have our clean data, we will use it to do some exploratory data analysis.


We are going to look at the following for each writer's:

1. **Most common words** - find these and create word clouds
2. **Size of vocabulary** - look number of unique words and compare authors' vocabulary and book lengths 


## Outline

1. Most common words - **Word Clouds**
    - Find top 30 words said by each author
    - Exclude words that appear in more than 50% of the books
    - Update the document-term matrix - add these common words to stop words
    - Create word cloud using WordCloud and matplotlib libraries
    
    
2. **Vocabulary and Length - Unique and Total** words
    - Find non-zero items in the document-term matrix and input the numbers into a new dataframe
    - Find the total number of words that a writer uses 
    
    
3. **Bar-plot** and **Scatter-plot** findings
    - Make a bar-plot of unique and total words of authoer using numpy and matplotlib
    - Make a scatter-plot of Book-Length vs Vocabulary using matplotlib
    

## Most common words

In [None]:
# Read in the document-term matrix
import pandas as pd

data = pd.read_pickle('dtm.pkl')
data = data.transpose() # transpose into a term-document matrix
data.head()

In [None]:
data = pd.read_pickle('dtm.pkl')
data = data.transpose() # transpose into a term-document matrix
data.head()

In [None]:
# Find the top 30 words said by each writer
top_dict = {} 
for c in data.columns: # for each writer (represented by a column)
    top = data[c].sort_values(ascending=False).head(30) # sort the words (values of rows) in descending order and take first 30
    top_dict[c]= list(zip(top.index, top.values)) # put the top 30 words into a discionary

top_dict

In [None]:
# Print the top 15 words said by each writer
for writer, top_words in top_dict.items(): # for each writer and his top_words in items of dictionary
    print(writer)
    print(', '.join([word for word, count in top_words[0:14]])) # format and print top 15
    print('---')

**NOTE:** At this point, we could go on and create word clouds. However, by looking at these top words, we can see that some of them have very little meaning and could be added to a stop words list, so let's do just that.

In [None]:
# Look at the most common top words --> add them to the stop word list
from collections import Counter

# Let's first pull out the top 30 words for each writer - basically reformat into a list
words = [] # create a new blank list
for writer in data.columns: 
    top = [word for (word, count) in top_dict[writer]] # take from our previously created top_dict
    for t in top: # take the words from previously created word list called top
        words.append(t) 
        
words

In [None]:
# Let's aggregate this list and identify the most common words along with how many books they occur in
Counter(words).most_common()

In [None]:
# If more than half of the writers have it as a top word, exclude it from the list
add_stop_words = [word for word, count in Counter(words).most_common() if count > 4]
add_stop_words

In [None]:
# Let's update our document-term matrix with the new list of stop words
from sklearn.feature_extraction import text 
from sklearn.feature_extraction.text import CountVectorizer

# Read in cleaned data
data_clean = pd.read_pickle('data_clean.pkl')

# Add new stop words
stop_words = text.ENGLISH_STOP_WORDS.union(add_stop_words)

# Recreate document-term matrix
cv = CountVectorizer(stop_words=stop_words) # create a new instance of CountVectorizer object
data_cv = cv.fit_transform(data_clean.writing) # fit and transform to learn the vocabulary and encode each document with a vector
data_stop = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names()) # actually recreate the matrix
data_stop.index = data_clean.index # initialize the index

# Pickle it for later use
import pickle
pickle.dump(cv, open("cv_stop.pkl", "wb"))
data_stop.to_pickle("dtm_stop.pkl")

In [None]:
# Let's make some word clouds!
# In order to run need - Terminal / Anaconda Prompt: conda install -c conda-forge wordcloud
from wordcloud import WordCloud

wc = WordCloud(stopwords=stop_words, background_color="white", colormap="Dark2",
               max_font_size=150, random_state=42)

In [None]:
# Reset the output dimensions
import matplotlib.pyplot as plt

plt.rcParams['figure.figsize'] = [16, 6]

full_names = ['Isaac Asimov', 'Anthony Burgess', 'Guy Debord', 'Stephen Hawking', 'Ernest Hemingway', 
              'Stephen King', 'J.R.R. Tolkien', 'Kurt Vonnegut']


# Create subplots for each writer
for index, writer in enumerate(data.columns):
    wc.generate(data_clean.writing[writer])
    
    plt.subplot(3, 4, index+1)
    plt.imshow(wc, interpolation="bilinear")
    plt.axis("off")
    plt.title(full_names[index])
    
plt.show()

## Findings

### Isaac Asimov
- **Hardin vs Mallow** - The story seems to be divided very evenly between Hardin and Mallow
- **Anacreon vs Terminus** - Anacreon, a smaller but rapidly developing planet, seems to be much more of a concern or interest than Terminus - Anacreon's more developped and seemingly central to story neighbor

### Anthony Burgess
- **Root of confusion** - The word cloud is about as confusing as the actual book (no offense Anthony). One reason for such confusion could be because other than Brother, most used words are not English

### Guy Debord
- **Facts** - A word 'fact' appears a lot, while the critics find little to no facts in Debord's writings
- **New vs History** - Debord seems to focus more on 'new' than 'history'

### Stephen Hawking
- **Citing himself?** - Stephen Hawking seems to mention himself often enough to appear in the word cloud! And he has a reason for that... After all, he was arguably the greatest astrophysicist in the world. So is he admitting that too?

### Ernest Hemingway
- **Balance** - Hemingway seems to have the most balance between most-often used words - each word and character are pretty competitive between each other. What does that say about his writing?

### Stephen King
- In contrast, Stephen King seems to be pretty straight-forward with his themes
- Yet his writing is so great!

### J.R.R. Tolkien
- **Sorry Gendalf...** But the goblin seems to be more popular!
- There seems to be **more dark than light**
- And the way seems to be **long, great, and far**

### Kurt Vonnegut
- **Billy** seems to be the star of the story, along with **America** and **War**

### Although it may feel like we know these popular authors from the book-based movies, summaries, and reports, a relatively simple data analysis can show that there is much more to these authors. 
### Evidently, even in fantasy, horror, and science fiction authors show their personalities and quirks.

## Number of words - Unique and Total

In [None]:
# choose from the options of full_names below depending on what you find easier for understand the graphs' results
# full_names = ['Asimov - The Foundation', 'Burgess - Clockwork Orange', 'Debord - Comments to the Society of the Spectacle', 'Hawking - A Brief History of Time', 'Hemingway - For Whom the Bell Tolls', 'King - Carrie', 'Tolkien - The Hobbit', 'Vonnegut - Slaughterhouse Five']
full_names = ['Isaac Asimov', 'Anthony Burgess', 'Guy Debord', 'Stephen Hawking', 'Ernest Hemingway', 'Stephen King', 'J.R.R. Tolkien', 'Kurt Vonnegut']


# Find the number of unique words that each writer uses

# Identify the non-zero items in the document-term matrix, meaning that the word occurs at least once
unique_list = []
for writer in data.columns:
    uniques = data[writer].nonzero()[0].size # number of unique words is the size of data[writer].nonzero()[0]
    unique_list.append(uniques) # append unique_list by the integer number of unique words that writer has

# Create a new dataframe that contains this unique word count and length of the writing
data_words = pd.DataFrame(list(zip(full_names, unique_list)), columns=['writer', 'unique_words']) # create the data frame
data_unique_sort = data_words.sort_values(by='unique_words') # sort by vocabulary size in ascending order
data_unique_sort = data_unique_sort.set_index('writer') # make writer name the index instead of the writers' order number
data_unique_sort

In [None]:
# Find the total number of words that a writer uses
total_list = []
for writer in data.columns: # for each writer (represented by columns of data)
    totals = sum(data[writer]) # sum up all the numbers of words (rows of data) per writer (columns of data)
    total_list.append(totals) # add the sums of each writers' words to total_list

data_words['total_words'] = total_list # add a new column

# Create a new dataframe that contains this unique word count and length of the writing
data_words = pd.DataFrame(list(zip(full_names, unique_list, total_list)), columns=['writer', 'unique_words', 'total_words'])
data_total_sort = data_words.sort_values(by='total_words') # sort by length of the writing in ascending order
data_total_sort = data_total_sort.set_index('writer') # set writer as the index of the dataframe to avoid confusion
data_total_sort

In [None]:
# Let's plot our findings
import numpy as np

y_pos = np.arange(len(data_words))

plt.subplot(1, 2, 1)
plt.barh(y_pos, data_unique_sort.unique_words, align='center')
plt.yticks(y_pos, data_unique_sort.index)
plt.title('Number of Unique Words', fontsize=20)

plt.subplot(1, 2, 2)
plt.barh(y_pos, data_total_sort.total_words, align='center')
plt.yticks(y_pos, data_total_sort.index)
plt.title('Number of Total Words', fontsize=20)

plt.tight_layout()
plt.show()

In [None]:
# Let's create a scatter plot of our findings
plt.rcParams['figure.figsize'] = [10, 8]

for i, writer in enumerate(data_total_sort.index):
    x = data_total_sort.total_words.loc[writer] # total words found in data_total_sort table, total_words column
    y = data_total_sort.unique_words.loc[writer] # unique words found in data_total_sort table, unique_words column
    plt.scatter(x, y, color='blue')
    plt.text(x+1.5, y+0.5, list(data_total_sort.index.values)[i], fontsize=10) # set the names near the dots in order shown in data_total
    plt.xlim(0, 40000) 
    
plt.title('Vocabulary to Book Length Ratios', fontsize=20) # name the graph
plt.xlabel('Book Length', fontsize=15) # set name for x axis
plt.ylabel('Vocabulary', fontsize=15) # set name for y axis

plt.show()

## Findings

- **Big universe and small vocabulary** - Surprisingly, Stephen Hawking is on the low end of vocabulary sizes. I expected his book 'A Brief History of Time' to have many technical terms, but it ended up having a low number of unique words. However, it does make sense! The purpose of this book was to explain complex astrophysics concepts in simple and understandable language, which Stephen Hawking indeed did a good job of!


- **Not enough Russian words** - It was also surprising to me to see Anthony Burgess' A Clockwork Orange on the slightly lower end of vocabulary size compared to others. With how many extravagant Russian-originated slang words there are in the book, their number still could not top Hemingway's vocabulary.


- **Horror & Sci-fi Campions** - Again surprisingly, Carrie (by Stephen King) and Foundation (by Isaac Asimov) had similarly wide vocabularies and lengths despite such different genres. Also, such a high-vocab statistic for King was a contrast to his word cloud. Looking at the word cloud (with leading words Carrie, momma, eye, hand), I thought his vocabulary would be pretty narrow, but, in reality, Stephen King got to the very top, almost beating Asimov's space-travel vocabulary.


- **Long journey, medium vocab** - Despite having the longest journey, The Hobbit had a relatively medium-sized vocabulary. Just like in A Clockwork Orange, I imagined that The Hobbit's lexicon would be full of world-specific words, but it ended up being just a tiny bit above the average, ranking almost equal to Kurt Vonnegut's vocabulary size.


### As we can see, more data analysis can not only bring new insights but also alter the previous judgements. 
### To me, this was a good reminder to not judge a book by its... wordcloud. 
### To keep learning and to never stop questioning.

# Next up - Sentiment Analysis!