# Project: Literature Analysis

### Reading is great. And with so many amazing books out there also come great movies, reviews, and summaries. Reading those reviews and watching those films often only gives us a picture of what the book is actually like, though. With the power of data science and natural language processing, I am able to bring another dimension to how we understand literature.

For this project, I am looking at the following eight writings:
* **The Foundation by Isaac Asimov** - a book I am currently reading, by my favorite sci-fi writer 
* **A Clockwork Orange by Anthony Burgess** - the writing behind a famous extravagant horror movie by Stanley Kubrik, a book with a unique writing style and vocabulary
* **Comments to the Society of the Spectacle by Guy Debord** - a continuation of a book I was taught in university about the influence of the capitalist media on the society
* **A Brief History of Time by Stephen Hawking** - a book that excited millions about the workings of our universe
* **For Whom the Bell Tolls by Ernest Hemingway** - a writing with a unique writing style and themes specific to American writers
* **Carrie by Stephen King** - one of the most well-known horrors out there
* **The Hobbit by J.R.R. Tolkien** - a very long journey by very short people, one that so many people and communities hold dear to their heart
* **Slaughterhouse Five by Kurt Vonnegut** - a book highly recommended to me

# Topic Modeling

In this notebook, we will be covering the steps on how to do **Latent Dirichlet Allocation (LDA)**, which is one of many topic modeling techniques. It was specifically designed for text data.

To use a topic modeling technique, you need to provide (1) a document-term matrix and (2) the number of topics you would like the algorithm to pick up.

Once the topic modeling technique is applied, your job as a human is to interpret the results and see if the mix of words in each topic make sense. 

- *I found LDA to be a good choice of tool for this project due to the interpretability of topics. Working with large and complex pieces of data like literature works, I wanted to see if I as a reader could pick up latent/hidden topics as a result of my data analysis, which was successful (as shown in the Insights section)*


1. Topic modeling based on the **entire** original document-term matrix (all parts of speech)
    - Input the document-term matrix, transpose, and transform it into gensim corpus required by the LDA (Latent Dirichlet Allocation)
    - Create a dictionary of all terms and their respective location in the term-document matrix
    - Specify number of topics and passes, and run the LDA model
    
    
2. **Extract Parts of Speech** for topic model
    - *I chose to use parts-of-speech extraction mainly to improve the results of my LDA model*
    - **Nouns only**
        - *Filtering and using only nouns for our LDA model could be an improvement simply because topics themselves (like “war” or “love”) are nouns. For example, in the sentence “The war was dreadful and unforgiving, but it could not destroy their love” we could simply extract nouns “war” and “love”, showing the sentence’s themes much clearer.*

        - Create a function to tokenize given text and extract only the nouns
        - Input the clean data
        - Apply the noun-filtering function
        - Create a new document-term matrix using only nouns
        - Create a new gensim corpus (based on the new document-term matrix)
        - Create a new vocabulary vocabulary dictionary
        - Test the LDA model, gradually increasing the number of topics
            - *Starting off with fewer topics and gradually increasing their number helped me track down at what point my LDA model starts giving redundant results (at what point and how its performance deteriorates) so that I can see what to fix*
     
    - **Nouns and adjectives**
        - Repeat the above process with nouns & adjectives
            - *At the same time, adjectives could also be useful for making a better term-document matrix for our LDA model. For example, in the sentence “Her face was bloody and demonic”, words like “bloody” and “demonic” hint at the themes much better than the word “face”.*
        
    - **Final Model**
        - Take the most recent noun+adjective function, set the topic number to 5 and pass number to 100
            - *After gradually increasing the number of topics and number of times for the model to run, 5 topics and 100 passes ended up giving topic clusters that made sense the most in connection to the books’ known themes*

## Topic Modeling - Attempt #1 (All Text)

In [None]:
# Let's read in our document-term matrix
import pandas as pd
import pickle

data = pd.read_pickle('dtm_stop.pkl')
data

In [None]:
# Import the necessary modules for LDA with gensim
# Terminal / Anaconda Navigator: conda install -c conda-forge gensim
from gensim import matutils, models
import scipy.sparse

# Optional:
# import logging # helps debug - logs every step
# logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [None]:
# One of the required inputs is a term-document matrix
tdm = data.transpose() 
tdm.head()

In [None]:
# We're going to put the term-document matrix into a new gensim format, from df --> sparse matrix --> gensim corpus
sparse_counts = scipy.sparse.csr_matrix(tdm) # create a sparse matrix from the term-document matrix
corpus = matutils.Sparse2Corpus(sparse_counts) # create a gensim corpus from the sparse matrix

In [None]:
# Gensim also requires dictionary of the all terms and their respective location in the term-document matrix
cv = pickle.load(open("cv_stop.pkl", "rb")) # load the pickled count vectorizer object
id2word = dict((v, k) for k, v in cv.vocabulary_.items()) # create a dictionary with location keys and word values

Now that we have the corpus (term-document matrix) and id2word (dictionary of location: term), we need to specify two other parameters - the number of topics and the number of passes. Let's start the number of topics at 2, see if the results make sense, and increase the number from there.

In [None]:
# Specify two other parameters - the number of topics and the number of passes
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=2, passes=10)
lda.print_topics()

# Oh wow! I expected it to do much worse!

In [None]:
# LDA for num_topics = 3
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=3, passes=10)
lda.print_topics()

In [None]:
# LDA for num_topics = 4
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=4, passes=10)
lda.print_topics()

I found this model to be a decent start. However, the results sometimes have quite a few verbs (*come, thought, dont, going, went, got*) that clutter-up our word groups - the verbs don’t really mean much, and if they had not been there, our results might have been more comprehensive. That is why I chose to filter out parts of speech and see if my models could be improved with it.

## Topic Modeling - Attempt #2 (Nouns only)

In [None]:
# Let's create a function to pull out nouns from a string of text
from nltk import word_tokenize, pos_tag

def nouns(text):
    '''Given a string of text, tokenize the text and pull out only the nouns.'''
    is_noun = lambda pos: pos[:2] == 'NN' # lambda function determining if part of speech is noun
    tokenized = word_tokenize(text) # tokenize the text
    all_nouns = [word for (word, pos) in pos_tag(tokenized) if is_noun(pos)] # if the word is noun in the tokenized text, add the word to the list of nouns
    return ' '.join(all_nouns) # return a string with all elements of noun list joined

In [None]:
# Read in the cleaned data, before the CountVectorizer step
data_clean = pd.read_pickle('data_clean.pkl')
data_clean

In [None]:
# Apply the nouns function to the writings to filter only on nouns
data_nouns = pd.DataFrame(data_clean.writing.apply(nouns))
data_nouns

In [None]:
# Create a new document-term matrix using only nouns
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import CountVectorizer

# Add the additional stop words since we are recreating the document-term matrix
add_stop_words = ['chapter', 'im', 'know', 'just', 'dont', 'thats', 'right', 'people',
                  'youre', 'got', 'gonna', 'think', 'yeah', 'said', 'i']
stop_words = text.ENGLISH_STOP_WORDS.union(add_stop_words)

# Recreate a document-term matrix with only nouns
cvn = CountVectorizer(stop_words=stop_words) # create a count vectorizer for nouns excluding stop words
data_cvn = cvn.fit_transform(data_nouns.writing) # fit the vectorizer onto the data
data_dtmn = pd.DataFrame(data_cvn.toarray(), columns=cvn.get_feature_names()) # convert into a 2D array
data_dtmn.index = data_nouns.index # label the columns
data_dtmn

In [None]:
# Re-create the gensim corpus
corpusn = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtmn.transpose()))

# Re-create the vocabulary dictionary
id2wordn = dict((v, k) for k, v in cvn.vocabulary_.items())

In [None]:
# Let's start with 2 topics
ldan = models.LdaModel(corpus=corpusn, num_topics=2, id2word=id2wordn, passes=10)
ldan.print_topics()

In [None]:
# Let's try topics = 3
ldan = models.LdaModel(corpus=corpusn, num_topics=3, id2word=id2wordn, passes=10)
ldan.print_topics()

# My first attempt to run this cell yielded pretty good results

In [None]:
# Let's try 4 topics
ldan = models.LdaModel(corpus=corpusn, num_topics=4, id2word=id2wordn, passes=10)
ldan.print_topics()

# However, after 3 topics the program started getting a bit more confused again

## Topic Modeling - Attempt #3 (Nouns and Adjectives)

In [None]:
# Let's create a function to pull out nouns and adjectives from a string of text
def nouns_adj(text):
    '''Given a string of text, tokenize the text and pull out only the nouns and adjectives.'''
    is_noun_adj = lambda pos: pos[:2] == 'NN' or pos[:2] == 'JJ' # lambda function determining if part of speech is noun or adjective
    tokenized = word_tokenize(text) # tokenize the text
    nouns_adj = [word for (word, pos) in pos_tag(tokenized) if is_noun_adj(pos)] # if the word is noun/adjective in the tokenized text, add the word to the list of nouns&adjectives
    return ' '.join(nouns_adj) # return a string with all elements of noun&adjective list joined

In [None]:
# Apply the function to the books to filter nouns and adjectives
data_nouns_adj = pd.DataFrame(data_clean.writing.apply(nouns_adj))
data_nouns_adj

In [None]:
# Create a new document-term matrix using only nouns and adjectives, also remove common words with max_df
cvna = CountVectorizer(stop_words=stop_words, max_df=.8)
data_cvna = cvna.fit_transform(data_nouns_adj.writing)
data_dtmna = pd.DataFrame(data_cvna.toarray(), columns=cvna.get_feature_names())
data_dtmna.index = data_nouns_adj.index
data_dtmna

In [None]:
# Re-create the gensim corpus
corpusna = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtmna.transpose()))

# Re-create the vocabulary dictionary
id2wordna = dict((v, k) for k, v in cvna.vocabulary_.items())

In [None]:
# Let's start with 2 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=2, id2word=id2wordna, passes=10)
ldana.print_topics()

In [None]:
# Let's try 3 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=3, id2word=id2wordna, passes=10)
ldana.print_topics()

In [None]:
# Let's try 4 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=4, id2word=id2wordna, passes=10)
ldana.print_topics()

In [None]:
# Let's try 5 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=5, id2word=id2wordna, passes=30)
ldana.print_topics()

Out of the 9 topic models we looked at, the nouns and adjectives, 5 topic one made the most sense. So let's pull that down here and run it through some more iterations to get more fine-tuned topics.

In [None]:
# Our final LDA model (for now)
ldana = models.LdaModel(corpus=corpusna, num_topics=5, id2word=id2wordna, passes=100)
ldana.print_topics()

### Result I got:
    
 (0,
  '0.009 * "carrie" + 0.006 * "brothers" + 0.004 * "momma" + 0.004 * "veck" + 0.004 * "door" + 0.004 * "school" + 0.003 * "dim" + 0.003 * "bed" + 0.003 * "horrorshow" + 0.003 * "malenky"'),
  
 (1,
  '0.013 * "jordan" + 0.012 * "bilbo" + 0.010 * "robert" + 0.009 * "dwarves" + 0.006 * "pilar" + 0.006 * "road" + 0.006 * "thee" + 0.006 * "pablo" + 0.006 * "bridge" + 0.006 * "thorin"'),
  
 (2,
  '0.010 * "spectacle" + 0.006 * "spectacular" + 0.003 * "social" + 0.003 * "media" + 0.003 * "disinformation" + 0.003 * "false" + 0.002 * "mafia" + 0.002 * "services" + 0.002 * "information" + 0.002 * "example"'),
  
 (3,
  '0.018 * "universe" + 0.009 * "theory" + 0.008 * "foundation" + 0.007 * "brief" + 0.006 * "particles" + 0.006 * "hardin" + 0.005 * "energy" + 0.005 * "mallow" + 0.004 * "seldon" + 0.004 * "star"'),
  
 (4,
  '0.000 * "carrie" + 0.000 * "jordan" + 0.000 * "robert" + 0.000 * "road" + 0.000 * "thee" + 0.000 * "momma" + 0.000 * "universe" + 0.000 * "bilbo" + 0.000 * "door" + 0.000 * "brothers"')

Indeed, the final model with 5 topics and 100 passes yielded much better results than the mini-noun-adjective model and even the noun-only model. As we can see above, the results include more comprehensive words (including adjectives), take words from more books, and incorporate a much wider variety of topics.

## Findings

After a long long time of iterating our LDA model, I ended up with five topics that look pretty decent. Let's settle on these for now.
* **Topic 1: horror:** carrie, brothers, momma, school, horrorshow
* **Topic 2: hobbit's world:** bilbo, dwarves, three, goblin, mountain, hobbit
* **Topic 3: media:** spectacle, disinformation, false, mafia, services
* **Topic 4: outer space:** universe, theory, foundation, particles, hardin, energy, star
* **Topic 5: the way somewhere:** bridge, road, door

Overall, the topics seem to make sense, as they are mainly divided by genres like sci-fi, horror, fantasy, non-fiction. However, in addition to books overlapping in their genres, one major visible overlap between all books seems to be “the way somewhere”. I find this interesting because while for humans concept of journey in stories makes sense, a computer may not have a general assumption that books (even non-fiction or scientific reports) are overall descriptions of some kind of journey (*road*) through obstacles (*bridge, door*) rather than random collections of facts or events. Through such relatively simple data analysis, a program can have a chance to gain a deeper understanding of literature - its essence and (in the future) possibly even its appeal. 


# Next up - Text Generation