# Breaking Apart Shakespeare

## Executive Summary

For the following project, I will be using the natural language toolkit (NLTK) to breakdown Shakespeare's plays Macbeth and Hamlet. The primary purpose of this project is to demonstrate how to utilize NLTK on text documents. During this process, I will create a model to identify Shakespeare's writing style, specifically the types of words utilized in this work. This will be done by using the NLTK modules Part of Speech Tagging and Chunking. In addition to model reaction, k-means clustering will be performed on both plays. The end result is to compare model creation and clustering as methods for working with text files.

In [39]:
import numpy as np
import pandas as pd
import scipy
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import re
from nltk.corpus import gutenberg, stopwords
from collections import Counter
import nltk
import io
from nltk import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, PorterStemmer
from string import punctuation
from collections import Counter
from collections import OrderedDict
import re
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

### Data Cleaning

In [40]:
from nltk.corpus import gutenberg
nltk.download('gutenberg')
import re
from sklearn.model_selection import train_test_split

[nltk_data] Downloading package gutenberg to C:\nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


In [54]:
mac = gutenberg.raw('shakespeare-macbeth.txt')
ham = gutenberg.raw('shakespeare-hamlet.txt')

In [56]:
pattern = "[\[].*?[\]]"
mac = re.sub(pattern, "", mac)
ham = re.sub(pattern, "", ham)

In [57]:
mac = re.sub(r'Chapter \d+', '', mac)
ham = re.sub(r'Chapter \d+', '', ham)

In [58]:
mac = ' '.join(mac.split())
ham = ' '.join(ham.split())

### Tokenization

In this process, meaningful segments from the text are extracted by breaking up the text from sentences to words. 

In [53]:
filtered_sentence = [w for w in word_tokens if not w in stop_words]

filtered_sentence = []

for w in word_tokens:
    if w not in stop_words:
        filtered_sentence.append(w)

#print(word_tokens)
#print(filtered_sentence)

In [50]:
macbeth_sentences = gutenberg.sents('shakespeare-macbeth.txt')
macbeth_sentences
[['[', 'The', 'Tragedie', 'of', 'Macbeth', 'by', 'William', 'Shakespeare',
'1603', ']'], ['Actus', 'Primus', '.'], ...]
macbeth_sentences[1116]
['Double', ',', 'double', ',', 'toile', 'and', 'trouble', ';',
'Fire', 'burne', ',', 'and', 'Cauldron', 'bubble']
longest_len = max(len(s) for s in macbeth_sentences)
[s for s in macbeth_sentences if len(s) == longest_len]
[['Doubtfull', 'it', 'stood', ',', 'As', 'two', 'spent', 'Swimmers', ',', 'that',
'doe', 'cling', 'together', ',', 'And', 'choake', 'their', 'Art', ':', 'The',
'mercilesse', 'Macdonwald', ...]]

[['Doubtfull',
  'it',
  'stood',
  ',',
  'As',
  'two',
  'spent',
  'Swimmers',
  ',',
  'that',
  'doe',
  'cling',
  'together',
  ',',
  'And',
  'choake',
  'their',
  'Art',
  ':',
  'The',
  'mercilesse',
  'Macdonwald',
  Ellipsis]]

### Stopwords

The following process eliminates useless phrases found commonly within the English language.

In [13]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [14]:
stop_words = set(stopwords.words('english'))

word_tokens = word_tokenize(mac)

filtered_sentence = [w for w in word_tokens if not w in stop_words]

filtered_sentence = []

for w in word_tokens:
    if w not in stop_words:
        filtered_sentence.append(w)

### Part of Speech Tagging

This method is utilized to label the words in each sentence. The labels indicate whether the word is a noun, verb, adjective, and so on. I used Shakespeare's Hamlet as well as Macbeth to create a model, tokenize, and then create a function that will tag all the parts of speech for each sentence. This model displays the type of words commonly found in Shakespeare's work.

In [15]:
train_text = gutenberg.raw('shakespeare-hamlet.txt')
sample_text = gutenberg.raw('shakespeare-macbeth.txt')

In [16]:
custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

In [17]:
tokenized = custom_sent_tokenizer.tokenize(sample_text)

In [18]:
def process_content():
    try:
        for i in tokenized[:5]:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            print(tagged)

    except Exception as e:
        print(str(e))


process_content()

[('[', 'IN'), ('The', 'DT'), ('Tragedie', 'NNP'), ('of', 'IN'), ('Macbeth', 'NNP'), ('by', 'IN'), ('William', 'NNP'), ('Shakespeare', 'NNP'), ('1603', 'CD'), (']', 'NNP'), ('Actus', 'NNP'), ('Primus', 'NNP'), ('.', '.')]
[('Scoena', 'NNP'), ('Prima', 'NNP'), ('.', '.')]
[('Thunder', 'NN'), ('and', 'CC'), ('Lightning', 'NNP'), ('.', '.')]
[('Enter', 'NNP'), ('three', 'CD'), ('Witches', 'NNP'), ('.', '.')]
[('1', 'CD'), ('.', '.')]


### Chunking

After establishing the parts of speech, I created groups based of the commonly found tags. I mainly focused on group together words that were coordinating conjuntions, cardinal digits, proper nouns, or nouns.

In [60]:
def process_content():
    try:
        for i in tokenized[:5]:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            chunkGram = r"""Chunk: {<CC.?>*<CD.?>*<NNP>+<NN>?}"""
            chunkParser = nltk.RegexpParser(chunkGram)
            chunked = chunkParser.parse(tagged)
            chunked.draw()
   
    except Exception as e:
        print(str(e))

process_content()

### Clustering

Below is K-means clustering, first for Macbeth and second for Hamlet. Both will breakdown into two clusters. The hard-clustering method will indicate which words group together.

In [23]:
tok = sent_tokenize(mac)

for x in range(100):
    print(tok[x])

Actus Primus.
Scoena Prima.
Thunder and Lightning.
Enter three Witches.
1.
When shall we three meet againe?
In Thunder, Lightning, or in Raine?
2.
When the Hurley-burley's done, When the Battaile's lost, and wonne 3.
That will be ere the set of Sunne 1.
Where the place?
2.
Vpon the Heath 3.
There to meet with Macbeth 1.
I come, Gray-Malkin All.
Padock calls anon: faire is foule, and foule is faire, Houer through the fogge and filthie ayre.
Exeunt.
Scena Secunda.
Alarum within.
Enter King Malcome, Donalbaine, Lenox, with attendants, meeting a bleeding Captaine.
King.
What bloody man is that?
he can report, As seemeth by his plight, of the Reuolt The newest state Mal.
This is the Serieant, Who like a good and hardie Souldier fought 'Gainst my Captiuitie: Haile braue friend; Say to the King, the knowledge of the Broyle, As thou didst leaue it Cap.
Doubtfull it stood, As two spent Swimmers, that doe cling together, And choake their Art: The mercilesse Macdonwald (Worthie to be a Rebell, fo

In [41]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score

In [42]:
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(tok)

In [43]:
true_k = 2
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=100,
    n_clusters=2, n_init=1, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [49]:
print("Top terms per cluster:")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
    print("Cluster %d:" % i),
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind]),
    print

print("\n")
print("Prediction")

Y = vectorizer.transform(["Actus Primus."])
prediction = model.predict(Y)
print(prediction)

Y = vectorizer.transform(["Scoena Prima."])
prediction = model.predict(Y)
print(prediction)

Top terms per cluster:
Cluster 0:
 haue
 thou
 wife
 king
 mal
 thee
 lenox
 scena
 thy
 shall
Cluster 1:
 macb
 enter
 exeunt
 lady
 macd
 rosse
 macbeth
 lord
 banquo
 lenox


Prediction
[0]
[0]


In [62]:
tok2 = sent_tokenize(ham)

for x in range(100):
    print(tok[x])

Actus Primus.
Scoena Prima.
Enter Barnardo and Francisco two Centinels.
Barnardo.
Who's there?
Fran.
Nay answer me: Stand & vnfold your selfe Bar.
Long liue the King Fran.
Barnardo?
Bar.
He Fran.
You come most carefully vpon your houre Bar.
'Tis now strook twelue, get thee to bed Francisco Fran.
For this releefe much thankes: 'Tis bitter cold, And I am sicke at heart Barn.
Haue you had quiet Guard?
Fran.
Not a Mouse stirring Barn.
Well, goodnight.
If you do meet Horatio and Marcellus, the Riuals of my Watch, bid them make hast.
Enter Horatio and Marcellus.
Fran.
I thinke I heare them.
Stand: who's there?
Hor.
Friends to this ground Mar.
And Leige-men to the Dane Fran.
Giue you good night Mar.
O farwel honest Soldier, who hath relieu'd you?
Fra.
Barnardo ha's my place: giue you goodnight.
Exit Fran.
Mar.
Holla Barnardo Bar.
Say, what is Horatio there?
Hor.
A peece of him Bar.
Welcome Horatio, welcome good Marcellus Mar.
What, ha's this thing appear'd againe to night Bar.
I haue seene no

In [63]:
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(tok2)

In [64]:
true_k = 2
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=100,
    n_clusters=2, n_init=1, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [65]:
print("Top terms per cluster:")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
    print("Cluster %d:" % i),
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind]),
    print

print("\n")
print("Prediction")

Y = vectorizer.transform(["Actus Primus."])
prediction = model.predict(Y)
print(prediction)

Y = vectorizer.transform(["Scoena Prima."])
prediction = model.predict(Y)
print(prediction)

Top terms per cluster:
Cluster 0:
 ham
 lord
 king
 enter
 haue
 ophe
 hamlet
 qu
 laer
 come
Cluster 1:
 hor
 speake
 tis
 reueale
 puh
 strooke
 beene
 strange
 question
 follow


Prediction
[0]
[0]


## Conclusion

Performing part of speech tagging on Shakespeare plays Macbeth and Hamlet proves nouns, proper nouns, coordinating conjunctions, and cardinal digits are most commonly occurring types of words found in Shakespeare’s writings. Thus, chunking the texts based on these four tags is the most logical method for grouping them together. In comparison, K-means clustering for both pieces of literature clustered mainly nouns and proper nouns. While both techniques generate similar results, the methodology is quite different. Using part of speech tagging and chunking both texts were combined into one unit. Though K-means is an excellent unsupervised learning method for grouping data, the text files proved to be to big for the notebook to process, and had to be broken apart.