# Introduction

This Python notebook runs through a number of typical nlp analysis steps. It operates on 
Shakespeare's plays. As an example of leading questions let's see how these work out:


- Is there an intrinsic language-driven classification of plays into say three categories?
    - Shakespeare's plays have often been sorted into histories, tragedies and comedies
- Do any traditionally-understood central themes of the plays emerge from language analysis?


In [2]:
from collections import Counter
from collections import defaultdict
import glob, os
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt 
from gensim import corpora, models
import string
import warnings

warnings.filterwarnings('ignore')      # suppresses warning output (use at own risk)

## Breaking down Shakespeare


Shakespeare's plays reside in the 'shakespeare' sub-directory. Stage directions,  
act/scene breaks and speaker identifications have all been removed. 
What remains are the spoken lines.


This code breaks the plays down into one or more versions of a corpus. 
A corpus is a matrix of two columns by some number of rows. 
In each row: The first column is source identifier for that row; and the second column
is a sequence of words from that source. 


In the first round of analysis: Each word sequence will usually be
500 words. A typical play will contain 45 or so sequences. Since there are 
36 plays this implies about 1600 sequences of text in total that can be seen 
in two dimensions: Plays along one axis and location within a play on the second 
axis.


After a corpus is built we proceed to build a document-term matrix. 
This is a table as well, but with many columns. The first column of a DTM is 
the same as for the corpus: It is a set of identifiers, 'where the text in that
row originated'. The remaining columns of the DTM are tokens: Some element 
that appears in the text. For our purposes tokens will be single words. 
The entries for each row (unique text source) are the number of
times that column's token appears in that source. Here is a simple example
using two sentences as sources. Notice that the text is converted to lowercase
and the punctuation is removed.


```
This is the first sentence.
This is the second sentence and it is slightly longer.

Corpus:
source          text
sentence 1      this is the first sentence
sentence 2      this is the second sentence and it is slightly longer

Document-Term Matrix:
source       this    is   the    first   sentence  second   and   it   slightly   longer
sentence 1      1     1     1        1          1       0     0    0          0        0  
sentence 2      1     1     1        0          1       1     1    1          1        1
```


Source label format is **`<play>_<sequence>`**, e.g. **`merchant_of_venice_006`**.


One aim here is to use the **`gensim`** library for topic modeling.
Then taking a topic distribution as a set of vectors, cluster using K-Means.

In [3]:
def preprocess(s, lowercase=True, strip_punctuation=True):
    punctuation = '.,?<>:;"\'!%'
    if isinstance(s, str):
        s = tokenize(s)
    if lowercase:
        s = [t.lower() for t in s]
    if strip_punctuation:
        s = [t.strip(punctuation) for t in s]      
    return s

def token_frequency(tokens=None, tf={}, relative=False):
    """
    Input:
        tokens = list of strings or None
        tf = dict or None
        relative = boolean
    Return:
        dictionary of token frequencies
    """
    for t in tokens:
        if t in tf:
            tf[t]+=1
        else:
            tf[t]=1
    if relative:
        total = sum([c for t, c in tf.items()])
        tf = {t:tf[t]/total for t in tf}
    return tf


In [3]:
filepath = os.getcwd() + '/shakespeare/*.txt'
files = glob.glob(filepath)
print(len(files), 'plays')

36 plays


In [4]:
print(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [5]:
t = ['crum&bum', 'mini', "0acc'id$-ental"]
print(t)
n = []
for w in t:
    for c in string.punctuation: 
        w = w.replace(c, '')
    n.append(w)
n

['crum&bum', 'mini', "0acc'id$-ental"]


['crumbum', 'mini', '0accidental']

In [6]:
def RedactPunctuation(s):
    '''
    Strips out most everything except single apostrophes
    '''
    puncts = "!\"#$%&()*+,./:;<=>?@^_`-{|}~"
    for c in puncts: s = s.replace(c, '')
    return s

for f in files:
    p = open(f, "r").read()              # can validate as string using bool isinstance(t, str)    
    p = p.split()                        # t becomes a list of tokens (whitespace separated words)
    p = [word.lower() for word in p]
    p = [RedactPunctuation(word) for word in p]
    chunk_size = 500                                                # words per chunk-o-words
    chunks = []                                                     # empty list; remember t is a list of words
    for wordnum in range(0, len(p) - chunk_size + 1, chunk_size):   # if t is a list of 1000 words: This gives us 2 chunks
        chunks.append(p[wordnum : wordnum + chunk_size])            # now chunks will be a list of lists
    
    chunk_labels = ['{}_{:03}'.format(os.path.split(f)[1][:-4], i) for i, j in enumerate(chunks)] # get chunk labels
    # print(f.split('/')[5].split('.')[0], len(t), 'tokens', len(chunks), 'chunks', chunks[0][0:5])
    
    if 'julius' in f: print(chunks[0], chunks[1])
    
    # I'm not sure the punctuation strip has worked properly; look for question marks at the start of Julius Caesar
    # I say this because Taming of the Shrew begins with "I'll" but shouldn't that mistakenly be 'ill'?
    # starting topic modeling for chunks using 5, 10, 25, 50 topics
    # topic_5 = make_topic_model(chunks, 5)
    # kmean_topics(topic_5, chunk_labels, 2) # plot n=5       (last arg was 5... error)

['hence', 'home', 'you', 'idle', 'creatures', 'get', 'you', 'home', 'is', 'this', 'a', 'holiday', 'what', 'know', 'you', 'not', 'being', 'mechanical', 'you', 'ought', 'not', 'walk', 'upon', 'a', 'labouring', 'day', 'without', 'the', 'sign', 'of', 'your', 'profession', 'speak', 'what', 'trade', 'art', 'thou', 'why', 'sir', 'a', 'carpenter', 'where', 'is', 'thy', 'leather', 'apron', 'and', 'thy', 'rule', 'what', 'dost', 'thou', 'with', 'thy', 'best', 'apparel', 'on', 'you', 'sir', 'what', 'trade', 'are', 'you', 'truly', 'sir', 'in', 'respect', 'of', 'a', 'fine', 'workman', 'i', 'am', 'but', 'as', 'you', 'would', 'say', 'a', 'cobbler', 'but', 'what', 'trade', 'art', 'thou', 'answer', 'me', 'directly', 'a', 'trade', 'sir', 'that', 'i', 'hope', 'i', 'may', 'use', 'with', 'a', 'safe', 'conscience', 'which', 'is', 'indeed', 'sir', 'a', 'mender', 'of', 'bad', 'soles', 'what', 'trade', 'thou', 'knave', 'thou', 'naughty', 'knave', 'what', 'trade', 'nay', 'i', 'beseech', 'you', 'sir', 'be', 'not'

In [3]:
# labels = [os.path.split(f)[1][:-4].replace('_', ' ').title() for f in files]

# 2. chunk all shakespeare and make labels
def make_topic_model(chunks, num):
    dictionary = corpora.Dictionary(chunks) 
    corpus = [dictionary.doc2bow(text) for text in chunks]
    # lda model
    lda = models.LdaModel(corpus, id2word=dictionary, num_topics=num)
    corpus_lda = lda[corpus]
    return corpus_lda

'''Kmeans the topics, k = 3'''
def kmean_topics(topics, labels, num):
    # Put labels, features, vectors into a single dataframe   
    vectors_df = pd.DataFrame(topics, index=labels, columns=range(num)).fillna(0)
    # 4. Use K-means clustering from Scikit Learn to find two clusters. 
    n_clusters=3
    kmeans = KMeans(n_clusters=n_clusters, random_state=0).fit(vectors_df)
    plot_clusters(kmeans, vextors_df, num) # plot topic clustering
    return 

'''Plot clustering for each topic'''
def plot_clusters(kmeans, df, topic_num):
    pca = PCA(n_components=2)
    transformed = pca.fit_transform(df) # transform topic_num features to 2D
    x = transformed[:,0]
    y = transformed[:,1]
    col_dict = {0:'red', 1:'blue', 2:'green'}
    cols = [col_dict[l] for l in kmeans.labels_]
    plt.figure(figsize=(15,10))
    plt.scatter(x,y, c=cols, s=100, alpha=.5)
    for i, l in enumerate(labels):
        plt.text(x[i]+.0003,y[i]-.0001, l)
    for i, c in enumerate(pca.components_.transpose()):
        plt.arrow(0,0, c[0]/50, c[1]/50, alpha=.3, width=.0001)
        plt.text(c[0]/50, c[1]/50, features[i])
    plt.xlabel('PCA1')
    plt.ylabel('PCA2')
    plt.title('Shakespeare works for Topic {}'.format(topic_num))
    plt.show()
    plt.savefig("shakespeare-kmeans-{}.png".format(topic_num))
    return

found 36 files starting with ./shakespeare/1_king_henry_iv.txt


In [16]:
chunks[0][0:10]

['so', 'shaken', 'as', 'we', 'are', 'so', 'wan', 'with', 'care', 'find']

In [17]:
import pickle
with open('chunks.pkl', 'wb') as picklefile: pickle.dump(chunks, picklefile)

In [18]:
!ls

 LICENSE			     chunks.pkl
 README.md			     democratic_nominees_quotes.csv
 Text_Analytics_NLP_Workshop.ipynb  'jupyterize shakespeare.ipynb'
 austen-kmeans.py		     shakespeare
 austen_alcott			     shakespeare-kmeans.py


In [22]:
# dir()
# globals()
# locals()

In [23]:
chunk_labels = ['{}_{:03}'.format(os.path.split(f)[1][:-4], i) for i, j in enumerate(chunks)] # get chunk labels

```
chunk_labels

'1_king_henry_iv_000', ..., '1_king_henry_iv_041'
```

In [25]:
dictionary = corpora.Dictionary(chunks) 

In [26]:
len(dictionary)

3851

In [27]:
dictionary[0], dictionary[1], dictionary[2], dictionary[3], dictionary[4], dictionary[5], dictionary[6]

('a', 'accents', 'acquaintance', 'acres', 'advantage', 'afar', 'against')

In [45]:
# dir(dictionary)

In [46]:
print(dictionary.token2id['butchered'])             # d.token2id[] is a reversed dictionary: keys are words

42


In [32]:
dictionary[42]

'butchered'

In [47]:
corpus = [dictionary.doc2bow(this_chunk) for this_chunk in chunks]       # word count across the dictionary for each chunk

In [49]:
corpus[0][0:42]

[(0, 8),
 (1, 1),
 (2, 1),
 (3, 1),
 (4, 1),
 (5, 1),
 (6, 2),
 (7, 1),
 (8, 3),
 (9, 1),
 (10, 1),
 (11, 19),
 (12, 1),
 (13, 1),
 (14, 1),
 (15, 2),
 (16, 1),
 (17, 1),
 (18, 1),
 (19, 5),
 (20, 1),
 (21, 1),
 (22, 3),
 (23, 1),
 (24, 1),
 (25, 1),
 (26, 2),
 (27, 1),
 (28, 1),
 (29, 1),
 (30, 1),
 (31, 1),
 (32, 1),
 (33, 1),
 (34, 1),
 (35, 1),
 (36, 1),
 (37, 1),
 (38, 2),
 (39, 1),
 (40, 1),
 (41, 2)]

In [22]:
# corpus[] is a list (per chunk; so 4) of bows

In [23]:
original_corpus = corpus.copy()

In [24]:
# lda model: Latent Dirichlet Allocation
lda = models.LdaModel(corpus, id2word=dictionary, num_topics=5)

In [25]:
type(lda)

gensim.models.ldamodel.LdaModel

In [26]:
if corpus == original_corpus: print('same')

same


In [27]:
original_corpus

[[(0, 1),
  (1, 96),
  (2, 1),
  (3, 3),
  (4, 1),
  (5, 1),
  (6, 1),
  (7, 1),
  (8, 1),
  (9, 1),
  (10, 1),
  (11, 1),
  (12, 1),
  (13, 1),
  (14, 3),
  (15, 1),
  (16, 1),
  (17, 1),
  (18, 1),
  (19, 6),
  (20, 6),
  (21, 4),
  (22, 1),
  (23, 1),
  (24, 1),
  (25, 1),
  (26, 1),
  (27, 1),
  (28, 1),
  (29, 1),
  (30, 1),
  (31, 21),
  (32, 1),
  (33, 1),
  (34, 3),
  (35, 3),
  (36, 1),
  (37, 6),
  (38, 1),
  (39, 1),
  (40, 2),
  (41, 17),
  (42, 194),
  (43, 2),
  (44, 1),
  (45, 1),
  (46, 3),
  (47, 2),
  (48, 2),
  (49, 1),
  (50, 2),
  (51, 1),
  (52, 1),
  (53, 1),
  (54, 1),
  (55, 1),
  (56, 1),
  (57, 1),
  (58, 1),
  (59, 1),
  (60, 9),
  (61, 1),
  (62, 3),
  (63, 12),
  (64, 1),
  (65, 50),
  (66, 1),
  (67, 1),
  (68, 1),
  (69, 1),
  (70, 1),
  (71, 20),
  (72, 1),
  (73, 1),
  (74, 2),
  (75, 1),
  (76, 1),
  (77, 2),
  (78, 3),
  (79, 2),
  (80, 1),
  (81, 1),
  (82, 1),
  (83, 1),
  (84, 1),
  (85, 2),
  (86, 1),
  (87, 1),
  (88, 2),
  (89, 3),
  (90, 1),
 

In [28]:
topic_5 = lda[corpus]

In [29]:
type(topic_5)

gensim.interfaces.TransformedCorpus

In [30]:
topic_5[0]

[(0, 0.5334106), (1, 0.42531976), (3, 0.041188113)]

In [31]:
topic_5[1]

[(0, 0.060195986), (1, 0.939593)]

In [32]:
topic_5[3][0]

(1, 0.99026656)

In [33]:
len(topic_5[0]), type(topic_5[0])

(3, list)

In [34]:
topic_5[0][0]

(0, 0.56814337)

In [35]:
type(topic_5[0][0])

tuple

In [36]:
topic_5[0][0][1]

0.55228126

In [37]:
# first index runs 0 1 2 3 by chunks. Second is a list of length 1: a tuple. Third index is 1st / 2nd element of the tuple.

In [38]:
len(topic_5), len(topic_5[0]), len(topic_5[1]), len(topic_5[2]), len(topic_5[3])

(4, 3, 2, 3, 1)

In [39]:
chunk_labels

['1_king_henry_iv_000',
 '1_king_henry_iv_001',
 '1_king_henry_iv_002',
 '1_king_henry_iv_003']

In [43]:
# Put labels, features, vectors into a single dataframe
vectors_df = pd.DataFrame(topic_5, index=chunk_labels, columns=range(4))     # .fillna(0)

In [44]:
# 4. Use K-means clustering from Scikit Learn to find two clusters. 
# n_clusters=3
# kmeans = KMeans(n_clusters=n_clusters, random_state=0).fit(vectors_df)
# plot_clusters(kmeans, vextors_df, num) # plot topic clustering

In [46]:
# This loop fails so cell above is diagnostics

filecount = 0
filelimit = 1

for f in files:
    print("working on file/play", f)
    filecount += 1
    if filecount > filelimit: break
        
    chunks = chunk(preprocess(open(f, "r").read()), 5000) # get chunks 
    chunk_labels = ['{}_{:03}'.format(os.path.split(f)[1][:-4], i) for i, j in enumerate(chunks)] # get chunk labels
    
    # starting topic modeling for chunks using 5, 10, 25, 50 topics
    topic_5 = make_topic_model(chunks, 5)
    
    kmean_topics(topic_5, chunk_labels, 2) # plot n=5       (last arg was 5... error)

    
    # topic_10 = make_topic_model(chunks, 10)
    # kmean_topics(topic_5, chunk_labels, 10) # plot n=10
    # topic_25 = make_topic_model(chunks, 25)
    # kmean_topics(topic_5, chunk_labels, 25) # plot n=25
    # topic_50 = make_topic_model(chunks, 50)
    # kmean_topics(topic_5, chunk_labels, 50) # plot n =50

working on file/play ./shakespeare/1_king_henry_iv.txt


ValueError: setting an array element with a sequence.