# SHAKESPEARE RESEARCH

## For an exhaustive research on Shakespeare's literary works, we need to know how many times the author uses every word in three of his masterpieces: 

1. King Lear:  
  * https://storage.cloud.google.com/apache-beam-samples/shakespeare/kinglear.txt   
 
2. Othello: 
  * https://storage.cloud.google.com/apache-beam-samples/shakespeare/othello.txt   
 
3. Romeo and Juliet: 
  * https://storage.cloud.google.com/apache-beam-samples/shakespeare/romeoandjuliet.txt 
 
 
The expected output should be a single file.  

In [1]:
import nltk
from nltk.probability import FreqDist
import string
import os

In [2]:
def get_vocab(txt) :
    f = open(txt,"r")
    raw = f.read()
    tokens = nltk.word_tokenize(raw)
    words = [w.lower for w in tokens]
    vocab = sorted(set(tokens))
    return vocab

In [3]:
punctuations = list(string.punctuation)

In [4]:
def get_frequencies(txt, include_punct=False):
    fdist = FreqDist()
    f = open(txt,"r")
    raw = f.read()
    for sent in nltk.tokenize.sent_tokenize(raw): #first tokenize sentences
        for word in nltk.tokenize.word_tokenize(sent): #then tokenize words for each sentence
            if include_punct==False:
                if word not in punctuations:
                    fdist[word] += 1
            else:
                fdist[word] += 1
    return fdist

In [5]:
def write_append(txt):
    with open("shakespeare_word_count.txt", "a") as output_file:
        output_file.write("%s\n"%f.upper())
        freqs = get_frequencies(f)
        for (k,v) in freqs.items():
            output_file.write("%s:%s\n"%(k,v))

## Write output to a single file named 'shakespeare_word_count.txt'

In [6]:
for f in ['shakespeare_kinglear.txt', 'shakespeare_othello.txt', 'shakespeare_romeoandjuliet.txt']:
    write_append(f)

## Show top 10 most frequent words for each masterpiece

In [9]:
for f in ['shakespeare_kinglear.txt', 'shakespeare_othello.txt', 'shakespeare_romeoandjuliet.txt']:
    print("%s"%f.upper())
    freqs = get_frequencies(f)
    for wf in freqs.most_common(10):
        print("%s:\t%s"%(wf[0],wf[1]))
    print('\n')
    

SHAKESPEARE_KINGLEAR.TXT
the:	786
I:	697
and:	593
of:	447
to:	429
you:	404
my:	402
a:	355
not:	287
in:	270


SHAKESPEARE_OTHELLO.TXT
I:	884
the:	674
and:	596
to:	477
you:	452
of:	420
a:	394
my:	368
not:	331
is:	316


SHAKESPEARE_ROMEOANDJULIET.TXT
I:	648
the:	614
and:	490
to:	456
a:	404
of:	367
is:	334
my:	314
in:	290
's:	284


