# Crunch-Shake


## Table of Contents

<strong>
1. Introduction
2. Preliminaries
3. Parsing
4. Processing
</strong>

## Introduction

**crunch-shake** is a library aimed to help analyze plays/scripts for gender disparities. Given a script, first you have to parse it to the format specified by the library. Then you can do fun stuff like seeing what are the most common words that females or males used, run network analysis to see who are the most important characters, create a graph of plays and even run the [bechdel test](https://en.wikipedia.org/wiki/Bechdel_test).

## Preliminaries

First lets take a look at the play we will be parsing, *Romeo and Juliet* by William Shakespeare. Ever wanted to know who was the more important of the romantic duo, Romeo or Juliet? (Hint, it does not dispell any notions that we live in a patriarchy.) I've taken the play from [MIT's website](http://shakespeare.mit.edu/romeo_juliet/full.html).

In [1]:
from utils import file_to_list

romeo_juliet_raw = file_to_list("plays/romeo_and_juliet_entire_play.html")

# Showing the beggining
for line in romeo_juliet_raw[:10]:
    print(line, end="")

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
 "http://www.w3.org/TR/REC-html40/loose.dtd">
 <html>
 <head>
 <title>Romeo and Juliet: Entire Play
 </title>
 <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
 <LINK rel="stylesheet" type="text/css" media="screen"
       href="/shake.css">
 </HEAD>


So obviously there's some stuff here thats not really relevant to us; lets look at some stuff in the middle of the play

In [2]:
# Showing the middle portion
for line in romeo_juliet_raw[2992:3007]:
    print(line, end="")

<A NAME=speech81><b>ROMEO</b></a>
<blockquote>
<A NAME=2.4.177>And stay, good nurse, behind the abbey wall:</A><br>
<A NAME=2.4.178>Within this hour my man shall be with thee</A><br>
<A NAME=2.4.179>And bring thee cords made like a tackled stair;</A><br>
<A NAME=2.4.180>Which to the high top-gallant of my joy</A><br>
<A NAME=2.4.181>Must be my convoy in the secret night.</A><br>
<A NAME=2.4.182>Farewell; be trusty, and I'll quit thy pains:</A><br>
<A NAME=2.4.183>Farewell; commend me to thy mistress.</A><br>
</blockquote>

<A NAME=speech82><b>Nurse</b></a>
<blockquote>
<A NAME=2.4.184>Now God in heaven bless thee! Hark you, sir.</A><br>
</blockquote>


In additions to dialogue we also have to watch out for act and scene information. 

In [3]:
for line in romeo_juliet_raw[3315:3317]:
    print(line, end="")

<H3>ACT III</h3>
<h3>SCENE I. A public place.</h3>


As well as information regarding **when characters enter and exit**. These stage directions can happen between dialogues, or within a dialogue (indicating a character should enter/exit while another is speaking).

In [4]:
# in between dialogue
for line in romeo_juliet_raw[1778:1782]:
    print(line, end="")

<h3>SCENE I. A lane by the wall of Capulet's orchard.</h3>
<p><blockquote>
<i>Enter ROMEO</i>
</blockquote>


In [5]:
# within a dialogue
for line in romeo_juliet_raw[3257:3273]:
    print(line, end="")

<A NAME=speech3><b>FRIAR LAURENCE</b></a>
<blockquote>
<A NAME=2.6.9>These violent delights have violent ends</A><br>
<A NAME=2.6.10>And in their triumph die, like fire and powder,</A><br>
<A NAME=2.6.11>Which as they kiss consume: the sweetest honey</A><br>
<A NAME=2.6.12>Is loathsome in his own deliciousness</A><br>
<A NAME=2.6.13>And in the taste confounds the appetite:</A><br>
<A NAME=2.6.14>Therefore love moderately; long love doth so;</A><br>
<A NAME=2.6.15>Too swift arrives as tardy as too slow.</A><br>
<p><i>Enter JULIET</i></p>
<A NAME=2.6.16>Here comes the lady: O, so light a foot</A><br>
<A NAME=2.6.17>Will ne'er wear out the everlasting flint:</A><br>
<A NAME=2.6.18>A lover may bestride the gossamer</A><br>
<A NAME=2.6.19>That idles in the wanton summer air,</A><br>
<A NAME=2.6.20>And yet not fall; so light is vanity.</A><br>
</blockquote>


So this the text I'm aiming to parse. Luckily [regular expressions](http://www.w3schools.com/jsref/jsref_obj_regexp.asp) are well suited to this task. For this particular play, I've prepared the matchers, found in mit_shakespeare_regex.py. Let's go ahead and try it out.

In [6]:
from mit_shakespeare_regex import matcher

line1 = romeo_juliet_raw[1943]
print(line1)

<A NAME=2.2.46>By any other name would smell as sweet;</A><br>



In [7]:
# Since line 1 is a piece of dialogue, matcher.dialogue should return an object when it searches the line
matcher.dialogue.search(line1)

<_sre.SRE_Match object; span=(0, 62), match='<A NAME=2.2.46>By any other name would smell as s>

In [8]:
# Since this line does not indicate which character is speaking, it should return None (so nothing)
matcher.character.search(line1)

In [9]:
# A line that matcher.character will match
line2 = romeo_juliet_raw[1935]
matcher.character.search(line2)

<_sre.SRE_Match object; span=(0, 33), match='<A NAME=speech6><b>JULIET</b></a>'>

The last thing we need before we begin is a gender file specifying the gender of each character in the play. This has to be done by hand.

In [10]:
from utils import json_file_to_dict

gender = json_file_to_dict("plays/romeo_and_juliet_entire_play_gender.json")
print(gender)

{'FIRST CITIZEN': 'M', 'JULIET': 'F', 'LADY CAPULET': 'F', 'SECOND CAPULET': 'M', 'SECOND SERVANT': 'M', 'THIRD MUSICIAN': 'M', 'PETER': 'M', 'PARIS': 'M', 'PAGE': 'M', 'MUSICIAN': 'M', 'SAMPSON': 'M', 'ROMEO': 'M', 'SECOND MUSICIAN': 'M', 'SERVANT': 'M', 'CAPULET': 'M', 'PRINCE': 'M', 'THIRD WATCHMAN': 'M', 'GREGORY': 'M', 'ABRAHAM': 'M', 'APOTHECARY': 'M', 'LADY MONTAGUE': 'F', 'FRIAR JOHN': 'M', 'TYBALT': 'M', 'BENVOLIO': 'M', 'BALTHASAR': 'M', 'CHORUS': 'N', 'FRIAR LAURENCE': 'M', 'FIRST MUSICIAN': 'M', 'MERCUTIO': 'M', 'FIRST WATCHMAN': 'M', 'SECOND WATCHMAN': 'M', 'NURSE': 'F', 'MONTAGUE': 'M', 'FIRST SERVANT': 'M'}


## Parsing

Now we have everything necessary to start using crunch-shake to parse the text. First we need to get the speaking characers in the text. (I get it directly from the play, you might be wondering why not just use the gender file? Well I actually used get_speaking_characters to generate the gender file.)

In [11]:
from parse import get_speaking_characters

speaking = get_speaking_characters(romeo_juliet_raw, matcher.character)
print(speaking)

{'FIRST CITIZEN', 'JULIET', 'PRINCE', 'FRIAR JOHN', 'PAGE', 'FIRST WATCHMAN', 'SECOND MUSICIAN', 'CAPULET', 'GREGORY', 'ABRAHAM', 'FIRST MUSICIAN', 'SECOND SERVANT', 'TYBALT', 'BENVOLIO', 'SECOND WATCHMAN', 'NURSE', 'APOTHECARY', 'LADY CAPULET', 'ROMEO', 'SECOND CAPULET', 'FRIAR LAURENCE', 'THIRD MUSICIAN', 'PETER', 'MUSICIAN', 'SERVANT', 'CHORUS', 'PARIS', 'LADY MONTAGUE', 'SAMPSON', 'BALTHASAR', 'THIRD WATCHMAN', 'MERCUTIO', 'MONTAGUE', 'FIRST SERVANT'}


In [13]:
from parse import parse_raw_text

play_lines = parse_raw_text(romeo_juliet_raw, speaking, matcher)
for line in play_lines[:20]:
    print(line)

Act(act=1)
Scene(scene=1)
Enter SAMPSON and GREGORY, of the house of Capulet, armed with swords and bucklers : ["Enter - ['SAMPSON', 'GREGORY', 'CAPULET']"] : None
SAMPSON
1.1 :Gregory, o' my word, we'll not carry coals. : None
GREGORY
1.1 :No, for then we should be colliers. : None
SAMPSON
1.1 :I mean, an we be in choler, we'll draw. : None
GREGORY
1.1 :Ay, while you live, draw your neck out o' the collar. : None
SAMPSON
1.1 :I strike quickly, being moved. : None
GREGORY
1.1 :But thou art not quickly moved to strike. : None
SAMPSON
1.1 :A dog of the house of Montague moves me. : None
GREGORY
1.1 :To move is to stir; and to be valiant is to stand: : None
1.1 :therefore, if thou art moved, thou runn'st away. : None


In [None]:
Processing