## Mini-assignment 2: Out Out Damn Speaker!
- This notebook does the following:
    - Loads the plain text version of <i>Macbeth</i> on Project Gutenberg.
    - Isolates the speeches only; removes any stage directions and other indications.
    - Further removes the speaker names.

In [1]:
# Imports:
from urllib import request
from bs4 import BeautifulSoup
import re

In [2]:
# Function to extract the content from the given URL:
def extract(url):
    resp = request.urlopen(url).read() # reads the webpage HTML
    html = BeautifulSoup(resp)
    script = html.find('body').text # extracts the text from the <body> tag
    return script

# Macbeth URL:
url = "http://www.gutenberg.org/cache/epub/1533/pg1533.txt"
script = extract(url)
print(script[:100]) # print preview

Project Gutenberg Etext of Macbeth by Shakespeare
PG has multiple editions of William Shakespeare's


## Part 1: Isolate only speeches in the play (not stage directions or other indications)

NOTE: I get rid of the Gutenberg Intro and Conclusion in the loop.

In [3]:
inp = script.split('\n') # Split on newline; returns a list
exp1 = r'^(SCENE.*|ACT.*|\[.*|.*\])$' # regex for non-speech lines

flag = False # flag variable I use to get rid of the Gutenberg intro

speeches = [] # list to store actual speeches

for line in inp[:-5]: # -5 to get rid of the last lines (which is the Gutenberg conclusion; not speeches)
    if 'FIRST WITCH.' in line: # this is where the first line of the script begins.
        flag = True
        
    if flag: # Ignores everything before "FIRST WITCH." because flag=False
        if not re.match(exp1, line): # filtering any stage directions/other indications
            speeches.append(line) # Add to the speeches

In [4]:
# Previewing the first and last 50 lines only, but it works!
print(speeches[:50])
print(speeches[-50:])

['FIRST WITCH.\r', 'When shall we three meet again?\r', 'In thunder, lightning, or in rain?\r', '\r', 'SECOND WITCH.\r', "When the hurlyburly's done,\r", "When the battle's lost and won.\r", '\r', 'THIRD WITCH.\r', 'That will be ere the set of sun.\r', '\r', 'FIRST WITCH.\r', 'Where the place?\r', '\r', 'SECOND WITCH.\r', 'Upon the heath.\r', '\r', 'THIRD WITCH.\r', 'There to meet with Macbeth.\r', '\r', 'FIRST WITCH.\r', 'I come, Graymalkin!\r', '\r', 'ALL.\r', 'Paddock calls:--anon:--\r', 'Fair is foul, and foul is fair:\r', 'Hover through the fog and filthy air.\r', '\r', '\r', '\r', '\r', '\r', 'with Attendants, meeting a bleeding Soldier.]\r', '\r', 'DUNCAN.\r', 'What bloody man is that? He can report,\r', 'As seemeth by his plight, of the revolt\r', 'The newest state.\r', '\r', 'MALCOLM.\r', 'This is the sergeant\r', 'Who, like a good and hardy soldier, fought\r', "'Gainst my captivity.--Hail, brave friend!\r", 'Say to the king the knowledge of the broil\r', 'As thou didst leave 

## Part 2: Removing speaker names

In [5]:
exp2 = r'^[A-Z]*\s?[A-Z]*\.\r$' # regex for speaker names

speaker_names = []

final_speeches = []

# Iterating through the speeches^
for speech in speeches:
    if re.match(exp2, speech): # removes speaker names
        speaker_names.append(speech.strip()) # strips the '\r' character at the end
    else:
        if speech != '\r' and speech != '': # removes empty lines
            final_speeches.append(speech)

In [6]:
print(set(speaker_names))

{'MENTEITH.', 'SERVANT.', 'CAITHNESS.', 'BANQUO.', 'SOLDIER.', 'BOTH MURDERERS.', 'FIRST WITCH.', 'PORTER.', 'ROSS.', 'ATTENDANT.', 'ALL.', 'LADY MACDUFF.', 'MESSENGER.', 'MACDUFF.', 'OLD MAN.', 'SIWARD.', 'LORD.', 'HECATE.', 'SON.', 'GENTLEWOMAN.', 'ANGUS.', 'DUNCAN.', 'SOLDIERS.', 'LORDS.', 'APPARITION.', 'LENNOX.', 'SECOND WITCH.', 'FLEANCE.', 'THIRD MURDERER.', 'YOUNG SIWARD.', 'MACBETH.', 'MALCOLM.', 'LADY MACBETH.', 'SEYTON.', 'DOCTOR.', 'FIRST MURDERER.', 'THIRD WITCH.', 'SECOND MURDERER.', 'MURDERER.', 'DONALBAIN.'}


In [7]:
# Previewing the first and last 50 lines only, but it works!
print(final_speeches[:50])
print(final_speeches[-50:])

['When shall we three meet again?\r', 'In thunder, lightning, or in rain?\r', "When the hurlyburly's done,\r", "When the battle's lost and won.\r", 'That will be ere the set of sun.\r', 'Where the place?\r', 'Upon the heath.\r', 'There to meet with Macbeth.\r', 'I come, Graymalkin!\r', 'Paddock calls:--anon:--\r', 'Fair is foul, and foul is fair:\r', 'Hover through the fog and filthy air.\r', 'with Attendants, meeting a bleeding Soldier.]\r', 'What bloody man is that? He can report,\r', 'As seemeth by his plight, of the revolt\r', 'The newest state.\r', 'This is the sergeant\r', 'Who, like a good and hardy soldier, fought\r', "'Gainst my captivity.--Hail, brave friend!\r", 'Say to the king the knowledge of the broil\r', 'As thou didst leave it.\r', 'Doubtful it stood;\r', 'As two spent swimmers that do cling together\r', 'And choke their art. The merciless Macdonwald,--\r', 'Worthy to be a rebel,--for to that\r', 'The multiplying villainies of nature\r', 'Do swarm upon him,--from the W

## fin.