# Parsing HTML
This notebook converts the HTML Friends transcripts found at https://fangj.github.io/friends/ into a Python-readable format.
This notebook outputs a single csv file containing each utterance (that could be parsed, including 166 out of 236 total episodes) along with some meta-data. This csv file is in the same format as the MELD dataset.

In [None]:
from bs4 import BeautifulSoup
import requests
# https://fangj.github.io/friends/
# https://fangj.github.io/friends/season/SSEE.html
'''
filepath = "../Friends_html_data/S4E20.html"

with open(filepath, "r", encoding="utf-8") as infile:
    soup = BeautifulSoup(infile, "html.parser")
'''
url = "https://fangj.github.io/friends/season/0101.html"
r = requests.get(url)
# check r.status_code == 200
soup = BeautifulSoup(r.text, "html.parser") # r.text is an html string in this case, so pass directly to BS

In [None]:
texts = soup.find_all("p")
for line in texts:
    parsed_line = line.get_text().replace(u'\xa0', u' ').replace("\n", " ")
    #if "(" in parsed_line:
    print("Scene" in parsed_line)

In [None]:
get_dialogues(soup)

In [None]:
### Treat [] (and "Commercial Break") as breaks between dialogues (I think); also look for "Opening Credits"
### Get rid of anything between () first? There's some within dialogue, "Paul: (entering from Monica's room) Morning.",
### and others as description-type stuff "(There's a knock on the door and it's Paul.)"


### S9E8 gets a little weird, but looks pretty uniform before that... On and after S9E8:
### Some <> for direction type stuff,too (later seasons only?) "Rachel: Amy! <pause> Yes I do.. I really do. <grabs   Ross' hand for support>"
### Also keep track of lines like "Monica to Emma: Hey you.", and "Amy coming out of the bathroom: Hey. Hey where's the baby?"
### Also unmarked direction:
# "Amy: <points to Chandler> This guy? Seriously?
# Later in the day.
# Monica: Okay! It's time for dinner...""
### Also "-Cut to Rachel (Phone ringing)"
### Also extra : "Monica:: what's the big deal, you f..."



lines = [] 
if "[" in line:
    prev_line = line[:line.index("[")]
    if prev_line!="":
        lines.append(prev_line)
    

In [None]:
def cut_bracketed_text(string):
    '''
    Given a string, eliminates any text between pairs of round brackets () (returns new string)
    '''
    new_string = ""
    in_bracket = False
    for char in string:
        if not in_bracket:
            if char == "(":
                in_bracket = True
            else:
                new_string += char
        else: #in_bracket currently
            if char == ")":
                in_bracket = False
                if len(new_string)>0 and new_string[-1]==" ":
                    new_string = new_string[:-1] # eliminate double space (hopefully)
    return new_string

In [None]:
def find_source(string):
    '''
    Given a string, returns a tuple of before, after a colon character.
    The purpose of this function is for parsing script text: anything that does not have a format
    like "character: utterance" will be skipped. Parsing then only needs to be done on a) weird cases where the
    format is like that but isn't actually dialogue (like "[Scene: ...]"), and b) what to do when a non-dialogue
    string comes up (end of dialogue?)
    '''
    string_split = string.split(":")
    if len(string_split) != 2:
        cat = parse_non_utterance(string)
        return cat
    else:
        if "Scene" in string or "Written by" in string:
            return "dialogue_break"
        utterance = string_split[1]
        parsed_utt = cut_bracketed_text(utterance)
        return (string_split[0], parsed_utt)
    
# Next function will look for categories returned:
# if cat=="scene_break":
#     perform_dialogue_break
# else: 
#     do_nothing (discard line)

In [None]:
def parse_non_utterance(string):
    '''
    This function will take a string that does not have a single colon in it and returns
    a 'type' according to those seen in friends files. The purpose of this is that different
    'types' of strings require different actions. Mostly we just get rid of these strings, but
    often they indicate a break in dialogue to be handled. We might also keep an eye out for 
    mistaken non-utterance labels (if an utterance has a colon embedded in it)
    '''
    cat = None
    if "Scene" in string:
        cat = "dialogue_break"
    elif "Time Lapse" in string:
        cat = "dialogue_break"
    else:
        cat = ""
    return cat
    
    
    

In [None]:
string = "(Chandler acts disgusted, but is happy that Joey has stopped snoring. However, just as he is about to leave, Joey starts snoring again. So to get him to stop, he slams the door shut, waking Joey.)"
string=cut_bracketed_text(string)
string

#### To do:
* Find way to work with HTML files one by one (retrieve from https://fangj.github.io/friends/)
* For each line, find_source first, then cut_bracketed_text on utterances
* Deal with types of returns from find_source: create dialogues (lists of source, utterance pairs), and start new dialogue when a scene_break is found
* Analyse output, cross fingers tightly
* If time, do stuff with post S9E7 files (okay probably don't bother with this)

##### Format for output:
SrNo.,Utterance,Speaker,Emotion,Sentiment,Dialogue_ID,Utterance_ID,Season,Episode,StartTime,EndTime
1,also I was the point person on my companys transition from the KL-5 to GR-6 system.,Chandler,neutral,neutral,0,0,8,21,"00:16:16,059","00:16:21,731"

In [None]:
def get_dialogues(soup):
    '''
    Takes a beautiful soup html object from https://fangj.github.io/friends/
    and extracts dialogues: lists of utterances paired with speakers
    '''
    dialogues = []
    current_dialogue = []
    texts = soup.find_all("p")
    for line in texts:
        # gets the text of each line. If it's an utterance, adds to dialogue.
        # If it's a dialogue break (as defined by parse_non_utterance()), finishes current
        # dialogue and starts new one. If it's neither, just skips it.
        parsed_line = line.get_text().replace(u'\xa0', u' ').replace("\n", " ")
        return_value = find_source(parsed_line)
        if type(return_value)==tuple: # line is an utterance, return is (speaker, utt)
            current_dialogue.append(return_value)
        else: # line is not dialogue
            if return_value == "dialogue_break" and len(current_dialogue)>0:
                dialogues.append(current_dialogue)
                current_dialogue = []
            elif return_value == "":
                continue
            else: # no other values currently valid
                if len(current_dialogue)>0:
                    print(f"Error: line is not a recognized category")

    return dialogues
    

In [None]:
print(f"{2:02d}")

In [None]:
import requests
from bs4 import BeautifulSoup
num_eps = [24, 24, 25, 23, 23, 24, 24, 23] # eps per season, index(+1) is season number
seasons = list(range(1,9))
Sr_No = 1
Emotion = None
Sentiment = None
StartTime = None
EndTime = None
rows = []

for season in seasons:
    eps = num_eps[season-1]
    for ep in range(1, eps+1):
        url = f"https://fangj.github.io/friends/season/{season:02d}{ep:02d}.html"
        print(f"working on {season:02d}{ep:02d}...")
        tries = 0
        while tries < 10:
            r = requests.get(url)
            if r.status_code == 200:
                # request was good, move on
                break
            else:
                # request was no good, try again
                tries+=1
        if tries >= 10:
            print(f"retrieving {season:02d}{ep:02d} failed. Moving on.")
            continue
        soup = BeautifulSoup(r.text, "html.parser")
        parsed_dialogues = get_dialogues(soup) # list of lists of (speaker, utterance)
        print(f"Num dialogues, {season}-{ep}: {len(parsed_dialogues)}")
        Dialogue_ID = 0
        for dialogue in parsed_dialogues:
            Utterance_ID = 0
            for speaker, utterance in dialogue:
                utt_dict = {"Sr No.":Sr_No, "Utterance":utterance, "Speaker":speaker, "Emotion":None, "Sentiment":None,\
                            "Dialogue_ID":Dialogue_ID, "Utterance_ID":Utterance_ID, "Season":season,\
                            "Episode":ep, "StartTime":None, "EndTime":None}
                rows.append(utt_dict)
                Utterance_ID += 1
                Sr_No += 1
            Dialogue_ID += 1
            
  

In [None]:
episodes = set()
for row in rows:
    season, ep, utt = row["Season"], row["Episode"], row["Utterance"]
    episodes.add((season, ep))
print(sorted(episodes))
print(len(episodes))

In [None]:
import csv

with open('friends_html_data.csv', 'w+', encoding="utf-8") as outfile:
    headers = ["Sr No.","Utterance","Speaker","Emotion","Sentiment","Dialogue_ID","Utterance_ID","Season",\
               "Episode","StartTime","EndTime"]
    writer = csv.DictWriter(outfile, fieldnames=headers)
    writer.writeheader()
    for row in rows:
        writer.writerow(row)

# HTML Data preprocessing

This portion converts the scraped HTML Friends transcript data into a pkl file for use in our other code.

This code was originally written to preprocess the MELD data, which is why we don't just save the HTML data as an appropriate pkl file in the first place.

This code replaces some strangely encoded characters and saves relevant data to a pickle file, outputting this pickle file at the end.

In [None]:
import pandas as pd
import spacy
import pickle
nlp = spacy.load("en_core_web_sm")
from nltk.tokenize import word_tokenize
import csv

data = "./friends_html_data.csv"
df = pd.read_csv(data)

i_text = df['Utterance'].tolist()
sources = df['Speaker'].tolist()
seasons = df['Season'].tolist()
episodes = df['Episode'].tolist()
sr_numbers = df['Sr No.'].tolist()

list_of_dicts = []
i = 0
for item in i_text:
    if type(item)==str:
        item2 = item.replace('\x92', "'")
        item3 = item2.replace('\x96', "-")
        item3 = item2.replace('\x97', "-")
        item4 = item3.replace('\x85', "...")
        item5 = item4.replace('\x91', "'")
        item6 = item5.replace('\x93', "'")
        item7 = item6.replace('\x94', "'")
        processed_text = item7
        tokens = word_tokenize(processed_text)
        processed_dict = {}
        processed_dict['source'] = sources[i]
        processed_dict['utt'] = processed_text
        processed_dict['tok_utt'] = tokens
        processed_dict['season'] = seasons[i]
        processed_dict['episode'] = episodes[i]
        processed_dict['sr_number'] = sr_numbers[i]
        list_of_dicts.append(processed_dict)
    i+=1
    
with open("Preprocessed_html_data.pkl", "wb") as outfile:
    pickle.dump(list_of_dicts, outfile)