This notebook is to parse and save well-structured dialogues from scripts.

Finally we got 945 labelled files.

In [1]:
import re
import pandas as pd
import os

## Parse the scripts and segment dialogues

The scripts are not well formatted: even though they are in html format, actually the lines are not well constructed in tags. We cannot parse them with tags. Fortunately, every characters and changes of scenes are bolded with tag 'b', and we have files containing characters information. So first we extract all bolded lines, and substract characters, then changes of scenes are left, and then we can know the boundaries of dialogues.

In [234]:
folder_path = '/Users/yan/Documents/document/EPFL/MA2/semesterprj/datasets/scripts/'
path = folder_path+'scripts'

files = []
# r=root, d=directories, f = files
for r, d, f in os.walk(path):
    for file in f:
        if (file.endswith('.html')) & (~file.startswith('fd')) & (~file.startswith('tbbt')) & (~file.startswith('friends')):
            test_path = folder_path+'sentences/'+file.split('.html')[0]+'-speakers.txt'
            if os.path.exists(test_path):
                files.append(file.split('.html')[0])


In [329]:
def Parser(files,j):
    """
    This function is used to parse the scripts in .html.
    Inputs:
        files: List of scripts' directory.
        j: index
    """
    file_path = folder_path+'sentences/'+files[j]+'-speakers.txt' # read the file with speakers in that script

    f = open(file_path)               
    lines = f.readlines()               
    tags_speakers = []
    for line in lines: 
        tags_speakers.append(line.replace('continued','').upper().rstrip())
    tags_speakers = set(tags_speakers) # characters in the scripts
    
    file_path = folder_path+'scripts/'+files[j]+'.html'
    soup = BeautifulSoup(open(file_path, errors='ignore'))
    
    tags = []
    for a in soup.find_all('b'):
        tags.append(a.string.rstrip().lstrip())
    tags = set(tags) # all bolded tags
    tags_background = tags - tags_speakers # only changes of scenes are left

    texts = [' '.join(x.rstrip().lstrip().split('\n\n')[0].split()) for x in soup.strings if str.strip(x) != '']

    idxs = []
    idxs_bg = []
    speaker = []
    lines = []
    for i in range(len(texts)):
        if texts[i] in tags_background:
            idxs_bg.append(i)
        if texts[i] not in tags_speakers:
            continue
        else:
            speaker.append(texts[i])
            line = re.sub(u"\\(.*?\\)|\\{.*?}|\\[.*?]|\\♪.*?♪|\\#.*?#|\\=.*?=|\\¶.*?¶", "", texts[i+1])
            lines.append(line)
            idxs.append(i)
            
    s = pd.Series(idxs)
    boundaries = pd.cut(s,idxs_bg, labels=False, retbins=False, right=False).get_values()
    boundaries = [1]+list((boundaries[1:] != boundaries[:-1])*1)
    
    MovieID = ['m%s'%(str(j))] * len(lines)
    MovieName = [files[j]] * len(lines)

    dialogue = pd.DataFrame([speaker,lines,boundaries,MovieID,MovieName]).T
    dialogue.columns = ['Speaker','Line','Label','MovieID','MoveiName']
    dialogue.to_csv(folder_path+'parsed/'+MovieID[0]+'.txt',sep=',', index=False, header=False)


In [333]:
for j in range(len(files)):
    try:
        Parser(files,j)
    except:
        print(j,files[j])

9 Crow-Salvation,-The
12 Star-Trek-First-Contact
23 i-walked_with_a_zombie
41 Who-Framed-Roger-Rabbit%3f
42 Platoon
43 the-x-files_production
51 Pitch-Black
56 pet-sematary
62 thethinman
93 Leaving-Las-Vegas
134 Stepmom
136 oneflewover
142 Star-Trek-Generations
147 Bones
149 Minority-Report
150 natural-born-killers_early
170 Buffy-the-Vampire-Slayer
177 Star-Trek-The-Motion-Picture
179 fivefeetandrising
181 Crying-Game
193 Clueless
197 natural-born-killers_shoot
215 Tremors
243 Sixth-Sense,-The
256 John-Q
273 Orgy-of-the-Dead
284 Almost-Famous
288 Blast-from-the-Past,-The
293 Anastasia
296 Memento
298 Aladdin
303 fabulous_baker_boys_final
305 Blade-II
308 hellraiser_ii
313 mission-impossible-2_shoot
315 English-Patient,-The
322 Independence-Day
340 halloween
364 Apartment,-The
368 kundun
373 Shampoo
374 True-Romance
378 Star-Trek-II-The-Wrath-of-Khan
390 Life-As-A-House
413 Little-Mermaid,-The
427 Red-Planet
432 Withnail-and-I
434 Heavy-Metal
436 Pearl-Harbor
442 thetimemachine_1959
45

In [10]:
folder_path = '/Users/yan/Documents/document/EPFL/MA2/semesterprj/datasets/scripts/parsed'

In [13]:
# load all files in that folder
info = [os.path.join(folder_path,file) for file in os.listdir(folder_path) if file.endswith('.txt') ]
script_data_set = pd.concat((pd.read_csv(f,header=None) for f in info))

In [15]:
script_data_set.columns = ['Speaker','Line','Label','MovieID','MovieName']

In [16]:
script_data_set.to_csv('/Users/yan/Documents/document/EPFL/MA2/semesterprj/datasets/scripts/script_data_set.csv',index=None)

In [17]:
script_data_set.head()

Unnamed: 0,Speaker,Line,Label,MovieID,MovieName
0,MODERATOR,Tonight we'll discuss a subject most of us see...,1,m478,Midnight-Cowboy
1,IRATE WOMAN,"They always put it that way, but well, all it ...",1,m478,Midnight-Cowboy
2,COOL WOMAN,"This, this image of the, the man eating woman....",1,m478,Midnight-Cowboy
3,SAD WOMAN,"No, I never had, well, whatever it is you call...",1,m478,Midnight-Cowboy
4,SAD WOMAN'S VOICE,... but it's a problem. A big problem. With so...,1,m478,Midnight-Cowboy


In [19]:
len(script_data_set)

718524

In [20]:
script_data_set.Label.sum()

162832

So we have 718524 sentences and 162832 dialogues.