## Table of Contents <a class="anchor" id="top"></a>
* [Data Preparation](#Data Prep)
* [Entity Resolution](#Entity)
* [Relation Extraction](#Relation)
* [Query System](#Query)

## Data Prep <a class="anchor" id="Data Prep"></a>
[[back to top]](#top)

In [1]:
%load_ext autoreload
%autoreload 2

#standard library imports
import re
import nltk
import numpy as np
import pandas as pd
import os
from collections import Counter, defaultdict

#modeling functions & utilities
from pronounResolution import pronResolution_base, pronResolution_nnMod, pronResolution_nn, pronEval
from relationExtract import simpleRE, REEval, getRelations, extract_relation_categories

In [2]:
files = [x for x in os.listdir('prep_scripts') if '_gapi' in x]
for file in files:
    df = pd.read_csv('prep_scripts/' + file)[['speaker']]
    print(list(df.speaker.unique()))
    print('***')
    print('***')

['narrator', 'Dr. Hank Pym', 'Mitchell Carson', 'Howard Stark', 'Peggy Carter', 'Peachy', 'Scott Lang', 'Luis', 'Ice Cream Store Customer', 'Dale', 'Dave', 'Kurt', 'Pym Tech Gate Guard', 'Pym Tech Security Guard', 'Pym Tech Employee', 'Hope van Dyne', 'Darren Cross', 'Carson', 'Frank', 'Cassie Lang', 'Paxton', 'Hideous Rabbit', 'Maggie Lang', 'Scott', 'Cab Driver', 'Cop on Speaker', 'Detective', 'Voice over Radio', 'Sam Wilson', 'Scot Lang', 'Alpha Guard', 'Gale', 'Computer', 'Cell Phone', 'Pool BBQ Dad', 'Police Radio', 'Steve Rogers']
***
***
['narrator', 'Dr. Hank Pym', 'Mitchell Carson', 'Howard Stark', 'Peggy Carter', 'Peachy', 'Scott Lang', 'Luis', 'Ice Cream Store Customer', 'Dale', 'Dave', 'Kurt', 'Pym Tech Gate Guard', 'Pym Tech Security Guard', 'Pym Tech Employee', 'Hope van Dyne', 'Darren Cross', 'Carson', 'Frank', 'Cassie Lang', 'Paxton', 'Hideous Rabbit', 'Maggie Lang', 'Scott', 'Cab Driver', 'Cop on Speaker', 'Detective', 'Voice over Radio', 'Sam Wilson', 'Scot Lang', 'Al

Helper functions to load and annotate dataset

In [3]:
# files = [x for x in os.listdir('prep_scripts') if '_gapi' in x]
# df = pd.read_csv('prep_scripts/' + files[1])[['speaker', 'dialogue', 'sentences', 'sentiment', 'entities', 'tokens']]
# df['tokens'] = df['tokens'].apply(lambda x: eval(x))
# df['sentiment'] = df['sentiment'].apply(lambda x: eval(x))
# df['speaker'] = df['speaker'].apply(lambda x: x.strip())
# df['entities'] = df['entities'].apply(lambda x: eval(x))
# df.head()

# returns dataframe with script annotations
def loadScript(file_name):
    # read file
    df = pd.read_csv('prep_scripts/' + file_name)[['speaker', 'dialogue', 'sentences', 'sentiment', 'entities', 'tokens']]

    # evaluate strings for lists/dicts of tokens, sentiment, entities
    df['tokens'] = df['tokens'].apply(lambda x: eval(x))
    df['sentiment'] = df['sentiment'].apply(lambda x: eval(x))
    df['speaker'] = df['speaker'].apply(lambda x: x.strip())
    df['entities'] = df['entities'].apply(lambda x: eval(x))
    
    return df

# cList = list(df.speaker.unique())
# cCount = Counter(df.speaker)
# df['total_sent'] = df['sentiment'].apply(lambda x: x['score'] * x['magnitude'])
# cDict = dict(df.groupby('speaker').total_sent.sum())

# # number of pronouns for each line
# df['num_pron'] = df['tokens'].apply(lambda x: sum([int(t['pos'] == 'PRON') for t in x]))

# # total sentiment score for each line
# df['total_sent'] = df['sentiment'].apply(lambda x: x['score'] * x['magnitude'])

# #set nearby speakers
# charRange = 10
# nearbyList = np.dstack((df.shift(i).speaker.values for i in range(-charRange, charRange+1)))[0]
# df['nearbyChars'] = None
# for i, nearbyChars in enumerate(nearbyList):
#     df.set_value(i, 'nearbyChars', nearbyChars)

# df.head()


# enhances annotations with pronoun counts, nearby speakers, and sentiments for each line
def annotateScript(df):
    
    
    
    # number of pronouns for each line
    df['num_pron'] = df['tokens'].apply(lambda x: sum([int((t['pos'] == 'PRON') and isPersonPron(t['lemma'].lower()) for t in x]))

    # total sentiment score for each line
    df['total_sent'] = df['sentiment'].apply(lambda x: x['score'] * x['magnitude'])

    # previous and next speaker for each line
    df['speaker_prev'] = df.speaker.shift(1)
    df['speaker_next'] = df.speaker.shift(-1)

    #set nearby speakers
    charRange = 10
    nearbyList = np.dstack((df.shift(i).speaker.values for i in range(-charRange, charRange+1)))[0]
    df['nearbyChars'] = None
    for i, nearbyChars in enumerate(nearbyList):
        df.set_value(i, 'nearbyChars', nearbyChars)

    return df

# selects random lines to evaluate in annotated script with unknown entities (pronouns) resolved
def selectEvalLines(df, numExamples):
    
    # indexes for lines of dialogue with resolved pronouns
    pronIndex = list(df[df.num_pron > 0].index)
    
    # sample random line to evaluate resolved pronoun
    evalLines = np.random.choice(pronIndex, min(len(pronIndex), numExamples), replace=False)
    
    return evalLines

View files for annotated movie scripts.

In [4]:
# get files for annotated scripts
files = [x for x in os.listdir('prep_scripts') if '_gapi.csv' in x]

print 'annotated scripts:'
for i, f in enumerate(files):
    print i, f

annotated scripts:
0 ant-man_tw_gapi.csv
1 avengers_age_of_ultron_tw_gapi.csv
2 captain_america_civil_war_tw_gapi.csv
3 captain_america_the_first_avenger_tw_gapi.csv
4 captain_america_the_winter_soldier_tw_gapi.csv
5 fantastic_four_imsdb_gapi.csv
6 iron_man_3_tw_gapi.csv
7 lego_marvel_super_heroes_tw_gapi.csv
8 spider-man_imsdb_gapi.csv
9 the_amazing_spider-man_2_tw_gapi.csv
10 the_amazing_spider-man_tw_gapi.csv
11 the_avengers_tw_gapi.csv
12 the_wolverine_tw_gapi.csv
13 thor_the_dark_world_tw_gapi.csv
14 thor_tw_gapi.csv
15 x-men_apocalypse_tw_gapi.csv
16 x-men_days_of_future_past_tw_gapi.csv
17 x-men_imsdb_gapi.csv
18 x-men_the_last_stand_tw_gapi.csv


Load set of raw annotated scripts and add annotations/features for speakers, sentiment, and pronouns.  Select lines to evaluate.

In [5]:
# list of file indexes for Avengers (1,11) and X-Men movies (15-19)
fileIndex = [1, 11, 15, 16, 18]

# dict to hold name, annotations, characters, and other info for scripts
scripts = defaultdict(lambda: defaultdict())

for i in fileIndex:
    # load annotated script
    df = loadScript(files[i])
    
    # add features to annotated script
    df = annotateScript(df)
    
    # list of unique characters, mentions, overall sentiment
    cCount = Counter(df.speaker)
    
    # script name for printing
    scripts[i]['name'] = files[i]
    
    # annotated script data
    scripts[i]['df'] = df
    
    # unique characters and counts in script
    scripts[i]['chars'] = cCount
    
    # lines to evaluate for each script
    scripts[i]['eval'] = selectEvalLines(scripts[i]['df'], numExamples=20)

## Task 1. Entity Resolution <a class="anchor" id="Entity"></a>
[[back to top]](#top)

1.1. Base Model (pronResolution_base): sets reference as random character from script

In [None]:
### copy scripts
scripts0 = scripts.copy()

# apply model to all scripts
for i in fileIndex:
    charCounter = scripts0[i]['chars']
    scripts0[i]['df'].apply(lambda x: pronResolution_base(charCounter, x), axis=1)
    
# manually evaluate results for all scripts
pronEval(scripts0)


******** line 131 ********
129. Soldier:
Atten-hut!

130. narrator:
the soldiers all rise

=> 131. Colonel:
=> At ease. [as Sanders walks through the tent he looks at the mutant soldiers, he notices Alex Summers/Havok and winks at him, Alex looks at him with confusion, Strand goes over the doctor who's collecting blood samples] What's all this?

132. Soldier:
Lab reports, blood tests. It's all getting packing up and shipped back.

133. Colonel:
Where is it going?

******** evaluate line 131 in x-men_days_of_future_past_tw_gapi.csv ********
6 pronouns resolved
1. he => Kitty
2. he => Colonel
3. him => Storm
4. him => Senator Brickman
5. who => Senator Brickman
6. What => Gwen


1.2. Nearest Speaker Model (pronResolution_nn)
* sets entity for first-person pronouns to speaker
* sets entity for second-person pronouns to random choice between previous and next speaker

In [6]:
# copy scripts
scripts1 = scripts.copy()

# apply model to all scripts
for i in fileIndex:
    charCounter = scripts1[i]['chars'].keys()  
    scripts1[i]['df'].apply(lambda x: pronResolution_nn(charCounter, x), axis=1)
    
# manually evaluate results for all scripts
pronEval(scripts1)


******** line 766 ********
764. Kurt:
Oh, no.

765. narrator:
back with Scott and the ants

=> 766. Scott Lang:
=> I'm employing the bullet ants. Hapanera-clamda-mana-merna. I don't remember what it's called but I feel bad for this guy. [using the ants Scott takes down one of the security guards with Luis also punching him]

767. Luis:
See, that's what I'm talkin’ bout. That's what I call it, an unfortunate casualty, in a very serious operation, you know? [Hope then comes along and enters the room and places the signal decoy]

768. Kurt:
Signal decoy in place. Mean pretty lady did good, Scott.

******** test model 1: line 766 ********
6 pronouns resolved
1. I => ['Peggy Carter']
2. I => ['Pym Tech Employee']
3. what => ['Hope van Dyne']
4. it => ['Cop on Speaker']
5. I => ['Hideous Rabbit']
6. him => ['Pym Tech Employee']

how many are correctly identified? 2

******** line 766 ********
764. Kurt:
Oh, no.

765. narrator:
back with Scott and the ants

=> 766. Scott Lang:
=> I'm employi

1.3. Probability-Weighted Nearby Entities (pronResolution_nnMod):
* Set entity for first-person pronouns to speaker
* Set entity for second- and third-person pronouns to entity based on distribution of person entities in nearby characters

In [None]:
# copy scripts
scripts2 = scripts.copy()

# apply model to all scripts
for i in fileIndex:
    charCounter = scripts2[i]['chars'] 
    scripts2[i]['df'].apply(lambda x: pronResolution_nnMod(charCounter, x, absolute=False), axis=1)
    
# manually evaluate results for all scripts
pronEval(scripts2)


******** line 97 ********
95. Ramone:
Gwen get dressed.

96. Past Wolverine:
Who the hell are you? Hey, I don't know what's going on.

=> 97. Ramone:
=> What's going on you're supposed to be guarding the boss's daughter not screwing her.

98. Past Wolverine:
Well, I didn't sleep with her.

99. Ramone:
No?

******** evaluate line 97 in x-men_days_of_future_past_tw_gapi.csv ********
2 pronouns resolved
1. you => ['Past Wolverine']
2. her => ['Professor X']

how many are correctly identified? 1

******** line 246 ********
244. Past Wolverine:
What the hell happened to him?

245. Hank:
He lost everything. Erik, Raven... his legs. We built this school, the labs, this whole place... then, just after the first semester... the war in Vietnam got worse. Many of the teachers... and older students were drafted. It broke him. He retreated himself. I wanted to help, do something... so I designed a serum to treat his spine... derived from the same formula that helps me control my mutation. I take j

## Task 2. Relation Extraction <a class="anchor" id="Relation"></a>
[[back to top]](#top)

In [48]:
df.apply(lambda x: pronResolution_nnMod(cCount, x), axis=1)
df.head()

Unnamed: 0,speaker,dialogue,sentences,sentiment,entities,tokens,total_sent,num_pron,nearbyChars
0,Announcer,[first lines; announcement over speaker] Repor...,[{'content': u'[first lines; announcement over...,"{'magnitude': 1.6, 'score': -0.2}","[{'salience': 0.35250518, 'type': 'OTHER', 'me...","[{'content': '[', 'pos': 'PUNCT', 'label': 'P'...",-0.32,3,"[Tony Stark, Clint Barton, narrator, Natasha R..."
1,narrator,the Avengers are in the process of infiltratin...,[{'content': u'the Avengers are in the process...,"{'magnitude': 0.1, 'score': 0.1}","[{'salience': 0.47595453, 'type': 'PERSON', 'm...","[{'content': 'the', 'pos': 'DET', 'label': 'DE...",0.01,0,"[Steve Rogers, Tony Stark, Clint Barton, narra..."
2,Tony Stark,Shit!,"[{'content': u'Shit!', 'begin': 0, 'score': -0...","{'magnitude': 0.6, 'score': -0.6}",[],"[{'content': 'Shit', 'pos': 'X', 'label': 'ROO...",-0.36,0,"[narrator, Steve Rogers, Tony Stark, Clint Bar..."
3,Steve Rogers,"Language! JARVIS, what's the view from upstairs?","[{'content': u'Language!', 'begin': 0, 'score'...","{'magnitude': 0.1, 'score': 0}","[{'salience': 0.7599061, 'type': 'OTHER', 'men...","[{'content': 'Language', 'pos': 'NOUN', 'label...",0.0,1,"[Steve Rogers, narrator, Steve Rogers, Tony St..."
4,JARVIS,The central building is protected by some kind...,[{'content': u'The central building is protect...,"{'magnitude': 1.5, 'score': 0.7}","[{'salience': 0.47500995, 'type': 'LOCATION', ...","[{'content': 'The', 'pos': 'DET', 'label': 'DE...",1.05,1,"[Strucker, Steve Rogers, narrator, Steve Roger..."


In [33]:
df['relations'] = df.apply(lambda x:extract_relation_categories(x), axis=1)
df.head()

Unnamed: 0,speaker,dialogue,sentences,sentiment,entities,tokens,total_sent,num_pron,nearbyChars,relations
0,Announcer,[first lines; announcement over speaker] Repor...,[{'content': u'[first lines; announcement over...,"{'magnitude': 1.6, 'score': -0.2}","[{'salience': 0.35250518, 'type': 'OTHER', 'me...","[{'lemma': '[', 'begin': 0, 'label': 'P', 'ind...",-0.32,3,"[Tony Stark, Clint Barton, narrator, Natasha R...",[{'relation': '[first lines; announcement over...
1,narrator,the Avengers are in the process of infiltratin...,[{'content': u'the Avengers are in the process...,"{'magnitude': 0.1, 'score': 0.1}","[{'salience': 0.47595453, 'type': 'PERSON', 'm...","[{'lemma': 'the', 'begin': 0, 'label': 'DET', ...",0.01,0,"[Steve Rogers, Tony Stark, Clint Barton, narra...",
2,Tony Stark,Shit!,"[{'content': u'Shit!', 'begin': 0, 'score': -0...","{'magnitude': 0.6, 'score': -0.6}",[],"[{'lemma': 'Shit', 'begin': 0, 'label': 'ROOT'...",-0.36,0,"[narrator, Steve Rogers, Tony Stark, Clint Bar...",
3,Steve Rogers,"Language! JARVIS, what's the view from upstairs?","[{'content': u'Language!', 'begin': 0, 'score'...","{'magnitude': 0.1, 'score': 0}","[{'salience': 0.7599061, 'type': 'OTHER', 'met...","[{'lemma': 'Language', 'begin': 0, 'label': 'R...",0.0,1,"[Steve Rogers, narrator, Steve Rogers, Tony St...",
4,JARVIS,The central building is protected by some kind...,[{'content': u'The central building is protect...,"{'magnitude': 1.5, 'score': 0.7}","[{'salience': 0.47500995, 'type': 'LOCATION', ...","[{'lemma': 'The', 'begin': 0, 'label': 'DET', ...",1.05,1,"[Strucker, Steve Rogers, narrator, Steve Roger...",[{'relation': 'The central building is protect...


In [34]:
REEval([df], 5)


******** line 271 ********
269. Natasha Romanoff:
Fella done me wrong.

270. Bruce Banner:
You got a lousy taste in men, kid.

=> 271. Natasha Romanoff:
=> He's not so bad. Well, he has a temper. Deep down he's all fluff. Fact is, he's not like anybody I've ever known. All my friends are fighters. And here comes this guy, spends his life avoiding the fight because he knows he'll win.

272. Bruce Banner:
Sounds amazing.

273. Natasha Romanoff:
He's also a huge dork. [Banner looks embarrassed] Chicks dig that. So what do you think should I fight this, or run with it?

******** test model 1: line 271 ********
4 relations identified
entities: Natasha Romanoff => fighters-['friends', 'fighters']
relation: He's not so bad. Well, he has a temper. Deep down he's all fluff. Fact is, he's not like anybody I've ever known. All my friends are fighters. And here comes this guy, spends his life avoiding the fight because he knows he'll win.
category: 4. mixed mentioning
entities: Natasha Romanoff =

## Putting Everything Together, a Simple Query System <a class="anchor" id="Query"></a>
[[back to top]](#top)

In [68]:
def checkQuery(relationList, ent1, ent2, relationClass):
    for relation in relationList:
        if ent1 in relation['ent1'] and ent2 in relation['ent2'] and relationClass == relation['class']:
            return True
    return False

def printAnswer(row):
    print('Movie: {}, Line {}'.format(row.movie, row.lineNum))
    print('{}: {}'.format(row.speaker, row.dialogue))
    print()
    
def queryScore(relationList, query, relationClass):
    querySet = set(query.split(' '))
    resultScore = 0
    
    for relation in relationList:
        relationSet = set()
        if type(relation['ent1']) == str:
            relationSet |= set(relation['ent1'].lower().split())
        else:
            for ent in relation['ent1']:
                #print(set(ent.split()))
                relationSet |= set(ent.lower().split())
            
        if type(relation['ent2']) == str:
            #print(relation['ent2'])
            relationSet |= set(relation['ent2'].lower().split())
        else:
            for ent in relation['ent2']:
                relationSet |= set(ent.lower().split())
        
        relationSet |= set(relation['relation'].lower().split())
        relationSet |= set(relationClass[relation['class']].lower().split())
        tempScore = len(relationSet & querySet) / (len(relationSet) + len(querySet))
        
        if tempScore > resultScore:
            resultScore = tempScore
        
    return resultScore

#Simple Query System

print('Select the movies of your interest:')
print('***Enter all to use all movies')
print('***Enter n, m, x, y (numbers separated by commas) for specific selections')
print('***Enter random, n for n random selections\n')

files = [x for x in os.listdir('prep_scripts') if '_gapi' in x]
for i, fileName in enumerate(files):
    print('{}. {}'.format(i+1, re.split(r'_tw_|_imsdb_', fileName)[0]))


x = input()


#random selection
try:
    if 'random' in x:
        queryFiles = np.random.choice(files, int(x.split(',')[-1]), replace=False)
    elif x != 'all':
        queryFiles = np.array(files)[[int(select) - 1 for select in x.split(',')]]
    #use all files
    else:
        queryFiles = files    
        
except:
    print('\nunexpected input, will use all movie files\n')
    queryFiles = files    

#print(queryFiles)
df_data = None
charSet = set()

for i, fileName in enumerate(queryFiles):    
    df = pd.read_csv('prep_scripts/'+fileName)[['speaker', 'dialogue', 'sentences', 'sentiment', 'entities', 'tokens']]
    df['tokens'] = df['tokens'].apply(lambda x: eval(x))
    df['sentiment'] = df['sentiment'].apply(lambda x: eval(x))
    df['total_sent'] = df['sentiment'].apply(lambda x: x['score'] * x['magnitude'])
    df['entities'] = df['entities'].apply(lambda x: eval(x))
    df['movie'] = re.split(r'_tw_|_imsdb_', fileName)[0]
    df['lineNum'] = df.index + 1
    
    charRange = 10
    nearbyList = np.dstack((df.shift(i).speaker.values for i in range(-charRange, charRange+1)))[0]
    df['nearbyChars'] = None
    for line, nearbyChars in enumerate(nearbyList):
        df.set_value(line, 'nearbyChars', nearbyChars)
    
    cList = list(df.speaker.unique())
    cDict = dict(df.groupby('speaker').total_sent.sum())
    
    #resolve entities
    df.apply(lambda x:pronResolution_nnMod(cList, x), axis=1)
    
    #extract relations
    df['relations'] = df.apply(lambda x:extract_relation_categories(x), axis=1)
    
    if i == 0:
        df_data = df[df.relations.notnull()]        
        
    else:
        df_data = pd.concat((df_data, df[df.relations.notnull()]))
    
    charSet |= set(df.speaker.unique())

relationClasses = getRelations()
    
print('Type end to finish at any time')
print('Choose one of the following:')
print('1. Structured search')
print('2. Free form query')
searchType = int(input()) - 1

#relationList = df_data[df_data.hasRelation == True]['relations'].values

if not searchType:
    
    while True:
        print('Characters: ')
        print(charSet)
        print('\nRelations:')
        for k, v in relationClasses.items():
            print('{}. {}'.format(k+1, v))
        print('What relation are you looking for?')
        ent1 = input('Entity 1:')
        if ent1 == 'end':
            break
        ent2 = input('Entity 2:')
        if ent2 == 'end':
            break
        relationClass = int(input('Relation category: '))-1

        qMatch = df_data.relations.apply(lambda x: checkQuery(x, ent1, ent2, relationClass))
        if sum(qMatch) == 0:
            print('nothing found\n')
        else:
            df_data[qMatch].apply(lambda x: printAnswer(x), axis=1)

else:
    while True:
        query = input('Enter query')
        if query == 'end':
            break
        df = df_data.copy()
        df['queryScore'] = df.relations.apply(lambda x: queryScore(x, query, relationClasses))
        df = df.sort_values(by='queryScore', ascending=False).head().copy()
        df.apply(lambda x: printAnswer(x), axis=1)

Select the movies of your interest:
***Enter all to use all movies
***Enter n, m, x, y (numbers separated by commas) for specific selections
***Enter random, n for n random selections

1. ant-man
2. avengers_age_of_ultron
3. captain_america_civil_war
4. captain_america_the_first_avenger
5. captain_america_the_winter_soldier
6. fantastic_four
7. iron_man_3
8. lego_marvel_super_heroes
9. spider-man
10. the_amazing_spider-man_2
11. the_amazing_spider-man
12. the_avengers
13. the_wolverine
14. thor_the_dark_world
15. thor
16. x-men_apocalypse
17. x-men_days_of_future_past
18. x-men
19. x-men_the_last_stand
all
Type end to finish at any time
Choose one of the following:
1. Structured search
2. Free form query
2
Enter querycaptain america helps thor fight loki and iron man
Movie: lego_marvel_super_heroes, Line 348
Captain America: Colonel Fury, sir, Loki jumped into a Vortex and vanished.

Movie: lego_marvel_super_heroes, Line 386
Loki: Oh and so am I, brother! I intend to get my revenge on 