## Table of Contents <a class="anchor" id="top"></a>
* [Data Preparation](#Data Prep)
* [Entity Resolution](#Entity)
* [Relation Extraction](#Relation)
* [Query System](#Query)

## [Data Prep](#top)  <a class="anchor" id="Data Prep"></a>

In [37]:
%load_ext autoreload
%autoreload 2

#standard library imports
import re
import nltk
import numpy as np
import pandas as pd
import os

#modeling functions & utilities
from pronounResolution import pronResolution_base, pronResolution_nnMod, pronResolution_nn, pronEval
from relationExtract import simpleRE, REEval, getRelations, extract_relation_categories

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [38]:
files = [x for x in os.listdir('prep_scripts') if '_gapi' in x]
for file in files:
    df = pd.read_csv('prep_scripts/' + file)[['speaker']]
    print(list(df.speaker.unique()))
    print('***')
    print('***')

['narrator', 'Dr. Hank Pym', 'Mitchell Carson', 'Howard Stark', 'Peggy Carter', 'Peachy', 'Scott Lang', 'Luis', 'Ice Cream Store Customer', 'Dale', 'Dave', 'Kurt', 'Pym Tech Gate Guard', 'Pym Tech Security Guard', 'Pym Tech Employee', 'Hope van Dyne', 'Darren Cross', 'Carson', 'Frank', 'Cassie Lang', 'Paxton', 'Hideous Rabbit', 'Maggie Lang', 'Scott', 'Cab Driver', 'Cop on Speaker', 'Detective', 'Voice over Radio', 'Sam Wilson', 'Scot Lang', 'Alpha Guard', 'Gale', 'Computer', 'Cell Phone', 'Pool BBQ Dad', 'Police Radio', 'Steve Rogers']
***
***
['Announcer', 'narrator', 'Tony Stark', 'Steve Rogers', 'JARVIS', 'Thor', 'Natasha Romanoff', 'Clint Barton', 'Strucker', 'Fortress Soldier', 'Dr. List', 'Jarvis', 'Iron Legion', 'Soldiers', 'Pietro Maximoff', 'Bruce Banner', 'Maria Hill', 'Dr. Helen Cho', 'Ultron', 'James Rhodes', 'Sam Wilson', 'Party Guest', 'Stan Lee', 'Wanda Maximoff', 'Ulysses Klaue', "Klaue's Mercenary", 'Ballet Instructor', 'Madame B', 'Peggy Carter', 'Heimdall', 'Laura B

In [39]:
files = [x for x in os.listdir('prep_scripts') if '_gapi' in x]
df = pd.read_csv('prep_scripts/' + files[0])[['speaker', 'dialogue', 'sentences', 'sentiment', 'entities', 'tokens']]
df['tokens'] = df['tokens'].apply(lambda x: eval(x))
df['sentiment'] = df['sentiment'].apply(lambda x: eval(x))
df['entities'] = df['entities'].apply(lambda x: eval(x))
df.head(20)

Unnamed: 0,speaker,dialogue,sentences,sentiment,entities,tokens
0,narrator,1989 – Hank Pym enters a SHIELD facility,[{'content': u'1989 \u2013 Hank Pym enters a S...,"{u'score': 0.3, u'magnitude': 0.3}","[{u'type': u'PERSON', u'meta': {}, u'salience'...","[{u'index': 0, u'begin': 0, u'pos': u'NUM', u'..."
1,Dr. Hank Pym,Stark.,"[{'content': u'Stark.', 'begin': 0, 'score': 0...","{u'score': 0.1, u'magnitude': 0.1}","[{u'type': u'WORK_OF_ART', u'meta': {}, u'sali...","[{u'index': 0, u'begin': 0, u'pos': u'NOUN', u..."
2,Mitchell Carson,He doesn't seem happy.,"[{'content': u""He doesn't seem happy."", 'begin...","{u'score': -0.6, u'magnitude': 0.6}",[],"[{u'index': 3, u'begin': 0, u'pos': u'PRON', u..."
3,Howard Stark,"Hello, Hank. You're supposed to be in Moscow.","[{'content': u'Hello, Hank.', 'begin': 0, 'sco...","{u'score': -0.1, u'magnitude': 1}","[{u'type': u'PERSON', u'meta': {}, u'salience'...","[{u'index': 2, u'begin': 0, u'pos': u'X', u'la..."
4,Dr. Hank Pym,I took a detour.[he places a vial containing a...,[{'content': u'I took a detour.[he places a vi...,"{u'score': 0.4, u'magnitude': 0.4}","[{u'type': u'OTHER', u'meta': {}, u'salience':...","[{u'index': 1, u'begin': 0, u'pos': u'PRON', u..."
5,Peggy Carter,Tell me that isn't what I think it is.,"[{'content': u""Tell me that isn't what I think...","{u'score': -0.6, u'magnitude': 0.6}",[],"[{u'index': 0, u'begin': 0, u'pos': u'VERB', u..."
6,Dr. Hank Pym,"It depends, if you think it's a poor attempt t...","[{'content': u""It depends, if you think it's a...","{u'score': -0.5, u'magnitude': 1}","[{u'type': u'OTHER', u'meta': {}, u'salience':...","[{u'index': 1, u'begin': 0, u'pos': u'PRON', u..."
7,Mitchell Carson,You were instructed to go to Russia. May I rem...,[{'content': u'You were instructed to go to Ru...,"{u'score': 0, u'magnitude': 0.7}","[{u'type': u'PERSON', u'meta': {}, u'salience'...","[{u'index': 2, u'begin': 0, u'pos': u'PRON', u..."
8,Dr. Hank Pym,I'm a scientist.,"[{'content': u""I'm a scientist."", 'begin': 0, ...","{u'score': 0.3, u'magnitude': 0.3}","[{u'type': u'PERSON', u'meta': {}, u'salience'...","[{u'index': 1, u'begin': 0, u'pos': u'PRON', u..."
9,Howard Stark,Then act like one. The Pym Particle is the mos...,"[{'content': u'Then act like one.', 'begin': 0...","{u'score': 0.1, u'magnitude': 1.2}","[{u'type': u'OTHER', u'meta': {}, u'salience':...","[{u'index': 1, u'begin': 0, u'pos': u'ADV', u'..."


In [40]:
df.entities[17]

[{'mentions': [u'Pym Particle', u'miracle'],
  'meta': {},
  'name': u'Pym Particle',
  'salience': 0.81080335,
  'type': u'CONSUMER_GOOD'},
 {'mentions': [u'Hank'],
  'meta': {},
  'name': u'Hank',
  'salience': 0.18919668,
  'type': u'PERSON'}]

In [41]:
df.dialogue[17]

"We don't accept it. Formally. Hank, we need you. The Pym Particle is a miracle. Please, don't let your past determine the future."

In [42]:
cList = list(df.speaker.unique())
df['total_sent'] = df['sentiment'].apply(lambda x: x['score'] * x['magnitude'])
cDict = dict(df.groupby('speaker').total_sent.sum())

# number of pronouns for each line
df['num_pron'] = df['tokens'].apply(lambda x: sum([int(t['pos'] == 'PRON') for t in x]))

# total sentiment score for each line
df['total_sent'] = df['sentiment'].apply(lambda x: x['score'] * x['magnitude'])

#set nearby speakers
charRange = 10
nearbyList = np.dstack((df.shift(i).speaker.values for i in range(-charRange, charRange+1)))[0]
df['nearbyChars'] = None
for i, nearbyChars in enumerate(nearbyList):
    df.set_value(i, 'nearbyChars', nearbyChars)

df.head()['tokens'][3]

[{'begin': 0,
  'content': u'Hello',
  'index': 2,
  'label': u'DISCOURSE',
  'lemma': u'Hello',
  'pos': u'X'},
 {'begin': 5,
  'content': u',',
  'index': 2,
  'label': u'P',
  'lemma': u',',
  'pos': u'PUNCT'},
 {'begin': 7,
  'content': u'Hank',
  'index': 2,
  'label': u'ROOT',
  'lemma': u'Hank',
  'pos': u'NOUN'},
 {'begin': 11,
  'content': u'.',
  'index': 2,
  'label': u'P',
  'lemma': u'.',
  'pos': u'PUNCT'},
 {'begin': 13,
  'content': u'You',
  'index': 6,
  'label': u'NSUBJPASS',
  'lemma': u'You',
  'pos': u'PRON'},
 {'begin': 16,
  'content': u"'re",
  'index': 6,
  'label': u'AUXPASS',
  'lemma': u'be',
  'pos': u'VERB'},
 {'begin': 20,
  'content': u'supposed',
  'index': 6,
  'label': u'ROOT',
  'lemma': u'suppose',
  'pos': u'VERB'},
 {'begin': 29,
  'content': u'to',
  'index': 8,
  'label': u'AUX',
  'lemma': u'to',
  'pos': u'PRT'},
 {'begin': 32,
  'content': u'be',
  'index': 6,
  'label': u'XCOMP',
  'lemma': u'be',
  'pos': u'VERB'},
 {'begin': 35,
  'conten

## Task 1. [Entity Resolution](#top) <a class="anchor" id="Entity"></a>

In [43]:
df['tokens'] = df.apply(lambda x:pronResolution_nnMod(cList, x), axis=1)
df.head()

Unnamed: 0,speaker,dialogue,sentences,sentiment,entities,tokens,total_sent,num_pron,nearbyChars
0,narrator,1989 – Hank Pym enters a SHIELD facility,[{'content': u'1989 \u2013 Hank Pym enters a S...,"{u'score': 0.3, u'magnitude': 0.3}","[{u'type': u'PERSON', u'meta': {}, u'salience'...","[{u'index': 0, u'begin': 0, u'pos': u'NUM', u'...",0.09,0,"[Dr. Hank Pym, Howard Stark, Dr. Hank Pym, Mit..."
1,Dr. Hank Pym,Stark.,"[{'content': u'Stark.', 'begin': 0, 'score': 0...","{u'score': 0.1, u'magnitude': 0.1}","[{u'type': u'WORK_OF_ART', u'meta': {}, u'sali...","[{u'index': 0, u'begin': 0, u'pos': u'NOUN', u...",0.01,0,"[Mitchell Carson, Dr. Hank Pym, Howard Stark, ..."
2,Mitchell Carson,He doesn't seem happy.,"[{'content': u""He doesn't seem happy."", 'begin...","{u'score': -0.6, u'magnitude': 0.6}",[],"[{u'index': 3, u'begin': 0, u'pos': u'PRON', u...",-0.36,1,"[Dr. Hank Pym, Mitchell Carson, Dr. Hank Pym, ..."
3,Howard Stark,"Hello, Hank. You're supposed to be in Moscow.","[{'content': u'Hello, Hank.', 'begin': 0, 'sco...","{u'score': -0.1, u'magnitude': 1}","[{u'type': u'PERSON', u'meta': {}, u'salience'...","[{u'index': 2, u'begin': 0, u'pos': u'X', u'la...",-0.1,1,"[Peggy Carter, Dr. Hank Pym, Mitchell Carson, ..."
4,Dr. Hank Pym,I took a detour.[he places a vial containing a...,[{'content': u'I took a detour.[he places a vi...,"{u'score': 0.4, u'magnitude': 0.4}","[{u'type': u'OTHER', u'meta': {}, u'salience':...","[{u'index': 1, u'begin': 0, u'pos': u'PRON', u...",0.16,3,"[Dr. Hank Pym, Peggy Carter, Dr. Hank Pym, Mit..."


In [None]:
pronEval([df, df], numExamples=2)


******** line 771 ********
769. Dave:
Looks like Pym's getting arrested.

770. Kurt:
Scott, we have problem.

=> 771. Scott Lang:
=> Problem? What's the problem? [just then Dave gets out of the can]

772. Kurt:
Dave! Dave, that's not part of plan!

773. Dr. Hank Pym:
[as Paxton and Gale are trying to arrest Pym] Listen to me, if I don't get into this building people will die.

******** test model 1: line 771 ********
0 pronouns resolved

how many are correctly identified? 0

******** line 771 ********
769. Dave:
Looks like Pym's getting arrested.

770. Kurt:
Scott, we have problem.

=> 771. Scott Lang:
=> Problem? What's the problem? [just then Dave gets out of the can]

772. Kurt:
Dave! Dave, that's not part of plan!

773. Dr. Hank Pym:
[as Paxton and Gale are trying to arrest Pym] Listen to me, if I don't get into this building people will die.

******** test model 2: line 771 ********
0 pronouns resolved


## Task 2. [Relation Extraction](#top) <a class="anchor" id="Relation"></a>

In [49]:
df['relations'] = df.apply(lambda x:extract_relation_categories(x), axis=1)
df.head(50)

Unnamed: 0,speaker,dialogue,sentences,sentiment,entities,tokens,total_sent,num_pron,nearbyChars,relations
0,narrator,1989 – Hank Pym enters a SHIELD facility,[{'content': u'1989 \u2013 Hank Pym enters a S...,"{u'score': 0.3, u'magnitude': 0.3}","[{u'type': u'PERSON', u'meta': {}, u'salience'...","[{u'index': 0, u'begin': 0, u'pos': u'NUM', u'...",0.09,0,"[Dr. Hank Pym, Howard Stark, Dr. Hank Pym, Mit...",
1,Dr. Hank Pym,Stark.,"[{'content': u'Stark.', 'begin': 0, 'score': 0...","{u'score': 0.1, u'magnitude': 0.1}","[{u'type': u'WORK_OF_ART', u'meta': {}, u'sali...","[{u'index': 0, u'begin': 0, u'pos': u'NOUN', u...",0.01,0,"[Mitchell Carson, Dr. Hank Pym, Howard Stark, ...",
2,Mitchell Carson,He doesn't seem happy.,"[{'content': u""He doesn't seem happy."", 'begin...","{u'score': -0.6, u'magnitude': 0.6}",[],"[{u'index': 3, u'begin': 0, u'pos': u'PRON', u...",-0.36,1,"[Dr. Hank Pym, Mitchell Carson, Dr. Hank Pym, ...",
3,Howard Stark,"Hello, Hank. You're supposed to be in Moscow.","[{'content': u'Hello, Hank.', 'begin': 0, 'sco...","{u'score': -0.1, u'magnitude': 1}","[{u'type': u'PERSON', u'meta': {}, u'salience'...","[{u'index': 2, u'begin': 0, u'pos': u'X', u'la...",-0.1,1,"[Peggy Carter, Dr. Hank Pym, Mitchell Carson, ...",
4,Dr. Hank Pym,I took a detour.[he places a vial containing a...,[{'content': u'I took a detour.[he places a vi...,"{u'score': 0.4, u'magnitude': 0.4}","[{u'type': u'OTHER', u'meta': {}, u'salience':...","[{u'index': 1, u'begin': 0, u'pos': u'PRON', u...",0.16,3,"[Dr. Hank Pym, Peggy Carter, Dr. Hank Pym, Mit...",
5,Peggy Carter,Tell me that isn't what I think it is.,"[{'content': u""Tell me that isn't what I think...","{u'score': -0.6, u'magnitude': 0.6}",[],"[{u'index': 0, u'begin': 0, u'pos': u'VERB', u...",-0.36,4,"[Howard Stark, Dr. Hank Pym, Peggy Carter, Dr....",
6,Dr. Hank Pym,"It depends, if you think it's a poor attempt t...","[{'content': u""It depends, if you think it's a...","{u'score': -0.5, u'magnitude': 1}","[{u'type': u'OTHER', u'meta': {}, u'salience':...","[{u'index': 1, u'begin': 0, u'pos': u'PRON', u...",-0.5,4,"[Dr. Hank Pym, Howard Stark, Dr. Hank Pym, Peg...",
7,Mitchell Carson,You were instructed to go to Russia. May I rem...,[{'content': u'You were instructed to go to Ru...,"{u'score': 0, u'magnitude': 0.7}","[{u'type': u'PERSON', u'meta': {}, u'salience'...","[{u'index': 2, u'begin': 0, u'pos': u'PRON', u...",0.0,4,"[Howard Stark, Dr. Hank Pym, Howard Stark, Dr....",
8,Dr. Hank Pym,I'm a scientist.,"[{'content': u""I'm a scientist."", 'begin': 0, ...","{u'score': 0.3, u'magnitude': 0.3}","[{u'type': u'PERSON', u'meta': {}, u'salience'...","[{u'index': 1, u'begin': 0, u'pos': u'PRON', u...",0.09,1,"[Dr. Hank Pym, Howard Stark, Dr. Hank Pym, How...",
9,Howard Stark,Then act like one. The Pym Particle is the mos...,"[{'content': u'Then act like one.', 'begin': 0...","{u'score': 0.1, u'magnitude': 1.2}","[{u'type': u'OTHER', u'meta': {}, u'salience':...","[{u'index': 1, u'begin': 0, u'pos': u'ADV', u'...",0.12,2,"[Mitchell Carson, Dr. Hank Pym, Howard Stark, ...",


In [50]:
REEval([df, df])


******** line 862 ********
860. narrator:
outside

861. Paxton:
[into his radio] All the chaos in here! Multiple shots fired. [suddenly a tank bursts out through the building] And there's a tank. [Luis then walks out of the building with the guard]

=> 862. Luis:
=> A little help. [someone takes hold of the guard, at the same time Hope helps Pym out of the tank] I got him. [Luis helps Pym]

863. Hope van Dyne:
We need a doctor! [a medic comes over to help Pym]

864. narrator:
Cross is in his helicopter

******** test model 1: line 862 ********
1 relations identified
entities: Luis => , someone, guard, Pym
relation: A little help. [someone takes hold of the guard, at the same time Hope helps Pym out of the tank] I got him. [Luis helps Pym]
category: 3

how many are correctly identified? 0

******** line 862 ********
860. narrator:
outside

861. Paxton:
[into his radio] All the chaos in here! Multiple shots fired. [suddenly a tank bursts out through the building] And there's a tank. [Lu

## Putting Everything Together, a [Simple Query System](#top) <a class="anchor" id="Query"></a>

In [10]:
def checkQuery(relationList, ent1, ent2, relationClass):
    for relation in relationList:
        if ent1 in relation['ent1'] and ent2 in relation['ent2'] and relationClass == relation['class']:
            return True
    return False

def printAnswer(row):
    print('Movie: {}, Line {}'.format(row.movie, row.lineNum))
    print(row.dialogue)
    print()

#Simple Query System

print('Select the movies of your interest:')
print('***Enter all to use all movies')
print('***Enter n, m, x, y (numbers separated by commas) for specific selections')
print('***Enter random, n for n random selections\n')

files = [x for x in os.listdir('prep_scripts') if '_gapi' in x]
for i, fileName in enumerate(files):
    print('{}. {}'.format(i+1, re.split(r'_tw_|_imsdb_', fileName)[0]))


x = input()


#random selection
try:
    if 'random' in x:
        queryFiles = np.random.choice(files, int(x.split(',')[-1]), replace=False)
    elif x != 'all':
        queryFiles = np.array(files)[[int(select) - 1 for select in x.split(',')]]
    #use all files
    else:
        queryFiles = files    
        
except:
    print('\nunexpected input, will use all movie files\n')
    queryFiles = files    

#print(queryFiles)
df_data = None
charSet = set()

for i, fileName in enumerate(queryFiles):    
    df = pd.read_csv('prep_scripts/'+fileName)[['speaker', 'dialogue', 'sentences', 'sentiment', 'entities', 'tokens']]
    df['tokens'] = df['tokens'].apply(lambda x: eval(x))
    df['sentiment'] = df['sentiment'].apply(lambda x: eval(x))
    df['total_sent'] = df['sentiment'].apply(lambda x: x['score'] * x['magnitude'])
    df['movie'] = re.split(r'_tw_|_imsdb_', fileName)[0]
    df['lineNum'] = df.index + 1
    
    cList = list(df.speaker.unique())
    cDict = dict(df.groupby('speaker').total_sent.sum())
    
    #resolve entities
    df['tokens'] = df.apply(lambda x:pronResolution_base(cList, x), axis=1)
    
    #extract relations
    df['relations'] = df.apply(lambda x:simpleRE(x), axis=1)
    
    if i == 0:
        df_data = df[df.relations.notnull()]        
        
    else:
        df_data = pd.concat((df_data, df[df.hasRelation == True]))
    
    charSet |= set(df.speaker.unique())

relationClasses = getRelations()
    
print('Type end to finish at any time')

#relationList = df_data[df_data.hasRelation == True]['relations'].values

while True:
    print('Characters: ')
    print(charSet)
    print('\nRelations:')
    for k, v in relationClasses.items():
        print('{}. {}'.format(k+1, v))
    print('What relation are you looking for?')
    ent1 = input('Entity 1:')
    if ent1 == 'end':
        break
    ent2 = input('Entity 2:')
    if ent2 == 'end':
        break
    relationClass = int(input('Relation category: '))-1
    
    qMatch = df_data.relations.apply(lambda x: checkQuery(x, ent1, ent2, relationClass))
    if sum(qMatch) == 0:
        print('nothing found\n')
    else:
        df_data[qMatch].apply(lambda x: printAnswer(x), axis=1)
    

Select the movies of your interest:
***Enter all to use all movies
***Enter n, m, x, y (numbers separated by commas) for specific selections
***Enter random, n for n random selections

1. ant-man
2. avengers_age_of_ultron
3. captain_america_civil_war
4. captain_america_the_first_avenger
5. captain_america_the_winter_soldier
6. fantastic_four
7. iron_man_3
8. lego_marvel_super_heroes
9. spider-man
10. the_amazing_spider-man_2
11. the_amazing_spider-man
12. the_avengers
13. the_wolverine
14. thor_the_dark_world
15. thor
16. x-men_apocalypse
17. x-men_days_of_future_past
18. x-men
19. x-men_the_last_stand
1

unexpected input, will use all movie files



AttributeError: ("'Series' object has no attribute 'token'", u'occurred at index 0')