### Tokenizer

In [2]:
from nltk.tokenize import sent_tokenize, word_tokenize
import tqdm

In [36]:
te = "A team of international observers that came under fire in northern Syria was apparently stranded in the area overnight, and United Nations officials said Wednesday they hope to evacuate the group within hours. A roadside bomb struck the team's vehicles Tuesday during a mission in the northern town of Khan Sheikhoun, but none of the observers was wounded. The attack, which came minutes after witnesses said regime forces gunned down mourners at a funeral procession nearby, dealt a fresh blow to international envoy Kofi Annan's peace plan. Activists said the violence continued Wednesday with regime forces opening fire from the outskirts of Khan Sheikhoun. Rami Abdul-Rahman, who heads the Britain-based Syrian Observatory for Human Rights activist group, said the heavy machine-gun fire has so far prevented people from holding funerals for some of the 20 mourners who were killed at the funeral a day earlier. The U.N. said rebel forces had given the observers shelter in the town, which has witnessed anti-government protests since an uprising against President Bashar Assad's regime began in March last year. Maj. Gen. Robert Mood, the Norwegian head of the U.N. team, told reporters Wednesday that he had spoken with the observers in Khan Sheikhoun by telephone and that they \"told us that they are happy and safe where they are.\" Ahmad Fawzi, Annan's spokesman, said in a statement that the mission will recover the six staff members later Wednesday. \"The U.N. staff members are co-located with opposition elements ... and are reportedly being treated well.\" Fawzi said the observers were caught up in the explosion as they met with the rebel Free Syrian Army. He said three vehicles were damaged. More than 200 U.N. observers have been deployed throughout Syria to monitor a cease-fire agreement that has been repeatedly violated by both sides since it took effect on April 12. Tuesday's attack was at least the second time the U.N. observers have been caught up in Syria's violence. Last week, a roadside bomb struck a Syrian military truck in the south of the country just seconds after Mood rode by in a convoy. It was not clear who was behind the blast and no one claimed responsibility. A video posted by activists online appeared to show the exact moment the U.N. vehicle was struck. The video shows two white vehicles clearly marked \"U.N\" with people milling around it, and two others parked a few meters (yards) behind. Slippers apparently left behind by the mourners running away from the shooting earlier are seen strewn about on the ground. The blast blew off the front of the first vehicle and sent up a plume of smoke as people screamed and frantically ran for cover. The four cars are then seen slowly driving away. It was not clear how close the observers were to the funeral shootings, but if confirmed, a regime attack on civilians directly in front of the observer mission could put pressure on them to describe publicly what they are seeing in Syria. They report back to the U.N. but have not publicized their findings. Syria's state-run TV, meanwhile, reported Wednesday that authorities released 250 people who were involved in the uprising. Assad has issued several pardons releasing thousands of detainees since the crisis began. The Observatory also said Syrian forces opened fire at the Naziheen Palestinian refugee camp in the southern city of Daraa, killing four people. The pro-government TV station Ikhbariyah blamed members of \"an armed terrorist group,\" saying they fired two rocket-propelled grenades at the camp, killing a 4-year-old girl and wounding 15 other people. The Syrian uprising began with mostly peaceful protests calling for change, but a relentless government crackdown led many in the opposition to take up arms. Some soldiers also have switched sides and joined forces with the rebels. The U.N. estimates the conflict has killed more than 9,000 people."
print (word_tokenize(te))

['A', 'team', 'of', 'international', 'observers', 'that', 'came', 'under', 'fire', 'in', 'northern', 'Syria', 'was', 'apparently', 'stranded', 'in', 'the', 'area', 'overnight', ',', 'and', 'United', 'Nations', 'officials', 'said', 'Wednesday', 'they', 'hope', 'to', 'evacuate', 'the', 'group', 'within', 'hours', '.', 'A', 'roadside', 'bomb', 'struck', 'the', 'team', "'s", 'vehicles', 'Tuesday', 'during', 'a', 'mission', 'in', 'the', 'northern', 'town', 'of', 'Khan', 'Sheikhoun', ',', 'but', 'none', 'of', 'the', 'observers', 'was', 'wounded', '.', 'The', 'attack', ',', 'which', 'came', 'minutes', 'after', 'witnesses', 'said', 'regime', 'forces', 'gunned', 'down', 'mourners', 'at', 'a', 'funeral', 'procession', 'nearby', ',', 'dealt', 'a', 'fresh', 'blow', 'to', 'international', 'envoy', 'Kofi', 'Annan', "'s", 'peace', 'plan', '.', 'Activists', 'said', 'the', 'violence', 'continued', 'Wednesday', 'with', 'regime', 'forces', 'opening', 'fire', 'from', 'the', 'outskirts', 'of', 'Khan', 'She

### HiEve

In [3]:
from os import listdir
from os.path import isfile, join
mypath = 'hievents/'
onlyfiles = [f for f in listdir(mypath) if isfile(join(mypath, f))]
print (onlyfiles[:3])

['article-10901.xml', 'article-1126.xml', 'article-11554.xml']


In [5]:
import xml.etree.ElementTree as ET
countevent, countsub, counttext = 0, 0, 0
for fname in tqdm.tqdm(onlyfiles):
    tree = ET.parse(mypath+fname)
    root = tree.getroot()
    this_text = None
    with open('hieve_processed/' + fname.replace('.xml', '.tsvx'), 'w', encoding='utf8') as fp:
        id2name = {}
        counttext += 1
        name_record = {}
        offset2index = {}
        passed_id = set([])
        for child in root:
            if child.tag == 'Text':
                while child.text.find('\n') > -1:
                    child.text = child.text.replace('\n', ' ')
                this_text = word_tokenize(child.text.strip())
                fp.write('Text\t' + ' '.join(this_text) + '\n')
            elif child.tag == 'Events':
                for event in child:
                    id2name[event[0].text] = event[1].text
                    index = name_record.get(event[1].text)
                    if index is None:
                        try:
                            this_index = this_text.index(event[1].text)
                        except:
                            this_index = -1
                        if this_index == -1:
                            name_record[event[1].text] = this_index
                            continue
                        offset2index[event[3].text] = this_text.index(event[1].text)
                        name_record[event[1].text] = set([offset2index[event[3].text]])
                    elif index is not -1:
                        this_index = None
                        try:
                            this_index = [i for i, n in enumerate(this_text) if n == event[1].text][len(name_record[event[1].text])]
                        except:
                            name_record[event[1].text] = -1
                            continue
                        offset2index[event[3].text] = this_index
                        name_record[event[1].text].add(this_index)
                    fp.write('Event\t')
                    fp.write('\t'.join([e.text for e in event]) + '\t' + str(this_index) + '\n')
                    passed_id.add(event[0].text)
            elif child.tag == 'Relations':
                local_events, sup_set, coref_set = set([]), set([]), set([])
                coref, trans = {}, {}
                sup_write = []
                for relation in child:
                    if relation[0].text not in passed_id and relation[1].text not in passed_id:
                        continue
                    #fp.write('Relation\t')
                    #fp.write('\t'.join([e.text for e in relation[:4]]))
                    #fp.write('\t' + id2name[relation[0].text] + '\t' + id2name[relation[1].text] + '\n')
                    if relation[2].text.lower() == 'supersub':
                        local_events.add(relation[0].text)
                        local_events.add(relation[1].text)
                        sup_write.append([e.text for e in relation[:4]])
                        sup_set.add((relation[0].text, relation[1].text))
                        if trans.get(relation[0].text) is None:
                            trans[relation[0].text] = [relation[1].text]
                        else:
                            trans[relation[0].text].append(relation[1].text)
                    elif relation[2].text.lower() == 'coref':
                        if coref.get(relation[0].text) is None:
                            coref[relation[0].text] = set([relation[1].text])
                        else:
                            coref[relation[0].text].add(relation[1].text)
                        if coref.get(relation[1].text) is None:
                            coref[relation[1].text] = set([relation[0].text])
                        else:
                            coref[relation[1].text].add(relation[0].text)
                        coref_set.add(( relation[0].text , relation[1].text ))
                        coref_set.add(( relation[1].text , relation[0].text ))
                        fp.write('Relation\t')
                        fp.write('\t'.join([e.text for e in relation[:4]]))
                        fp.write('\t' + id2name[relation[0].text] + '\t' + id2name[relation[1].text] + '\n')
                # Add coref
                this_add = []
                for line in sup_write:
                    if coref.get(line[0]) is not None:
                        for e in coref[line[0]]:
                            if (e, line[1]) not in coref_set:
                                this_add.append([e, line[1], line[2], 'true'])
                                sup_set.add((e, line[1]))
                    if coref.get(line[1]) is not None:
                        for e in coref[line[1]]:
                            if (line[0], e) not in coref_set:
                                this_add.append([line[0], e, line[2], 'true'])
                                sup_set.add((line[0], e))
                sup_write += this_add
                this_add = []
                # Add transitive
                # We only consider 2 hops for simplicity
                i = 0
                while i <= 2:
                    i += 1
                    for line in sup_write:
                        if trans.get(line[1]) is not None:
                            this_append = []
                            for e in trans[line[1]]:
                                if (line[0], e) not in sup_set and (line[0], e) not in coref_set and id2name[line[0]] != id2name[e]:
                                    this_add.append([line[0], e, line[2], 'true'])
                                    sup_set.add((line[0], e))
                                    this_append.append(e)
                            for e in this_append:
                                if trans.get(e) is not None:
                                    trans[line[1]] += trans[e]
                    sup_write += this_add
                for relation in sup_write:
                    fp.write('Relation\t')
                    fp.write('\t'.join(relation))
                    fp.write('\t' + id2name[relation[0]] + '\t' + id2name[relation[1]] + '\n')
                    countsub += 1
                countevent += len(local_events)
print (countevent, countsub, counttext)

100%|████████████████████████████████████████████████████████████████████████████████| 100/100 [00:01<00:00, 73.26it/s]


1259 3925 100


### NLTK

In [11]:
import nltk
nltk.download('propbank')
nltk.download('framenet_v17')

[nltk_data] Downloading package propbank to
[nltk_data]     C:\Users\ccolo\AppData\Roaming\nltk_data...
[nltk_data]   Package propbank is already up-to-date!
[nltk_data] Downloading package framenet_v17 to
[nltk_data]     C:\Users\ccolo\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\framenet_v17.zip.


True

In [12]:
from nltk.corpus import framenet as fn

In [4]:
from nltk.corpus import propbank
pb_instances = propbank.instances()

In [7]:
print (propbank.verbs())

['abandon', 'abate', 'abdicate', 'abet', 'abide', ...]


In [40]:
print (fn.lu(len(fn.frames())-1).lexemes[0]['name'])

skulk
[]


In [43]:
## overall
import nltk
nltk.download('propbank')
nltk.download('framenet_v17')
from nltk.corpus import propbank
from nltk.corpus import framenet as fn
verbs = [x.lower() for x in propbank.verbs()]
for i in range(len(fn.lus())):
    try:
        x = fn.lu(i).name[:-2].lower()
        x = x[:x.rindex('.')]
        verbs.append(x)
    except:
        pass
verbs = set(verbs)
print (len(verbs))
print (verbs)

[nltk_data] Downloading package propbank to
[nltk_data]     C:\Users\ccolo\AppData\Roaming\nltk_data...
[nltk_data]   Package propbank is already up-to-date!
[nltk_data] Downloading package framenet_v17 to
[nltk_data]     C:\Users\ccolo\AppData\Roaming\nltk_data...
[nltk_data]   Package framenet_v17 is already up-to-date!
3395
{'cake', 'formulate', 'sip', 'ambush', 'clamber', 'come', 'spirit', 'surround', 'take', 'join', 'flirt', 'want', 'hook', 'impress', 'bloody', 'deregulate', 'spoil', 'blaze', 'bless', 'haunt', 'drink', 'unite', 'indict', 'back', 'deepen', 'eat', 'atone', 'discharge', 'endorse', 'coat', 'exhaust', 'find', 'ruffle', 'shroud', 'unravel', 'redden', 'place weight', 'symbolize', 'scoff', 'cancel', 'smart', 'relax', 'fine', 'banish', 'remove', 'read', 'cast', 'design', 'budget', 'anticipate', 'crown', 'black', 'scatter', 'entrance', 'faint', 'sign', 'ally', 'enact', 'befall', 'reiterate', 'three', 'out', 'humble', 'expand', 'cheapen', 'declassify', 'savor', 'eradicate', 