This notebook explores relation extraction by measuring common dependency paths between two entities that hold a given relation to each other -- here, the relation "born_in" between a PER entity and an GPE entity, using data from Wikipedia biographies.

In [1]:
import re
import spacy
import neuralcoref
from collections import Counter

In [2]:
nlp = spacy.load('en')
# workaround if you are getting an error loading the sapcy 'en' module:
# nlp = spacy.load('en_core_web_sm')
coref = neuralcoref.NeuralCoref(nlp.vocab)
nlp.add_pipe(coref, name='neuralcoref')

In [3]:
def get_path(one, two):
    
    """ Get dependency path between two tokens in a sentence; return None if not reachable """
    
    one_heads=[x for x in one.ancestors]
    two_heads=[x for x in two.ancestors]
    
    up_path=[]
    down_path=[]
    up_path.append(one)
    down_path.append(two)
    
    lca=None
    for head in one_heads:
        if head in two_heads:
            lca=head
            break
            
        up_path.append(head)

    for head in two_heads:
        if head == lca:
            break
    
        down_path.append(head)
   
    if lca is None:
        return None
    
    path="%s->%s<-%s" % ('->'.join(["%s" % x.dep_ for x in up_path]), lca.text, '<-'.join(["%s" % x.dep_ for x in reversed(down_path)]))
    return path

In [4]:
def get_closest_coref(entity1, clusters, target_entity):
    
    """ Given entities e1 and mention m2 of another entity, returns the mention for e1 closest to m2 """
    
    targetCluster=None
    for chain in clusters:
        for mention in chain.mentions:
            if mention.start <= entity1.start and mention.end >= entity1.end:
                targetCluster=chain
                break

    if targetCluster is None:
        return None

    closestMention=None
    dist=100
    for mention in targetCluster:
            sentDist=abs(target_entity.sent.start-mention.sent.start)
            if sentDist < dist:
                dist=sentDist
                closestMention=mention
            if sentDist == dist and closestMention is not None: 
                if abs(target_entity.start-mention.start) < abs(target_entity.start-closestMention.start):
                    closetMention=mention
    return closestMention
                

Q1. In class activity: here's [a Google spreadsheet](https://docs.google.com/spreadsheets/d/1PNDInP5JIqad9mOXwRUxGDZntvoUerX22QQcgFCJDxY/edit?usp=sharing) with the first 5 sentences from ~500 Wikipedia biographies.  Pick 10 rows of this spreadsheet and put your student ID in the "Student ID" column; then go through those 20 rows and read the document. If you can infer that a person from the People column was born in a place in the Places column, list that person in the "PER BORN" column and the place in the "PLACE BORN" column.

In [5]:
def read_training(filename):
    
    """ Read in training data for <person, place> tuples that express the "born_in" relation.
    
    -- Use coreference resolution to identity the person mention closest to the place mention.
    -- Use dependency parsing to extract the syntactic path from that person mention to the place.
    
    """
    
    data=[]
    with open(filename) as file:
        for line in file:
            cols=line.split("\t")
            idd=cols[0]
            doc=cols[1]
            pers=cols[4]
            place=cols[5].rstrip()
            
            if pers != "" and place != "":
                doc=nlp(doc)

                target_person=None
                target_place=None
                
                # Annotations are at the type level, so let's anchor them to specific mentions
                for entity in doc.ents:
                    if entity.text == pers:
                        target_person=entity
                    elif entity.text == place:
                        target_place=entity
                
                if target_person is not None and target_place is not None:
                    
                    # Use coreference to get person mention that's closest to the place (ideally in the same sentence).
                    closest_person_mention=get_closest_coref(target_person, doc._.coref_clusters, target_place)
                    if closest_person_mention is None:
                        closest_person_mention=target_person
                    
                    path=get_path(closest_person_mention.root, target_place.root)
                    
                    # if a path can be found between the two
                    if path is not None:
                        data.append((pers, place, path, target_place.sent ))
    return data     

Save this Google sheet as a tsv in `data/born.tsv` and execute the `read_training` function on it to read in the <person, place> tuples.

In [7]:
trainingData=read_training("../data/born.tsv")
for data in trainingData:
    print ('\t'.join([str(x) for x in data]))

Petermann	Bleicherode	nsubjpass->born<-prep<-pobj	Petermann was born in Bleicherode , Germany .
Omar Ahmed Sayid Khadr	Canada	nsubjpass->taken<-advcl<-prep<-pobj	Born in Canada , Khadr was taken to Afghanistan by his father , who was affiliated with Al-Qaeda and other terrorist organizations .
Marx	Trier	npadvmod->amod->dep->Born<-prep<-pobj	Born in Trier , Germany , to a Jewish middle-class family , Marx studied law and philosophy at university .
Joseph Rudyard Kipling	India	nsubjpass->born<-prep<-pobj	He was born in India , which inspired much of his work .
Jeffrey Koons	Pennsylvania	nsubj->lives<-conj<-prep<-pobj<-conj<-prep<-pobj<-appos	He lives and works in both New York City and his hometown of York , Pennsylvania .
Sonia Maria Sotomayor	Bronx	nsubjpass->born<-prep<-pobj	Sotomayor was born in The Bronx , New York City , to Puerto Rican-born parents .
Margaret Alice Murray	Calcutta	nsubj->divided<-advcl<-prep<-pobj<-prep<-pobj	Born to a wealthy middle-class English family in Calcu

Q2: Count the syntactic paths identified in the training data.  What are the two that are most frequently attested?

In [8]:
counts=Counter()
for _, _, path, _ in trainingData:
    counts[path]+=1
for k, v in counts.most_common(2):
    print("%s\t%s" % (k,v))

nsubjpass->born<-prep<-pobj	16
nsubjpass->born<-prep<-pobj<-prep<-pobj	2


Q3: Write a function to read in a target file (containing one document per line) and a syntactic path and identify all people/places that are joined by that path. Hint: you can use the get_path function defined above to retrieve the syntactic path between two tokens.

In [9]:
def extract_relations(filename, target_path):
    
    """ Extract new relations from a file.
    Input: 
        - filename containing one document per line
        - target_path: the syntactic dependency path connecting the person entity to the place entity
    Output:
        - a list of (pers, place, path, sentence) tuples in the same format returned from `read_training`.
    
    """
    data=[]
    with open(filename) as file:
        for line in file:
            text=line.rstrip()

            doc=nlp(text)

            for sent in doc.sents:
                people=[]
                places=[]

                for entity in sent.ents:
                    if entity.label_ == "PERSON":
                        people.append(entity)
                    elif entity.label_ == "GPE":
                        places.append(entity)

                for person in people:
                    for place in places:

                        path=get_path(person.root, place.root)

                        if path is not None and path == target_path:
                            data.append((person, place, path, place.sent ))
                           # print("%s\t%s\t%s\t%s" % (person, place, path, sent))
    return data

In [10]:
new_examples=extract_relations("../data/wiki.bio.born.test.txt", "nsubjpass->born<-prep<-pobj")
for data in new_examples:
    print ('\t'.join([str(x) for x in data]))
    print()

Joel	Bronx	nsubjpass->born<-prep<-pobj	Joel was born in 1949 in The Bronx , New York , and grew up on Long Island , New York , both places that influenced his music .

John Whitfield Bunn	Hunterdon County	nsubjpass->born<-prep<-pobj	John Whitfield Bunn was born June 21 , 1831 , in Hunterdon County , New Jersey .

Hanna	New Lisbon	nsubjpass->born<-prep<-pobj	Hanna was born in New Lisbon ( today Lisbon ) , Ohio , in 1837 .

Foraker	Ohio	nsubjpass->born<-prep<-pobj	Foraker was born in rural Ohio in 1846 , and enlisted at age 16 in the Union Army during the American Civil War .

Bonnet	Bassillac	nsubjpass->born<-prep<-pobj	Bonnet was born in Bassillac , Dordogne , the son of a lawyer .

Agnew	Baltimore	nsubjpass->born<-prep<-pobj	Agnew was born in Baltimore , to an American-born mother and a Greek immigrant father .

Bush	New Haven	nsubjpass->born<-prep<-pobj	Bush was born in New Haven , Connecticut , and grew up in Texas .

Biden	Scranton	nsubjpass->born<-prep<-pobj	Biden was born in Scra

Q4: Execute `extract_relations` on `../data/wiki.bio.born.test.txt` and the two most frequent paths identified in the training data above.

In [11]:
new_examples=extract_relations("../data/wiki.bio.born.test.txt", "nsubjpass->born<-prep<-pobj<-prep<-pobj")
for data in new_examples:
    print ('\t'.join([str(x) for x in data]))
    print()

Joel Stephen Kovel	Brooklyn	nsubjpass->born<-prep<-pobj<-prep<-pobj	Joel Stephen Kovel was born on August 27 , 1936 , in Brooklyn , New York .

