# Entity Extraction Workshop Exercise 2: Looking for Relations

In this exercise we will use spaCy's entity extraction algorithm to find relations between different entities in the Brown corpus.

## Part 1: Basic entity extraction

The Brown corpus is a well-known corpus of English developed at Brown Univeristy, containing text from many different sources. We will use entity extraction on a subset of the Brown corpus covering a few categories.

We can use spaCy to find entities in a basic sentence as follows:

In [45]:
import spacy
nlp = spacy.load('en_core_web_sm')
sample_sentence = "The White House is located in Washington D.C."
sample_doc = nlp(sample_sentence)
print([(ent.text, ent.label_) for ent in sample_doc.ents])

[('The White House', 'ORG'), ('Washington D.C.', 'GPE')]


To see what an entity label means:

In [46]:
spacy.explain("ORG")

'Companies, agencies, institutions, etc.'

And to display the entities in a document using displaCy:

In [47]:
from spacy import displacy
displacy.render(sample_doc, style='ent', jupyter = True)

Now let's load sentences from the Brown corpus for a few categories:

In [48]:
from nltk.corpus import brown
import nltk
nltk.download('brown')
sentences = brown.sents(categories = ['news', 'editorial', 'reviews'])

[nltk_data] Downloading package brown to
[nltk_data]     /Users/jeremybensoussan/nltk_data...
[nltk_data]   Package brown is already up-to-date!


**Questions:**
  1. Use displaCy to display the entities in the first three sentences of this corpus. What are some entities that are tagged, and what do their entity labels means?
  2. What are the five most common people mentioned in the corpus for these categories? What are the five most common buildings? (Hint: See [this page](https://spacy.io/usage/linguistic-features#section-named-entities) under "built-in entity types")

In [49]:
displacy.render(nlp(' '.join([' '.join(sentences[i]) for i in range(3)])), style='ent', jupyter = True)

The entities tagged in this example are: <div class="alert alert-block alert-info">DATE, GPE, ORG, PERSON</div>

In [50]:
entities = ['DATE', 'GPE', 'ORG', 'PERSON']
print('\n'.join(['{}: {}'.format(entities[k], v) for k,v in enumerate(map(spacy.explain, entities))]))

DATE: Absolute or relative dates or periods
GPE: Countries, cities, states
ORG: Companies, agencies, institutions, etc.
PERSON: People, including fictional


In [51]:
people_list = []
building_list = []

for sentence in sentences:
    for ent in nlp(' '.join(sentence)).ents:
        if ent.label_ == 'PERSON':
            people_list.append(ent.text)
        elif ent.label_ == 'FAC':
            building_list.append(ent.text)

In [52]:
from collections import Counter
print('Most common people are: {}'.format(Counter(people_list).most_common(5)))
print('Most common buildings are: {}'.format(Counter(building_list).most_common(5)))

Most common people are: [('Kennedy', 115), ('Khrushchev', 72), ('Maris', 34), ('Podger', 23), ('Eisenhower', 22)]
Most common buildings are: [('Broadway', 16), ('the White House', 9), ('Capitol', 5), ('Pennsylvania Avenue', 4), ('Lewisohn Stadium', 4)]


## Part 2: Finding relations

Now we will look at pairs of entities in sentences in the corpus and try to identify relations between them.

**Questions**:
  3. We would like to know where organizations are located. Try to find all occurences of organization-location where the organization (ORG) comes before the location (GPE) in the sentence, and the word "in" appears somewhere between them. Put this in a Pandas Dataframe with three columns: ORG (organization name), GPE (location name), and context (words in between the organization and location).  How many of these are there?
  
  Hint: use entity.start and entity.end to get the starting and ending indices for an entity in the sentence.
  
  4. How much does this data tell us about what organizations are located where? In what cases can we be more or less certain?
  
  5. What is another example of a pair of entity labels and context word that would give us useful information? Try running your code to find this new relation.

In [113]:
import pandas as pd

org_loc_dict = {'ORG':[], 'GPE':[], 'context':[]}
for sentence in sentences:
    org_index = -1
    org = ''
    tokens = nlp(' '.join(sentence))
    for entity in tokens.ents:
        if org != '' and entity.label_ == 'GPE':
            context = ' '.join([tok.text for tok in tokens[org_index:entity.start]])
            if 'in' in context.lower().split(' '):
                org_loc_dict['ORG'].append(org)
                org_loc_dict['GPE'].append(entity.text)
                org_loc_dict['context'].append(context)
                org = ''
                org_index = -1
        elif entity.label_ == 'ORG':
            org = entity.text
            org_index = entity.end

In [115]:
org_locations = pd.DataFrame.from_dict(org_loc_dict, dtype='str')
org_locations

Unnamed: 0,ORG,GPE,context
0,the State Welfare Department,Fulton County,`` has seen fit to distribute these funds thro...
1,GOP,Blue Ridge,"chairman , said a meeting held Tuesday night in"
2,State Party,Savannah,Chairman James W. Dorsey added that enthusiasm...
3,TEA,Dallas County,estimated there would be 182 scholastics to at...
4,ADC,Cook county,program in
5,White House,Washington,aids in
6,Customary Senate,Negro,rules were ignored in order to speed approval ...
7,NATO,Angola,committee has been set up so that in the futur...
8,State,the United States,"himself , in his first speech , gave some idea..."
9,State,Berlin,has also solemnly repeated a warning to the So...


#### 4. The context column helps a lot in understanding the link between the Organization and their locations.
The organization/places tuples that have a context such as 'in' or 'in the' are the most reliable to be in that location.

After those, I would say the organizations with a context where the last word is 'in' are likely to be in that location.

Finally, the places for which the word 'in' in located anywhere but in the end of the context are more prone to confusion.

In [119]:
author_dict = {'author':[], 'artpiece':[], 'context':[]}
for sentence in sentences:
    author_index = -1
    author = ''
    tokens = nlp(' '.join(sentence))
    for entity in tokens.ents:
        if author != '' and entity.label_ == 'WORK_OF_ART':
            context = ' '.join([tok.text for tok in tokens[author_index:entity.start]])
            if 'of' in context.lower().split(' ') or 'from' in context.lower().split(' '):
                author_dict['author'].append(author)
                author_dict['artpiece'].append(entity.text)
                author_dict['context'].append(context)
                author = ''
                author_index = -1
        elif entity.label_ == 'PERSON':
            author = entity.text
            author_index = entity.end

In [120]:
authors = pd.DataFrame.from_dict(author_dict, dtype='str')
authors

Unnamed: 0,author,artpiece,context
0,T. F. Zimmerman,Bible,", general superintendent , commented , `` The ..."
1,Alva W. Vernava,Maple Ave,", 21 , of 23"
2,Monte Carlo,Francesca Da Rimini,", will present a program of four ballets inclu..."
3,Brevard,Gallery,group of 85 arrived at the
4,Richard Preston,the Governors Conference on Industrial Develop...,", executive director of the New Hampshire Stat..."
5,Geroge Bragg,`` Stabat Mater '',", the twenty - six boys made some lovely sound..."
6,Rudy Bond,`` Little Tin Box '',and his band of tuneful ward - heelers made
7,Brown,`` Where The Boys Are '',"senior , while two Yalies are cast as virtual ..."
8,Owen Wister,`` The Virginian '',", author of"
9,Frederick Fuller,East Meets West,", baritone , presented a program of folksongs ..."


In [123]:
nationalities_dict = {'person':[], 'nationality':[], 'context':[]}
for sentence in sentences:
    person_index = -1
    person = ''
    tokens = nlp(' '.join(sentence))
    for entity in tokens.ents:
        if person != '' and entity.label_ == 'NORP':
            context = ' '.join([tok.text for tok in tokens[person_index:entity.start]])
            if 'of' in context.lower().split(' ') or 'from' in context.lower().split(' '):
                nationalities_dict['person'].append(person)
                nationalities_dict['nationality'].append(entity.text)
                nationalities_dict['context'].append(context)
                person = ''
                person_index = -1
        elif entity.label_ == 'PERSON':
            person = entity.text
            person_index = entity.end

In [124]:
nationalities = pd.DataFrame.from_dict(nationalities_dict, dtype='str')
nationalities

Unnamed: 0,person,nationality,context
0,J. M. Cheshire,Griffin,of
1,Kennedy,Americans,today proposed a mammoth new medical care prog...
2,Kennedy,Communist,administration would be held responsible if th...
3,Castro,Communist,regime from the
4,Souvanna Phouma,Communists,", whom it felt was too trusting of"
5,James J. Sheeran,Republican,"of West Orange , for the"
6,Wagner,Democratic,has the confidence of the
7,Souvanna Phouma,Communist,", leader of the nation 's neutralists and reco..."
8,Pathet Lao,Communist,"rebel attacks had been launched , heavily supp..."
9,Anderson,Cuban,", a Seattle ex - marine and Havana businessman..."
