In [1]:
%matplotlib inline
%load_ext autoreload
%autoreload 2

from standard_libs import *

<a id="ch07"></a>
# Chapter 7: Extracting Information from Text
Questions to be answered:
1. How can we build a system that extracts structured data, such as tables, from unstructured text?
2. What are some robust methods for idenitfying the entities and relationships described in a text?
3. Which corpora are appropriate for this work, and how do we use them for training and evaluating our models?


* [Section 1: Information Extraction](#section1)
* [Section 2: Chunking](#section2)
    - [Section 2.1 Noun Phrase Chunking](#section2_1)
    - [Section 2.3 Chunking with Regular Expressions](#section2_3)
    - [Section 2.4 Exploring Corpora](#section2_4)
    - [Section 2.5 Chinking](#section2_5)
    - [Section 2.6 Representing Chunks: Tags vs Trees](#section2_6)
* [Section 3: Developing Chunkers](#section3)
    - [Section 3.1 Reading data](#section3_1)
    - [Section 3.2 Simple Baselines](#section3_2)
    - [Section 3.3 Training classifier-based chunkers](#section3_3)
* [Section 4: Recursion in Linguistic Structure](#section4)
    - [Section 4.1 Building Nested Structure](#section4_1)
    - [Section 4.2 Trees](#section4_2)
    - [Section 4.3 Tree traversal](#section4_3)

<a id='section1'></a>
## 1. Information Extraction
We take an approach of converting the **unstructured data** of natural language sentences into the structured data (e.g. table). Then we reap the benefits of powerful query tools such as SQL. THis method of getting meaning from text is called **Information Extraction**.
<a id="section11"></a>
### 1.1. Information Extraction Architecture
<img src="figures/ie-architecture.png" width=800>

1. Raw text of the document is split into sentences using a sentence segmenter
2. Each sentence is subdivided into words using a tokenizer
3. Each sentence is tagged with part-of-speech tags
4. **Named entity detection** - search for mentions of potentially interesting entities in each sentence
5. **Relation detection** to search for likely relations between different entities in the text.

In [2]:
def ie_preprocess(document):
    sentences = nltk.sent_tokenize(document)
    sentences = [nltk.word_tokenize(sent) for sent in sentences]
    sentences = [nltk.pos_tag(sent) for sent in sentences]

<a id="section2"></a>
## 2. Chunking
[Back](#ch07)

**Chunking** segments and labels multi-token sequences.
<img src="figures/chunk-segmentation.png" width=600>
Segmentation and labelling at both the TOken and Chunk Levels.
<a id="section2_1"></a>
### 2.1 Noun Phrase Chunking or NP-Chunking
We search for chunks corresponding to individual noun phrases. 

In order to create an `NP-`chunker, we will first define a **chunk grammar**, consisting of rules that indiccate how sentences should be chunked. In this case, we will define a simple grammar with a single regular-expression rule. This rule says that an NP chunk should be formed whenever the chunker finds an optional determiner (`DT`) followed by any number of adjectives (`JJ`) and then a noun (`NN`). 

Using this grammar, we create a chunk parser, and test it on our example sentence. The result is a tree.



In [3]:
sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), ("dog", "NN"), ("barked", "VBD"), ("at", "IN"),  ("the", "DT"), ("cat", "NN")]

grammar = "NP: {<DT>?<JJ>*<NN>}"

cp = nltk.RegexpParser(grammar)
result = cp.parse(sentence)

print(result)

(S
  (NP the/DT little/JJ yellow/JJ dog/NN)
  barked/VBD
  at/IN
  (NP the/DT cat/NN))


<img src="figures/chunker_example.png">

<a id="section2_3"></a>
### 2.3 Chunking with Regular Expressions
[Back](#ch07)

The rules that make up a chunk grammar use **tag patterns** to describe sequences of tagged words. A tag pattern is a sequence of part-of-speech tags delimited using angle brackets.

To find the chunk structure for a given sentence, the `RegexpParser` chunker begins with a flat structure in which no tokens are chunked. The chunking rules are applied in turn, successively updating the chunk structure. Once all of the rules have been invoked, the resulting chunk structure is returned.

In [5]:
grammar = r"""
    NP: {<DT|PP\$>?<JJ>*<NN>}  # chunk determiner/possessive, adjectives and noun
    {<NNP>+}                   # chunk sequences of proper nouns
"""
cp = nltk.RegexpParser(grammar)
sentence = [("Rapunzel", "NNP"), ("let", "VBD"), ("down", "RP"), ("her", "PP$"), ("long", "JJ"), ("golden", "JJ"), ("hair", "NN")]

In [6]:
print(cp.parse(sentence))

(S
  (NP Rapunzel/NNP)
  let/VBD
  down/RP
  (NP her/PP$ long/JJ golden/JJ hair/NN))


If a tag pattern matches at overlapping locations, the leftmost match takes precedence. For example, if we apply a rule that matches two consecutive nouns to a text containing three consecutive nouns, then only the first two nouns will be chunked:

<a id="section2_4"></a>
### 2.4  Exploring Text Corpora
[Back](#ch07)


In [12]:
cp = nltk.RegexpParser('CHUNK: {<V.*> <TO> <V.*>}')
brown = nltk.corpus.brown

for i, sent in enumerate(brown.tagged_sents()):
    tree = cp.parse(sent)
    for subtree in tree.subtrees():
        if subtree.label() == 'CHUNK' and i < 100: print(subtree)

(CHUNK combined/VBN to/TO achieve/VB)
(CHUNK continue/VB to/TO place/VB)
(CHUNK serve/VB to/TO protect/VB)
(CHUNK wanted/VBD to/TO wait/VB)
(CHUNK allowed/VBN to/TO place/VB)
(CHUNK expected/VBN to/TO become/VB)
(CHUNK expected/VBN to/TO approve/VB)
(CHUNK expected/VBN to/TO make/VB)
(CHUNK intends/VBZ to/TO make/VB)
(CHUNK seek/VB to/TO set/VB)
(CHUNK like/VB to/TO see/VB)


In [21]:
def find_chunks(pattern):
    c_type = pattern.split(':')[0]
    cp = nltk.RegexpParser(pattern)
    brown = nltk.corpus.brown

    for i, sent in enumerate(brown.tagged_sents()):
        tree = cp.parse(sent)
        for subtree in tree.subtrees():
            if subtree.label() == c_type and i < 100: print(subtree)

In [22]:
find_chunks('NOUNS: {<N.*>{4,}}')

(NOUNS Court/NN-TL Judge/NN-TL Durwood/NP Pye/NP)
(NOUNS Mayor-nominate/NN-TL Ivan/NP Allen/NP Jr./NP)
(NOUNS Georgia's/NP$ automobile/NN title/NN law/NN)
(NOUNS State/NN-TL Welfare/NN-TL Department's/NN$-TL handling/NN)
(NOUNS Fulton/NP-TL Tax/NN-TL Commissioner's/NN$-TL Office/NN-TL)
(NOUNS Mayor/NN-TL William/NP B./NP Hartsfield/NP)
(NOUNS Mrs./NP J./NP M./NP Cheshire/NP)
(NOUNS E./NP Pelham/NP Rd./NN-TL Aj/NN)
(NOUNS
  State/NN-TL
  Party/NN-TL
  Chairman/NN-TL
  James/NP
  W./NP
  Dorsey/NP)
(NOUNS Texas/NP Sen./NN-TL John/NP Tower/NP)
(NOUNS Lt./NN-TL Gov./NN-TL Garland/NP Byrd's/NP$ campaign/NN)
(NOUNS Schley/NP County/NN-TL Rep./NN-TL B./NP D./NP Pelham/NP)
(NOUNS Colquitt/NP-TL Policeman/NN-TL Tom/NP Williams/NP)


<a id="section2_5"></a>
### 2.5 Chinking
[Back](#ch07)

We define a **chink** to be a sequence of tokens that is not included in a chunk. Here `barked/VBD at/IN` is a chink
```
[ the/DT little/JJ yellow/JJ dog/NN ] barked/VBD at/IN [ the/DT cat/NN ]
```
Chinking is the process of removing a sequence of tokens from a chunk

In [24]:
grammar = r"""
    NP:
        {<.*>+}          # Chunk everything
        }<VBD|IN>+{     # Chink sequences of VBD and IN
"""
sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), ("dog", "NN"), ("barked", "VBD"), ("at", "IN"),  ("the", "DT"), ("cat", "NN")]
cp = nltk.RegexpParser(grammar)

In [25]:
print(cp.parse(sentence))

(S
  (NP the/DT little/JJ yellow/JJ dog/NN)
  barked/VBD
  at/IN
  (NP the/DT cat/NN))


<a id="section2_6"></a>
### 2.6 Representing Chunks: Tags vs Trees
[Back](#ch07)

Chunk structures can be represented using either tags or trees. The most widspread file representation uses **IOB tags**. In this shceme, each token is tagged with one of three special chunk tags, `I`(inside), `O`(outside) or `B`(begin). A token is tagged as `B` if it marks the beginning of a chunk.
<img src="figures/chunk-tagrep.png" width=700>

<a id="section3"></a>
## 3. Developing and Evaluating Chunkers
[Back](#ch07)
### 3.1 Reading IOB Format and the CoNLL 2000 Corpus

In [3]:
text = '''
he PRP B-NP
accepted VBD B-VP
the DT B-NP
position NN I-NP
of IN B-PP
vice NN B-NP
chairman NN I-NP
of IN B-PP
Carlyle NNP B-NP
Group NNP I-NP
, , O
a DT B-NP
merchant NN I-NP
banking NN I-NP
concern NN I-NP
. . O
'''
nltk.chunk.conllstr2tree(text, chunk_types=['NP']).draw()

<img src="figures/chunk_tree.png">

In [4]:
from nltk.corpus import conll2000
print(conll2000.chunked_sents('train.txt')[99])

(S
  (PP Over/IN)
  (NP a/DT cup/NN)
  (PP of/IN)
  (NP coffee/NN)
  ,/,
  (NP Mr./NNP Stone/NNP)
  (VP told/VBD)
  (NP his/PRP$ story/NN)
  ./.)


In [5]:
print(conll2000.chunked_sents('train.txt', chunk_types=['NP'])[99])

(S
  Over/IN
  (NP a/DT cup/NN)
  of/IN
  (NP coffee/NN)
  ,/,
  (NP Mr./NNP Stone/NNP)
  told/VBD
  (NP his/PRP$ story/NN)
  ./.)


<a id="section2_3"></a>
### 3.2 Simple Evaluation and Baselines
[Back](#ch07)

Now that we can access a chunked corpus, we can evaluate chunkers. We start off by establishing a baseline for the trivial chunk parser `cp` that creates no chunks:

In [6]:
from nltk.corpus import conll2000
cp = nltk.RegexpParser("")
test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP'])
print(cp.evaluate(test_sents))

ChunkParse score:
    IOB Accuracy:  43.4%%
    Precision:      0.0%%
    Recall:         0.0%%
    F-Measure:      0.0%%


Naive regular expression chunker that looks for tags beginning with letters that are characteristic of noun phrase tags (e.g. `CD`, `DT`, and `JJ`)

In [7]:
grammar = r"NP: {<[CDJNP].*>+}"
cp = nltk.RegexpParser(grammar)
print(cp.evaluate(test_sents))

ChunkParse score:
    IOB Accuracy:  87.7%%
    Precision:     70.6%%
    Recall:        67.8%%
    F-Measure:     69.2%%


This approach achieves decent results, but we can improve on it by adopting a data-driven approach, where we use the training corpus to find the chunk tag (`I`, `O` or `B`) that is most likely for each part-of-speech tag. In other words, we can build a chunker using a *unigram tagger*. But rather than trying to determine the correct part-of-speech tag for each word, we are trying to determine the correct chunk tag, given each word's part-of-speech tag.

We define the `UnigramChunker` class, which uses a unigram tagger to label sentences with chunk tags. Most of the code in this class is simply used to convert back and forth between the chunk tree representation used by NLTK's `ChunkParserI` interface, and the IOB representation used by the embedded tagger. 

In [12]:
class UnigramChunker(nltk.ChunkParserI):
    def __init__(self, train_sents):
        train_data = [[(t, c) for w, t, c in nltk.chunk.tree2conlltags(sent)]
                     for sent in train_sents]
        self.tagger = nltk.UnigramTagger(train_data)
        
    def parse(self, sentence):
        pos_tags = [pos for (word, pos) in sentence]
        tagged_pos_tags = self.tagger.tag(pos_tags)
        chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags]
        conlltags = [(word, pos, chunktag) for ((word, pos), chunktag)
                    in zip(sentence, chunktags)]
        return nltk.chunk.conlltags2tree(conlltags)

In [13]:
test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP'])
train_sents = conll2000.chunked_sents('train.txt', chunk_types=['NP'])
unigram_chunker = UnigramChunker(train_sents)
print(unigram_chunker.evaluate(test_sents))

ChunkParse score:
    IOB Accuracy:  92.9%%
    Precision:     79.9%%
    Recall:        86.8%%
    F-Measure:     83.2%%


The chunker does reasonably well, achieving an overall f-measure score of 83%. Let's take a look at what it's learned, by using its unigram tagger to assign a tag to each of the part-of-speech tags that appear in the corpus:

In [14]:
postags = sorted(set(pos for sent in train_sents
                    for (word, pos) in sent.leaves()))
print(unigram_chunker.tagger.tag(postags))

[('#', 'B-NP'), ('$', 'B-NP'), ("''", 'O'), ('(', 'O'), (')', 'O'), (',', 'O'), ('.', 'O'), (':', 'O'), ('CC', 'O'), ('CD', 'I-NP'), ('DT', 'B-NP'), ('EX', 'B-NP'), ('FW', 'I-NP'), ('IN', 'O'), ('JJ', 'I-NP'), ('JJR', 'B-NP'), ('JJS', 'I-NP'), ('MD', 'O'), ('NN', 'I-NP'), ('NNP', 'I-NP'), ('NNPS', 'I-NP'), ('NNS', 'I-NP'), ('PDT', 'B-NP'), ('POS', 'B-NP'), ('PRP', 'B-NP'), ('PRP$', 'B-NP'), ('RB', 'O'), ('RBR', 'O'), ('RBS', 'B-NP'), ('RP', 'O'), ('SYM', 'O'), ('TO', 'O'), ('UH', 'O'), ('VB', 'O'), ('VBD', 'O'), ('VBG', 'O'), ('VBN', 'O'), ('VBP', 'O'), ('VBZ', 'O'), ('WDT', 'B-NP'), ('WP', 'B-NP'), ('WP$', 'B-NP'), ('WRB', 'O'), ('``', 'O')]


Bigram Chunker:

In [15]:
class BigramChunker(nltk.ChunkParserI):
    def __init__(self, train_sents):
        train_data = [[(t, c) for w, t, c in nltk.chunk.tree2conlltags(sent)]
                     for sent in train_sents]
        self.tagger = nltk.BigramTagger(train_data)
        
    def parse(self, sentence):
        pos_tags = [pos for (word, pos) in sentence]
        tagged_pos_tags = self.tagger.tag(pos_tags)
        chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags]
        conlltags = [(word, pos, chunktag) for ((word, pos), chunktag)
                    in zip(sentence, chunktags)]
        return nltk.chunk.conlltags2tree(conlltags)

In [17]:
bigram_chunker = BigramChunker(train_sents)
print(bigram_chunker.evaluate(test_sents))

ChunkParse score:
    IOB Accuracy:  93.3%%
    Precision:     82.3%%
    Recall:        86.8%%
    F-Measure:     84.5%%


<a id="section3_3"></a>
### 3.3 Training Classifier-Based Chunkers
[Back](#ch07)

In [23]:
import os

In [29]:
import os
cwd = os.path.abspath(os.path.curdir)
os.environ["MEGAM"] = os.path.join(cwd, 'MEGAM/megam-64')

In [35]:
def npchunk_features(sentence, i, history):
    word, pos = sentence[i]
    return {"pos": pos}

In [19]:
class ConsecutiveNPChunkTagger(nltk.TaggerI):
    def __init__(self, train_sents):
        train_set = []
        for tagged_sent in train_sents:
            untagged_sent = nltk.tag.untag(tagged_sent)
            history = []
            for i, (word, tag) in enumerate(tagged_sent):
                featureset = npchunk_features(untagged_sent, i, history)
                train_set.append((featureset, tag))
                history.append(tag)
        self.classifier = nltk.MaxentClassifier.train(
            train_set, algorithm='megam', trace=0)
    
    def tag(self, sentence):
        history = []
        for i, word in enumerate(sentence):
            featureset = npchunk_features(sentence, i, history)
            tag = self.classifier.classify(featureset)
            history.append(tag)
        return zip(sentence, history)

In [32]:
class ConsecutiveNPChunker(nltk.ChunkParserI):
    def __init__(self, train_sents):
        tagged_sents = [[((w, t), c) for (w, t, c) in
                        nltk.chunk.tree2conlltags(sent)]
                        for sent in train_sents]
        self.tagger = ConsecutiveNPChunkTagger(tagged_sents)
        
    def parse(self, sentence):
        tagged_sents = self.tagger.tag(sentence)
        conlltags = [(w, t, c) for ((w, t), c) in tagged_sents]
        return nltk.chunk.conlltags2tree(conlltags)

Add a feature for the previous part-of-speech tag. Adding this feature allows the classifier to model interactions between adjacent tags, and results in a chunker that is closely related to the bigram chunker.

In [34]:
def npchunk_features(sentence, i, history):
    word, pos = sentence[i]
    if i == 0:
        prevword, prevpos = "<START>", "<START>"
    else:
        prevword, prevpos = sentence[i-1]
    return {'pos': pos, 'prevpos': prevpos}

In [33]:
chunker = ConsecutiveNPChunker(train_sents)
print(chunker.evaluate(test_sents))

ChunkParse score:
    IOB Accuracy:  93.6%%
    Precision:     81.9%%
    Recall:        87.1%%
    F-Measure:     84.4%%


Now we'll try adding a feature for the current word, since we hypothesized that word content should be useful for chunking. 

In [36]:
def npchunk_features(sentence, i, history):
    word, pos = sentence[i]
    if i == 0:
        prevword, prevpos = "<START>", "<START>"
    else:
        prevword, prevpos = sentence[i-1]
    return {'pos': pos, 'word': word, 'prevpos': prevpos}

In [37]:
chunker = ConsecutiveNPChunker(train_sents)
print(chunker.evaluate(test_sents))

ChunkParse score:
    IOB Accuracy:  94.1%%
    Precision:     82.9%%
    Recall:        87.9%%
    F-Measure:     85.3%%


Finally, we can try extending the feature extractor with a variety of additional features, such as lookahead features, paired features, and complex contextual feature. This last feature, called `tags-since-dt` creates a string describing the set of all part-of-speech tags that have been encountered since the most recent determiner, or since the beginning of the sentence if there is no determiner before index `i`..

In [38]:
def npchunk_features(sentence, i, history):
    word, pos = sentence[i]
    if i == 0:
        prevword, prevpos = "<START>", "<START>"
    else:
        prevword, prevpos = sentence[i-1]
    if i == len(sentence)-1:
        nextword, nextpos = "<END>", "<END>"
    else:
        nextword, nextpos = sentence[i+1]
    return {"pos": pos,
             "word": word,
             "prevpos": prevpos,
             "nextpos": nextpos, 
             "prevpos+pos": "%s+%s" % (prevpos, pos),  
             "pos+nextpos": "%s+%s" % (pos, nextpos),
             "tags-since-dt": tags_since_dt(sentence, i)} 

In [39]:
def tags_since_dt(sentence, i):
    tags = set()
    for word, pos in sentence[:i]:
        if pos == 'DT':
            tags = set()
        else:
            tags.add(pos)
    return '+'.join(sorted(tags))

In [41]:
chunker = ConsecutiveNPChunker(train_sents)
print(chunker.evaluate(test_sents))

ChunkParse score:
    IOB Accuracy:  96.0%%
    Precision:     88.8%%
    Recall:        91.1%%
    F-Measure:     89.9%%


<a id="section4"></a>
## 4. Recursion in Linguistic Structure
[Back](#ch07)
<a id="section4_1"></a>
### 4.1 Building Nested Structure with Cascaded Chunkers

In [42]:
grammar = r"""
   NP: {<DT|JJ|NN.*>+}          # chunk sequences of DT, JJ, NN
   PP: {<IN><NP>}               # chunk prepositions followed by NP
   VP: {<VB.*><NP|PP|CLAUSE>+$} # chunk verbs and their arguments
   CLAUSE: {<NP><VP>}           # chunk NP, VP
"""
cp = nltk.RegexpParser(grammar)
sentence = [("Mary", "NN"), ("saw", "VBD"), ("the", "DT"), ("cat", "NN"),
    ("sit", "VB"), ("on", "IN"), ("the", "DT"), ("mat", "NN")]

In [43]:
print(cp.parse(sentence))

(S
  (NP Mary/NN)
  saw/VBD
  (CLAUSE
    (NP the/DT cat/NN)
    (VP sit/VB (PP on/IN (NP the/DT mat/NN)))))


Sentence with deeper nesting

In [44]:
sentence = [("John", "NNP"), ("thinks", "VBZ"), ("Mary", "NN"),
     ("saw", "VBD"), ("the", "DT"), ("cat", "NN"), ("sit", "VB"),
     ("on", "IN"), ("the", "DT"), ("mat", "NN")]
print(cp.parse(sentence))

(S
  (NP John/NNP)
  thinks/VBZ
  (NP Mary/NN)
  saw/VBD
  (CLAUSE
    (NP the/DT cat/NN)
    (VP sit/VB (PP on/IN (NP the/DT mat/NN)))))


In [45]:
cp = nltk.RegexpParser(grammar, loop=2)
print(cp.parse(sentence))

(S
  (NP John/NNP)
  thinks/VBZ
  (CLAUSE
    (NP Mary/NN)
    (VP
      saw/VBD
      (CLAUSE
        (NP the/DT cat/NN)
        (VP sit/VB (PP on/IN (NP the/DT mat/NN)))))))


<a id="section4_2"></a>
### 4.2 Trees
[Back](#ch07)

In NLTK, we create a tree by giving a node label and a list of children:

In [46]:
tree1 = nltk.Tree('NP', ['Alice'])
print(tree1)
tree2 = nltk.Tree('NP', ['the', 'rabbit'])
print(tree2)

(NP Alice)
(NP the rabbit)


We can coorporate these into successively larger trees as follows:

In [47]:
tree3 = nltk.Tree('VP', ['chased', tree2])
tree4 = nltk.Tree('S', [tree1, tree3])
print(tree4)

(S (NP Alice) (VP chased (NP the rabbit)))


In [48]:
tree4[1].label()

'VP'

In [49]:
tree4.leaves()

['Alice', 'chased', 'the', 'rabbit']

In [50]:
tree4[1][1][1]

'rabbit'

In [51]:
tree3.draw()

<a id="section4_3"></a>
### 4.3 Tree Traversal
[Back](#ch07)

It is standard to use a recursive function to traverse a tree.

In [52]:
def traverse(t):
    try:
        t.label()
    except AttributeError:
        print(t, end=" ")
    else:
        # Now we know that t.node is defined
        print('(', t.label(), end=" ")
        for child in t:
            traverse(child)
        print(')', end=" ")

In [58]:
t = nltk.Tree('S', [nltk.Tree('NP', ['Alice']), nltk.Tree('VP', ['chased', nltk.Tree('NP', ['the rabbit'])])])
traverse(t)

( S ( NP Alice ) ( VP chased ( NP the rabbit ) ) ) 

<a id="section5"></a>
## 5 Named Entity Recognition
This can be broken down into two sub-tasks:
- Identifying the boundaries of the NE
- Identifying its type

In [61]:
sent = nltk.corpus.treebank.tagged_sents()[20]
print(nltk.ne_chunk(sent, binary=True)) # Named entities are taked as NE

(S
  The/DT
  plant/NN
  ,/,
  which/WDT
  *T*-1/-NONE-
  is/VBZ
  owned/VBN
  *-4/-NONE-
  by/IN
  (NE Hollingsworth/NNP)
  &/CC
  (NE Vose/NNP Co./NNP)
  ,/,
  was/VBD
  under/IN
  contract/NN
  *ICH*-2/-NONE-
  with/IN
  (NE Lorillard/NN)
  */-NONE-
  to/TO
  make/VB
  the/DT
  cigarette/NN
  filters/NNS
  ./.)


In [62]:
print(nltk.ne_chunk(sent)) # Classifier adds category labels such as PERSON, ORGANISATION and GPE

(S
  The/DT
  plant/NN
  ,/,
  which/WDT
  *T*-1/-NONE-
  is/VBZ
  owned/VBN
  *-4/-NONE-
  by/IN
  (ORGANIZATION Hollingsworth/NNP)
  &/CC
  (PERSON Vose/NNP Co./NNP)
  ,/,
  was/VBD
  under/IN
  contract/NN
  *ICH*-2/-NONE-
  with/IN
  (PERSON Lorillard/NN)
  */-NONE-
  to/TO
  make/VB
  the/DT
  cigarette/NN
  filters/NNS
  ./.)


<a id="section6"></a>
## 6. Relation Extraction
[Back](#ch07)

In [63]:
IN = re.compile(r'.*\bin\b(?!\b.+ing)')
for doc in nltk.corpus.ieer.parsed_docs('NYT_19980315'):
    for rel in nltk.sem.extract_rels('ORG', 'LOC', doc, corpus='ieer', pattern=IN):
        print(nltk.sem.rtuple(rel))

[ORG: 'WHYY'] 'in' [LOC: 'Philadelphia']
[ORG: 'McGlashan &AMP; Sarrail'] 'firm in' [LOC: 'San Mateo']
[ORG: 'Freedom Forum'] 'in' [LOC: 'Arlington']
[ORG: 'Brookings Institution'] ', the research group in' [LOC: 'Washington']
[ORG: 'Idealab'] ', a self-described business incubator based in' [LOC: 'Los Angeles']
[ORG: 'Open Text'] ', based in' [LOC: 'Waterloo']
[ORG: 'WGBH'] 'in' [LOC: 'Boston']
[ORG: 'Bastille Opera'] 'in' [LOC: 'Paris']
[ORG: 'Omnicom'] 'in' [LOC: 'New York']
[ORG: 'DDB Needham'] 'in' [LOC: 'New York']
[ORG: 'Kaplan Thaler Group'] 'in' [LOC: 'New York']
[ORG: 'BBDO South'] 'in' [LOC: 'Atlanta']
[ORG: 'Georgia-Pacific'] 'in' [LOC: 'Atlanta']


The `conll2002` Dutch corpus contains not just named entity annotation but also part-of-speech tags. This allows us to devise patterns that are sensitive to these tag:

In [64]:
from nltk.corpus import conll2002
vnv = """
(
is/V|       # 3rd sing present and
was/V|      # pastforms of the verb zijn ('be')
werd/V|     # and also present
wordt/V     # past of worden ('become')
)
.*          # followed by anything
van/Prep    # followed by van ('of')
"""

In [65]:
VAN = re.compile(vnv, re.VERBOSE)
for doc in conll2002.chunked_sents('ned.train'):
    for rel in nltk.sem.extract_rels('PER', 'ORG', doc, corpus='conll2002', pattern=VAN):
        print(nltk.sem.clause(rel, relsym="VAN"))

VAN("cornet_d'elzius", 'buitenlandse_handel')
VAN('johan_rottiers', 'kardinaal_van_roey_instituut')
VAN('annie_lennox', 'eurythmics')


Show actual words between the two NEs and their left and right context

In [67]:
VAN = re.compile(vnv, re.VERBOSE)
for doc in conll2002.chunked_sents('ned.train'):
    for rel in nltk.sem.extract_rels('PER', 'ORG', doc, corpus='conll2002', pattern=VAN):
        print(nltk.rtuple(rel, lcon=True, rcon=True))

...'')[PER: "Cornet/V d'Elzius/N"] 'is/V op/Prep dit/Pron ogenblik/N kabinetsadviseur/N van/Prep staatssecretaris/N voor/Prep' [ORG: 'Buitenlandse/N Handel/N'](''...
...'')[PER: 'Johan/N Rottiers/N'] 'is/V informaticacoördinator/N van/Prep het/Art' [ORG: 'Kardinaal/N Van/N Roey/N Instituut/N']('in/Prep'...
...'Door/Prep rugproblemen/N van/Prep zangeres/N')[PER: 'Annie/N Lennox/N'] 'wordt/V het/Art concert/N van/Prep' [ORG: 'Eurythmics/N']('vandaag/Adv in/Prep'...
