# FLIP(01):  Advanced Data Science
**(Module 03: Natural Language Processing)**

---
- Materials in this module include resources collected from various open-source online repositories.
- You are free to use, but NOT allowed to change or distribute this package.

Prepared by and for 
**Student Members** |
2006-2018 [TULIP Lab](http://www.tulip.org.au)

---


# Session 06 - Extracting Information from Text
### Information Extraction

## Information Extraction Architecture

A simple information extraction system.first, the raw text of the document is split into sentences using a sentence segmenter, and each sentence is further subdivided into words using a tokenizer. Next, each sentence is tagged with part-of-speech tags, which will prove very helpful in the next step,
named entity recognition. In this step, we search for mentions of potentially interesting entities in each sentence. Finally, we use relation recognition to search for likely relations between different entities in the text.

To perform the first three tasks, we can define a function that simply connects together NLTK’s default sentence segmenter , word tokenizer , and part-of-speech tagger:

In [None]:
# we can define a function that simply connects together NLTK’s default sentence segmenter , word tokenizer , and part-of-speech tagger
def ie_preprocess(document):
    sentences = nltk.sent_tokenize(document)
    sentences = [nltk.word_tokenize(sent) for sent in sentences]
    sentences = [nltk.pos_tag(sent) for sent in sentences]

# Chunking

## Noun Phrase Chunking

One of the most useful sources of information for NP-chunking is part-of-speech tags.This is one of the motivations for performing part-of-speech tagging in our information extraction system.In order to create an NP-chunker, we will first define a chunk grammar, consisting of rules that indicate how sentences should be chunked. In this case, we will define a simple grammar with a single regular expression rule . This rule says that an NP chunk should be formed whenever the chunker finds an optional determiner (DT) followed by any number of adjectives (JJ) and then a noun (NN). Using this grammar, we create a chunk parser , and test it on our example sentence . The result is a tree, which we can either print , or display graphically.

In [None]:
# Example of a simple regular expression–based NP chunker
sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"),
            ("dog", "NN"), ("barked", "VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")]

In [None]:
grammar = "NP: {<DT>?<JJ>*<NN>}"

In [None]:
import nltk
cp = nltk.RegexpParser(grammar)

In [None]:
result = cp.parse(sentence)

In [None]:
print(result)

In [None]:
result.draw()

## Chunking with Regular Expressions

To find the chunk structure for a given sentence, the RegexpParser chunker begins with a flat structure in which no tokens are chunked. The chunking rules are applied in turn, successively updating the chunk structure. Once all of the rules have been invoked, the resulting chunk structure is returned.

In [None]:
# Chunking with Regular Expressions
grammar = r"""
    NP: {<DT|PP\$>?<JJ>*<NN>} # chunk determiner/possessive, adjectives and nouns
        {<NNP>+}              # chunk sequences of proper nouns
"""
cp = nltk.RegexpParser(grammar)
sentence = [("Rapunzel", "NNP"), ("let", "VBD"), ("down", "RP"), 
                    ("her", "PP$"), ("long", "JJ"), ("golden", "JJ"), ("hair", "NN")]

In [None]:
print(cp.parse(sentence))

In [None]:
nouns = [("money", "NN"), ("market", "NN"), ("fund", "NN")]

In [None]:
grammar = "NP: {<NN><NN>} # Chunk two consecutive nouns"

In [None]:
cp = nltk.RegexpParser(grammar)

In [None]:
print(cp.parse(nouns))

## Exploring Text Corpora

In [None]:
cp = nltk.RegexpParser('CHUNK: {<V.*> <TO> <V.*>}')

In [None]:
brown = nltk.corpus.brown

In [None]:
# Chinking
grammar = r"""
    NP:
    {<.*>+} # Chunk everything
    }<VBD|IN>+{ # Chink sequences of VBD and IN
    """

In [None]:
sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"),
    ("dog", "NN"), ("barked", "VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")]

In [None]:
cp = nltk.RegexpParser(grammar)

In [None]:
print(cp.parse(sentence))

# Developing and Evaluating Chunkers

## Reading IOB Format and the CoNLL-2000 Chunking Corpus

A conversion function chunk.conllstr2tree() builds a tree representation from one of these multiline strings. Moreover, it permits us to choose any subset of the three chunk types to use, here just for NP chunks:

In [None]:
text = '''
... he PRP B-NP
... accepted VBD B-VP
... the DT B-NP
... position NN I-NP
... of IN B-PP
... vice NN B-NP
... chairman NN I-NP
... of IN B-PP
... Carlyle NNP B-NP
... Group NNP I-NP
... , , O
... a DT B-NP
... merchant NN I-NP
... banking NN I-NP
... concern NN I-NP
... . . O
... '''

In [None]:
nltk.chunk.conllstr2tree(text, chunk_types=['NP']).draw()

## Simple Evaluation and Baselines

Now that we can access a chunked corpus, we can evaluate chunkers. We start off by establishing a baseline for the trivial chunk parser cp that creates no chunks:

In [None]:
from nltk.corpus import conll2000
print(conll2000.chunked_sents('train.txt')[99])

In [None]:
print(conll2000.chunked_sents('train.txt', chunk_types=['NP'])[99])

In [None]:
# Simple Evaluation and Baselines
from nltk.corpus import conll2000

In [None]:
cp = nltk.RegexpParser("")

In [None]:
test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP'])

In [None]:
print(cp.evaluate(test_sents))

In [None]:
grammar = r"NP: {<[CDJNP].*>+}"

In [None]:
cp = nltk.RegexpParser(grammar)

In [None]:
print(cp.evaluate(test_sents))

In [None]:
# Noun phrase chunking with a unigram tagger.
class UnigramChunker(nltk.ChunkParserI):
    def __init__(self, train_sents): 
        train_data = [[(t,c) for w,t,c in nltk.chunk.tree2conlltags(sent)]
                      for sent in train_sents]
        self.tagger = nltk.UnigramTagger(train_data)
    def parse(self, sentence):
        pos_tags = [pos for (word,pos) in sentence]
        tagged_pos_tags = self.tagger.tag(pos_tags)
        chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags]
        conlltags = [(word, pos, chunktag) for ((word,pos),chunktag)
                     in zip(sentence, chunktags)]
        return nltk.chunk.conlltags2tree(conlltags)    

In [None]:
test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP'])

In [None]:
train_sents = conll2000.chunked_sents('train.txt', chunk_types=['NP'])

In [None]:
unigram_chunker = UnigramChunker(train_sents)

In [None]:
print(unigram_chunker.evaluate(test_sents))

In [None]:
postags = sorted(set(pos for sent in train_sents
                     for (word,pos) in sent.leaves()))

In [None]:
print(unigram_chunker.tagger.tag(postags))

## Training Classifier-Based Chunkers

Both the regular expression–based chunkers and the n-gram chunkers decide what chunks to create entirely based on part-of-speech tags.

In [None]:
# Noun phrase chunking with a consecutive classifier.
class ConsecutiveNPChunkTagger(nltk.TaggerI):
    def __init__(self, train_sents):
        train_set = []
        for tagged_sent in train_sents:
            untagged_sent = nltk.tag.untag(tagged_sent)
            history = []
            for i, (word, tag) in enumerate(tagged_sent):
                featureset = npchunk_features(untagged_sent, i, history)
                train_set.append( (featureset, tag) )
                history.append(tag)
        self.classifier = nltk.MaxentClassifier.train(
            train_set, algorithm='megam', trace=0)
    def tag(self, sentence):
        history = []
        for i, word in enumerate(sentence):
            featureset = npchunk_features(sentence, i, history)
            tag = self.classifier.classify(featureset)
            history.append(tag)
        return zip(sentence, history)
    class ConsecutiveNPChunker(nltk.ChunkParserI):
        def __init__(self, train_sents):
            tagged_sents = [[((w,t),c) for (w,t,c) in
                             nltk.chunk.tree2conlltags(sent)]
                            for sent in train_sents]
            self.tagger = ConsecutiveNPChunkTagger(tagged_sents)
        def parse(self, sentence):
            tagged_sents = self.tagger.tag(sentence)
            conlltags = [(w,t,c) for ((w,t),c) in tagged_sents]
            return nltk.chunk.conlltags2tree(conlltags)

In [None]:
def npchunk_features(sentence, i, history):
    word, pos = sentence[i]
    return {"pos": pos}

In [None]:
chunker = ConsecutiveNPChunker(train_sents)
print(chunker.evaluate(test_sents))

In [None]:
def npchunk_features(sentence, i, history):
    word, pos = sentence[i]
    if i == 0:
        prevword, prevpos = "<START>", "<START>"
    else:
        prevword, prevpos = sentence[i-1]
    return {"pos": pos, "prevpos": prevpos}

In [None]:
chunker = ConsecutiveNPChunker(train_sents)
print(chunker.evaluate(test_sents))

In [None]:
def npchunk_features(sentence, i, history):
    word, pos = sentence[i]
    if i == 0:
        prevword, prevpos = "<START>", "<START>"
    else:
        prevword, prevpos = sentence[i-1]
    if i == len(sentence)-1:
        nextword, nextpos = "<END>", "<END>"
    else:
        nextword, nextpos = sentence[i+1]
    return {"pos": pos,
            "word": word,
            "prevpos": prevpos,
            "nextpos": nextpos,
            "prevpos+pos": "%s+%s" % (prevpos, pos),
            "pos+nextpos": "%s+%s" % (pos, nextpos),
            "tags-since-dt": tags_since_dt(sentence, i)}

In [None]:
def tags_since_dt(sentence, i):
    tags = set()
    for word, pos in sentence[:i]:
        if pos == 'DT':
            tags = set()
        else:
            tags.add(pos)
    return '+'.join(sorted(tags))

In [None]:
chunker = ConsecutiveNPChunker(train_sents)

In [None]:
print(chunker.evaluate(test_sents))

# Recursion in Linguistic Structure

## Building Nested Structure with Cascaded Chunkers
So far, our chunk structures have been relatively flat. Trees consist of tagged tokens, optionally grouped under a chunk node such as NP. However, it is possible to build chunk structures of arbitrary depth, simply by creating a multistage chunk grammar containing recursive rules.

In [None]:
# A chunker that handles NP, PP, VP, and S.
grammar = r"""
    NP: {<DT|JJ|NN.*>+} # Chunk sequences of DT, JJ, NN
    PP: {<IN><NP>} # Chunk prepositions followed by NP
    VP: {<VB.*><NP|PP|CLAUSE>+$} # Chunk verbs and their arguments
    CLAUSE: {<NP><VP>} # Chunk NP, VP
"""

In [None]:
cp = nltk.RegexpParser(grammar)

In [None]:
sentence = [("Mary", "NN"), ("saw", "VBD"), ("the", "DT"), ("cat", "NN"),
    ("sit", "VB"), ("on", "IN"), ("the", "DT"), ("mat", "NN")]

In [None]:
print(cp.parse(sentence))

In [None]:
sentence = [("John", "NNP"), ("thinks", "VBZ"), ("Mary", "NN"),
            ("saw", "VBD"), ("the", "DT"), ("cat", "NN"), ("sit", "VB"),
            ("on", "IN"), ("the", "DT"), ("mat", "NN")]

In [None]:
print(cp.parse(sentence))

In [None]:
cp = nltk.RegexpParser(grammar, loop=2)

In [None]:
print(cp.parse(sentence))

## Trees

A tree is a set of connected labeled nodes, each reachable by a unique path from a distinguished root node.

In [None]:
# In NLTK, we create a tree by giving a node label and a list of children:
tree1 = nltk.Tree('NP', ['Alice'])

In [None]:
print(tree1)

In [None]:
tree2 = nltk.Tree('NP', ['the', 'rabbit'])

In [None]:
print(tree2)

In [None]:
tree3 = nltk.Tree('VP', ['chased', tree2])
tree4 = nltk.Tree('S', [tree1, tree3])
print(tree4)

In [None]:
print(tree4[1])

In [None]:
tree4.leaves()

In [None]:
# Tree Traversal
def traverse(t):
    try:
        t.node
    except AttributeError:
        print(t,)
    else:
        # Now we know that t.node is defined
        print '(', t.node,
        for child in t:
            traverse(child)
        print(')',)

t = nltk.Tree('(S (NP Alice) (VP chased (NP the rabbit)))')

In [None]:
traverse(t)

# Named Entity Recognition
NLTK provides a classifier that has already been trained to recognize named entities, accessed with the function nltk.ne_chunk(). If we set the parameter binary=True, then named entities are just tagged as NE; otherwise, the classifier adds category labels such as PERSON, ORGANIZATION, and GPE.

In [None]:
sent = nltk.corpus.treebank.tagged_sents()[22]

In [None]:
print(nltk.ne_chunk(sent, binary=True))

In [None]:
print(nltk.ne_chunk(sent))

# Relation Extraction
Once named entities have been identified in a text, we then want to extract the relations that exist between them. As indicated earlier, we will typically be looking for relations between specified types of named entity. One way of approaching this task is to initially look for all triples of the form (X, α, Y), where X and Y are named entities of the required types, and α is the string of words that intervenes between X and Y. We can then use regular expressions to pull out just those instances of α that express the relation that we are looking for. The following example searches for strings that contain the word in. The special regular expression (?!\b.+ing\b) is a negative lookahead assertion that allows us to disregard strings such as success in supervising the transition of, where in is followed by a gerund.

In [None]:
IN = re.compile(r'.*\bin\b(?!\b.+ing)')

In [None]:
for doc in nltk.corpus.ieer.parsed_docs('NYT_19980315'):
    for rel in nltk.sem.extract_rels('ORG', 'LOC', doc,
                                     corpus='ieer', pattern = IN):
        print(nltk.sem.show_raw_rtuple(rel))

In [None]:
from nltk.corpus import conll2002
vnv = """
    (
    is/V| # 3rd sing present and
    was/V| # past forms of the verb zijn ('be')
    werd/V| # and also present
    wordt/V # past of worden ('become')
    )
    .* # followed by anything
    van/Prep # followed by van ('of')
    """

In [None]:
VAN = re.compile(vnv, re.VERBOSE)

In [None]:
for doc in conll2002.chunked_sents('ned.train'):
    for r in nltk.sem.extract_rels('PER', 'ORG', doc,
                                   corpus='conll2002', pattern=VAN):
        print(nltk.sem.show_clause(r, relsym="VAN)

In [None]:
# small test:
# Replace the last line with print show_raw_rtuple(rel,lcon=True, rcon=True). This will show you the actual words that intervene
# between the two NEs and also their left and right context, within a default 10-word window. With the help of a Dutch dictionary, 
# you might be able to figure out why the result VAN('annie_lennox', 'euryth mics') is a false hit.