### Task 1

Objectives

Define, train and evaluate uni-gram and bi-gram HMM chunkers
- load the conll2000 corpus
- split the corpus to test and train (given)
- define a class for unigram chunker (given)
- define a class for bi-gram chunker. The bi-gram chunker should backoff on the unigram.
- train a unigram and a bi-gram chunker on the train corpus.
- evaluate and compare both chunkers on the test corpus

In [2]:
# Import section
import nltk
from nltk.corpus import conll2000

In [3]:
# Class for unigram chunker
# Takes a corpus in a pos tagged an i-o-b chunk format as input
# Parses pos-tagged corpus with the parse funciton
# Given in class
class unigram_chunker(nltk.ChunkParserI):
    
    # Initialize and train the chunker
    def __init__(self, train_sents):
        # Take the pos and the iob tags of the corpus
        # Ignore the actual words, we map from pos tag to iob tag
        train_data = [[(t,c) for w,t,c in nltk.chunk.tree2conlltags(sent)] for sent in train_sents]
        # Train an unigram tagger from the train data
        self.tagger = nltk.UnigramTagger(train_data)
    
    # Parse function
    # Takes a corpus in POS tagged format
    def parse(self,sentence):
        # Take the pos tags
        pos_tags = [pos for (word,pos) in sentence]
        # Use the tagger to tag the modified corpus
        tagged_pos_tags = self.tagger.tag(pos_tags)
        # Take the chunks from the tagged corpus
        chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags]
        # Convert the output
        conlltags = [(word, pos, chunktag) for ((word,pos),chunktag) in zip(sentence, chunktags)]
        
        # Return the tagged sentence
        return nltk.chunk.conlltags2tree(conlltags)             

In [5]:
# Class for bigram chunker
# Takes a corpus in a pos tagged an i-o-b chunk format as input
# Parses pos-tagged corpus with the parse funciton
class bigram_chunker(nltk.ChunkParserI):
    
    # Initialize and train the chunker
    def __init__(self, train_sents):
        # YOUR CODE HERE
        return()        
        
    # Parse function
    # Takes a corpus in POS tagged format
    def parse(self,sentence):
        # YOUR CODE HERE
        return()


In [11]:
# Dummy function for exercise 1
def ex1():
    # Get the corpus
    train = conll2000.chunked_sents("train.txt")
    test = conll2000.chunked_sents("test.txt")
    
    # Train the two taggers:
    # Train unigram tagger (given)
    uni_chunker = unigram_chunker(train)
    # Train the bigram tagger HERE
    
    # Evaluate and print the results:
    print ("The performance of unigram chunker is: {}".format(uni_chunker.evaluate(test)))    
    # Evaluate bigram tagger HERE


In [12]:
ex1()

  Function evaluate() has been deprecated.  Use accuracy(gold)
  instead.
  print ("The performance of unigram chunker is: {}".format(uni_chunker.evaluate(test)))


The performance of unigram chunker is: ChunkParse score:
    IOB Accuracy:  86.5%%
    Precision:     74.3%%
    Recall:        86.4%%
    F-Measure:     79.9%%


### Task 2

Objectives

Create and use a simple context free grammar for syntactic parsing
- extend the given CFG
- load the grammar in an nltk.RecursiveDescentParser
- use the parset to tag a to corpus (given)
- for each sentence, print the number of possible parses (correct answer below)

Correct number of parses for each sentence:

- “a young woman walks in the park” <- 1 parse
- “two young men smile” <- 1 parse
- “a young woman sees two men” <- 1 parse
- “sees two men a young woman” <- 0 parses
- “a young woman sees two old men in the park with a telescope” <- AT LEAST 3 parses
- “a young woman two old men in the park with a telescope sees” <- 0 parses
- “two angry men chase a woman with a telescope” <- 2 parses
- “a woman I know owns a telescope” <- 1 parse
- “a woman I know a telescope” <- 0 parses

In [13]:
# Dummy function for exercise 2
def ex2():
    """Function for exercise 2"""
    # corpus (given)
    corpus = [['a', 'young', 'woman', 'walks', 'in', 'the', 'park'], 
['two', 'young', 'men', 'smile'], 
['a', 'young', 'woman', 'sees', 'two', 'men'], 
['sees', 'two', 'men', 'a', 'young', 'woman'], 
['a', 'young', 'woman', 'sees', 'two', 'old', 'men', 'in', 'the', 'park', 'with', 'a', 'telescope'], 
['a', 'young', 'woman', 'two', 'old', 'men', 'in', 'the', 'park', 'with', 'a', 'telescope', 'sees'], 
['two', 'angry', 'men', 'chase', 'a', 'woman', 'with', 'a', 'telescope'], 
['a', 'woman', 'I', 'know', 'owns', 'a', 'telescope'], 
['a', 'woman', 'I', 'know', 'a', 'telescope']]
    
    # Grammar (in a string format)
    grammar_string = """
    S -> NP VP
    VP -> V NP | V NP PP
    PP -> P NP
    V -> "saw" | "ate" | "walked"
    NP -> "John" | "Mary" | "Bob" | Det N | Det N PP
    Det -> "a" | "an" | "the" | "my"
    N -> "man" | "dog" | "cat" | "telescope" | "park"
    P -> "in" | "on" | "by" | "with"
    """
    
    # Grammar (in nltk CFG format)
    grammar = nltk.CFG.fromstring(grammar_string)
    
    # Parse the corpus, 
    # count the number of parses for each sentence,
    # and print the sentence and the number of parses
    
    # YOUR CODE HERE
    

In [14]:
# Example of using CFG

# A simple CFG
grammar1 = nltk.CFG.fromstring("""
  S -> NP VP
  VP -> V NP | V NP PP
  PP -> P NP
  V -> "saw" | "ate" | "walked"
  NP -> "John" | "Mary" | "Bob" | Det N | Det N PP
  Det -> "a" | "an" | "the" | "my"
  N -> "man" | "dog" | "cat" | "telescope" | "park"
  P -> "in" | "on" | "by" | "with"
  """)

# Test sentence
sent = "Mary saw Bob".split()

# Parse the sentence using the grammar
rd_parser = nltk.RecursiveDescentParser(grammar1)

# Print all possible trees
for tree in rd_parser.parse(sent):
    print(tree)

(S (NP Mary) (VP (V saw) (NP Bob)))


### Optional Task 3

Experiment with bottom-up parser

nltk.app.srparser()

Try to get multiple correct parses using the given sentence and grammar