## Chapter 6

### 1. Using Naive Bayes classifier described in this chapter, and any features you can think of, build the best name gender classifier you can. Begin by splitting the Names Corpus into three subsets: 500 words for the test set, 500 words for the dev-test set, and the remaining 6900 words for the training set. Then, starting with the example name gender classifier, make incremental improvements. Use the dev-test set to check your progress. Once you are satisfied with your classifier, check its final performance on the test set. 

In [1]:
import nltk
import numpy as np
import pandas as pd
# import matplotlib.pyplot as plt
from nltk.corpus import names
import random
from nltk.classify import NaiveBayesClassifier
from nltk.classify import DecisionTreeClassifier
from nltk.classify import MaxentClassifier
from nltk.classify import apply_features
from nltk.classify import accuracy
from nltk import ConditionalFreqDist

In [2]:
names = ([(name, 'male') for name in names.words('male.txt')] + [(name, 'female') for name in names.words('female.txt')])
random.shuffle(names)

# Sample the file into test, devtest and training datasets:
test, devtest, training = names[:500], names[500:1000], names[1000:]

In [3]:
# Build the example name gender classifier as we did in class:
def gender_features1(name):
    features = {}
    features["first_letter"] = name[0].lower()
    features["last_letter"] = name[-1].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count({})".format(letter)] = name.lower().count(letter)
        features["has({})".format(letter)] = (letter in name.lower())
    return features

In [4]:
# Verfiy one example name:
gender_features1('Mike')

{'first_letter': 'm',
 'last_letter': 'e',
 'count(a)': 0,
 'has(a)': False,
 'count(b)': 0,
 'has(b)': False,
 'count(c)': 0,
 'has(c)': False,
 'count(d)': 0,
 'has(d)': False,
 'count(e)': 1,
 'has(e)': True,
 'count(f)': 0,
 'has(f)': False,
 'count(g)': 0,
 'has(g)': False,
 'count(h)': 0,
 'has(h)': False,
 'count(i)': 1,
 'has(i)': True,
 'count(j)': 0,
 'has(j)': False,
 'count(k)': 1,
 'has(k)': True,
 'count(l)': 0,
 'has(l)': False,
 'count(m)': 1,
 'has(m)': True,
 'count(n)': 0,
 'has(n)': False,
 'count(o)': 0,
 'has(o)': False,
 'count(p)': 0,
 'has(p)': False,
 'count(q)': 0,
 'has(q)': False,
 'count(r)': 0,
 'has(r)': False,
 'count(s)': 0,
 'has(s)': False,
 'count(t)': 0,
 'has(t)': False,
 'count(u)': 0,
 'has(u)': False,
 'count(v)': 0,
 'has(v)': False,
 'count(w)': 0,
 'has(w)': False,
 'count(x)': 0,
 'has(x)': False,
 'count(y)': 0,
 'has(y)': False,
 'count(z)': 0,
 'has(z)': False}

In [5]:
# Check the classifier for training and dev test for gender_features_1:
train_set = [(gender_features1(n), g) for (n,g) in training]
devtest_set = [(gender_features1(n), g) for (n,g) in devtest]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print("Accuracy of Devtest for gender_features1:", round((nltk.classify.accuracy(classifier, devtest_set)*100),2),"%")

Accuracy of Devtest for gender_features1: 74.0 %


In [6]:
# Error Analysis of which ones were classified wrong:
def error_analysis(gender_features):
    errors = []
    for (name, tag) in devtest:
        guess = classifier.classify(gender_features(name))
        if guess != tag:
            errors.append((tag, guess, name))
    print ('no. of errors: ', len(errors))
        
    for (tag, guess, name) in sorted(errors):
        print ('correct=%-8s guess=%-8s name=%-30s' % (tag, guess, name))

In [7]:
# Error Analysis for first combination of features from gender_features_1:
error_analysis(gender_features1)

no. of errors:  130
correct=female   guess=male     name=Ardis                         
correct=female   guess=male     name=Astrix                        
correct=female   guess=male     name=Bel                           
correct=female   guess=male     name=Bess                          
correct=female   guess=male     name=Bev                           
correct=female   guess=male     name=Brunhilde                     
correct=female   guess=male     name=Cathryn                       
correct=female   guess=male     name=Coreen                        
correct=female   guess=male     name=Correy                        
correct=female   guess=male     name=Darb                          
correct=female   guess=male     name=Dode                          
correct=female   guess=male     name=Donny                         
correct=female   guess=male     name=Elsbeth                       
correct=female   guess=male     name=Farah                         
correct=female   guess=male 

In [8]:
# Final performance of the test set for gender_features_1:
test_set = [(gender_features1(n), g) for (n,g) in test]
print("Accuracy of Test set for gender_features2:", round((nltk.classify.accuracy(classifier, test_set)*100),2),"%")

Accuracy of Test set for gender_features2: 77.0 %


In [9]:
# Using incremental approach I am trying to add more features now:
# Let me call it gender_features_2 -- I have added for features for suffixes:
def gender_features2(name):
    features = {}
    features["first_letter"] = name[0].lower()
    features["last_letter"] = name[-1].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count({})".format(letter)] = name.lower().count(letter)
        features["has({})".format(letter)] = (letter in name.lower())
    features["suffix2"] = name[-2:].lower()
    features["suffix3"] = name[-3:].lower()
    return features

In [10]:
# Check the classifier for training and dev test for gender_features_2:
train_set = [(gender_features2(n), g) for (n,g) in training]
devtest_set = [(gender_features2(n), g) for (n,g) in devtest]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print("Accuracy of Devtest for gender_features2:", round((nltk.classify.accuracy(classifier, devtest_set)*100),2),"%")

Accuracy of Devtest for gender_features2: 79.8 %


In [11]:
# Error Analysis for first combination of features from gender_features_2:
error_analysis(gender_features2)

no. of errors:  101
correct=female   guess=male     name=Bamby                         
correct=female   guess=male     name=Bess                          
correct=female   guess=male     name=Bev                           
correct=female   guess=male     name=Constance                     
correct=female   guess=male     name=Correy                        
correct=female   guess=male     name=Darb                          
correct=female   guess=male     name=Darell                        
correct=female   guess=male     name=Debby                         
correct=female   guess=male     name=Devan                         
correct=female   guess=male     name=Devin                         
correct=female   guess=male     name=Dode                          
correct=female   guess=male     name=Donny                         
correct=female   guess=male     name=Eran                          
correct=female   guess=male     name=Fancy                         
correct=female   guess=male 

In [12]:
# Final performance of the test set for gender_features_2:
test_set = [(gender_features2(n), g) for (n,g) in test]
print("Accuracy of Test set for gender_features1:", round((nltk.classify.accuracy(classifier, test_set)*100),2),"%")

Accuracy of Test set for gender_features1: 81.4 %


In [13]:
# Using incremental approach I am trying to add more features now:
# Let me call it gender_features_3 -- I have added for features for prefixes:
def gender_features3(name):
    features = {}
    features["first_letter"] = name[0].lower()
    features["last_letter"] = name[-1].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count({})".format(letter)] = name.lower().count(letter)
        features["has({})".format(letter)] = (letter in name.lower())
    features["suffix2"] = name[-2:].lower()
    features["suffix3"] = name[-3:].lower()
    features["prefix3"] = name[:3].lower()
    return features

In [14]:
# Check the classifier for training and dev test for gender_features_3:
train_set = [(gender_features3(n), g) for (n,g) in training]
devtest_set = [(gender_features3(n), g) for (n,g) in devtest]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print("Accuracy of Devtest for gender_features3:", round((nltk.classify.accuracy(classifier, devtest_set)*100),2),"%")

Accuracy of Devtest for gender_features3: 81.4 %


In [15]:
# Error Analysis for first combination of features from gender_features_3:
error_analysis(gender_features3)

no. of errors:  93
correct=female   guess=male     name=Bess                          
correct=female   guess=male     name=Bev                           
correct=female   guess=male     name=Constance                     
correct=female   guess=male     name=Correy                        
correct=female   guess=male     name=Darb                          
correct=female   guess=male     name=Darell                        
correct=female   guess=male     name=Devan                         
correct=female   guess=male     name=Devin                         
correct=female   guess=male     name=Eran                          
correct=female   guess=male     name=Florry                        
correct=female   guess=male     name=Fran                          
correct=female   guess=male     name=Gerry                         
correct=female   guess=male     name=Glad                          
correct=female   guess=male     name=Gretchen                      
correct=female   guess=male  

In [16]:
# Final performance of the test set for gender_features_3:
test_set = [(gender_features3(n), g) for (n,g) in test]
print("Accuracy of Test set for gender_features3:", round((nltk.classify.accuracy(classifier, test_set)*100),2),"%")

Accuracy of Test set for gender_features3: 82.6 %


In [17]:
# Using incremental approach I am trying to add more features now:
# Let me call it gender_features_4 -- I have added for features for first 2 letters and vowels:
def gender_features4(name):
    features = {}
    features["first_letter"] = name[0].lower()
    features["last_letter"] = name[-1].lower()
    features["first_two_letters"] = name[:2].lower()
    features["last_two_letters"] = name[-2:].lower()
    features["double_letters"] = (sum([1 for ch in range(len(name) - 1) if name[ch] == name[ch + 1]]))
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count({})".format(letter)] = name.lower().count(letter)
        features["has({})".format(letter)] = (letter in name.lower())
    features["suffix2"] = name[-2:].lower()
    features["suffix3"] = name[-3:].lower()
    features["prefix3"] = name[:3].lower()
    features["first_letter_vowel"] = [i for i in range(len(name)) if name[i] in 'AEIOUaeiouy'][0]
    features["num_vowels"] = len([letter for letter in name if letter in 'AEIOUaeiouy'])
    return features

In [18]:
# Check the classifier for training and dev test for gender_features_4:
train_set = [(gender_features4(n), g) for (n,g) in training]
devtest_set = [(gender_features4(n), g) for (n,g) in devtest]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print("Accuracy of Devtest for gender_features4:", round((nltk.classify.accuracy(classifier, devtest_set)*100),2),"%")

Accuracy of Devtest for gender_features4: 82.6 %


In [19]:
# Error Analysis for first combination of features from gender_features_4:
error_analysis(gender_features4)

no. of errors:  87
correct=female   guess=male     name=Bamby                         
correct=female   guess=male     name=Bess                          
correct=female   guess=male     name=Bev                           
correct=female   guess=male     name=Constance                     
correct=female   guess=male     name=Correy                        
correct=female   guess=male     name=Darb                          
correct=female   guess=male     name=Darell                        
correct=female   guess=male     name=Del                           
correct=female   guess=male     name=Devan                         
correct=female   guess=male     name=Devin                         
correct=female   guess=male     name=Eran                          
correct=female   guess=male     name=Florry                        
correct=female   guess=male     name=Fran                          
correct=female   guess=male     name=Gerry                         
correct=female   guess=male  

In [20]:
# Final performance of the test set for gender_features_4:
test_set = [(gender_features4(n), g) for (n,g) in test]
print("Accuracy of Test set for gender_features4:", round((nltk.classify.accuracy(classifier, test_set)*100),2),"%")

Accuracy of Test set for gender_features4: 83.2 %


After 4 iterations of different feature extraction and acuuracy testing I come to the conclusion that:

The accuracy of the classifier is slightly better when evaluating the test set (83.2%) than when evaluating the devtest set (82.6%). I have built the models incremently and as we see with every new addition the accuracy has increased for the dev test. After 4 such different additions, I check the errors and it seems some are really ambiguous and even a human mind will be confused. The difference can be attributed to inability to find definitve features for cases which can be evaluated to find the gender. We can go on to think for more possible features to tune it further but this is a good overall result and we know now how to build necessary features.

----

### 2. Using the movie review document classifier discussed in Chapter 6- Section 1.3 ( constructing a list of the 2500 most frequent words as features and use the first 150 documents as the test dataset) , generate a list of the 10 features that the classifier finds to be most informative. Can you explain why these particular features are informative? Do you find any of them surprising?

In [21]:
from nltk.corpus import movie_reviews

In [22]:
documents = [(list(movie_reviews.words(fileid)), category)
            for category in movie_reviews.categories()
            for fileid in movie_reviews.fileids(category)]

In [23]:
random.seed(2020)
random.shuffle(documents)

In [24]:
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())

In [25]:
word_features = list(all_words)[:2500]

In [26]:
word_features

['plot',
 ':',
 'two',
 'teen',
 'couples',
 'go',
 'to',
 'a',
 'church',
 'party',
 ',',
 'drink',
 'and',
 'then',
 'drive',
 '.',
 'they',
 'get',
 'into',
 'an',
 'accident',
 'one',
 'of',
 'the',
 'guys',
 'dies',
 'but',
 'his',
 'girlfriend',
 'continues',
 'see',
 'him',
 'in',
 'her',
 'life',
 'has',
 'nightmares',
 'what',
 "'",
 's',
 'deal',
 '?',
 'watch',
 'movie',
 '"',
 'sorta',
 'find',
 'out',
 'critique',
 'mind',
 '-',
 'fuck',
 'for',
 'generation',
 'that',
 'touches',
 'on',
 'very',
 'cool',
 'idea',
 'presents',
 'it',
 'bad',
 'package',
 'which',
 'is',
 'makes',
 'this',
 'review',
 'even',
 'harder',
 'write',
 'since',
 'i',
 'generally',
 'applaud',
 'films',
 'attempt',
 'break',
 'mold',
 'mess',
 'with',
 'your',
 'head',
 'such',
 '(',
 'lost',
 'highway',
 '&',
 'memento',
 ')',
 'there',
 'are',
 'good',
 'ways',
 'making',
 'all',
 'types',
 'these',
 'folks',
 'just',
 'didn',
 't',
 'snag',
 'correctly',
 'seem',
 'have',
 'taken',
 'pretty',


In [27]:
def document_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

In [28]:
featuresets = [(document_features(d), c) for (d, c) in documents]
train_set, test_set = featuresets[150:], featuresets[:150]
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [29]:
print(nltk.classify.accuracy(classifier, test_set))

0.8333333333333334


In [30]:
classifier.show_most_informative_features(10)

Most Informative Features
     contains(atrocious) = True              neg : pos    =     11.0 : 1.0
        contains(turkey) = True              neg : pos    =     10.6 : 1.0
       contains(frances) = True              pos : neg    =      9.0 : 1.0
        contains(annual) = True              pos : neg    =      9.0 : 1.0
      contains(bothered) = True              neg : pos    =      8.3 : 1.0
 contains(unimaginative) = True              neg : pos    =      8.3 : 1.0
        contains(stinks) = True              neg : pos    =      7.7 : 1.0
    contains(schumacher) = True              neg : pos    =      7.0 : 1.0
        contains(shoddy) = True              neg : pos    =      6.3 : 1.0
          contains(mena) = True              neg : pos    =      6.3 : 1.0


Words such as shoddy, stinks, bothered have a negative connotation so it is not surprsing to see that having a negative tag to it but words like turkey, annual are not ambiguous and can't be clearly said whether those words were used with negative or positive connotation. We will need a more context based understanding to come to a conclusion. We have a good accuracy of 83.33% which is also a good indication.

----

### 3. Select one of the classification tasks described in this chapter, such as name gender detection, document classification, part-of-speech tagging, or dialog act classification. Using the same training and test data, and the same feature extractor, build three classifiers for the task: a decision tree, a naive Bayes classifier, and a  Maximum Entropy classifier. Compare the performance of the three classifiers on your selected task.

In [31]:
# I will use the gender classification case and use the features I built in question 1 above
from nltk.corpus import names
import random

names = ([(name, 'male') for name in names.words('male.txt')] + [(name, 'female') for name in names.words('female.txt')])
random.shuffle(names)

In [32]:
# Feature extraction for gender classification
def gender_features(name):
    features = {}
    features["first_letter"] = name[0].lower()
    features["last_letter"] = name[-1].lower()
    features["first_two_letters"] = name[:2].lower()
    features["last_two_letters"] = name[-2:].lower()
    features["double_letters"] = (sum([1 for ch in range(len(name) - 1) if name[ch] == name[ch + 1]]))
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count({})".format(letter)] = name.lower().count(letter)
        features["has({})".format(letter)] = (letter in name.lower())
    features["suffix2"] = name[-2:].lower()
    features["suffix3"] = name[-3:].lower()
    features["prefix3"] = name[:3].lower()
    features["first_letter_vowel"] = [i for i in range(len(name)) if name[i] in 'AEIOUaeiouy'][0]
    features["num_vowels"] = len([letter for letter in name if letter in 'AEIOUaeiouy'])
    return features

In [33]:
# 1. Evaluating decision tree:
classifier_dt = nltk.DecisionTreeClassifier.train(train_set)
print("Accuracy of Train set for Decision Tree:", round((nltk.classify.accuracy(classifier_dt, train_set)*100),2),"%")
print("Accuracy of Test set for Decision Tree:", round((nltk.classify.accuracy(classifier_dt, test_set)*100),2),"%")

Accuracy of Train set for Decision Tree: 92.86 %
Accuracy of Test set for Decision Tree: 60.0 %


In [34]:
# 2. Evaluating naive bayes classifier
classifier_nbc = nltk.NaiveBayesClassifier.train(train_set)
print("Accuracy of Train set for Naive Bayes:", round((nltk.classify.accuracy(classifier_nbc, train_set)*100),2),"%")
print("Accuracy of Test set for Naive Bayes:", round((nltk.classify.accuracy(classifier_nbc, test_set)*100),2),"%")

Accuracy of Train set for Naive Bayes: 88.7 %
Accuracy of Test set for Naive Bayes: 83.33 %


In [74]:
# 3. Evaluating maximum entropy classifier
featuresets = [(gender_features(n), gender) for (n, gender) in names]
train_set, devtest_set, test_set = featuresets[1000:], featuresets[500:1000], featuresets[:500]
classifier_mec = nltk.classify.MaxentClassifier.train(train_set, max_iter = 100)
print("Accuracy of Train set for Maximum Entropy:", round((nltk.classify.accuracy(classifier_mec, train_set)*100),2),"%")
print("Accuracy of Test set for Maximum Entropy:", round((nltk.classify.accuracy(classifier_mec, test_set)*100),2),"%")

  ==> Training (100 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.371
             2          -0.58532        0.629
             3          -0.54989        0.638
             4          -0.51888        0.702
             5          -0.49183        0.761
             6          -0.46824        0.797
             7          -0.44762        0.815
             8          -0.42954        0.828
             9          -0.41360        0.837
            10          -0.39949        0.843
            11          -0.38693        0.848
            12          -0.37569        0.853
            13          -0.36558        0.857
            14          -0.35645        0.859
            15          -0.34815        0.860
            16          -0.34059        0.863
            17          -0.33367        0.864
            18          -0.32730        0.866
            19          -0.32143        0.867
 

Now, we compare the three models on the test set and we can see that the results are in the following order:

Maximum Entropy Classifier > Naive Bayes Classifier > Decision Tree

Maximum Entropy classifier calculates the likelihood of each label for a given input value by multiplying together the parameters that are applicable for the input value and label. Thus, Maximum entropy classifier does the best job as it builds iteratively on various likelihood scenarios and improve the accuracy based on each sequence. It thus optimzes the result for each iteration as we can see above. It lead to an accuracy of 83.8% at the end for the test set. 

Naive Bayes classifier model defines a parameter for each label, specifying its prior probability, and a parameter for each (feature, label) pair, specifying the contribution of individual features towards a label's likelihood. Naive bayes classifier is a good model and gets a 83.33% accuracy in classification of gender features but it is often affected by features which may have high co-relation. 

Lastly, decision trees gave an accuracy of 60.0% which is low and far below the other two models, and it can be attributed to the compliacted nature of the feature extraction. It works best on simplified models.

----

### 4. Identify the NPS Chat Corpus, which was demonstrated in Chapter 2, consists of over 15,000 posts from instant messaging sessions. These posts have all been labeled with one of 15 dialogue act types, such as "Statement," "Emotion," "ynQuestion", and "Continuer." We can therefore use this data to build a classifier that can identify the dialogue act types for new instant messaging posts. Build a simple feature extractor that checks what words the post contains. Construct the training and testing data by applying the feature extractor to each post and create a Naïve Bayes classifier. Please print the accuracy of this classifier. We use the first 15,000 messages from these instant messages as our dataset and use 8% data as our test data.

In [36]:
len(nltk.corpus.nps_chat.xml_posts())
# There are less than 15k rows but we will consider them all and use 8% of 15K. 
# I double checked with Prof to be sure if I am on the right track.

10567

In [37]:
posts = nltk.corpus.nps_chat.xml_posts()[:15000]

In [38]:
# Define a feature extractor that checks what words the post contains:
def dialogue_act_features(post):
    features = {}
    for word in nltk.word_tokenize(post):
        features['contains({})'.format(word.lower())] = True
    return features

In [39]:
featuresets = [(dialogue_act_features(post.text), post.get('class'))
              for post in posts]

In [40]:
# Definign training and test sets:
size = int(len(featuresets) * 0.08)

train_set, test_set = featuresets[size:], featuresets[:size]

In [41]:
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [42]:
# Check accuracy of the classifier:
print("Accuracy:", round((nltk.classify.accuracy(classifier, test_set)*100),2),"%")

Accuracy: 67.69 %


----

### 5.Given the following confusion matrix, please calculate: a) Accuracy Rate; b) Precision; c) Recall; d) F-Measure.

	No	Yes
No	104	33
Yes	13	50


In [43]:
# Create a confusion matrix:
column_names = ['predicted_no', 'predicted_yes']
row_names    = ['actual_no', 'actual_yes']
matrix = np.reshape((104,33,13,50), (2,2))
df = pd.DataFrame(matrix, columns=column_names, index=row_names)
df

Unnamed: 0,predicted_no,predicted_yes
actual_no,104,33
actual_yes,13,50


In [44]:
# a) Accuracy Rate
acc = (df['predicted_no']['actual_no'] + df['predicted_yes']['actual_yes'])/ matrix.sum()
print("Accuracy Rate:",round((acc*100),2),"%")

Accuracy Rate: 77.0 %


In [45]:
# b) Precision
pre = df['predicted_yes']['actual_yes'] / (df['predicted_no']['actual_yes'] + df['predicted_yes']['actual_yes'])
print("Precision:",round((pre*100),2),"%")

Precision: 79.37 %


In [46]:
# c) Recall
rec = df['predicted_yes']['actual_yes'] / (df['predicted_yes']['actual_no'] + df['predicted_yes']['actual_yes'])
print("Recall:",round((rec*100),2),"%")

Recall: 60.24 %


In [47]:
#d) F-score
fm = 2*pre*rec / (pre + rec)
print("F - measure:",round((fm*100),2),"%")

F - measure: 68.49 %


----

## Chapter 7

### 6. Write a tag pattern to match noun phrases containing plural head nouns in the following sentence: "Many researchers discussed this project for two weeks." Try to do this by generalizing the tag pattern that handled singular noun phrases too. Please 1) pos-tag this sentence 2) write a tag pattern (i.e. grammar); 3) use RegexpParser to parse the sentence and 4) print out the result containing NP (noun phrases).

In [48]:
# singular noun phrases: <DT>?<JJ.*>*<NN.*>+
# plural noun phrases: <DT>?<JJ.*>*<NN.*>*<NNS>+

In [49]:
# Step 1 - pos-tag the given sentence:
sentence  =  "Many researchers discussed this project for two weeks."
sentence_tokenized =  nltk.word_tokenize(sentence)
tagged_sentence = nltk.pos_tag(sentence_tokenized)
tagged_sentence

[('Many', 'JJ'),
 ('researchers', 'NNS'),
 ('discussed', 'VBD'),
 ('this', 'DT'),
 ('project', 'NN'),
 ('for', 'IN'),
 ('two', 'CD'),
 ('weeks', 'NNS'),
 ('.', '.')]

In [50]:
# Step 2 - define the tag pattern (grammar):
grammar = "NP: {<(JJ|CD|DT).*>+<NNS?>}"
cp = nltk.RegexpParser(grammar)

In [51]:
# Step 3 - use regex parser to parse the sentence:
result = cp.parse(tagged_sentence)

In [52]:
# Step 4 - print out the results than contains NPs:
print(result)

(S
  (NP Many/JJ researchers/NNS)
  discussed/VBD
  (NP this/DT project/NN)
  for/IN
  (NP two/CD weeks/NNS)
  ./.)


In [53]:
# Step 5 - print the result graphically using a tree
output = '(S (NP Many/JJ researchers/NNS) discussed/VBD (NP this/DT project/NN) for/IN (NP two/CD weeks/NNS) ./.)'

from nltk.tree import Tree
parsetree = Tree.fromstring(output)
parsetree.pretty_print()

                                            S                                                   
       _____________________________________|_______________________________________             
      |         |     |           NP                          NP                    NP          
      |         |     |      _____|_________             _____|______          _____|______      
discussed/VBD for/IN ./. Many/JJ     researchers/NNS this/DT     project/NN two/CD     weeks/NNS



----

### 7. Write a tag pattern to cover noun phrases that contain gerunds, e.g. "the/DT receiving/VBG end/NN", "assistant/NN managing/VBG editor/NN". Add these patterns to the grammar, one per line. Test your work using some tagged sentences of your own devising.

In [106]:
grammar = """
    NP: {<DT><VBG><NN.*>}    # chunk determiner, gerund, and noun
        {<NN.*><VBG><NN.*>}   # chunk noun, gerund, and noun
"""

In [107]:
cp = nltk.RegexpParser(grammar)
sentences = [[("the", "DT"), ("receiving", "VBG"), ("end", "NN")], 
             [("assistant", "NN"),  ("managing", "VBG"),  ("editor", "NN")]]

for sent in sentences:
    print(cp.parse(sent))

(S (NP the/DT receiving/VBG end/NN))
(S (NP assistant/NN managing/VBG editor/NN))


In [108]:
# Testing the work using my own sentences - 1
my_sentence = "American citizens will vote for the new president"
my_sentence_tokenized =  nltk.word_tokenize(my_sentence)
tagged_my_sentence = nltk.pos_tag(my_sentence_tokenized)
tagged_my_sentence

[('American', 'NNP'),
 ('citizens', 'NNS'),
 ('will', 'MD'),
 ('vote', 'VB'),
 ('for', 'IN'),
 ('the', 'DT'),
 ('new', 'JJ'),
 ('president', 'NN')]

In [110]:
result = cp.parse(tagged_my_sentence)
print(result)

(S
  American/NNP
  citizens/NNS
  will/MD
  vote/VB
  for/IN
  the/DT
  new/JJ
  president/NN)


In [113]:
# Testing the work using my own sentences - 2
my_sentence_2 = "I am excited for the upcoming olympic games in 2021"
my_sentence_tokenized_2 =  nltk.word_tokenize(my_sentence_2)
tagged_my_sentence_2 = nltk.pos_tag(my_sentence_tokenized_2)
tagged_my_sentence_2

[('I', 'PRP'),
 ('am', 'VBP'),
 ('excited', 'VBN'),
 ('for', 'IN'),
 ('the', 'DT'),
 ('upcoming', 'JJ'),
 ('olympic', 'NN'),
 ('games', 'NNS'),
 ('in', 'IN'),
 ('2021', 'CD')]

In [114]:
result_2 = cp.parse(tagged_my_sentence_2)
print(result_2)

(S
  I/PRP
  am/VBP
  excited/VBN
  for/IN
  the/DT
  upcoming/JJ
  olympic/NN
  games/NNS
  in/IN
  2021/CD)


In [115]:
my_sentence_tags = [[("the", "DT"), ("new", "JJ"), ("president", "NN")], 
             [("upcoming", "JJ"),  ("olympic", "NNS"),  ("games", "NNS")]]

for sent in my_sentence_tags:
    print(cp.parse(sent))

(S the/DT new/JJ president/NN)
(S upcoming/JJ olympic/NNS games/NNS)


----

### 8. Use the Brown Corpus and the cascaded chunkers that has patterns for noun phrases, prepositional phrases, verb phrases, and clauses to print out all the verb phrases in the Brown corpus.

In [59]:
grammar = r"""
  NP: {<DT|JJ|NN.*>+}          # Chunk sequences of DT, JJ, NN
  PP: {<IN><NP>}               # Chunk prepositions followed by NP
  VP: {<VB.*><NP|PP|CLAUSE>+$} # Chunk verbs and their arguments
  CLAUSE: {<NP><VP>}           # Chunk NP, VP
  """

In [60]:
cp = nltk.RegexpParser(grammar, loop=2)

In [61]:
for sent in nltk.corpus.brown.tagged_sents():
    tree = cp.parse(sent)
    for subtree in tree.subtrees():
        if subtree.label() == 'VP': print(subtree)

(VP Ask/VB-HL (NP jail/NN-HL deputies/NNS-HL))
(VP revolving/VBG-HL (NP fund/NN-HL))
(VP Issue/VB-HL (NP jury/NN-HL subpoenas/NNS-HL))
(VP Nursing/VBG-HL (NP home/NN-HL care/NN-HL))
(VP pay/VB-HL (NP doctors/NNS-HL))
(VP nursing/VBG-HL (NP homes/NNS))
(VP Asks/VBZ-HL (NP research/NN-HL funds/NNS-HL))
(VP Regrets/VBZ-HL (NP attack/NN-HL))
(VP Decries/VBZ-HL (NP joblessness/NN-HL))
(VP Underlying/VBG-HL (NP concern/NN-HL))
(VP bar/VB-HL (NP vehicles/NNS-HL))
(VP loses/VBZ-HL (NP pace/NN-HL))
(VP hits/VBZ-HL (NP homer/NN-HL))
(VP attend/VB-HL (NP races/NNS-HL))
(VP follows/VBZ-HL (NP ceremonies/NNS-HL))
(VP Noted/VBN-HL (NP artist/NN-HL))
(VP Cites/VBZ-HL (NP discrepancies/NNS-HL))
(VP calls/VBZ-HL (NP police/NNS-HL))
(VP held/VBN-HL (NP key/NN-HL))
(VP grant/VB-HL (NP bail/NN-HL))
(VP Held/VBD-HL (NP candle/NN-HL))
(VP Expresses/VBZ-HL (NP thanks/NNS-HL))
(VP Gets/VBZ-HL (NP car/NN-HL number/NN-HL))
(VP Attacks/VBZ-HL (NP officer/NN-HL))
(VP oks/VBZ-HL (NP pact/NN-HL))
(VP report/VB-HL (

----

### 9. The bigram chunker scores about 90% accuracy. Study its errors and try to work out why it doesn't get 100% accuracy. Experiment with trigram chunking. Are you able to improve the performance any more?

In [62]:
from nltk.corpus import conll2000

# Define train and test sets
test_sents = conll2000.chunked_sents('test.txt')
train_sents = conll2000.chunked_sents('train.txt')

In [63]:
# Create bigram chunker
class BigramChunker(nltk.ChunkParserI):
    def __init__(self, train_sents):
        train_data = [[(t,c) for w,t,c in nltk.chunk.tree2conlltags(sent)]
                       for sent in train_sents]
        self.tagger = nltk.BigramTagger(train_data)       
    
    def parse(self, sentence):
        pos_tags = [pos for (word, pos) in sentence]
        tagged_pos_tags = self.tagger.tag(pos_tags)
        chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags]
        conlltags = [(word, pos, chunktag) for ((word, pos), chunktag)
                     in zip(sentence, chunktags)]
        return nltk.chunk.conlltags2tree(conlltags)

In [64]:
# Evaluate the bigram chunker
bigram_chunker = BigramChunker(train_sents)
print(bigram_chunker.evaluate(test_sents))

ChunkParse score:
    IOB Accuracy:  89.3%%
    Precision:     81.2%%
    Recall:        86.2%%
    F-Measure:     83.6%%


In [65]:
grammar = r"VP: {<VB.>?<RB>*<MD>?<VB.>?<TO>?<MD>?<RB>*<VB.>}"
cp = nltk.RegexpParser(grammar)

In [66]:
# Study the errors for bigram chunker:
# a) chunkscore.missed()
cp.evaluate(test_sents).missed()[:10]

[ImmutableTree('VP', [('rebound', 'NN'), ('to', 'TO'), ('close', 'VB')]),
 ImmutableTree('NP', [('Mr.', 'NNP'), ('Edelman', 'NNP')]),
 ImmutableTree('NP', [("'s", 'POS'), ('chief', 'JJ'), ('retail', 'JJ'), ('banking', 'NN'), ('officer', 'NN')]),
 ImmutableTree('NP', [('McNally', 'NNP')]),
 ImmutableTree('PP', [('with', 'IN')]),
 ImmutableTree('NP', [('next', 'JJ'), ('year', 'NN')]),
 ImmutableTree('VP', [('were', 'VBD'), ('both', 'DT'), ('hired', 'VBN')]),
 ImmutableTree('NP', [('any', 'DT'), ('alternative', 'NN')]),
 ImmutableTree('PP', [('to', 'TO')]),
 ImmutableTree('NP', [('a', 'DT'), ('one-time', 'JJ'), ('$', '$'), ('16', 'CD'), ('million', 'CD'), ('gain', 'NN')])]

In [67]:
# Study the errors for bigram chunker:
# b) chunkscore.incorrect()
cp.evaluate(test_sents).incorrect()[:10]

[ImmutableTree('VP', [('scrambled', 'VBD')]),
 ImmutableTree('VP', [('does', 'VBZ')]),
 ImmutableTree('VP', [('proposed', 'VBN')]),
 ImmutableTree('VP', [('recently', 'RB'), ('launched', 'VBN')]),
 ImmutableTree('VP', [('pushed', 'VBN')]),
 ImmutableTree('VP', [('holding', 'VBG')]),
 ImmutableTree('VP', [('swelling', 'VBG')]),
 ImmutableTree('VP', [('get', 'VBP'), ('is', 'VBZ')]),
 ImmutableTree('VP', [('shares', 'VBZ')]),
 ImmutableTree('VP', [('have', 'VBP'), ('been', 'VBN'), ('forced', 'VBN')])]

The error study as seen above doesn't give much directions. I looked at 10 incorrect ones and we can see how the chunks are an issue when the past tense and past participle forms are difficult for the system to tag.

However, oen thing that appears is that sometimes chunker marks single VBN or VBG as VP chunk. This could be thought of as a direcyion to improve the accuracy.

In [68]:
# Create trigram chunker
class TrigramChunker(nltk.ChunkParserI):
    def __init__(self, train_sents):
        train_data = [[(t,c) for w,t,c in nltk.chunk.tree2conlltags(sent)]
                       for sent in train_sents]
        self.tagger = nltk.TrigramTagger(train_data)
        
    
    def parse(self, sentence):
        pos_tags = [pos for (word, pos) in sentence]
        tagged_pos_tags = self.tagger.tag(pos_tags)
        chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags]
        conlltags = [(word, pos, chunktag) for ((word, pos), chunktag)
                     in zip(sentence, chunktags)]
        return nltk.chunk.conlltags2tree(conlltags)

In [69]:
# Evaluate the trigram chunker
trigram_chunker = TrigramChunker(train_sents)
print(trigram_chunker.evaluate(test_sents))

ChunkParse score:
    IOB Accuracy:  87.7%%
    Precision:     81.0%%
    Recall:        84.4%%
    F-Measure:     82.6%%


No the performance doesn't improve for trigram chunker as comapred to the bigram chunker. As we studied in class (quoting the nltk book), the specificity of the contexts increases, as does the chance that the data we wish to tag contains contexts that were not present in the training data. This is known as the sparse data problem, and is quite pervasive in NLP. As a consequence, there is a trade-off between the accuracy and the coverage of our results (and this is related to the precision/recall trade-off in information retrieval).

----

### 10. Explore the Brown Corpus to print out all the FACILITIES (one of the commonly used types of name entities).

In [70]:
brown = nltk.corpus.brown.tagged_sents()

In [71]:
brown_facilities = {subtree[0][0] for sent in brown
                  for subtree in nltk.ne_chunk(sent).subtrees()
                  if subtree.label() == 'FACILITY'
}

In [72]:
print(sorted(brown_facilities))

['Baltimore', 'Bari', 'Berlin', 'Boron', 'Caltech', 'Caracas', 'Clayton', 'Francie', 'Franklin', 'Grafton', 'Hilo', 'Israelite', 'Jack', 'Jenks', 'Kremlin', 'Lublin', 'Madison', 'Marston', 'Ninth', 'Northfield', 'Pennsylvania', 'Penny', 'Pensacola', 'Phil', 'Raymondville', 'Rome', 'Teheran', 'Versailles', 'White', 'Whiteleaf', 'Whitemarsh', 'Winston']


----