**In this assignment you will be guided to add more features in order to get better performance!**

In [1]:
import os
path = os.getcwd() + "/"

In [2]:
import matplotlib.pyplot as plt

In [3]:
from nltk.corpus import stopwords

In [4]:
from itertools import chain

import nltk
import sklearn
import scipy.stats
from sklearn.metrics import make_scorer
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RandomizedSearchCV

import sklearn_crfsuite
from sklearn_crfsuite import scorers
from sklearn_crfsuite import metrics

A simple sentence NER example:

[**ORG** U.N. ] official [**PER** Ekeus ] heads for [**LOC** Baghdad ] 

We will concentrate on four types of named entities:
 * persons (**PER**), 
 * locations (**LOC**) 
 * organizations (**ORG**)
 * Others (**O**)

In [5]:
def _generate_examples(filepath):
        with open(filepath, encoding="utf-8") as f:
            sent = []
            for line in f:
                if line.startswith("-DOCSTART-") or line == "" or line == "\n":
                    if sent:
                        yield sent
                        sent = []
                else:
                    splits = line.split(" ")
                    token = splits[0]
                    pos_tag = splits[1]
                    ner_tag = splits[3].rstrip()
                    if 'MISC' in ner_tag:
                        ner_tag = 'O'
                    
                    sent.append((token, pos_tag, ner_tag))

In [7]:
%%time
train_sents = list(_generate_examples(path + 'train.txt'))
test_sents = list(_generate_examples(path + 'test.txt'))

CPU times: user 242 ms, sys: 19.1 ms, total: 261 ms
Wall time: 262 ms


Here we have succesfully loaded the trianing and test data.
_________________

Here is a list of english stopwords, and we would like to include it as a feature

In [8]:
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /home/qlr/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Here is a list of names, and we would like to include it as a feature

In [9]:
names = set()
with open(path + 'names.txt') as f:
    for l in f:
        names.add(l[:-1].lower())

_____________________________________________________

You are asked to change the `word2features` function to add the following features:

**For the current word**:
1. Add a feature named `word.isupper()` that tells if the word is in upper case (you can your the `isupper()` function in python)
2. Add a feature named `word.isdigit()` that tells if the word is all digits (similarility you can use the `isdigit()` built-in python function)
3. Add a feature named `word.l1_is_capital` that tells if the word starts with a capital letter
4. Add a feature named `word.ends_in_dot` that tells if the word has lenght > 1 and ends with a dot (`.`)
5. Add a feature named `word.is_stop_word` that tells is the word belongs to the list of stop words defined previously `stop_words` (don't forget to convert the word into lower case before testing, just to be case insensitive)
6. Add a feature named `word.constains_digits` that tells if the word contains a digit or not
7. Add a feature names `word.figures_in_names_list` that tells if the word belongs to the list of names we defined previously `names`. Again don't forget to change the word into lower case first.

**For the previous word**: (BE CAREFUL, YOU SHOULD NOT USE `word`, USE `word1` instead):

Add the same features. Just prepend the name of features with `-1:` (It's important for the different features to be of different names)

**Add infomration about nextword**: (BE CAREFUL, YOU SHOULD NOT USE `word`, USE `word1` instead):

* Add the same features. Just prepend the name of features with `+1:` (It's important for the different features to be of different names)

* **PS**: If the word is the last one in the sentence (no next word), just add a feature named `EOS` = True to tell that the word is in the last position. JUst as we've did with `BOS`



In [30]:
import re

def contains(d, shape):
    _digits = re.compile(shape)
    return bool(_digits.search(d))

In [41]:
def word2features(sent, i):
    word = sent[i][0]
    postag = sent[i][1]

    features = {
        'bias': 1.0,
        'word.lower()': word.lower(),
        'word.isupper()': word.isupper(),
        'word.isdigit()': word.isdigit(),
        'word.l1_is_capital': word[0].isupper(),
        'word.ends_in_dot': (len(word)>1 and word[-1]=='.'),
        'word.is_stop_word': word in stop_words,
        'word.contains_digits': contains(word, '\d'),
        'word.figures_in_names_list': word in names,
        'postag': postag,
        # Add features of the current word here
    }
    
    if i > 0:
        word1 = sent[i-1][0]
        postag1 = sent[i-1][1]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:word.isupper()': word1.isupper(),
            '-1:word.isdigit()': word1.isdigit(),
            '-1:word.l1_is_capital': word1[0].isupper(),
            '-1:word.ends_in_dot': (len(word1)>1 and word1[-1]=='.'),
            '-1:word.is_stop_word': word1 in stop_words,
            '-1:word.contains_digits': contains(word1, '\d'),
            '-1:word.figures_in_names_list': word1 in names,
            '-1:postag': postag1,
            # Add features of previous word here
        })
    else:
        features['BOS'] = True
        
    if i < len(sent)-1:
        word1 = sent[i+1][0]
        postag1 = sent[i+1][1]
        features.update({
            '+1:word.lower()': word1.lower(),
            '+1:word.isupper()': word1.isupper(),
            '+1:word.isdigit()': word1.isdigit(),
            '+1:word.l1_is_capital': word1[0].isupper(),
            '+1:word.ends_in_dot': (len(word1)>1 and word1[-1]=='.'),
            '+1:word.is_stop_word': word1 in stop_words,
            '+1:word.contains_digits': contains(word1, '\d'),
            '+1:word.figures_in_names_list': word1 in names,
            '+1:postag': postag1,
            # Add features of previous word here
        })
    else:
        features['EOS'] = True
        
    return features

In [42]:
def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for token, postag, label in sent]

In [43]:
test_sents[0]

[('SOCCER', 'NN', 'O'),
 ('-', ':', 'O'),
 ('JAPAN', 'NNP', 'B-LOC'),
 ('GET', 'VB', 'O'),
 ('LUCKY', 'NNP', 'O'),
 ('WIN', 'NNP', 'O'),
 (',', ',', 'O'),
 ('CHINA', 'NNP', 'B-LOC'),
 ('IN', 'IN', 'O'),
 ('SURPRISE', 'DT', 'O'),
 ('DEFEAT', 'NN', 'O'),
 ('.', '.', 'O')]

In [44]:
[word2features(test_sents[0], i) for i in range(len(test_sents[0]))]

[{'bias': 1.0,
  'word.lower()': 'soccer',
  'word.isupper()': True,
  'word.isdigit()': False,
  'word.l1_is_capital': True,
  'word.ends_in_dot': False,
  'word.is_stop_word': False,
  'word.contains_digits': False,
  'word.figures_in_names_list': False,
  'postag': 'NN',
  'BOS': True,
  '+1:word.lower()': '-',
  '+1:word.isupper()': False,
  '+1:word.isdigit()': False,
  '+1:word.l1_is_capital': False,
  '+1:word.ends_in_dot': False,
  '+1:word.is_stop_word': False,
  '+1:word.contains_digits': False,
  '+1:word.figures_in_names_list': False,
  '+1:postag': ':'},
 {'bias': 1.0,
  'word.lower()': '-',
  'word.isupper()': False,
  'word.isdigit()': False,
  'word.l1_is_capital': False,
  'word.ends_in_dot': False,
  'word.is_stop_word': False,
  'word.contains_digits': False,
  'word.figures_in_names_list': False,
  'postag': ':',
  '-1:word.lower()': 'soccer',
  '-1:word.isupper()': True,
  '-1:word.isdigit()': False,
  '-1:word.l1_is_capital': True,
  '-1:word.ends_in_dot': False,


Construct the features for the training and test test
_________________________

In [45]:
%%time
X_train = [sent2features(s) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]

X_test = [sent2features(s) for s in test_sents]
y_test = [sent2labels(s) for s in test_sents]

CPU times: user 1.55 s, sys: 39 ms, total: 1.59 s
Wall time: 1.6 s


Train your CRF
______________________________

In [46]:
%%time
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    c1=0.1,
    c2=0.1,
    max_iterations=100,
    all_possible_transitions=True
)
crf.fit(X_train, y_train)

CPU times: user 26.9 s, sys: 78 ms, total: 27 s
Wall time: 27.2 s




CRF(algorithm='lbfgs', all_possible_transitions=True, c1=0.1, c2=0.1,
    keep_tempfiles=None, max_iterations=100)

In [47]:
labels = list(crf.classes_)

In [48]:
labels

['B-ORG', 'O', 'B-PER', 'I-PER', 'B-LOC', 'I-ORG', 'I-LOC']

In [49]:
labels.remove('O')

Compute F1 score for different labels. Remove the 'O' label before that
______________________

In [50]:
y_pred = crf.predict(X_test)
metrics.flat_f1_score(y_test, y_pred, average='weighted', labels=labels)

0.8252704236583717

In [51]:
# group B and I results
sorted_labels = sorted(
    labels,
    key=lambda name: (name[1:], name[0])
)

In [52]:
print(metrics.flat_classification_report(
    y_test, y_pred, labels=sorted_labels, digits=3
))



              precision    recall  f1-score   support

       B-LOC      0.867     0.853     0.860      1667
       I-LOC      0.823     0.724     0.770       257
       B-ORG      0.827     0.719     0.769      1660
       I-ORG      0.740     0.733     0.736       834
       B-PER      0.834     0.846     0.840      1615
       I-PER      0.876     0.950     0.912      1156

   micro avg      0.836     0.818     0.827      7189
   macro avg      0.828     0.804     0.815      7189
weighted avg      0.835     0.818     0.825      7189



In [53]:
len(crf.transition_features_)

49

Show the 20 top and 20 least likely transitions between labels 
_________________

In [54]:
from collections import Counter

def print_transitions(trans_features):
    for (label_from, label_to), weight in trans_features:
        print("%-6s -> %-7s %0.6f" % (label_from, label_to, weight))

print("Top likely transitions:")
print_transitions(Counter(crf.transition_features_).most_common(20))

print("\nTop unlikely transitions:")
print_transitions(Counter(crf.transition_features_).most_common()[-20:])

Top likely transitions:
B-ORG  -> I-ORG   5.827193
B-PER  -> I-PER   4.684174
I-ORG  -> I-ORG   4.547670
B-LOC  -> I-LOC   3.857031
I-LOC  -> I-LOC   3.762733
O      -> O       3.723489
I-PER  -> I-PER   3.021489
O      -> B-PER   1.652605
O      -> B-ORG   1.266949
O      -> B-LOC   0.780464
B-ORG  -> O       0.293434
B-LOC  -> O       0.004557
I-PER  -> O       -0.287651
I-LOC  -> O       -0.324062
B-PER  -> O       -0.341121
I-ORG  -> O       -0.941013
I-LOC  -> B-LOC   -1.430664
I-LOC  -> B-ORG   -1.680629
I-PER  -> I-ORG   -1.954587
B-LOC  -> B-ORG   -1.968002

Top unlikely transitions:
B-PER  -> I-LOC   -2.609454
I-PER  -> B-LOC   -2.703711
B-ORG  -> I-LOC   -2.880424
I-LOC  -> B-PER   -2.893446
I-ORG  -> I-LOC   -2.940391
B-ORG  -> I-PER   -3.300197
B-LOC  -> B-PER   -3.323175
B-ORG  -> B-LOC   -3.359997
I-PER  -> B-PER   -3.391003
B-LOC  -> I-ORG   -3.459021
B-LOC  -> I-PER   -3.583470
I-ORG  -> B-ORG   -3.634410
I-ORG  -> B-LOC   -3.695969
B-PER  -> B-ORG   -3.742814
I-ORG  ->

In [55]:
len(crf.state_features_)

17919

Show the 50 top and 50 least likely state features (compatibility between features and labels)
_______________

In [56]:
def print_state_features(state_features):
    for (attr, label), weight in state_features:
        print("%0.6f %-8s %s" % (weight, label, attr))

print("Top positive:")
print_state_features(Counter(crf.state_features_).most_common(50))

print("\nTop negative:")
print_state_features(Counter(crf.state_features_).most_common()[-50:])

Top positive:
6.780963 O        word.lower():september
6.733413 B-LOC    word.lower():england
6.639222 O        word.lower():june
6.436755 B-LOC    word.lower():pakistan
6.431660 O        word.lower():august
6.408326 O        word.lower():july
6.204839 O        +1:word.lower():open
6.166702 O        word.lower():tuesday
5.996229 O        word.lower():friday
5.991452 B-LOC    word.lower():germany
5.989684 O        word.lower():may
5.968940 O        word.lower():thursday
5.886973 O        word.lower():aug
5.874153 O        word.lower():monday
5.842614 O        word.lower():wednesday
5.838268 B-ORG    -1:word.lower():v
5.756147 I-ORG    word.lower():newsroom
5.671975 O        word.lower():minister
5.646289 B-PER    word.lower():clinton
5.618795 B-LOC    word.lower():iraq
5.572124 B-ORG    word.lower():u.n.
5.571144 O        word.lower():march
5.562797 B-LOC    word.lower():britain
5.541021 O        word.lower():october
5.513678 I-LOC    -1:word.lower():colo
5.498038 I-LOC    -1:word.lower

See if you can spot some interesting features (both with positive and negative coefficients)