<span style="color:red">**Team members / emails**</span> --> *To be sent 8th of january, 8pm (Ex. 6)*
- (1)
- (2)
- (3)

#### For source code, see
1. https://eli5.readthedocs.io/en/latest/tutorials/sklearn_crfsuite.html#
2. https://sklearn-crfsuite.readthedocs.io/en/latest/tutorial.html

#### For CRF theory, see
1. https://medium.com/ml2vec/overview-of-conditional-random-fields-68a2a20fa541
2. https://www.cs.upc.edu/~aquattoni/AllMyPapers/crf_tutorial_talk.pdf
3. http://blog.echen.me/2012/01/03/introduction-to-conditional-random-fields/

#### More detailled theoretical ref.
1. https://homepages.inf.ed.ac.uk/csutton/publications/crftut-fnt.pdf


In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')

In [None]:
!pip3 install sklearn_crfsuite

Collecting sklearn_crfsuite
  Downloading https://files.pythonhosted.org/packages/25/74/5b7befa513482e6dee1f3dd68171a6c9dfc14c0eaa00f885ffeba54fe9b0/sklearn_crfsuite-0.3.6-py2.py3-none-any.whl
Collecting python-crfsuite>=0.8.3
[?25l  Downloading https://files.pythonhosted.org/packages/95/99/869dde6dbf3e0d07a013c8eebfb0a3d30776334e0097f8432b631a9a3a19/python_crfsuite-0.9.7-cp36-cp36m-manylinux1_x86_64.whl (743kB)
[K     |████████████████████████████████| 747kB 5.4MB/s 
Installing collected packages: python-crfsuite, sklearn-crfsuite
Successfully installed python-crfsuite-0.9.7 sklearn-crfsuite-0.3.6


In [None]:
!pip3 install eli5

Collecting eli5
[?25l  Downloading https://files.pythonhosted.org/packages/97/2f/c85c7d8f8548e460829971785347e14e45fa5c6617da374711dec8cb38cc/eli5-0.10.1-py2.py3-none-any.whl (105kB)
[K     |███                             | 10kB 15.9MB/s eta 0:00:01[K     |██████▏                         | 20kB 8.9MB/s eta 0:00:01[K     |█████████▎                      | 30kB 6.7MB/s eta 0:00:01[K     |████████████▍                   | 40kB 6.2MB/s eta 0:00:01[K     |███████████████▌                | 51kB 4.1MB/s eta 0:00:01[K     |██████████████████▋             | 61kB 4.4MB/s eta 0:00:01[K     |█████████████████████▊          | 71kB 4.8MB/s eta 0:00:01[K     |████████████████████████▊       | 81kB 5.0MB/s eta 0:00:01[K     |███████████████████████████▉    | 92kB 4.8MB/s eta 0:00:01[K     |███████████████████████████████ | 102kB 5.2MB/s eta 0:00:01[K     |████████████████████████████████| 112kB 5.2MB/s 
Installing collected packages: eli5
Successfully installed eli5-0.10.1


In [None]:
import nltk
# Do this once to get CoNLL2002: --> nltk.download()
import sklearn_crfsuite
import eli5
import scipy.stats

from sklearn.model_selection import train_test_split
from sklearn_crfsuite.metrics import flat_f1_score
from sklearn_crfsuite.metrics import flat_classification_report
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import make_scorer
from sklearn.model_selection import RandomizedSearchCV



## Training data

In [None]:
train_sents = list(nltk.corpus.conll2002.iob_sents('esp.train'))
test_sents = list(nltk.corpus.conll2002.iob_sents('esp.testb'))
train_sents[0]

LookupError: ignored

## Feature extraction

**Features** word identity, word suffix, word shape and word POS tag
<br><br>
NB: *The istitle() method returns True if all words in a text start with a upper case letter, AND the rest of the word are lower case letters, otherwise False.*

In [None]:
def word2features(sent, i):
    word = sent[i][0]
    postag = sent[i][1]

    features = {
        'bias': 1.0,
        'word.lower()': word.lower(),
        'word[-3:]': word[-3:],
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit(),
        'postag': postag,
        'postag[:2]': postag[:2],
    }
    if i > 0:
        word1 = sent[i-1][0]
        postag1 = sent[i-1][1]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper(),
            '-1:postag': postag1,
            '-1:postag[:2]': postag1[:2],
        })
    else:
        features['BOS'] = True

    if i < len(sent)-1:
        word1 = sent[i+1][0]
        postag1 = sent[i+1][1]
        features.update({
            '+1:word.lower()': word1.lower(),
            '+1:word.istitle()': word1.istitle(),
            '+1:word.isupper()': word1.isupper(),
            '+1:postag': postag1,
            '+1:postag[:2]': postag1[:2],
        })
    else:
        features['EOS'] = True

    return features


def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for token, postag, label in sent]

def sent2tokens(sent):
    return [token for token, postag, label in sent]

X_train = [sent2features(s) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]

X_test = [sent2features(s) for s in test_sents]
y_test = [sent2labels(s) for s in test_sents]

In [None]:
# 10th word of sentence 0
X_train[0][10]

## Train a CRF model

**L-BFGS** Limited-memory BFGS (Broyden–Fletcher–Goldfarb–Shanno) algorithm <br>--> for the gradient descent
<br><br>**c1, c2** Regularization weights (L1+L2 regularizations)

In [None]:
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    c1=0.1,
    c2=0.1,
    max_iterations=100,
    all_possible_transitions=True,
)
crf.fit(X_train, y_train);

We want to peak good c1 and c2 parameters. We do a 3-fold cross-validation using a given parameters grid and retain the c1 and c2 that yield the best F1-score.
<br><br>
**Caution**: this can take **a while**...so I give you below my grid search result!

In [None]:
#%%time
## define fixed parameters and parameters to search
#crf = sklearn_crfsuite.CRF(
#    algorithm='lbfgs', 
#    max_iterations=100, 
#    all_possible_transitions=True
#)
#params_space = {
#    'c1': scipy.stats.expon(scale=0.5),
#    'c2': scipy.stats.expon(scale=0.05),
#}
#
## use the same metric for evaluation
#f1_scorer = make_scorer(flat_f1_score, 
#                        average='weighted', labels=labels)
#
## search
#rs = RandomizedSearchCV(crf, params_space, 
#                        cv=3, 
#                        verbose=1, 
#                        n_jobs=-1, 
#                        n_iter=50, 
#                        scoring=f1_scorer)
#rs.fit(X_train, y_train)

In [None]:
## crf = rs.best_estimator_
#print('best params:', rs.best_params_)
#print('best CV score:', rs.best_score_)
#print('model size: {:0.2f}M'.format(rs.best_estimator_.size_ / 1000000))

## MY RESULTS
## ----
#best params: {'c1': 0.03802979579044823, 'c2': 0.0624535538687852}
#best CV score: 0.7492315130798647
#model size: 1.79M
## ----

## Display parameter space

A chart which shows which c1 and c2 values have RandomizedSearchCV checked. Red color means better results, blue means worse.

In [None]:
#rs.cv_results_

In [None]:
#_x = [s['c1'] for s in rs.cv_results_['params']]
#_y = [s['c2'] for s in rs.cv_results_['params']]
#_c = [s for s in rs.cv_results_['mean_test_score']]
#
#fig = plt.figure()
#fig.set_size_inches(12, 12)
#ax = plt.gca()
#ax.set_yscale('log')
#ax.set_xscale('log')
#ax.set_xlabel('C1')
#ax.set_ylabel('C2')
#ax.set_title("Randomized Hyperparameter Search CV Results (min={:0.3}, max={:0.3})".format(
#   min(_c), max(_c)
#))
#
#ax.scatter(_x, _y, c=_c, s=60, alpha=0.9, edgecolors=[0,0,0])
#
#print("Dark blue => {:0.4}, dark red => {:0.4}".format(min(_c), max(_c)))

In [None]:
# Results with the best estimator

#crf = rs.best_estimator_

## USING MY RESULTS
## ----
#best params: {'c1': 0.03802979579044823, 'c2': 0.0624535538687852}
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    c1=0.03802979579044823,
    c2=0.0624535538687852,
    max_iterations=100,
    all_possible_transitions=True,
)
crf.fit(X_train, y_train);
## ----

y_pred = crf.predict(X_test)
print(flat_classification_report(
    y_test, y_pred, labels=sorted_labels, digits=3
))

## Inspect model weights

In [None]:
from collections import Counter

def print_transitions(trans_features):
    for (label_from, label_to), weight in trans_features:
        print("%-6s -> %-7s %0.6f" % (label_from, label_to, weight))

print("Top likely transitions:")
print_transitions(Counter(crf.transition_features_).most_common(20))

print("\nTop unlikely transitions:")
print_transitions(Counter(crf.transition_features_).most_common()[-20:])

après avoir faire entrainer le modèle 

In [None]:
eli5.show_weights(crf, top=30)

Some observations:

- **9.338 B-ORG word.lower()**:psoe-progresistas - the model remembered names of some entities - maybe it is overfit, or maybe our features are not adequate, or maybe remembering is indeed helpful;
- **4.970 I-LOC -1:word.lower()**:calle: “calle” is a street in Spanish; model learns that if a previous word was “calle” then the token is likely a part of location;
- **-7.343 O word.isupper(), -8.461 O word.istitle()**: UPPERCASED or TitleCased words are likely entities of some kind;
- **-2.097561 O postag:NP** - proper nouns (NP is a proper noun in the Spanish tagset) are often entities.


In [None]:
crf_sparse = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    c1=20,
    c2=0.0624535538687852,
    max_iterations=100,
    all_possible_transitions=True,
)
crf_sparse.fit(X_train, y_train);

eli5.show_weights(crf_sparse, top=30)

## Customize weights visu

In [None]:
eli5.show_weights(crf, top=10, targets=['O', 'B-ORG', 'I-ORG'])

In [None]:
# Check if a feature function works as intended
eli5.show_weights(crf, top=10, feature_re='^word\.is',
                  horizontal_layout=False, show=['targets'])

In [None]:
expl = eli5.explain_weights(crf, top=5, targets=['O', 'B-LOC', 'I-LOC'])
print(eli5.format_as_text(expl))

#### <span style="color:red">Exercise 6</span>

#### Comparative study between NLTK NER, Spacy NER and CRF NER on the GMB datasets

We consider the GMB (Groningen Meaning Bank) corpus.
<br> **(1)** Load the sentences of the corpus (see below)
<br> **(2)** Provide counts on the chunk tags for the train and test sets
<br> **(3)** Apply the NLTK NER, SpaCy NER and CRF NER on a test sentence set of the GMB corpus. 
<br> **(4)** Evaluate the precision, recall and F-Measure for each entity. Provide separated metrics and average.
<br> **(5)** Give also these metrics for all classes.
<br> **(6)** Compare the three NER approaches

## Read GMB data

In [None]:
import pandas as pd

In [None]:
# Reading the csv file
df = pd.read_csv('../../../data/GMB/ner_dataset.csv', 
                 encoding = "ISO-8859-1")
df.head(10)

In [None]:
df.describe()

In [None]:
# Displaying the unique tags
df['Tag'].unique()

In [None]:
# Checking the null values if any
#print(df.head(10))
#print(df.isnull().sum())
df = df.fillna(method = 'ffill')
df.head(10)

In [None]:
# A class to get the sentence (= list of tuples with tag and pos)
class sentence(object):
    def __init__(self, df):
        self.n_sent = 1
        self.df = df
        self.empty = False
        agg = lambda s : [(w, p, t) for w, p, t in zip(s['Word'].values.tolist(),
                                                       s['POS'].values.tolist(),
                                                       s['Tag'].values.tolist())]
        self.grouped = self.df.groupby("Sentence #").apply(agg)
        self.sentences = [s for s in self.grouped]

    def get_text(self):
        try:
            s = self.grouped['Sentence: {}'.format(self.n_sent)]
            self.n_sent += 1
            return s
        except:
            return None

In [None]:
getter = sentence(df)

# Displaying one full sentence
sentences = [" ".join([s[0] for s in sent]) for sent in getter.sentences]
sentences[0]

# Sentence with its pos and tag
sent = getter.get_text()
print(sent)

# Getting all the sentences in the dataset
sentences = getter.sentences