# Evaluation of Spatial Nominal Entity Recognition models

This notebook presents the evaluation of the models trained for Spatial Nominal Entity Recognition and proposed in 

> Amine Medad, Mauro Gaio Ludovic Moncla, Sébastien Mustière, and Yannick Le Nir. Comparing supervised learning algorithms for Spatial Nominal Entity recognition. The 23rd AGILE International Conference on Geographic Information Science. 2020

This paper presents a methodology comparing five supervised machine learning algorithms for the automatic identification of SNoE from raw texts. The approach uses a pre-trained WEs model as input according to the TL principle. The WEs used as input data for these algorithms, come from the FastText model pre-trained on a huge corpus of generic texts in French. The FastText model was chosen because it produced better results, compared to other equivalent WEs models, on so-called morphological rich languages such as French. 

The experimental results demonstrate: 1) the feasibility of our approach for the SNoE recognition task, 2) the importance of the context on this kind of task. Thanks to the use of the principle of transfer learning we have been able to show that it is possible to test methodological and algorithmic choices by relying on small corpora.

In [1]:
import random
import pandas as pd
import numpy as np
import treetaggerwrapper
from keras.models import load_model
from gensim.models import fasttext
from joblib import load
from sklearn.decomposition import PCA
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score

  re.IGNORECASE | re.VERBOSE)
  re.VERBOSE | re.IGNORECASE)
  UrlMatch_re = re.compile(UrlMatch_expression, re.VERBOSE | re.IGNORECASE)
  EmailMatch_re = re.compile(EmailMatch_expression, re.VERBOSE | re.IGNORECASE)


In [2]:
def sentences_to_ngrams(sentences, ngram_size, fr_nouns_file):

    ngrams = []
    context_size = int(ngram_size / 2)
    tagger = treetaggerwrapper.TreeTagger(TAGLANG='fr', TAGINENC='utf-8', TAGOUTENC='utf-8')

    with open(fr_nouns_file, "r") as file:
        fr_nouns = file.readlines()

    for s in sentences:
        s = s.replace(';', '')
        s = s.replace("'", chr(39))
        s = s.replace('\'', chr(39))
        s = s.replace("d\'", " deeee ")
        s = s.replace("l\'", " leeee ")

        sentence_tagged = treetaggerwrapper.make_tags(tagger.tag_text(s))

        try:
            sentence = list(np.array(sentence_tagged)[:, 0])  # getting only the token (not lemmas and POS)
        except IndexError:
            pass
            
        for i, token in enumerate(sentence):
            if token == "leeee":
                sentence[i] = "l\'"
            if token == 'deeee':
                sentence[i] = "d\'"

        index_left = sentence.index('[')
        index_right = sentence.index(']')

        phrase_ngram = []

        # add left context
        for i in range(context_size):
            try:
                phrase_ngram.append(sentence[index_left - context_size + i])
            except IndexError:
                # when there is not enough words (ex: pivot word starting the sentence)
                phrase_ngram.append(random.choice(fr_nouns).rstrip())

        # add pivot token(s) (can contain several tokens)
        phrase_ngram.append(' '.join(sentence[index_left + 1:index_right]))

        # add right context
        for i in range(context_size):
            try:
                phrase_ngram.append(sentence[index_right + 1 + i])
            except IndexError:
                # when there is not enough words (ex: pivot word starting the sentence)
                phrase_ngram.append(random.choice(fr_nouns).rstrip())

        ngrams.append(phrase_ngram)

    return ngrams

In [3]:
def vectorization(ngram_size, input_data, we_vector_size, fasttext_wv):

    data_vec = np.array([])

    for phrase in input_data:
        phrase_vec = np.array([])

        for word in phrase:
            word = word.replace("’", "\'")
            vec = fasttext_wv[word]
            phrase_vec = np.append(phrase_vec, vec)

        data_vec = np.append(data_vec, phrase_vec)

    data_vec = np.reshape(data_vec, (len(input_data), ngram_size, we_vector_size))

    return data_vec

In [4]:
input_data = './data/corpus_validation.csv'
train_corpus_filepath = './data/corpus_train.csv'
we_vector_size = 300
fr_nouns_file = './data/French_nouns.txt'
model_fasttext = './data/cc.fr.300.bin'

keras_models = ['GRU', 'MLP_PCA', 'MLP_AE']

np.random.seed(1)

In [5]:
print('** Load input data... \n')
df = pd.read_csv(input_data, delimiter=';', names=['idf', 'labels', 'sentences', 'pivot_words', 'src', 'alea'])

print(df.head(5))

y_test = df['labels']

** Load input data... 

   idf  labels                                          sentences  \
0  166       1  la balade peut se poursuivre autour du lac ou ...   
1  303       1  le sentier grimpe au-dessus du hameau avec un ...   
2  199       1  ( 9 ) poursuivre la descente vers la droite en...   
3  394       1  continuer un petit peu sur l'arête puis descen...   
4  313       1  (3) À la [patte d'oie], laisser le départ à dr...   

         pivot_words                src               alea  
0               cols  corpus_validation  0,085618299510875  
1            passage  corpus_validation  0,295723408093251  
2             église  corpus_validation  0,942577511847241  
3  sentier de montée  corpus_validation  0,543919977290256  
4        patte d'oie  corpus_validation    0,8779708338724  


In [6]:
print("** Loading fastText model...\n")
fasttext_model = fasttext.load_facebook_vectors(model_fasttext)

** Loading fastText model...



In [7]:
def preprocess_input(sentences, ngram_size, fr_nouns_file, fasttext_model, we_vector_size):
    print('** Transform sentences to ' + str(ngram_size) + ' ngrams... \n')
    ngrams_list = sentences_to_ngrams(sentences, ngram_size, fr_nouns_file)
    #print(ngrams_list)

    print('** Vectorisation of inputs... \n')
    x_test = vectorization(ngram_size, ngrams_list, we_vector_size, fasttext_model)
    
    return x_test

In [13]:
def loadmodel(model_path, algorithm, keras_models):
    print('** Loading model ' + model_path + ' \n')

    if algorithm in keras_models:
        clf = load_model(model_path)
    else:
        clf = load(model_path)
        
    return clf


def prediction(clf, x_test, y_test, ngram_size, we_vector_size, algorithm):
    print('** Predicting... \n')

    if algorithm == 'RF' or algorithm == 'SVM' or algorithm == 'MLP_AE' or algorithm == 'MLP_PCA':
        x_test = np.reshape(x_test, (len(x_test), ngram_size * we_vector_size))

    if algorithm == 'MLP_PCA':
        df_train = pd.read_csv(train_corpus_filepath, delimiter=';', names=['idf', 'labels', 'sentences', 'pivot_words', 'alea'])
        
        x_train = preprocess_input(df_train['sentences'], ngram_size, fr_nouns_file, fasttext_model, we_vector_size)
        x_train = np.reshape(x_train, (len(x_train), ngram_size * we_vector_size))

        #pca = PCA(0.99)
        if ngram_size == 1:
            pca = PCA(n_components = 87, random_state=1)
        if ngram_size == 5:
            pca = PCA(n_components = 295, random_state=1)
        if ngram_size == 7:
            pca = PCA(n_components = 369, random_state=1)
        
        pca.fit(x_train)
        
        x_test = pca.transform(x_test)

    if algorithm in keras_models:
        #y_pred = clf.predict_classes(x_test)
        score = clf.evaluate(x_test, y_test)
        accuracy = score[1]

    if algorithm == 'RF' or algorithm == 'SVM':
        #y_pred = clf.predict(x_test)
        accuracy = clf.score(x_test, y_test)
        
    #precision = precision_score(y_test, y_pred)
    #recall = recall_score(y_test, y_pred)
    #f1 = f1_score(y_test, y_pred)
    #accuracy = accuracy_score(y_test, y_pred)

    return accuracy

# GRU

For GRU models, hyper-parameters include the number of GRU units (5,10,100,1000,100), GRU units activation function (hyperbolic tangent), recurrent activation function (hyperbolic tangent), dropout (0.0, 0.3, 0.5, 0.8, 0.9), recurrent dropout (0.0, 0.3, 0.5, 0.8, 0.9), dense activation function (hyperbolic tangent, sigmoid), the number of epochs for training (500, 1000, 2000), the optimiser (adam) with learning rate (0.001).

In [14]:
model_path = './models/GRU_1gram.h5'
algorithm = 'GRU'
ngram_size = 1

x_test = preprocess_input(df['sentences'], ngram_size, fr_nouns_file, fasttext_model, we_vector_size)
model = loadmodel(model_path, algorithm, keras_models)

score = prediction(model, x_test, y_test, ngram_size, we_vector_size, algorithm)
print('accuracy: ', score)

** Transform sentences to 1 ngrams... 

** Vectorisation of inputs... 

** Loading model ./models/GRU_1gram.h5 

** Predicting... 

accuracy:  0.6701030731201172


In [15]:
model_path = './models/GRU_5grams.h5'
algorithm = 'GRU'
ngram_size = 5

x_test = preprocess_input(df['sentences'], ngram_size, fr_nouns_file, fasttext_model, we_vector_size)
model = loadmodel(model_path, algorithm, keras_models)

score = prediction(model, x_test, y_test, ngram_size, we_vector_size, algorithm)
print('accuracy: ', score)

** Transform sentences to 5 ngrams... 

** Vectorisation of inputs... 

** Loading model ./models/GRU_5grams.h5 

** Predicting... 

accuracy:  0.7628865838050842


In [16]:
model_path = './models/GRU_7grams.h5'
algorithm = 'GRU'
ngram_size = 7

x_test = preprocess_input(df['sentences'], ngram_size, fr_nouns_file, fasttext_model, we_vector_size)
model = loadmodel(model_path, algorithm, keras_models)

score = prediction(model, x_test, y_test, ngram_size, we_vector_size, algorithm)
print('accuracy: ', score)

** Transform sentences to 7 ngrams... 

** Vectorisation of inputs... 

** Loading model ./models/GRU_7grams.h5 

** Predicting... 

accuracy:  0.7938144207000732


# MLP + PCA

For MLP+PCA hyper-parameters include activation function for each layer (Exponential Linear Unit, Rectified Linear Unit, Softplus), the output layer activation function (sigmoid), dropout (0.0, 0.3, 0.5), PCA information, the optimiser (adam) with learning rate (0.001).

In [17]:
model_path = './models/MLP_PCA_1gram.h5'
algorithm = 'MLP_PCA'
ngram_size = 1

x_test = preprocess_input(df['sentences'], ngram_size, fr_nouns_file, fasttext_model, we_vector_size)
model = loadmodel(model_path, algorithm, keras_models)

score = prediction(model, x_test, y_test, ngram_size, we_vector_size, algorithm)
print('accuracy: ', score)

** Transform sentences to 1 ngrams... 

** Vectorisation of inputs... 

** Loading model ./models/MLP_PCA_1gram.h5 

** Predicting... 

** Transform sentences to 1 ngrams... 

** Vectorisation of inputs... 

accuracy:  0.48969072103500366


In [18]:
model_path = './models/MLP_PCA_5grams.h5'
#model_path = 'MLP_ACP_models/MLP_PCA_relu_relu_sigmoid_32_6_adam_dynamique_0.5_5.h5'
algorithm = 'MLP_PCA'
ngram_size = 5

x_test = preprocess_input(df['sentences'], ngram_size, fr_nouns_file, fasttext_model, we_vector_size)
model = loadmodel(model_path, algorithm, keras_models)

score = prediction(model, x_test, y_test, ngram_size, we_vector_size, algorithm)
print('accuracy: ', score)

** Transform sentences to 5 ngrams... 

** Vectorisation of inputs... 

** Loading model ./models/MLP_PCA_5grams.h5 

** Predicting... 

** Transform sentences to 5 ngrams... 

** Vectorisation of inputs... 

accuracy:  0.6546391844749451


In [14]:
model_path = './models/MLP_PCA_7grams.h5'
algorithm = 'MLP_PCA'
ngram_size = 7

x_test = preprocess_input(df['sentences'], ngram_size, fr_nouns_file, fasttext_model, we_vector_size)
model = loadmodel(model_path, algorithm, keras_models)

score = prediction(model, x_test, y_test, ngram_size, we_vector_size, algorithm)
print('accuracy: ', score)

** Transform sentences to 7 ngrams... 

** Vectorisation of inputs... 

** Loading model ./models/MLP_PCA_7grams.h5 

** Predicting... 

** Transform sentences to 7 ngrams... 

** Vectorisation of inputs... 

accuracy:  0.592783510684967


# MLP + AE

For MLP+AE the hyper-parameters are the same as in the MLP+AE except for the dropout (0.0, 0.5, 0.9), and the dimension of the encoding layer (500).

In [19]:
model_path = './models/MLP_AE_1gram.h5'
algorithm = 'MLP_AE'
ngram_size = 1

x_test = preprocess_input(df['sentences'], ngram_size, fr_nouns_file, fasttext_model, we_vector_size)
model = loadmodel(model_path, algorithm, keras_models)

score = prediction(model, x_test, y_test, ngram_size, we_vector_size, algorithm)
print('accuracy: ', score)

** Transform sentences to 1 ngrams... 

** Vectorisation of inputs... 

** Loading model ./models/MLP_AE_1gram.h5 

** Predicting... 

accuracy:  0.6855670213699341


In [20]:
model_path = './models/MLP_AE_5grams.h5'
algorithm = 'MLP_AE'
ngram_size = 5

x_test = preprocess_input(df['sentences'], ngram_size, fr_nouns_file, fasttext_model, we_vector_size)
model = loadmodel(model_path, algorithm, keras_models)

score = prediction(model, x_test, y_test, ngram_size, we_vector_size, algorithm)
print('accuracy: ', score)

** Transform sentences to 5 ngrams... 

** Vectorisation of inputs... 

** Loading model ./models/MLP_AE_5grams.h5 

** Predicting... 

accuracy:  0.7474226951599121


In [21]:
model_path = './models/MLP_AE_7grams.h5'
algorithm = 'MLP_AE'
ngram_size = 7

x_test = preprocess_input(df['sentences'], ngram_size, fr_nouns_file, fasttext_model, we_vector_size)
model = loadmodel(model_path, algorithm, keras_models)

score = prediction(model, x_test, y_test, ngram_size, we_vector_size, algorithm)
print('accuracy: ', score)

** Transform sentences to 7 ngrams... 

** Vectorisation of inputs... 

** Loading model ./models/MLP_AE_7grams.h5 

** Predicting... 

accuracy:  0.7680412530899048


# Random Forest

For the RF, hyper-parameters include the number of trees in the forest (50, 60, 70, 80, 100, 200, 300, 500), the maximum depth of the tree (1, 3, 6, 12, 15, 20, 22, 25, 27, 29, 32, 34, 36, 38, 40, 43, 46, 48, 50, 60, 65, 70, 75, 80), the function to measure the quality of a split (Gini impurity, Entropy).

In [22]:
model_path = './models/RF_1gram.joblib'
algorithm = 'RF'
ngram_size = 1

x_test = preprocess_input(df['sentences'], ngram_size, fr_nouns_file, fasttext_model, we_vector_size)
model = loadmodel(model_path, algorithm, keras_models)

score = prediction(model, x_test, y_test, ngram_size, we_vector_size, algorithm)
print('accuracy: ', score)

** Transform sentences to 1 ngrams... 

** Vectorisation of inputs... 

** Loading model ./models/RF_1gram.joblib 





** Predicting... 

accuracy:  0.7061855670103093




In [23]:
model_path = './models/RF_5grams.joblib'
algorithm = 'RF'
ngram_size = 5

x_test = preprocess_input(df['sentences'], ngram_size, fr_nouns_file, fasttext_model, we_vector_size)
model = loadmodel(model_path, algorithm, keras_models)

score = prediction(model, x_test, y_test, ngram_size, we_vector_size, algorithm)
print('accuracy: ', score)

** Transform sentences to 5 ngrams... 

** Vectorisation of inputs... 

** Loading model ./models/RF_5grams.joblib 

** Predicting... 

accuracy:  0.7268041237113402




In [20]:
model_path = './models/RF_7grams.joblib'
algorithm = 'RF'
ngram_size = 7

x_test = preprocess_input(df['sentences'], ngram_size, fr_nouns_file, fasttext_model, we_vector_size)
model = loadmodel(model_path, algorithm, keras_models)

score = prediction(model, x_test, y_test, ngram_size, we_vector_size, algorithm)
print('accuracy: ', score)

** Transform sentences to 7 ngrams... 

** Vectorisation of inputs... 

** Loading model ./models/RF_7grams.joblib 

** Predicting... 

accuracy:  0.7474226804123711




# SVM

For the SVM the hyper-parameters include the kernel type (Polynomial, Linear, Sigmoid, Ra- dial Basis Function), regularisation parameter (1e-3, 1e-2, 1e-1, 0.5, 1, 10, 100), the kernel coefficient gamma (1e-3, 1e-2, 1e-1, 1, 10, 100, 1000, scale).

In [24]:
model_path = './models/SVM_1gram.joblib'
algorithm = 'SVM'
ngram_size = 1

x_test = preprocess_input(df['sentences'], ngram_size, fr_nouns_file, fasttext_model, we_vector_size)
model = loadmodel(model_path, algorithm, keras_models)

score = prediction(model, x_test, y_test, ngram_size, we_vector_size, algorithm)
print('accuracy: ', score)

** Transform sentences to 1 ngrams... 

** Vectorisation of inputs... 

** Loading model ./models/SVM_1gram.joblib 

** Predicting... 

accuracy:  0.6907216494845361




In [25]:
model_path = './models/SVM_5grams.joblib'
algorithm = 'SVM'
ngram_size = 5

x_test = preprocess_input(df['sentences'], ngram_size, fr_nouns_file, fasttext_model, we_vector_size)
model = loadmodel(model_path, algorithm, keras_models)

score = prediction(model, x_test, y_test, ngram_size, we_vector_size, algorithm)
print('accuracy: ', score)

** Transform sentences to 5 ngrams... 

** Vectorisation of inputs... 

** Loading model ./models/SVM_5grams.joblib 

** Predicting... 

accuracy:  0.7525773195876289




In [26]:
model_path = './models/SVM_7grams.joblib'
algorithm = 'SVM'
ngram_size = 7

x_test = preprocess_input(df['sentences'], ngram_size, fr_nouns_file, fasttext_model, we_vector_size)
model = loadmodel(model_path, algorithm, keras_models)

score = prediction(model, x_test, y_test, ngram_size, we_vector_size, algorithm)
print('accuracy: ', score)

** Transform sentences to 7 ngrams... 

** Vectorisation of inputs... 

** Loading model ./models/SVM_7grams.joblib 

** Predicting... 





accuracy:  0.7268041237113402
