# DeepEurovoc

The goal of this experiment is to predict Eurovoc codes based on expression abstracts published by the PO. 

The model used in this notebook is identical to the 1-dimensional convolutional neural network described in the Keras blog article https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html. It uses pre-trained GloVe word embeddings for text classification.

The input is retrieved from PO's public SPARQL endpoint, available at http://publications.europa.eu/webapi/rdf/sparql. The following query does the job:

The input data is stored in data.csv. Before loading the data, we import the modules we will need throughout this notebook.

In [1]:
import pandas as pd
import numpy as np
import os
import sys
import xml.etree.ElementTree as ET
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

import keras
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.layers import Dense, Input, GlobalMaxPooling1D, Flatten
from keras.layers import Conv1D, MaxPooling1D, Embedding
from keras.models import Model
from keras.initializers import Constant

from sklearn.preprocessing import MultiLabelBinarizer

Definition of some global variables used in this notebook:

In [2]:
CLEANUP_DATA = True 
MAX_NUM_WORDS = 20000 # size of vocabulary
EMBEDDING_DIM = 100 # dimension of GloVe word embeddings
MAX_SEQUENCE_LENGTH = 1000 # truncate abstracts after MAX_SEQUENCE_LENGTH words

Load the data and get some numbers...

In [3]:
data_df = pd.read_csv("data.csv")
print(data_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35472 entries, 0 to 35471
Data columns (total 3 columns):
exp         35472 non-null object
abstract    35472 non-null object
concepts    35472 non-null object
dtypes: object(3)
memory usage: 831.5+ KB
None


We define a function that helps us extract the abstracts as plain text from the XMLLiterals returned by Virtuoso, removes all non-English words including stop words, and performs lemmatization on the filtered words.

In [4]:
def cleanup_abstract(xmlstring):
    #import ipdb; ipdb.set_trace()
    xmlstring = xmlstring.replace('""', '"')
    text = None
    try: 
        tree = ET.ElementTree(ET.fromstring(xmlstring))
        xpath_result = tree.findall(".//description")
        text = xpath_result[0].text
    except:
        text = xmlstring
    # remove stopwords and punctuation. lower case everything
    stop_words = set(stopwords.words('english'))
    tokens = word_tokenize(text)
    tokens = [w.lower() for w in tokens if not w in stop_words and w.isalpha() and wordnet.synsets(w)]
    # lemmatize
    lemma = WordNetLemmatizer()
    final_tokens = []
    for word in tokens:
        final_tokens.append(lemma.lemmatize(word))
    ret = " ".join(final_tokens)
    return ret

Cleaning the data involves two steps:

1\. clean the abstracts 

2\. transform the ";" separated eurovoc codes in the concepts column into lists of eurovoc codes. 

In [5]:
if CLEANUP_DATA:
    data_df["cleaned_abstract"] = data_df["abstract"].apply(cleanup_abstract)
    #data_df["concept_list"] = data_df["concepts"].apply(lambda x: [c for c in x.split(";")])
    data_df["concept_list"] = data_df["concepts"].apply(lambda x: list({c[c.rfind("/")+1:c.rfind("/")+3] for c in x.split(";")}))
    concepts_list = data_df["concepts"].tolist()
    data_df.drop(["abstract"], axis=1)
    data_df.to_pickle("data_df.pkl")

The next line loads the cleaned up data directly from file instead of computing it everytime the notebook is executed.

In [6]:
data_df = pd.read_pickle("data_df.pkl")
print(data_df['cleaned_abstract'][:5])
print(data_df['concept_list'][:5])

0    at institute reference material measurement di...
1    statistic focus describes preliminary result b...
2    short document aim provide summary main issue ...
3    briefing note intended provide european parlia...
4    exterior de la sus la de para de en la en para...
Name: cleaned_abstract, dtype: object
0            [37, 40, 11, 53, 25, 34]
1                        [42, 46, 63]
2    [53, 52, 18, 43, 24, 19, 44, 17]
3        [11, 52, 36, 10, 93, 13, 44]
4             [14, 42, 27, 13, 31, 9]
Name: concept_list, dtype: object


We must convert the data set labels to numbers so that they can be processed by Keras. The appraoch is described in https://www.pyimagesearch.com/2018/05/07/multi-label-classification-with-keras/.

In [None]:
labels = data_df["concept_list"].tolist()

mlb = MultiLabelBinarizer()
labels = mlb.fit_transform(labels)
 
# loop over each of the possible class labels and show them
for (i, label) in enumerate(mlb.classes_):
	print("{}. {}".format(i + 1, label))

In [None]:
#labels2 = [["a","b","c"],["c","d","e"]]

#mlb2 = MultiLabelBinarizer()
#labels2 = mlb2.fit_transform(labels2)

#print(list(mlb2.classes_))
#print(mlb2.transform([["a","b"],["c","d","e"]]))
#print(mlb2.inverse_transform(mlb2.transform([["a","b"],["c","d","e"]])))

#y = data_df["concept_list"].tolist()[0]
#print("y = ", y)
#print("len(y) = ", len(y))
#print("mlb.transform([y]) = ", mlb.transform([y]))
#print("len(mlb.transform([y])[0]) = ", len(mlb.transform([y])[0]))
#print("mlb.transform([y])[0].tolist().count(1) = ", mlb.transform([y])[0].tolist().count(1))
#print("mlb.inverse_transform(mlb.transform([y])) = ", mlb.inverse_transform(mlb.transform([y])))
#print("len(mlb.classes_) = ", len(mlb.classes_))

A small test to make sure that we are really getting multi-label vectors and not just one-hot vectors.

In [None]:
print("Labels of the 2nd training example: " + str(mlb.inverse_transform(np.array([labels[1]]))))

Next we need to tranform the input into an array of numbers:

In [None]:
data = data_df["cleaned_abstract"].tolist()

#print(len(data[0]))
#print(type(data[0]))
#print(data[0])
tokenizer = Tokenizer(nb_words=MAX_NUM_WORDS)
tokenizer.fit_on_texts(data)
sequences = tokenizer.texts_to_sequences(data)

#print(len(sequences[0]))
#print(sequences[0])

word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)
#labels = to_categorical(np.asarray(labels))
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)


Split data into training data and test data.

In [None]:
VALIDATION_SPLIT = 0.2

# split the data into a training set and a validation set
indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]
num_validation_samples = int(VALIDATION_SPLIT * data.shape[0])
print(num_validation_samples)
trainX = data[:-num_validation_samples]
trainY = labels[:-num_validation_samples]
testX = data[-num_validation_samples:]
testY = labels[-num_validation_samples:]

print(data.shape)
print(labels.shape)
print(trainX.shape)
print(trainY.shape)
print(testX.shape)
print(testY.shape)

Load the pre-computed GloVe word embeddings from file and create an embeddings_index:

In [None]:
embeddings_index = {}
with open(os.path.join('glove.6B', 'glove.6B.100d.txt'), 'r') as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs

print('Found %s word vectors.' % len(embeddings_index))

Use word_index and embedding_index to compute the embedding_matrix. embedding_matrix is  a matrix storing the embedded_vector for each word in the data set. 

In [None]:
num_words = min(MAX_NUM_WORDS, len(word_index) + 1)
embedding_matrix = np.zeros((num_words, EMBEDDING_DIM))
for word, i in word_index.items():
    if i >= MAX_NUM_WORDS:
        continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector
    #else:
    #    print("Not not in embedding index: " + word)

Build an Keras embedding_layer. Note that trainable=false, i.e., weights are not getting updated. 

In [None]:
embedding_layer = Embedding(num_words,
                            EMBEDDING_DIM,
                            embeddings_initializer=Constant(embedding_matrix),
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=False)

Let's define the 1D convolutional model.

In [None]:
from keras.layers import Dropout

sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
x = Conv1D(128, 5, activation='relu')(embedded_sequences)
x = MaxPooling1D(5)(x)
x = Dropout(0.25)(x)
x = Conv1D(128, 5, activation='relu')(x)
x = MaxPooling1D(5)(x)
x = Dropout(0.25)(x)
x = Conv1D(128, 5, activation='relu')(x)
x = MaxPooling1D(35)(x)  # global max pooling
x = Dropout(0.25)(x)
x = Flatten()(x)
x = Dense(128, activation='relu')(x)
preds = Dense(labels.shape[1], activation='sigmoid')(x)

model = Model(sequence_input, preds)
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['categorical_accuracy'])

Now the model needs to be trained.

In [None]:
print(model.summary())

In [None]:
history = model.fit(trainX, trainY, validation_data=(testX, testY),
          epochs=1, batch_size=128)


In [None]:
import matplotlib.pyplot as plt

acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(len(acc))

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()

plt.figure()

plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()

plt.show()

In [None]:
print(data_df['cleaned_abstract'].tolist()[:2])
ex = data_df['cleaned_abstract'][0]
seq = tokenizer.texts_to_sequences(ex)
seq = pad_sequences(seq, maxlen=MAX_SEQUENCE_LENGTH)
print("len(ex) = ", len(ex))
print("seq.shape = ", seq.shape)
print("Example: " + str(ex) + "\n")
print("Sequence: " + str(seq) + "\n")
prediction = model.predict(np.array(seq))[0]
prediction[prediction>=0.25] = 1
prediction[prediction<0.25] = 0
print("Prediction: " + str(prediction))
print(type(prediction))
print(prediction.shape)

mlb.inverse_transform(np.array([prediction]))

#proba = model.predict(np.array(seq))[0]
#idxs = np.argsort(proba)[::-1][:2]

