Veyn is a system for automatic identification of multiword expressions in running text submitted to the PARSEME shared task 2018. The model is first trained on a MWE-annotated corpus, and then can be applied to any new text to identify MWEs that are similar to those in the training corpus.
Veyn is based on a sequence tagger using recurrent neural networks. As input features it takes the lemmas and POS tags of words. We represent the output MWEs using a variant of the begin-inside-outside encoding scheme combined with the MWE category tag.
Veyn is implemented using Python's keras library
For more details, check the following scientific article:
Veyn was developed with python3 using the free libraries keras(2.4.0), tensorflow(1.8.0) and keras-contrib(2.0.8) (only to use --activationCRF). Download all the required libraries, and then simply clone this git repository.
When training the models, we used a directory named Model
to stock our system models. However, this directory is too large and we cannot push it on the github directory. You can create this repository with this command : mkdir Model
and then train the models.
We used the shared task corpora to create/tune this system. You can show data on the official website of the PARSEME shared task 2018 or directly on there gitlab repository.
./bin/Veyn.py -h
to show all commands in a terminal.
Command to create and train a model:
./bin/Veyn.py --file fileTest/trial-train.cupt --mode train --model Model/trial-model -cat
Command to load and test a model:
./bin/Veyn.py --file fileTest/trial-test.cupt --mode test --model Model/trial-model
To use Veyn in test mode, only options (--file, --mode, --model) are required.
Commands | Required | Definition |
---|---|---|
-h, --help | False | Helpers and print all commands in stdout |
-feat, --featureColumns | False | To treat columns as features. The first column is number 1, the second 2... By default, features are LEMME and POS, e.g 3 4 |
--mweTags | False | To give the number of the column containing tags (default 11) Careful! The first column is number 1, the second number 2, ... |
--embeddings | False | To give some files containing embeddings. First, you give the path of the file containing embeddings, and separate with a "," you gave the column concern by this file. eg: file1,2 file2,5 ... You could have only column match with featureColumns. |
--file | True | Give a file in the Extended CoNLL-U (.cupt) format. You can only give one file to train/test a model. You can give a CoNLL file to only test it. |
--mode | True | To choice the mode of the system : train/test. If the file is a train file and you want to create a model use 'train'. If the file is a test/dev file and you want to load a model use 'test'. In test mode the system doesn't need params RNN. |
--model | True | Name of the model which you want to save/load without extension. e.g 'nameModel' , and the system save/load files nameModel.h5, nameModel.json and nameModel.voc. nameModel.h5 is the model file. nameModel.voc is the vocabulary file. nameModel.args is the arguments file which train your model. |
--io | False | Option to use the representation of IO. You can combine with other options like --nogap or/and --cat. By default, the representation is BIO. |
-ng, --ngap | False | Option to use the representation of BIO/IO without gap. By default, the gap it is using to the representation of BIO/IO. |
-cat, --category | False | Option to use the representation of BIO/IO with categories. By default, the representation of BIO/IO is without categories. |
--sentences_per_batch | False | Option to initialize the size of mini batch for the RNN. By default, batch_size is 128. |
--max_sentence_size | False | Option to initialize the size of sentence for the RNN. By default, max_sentence_size is 200. |
--overlaps | False | Option to use the representation of BIO/IO with overlaps. We can't load a file test with overlaps, if option test and overlaps are activated, only the option test is considered. By default, the representation is without overlaps. |
--validation_split | False | Option to configure the validation_split to train the RNN. By default 0.3(30%) of train file is use to validation data. |
--validation_data | False | Give a file in the Extended CoNLL-U (.cupt) format to loss function for the RNN. |
--epochs | False | Number of epochs to train RNN. By default, RNN trains on 10 epochs. |
--recurrent_unit | False | This option allows choosing the type of recurrent units in the recurrent layer. By default it is biGRU. You can choice GRU, LSTM, biGRU, biLSTM. |
--number_recurrent_layer | False | This option allows choosing the numbers of recurrent layer. By default it is 2 recurrent layers. |
--size_recurrent_layer | False | This option allows choosing the size of recurrent layer. By default it is 512. |
--feat_embedding_size | False | Option that takes as input a sequence of integers corresponding to the dimension/size of the embeddings layer of each column given to the --feat option. By default, all embeddings have the same size, use the current default value (64) |
--early_stopping_mode | False | Option to save the best model training in function of acc/loss value, only if you use validation_data or validation_split. By default, it is in function of the loss value. |
--patience_early_stopping | False | Option to choice patience for the early stopping. By default, it is 5 epochs. |
--numpy_seed | False | Option to initialize manually the seed of numpy. By default, it is initialized to 42. |
--tensorflow_seed | False | Option to initialize manually the seed of tensorflow. By default, it is initialized to 42. |
--random_seed | False | Option to initialize manually the seed of random library. By default, it is initialized to 42. |
--dropout | False | Float between 0 and 1. Fraction of the units to drop for the linear transformation of the inputs. |
--recurrent_dropout | False | Float between 0 and 1. Fraction of the units to drop for the linear transformation of the recurrent state. |
--no_fine_tuning_embeddings | False | Option to no tune embeddings in train. We can't used its option without --embeddings. |
--activationCRF | False | Option to replace activation('softmax') by a CRF layer. |