Skip to content

tsolakghukasyan/d-lemma

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Lemmatization using Deep Learning

Author: Tsolak Ghukasyan
Project advisor: Adam Mathias Bittlingmayer

Introduction

Lemmatization tools are still usually implemented with rules and lookup tables even in today's top libraries, which require linguistic knowledge of each language to build.

d-lemma is developing simple universal models for learning lemmatization, using only annotated text datasets and word embeddings.

d-lemma models support a growing set of languages - lemma-annotated UD treebanks and fastText embeddings are publicly available for over 60 different languages.

Approaches

In this project, 6 different approaches were considered.

To understand the evaluation of the developed learning models, 2 baseline approaches are used:

  • Identity baseline
    Identity function, i.e. returning the input token as its lemma, serves as a weak baseline for main models.

  • Most common lemma with identity backoff
    Returning the most common lemma serves as a stronger baseline for developed models. This baseline backs off to identity for unknown words.

The 4 learning models are:

  • Linear regression
    A linear regressor with cosine proximity loss that for each input token tries to produce its lemma's embedding. This lemmatizer does not use contextual information during prediction.

  • Regression with LSTM
    A recurrent neural network with a single LSTM unit that receives the sequence of input tokens' embeddings and produces the embeddings of their lemmas.

  • Seq2seq
    A word level sequence-to-sequence model using LSTM cells. This model receives a sequence of tokens as input and produces the sequence of their lemmas.

  • Transformer
    An encoder-decoder model based on self-attention mechanism, introduced by Google in Attention Is All You Need. Similar to seq2seq, it processes a sequence of input tokens to output the sequence of their lemmas.

Other model ideas were also considered such as LSTM networks with softmax layers, however these were rejected because of memory and performance requirements.

Training and Evaluation

Two languages were selected for training and evaluation of the aforementioned models: English as a relatively low-morphology language and Finnish as a high-morphology language.

Since one of this project's goals is developing a lemmatization model for low-resource languages, the models were trained with only a 10000-token subset of the respective UD treebanks and another 2000 tokens for validation. Below are the evaluation results on 2000-token test sets:

Results for English:

Model Accuracy BLEU
identity 78.15% 0.579
most common 91.40% 0.773
linear reg. 87.55% 0.685
LSTM 93.0% -
transformer - 0.439

Results for Finnish:

Model Accuracy BLEU
identity 47.35% 0.128
most common 66.50% 0.285
linear reg. 73.15% 0.389
LSTM 75.07% -
transformer - -

*Word-level seq2seq without attention did not produce any meaningful results.

Because the output of transformer and seq2seq models is of variable length, it may contain a different number or order of tokens than the input, so it is not possible to give a token-level accuracy score.

Sample output of the learned LSTM lemmatizer for English:

>>> lemmatize("I knew him because he had attended my school .".split(' '))
['I', 'know', 'he', 'because', 'he', 'have', 'attend', 'my', 'school', '.']

Training Lemmatizers for New Languages

The linear and LSTM regressors can be easily adapted for new languages.

To train and evaluate a new model, you can use linear_models.ipynb, lstm_model.ipynb Jupyter notebooks. All you need to do is set the paths to the CoNLL-U treebanks and word embeddings files at the beginning of the notebook (n.b. only FORM and LEMMA columns of the treebank are used).

Conclusion

It can be clearly seen that advanced deep learning models do not perform well in this task, with the main reasons being limited training data and difficulty of hyperparameter tuning.

At the same time, a simple linear regression model demonstrates results very close to the strong baseline, and for Finnish even outperforms it. Among considered approaches the highest accuracy was achieved with the LSTM network, which beat both baselines for both languages.

The regressors learn to lemmatize not only very common words such as 'are', 'got', 'was' etc, but also seem to learn certain relations (e.g. 'killed'-'kill', 'said'-'say' 'years'-'year'). In addition, these models demonstrate capability to lemmatize unseen wordforms (e.g. 'submitted'-'submit', 'replacing'-'replace').

Future Work

For further research of advanced deep learning approaches' efficiency, it could be useful to experiment with the following models:

  • word-level seq2seq with attention
  • char-level seq2seq with attention
  • DeepMind's relation networks

It could also be useful to slice the evaluation metrics by word frequency or length, to understand how the approaches differ.

Datasets

For training and evaluation:

UD treebanks: universaldependencies.org

Word embeddings:

Related Work

A Neural Lemmatizer for Bengali
Abhisek Chakrabarty et al.

Context Sensitive Lemmatization Using Two Successive Bidirectional Gated Recurrent Networks
Abhisek Chakrabarty et al.