@alanakbik alanakbik released this Dec 19, 2018 · 66 commits to master since this release

Assets 2

Release 0.4 with new models, lots of new languages, experimental multilingual models, hyperparameter selection methods, BERT and ELMo embeddings, etc.

New Features

Support for new languages

Flair embeddings

We now include new language models for:

In addition to English and German. You can load FlairEmbeddings for Dutch for instance with:

flair_embeddings = FlairEmbeddings('dutch-forward')

Word Embeddings

We now include pre-trained FastText Embeddings for 30 languages: English, German, Dutch, Italian, French, Spanish, Swedish, Danish, Norwegian, Czech, Polish, Finnish, Bulgarian, Portuguese, Slovenian, Slovakian, Romanian, Serbian, Croatian, Catalan, Russian, Hindi, Arabic, Chinese, Japanese, Korean, Hebrew, Turkish, Persian, Indonesian.

Each language has embeddings trained over Wikipedia, or Web crawls. So instantiate with:

# German embeddings computed over Wikipedia
german_wikipedia_embeddings = WordEmbeddings('de-wiki')

# German embeddings computed over web crawls
german_crawl_embeddings = WordEmbeddings('de-crawl')

Named Entity Recognition

Thanks to the Flair community, we now include NER models for:

Next to the previous models for English and German.

Part-of-Speech Taggigng

Thanks to the Flair community, we now include PoS models for:

Multilingual models

As a major new feature, we now include models that can tag text in various languages.

12-language Part-of-Speech Tagging

We include a PoS model trained over 12 different languages (English, German, Dutch, Italian, French, Spanish, Portuguese, Swedish, Norwegian, Danish, Finnish, Polish, Czech).

# load model
tagger = SequenceTagger.load('pos-multi')

# text with English and German sentences
sentence = Sentence('George Washington went to Washington . Dort kaufte er einen Hut .')

# predict PoS tags
tagger.predict(sentence)

# print sentence with predicted tags
print(sentence.to_tagged_string())

4-language Named Entity Recognition

We include a NER model trained over 4 different languages (English, German, Dutch, Spanish).

# load model
tagger = SequenceTagger.load('ner-multi')

# text with English and German sentences
sentence = Sentence('George Washington went to Washington . Dort traf er Thomas Jefferson .')

# predict NER tags
tagger.predict(sentence)

# print sentence with predicted tags
print(sentence.to_tagged_string())

This model also kind of works on other languages, such as French.

Pre-trained classification models (issue 70)

Flair now also includes two pre-trained classification models:

  • de-offensive-lanuage: detecting offensive language in German text (GermEval 2018 Task 1)
  • en-sentiment: detecting postive and negative sentiment in English text (IMDB)

Simply load the TextClassifier using the preferred model, such as

TextClassifier.load('en-sentiment')

BERT and ELMo embeddings

We added both BERT and ELMo embeddings so you can try them out, and mix and match them with Flair embeddings or any other embedding types. We hope this will enable the research community to better compare and combine approaches.

BERT Embeddings (issue 251)

We added BERT embeddings to Flair. We are using the implementation of huggingface. The embeddings can be used as any other embedding type in Flair:

from flair.embeddings import BertEmbeddings
 # init embedding
embedding = BertEmbeddings()
 # create a sentence
sentence = Sentence('The grass is green .')
 # embed words in sentence
embedding.embed(sentence)

ELMo Embeddings (issue 260)

Flair now also includes ELMo embeddings. We use the implementation of AllenNLP. As this implementation comes with a lot of sub-dependencies, you need to first install the library via pip install allennlp before you can use it in Flair. Using the embeddings is as simple as using any other embedding type:

from flair.embeddings import ELMoEmbeddings
# init embedding
embedding = ELMoEmbeddings()
# create a sentence
sentence = Sentence('The grass is green .')
# embed words in sentence
embedding.embed(sentence)

Multi-Dataset Training (issue 232)

You can now train a model on on multiple datasets with the MultiCorpus object. We use this to train our multilingual models.

Just create multiple corpora and put them into MultiCorpus:

english_corpus = NLPTaskDataFetcher.load_corpus(NLPTask.UD_ENGLISH)
german_corpus = NLPTaskDataFetcher.load_corpus(NLPTask.UD_GERMAN)
dutch_corpus = NLPTaskDataFetcher.load_corpus(NLPTask.UD_DUTCH)

multi_corpus = MultiCorpus([english_corpus, german_corpus, dutch_corpus])

The multi_corpus can now be used for training, just as any other corpus before. Check the tutorial for more details.

Parameter Selection using Hyperopt (issue 242)

We built a wrapper around hyperopt to allow you to search for the best hyperparameters for your downstream task.

Define your search space and start training using several different parameter settings. The results are written to a specific file called param_selection.txt in the result directory. Check the tutorial for more details.

NLP Dataset Downloader (issue 243)

To make it as easy as possible to start training models, we have a new feature for automatically downloading publicly available NLP datasets. For instance, by running this code:

corpus = NLPTaskDataFetcher.load_corpus(NLPTask.UD_ENGLISH)

you download the Universal Dependencies corpus for English and can immediately start training models. The list of available datasets can be found in the tutorial.

Model training features

We added various other features to model training.

Saving training log (issue 212)

The training log output will from now on be automatically saved in the result directory you provide for training.
The log will be saved in training.log.

Resuming training (issue 217)

It is now possible to stop training at any point in time and to resume it later by training with checkpoint set to True. Check the tutorial for more details.

Custom Optimizers (issue 220)

You can now choose other optimizers besides SGD, i.e. any PyTorch optimizer, plus our own modified implementations of SDG and Adam, namely SGDW and AdamW.

Learning Rate Finder (issue 228)

A new helper method to assist you in finding a good learning rate for model training.

Breaking Changes

This release introduces breaking changes. The most important are:

Unified Model Trainer (issue 189)

Instead of maintaining two separate trainer classes for sequence labeling and text classification, we now have one model training class, namely ModelTrainer. This replaces the earlier classes SequenceTaggerTrainer and TextClassifierTrainer.

Downstream task models now implement the new flair.nn.Model interface. So, both the SequenceTagger and TextClassifier now inherit from flair.nn.Model. This allows both models to be trained with the ModelTrainer, like this:

# Training text classifier
tagger = SequenceTagger(512, embeddings, tag_dictionary, 'ner')
trainer = ModelTrainer(tagger, corpus)
trainer.train('results')

# Training text classifier
classifier = TextClassifier(document_embedding, label_dictionary=label_dict)
trainer = ModelTrainer(classifier, corpus)
trainer.train('results')

The advantage is that all training parameters ans training procedures are now the same for sequence labeling and text classification, which reduces redundancy and hopefully make it easier to understand.

Metric class

The metric class is now refactored to compute micro and macro averages for F1 and accuracy. There is also a new enum EvaluationMetric which you can pass to the ModelTrainer to tell it what to use for evaluation.

Updates and Bug Fixes

Torch 1.0 (issue 176)

Flair now bulids on torch 1.0.

Use Pathlib (issue 176)

Flair now uses Path wherever possible to allow easier operations on files/directories. However, our interfaces still allows you to pass a string, which will then be transformed into a Path by Flair.

Bug Fixes

  • Fix: Non-whitespaced tokenized text results into an infinite loop (issue 226)
  • Fix: Getting IndexError: list index out of range error (issue 233)
  • Do not reset cache directory always to None (issue 249)
  • Filter sentences with zero tokens (issue 266)

@alanakbik alanakbik released this Nov 12, 2018 · 387 commits to master since this release

Assets 2

This is an update over release 0.3.1 with some critical bug fixes, a few new features and a lot more pre-packaged embeddings.

New Features

Embeddings

More word embeddings (#194 )

We added FastText embeddings for 10 languages ('en', 'de', 'fr', 'pl', 'it', 'es', 'pt', 'nl', 'ar', 'sv'), load using the two-letter language code, like this:

french_embedding = WordEmbeddings('fr')

More character LM embeddings (#204 #187 )

Thanks to contribution by @stefan-it, we added CharLMEmbeddings for Bulgarian and Slovenian. Load like this:

flm_embeddings = CharLMEmbeddings('slovenian-forward')
blm_embeddings = CharLMEmbeddings('slovenian-backward')

Custom embeddings (#170 )

Add explanation on how to use your own custom word embeddings. Simply convert to gensim.KeyedVectors and point embedding class there:

custom_embedding = WordEmbeddings('path/to/your/custom/embeddings.gensim')

New embeddings type: DocumentPoolEmbeddings (#191 )

Add a new embedding class for document-level embeddings. You can now choose between different pooling options, e.g. min, max and average. Create the new embeddings like this:

word_embeddings = WordEmbeddings('glove')
pool_embeddings = DocumentPoolEmbeddings([word_embeddings], mode='min')

Language model

New method: generate_text() (#167 )

The LanguageModel class now has an in-built generate_text() method to sample the LM. Run code like this:

# load your language model
model = LanguageModel.load_language_model('path/to/your/lm')

# generate 2000 characters
text = model.generate_text(20000)
print(text)

Metrics

Class-based metrics in Metric class (#164 )

Refactored Metric class to provide class-based metrics, as well as micro and macro averaged F1 scores.

Bug Fixes

Fix serialization error for MacOS and Windows (#174 )

On these setups, we got errors when serializing or loading large models. We've put in place a workaround that limits model size so it works on those systems. Added bonus is that models are smaller now.

"Frozen" dropout (#184 )

Potentially big issue in which dropout was frozen in the first epoch in embeddings produced from the character LM, meaning that throughout training the same dimensions stayed dropped. Fixed this.

Testing step in language model trainer (#178 )

Previously, the language model was never applied to test data during training. A final testing step has been added in (again).

Testing

Distinguish between unit and integration tests (#183)

Instructions on how to run tests with pipenv (#161 )

Optimizations

Disable autograd during testing and prediction (#175)

Since autograd is unused here this gives us minor speedups.

@alanakbik alanakbik released this Oct 19, 2018 · 460 commits to master since this release

Assets 2

This is a stability-update over release 0.3.0 with small optimizations, refactorings and bug fixes. For list of new features, refer to 0.3.0.

Optimizations

Retain Token embeddings in memory by default (#146 )

Allow for faster training of text classifier on large datasets by keeping token embeddings im memory.

Always clear embeddings after prediction (#149 )

After prediction, remove embeddings from memory to avoid filling up memory.

Refactorings

Alignd TextClassificationTrainer and SquenceTaggerTrainer (#148 )

Align signatures and features of the two training classes to make it easier to understand training options.

Updated DocumentLSTMEmbeddings (#150 )

Remove unused flag and code from DocumentLSTMEmbeddings

Removed unneeded AWS and Jinja2 dependencies (#158 )

Some dependencies are no longer required.

Bug Fixes

Fixed error when predicting over empty sentences. (#157)

Serialization: reset cache settings when saving a model. (#153 )

@alanakbik alanakbik released this Oct 16, 2018 · 489 commits to master since this release

Assets 2

Breaking Changes

New Label class with confidence score (#38)

A tag prediction is not a simple string anymore but a Label, which holds a value and a confidence score.
To obtain the tag name you need to call tag.value. To get the score call tag.score. This can help you build
applications in which you only want to use predictions that lie above a specific confidence threshold.

LockedDropout moved to the new flair.nn module (#48)

New Features

Multi-token spans (#54, #97)

Entities are can now be wrapped into multi-token spans (type: Span). This is helpful for entities that span multiple words, such as "George Washington". A Span contains the position of the entity in the original text, the tag, a confidence score, and its text. You can get spans from a sentence by using the get_spans() method, like so:

from flair.data import Sentence
from flair.models import SequenceTagger

# make a sentence
sentence = Sentence('George Washington went to Washington .')

# load and run NER
tagger = SequenceTagger.load('ner')
tagger.predict(sentence)

# get span entities, together with tag and confidence score
for entity in sentence.get_spans('ner'):
    print('{} {} {}'.format(entity.text, entity.tag, entity.score))

Predictions with confidence score (#38)

Predicted tags are no longer simple strings, but objects of type Label that contain a value and a confidence score. These scores are extracted during prediction from the sequence tagger or text classifier and indicate how confident the model is of a prediction. Print confidence scores of tags like this:

from flair.data import Sentence
from flair.models import SequenceTagger

# make a sentence
sentence = Sentence('George Washington went to Washington .')

# load the POS tagger
tagger = SequenceTagger.load('pos')

# run POS over sentence
tagger.predict(sentence)

# print token, predicted POS tag and confidence score
for token in sentence:
    print('{} {} {}'.format(token.text, token.get_tag('pos').value, token.get_tag('pos').score))

Visualization routines (#61)

flair now includes visualizations for plotting training curves and weights when training a sequence tagger or text classifier. We also added visualization routines for plotting embeddings and highlighting tags in a sentence. For instance, to visualize contextual string embeddings, do this:

from flair.data_fetcher import NLPTaskDataFetcher, NLPTask
from flair.embeddings import CharLMEmbeddings
from flair.visual import Visualizer

# get a list of Sentence objects
corpus = NLPTaskDataFetcher.fetch_data(NLPTask.CONLL_03).downsample(0.1)
sentences = corpus.train + corpus.test + corpus.dev

# init embeddings (can also be a StackedEmbedding)
embeddings = CharLMEmbeddings('news-forward-fast')

# embed corpus batch-wise
batches = [sentences[x:x + 8] for x in range(0, len(sentences), 8)]
for batch in batches:
    embeddings.embed(batch)

# visualize
visualizer = Visualizer()
visualizer.visualize_word_emeddings(embeddings, sentences, 'data/visual/embeddings.html')

Implementation of different dropouts (#48)

Different dropout possibilities (Locked Dropout and Word Dropout) were added and can be used during training.

Memory management for training on large data sets (#137)

flair now stores contextual string embeddings on disk to speed up training and allow for training on larger datsets.

Pre-trained language models for Polish

Added pre-trained language models for Polish, donated by (Borchmann et al., 2018). Load the Polish embeddings like this:

flm_embeddings = CharLMEmbeddings('polish-forward')
blm_embeddings = CharLMEmbeddings('polish-backward')

Bug Fixes

Fix evaluation of sequence tagger (#79, #75)

The script eval.pl for sequence tagger contained bugs. flair now uses its own evaluation methods.

Fix bugs in text classifier (#108)

Fixed bugs in single label training and out-of-memory errors during evaluation.

Others

Standardize logging output (#16)

Logging output for sequence tagger and text classifier is imporved and standardized.

Update torch version (#34, #106)

flair now uses torch version 0.4.1

Updated documentation (#138, #89)

Expanded documentation and tutorials.

@alanakbik alanakbik released this Aug 3, 2018 · 700 commits to master since this release

Assets 2

Breaking Changes

Reorganized package structure #12

There are now two packages: flair.models and flair.trainers for the models and model trainers respectively.

Models package

flair.models contains 3 model classes: SequenceTagger, TextClassifier and LanguageModel.

Trainers package

flair.trainers contains 3 model trainer classes: SequenceTaggerTrainer, TextClassifierTrainer and LanguageModelTrainer.

Direct import from package

You call these classes directly from the packages, for instance the SequenceTagger is now instantiated as:

from flair.models import SequenceTagger
tagger = SequenceTagger.load('ner')

Reorganized embeddings #12

Clear distinction between token-level and document-level embeddings by adding two classes, namely TokenEmbeddings and DocumentEmbeddings from which respective embeddings need to inherit.

New Features

LanguageModelTrainer #24 #17

Added LanguageModelTrainer class to train your own LM embeddings.

Document Classification #10

Added experimental TextClassifier model for document-level text classification. Also added corresponding model trainer class, i.e. TextClassifierTrainer.

Batch prediction #7

Added batching into prediction method for faster sequence tagging

CPU-friendly pre-trained models #29

Added pre-trained models with smaller LM embeddings for faster CPU-inference speed

You can load them by adding '-fast' to the model name. Only for English at present.

from flair.models import SequenceTagger
tagger = SequenceTagger.load('ner-fast')

Learning Rate Scheduling #19

Added learning rate schedulers to all trainer classes for improved learning rate annealing functionality and control.

Auto-spawn on GPUs #19

All model classes now automatically spawn on GPUs if available. The separate .cuda() call is no longer necessary.

Bug Fixes

Retagging error #23

Fixed error that occurred when using multiple pre-trained taggers on the same sentence.

Empty sentence error #33

Fixed error that caused data fetchers to sometimes create empty sentences.

Other

Unit Tests #15

Added a large set of automated unit tests for better stability.

Documentation #15

Expanded documentation and tutorials. Also expanded descriptions of APIs.

Code Simplifications in sequence tagger #19

A number of code simplifications all around, hopefully making the code easier to understand.

@alanakbik alanakbik released this Jul 13, 2018 · 802 commits to master since this release

Assets 2

First release of Flair Framework

Static word embeddings:

  • includes prepared word embeddings from GloVe, FastText, Numberbatch and Extvec
  • includes prepared word embeddings for English, German and Swedish

Contextual string embeddings:

  • includes pre-trained models for English and German

Text embeddings:

  • Two experimental methods for full-text embeddings (LSTM and Mean)

Sequence labeling:

  • pre-trained models for English (PoS-tagging, chunking and NER)
  • pre-trained models for German (PoS-tagging and NER)
  • experimental semantic frame detector for English