Skip to content

Commit

Permalink
GH-563: prepare 0.4.2 release
Browse files Browse the repository at this point in the history
  • Loading branch information
aakbik committed May 29, 2019
1 parent 637d6b8 commit 4928d53
Show file tree
Hide file tree
Showing 8 changed files with 208 additions and 163 deletions.
2 changes: 1 addition & 1 deletion flair/__init__.py
Expand Up @@ -18,7 +18,7 @@

import logging.config

__version__ = "0.4.1"
__version__ = "0.4.2"

logging.config.dictConfig(
{
Expand Down
4 changes: 4 additions & 0 deletions flair/embeddings.py
Expand Up @@ -1043,9 +1043,13 @@ def __init__(
# Slovenian
"sl-forward": f"{aws_path}/embeddings-stefan-it/lm-sl-opus-large-forward-v0.1.pt",
"sl-backward": f"{aws_path}/embeddings-stefan-it/lm-sl-opus-large-backward-v0.1.pt",
"sl-v0-forward": f"{aws_path}/embeddings-v0.3/lm-sl-large-forward-v0.1.pt",
"sl-v0-backward": f"{aws_path}/embeddings-v0.3/lm-sl-large-forward-v0.1.pt",
# Swedish
"sv-forward": f"{aws_path}/embeddings-stefan-it/lm-sv-opus-large-forward-v0.1.pt",
"sv-backward": f"{aws_path}/embeddings-stefan-it/lm-sv-opus-large-backward-v0.1.pt",
"sv-v0-forward": f"{aws_path}/embeddings-v0.4/lm-sv-large-forward-v0.1.pt",
"sv-v0-backward": f"{aws_path}/embeddings-v0.4/lm-sv-large-backward-v0.1.pt",
}

# load model if in pretrained model map
Expand Down
75 changes: 36 additions & 39 deletions resources/docs/TUTORIAL_4_ELMO_BERT_FLAIR_EMBEDDING.md
Expand Up @@ -38,48 +38,45 @@ flair_embedding_forward.embed(sentence)
```

You choose which embeddings you load by passing the appropriate string to the constructor of the `FlairEmbeddings` class.
Currently, the following contextual string embeddings are provided (more coming):
Currently, the following contextual string embeddings are provided (note: replace '*X*' with either '*forward*' or '*backward*'):

| ID | Language | Embedding |
| ------------- | ------------- | ------------- |
| 'multi-forward' | English, German, French, Italian, Dutch, Polish | Mix of corpora (Web, Wikipedia, Subtitles, News) |
| 'multi-backward' | English, German, French, Italian, Dutch, Polish | Mix of corpora (Web, Wikipedia, Subtitles, News) |
| 'multi-forward-fast' | English, German, French, Italian, Dutch, Polish | Mix of corpora (Web, Wikipedia, Subtitles, News) |
| 'multi-backward-fast' | English, German, French, Italian, Dutch, Polish | Mix of corpora (Web, Wikipedia, Subtitles, News) |
| 'news-forward' | English | Forward LM embeddings over 1 billion word corpus |
| 'news-backward' | English | Backward LM embeddings over 1 billion word corpus |
| 'news-forward-fast' | English | Smaller, CPU-friendly forward LM embeddings over 1 billion word corpus |
| 'news-backward-fast' | English | Smaller, CPU-friendly backward LM embeddings over 1 billion word corpus |
| 'mix-forward' | English | Forward LM embeddings over mixed corpus (Web, Wikipedia, Subtitles) |
| 'mix-backward' | English | Backward LM embeddings over mixed corpus (Web, Wikipedia, Subtitles) |
| 'german-forward' | German | Forward LM embeddings over mixed corpus (Web, Wikipedia, Subtitles) |
| 'german-backward' | German | Backward LM embeddings over mixed corpus (Web, Wikipedia, Subtitles) |
| 'polish-forward' | Polish | Added by [@borchmann](https://github.com/applicaai/poleval-2018): Forward LM embeddings over web crawls (Polish part of CommonCrawl) |
| 'polish-backward' | Polish | Added by [@borchmann](https://github.com/applicaai/poleval-2018): Backward LM embeddings over web crawls (Polish part of CommonCrawl) |
| 'slovenian-forward' | Slovenian | Added by [@stefan-it](https://github.com/stefan-it/flair-lms): Forward LM embeddings over various sources (Europarl, Wikipedia and OpenSubtitles2018) |
| 'slovenian-backward' | Slovenian | Added by [@stefan-it](https://github.com/stefan-it/flair-lms): Backward LM embeddings over various sources (Europarl, Wikipedia and OpenSubtitles2018) |
| 'bulgarian-forward' | Bulgarian | Added by [@stefan-it](https://github.com/stefan-it/flair-lms): Forward LM embeddings over various sources (Europarl, Wikipedia or SETimes) |
| 'bulgarian-backward' | Bulgarian | Added by [@stefan-it](https://github.com/stefan-it/flair-lms): Backward LM embeddings over various sources (Europarl, Wikipedia or SETimes) |
| 'dutch-forward' | Dutch | Added by [@stefan-it](https://github.com/stefan-it/flair-lms): Forward LM embeddings over various sources (Europarl, Wikipedia or OpenSubtitles2018) |
| 'dutch-backward' | Dutch | Added by [@stefan-it](https://github.com/stefan-it/flair-lms): Backward LM embeddings over various sources (Europarl, Wikipedia or OpenSubtitles2018) |
| 'swedish-forward' | Swedish | Added by [@stefan-it](https://github.com/stefan-it/flair-lms): Forward LM embeddings over various sources (Europarl, Wikipedia or OpenSubtitles2018) |
| 'swedish-backward' | Swedish | Added by [@stefan-it](https://github.com/stefan-it/flair-lms): Backward LM embeddings over various sources (Europarl, Wikipedia or OpenSubtitles2018) |
| 'french-forward' | French | Added by [@mhham](https://github.com/mhham): Forward LM embeddings over French Wikipedia |
| 'french-backward' | French | Added by [@mhham](https://github.com/mhham): Backward LM embeddings over French Wikipedia |
| 'czech-forward' | Czech | Added by [@stefan-it](https://github.com/stefan-it/flair-lms): Forward LM embeddings over various sources (Europarl, Wikipedia or OpenSubtitles2018) |
| 'czech-backward' | Czech | Added by [@stefan-it](https://github.com/stefan-it/flair-lms): Backward LM embeddings over various sources (Europarl, Wikipedia or OpenSubtitles2018) |
| 'portuguese-forward' | Portuguese | Added by [@ericlief](https://github.com/ericlief/language_models): Forward LM embeddings |
| 'portuguese-backward' | Portuguese | Added by [@ericlief](https://github.com/ericlief/language_models): Backward LM embeddings |
| 'basque-forward' | Basque | Added by [@stefan-it](https://github.com/stefan-it/flair-lms): Forward LM embeddings |
| 'basque-backward' | Basque | Added by [@stefan-it](https://github.com/stefan-it/flair-lms): Backward LM embeddings |
| 'spanish-forward' | Spanish | Added by [@iamyihwa](https://github.com/zalandoresearch/flair/issues/80): Forward LM embeddings over Wikipedia |
| 'spanish-backward' | Spanish | Added by [@iamyihwa](https://github.com/zalandoresearch/flair/issues/80): Backward LM embeddings over Wikipedia |
| 'spanish-forward-fast' | Spanish | Added by [@iamyihwa](https://github.com/zalandoresearch/flair/issues/80): CPU-friendly forward LM embeddings over Wikipedia |
| 'spanish-backward-fast' | Spanish | Added by [@iamyihwa](https://github.com/zalandoresearch/flair/issues/80): CPU-friendly backward LM embeddings over Wikipedia |
| 'japanese-forward' | Japanese | Added by [@frtacoa](https://github.com/zalandoresearch/flair/issues/527): Forward LM embeddings over 439M words of Japanese Web crawls (2048 hidden states, 2 layers)|
| 'japanese-backward' | Japanese | Added by [@frtacoa](https://github.com/zalandoresearch/flair/issues/527): Backward LM embeddings over 439M words of Japanese Web crawls (2048 hidden states, 2 layers)|
| 'pubmed-forward' | English | Added by [@jessepeng](https://github.com/zalandoresearch/flair/pull/519): Forward LM embeddings over 5% of PubMed abstracts until 2015 (1150 hidden states, 3 layers)|
| 'pubmed-backward' | English | Added by [@jessepeng](https://github.com/zalandoresearch/flair/pull/519): Backward LM embeddings over 5% of PubMed abstracts until 2015 (1150 hidden states, 3 layers)|
| 'multi-X' | English, German, French, Italian, Dutch, Polish | Mix of corpora (Web, Wikipedia, Subtitles, News) |
| 'multi-X-fast' | English, German, French, Italian, Dutch, Polish | Mix of corpora (Web, Wikipedia, Subtitles, News), CPU-friendly |
| 'news-X' | English | Trained with 1 billion word corpus |
| 'news-X-fast' | English | Trained with 1 billion word corpus, CPU-friendly |
| 'mix-X' | English | Trained with mixed corpus (Web, Wikipedia, Subtitles) |
| 'ar-X' | Arabic | Added by [@stefan-it](https://github.com/zalandoresearch/flair/issues/614): Trained with Wikipedia/OPUS |
| 'bg-X' | Bulgarian | Added by [@stefan-it](https://github.com/zalandoresearch/flair/issues/614): Trained with Wikipedia/OPUS |
| 'bg-X-fast' | Bulgarian | Added by [@stefan-it](https://github.com/stefan-it/flair-lms): Trained with various sources (Europarl, Wikipedia or SETimes) |
| 'cs-X' | Arabic | Added by [@stefan-it](https://github.com/zalandoresearch/flair/issues/614): Trained with Wikipedia/OPUS |
| 'cs-v0-X' | Czech | Added by [@stefan-it](https://github.com/stefan-it/flair-lms): LM embeddings (earlier version) |
| 'de-X' | German | Trained with mixed corpus (Web, Wikipedia, Subtitles) |
| 'es-X' | Spanish | Added by [@iamyihwa](https://github.com/zalandoresearch/flair/issues/80): Trained with Wikipedia |
| 'es-X-fast' | Spanish | Added by [@iamyihwa](https://github.com/zalandoresearch/flair/issues/80): Trained with Wikipediam CPU-friendly |
| 'eu-X' | Basque | Added by [@stefan-it](https://github.com/zalandoresearch/flair/issues/614): Trained with Wikipedia/OPUS |
| 'eu-v0-X' | Basque | Added by [@stefan-it](https://github.com/stefan-it/flair-lms): LM embeddings (earlier version) |
| 'fa-X' | Farsi | Added by [@stefan-it](https://github.com/zalandoresearch/flair/issues/614): Trained with Wikipedia/OPUS |
| 'fi-X' | Finnish | Added by [@stefan-it](https://github.com/zalandoresearch/flair/issues/614): Trained with Wikipedia/OPUS |
| 'fr-X' | French | Added by [@mhham](https://github.com/mhham): Trained with French Wikipedia |
| 'he-X' | Hebrew | Added by [@stefan-it](https://github.com/zalandoresearch/flair/issues/614): Trained with Wikipedia/OPUS |
| 'hi-X' | Hindi | Added by [@stefan-it](https://github.com/zalandoresearch/flair/issues/614): Trained with Wikipedia/OPUS |
| 'hr-X' | Croatian | Added by [@stefan-it](https://github.com/zalandoresearch/flair/issues/614): Trained with Wikipedia/OPUS |
| 'id-X' | Indonesian | Added by [@stefan-it](https://github.com/zalandoresearch/flair/issues/614): Trained with Wikipedia/OPUS |
| 'it-X' | Italian | Added by [@stefan-it](https://github.com/zalandoresearch/flair/issues/614): Trained with Wikipedia/OPUS |
| 'ja-X' | Japanese | Added by [@frtacoa](https://github.com/zalandoresearch/flair/issues/527): Trained with 439M words of Japanese Web crawls (2048 hidden states, 2 layers)|
| 'nl-X' | Dutch | Added by [@stefan-it](https://github.com/zalandoresearch/flair/issues/614): Trained with Wikipedia/OPUS |
| 'nl-v0-X' | Dutch | Added by [@stefan-it](https://github.com/stefan-it/flair-lms): LM embeddings (earlier version) |
| 'no-X' | Norwegian | Added by [@stefan-it](https://github.com/zalandoresearch/flair/issues/614): Trained with Wikipedia/OPUS |
| 'pl-X' | Polish | Added by [@borchmann](https://github.com/applicaai/poleval-2018): Trained with web crawls (Polish part of CommonCrawl) |
| 'pl-opus-X' | Polish | Added by [@stefan-it](https://github.com/zalandoresearch/flair/issues/614): Trained with Wikipedia/OPUS |
| 'pt-X' | Portuguese | Added by [@ericlief](https://github.com/ericlief/language_models): LM embeddings |
| 'sl-X' | Slovenian | Added by [@stefan-it](https://github.com/zalandoresearch/flair/issues/614): Trained with Wikipedia/OPUS |
| 'sl-v0-X' | Slovenian | Added by [@stefan-it](https://github.com/stefan-it/flair-lms): Trained with various sources (Europarl, Wikipedia and OpenSubtitles2018) |
| 'sv-X' | Swedish | Added by [@stefan-it](https://github.com/zalandoresearch/flair/issues/614): Trained with Wikipedia/OPUS |
| 'sv-v0-X' | Swedish | Added by [@stefan-it](https://github.com/stefan-it/flair-lms): Trained with various sources (Europarl, Wikipedia or OpenSubtitles2018) |
| 'pubmed-X' | English | Added by [@jessepeng](https://github.com/zalandoresearch/flair/pull/519): Trained with 5% of PubMed abstracts until 2015 (1150 hidden states, 3 layers)|


So, if you want to load embeddings from the English news backward LM model, instantiate the method as follows:
Expand Down
27 changes: 25 additions & 2 deletions resources/docs/TUTORIAL_5_DOCUMENT_EMBEDDINGS.md
Expand Up @@ -28,7 +28,7 @@ The resulting embedding is taken as document embedding.

To create a mean document embedding simply create any number of `TokenEmbeddings` first and put them in a list.
Afterwards, initiate the `DocumentPoolEmbeddings` with this list of `TokenEmbeddings`.
So, if you want to create a document embedding using GloVe embeddings together with CharLMEmbeddings,
So, if you want to create a document embedding using GloVe embeddings together with `FlairEmbeddings`,
use the following code:

```python
Expand Down Expand Up @@ -68,7 +68,30 @@ to use to the initialization of the `DocumentPoolEmbeddings`:
document_embeddings = DocumentPoolEmbeddings([glove_embedding,
flair_embedding_backward,
flair_embedding_backward],
mode='min')
pooling='min')
```

You can also choose which fine-tuning operation you want, i.e. which transformation to apply before word embeddings get
pooled. The default operation is 'linear' transformation, but if you only use simple word embeddings that are
not task-trained you should probably use a 'nonlinear' transformation instead:

```python
# instantiate pre-trained word embeddings
embeddings = WordEmbeddings('glove')

# document pool embeddings
document_embeddings = DocumentPoolEmbeddings([embeddings], fine_tune_mode='nonlinear')
```

If on the other hand you use word embeddings that are task-trained (such as simple one hot encoded embeddings), you
are often better off doing no transformation at all. Do this by passing 'none':

```python
# instantiate one-hot encoded word embeddings
embeddings = OneHotEmbeddings(corpus)

# document pool embeddings
document_embeddings = DocumentPoolEmbeddings([embeddings], fine_tune_mode='none')
```


Expand Down

0 comments on commit 4928d53

Please sign in to comment.