New "spacy" feature #487

dcfidalgo · 2021-01-04T15:33:54Z

dcfidalgo
Jan 4, 2021

This is a follow-up idea to our discussion we had before Christmas regarding the spacy token features.

I think in the end we decided to go with following scheme:

"features": {
   "word": [
      {"embedding_dim": 300, "weights_file": "fasttext"}, # default is text
      {"embedding_dim": 16, "feature": "pos"},
      {"embedding_dim": 32, "feature": "dep"},
      {"embedding_dim": 8, "feature": "shape"}, # tokens’s string orthographic features like Xxxx
   ] # note that "word" is no longer the vocab namespace and each feature will have its own namespace
}

There are two things that still feel a bit itchy to me:

the WordFeatures options weights_file, trainable and to some extent lowercase_tokens only really make sense for the text feature_name
when providing pre-tokenized text all the spacy pipeline features (pos, dep, ...) are not available.

I thought that maybe we should be more explicit about where the features come from, that is create a new spacy feature:

"features": {
  "word": {"embedding_dim": 300, "weights_file": "fasttext"},
  "spacy": {"attributes": ["pos", "dep"], "embedding_dims": [16, 32]}  # i chose "attributes" in order not to repeat "features" inside the "features" key ...
}

or

"features": {
  "word": {"embedding_dim": 300, "weights_file": "fasttext"},
  "spacy": [
    {"attribute": "pos", "embedding_dim": 16},
    {"attribute": "dep", "embedding_dim": 32},
  ],
}

I would only choose the latter configuration type if we also want to support stacking the other features, that is several transformers or word (with different weights_file) features. @dvsrepo I mention this, since i do not know if, for you, the main motivation of stacking features was actually stacking spacy features, was it?

The spacy feature can also accept a custom attribute, but i think in the use case of the RelationClassification head, we actually should add this feature automatically to the config. Meaning, the PipelineConfiguration class for example, adds automatically a

"spacy": {"attributes": ["bilou_from_relation_entity"], "embedding_dims": [32]}  # the embedding_dim is taken from the `RelationClassification` head config

feature if a RelationClassification head is detected. I think this is less error-prone, and if a head absolutely requires a certain feature, it should be added automatically.

What do you think @dvsrepo @frascuchon ?

dvsrepo · 2021-01-04T17:31:27Z

dvsrepo
Jan 4, 2021
Maintainer

Thanks for the write-up. In general, I don't mind if we add a new feature namespace if you think it makes more sense in terms of API and easier usage. Some things to consider: - I was thinking about stacking also word features with different pretrained weights, although I think stacking spacy features would be more useful and thus of higher priority. - for deciding between option 1 and 2, please consider the impact on hpo definitions and readability of logging in wanddb/mlflow. Not sure which one is better. I remember other option we discussed, with a dict with spacy feature as keys, have you considered this one too? - for custom features, we need further discussion I fail to see how this automatic feature creation would work and how should head config would look like. Also, even if this is done under the hood, is this conceptually a spacy feature? In general I would prefer more explicit configuration of features, not automatically added ones under the hood. Conceptually, entities is a feature you will be passing and want to configure as the other features. Also, this might be useful outside the relation classifier, for example for a text classifier which can leverage entities or other custom features, I am unsure about the design if this responsibility goes to the head. El lun., 4 ene. 2021 16:34, David Fidalgo <notifications@github.com> escribió:

…

This is a follow-up idea to our discussion we had before Christmas regarding the spacy token features. I think in the end we decided to go with following scheme: "features": { "word": [ {"embedding_dim": 300, "weights_file": "fasttext"}, # default is text {"embedding_dim": 16, "feature": "pos"}, {"embedding_dim": 32, "feature": "dep"}, {"embedding_dim": 8, "feature": "shape"}, # tokens’s string orthographic features like Xxxx ] # note that "word" is no longer the vocab namespace and each feature will have its own namespace } There are two things that still feel a bit itchy to me: - the WordFeatures options weights_file, trainable and to some extent lowercase_tokens only really make sense for the text feature_name - when providing pre-tokenized text all the spacy pipeline features (pos, dep, ...) are not available. I thought that maybe we should be more explicit about where the features come from, that is create a new *spacy* feature: "features": { "word": {"embedding_dim": 300, "weights_file": "fasttext"}, "spacy": {"attributes": ["pos", "dep"], "embedding_dims": [16, 32]} # i chose "attributes" in order not to repeat "features" inside the "features" key ... } or "features": { "word": {"embedding_dim": 300, "weights_file": "fasttext"}, "spacy": [ {"attribute": "pos", "embedding_dim": 16}, {"attribute": "dep", "embedding_dim": 32}, ], } I would only choose the latter configuration type if we also want to support stacking the other features, that is several transformers or word (with different weights_file) features. @dvsrepo <https://github.com/dvsrepo> I mention this, since i do not know if, for you, the main motivation of stacking features was actually stacking spacy features, was it? The spacy feature can also accept a custom attribute, but i think in the use case of the RelationClassification head, we actually should add this feature automatically to the config. Meaning, the PipelineConfiguration class for example, adds automatically a "spacy": {"attributes": ["bilou_from_relation_entity"], "embedding_dims": [32]} # the embedding_dim is taken from the `RelationClassification` head config feature if a RelationClassification head is detected. I think this is less error-prone, and if a head absolutely requires a certain feature, it should be added automatically. What do you think @dvsrepo <https://github.com/dvsrepo> @frascuchon <https://github.com/frascuchon> ? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#487>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAIOJJYKLUAQ5B7TZHX5HM3SYHNXRANCNFSM4VTFRH3Q> .

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New "spacy" feature #487

{{title}}

Replies: 1 comment

{{title}}

Select a reply

New "spacy" feature #487

dcfidalgo Jan 4, 2021

Replies: 1 comment

dvsrepo Jan 4, 2021 Maintainer

dcfidalgo
Jan 4, 2021

dvsrepo
Jan 4, 2021
Maintainer