You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, when we call "get_tokenizer()" for any language other than English, we must rely on third party libraries such as SpaCy or NLTK to get these tokenizers. The devs current stance on this is that these are "optional dependencies." They are not officially dependencies of torchtext but are if users want to use this feature.
Motivation
Having these libraries optionally seems like a half-way solution to the problem. Especially for something as important as tokenization! Seeing as how we don't need the whole libraries, and just the tokenization bits, it makes sense that we just implement these ourselves.
Pitch
I propose one of two solutions; either we include these libraries as dependencies, which I believe unnecessarily bloats this tool and seems to not be the wish of the developers, or we implement our own tokenizers for a set of supported languages.
I would be happy to work on this.
Additional context
There has been previous discussion on this in issue #178
The text was updated successfully, but these errors were encountered:
🚀 Feature
Currently, when we call "get_tokenizer()" for any language other than English, we must rely on third party libraries such as SpaCy or NLTK to get these tokenizers. The devs current stance on this is that these are "optional dependencies." They are not officially dependencies of torchtext but are if users want to use this feature.
Motivation
Having these libraries optionally seems like a half-way solution to the problem. Especially for something as important as tokenization! Seeing as how we don't need the whole libraries, and just the tokenization bits, it makes sense that we just implement these ourselves.
Pitch
I propose one of two solutions; either we include these libraries as dependencies, which I believe unnecessarily bloats this tool and seems to not be the wish of the developers, or we implement our own tokenizers for a set of supported languages.
I would be happy to work on this.
Additional context
There has been previous discussion on this in issue #178
The text was updated successfully, but these errors were encountered: