-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
non-breaking space does not appear in token stream #135
Comments
Hi @retorquere Looks like unicode handling needs a rapid solution. We will take it up as soon as possible. Appreciate your inputs, please keep posting as and how you uncover such cases. Best, |
references #135 Co-authored-by: Rachna <rachna@graype.in>
references #135 Co-authored-by: Rachna <rachna@graype.in>
references winkjs/wink-nlp#135 Co-authored-by: Rachna <rachna@graype.in>
… spaces references #135 Co-authored-by: Rachna <rachna@graype.in>
references winkjs/wink-nlp#135 Co-authored-by: Rachna <rachna@graype.in>
references #135 Co-authored-by: Rachna <rachna@graype.in>
Published new versions with the feature to handle nbsps:
Thanks @retorquere for your contribution. |
Cool, thanks. Is it true there's a way to use wink for tokenization only (and get a faster and/or smaller solution that way)? I thought I saw that somewhere, would like to test that. |
Yes! You can do it by passing an empty pipe during model instantiation:
This should give you ~4x improvement in the speed. You can read more about it at Processing Pipeline. |
Will this also bring down the bundle size? My previous attempt gave me a 15Mb bundle where compromise/one was about a tenth of that. |
Bundle size will not reduce. But the contribution of winkNLP+model should be less than 4MB to the total bundle size using browserify. |
I'm using esbuild; so 4 MB include the model? Then I've been doing something wrong. But I'm afraid I'm pulling this issue off-topic. |
Yes the contribution of winkNLP + |
I found my mistake and the bundle size is as you indicate. Thanks for your help! |
does not return
\u00a0
as eitherprecedingSpaces
ortoken.out()
The text was updated successfully, but these errors were encountered: