Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

non-breaking space does not appear in token stream #135

Closed
retorquere opened this issue Mar 25, 2024 · 9 comments
Closed

non-breaking space does not appear in token stream #135

retorquere opened this issue Mar 25, 2024 · 9 comments
Assignees

Comments

@retorquere
Copy link

const winkNLP = require( 'wink-nlp' );
const model = require( 'wink-eng-lite-web-model' );
const nlp = winkNLP( model );

const text = 'Hello\u00a0World';
const doc = nlp.readDoc(text);

doc.sentences().each(sentence => {
  sentence.tokens().each(token => {
    console.log([token.out(nlp.its.precedingSpaces), token.out()])
  })
})

does not return \u00a0 as either precedingSpaces or token.out()

@rachnachakraborty
Copy link
Member

Hi @retorquere

Looks like unicode handling needs a rapid solution.

We will take it up as soon as possible.

Appreciate your inputs, please keep posting as and how you uncover such cases.

Best,
Rachna

@sanjayaksaxena sanjayaksaxena self-assigned this Mar 27, 2024
sanjayaksaxena added a commit that referenced this issue Mar 27, 2024
references #135

Co-authored-by: Rachna <rachna@graype.in>
sanjayaksaxena added a commit that referenced this issue Mar 29, 2024
references #135

Co-authored-by: Rachna <rachna@graype.in>
sanjayaksaxena added a commit to winkjs/wink-eng-lite-web-model that referenced this issue Mar 31, 2024
references winkjs/wink-nlp#135

Co-authored-by: Rachna <rachna@graype.in>
sanjayaksaxena added a commit that referenced this issue Mar 31, 2024
… spaces

references #135

Co-authored-by: Rachna <rachna@graype.in>
sanjayaksaxena added a commit to winkjs/wink-eng-lite-web-model that referenced this issue Apr 3, 2024
sanjayaksaxena added a commit that referenced this issue Apr 3, 2024
references #135

Co-authored-by: Rachna <rachna@graype.in>
@sanjayaksaxena
Copy link
Member

Published new versions with the feature to handle nbsps:

  1. wink-nlp version 2.2.0
  2. wink-eng-lite-web-model 1.6.0

Thanks @retorquere for your contribution.

@retorquere
Copy link
Author

Cool, thanks. Is it true there's a way to use wink for tokenization only (and get a faster and/or smaller solution that way)? I thought I saw that somewhere, would like to test that.

@sanjayaksaxena
Copy link
Member

Yes! You can do it by passing an empty pipe during model instantiation:

nlp = winkNLP( model, [] )

This should give you ~4x improvement in the speed.

You can read more about it at Processing Pipeline.

@retorquere
Copy link
Author

Will this also bring down the bundle size? My previous attempt gave me a 15Mb bundle where compromise/one was about a tenth of that.

@sanjayaksaxena
Copy link
Member

Bundle size will not reduce. But the contribution of winkNLP+model should be less than 4MB to the total bundle size using browserify.

@retorquere
Copy link
Author

retorquere commented Apr 4, 2024

I'm using esbuild; so 4 MB include the model? Then I've been doing something wrong. But I'm afraid I'm pulling this issue off-topic.

@sanjayaksaxena
Copy link
Member

Yes the contribution of winkNLP + model together should contribute <4MB. Let us use email wink@graype.in or open separate thread for any further discussions which are off-topic.

@retorquere
Copy link
Author

I found my mistake and the bundle size is as you indicate. Thanks for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants