non-breaking space does not appear in token stream #135

retorquere · 2024-03-25T20:20:31Z

const winkNLP = require( 'wink-nlp' );
const model = require( 'wink-eng-lite-web-model' );
const nlp = winkNLP( model );

const text = 'Hello\u00a0World';
const doc = nlp.readDoc(text);

doc.sentences().each(sentence => {
  sentence.tokens().each(token => {
    console.log([token.out(nlp.its.precedingSpaces), token.out()])
  })
})

does not return \u00a0 as either precedingSpaces or token.out()

The text was updated successfully, but these errors were encountered:

rachnachakraborty · 2024-03-26T03:49:25Z

Hi @retorquere

Looks like unicode handling needs a rapid solution.

We will take it up as soon as possible.

Appreciate your inputs, please keep posting as and how you uncover such cases.

Best,
Rachna

references #135 Co-authored-by: Rachna <rachna@graype.in>

references winkjs/wink-nlp#135 Co-authored-by: Rachna <rachna@graype.in>

… spaces references #135 Co-authored-by: Rachna <rachna@graype.in>

references winkjs/wink-nlp#135 Co-authored-by: Rachna <rachna@graype.in>

references #135 Co-authored-by: Rachna <rachna@graype.in>

sanjayaksaxena · 2024-04-03T14:24:05Z

Published new versions with the feature to handle nbsps:

wink-nlp version 2.2.0
wink-eng-lite-web-model 1.6.0

Thanks @retorquere for your contribution.

retorquere · 2024-04-04T07:33:11Z

Cool, thanks. Is it true there's a way to use wink for tokenization only (and get a faster and/or smaller solution that way)? I thought I saw that somewhere, would like to test that.

sanjayaksaxena · 2024-04-04T09:37:27Z

Yes! You can do it by passing an empty pipe during model instantiation:

nlp = winkNLP( model, [] )

This should give you ~4x improvement in the speed.

You can read more about it at Processing Pipeline.

retorquere · 2024-04-04T11:44:03Z

Will this also bring down the bundle size? My previous attempt gave me a 15Mb bundle where compromise/one was about a tenth of that.

sanjayaksaxena · 2024-04-04T11:53:46Z

Bundle size will not reduce. But the contribution of winkNLP+model should be less than 4MB to the total bundle size using browserify.

retorquere · 2024-04-04T12:06:21Z

I'm using esbuild; so 4 MB include the model? Then I've been doing something wrong. But I'm afraid I'm pulling this issue off-topic.

sanjayaksaxena · 2024-04-04T12:17:52Z

Yes the contribution of winkNLP + model together should contribute <4MB. Let us use email wink@graype.in or open separate thread for any further discussions which are off-topic.

retorquere · 2024-04-04T14:13:56Z

I found my mistake and the bundle size is as you indicate. Thanks for your help!

sanjayaksaxena self-assigned this Mar 27, 2024

sanjayaksaxena added a commit that referenced this issue Mar 27, 2024

feat(*): enable nbsp handling

e139a5a

references #135 Co-authored-by: Rachna <rachna@graype.in>

sanjayaksaxena added a commit that referenced this issue Mar 29, 2024

test(wink-nlp-specs): add nbsp test cases for entity & sentence

6314c4f

references #135 Co-authored-by: Rachna <rachna@graype.in>

sanjayaksaxena added a commit to winkjs/wink-eng-lite-web-model that referenced this issue Mar 31, 2024

feat(*): enable nbsp handling

41a2ec1

references winkjs/wink-nlp#135 Co-authored-by: Rachna <rachna@graype.in>

sanjayaksaxena added a commit that referenced this issue Mar 31, 2024

test(wink-nlp-specs): add case for nbsp reconstruction with preceding…

e2b952c

… spaces references #135 Co-authored-by: Rachna <rachna@graype.in>

sanjayaksaxena added a commit to winkjs/wink-eng-lite-web-model that referenced this issue Apr 3, 2024

chore(*): chore: bump npm version — minor & nlp version > 2.1.0

cd4fbd8

references winkjs/wink-nlp#135 Co-authored-by: Rachna <rachna@graype.in>

sanjayaksaxena added a commit that referenced this issue Apr 3, 2024

chore(*): chore: bump npm version — minor

ab70a32

references #135 Co-authored-by: Rachna <rachna@graype.in>

sanjayaksaxena closed this as completed Apr 3, 2024

retorquere mentioned this issue Apr 7, 2024

accented characters treated differently from non-accented winkjs/wink-eng-lite-web-model#13

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

non-breaking space does not appear in token stream #135

non-breaking space does not appear in token stream #135

retorquere commented Mar 25, 2024

rachnachakraborty commented Mar 26, 2024

sanjayaksaxena commented Apr 3, 2024

retorquere commented Apr 4, 2024

sanjayaksaxena commented Apr 4, 2024

retorquere commented Apr 4, 2024

sanjayaksaxena commented Apr 4, 2024

retorquere commented Apr 4, 2024 •

edited

Loading

sanjayaksaxena commented Apr 4, 2024

retorquere commented Apr 4, 2024

non-breaking space does not appear in token stream #135

non-breaking space does not appear in token stream #135

Comments

retorquere commented Mar 25, 2024

rachnachakraborty commented Mar 26, 2024

sanjayaksaxena commented Apr 3, 2024

retorquere commented Apr 4, 2024

sanjayaksaxena commented Apr 4, 2024

retorquere commented Apr 4, 2024

sanjayaksaxena commented Apr 4, 2024

retorquere commented Apr 4, 2024 • edited Loading

sanjayaksaxena commented Apr 4, 2024

retorquere commented Apr 4, 2024

retorquere commented Apr 4, 2024 •

edited

Loading