Skip to content

Add alexScanB and alexScanUserB, fix ByteString wrappers w.r.t. unicode #32

Closed
wants to merge 2 commits into from

2 participants

@yinguanhao

The ByteString wrappers return truncated lexemes if there are non-ascii characters. See tests/tokens_bytestring_unicode.x.

This is due to alexScan returning length in characters.

I have added two functions alexScanB and alexScanUserB which return length in bytes, and modified
ByteString wrappers to use alexScanB.

yinguanhao added some commits Oct 29, 2013
@yinguanhao yinguanhao Fix bytestring wrappers, add alexScanB and alexScanUserB
alexScanB and alexScanUserB count every bytes, instead of just first bytes of characters
abc860c
@yinguanhao yinguanhao Doc: alexScanB and alexScanUserB a124ec1
@simonmar
Owner

Fixed. I think we always want the number of bytes for the bytestring wrappers, so I've done it slightly differently.

@simonmar simonmar closed this Nov 11, 2013
@yinguanhao

In some cases the number of bytes is desirable when not using wrappers. That is why I proposed alexScanB and alexScanUserB.

If the new APIs sound bad, maybe we can add an option or directive for this behavior?

@simonmar
Owner

The change I made lets you choose whether you want the number of bytes or chars by defining incrLength appropriately. However, this is a change to the API, so I need to think about it some more - we want something that is backwards compatible too.

@simonmar simonmar reopened this Nov 11, 2013
@simonmar simonmar added a commit that referenced this pull request Nov 11, 2013
@simonmar The fix for #32 implied an API change, so document it
Also follow the API change in Alex's own lexer, so that it bootstraps
again.
4f74772
@simonmar simonmar closed this Nov 11, 2013
@simonmar simonmar added a commit that referenced this pull request Nov 11, 2013
@simonmar Revert "The fix for #32 implied an API change, so document it"
This reverts commit 4f74772.
52bcf7c
@simonmar simonmar added a commit that referenced this pull request Nov 11, 2013
@simonmar Revert "Fix the token length for ByteString wrappers (#32)"
This reverts commit a325dde.
3fc0d63
@simonmar simonmar added a commit that referenced this pull request Nov 11, 2013
@simonmar On second thoughts, fix #32 without an API change.
The length returned in AlexReturn is really bogus, we should be moving
towards clients managing the token length themselves as part of
AlexInput.  GHC itself has always done this, and now the ByteString
wrappers do it too.  We should make the other wrappers keep track of
their own token length, and then we could remove the length field of
AlexToken/AlexSkip.  Maybe we should make these constructors into
records first.
3487dcc
@mamash mamash pushed a commit to joyent/pkgsrc-wip that referenced this pull request Nov 12, 2013
szptvlfn Update to 3.1.2
Changes in 3.1.2:
    Add missing file to extra-source-files
Changes in 3.1.1:
	Bug fixes (#24, #30, #31, #32)

( #32 => simonmar/alex#32 )
( #31 => simonmar/alex#31 )
( #30 => simonmar/alex#30 )
( #24 => simonmar/alex#24 )
757deed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.