Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

Add alexScanB and alexScanUserB, fix ByteString wrappers w.r.t. unicode #32

Closed
wants to merge 2 commits into from

2 participants

@yinguanhao

The ByteString wrappers return truncated lexemes if there are non-ascii characters. See tests/tokens_bytestring_unicode.x.

This is due to alexScan returning length in characters.

I have added two functions alexScanB and alexScanUserB which return length in bytes, and modified
ByteString wrappers to use alexScanB.

yinguanhao added some commits
@yinguanhao yinguanhao Fix bytestring wrappers, add alexScanB and alexScanUserB
alexScanB and alexScanUserB count every bytes, instead of just first bytes of characters
abc860c
@yinguanhao yinguanhao Doc: alexScanB and alexScanUserB a124ec1
@simonmar
Owner

Fixed. I think we always want the number of bytes for the bytestring wrappers, so I've done it slightly differently.

@simonmar simonmar closed this
@yinguanhao

In some cases the number of bytes is desirable when not using wrappers. That is why I proposed alexScanB and alexScanUserB.

If the new APIs sound bad, maybe we can add an option or directive for this behavior?

@simonmar
Owner

The change I made lets you choose whether you want the number of bytes or chars by defining incrLength appropriately. However, this is a change to the API, so I need to think about it some more - we want something that is backwards compatible too.

@simonmar simonmar reopened this
@simonmar simonmar referenced this pull request from a commit
@simonmar The fix for #32 implied an API change, so document it
Also follow the API change in Alex's own lexer, so that it bootstraps
again.
4f74772
@simonmar simonmar closed this
@simonmar simonmar referenced this pull request from a commit
@simonmar Revert "The fix for #32 implied an API change, so document it"
This reverts commit 4f74772.
52bcf7c
@simonmar simonmar referenced this pull request from a commit
@simonmar On second thoughts, fix #32 without an API change.
The length returned in AlexReturn is really bogus, we should be moving
towards clients managing the token length themselves as part of
AlexInput.  GHC itself has always done this, and now the ByteString
wrappers do it too.  We should make the other wrappers keep track of
their own token length, and then we could remove the length field of
AlexToken/AlexSkip.  Maybe we should make these constructors into
records first.
3487dcc
@mamash mamash referenced this pull request from a commit in joyent/pkgsrc-wip
szptvlfn Update to 3.1.2
Changes in 3.1.2:
    Add missing file to extra-source-files
Changes in 3.1.1:
	Bug fixes (#24, #30, #31, #32)

( #32 => simonmar/alex#32 )
( #31 => simonmar/alex#31 )
( #30 => simonmar/alex#30 )
( #24 => simonmar/alex#24 )
757deed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Oct 29, 2013
  1. @yinguanhao

    Fix bytestring wrappers, add alexScanB and alexScanUserB

    yinguanhao authored
    alexScanB and alexScanUserB count every bytes, instead of just first bytes of characters
  2. @yinguanhao
This page is out of date. Refresh to see the latest.
View
5 doc/alex.xml
@@ -1216,6 +1216,11 @@ data AlexReturn action
<para>The extra argument, of some type <literal>user</literal>,
is passed to each predicate.</para>
+
+ <para>There are also <literal>alexScanB</literal> and
+ <literal>alexScanUserB</literal> which calculate token length in
+ bytes instead of characters. Consider using them if you are
+ using ByteString input.</para>
</section>
<section id="wrappers">
View
20 templates/GenericTemplate.hs
@@ -116,8 +116,14 @@ data AlexReturn a
alexScan input IBOX(sc)
= alexScanUser undefined input IBOX(sc)
-alexScanUser user input IBOX(sc)
- = case alex_scan_tkn user input ILIT(0) input sc AlexNone of
+alexScanB input IBOX(sc)
+ = alexScanUserB undefined input IBOX(sc)
+
+alexScanUser = alexScanUser1 False
+alexScanUserB = alexScanUser1 True
+
+alexScanUser1 lenb user input IBOX(sc)
+ = case alex_scan_tkn lenb user input ILIT(0) input sc AlexNone of
(AlexNone, input') ->
case alexGetByte input of
Nothing ->
@@ -143,11 +149,10 @@ alexScanUser user input IBOX(sc)
#endif
AlexToken input''' len k
-
-- Push the input through the DFA, remembering the most recent accepting
-- state it encountered.
-alex_scan_tkn user orig_input len input s last_acc =
+alex_scan_tkn lenb user orig_input len input s last_acc =
input `seq` -- strict in the input
let
new_acc = (check_accs (alex_accept `quickIndex` IBOX(s)))
@@ -173,8 +178,9 @@ alex_scan_tkn user orig_input len input s last_acc =
ILIT(-1) -> (new_acc, input)
-- on an error, we want to keep the input *before* the
-- character that failed, not after.
- _ -> alex_scan_tkn user orig_input (if c < 0x80 || c >= 0xC0 then PLUS(len,ILIT(1)) else len)
- -- note that the length is increased ONLY if this is the 1st byte in a char encoding)
+ _ -> alex_scan_tkn lenb user orig_input (if lenb || c < 0x80 || c >= 0xC0 then PLUS(len,ILIT(1)) else len)
+ -- note that the length is increased ONLY if this is the 1st byte in a char encoding
+ -- unless every byte should be counted
new_input new_s new_acc
}
where
@@ -230,7 +236,7 @@ alexPrevCharIsOneOf arr _ input _ _ = arr ! alexInputPrevChar input
--alexRightContext :: Int -> AlexAccPred _
alexRightContext IBOX(sc) user _ _ input =
- case alex_scan_tkn user input ILIT(0) input sc AlexNone of
+ case alex_scan_tkn True user input ILIT(0) input sc AlexNone of
(AlexNone, _) -> False
_ -> True
-- TODO: there's no need to find the longest
View
8 templates/wrappers.hs
@@ -295,7 +295,7 @@ alexSetStartCode sc = Alex $ \s -> Right (s{alex_scd=sc}, ())
alexMonadScan = do
inp <- alexGetInput
sc <- alexGetStartCode
- case alexScan inp sc of
+ case alexScanB inp sc of
AlexEOF -> alexEOF
AlexError ((AlexPn _ line column),_,_) -> alexError $ "lexical error at line " ++ (show line) ++ ", column " ++ (show column)
AlexSkip inp' len -> do
@@ -362,7 +362,7 @@ alexGetByte (_,[],(c:s)) = case utf8Encode c of
-- alexScanTokens :: String -> [token]
alexScanTokens str = go ('\n',str)
where go inp@(_,str) =
- case alexScan inp 0 of
+ case alexScanB inp 0 of
AlexEOF -> []
AlexError _ -> error "lexical error"
AlexSkip inp' len -> go inp'
@@ -376,7 +376,7 @@ alexScanTokens str = go ('\n',str)
-- alexScanTokens :: String -> [token]
alexScanTokens str = go (AlexInput '\n' str)
where go inp@(AlexInput _ str) =
- case alexScan inp 0 of
+ case alexScanB inp 0 of
AlexEOF -> []
AlexError _ -> error "lexical error"
AlexSkip inp' len -> go inp'
@@ -409,7 +409,7 @@ alexScanTokens str = go (alexStartPos,'\n',[],str)
--alexScanTokens :: ByteString -> [token]
alexScanTokens str = go (alexStartPos,'\n',str)
where go inp@(pos,_,str) =
- case alexScan inp 0 of
+ case alexScanB inp 0 of
AlexEOF -> []
AlexError ((AlexPn _ line column),_,_) -> error $ "lexical error at line " ++ (show line) ++ ", column " ++ (show column)
AlexSkip inp' len -> go inp'
View
2  tests/Makefile
@@ -10,7 +10,7 @@ else
HS_PROG_EXT = .bin
endif
-TESTS = unicode.x simple.x tokens.x tokens_posn.x tokens_gscan.x tokens_bytestring.x tokens_posn_bytestring.x tokens_strict_bytestring.x
+TESTS = unicode.x simple.x tokens.x tokens_posn.x tokens_gscan.x tokens_bytestring.x tokens_bytestring_unicode.x tokens_posn_bytestring.x tokens_strict_bytestring.x
TEST_ALEX_OPTS = --template=..
View
43 tests/tokens_bytestring_unicode.x
@@ -0,0 +1,43 @@
+{
+{-# LANGUAGE OverloadedStrings #-}
+module Main (main) where
+import System.Exit
+import Data.ByteString.Lazy.Char8 (unpack)
+}
+
+%wrapper "basic-bytestring"
+
+$digit = 0-9 -- digits
+$alpha = [a-zA-Zαβ] -- alphabetic characters
+
+tokens :-
+
+ $white+ ;
+ "--".* ;
+ let { \s -> Let }
+ in { \s -> In }
+ $digit+ { \s -> Int (read (unpack s)) }
+ [\=\+\-\*\/\(\)] { \s -> Sym (head (unpack s)) }
+ $alpha [$alpha $digit \_ \']* { \s -> Var (unpack s) }
+
+{
+-- Each right-hand side has type :: ByteString -> Token
+
+-- The token type:
+data Token =
+ Let |
+ In |
+ Sym Char |
+ Var String |
+ Int Int |
+ Err
+ deriving (Eq,Show)
+
+main = if test1 /= result1 then exitFailure
+ else exitWith ExitSuccess
+
+-- \206\177\206\178\206\178 is "αββ" utf-8 encoded
+test1 = alexScanTokens " let in 012334\n=+*foo \206\177\206\178\206\178 bar__'"
+result1 = [Let,In,Int 12334,Sym '=',Sym '+',Sym '*',Var "foo",Var "\206\177\206\178\206\178",Var "bar__'"]
+
+}
Something went wrong with that request. Please try again.