Unicode appears to be broken on windows String wrappers #19

Closed
dagit opened this Issue Mar 3, 2013 · 6 comments

Projects

None yet

2 participants

@dagit
dagit commented Mar 3, 2013

Given this alex file, Main.x:

{
{-# OPTIONS -w  #-}
module Main where
}

%wrapper "basic"

tokens :-
  $white+                               ;
  λ                                     { const TokenLambda  }

{
-- The token type:
data Token = TokenLambda
  deriving Show

main :: IO ()
main = do
  i <- readFile "input"
  print (alexScanTokens i)

}

And a single unicode character in the the input file:

$ cat input
λ
[dagit@mango:~/Documents/Repos/testing/alex-unicode]
$ alex Main.x
[dagit@mango:~/Documents/Repos/testing/alex-unicode]
$ ghc --make Main.hs
[1 of 1] Compiling Main             ( Main.hs, Main.o )
Linking Main.exe ...
[dagit@mango:~/Documents/Repos/testing/alex-unicode]
$ ./Main.exe
Main.exe: lexical error

I tried setting the encoding of the file to utf8 before reading it but that resulted in a lexical error looking for EOF.

This happens on Windows 7 64 bit, ghc 7.6.1 (32bit build), alex 3.0.2. I tried the above on OSX 10.8.2 using ghc 7.6.2 and alex 3.0.2 and it worked as expected, so I believe this to be a windows specific ghc bug.

@dagit
dagit commented Mar 3, 2013

Setting the codepage to unicode fixes the problem:

$ chcp.com 65001
Active code page: 65001
[dagit@mango:~/Documents/Repos/testing/alex-unicode]
$ ./Main.exe
[TokenLambda]
@simonmar
Owner
simonmar commented Mar 4, 2013

readFile does Unicode decoding when it reads the file based on the code page setting, so to read a Unicode file you need to set your code page appropriately, or use hSetEncoding. I don't think this is an Alex issue.

@simonmar simonmar closed this Mar 4, 2013
@dagit
dagit commented Mar 4, 2013

I tried to use hSetEncoding and I thought it failed in a surprising way (my comment about lexical error on EOF). I'll try it again soon just to be certain.

@simonmar
Owner
simonmar commented Mar 5, 2013

Ok, you can check whether the file is being read correctly with print (map ord str) or similar. If you definitely have the right characters in the string and Alex still fails, please re-open the ticket and I'll take a look.

@dagit
dagit commented Mar 5, 2013

You are right. My previously when I tested I must have done something wrong because hSetEncoding is definitely working today. It's unfortunate that it's needed on windows but I can live with that.

Thanks for your help!

@simonmar
Owner
simonmar commented Mar 5, 2013

If you have a file in a specific encoding (e.g. UTF-8) then you should always use hSetEncoding, rather than relying on the user's locale/codepage being set correctly. This is the case on every platform.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment