unicode char (emoji) considered illegal #898

tittoassini · 2018-11-08T14:53:54Z

Description

Some valid unicode characters cause a run time error.

Steps to Reproduce

module Main where
emoji = "\x1F600"
main = return ()

When run causes:
Exception in thread "main" java.lang.ClassFormatError: Illegal UTF8 string in constant pool in class file main/Main$emoji

Your Environment

Did you install an older version of Eta/Etlas before?
Yes, but I removed before installing the latest version.

Current Eta & Etlas version:
eta-0.8.6b2.

etlas version 1.5.0.0
compiled using version 2.1.0.0 of the etlas-cabal library

Operating System and version:
macOS Sierra

rahulmutt · 2018-11-08T17:16:41Z

Was able to reproduce this. This seems to be a fairly straightforward fix - requires debugging how unicode characters are transformed down the compilation pipeline.

We can work backwards step-by-step to see where the incorrect encoding is being produced.

Do a successful source installation of eta.
The part of the codegen that deals with string literals is here:
https://github.com/typelead/eta/blob/master/compiler/Eta/CodeGen/Utils.hs#L41-L78
The code is rather long because it has to get around Java's string size limitations by breaking up the string into chunks if it exceeds the limit. I'll make it easier for you by telling you what you can debug.

Replace

cgLit (MachStr s)           = (jlong, genCode)

with

cgLit (MachStr s)           = 
  traceShow (s, string, isLatin1, BL.unpack $ encodeModifiedUtf8 string) (jlong, genCode)

(Note: You'll need to add import Debug.Trace to make this code compile)

s is the raw ByteString version of the string literal that has been transformed from the original code.
string is a decoded Text - this should show a valid emoji if your terminal supports it.
isLatin1 decides whether we're dumping the string in latin1 form (very space inefficient, but required in some cases) or utf-8 form.
BL.unpack $ encodeModifiedUtf8 string will be the actual sequence of bytes that are written straight into the classfile. This is what needs to be paid close attention.

After you've added the traces, you can simply run stack install eta rather than a full cleaninstall since you've only changed the compiler. You may have to run etlas clean in your test project and try running it again to get the trace output.

Let me know if you need further assistance.

rahulmutt · 2018-11-08T17:17:33Z

For this particular program, the expected byte sequence should be:
www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=1F600&mode=hex

rahulmutt · 2018-12-12T08:38:10Z

This was an issue in codec-jvm - it was not handling supplementary characters in Modified UTF-8 properly. It has been fixed and a new test has been added.

rahulmutt added the bug label Nov 8, 2018

rahulmutt added the inprogress label Dec 10, 2018

rahulmutt closed this as completed in 14f1286 Dec 12, 2018

rahulmutt removed the inprogress label Dec 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unicode char (emoji) considered illegal #898

unicode char (emoji) considered illegal #898

tittoassini commented Nov 8, 2018

rahulmutt commented Nov 8, 2018 •

edited

rahulmutt commented Nov 8, 2018

rahulmutt commented Dec 12, 2018

unicode char (emoji) considered illegal #898

unicode char (emoji) considered illegal #898

Comments

tittoassini commented Nov 8, 2018

Description

Steps to Reproduce

Your Environment

rahulmutt commented Nov 8, 2018 • edited

rahulmutt commented Nov 8, 2018

rahulmutt commented Dec 12, 2018

rahulmutt commented Nov 8, 2018 •

edited