Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unicode char (emoji) considered illegal #898

Closed
tittoassini opened this issue Nov 8, 2018 · 3 comments
Closed

unicode char (emoji) considered illegal #898

tittoassini opened this issue Nov 8, 2018 · 3 comments
Labels

Comments

@tittoassini
Copy link

Description

Some valid unicode characters cause a run time error.

Steps to Reproduce

module Main where
emoji = "\x1F600"
main = return ()

When run causes:
Exception in thread "main" java.lang.ClassFormatError: Illegal UTF8 string in constant pool in class file main/Main$emoji

Your Environment

Did you install an older version of Eta/Etlas before?
Yes, but I removed before installing the latest version.

Current Eta & Etlas version:
eta-0.8.6b2.

etlas version 1.5.0.0
compiled using version 2.1.0.0 of the etlas-cabal library

Operating System and version:
macOS Sierra

@rahulmutt rahulmutt added the bug label Nov 8, 2018
@rahulmutt
Copy link
Member

rahulmutt commented Nov 8, 2018

Was able to reproduce this. This seems to be a fairly straightforward fix - requires debugging how unicode characters are transformed down the compilation pipeline.

We can work backwards step-by-step to see where the incorrect encoding is being produced.

  1. Do a successful source installation of eta.
  2. The part of the codegen that deals with string literals is here:
    https://github.com/typelead/eta/blob/master/compiler/Eta/CodeGen/Utils.hs#L41-L78
    The code is rather long because it has to get around Java's string size limitations by breaking up the string into chunks if it exceeds the limit. I'll make it easier for you by telling you what you can debug.

Replace

cgLit (MachStr s)           = (jlong, genCode)

with

cgLit (MachStr s)           = 
  traceShow (s, string, isLatin1, BL.unpack $ encodeModifiedUtf8 string) (jlong, genCode)

(Note: You'll need to add import Debug.Trace to make this code compile)

  • s is the raw ByteString version of the string literal that has been transformed from the original code.
  • string is a decoded Text - this should show a valid emoji if your terminal supports it.
  • isLatin1 decides whether we're dumping the string in latin1 form (very space inefficient, but required in some cases) or utf-8 form.
  • BL.unpack $ encodeModifiedUtf8 string will be the actual sequence of bytes that are written straight into the classfile. This is what needs to be paid close attention.
  1. After you've added the traces, you can simply run stack install eta rather than a full cleaninstall since you've only changed the compiler. You may have to run etlas clean in your test project and try running it again to get the trace output.

Let me know if you need further assistance.

@rahulmutt
Copy link
Member

For this particular program, the expected byte sequence should be:
www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=1F600&mode=hex

@rahulmutt
Copy link
Member

This was an issue in codec-jvm - it was not handling supplementary characters in Modified UTF-8 properly. It has been fixed and a new test has been added.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants