New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix #1801: Store the already processed byte string in Val.Chars #1855
Fix #1801: Store the already processed byte string in Val.Chars #1855
Conversation
Wojciech
The Hello World case fails, which is fair enough and answers my motivating question. Thank you for considering this (and thank you for the underlying PR.) |
It's strange since those two cases you mentioned works on my machine. May it be OS related issue? I'm using Fedora 32 and LLVM 10.0.0. |
nscplugin/src/main/scala/scala/scalanative/nscplugin/NirGenExpr.scala
Outdated
Show resolved
Hide resolved
Re: failing test cases. I'll probe around at my end to see if something is wonky in my Went one step further and it has something to do with the c interpolation. I think In particular, I want to see what happens with those test cases with both current I have some things I need to accomplish, like food in the house before pending |
Update: Short storyA nasty and astonishing bug should be fixed once NirGenExpr.scala line 1283 Fixing this bug results in a reduction in astonishment that should be prominently One would expect Long storyI have been gored by this bug several times over the past few years but have On my system, using both SN 0.4.0-M2 and the current commit of this PR,
Whereas `assertEquals("\t", fromCString(c"\t")) passes. The culprit appears to be the 0.4.0-M2 code at NirGenExpr.scala line 1282 When I try the block below in scastie, I get the expected 2 characters.
If authorship issues can be worked out with @lolgab, I recommend that those The obligatory complication.Scastie reports that toCString() does not mumbleEscapes() the string it is passed. It converts it to bytes I think the economic solution here is to use |
…s with additional processing of hex values (up to 0xff)
@@ -357,6 +357,9 @@ strings (similarly to C): | |||
val msg: CString = c"Hello, world!" | |||
stdio.printf(msg) | |||
|
|||
It does not allow any octal values or escape characters not supported by Scala compiler, like ``\a`` or ``\?``. | |||
It is possible to use C-style hex values up to value 0xFF, eg. ``c"Hello \x61\x62\x63"`` | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well said! I like the examples, they clarify.
I hope that you are not getting frustrated by the review cycles.
Whack a mole time! I can create an independent issue for the concern below
so that the current PR can proceed.
I think this section exposes a weakness in the section which follows it.
Absent any documentation, it is reasonable to believe that the c interpolator
and toCString() have the same results. One knows that someone is going
to do toCString() and report that it allows escapes which the quote-equivalent-unquote
interpolator does not.
Additionally, we also expose two helper functions unsafe.toCString and unsafe.fromCString to convert between C-style and Java-style strings.
I think that creating some test cases to figure out what toCString() is actually
doing with regards to escapes and documenting it in this .rst is worthwhile.
I think it is converting from Java strings to bytes and then copying those bytes to allocated memory. The creation of the Java string would have done escape
processing but toCString() would not. Thus one would need a Java "\a" to pass
the two bytes '' & 'a' to C.
The same consideration and need for explicit test cases applies to fromCString().
For the sake of this discussion, consider a byte array read in from user input.
I believe that each byte gets visited one by one and converted to a Java 16 bit Character without escape processing. Thus a CString holding the two bytes
'' and 'n' become two Java Characters, not one newline.
I propose that one or more test cases be created and the behavior documented
in this section of the .rst.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The key insight is that:
- A
String
is a UCS string, i.e., a sequence of 2-byte Chars. It is usually interpreted as UTF-16. - A
CString
is a byte string, i.e., a sequence of Bytes. It is usually interpreted as UTF-8, but it can also be ASCII or latin1 or whatever. - The
c"..."
interpolator describes a byte string, without any specific encoding. Because the source code is text, some encoding needs to be chosen when encoding the text at compile-time into a byte string. The natural choice is latin1 aka ISO-8859-1, because it uses 1-byte code points, and its valid code point range (0-255) corresponds to the same range in UTF-16Char
s and in abstract Unicode code point. fromCString
andtoCString
are charset-aware. They will always assume that theString
is UTF-16 (as do thejava.nio.charset
APIs) and take anCharset
parameter to know what encoding to assume for the byte string. If not specified, it is UTF-8. They are very similar toString.getBytes
andnew String(Array[Byte])
, except that they work with aCString
instead of anArray[Byte]
.
fromCString
and toCString
don't care about escapes; escapes are a source-level programming language concern, not a run-time concern. They don't copy byte-by-byte because that would not respect the charset-aware decoding and encoding.
unit-tests/src/test/scala/scala/scalanative/unsafe/CStringSuite.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good! I think this is the right approach now. I have left two comments.
nir/src/main/scala/scala/scalanative/nir/serialization/BinaryDeserializer.scala
Outdated
Show resolved
Hide resolved
nscplugin/src/main/scala/scala/scalanative/nscplugin/NirGenExpr.scala
Outdated
Show resolved
Hide resolved
This reverts commit 0bd6147
…mplementation handling hex values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry @WojciechMazur, I had forgotten about this PR. I think this is mostly in good shape. I just have a few minor comments left.
final case class Chars(value: Seq[scala.Byte]) extends Val { | ||
lazy val byteCount: scala.Int = value.length + 1 | ||
lazy val bytes: Array[scala.Byte] = value.toArray | ||
lazy val stringValue = new java.lang.String(bytes, "UTF-8") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we identified that this is more a byte string than a character string, using UTF-8 could blow up or give unusable results if the byte string contains sequences of bytes that are not valid UTF-8. I believe that stringValue
should use an encoding that can represent, somehow, any byte string. A typical choice for that is latin1, aka StandardCharsets.ISO_8859_1
. That will of course display "garbage" for multi-byte characters encoded in UTF-8 by the compiler, but that's OK because this is basically a debug string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make it even clearer that it is a debug string that is only used by Show
, it would be even better to simply move this to Show
, actually. That way we're sure that we're not going to use this method in a "meaningful" way.
@@ -112,7 +115,7 @@ trait NirGenExpr { self: NirGenPhase => | |||
|
|||
def translateMatch(last: LabelDef) = { | |||
val (prologue, cases) = stats.span(s => !isCaseLabelDef(s)) | |||
val labels = cases.map { case label: LabelDef => label } | |||
val labels = cases.map { case label: LabelDef => label } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems something wrong happened with the formatting of this file.
} | ||
loop(0) | ||
bytes.foreach { | ||
case '\\' => str("\\" * 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider constant-folding this. * 2
is a method call that is relatively expensive:
case '\\' => str("\\" * 2) | |
case '\\' => str("\\\\") |
|
||
/** | ||
* Custom implementation of StringContext.processEscapes which also parses hex values | ||
* @param str UTF-8 encoded input string optionally containing literal escapes and hex values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A String
is never UTF-8. In all generality it is UCS-2, a superset of UTF-16.
* @param str UTF-8 encoded input string optionally containing literal escapes and hex values | ||
* @return UTF-8 representation of escaped ByteString | ||
*/ | ||
def processEscapes(str: String): String = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this method really parses a byte string, it should directly return an Array[Byte]
. It should not try to encode the byte string into a String, only to then re-decode that String into a byte string by calling getBytes("UTF-8")
. As such, it should use an Array.newBuilder[Byte]
instead of a StringBuilder
.
This will avoid many indirections, and would allow to parse byte strings that are not valid UTF-8 sequences (but could be valid in other interpretations, which we must not prevent).
…[Byte] instead of StringBuilder
val const = Val.Const(chars) | ||
buf.box(nir.Rt.BoxedPtr, const, unwind) | ||
// format: on | ||
val chars = Val.Chars(StringUtils.processEscapes(str)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The indentation is wrong here.
In general, there are zillions of formatting changes that are unrelated to this PR in this file. They could have come from a transient use of another version of scalafmt or some misconfiguration. Could you please revert all the changes in this file, then reintroduce this specific change (which is only 1 line)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, it should be fixed now
….Chars (scala-native#1855) Previously, the `Val.Chars` IR node contained a `String` coming more or less directly from the source code, notably still containing escape sequences. The CodeGen was responsible for processing the escape sequences and emitting an escaped LLVM byte from the string. This was very fragile, and left in the NIR specification concerns that are specific to Scala syntax. It was fragile enough to cause issues like scala-native#1801. The spec of `Val.Chars` is now changed to contain a byte string, without any notion of escape sequences. The compiler back-end is responsible for treat the escape sequences and creating a byte string from the source string. The CodeGen is responsible for encoding this byte string in the LLVM syntax. A deserialization hack is introduced not to break backward binary compatibility.
….Chars (scala-native#1855) Previously, the `Val.Chars` IR node contained a `String` coming more or less directly from the source code, notably still containing escape sequences. The CodeGen was responsible for processing the escape sequences and emitting an escaped LLVM byte from the string. This was very fragile, and left in the NIR specification concerns that are specific to Scala syntax. It was fragile enough to cause issues like scala-native#1801. The spec of `Val.Chars` is now changed to contain a byte string, without any notion of escape sequences. The compiler back-end is responsible for treat the escape sequences and creating a byte string from the source string. The CodeGen is responsible for encoding this byte string in the LLVM syntax. A deserialization hack is introduced not to break backward binary compatibility.
Resolves #1801
This PR may break NIR compability. I think there is no need to updated versioning due to other breaking changes which already updated version, since there was no release in between.
Each string in Val.Chars is already escaped literal