Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #1801: Store the already processed byte string in Val.Chars #1855

Merged

Conversation

WojciechMazur
Copy link
Contributor

Resolves #1801
This PR may break NIR compability. I think there is no need to updated versioning due to other breaking changes which already updated version, since there was no release in between.

Each string in Val.Chars is already escaped literal

@LeeTibbert
Copy link
Contributor

Wojciech
Please pardon me if I am missing something. I was curious if this PR also fixed
Issue #1737, so I cloned & built your branch and ran it with some unit-test cases from
PR #1744.

 test("""the value of c"..." literals""") {
    assertEquals("\t", fromCString(c"\t"))
    assertEquals("\\t", fromCString(c"\\t"))
    assertEquals("\u0020\u0020\u0061\u0062", fromCString(c"\040\40\141\x62"))

//    assertEquals("\\n", fromCString(c"\\n"))                                 
    assertEquals("\n", fromCString(c"\n"))
//    assertEquals("\\r", fromCString(c"\\r"))                                 
    assertEquals("\r", fromCString(c"\r"))
    assertEquals("\u0020\u0020\u0061\u0062", fromCString(c"\040\40\141\x62"))

    // assertEquals("""                                                        
    // {                                                                       
    //   "greeting": "Hello world!"                                            
    // }""",                                                                   
    //              fromCString(c"""                                           
    // {                                                                       
    //   "greeting": "Hello world!"                                            
    // }"""))                                                                  
  }

The Hello World case fails, which is fair enough and answers my motivating question.
However, the two other commented out cases fail. It looks to me like they should work.

Thank you for considering this (and thank you for the underlying PR.)

@WojciechMazur
Copy link
Contributor Author

WojciechMazur commented Jul 16, 2020

However, the two other commented out cases fail. It looks to me like they should work.

It's strange since those two cases you mentioned works on my machine. May it be OS related issue? I'm using Fedora 32 and LLVM 10.0.0.
I may dig into triple quote case though.

@LeeTibbert
Copy link
Contributor

Re: failing test cases. I'll probe around at my end to see if something is wonky in my
environment. I did a second debug round and was able to replicate my result.

Went one step further and it has something to do with the c interpolation. I think
I have been hosed by that before. Something about the c interpolate seeing &
translating the \n earlier mumble, mumble. Let me refresh & clear my memory.
mumble, mumble is not good enough for action.

In particular, I want to see what happens with those test cases with both current
SN mainline and 0.4.0-M2.

I have some things I need to accomplish, like food in the house before pending
lockdown, so it probably be tomorrow, Friday, before I have anything useful.

@LeeTibbert
Copy link
Contributor

LeeTibbert commented Jul 16, 2020

Update:

Short story

A nasty and astonishing bug should be fixed once NirGenExpr.scala line 1283
is changed to use the recommended idiom StringContext.treatEscapes(str).

Fixing this bug results in a reduction in astonishment that should be prominently
mentioned in the release notes.

One would expect toCString("\\n") == c"\\n. Currently & historically this is
surprisingly not true. People in the wild probably have code to workaround this
bug, so there should be a prominent mention of the fix in the release notes.

Long story

I have been gored by this bug several times over the past few years but have
never defined it well enough to track down. I have always mumbled a few
words I can not repeat here, written it off to
"Oh! quoting code is always quirky!", and then coded around it.

On my system, using both SN 0.4.0-M2 and the current commit of this PR,
both of the following definitely hard fail. I believe the same would be true
on Travis CI.

assertEquals("\\n", fromCString(c"\\n"))
assertEquals("\\r", fromCString(c"\\r"))

Whereas `assertEquals("\t", fromCString(c"\t")) passes.

The culprit appears to be the 0.4.0-M2 code at NirGenExpr.scala line 1282
val chars = Val.Chars(str.replace("\\n", "\n").replace("\\r", "\r")). The code in this
PR at 1283 carries that bug forward in its private implementation of treatEscapes().

When I try the block below in scastie, I get the expected 2 characters.

val	left = "\\n"
val sc = StringContext.treatEscapes(left)
System.err.printf("|" + sc + "|" + s" len: $sc.length")

If authorship issues can be worked out with @lolgab, I recommend that those
two test cases be added to CStringSuite.scala for this PR.

The obligatory complication.

Scastie reports that StringContext.treatEscapes has been deprecated in Scala 2.13.0
and recommends processEscapes. The latter is documented to disallow octal escape
sequences. Doing the obvious and using processEscapes means that toCString and
the c interpolater will handle octal escapes differently. The former will pass them, the
latter will not.

toCString() does not mumbleEscapes() the string it is passed. It converts it to bytes
and copies it pretty straightforwardly to the memory it allocates. (For some reason, it
uses an open coded loop rather than a memcpy or some form of array copy. Perhaps
optimized for one or byte strings????),

I think the economic solution here is to use processEscapes and add a small note
to the .rst which describes toCString and the c interpolater to say that the former
allows and the latter disallows octal escapes. Octal escapes are still used
in C land.

@@ -357,6 +357,9 @@ strings (similarly to C):
val msg: CString = c"Hello, world!"
stdio.printf(msg)

It does not allow any octal values or escape characters not supported by Scala compiler, like ``\a`` or ``\?``.
It is possible to use C-style hex values up to value 0xFF, eg. ``c"Hello \x61\x62\x63"``

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well said! I like the examples, they clarify.

I hope that you are not getting frustrated by the review cycles.

Whack a mole time! I can create an independent issue for the concern below
so that the current PR can proceed.

I think this section exposes a weakness in the section which follows it.

Absent any documentation, it is reasonable to believe that the c interpolator
and toCString() have the same results. One knows that someone is going
to do toCString() and report that it allows escapes which the quote-equivalent-unquote
interpolator does not.

Additionally, we also expose two helper functions unsafe.toCString and unsafe.fromCString to convert between C-style and Java-style strings.

I think that creating some test cases to figure out what toCString() is actually
doing with regards to escapes and documenting it in this .rst is worthwhile.
I think it is converting from Java strings to bytes and then copying those bytes to allocated memory. The creation of the Java string would have done escape
processing but toCString() would not. Thus one would need a Java "\a" to pass
the two bytes '' & 'a' to C.

The same consideration and need for explicit test cases applies to fromCString().
For the sake of this discussion, consider a byte array read in from user input.
I believe that each byte gets visited one by one and converted to a Java 16 bit Character without escape processing. Thus a CString holding the two bytes
'' and 'n' become two Java Characters, not one newline.

I propose that one or more test cases be created and the behavior documented
in this section of the .rst.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key insight is that:

  • A String is a UCS string, i.e., a sequence of 2-byte Chars. It is usually interpreted as UTF-16.
  • A CString is a byte string, i.e., a sequence of Bytes. It is usually interpreted as UTF-8, but it can also be ASCII or latin1 or whatever.
  • The c"..." interpolator describes a byte string, without any specific encoding. Because the source code is text, some encoding needs to be chosen when encoding the text at compile-time into a byte string. The natural choice is latin1 aka ISO-8859-1, because it uses 1-byte code points, and its valid code point range (0-255) corresponds to the same range in UTF-16 Chars and in abstract Unicode code point.
  • fromCString and toCString are charset-aware. They will always assume that the String is UTF-16 (as do the java.nio.charset APIs) and take an Charset parameter to know what encoding to assume for the byte string. If not specified, it is UTF-8. They are very similar to String.getBytes and new String(Array[Byte]), except that they work with a CString instead of an Array[Byte].

fromCString and toCString don't care about escapes; escapes are a source-level programming language concern, not a run-time concern. They don't copy byte-by-byte because that would not respect the charset-aware decoding and encoding.

@sjrd sjrd self-requested a review July 23, 2020 14:03
Copy link
Collaborator

@sjrd sjrd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good! I think this is the right approach now. I have left two comments.

@WojciechMazur WojciechMazur requested a review from sjrd July 27, 2020 10:44
@WojciechMazur WojciechMazur deleted the fix/1801_val.chars_store_processed_string branch August 18, 2020 08:22
@WojciechMazur WojciechMazur restored the fix/1801_val.chars_store_processed_string branch August 18, 2020 08:22
@WojciechMazur WojciechMazur reopened this Aug 18, 2020
Copy link
Collaborator

@sjrd sjrd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @WojciechMazur, I had forgotten about this PR. I think this is mostly in good shape. I just have a few minor comments left.

final case class Chars(value: Seq[scala.Byte]) extends Val {
lazy val byteCount: scala.Int = value.length + 1
lazy val bytes: Array[scala.Byte] = value.toArray
lazy val stringValue = new java.lang.String(bytes, "UTF-8")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we identified that this is more a byte string than a character string, using UTF-8 could blow up or give unusable results if the byte string contains sequences of bytes that are not valid UTF-8. I believe that stringValue should use an encoding that can represent, somehow, any byte string. A typical choice for that is latin1, aka StandardCharsets.ISO_8859_1. That will of course display "garbage" for multi-byte characters encoded in UTF-8 by the compiler, but that's OK because this is basically a debug string.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make it even clearer that it is a debug string that is only used by Show, it would be even better to simply move this to Show, actually. That way we're sure that we're not going to use this method in a "meaningful" way.

@@ -112,7 +115,7 @@ trait NirGenExpr { self: NirGenPhase =>

def translateMatch(last: LabelDef) = {
val (prologue, cases) = stats.span(s => !isCaseLabelDef(s))
val labels = cases.map { case label: LabelDef => label }
val labels = cases.map { case label: LabelDef => label }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems something wrong happened with the formatting of this file.

}
loop(0)
bytes.foreach {
case '\\' => str("\\" * 2)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider constant-folding this. * 2 is a method call that is relatively expensive:

Suggested change
case '\\' => str("\\" * 2)
case '\\' => str("\\\\")


/**
* Custom implementation of StringContext.processEscapes which also parses hex values
* @param str UTF-8 encoded input string optionally containing literal escapes and hex values
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A String is never UTF-8. In all generality it is UCS-2, a superset of UTF-16.

* @param str UTF-8 encoded input string optionally containing literal escapes and hex values
* @return UTF-8 representation of escaped ByteString
*/
def processEscapes(str: String): String = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this method really parses a byte string, it should directly return an Array[Byte]. It should not try to encode the byte string into a String, only to then re-decode that String into a byte string by calling getBytes("UTF-8"). As such, it should use an Array.newBuilder[Byte] instead of a StringBuilder.

This will avoid many indirections, and would allow to parse byte strings that are not valid UTF-8 sequences (but could be valid in other interpretations, which we must not prevent).

val const = Val.Const(chars)
buf.box(nir.Rt.BoxedPtr, const, unwind)
// format: on
val chars = Val.Chars(StringUtils.processEscapes(str))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The indentation is wrong here.

In general, there are zillions of formatting changes that are unrelated to this PR in this file. They could have come from a transient use of another version of scalafmt or some misconfiguration. Could you please revert all the changes in this file, then reintroduce this specific change (which is only 1 line)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, it should be fixed now

@sjrd sjrd changed the title Fix #1801 - Val.Chars stores already processed string Fix #1801: Store the already processed byte string in Val.Chars Aug 19, 2020
@sjrd sjrd merged commit e1d2f52 into scala-native:master Aug 19, 2020
1 check passed
@WojciechMazur WojciechMazur deleted the fix/1801_val.chars_store_processed_string branch February 15, 2021 09:07
ekrich pushed a commit to ekrich/scala-native that referenced this pull request May 21, 2021
….Chars (scala-native#1855)

Previously, the `Val.Chars` IR node contained a `String` coming
more or less directly from the source code, notably still
containing escape sequences. The CodeGen was responsible for
processing the escape sequences and emitting an escaped LLVM byte
from the string.

This was very fragile, and left in the NIR specification concerns
that are specific to Scala syntax. It was fragile enough to cause
issues like scala-native#1801.

The spec of `Val.Chars` is now changed to contain a byte string,
without any notion of escape sequences. The compiler back-end is
responsible for treat the escape sequences and creating a byte
string from the source string. The CodeGen is responsible for
encoding this byte string in the LLVM syntax.

A deserialization hack is introduced not to break backward binary
compatibility.
WojciechMazur added a commit to WojciechMazur/scala-native that referenced this pull request Aug 25, 2021
….Chars (scala-native#1855)

Previously, the `Val.Chars` IR node contained a `String` coming
more or less directly from the source code, notably still
containing escape sequences. The CodeGen was responsible for
processing the escape sequences and emitting an escaped LLVM byte
from the string.

This was very fragile, and left in the NIR specification concerns
that are specific to Scala syntax. It was fragile enough to cause
issues like scala-native#1801.

The spec of `Val.Chars` is now changed to contain a byte string,
without any notion of escape sequences. The compiler back-end is
responsible for treat the escape sequences and creating a byte
string from the source string. The CodeGen is responsible for
encoding this byte string in the LLVM syntax.

A deserialization hack is introduced not to break backward binary
compatibility.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Val.Chars() should store the processed string literal, not the Scala syntax
4 participants