-
Notifications
You must be signed in to change notification settings - Fork 332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove The Unicode Escape \u000E In Scaladoc Code Comment Parsing #2336
Remove The Unicode Escape \u000E In Scaladoc Code Comment Parsing #2336
Conversation
The regular expressions for `CodeBlockStartRegex` and `CodeBlockEndRegex` both contain two instances of the Unicode escape `\u000E`. This is the "Shift In" character. I expect that it was inserted as part of a copy/paste error. Unicode escapes in triple quote strings are deprecated as of 2.13.2 (scala/scala#8282). Further, this character actually makes the regular expression invalid if it is interpreted. This isn't a big deal right now, as it appears to be ignored on Scala 2.12.x, but on Scala 2.13.x this will cause the regular expressions to fail for Scaladoc using the `<pre>` tag. For example, ```scala import scala.util.matching._ object Main { val doc0: String = """ | /** A foo is a bar, for example. | * | * {{{ | * val foo: String = "bar" | * }}} | * | * <pre> | * val bar: String = "baz | * </pre> | */""".stripMargin val CodeBlockStartRegex = new Regex("""(.*?)((?:\{\{\{)|(?:\u000E<pre(?: [^>]*)?>\u000E))(.*)""") val CodeBlockStartRegex0 = new Regex("""(.*?)((?:\{\{\{)|(?:<pre(?: [^>]*)?>))(.*)""") def matchInfo(regex: Regex, value: CharSequence): Unit = { println(s"\nTarget: ${value}") println(s"Regex: ${regex}") val matches: List[Regex.Match] = regex.findAllMatchIn(value).toList println(s"Match Count: ${matches.size}") println(s"Matches: ${matches}") } def main(args: Array[String]): Unit = { matchInfo(CodeBlockStartRegex, doc0) matchInfo(CodeBlockStartRegex0, doc0) } } ``` When run with 2.13.4 yields this result, ```shell warning: 1 deprecation (since 2.13.2); re-run with -deprecation for details 1 warning Picked up JAVA_TOOL_OPTIONS: -Dsbt.supershell=false Target: /** A foo is a bar, for example. * * {{{ * val foo: String = "bar" * }}} * * <pre> * val bar: String = "baz * </pre> */ Regex: (.*?)((?:\{\{\{)|(?:�<pre(?: [^>]*)?>�))(.*) Match Count: 1 Matches: List( * {{{) Target: /** A foo is a bar, for example. * * {{{ * val foo: String = "bar" * }}} * * <pre> * val bar: String = "baz * </pre> */ Regex: (.*?)((?:\{\{\{)|(?:<pre(?: [^>]*)?>))(.*) Match Count: 2 Matches: List( * {{{, * <pre>) ``` Note how the first output only found one match, the `{{{` based one, but the second one found both. Finally, a small test was added to ensure that the change does not break comment parsing.
just removing the control characters didn't work over at scala/scala#9359 , it turned out they were there for a reason -- but maybe the context here is somehow different, idk |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@olafurpg do you remember why that unicode character could have been added?
I think this should be good to merge, probably it would be best to use the scalameta scaladoc parser itself and remove the one here, but that would require much more work than this.
It looks like it was copied from scala/scala, I am up for removing it, I don't really understand what that doesn, but I feel like this should not be needed in Metals. |
The upstream Scalameta Scaladoc parser is less accurate than the one in Metals so I would keep it unchanged. I think the code was copied verbatim from scala/scala and the special unicode character is still present in the latest version of the parser (see here https://github.com/scala/scala/blob/0f64691c952bddc37e3196fd8ef7f7cd23214ccb/src/scaladoc/scala/tools/nsc/doc/base/CommentFactoryBase.scala#L169-L186). @isomarcte do you think it's possible to keep the special character but interpolate it similar to how the scala/scala implementation does? I think it's important that we avoid diverging from the upstream scaladoc parser. |
Yeah, but I want to dig in a bit and understand what this character is being used for upstream, no matter what we do here. I'll get back to everyone here shortly. Thanks for the reviews! |
I'm going to go ahead and close this seeing that it's pretty stale. @isomarcte if you plan on digging back into this, feel free to just ping me and I can reopen. |
The regular expressions for
CodeBlockStartRegex
andCodeBlockEndRegex
both contain two instances of the Unicode escape\u000E
. This is the "Shift Out" character. I expect that it was inserted as part of a copy/paste error.Unicode escapes in triple quote strings are deprecated as of 2.13.2 (scala/scala#8282). Further, this character actually makes the regular expression invalid if it is interpreted. This isn't a big deal right now, as it appears to be ignored on Scala 2.12.x, but on Scala 2.13.x this will cause the regular expressions to fail for Scaladoc using the
<pre>
tag. For example,When run with 2.13.4 yields this result,
Note how the first output only found one match, the
{{{
based one, but the second one found both.Finally, a small test was added to ensure that the change does not break comment parsing.