Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MySQL ` (backtick) highlighting #1551

Closed
jord1e opened this issue Sep 23, 2020 · 10 comments
Closed

MySQL ` (backtick) highlighting #1551

jord1e opened this issue Sep 23, 2020 · 10 comments
Milestone

Comments

@jord1e
Copy link

@jord1e jord1e commented Sep 23, 2020

Hello,
I am trying to highlighting everything between backticks exactly like in MySQL Workbench:
mysql workbench

Code

This is my style and example sql

# -*- coding: utf-8 -*-

from pygments.style import Style
from pygments.token import Keyword, Name, Comment, String, Error, \
    Number, Operator, Punctuation, Generic, Whitespace, Text, Literal, Punctuation


class WorkbenchStyle(Style):
    styles = {
        Whitespace: '#a89028',

        Text: '#000000',
        Punctuation: '#000000',

        Comment.Single: '#0987cb',
        Comment.Special: '#0987cb',
        Comment.Multiline: '#0987cb',
        Comment.Preproc: '#0987cb',

        Number.Hex: '#cc6c00',
        Number.Bin: '#cc6c00',
        Number.Float: '#cc6c00',
        Number.Integer: '#cc6c00',

        Literal.Date: '#cc6c00',

        String.Single: '#dd7a00',
        String.Double: '#dd7a00',
        String.Escape: '#dd7a00',

        Name: '#993a3e', # <<<<<<<<<<<<<
        Name.Variable: '#000000',
        Name.Constant: 'bold #007FBF',
        Name.Function: '#7d7d63',

        Operator: '#000000',

        Keyword: 'bold #007FBF',
        Keyword.Type: 'bold #007FBF',
    }
SELECT MIN(date) FROM medw
WHERE afd = (SELECT anr FROM afd WHERE name = 'Verkoop\%');
/**
abc
\%
*/
-- abc
# lollll lol
create user bob@localhost identified by 'Secure1pass!';
SELECT * FROM abc GROUP BY a;
use bobdb;
PREPARE stmt1 FROM 'SELECT SQRT(POW(?,2) + POW(?,2)) AS hypotenuse';
SET @a = 3;
SET @b = 4; ?
EXECUTE stmt1 USING @a, @b;
SELECT * FROM abc WHERE x = ' %s a' OR a = 0xA111 or c = 0b0011
AND x IS NULL OR c = true
AND f = '1983-09-05 13:28:00' OR `xbc`.`a` >= 567;
'x'
SELECT /*+ MAX_EXECUTION_TIME(1000) */ * FROM t1
PROCEDURE
MAX_EXECUTION_TIME(3)
CREATE TABLE IF NOT EXISTS tasks (
    task_id INT AUTO_INCREMENT PRIMARY KEY,
    title VARCHAR(255) NOT NULL,
    start_date DATE,
    due_date DATE,
    status TINYINT NOT NULL,
    priority TINYINT NOT NULL,
    description TEXT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)  ENGINE=INNODB;

I compile everything using

pygmentize -f html -l mysql -O style=workbench -O noclasses=True -o abc.html test.sql

My attempts at solving the issue

The problem is that Name: '#993a3e' makes everything red:
comparison

I Have tried solving it with noinherit, but alas

Backtick identification seems to be happening here, here or here. A solution would be appreciated

@Anteru
Copy link
Collaborator

@Anteru Anteru commented Sep 23, 2020

@kurtmckee has been recently doing a lot of fantastic work on the MySQL lexer -- maybe he can help :)

@kurtmckee
Copy link
Contributor

@kurtmckee kurtmckee commented Sep 23, 2020

@jord1e
Copy link
Author

@jord1e jord1e commented Sep 24, 2020

Since it's very MySQL specific we could use one of the predefined tokens (e.g. Name.Attribute) as to not break backwards compatibility.

autumn and trac styles for example already have an entry for this. We should probably check for keywords between backticks though, as to not highlight them.

The biggest problem then is choosing between the different predefined tokens (Name.Entity, Name.Attribute, Name.Namsepace etc.), maybe @Anteru knows the best one?

The formatter does seem to inherit from parents (hence the noinherit style attribute exists), see here.

Edit: String.Backtick exists, but that seems to be for strings and not names, thoughts?

@kurtmckee
Copy link
Contributor

@kurtmckee kurtmckee commented Sep 24, 2020

@Anteru, I have a serious question for you below. If you can help answer this it will allow me to both improve the MySQL lexer as well as to resolve @jord1e's need for a unique token type for quoted schema object names.

@jord1e, the MySQL lexer does its best to correctly tokenize all of the input. Quoted and unquoted schema object names have no semantic difference that I'm aware of in MySQL so in both cases they are tokenized as "Name". If the token type is changed for quoted schema object names it will introduce a new semantic distinction between quoted and unquoted names. If that's the case, the semantic distinction still needs to be meaningful. Name.Attribute would be unique, but it wouldn't be meaningful. Neither would String.Backtick.

I actually wanted to use a custom type for quoted schema object names. I seriously considered creating the Name.Quoted token that I mentioned above. It would have also allowed me to introduce a Name.Quoted.Escape sub-token so that I could even tokenize escape sequences in quoted schema object names. But I didn't, and it was because I anticipated it would have unforeseen consequences.

I had thought that if I introduced custom token types then I would have to add new CSS class names in token.py. I would have to modify all of the existing color schemes to at least map the new custom token types to the existing Name token color. Then I would have to confirm that none of the formatters that I don't use or understand, like LaTeX, didn't break because I overlooked something.

My big question for @Anteru is: What actually has to happen if I introduce a custom token type? I anticipated this apocalyptic scenario where I would have to touch 20 to 30 files, but now that I'm checking some of the formatters it appears that they do exactly what I was hoping: if the formatter doesn't recognize the token type then it follows the token hierarchy until it finds a token type that it recognizes (like in html.py or latex.py). This is encouraging but I would really like some guidance here.

I think that my original goal to uniquely tokenize quoted schema object names, as well as the fate of this ticket, primarily hinges on creating custom types.

@birkenfeld
Copy link
Contributor

@birkenfeld birkenfeld commented Sep 24, 2020

@kurtmckee don't forget that practicality beats purity. The token type names assigned by Pygments don't necessarily have to match the semantic meaning assigned by the language - they're not used by a parser. Usually it's a good idea since similar things will have a similar color for different lexers, but for the use case here it's much easier to use some other token type that already has useful assigned attributes in the various styles.

I don't think it makes sense to introduce new styling definitions in all style classes for a token type that exclusively appears in the MySQL lexer.

@Anteru
Copy link
Collaborator

@Anteru Anteru commented Sep 24, 2020

I haven't checked yet if the class falls back to the next item in the hierarchy, but it would make sense. @birkenfeld Do you know? Does Name.Foo try Name.Foo, and fall back to Name?

@birkenfeld
Copy link
Contributor

@birkenfeld birkenfeld commented Sep 24, 2020

That's the idea, yes.

@Anteru
Copy link
Collaborator

@Anteru Anteru commented Sep 24, 2020

In which case @kurtmckee I think your question is answered. If we spot a formatter down the line which doesn't behave like this it'll be considered a bug and fixed. If you can get away with some other pre-existing token that's probably the easiest solution though.

@kurtmckee
Copy link
Contributor

@kurtmckee kurtmckee commented Sep 24, 2020

kurtmckee added a commit to kurtmckee/pygments that referenced this issue Sep 27, 2020
…iquely

Changes in this patch:

* Name.Quoted and Name.Quoted.Escape are introduced as non-standard tokens
* HTML and LaTeX formatters were confirmed to provide default formatting
  if they encounter these two non-standard tokens. They also add style
  classes based on the token name, like "n-Quoted" (HTML) or "nQuoted"
  (LaTeX) so that users can add custom styles for these.
* Removed "\`" and "\\" as schema object name escapes. These are relics
  of the previous regular expression for backtick-quoted names and are
  not treated as escape sequences. The behavior was confirmed in the
  MySQL documentation as well as by running queries in MySQL Workbench.
* Prevent "123abc" from being treated as an integer followed by a schema
  object name. MySQL allows leading numbers in schema object names as long
  as 0-9 are not the only characters in the schema object name.
* Add ~10 more unit tests to validate behavior.

Closes pygments#1551
@kurtmckee
Copy link
Contributor

@kurtmckee kurtmckee commented Sep 27, 2020

@jord1e, I've created a pull request to fix this. I have also fixed two bugs in the lexer that I overlooked previously, involving escape characters in quoted schema object names as well as unquoted schema object names that start with leading digits.

I successfully tested adding the new token names in the "friendly" scheme. You will need to add "Name.Quoted" and whatever color definition you want. If you want to highlight escape sequences in quoted schema object names, add "Name.Quoted.Escape" with a custom color definition.

Please note that you won't be able to get 100% highlighting parity with MySQL Workbench. For example, Workbench incorrectly highlights schema object names with leading digits, and I've fixed this problem in Pygments with the same PR:

image

@Anteru Anteru closed this in #1555 Oct 27, 2020
Anteru pushed a commit that referenced this issue Oct 27, 2020
…iquely (#1555)

* MySQL: Tokenize quoted schema object names, and escape characters, uniquely

Changes in this patch:

* Name.Quoted and Name.Quoted.Escape are introduced as non-standard tokens
* HTML and LaTeX formatters were confirmed to provide default formatting
  if they encounter these two non-standard tokens. They also add style
  classes based on the token name, like "n-Quoted" (HTML) or "nQuoted"
  (LaTeX) so that users can add custom styles for these.
* Removed "\`" and "\\" as schema object name escapes. These are relics
  of the previous regular expression for backtick-quoted names and are
  not treated as escape sequences. The behavior was confirmed in the
  MySQL documentation as well as by running queries in MySQL Workbench.
* Prevent "123abc" from being treated as an integer followed by a schema
  object name. MySQL allows leading numbers in schema object names as long
  as 0-9 are not the only characters in the schema object name.
* Add ~10 more unit tests to validate behavior.

Closes #1551

* Remove an end-of-line regex match that triggered a lint warning

Also, add tests that confirm correct behavior. No tests failed before
or after removing the '$' match in the regex, but now regexlint isn't
complaining.

Removing the '$' matching probably depends on the fact that Pygments
adds a newline at the end of the input text, so there is always something
after a bare integer literal.
@Anteru Anteru added this to the 2.7.3 milestone Oct 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

4 participants