Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: SQL comments support and new tokenizer #141

Merged
merged 2 commits into from
Oct 4, 2023

Conversation

Qtax
Copy link
Contributor

@Qtax Qtax commented Sep 11, 2023

Add SQL comments support.

Faster and simpler tokenization.

Refs: #133

@scriptcoded
Copy link
Owner

Thanks, looks like interesting changes! Haven't had time to read it through properly yet, but I've updated the commit lint rules on master so when you get the time, please feel free to rebase from master and that check should pass.

Quick look at the comment regex tells me it would be trivial to add the # comment format used by MySQL I mentioned in #133 (comment), right? https://dev.mysql.com/doc/refman/8.0/en/comments.html

# Comment
-- Comment
/* Comment */

@Qtax Qtax force-pushed the tokenizer-and-comments branch 2 times, most recently from d0a760e to 21f6dec Compare September 12, 2023 08:30
@Qtax
Copy link
Contributor Author

Qtax commented Sep 12, 2023

Added # comments support and rebased.

Also improved the escapeHtml function.
Edit: https://jsben.ch/oFiBQ https://jsben.ch/rqaQ0 (regex d0a760e is faster when there are "fewer" replacements)

@Qtax Qtax force-pushed the tokenizer-and-comments branch 2 times, most recently from 8824a8a to b15224d Compare September 12, 2023 09:04
@parallels999
Copy link

\x1b[2m doesn't work

@Qtax
Copy link
Contributor Author

Qtax commented Sep 12, 2023

@parallels999 in what way does it not work? Work fine for me, both dark and light theme.

dark theme
light theme

@erikn69
Copy link
Contributor

erikn69 commented Sep 12, 2023

Why did you change it?? \x1b[90m was great, now, there's no difference
image
image
Maybe better with italic \x1b[90m\x1b[3m, or \x1b[2m\x1b[90m\x1b[3m
image
image

Depends on where you use it, \x1b[2m is not supported in some scenarios

@PaolaRuby
Copy link

PaolaRuby commented Sep 12, 2023

\x1b[2m doesn't work

+1 8824a8

@Qtax
Copy link
Contributor Author

Qtax commented Sep 12, 2023

Why did you change it?? \x1b[90m was great, now, there's no difference

Changed it because 90 doesn't work on default light theme in VS code:

image

So apparently 90 doesn't work in some cases, and 2 doesn't in some others.

\x1b[2m\x1b[90m does work in VS code. Does that work for you guys?

@parallels999
Copy link

parallels999 commented Sep 12, 2023

\x1b[2m\x1b[90m does work in VS code. Does that work for you guys?

it works for me 👍, but with italic looks awesome \x1b[2m\x1b[90m\x1b[3m

Also work in browsers, cmd console, powershell
image

@Qtax
Copy link
Contributor Author

Qtax commented Sep 12, 2023

it works for me 👍

Great!

italic looks awesome \x1b[2m\x1b[90m\x1b[3m

Italic is harder to read in smaller fonts, so I personally would like to avoid it.

Copy link
Owner

@scriptcoded scriptcoded left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've read through the code now, and wow, good job! You really managed to squeeze together that getSegments quite a bit! I also really like the use of named groups instead of my overly complicated solution 👍

Split the commits

Could you please split the comment support and the performance improvements into two separate commits when we're done? I'll rebase this PR onto master in the end and since they're separate features I'd like them clearly separated in the commit history and the changelog.

Escaper performance

I tried it out with some more realistic real world data. The switch solution seems to be the most performant one. The difference in performance between escSwitch and escSwitch2 can probably be considered within margin of error, so I say let's go for the escSwitch implementation with inline replacer function.

My benchmark: https://jsben.ch/eGHEk

Comment color

\x1b[2m\x1b[90m, albeit being a bit chaotic, seems to work in multiple scenarios for me too. I say we go with this. As for italic I'd rather keep it simple and bare bones, and if someone wants italic they can use the options for that. Thanks for the input everybody!

lib/index.js Outdated Show resolved Hide resolved
lib/index.js Show resolved Hide resolved
@parallels999
Copy link

This works too, no highlighters const, no getRegexString function, no regex spaces

tokenizer = new RegExp('(.*?)(' + [
    '\\b(?<keyword>' + keywords.join('|') + ')\\b',

    /\b(?<number>\d+(?:\.\d+)?)\b/,

    // Note: Repeating string escapes like 'sql''server' will also work as they are just repeating strings
    /(?<string>'(?:[^'\\]|\\.)*'|"(?:[^"\\]|\\.)*"|`(?:[^`\\]|\\.)*`)/,

    /(?<comment>--[^\n\r]*|#[^\n\r]*|\/\*(?:[^*]|\*(?!\/))*\*\/)/,

    // Future improvement: Comments should be allowed between the function name and the opening parenthesis
    /\b(?<function>\w+)(?=\s*\()/,

    /(?<bracket>[()])/,

    /(?<special>!=|[=%*/\-+,;:<>])/,
 ].map(v => v.toString().replace(/^\/|\/\w*$|[\t ]+/g, '')).join('|') + '|$)', 'isy'),

@scriptcoded
Copy link
Owner

This works too, no highlighters const, no getRegexString function, no regex spaces

@parallels999 Yes, that would probably work. And even though it's more compact and could maybe be considered "cleaner", I think it adds to the obfuscation and makes the code harder to follow. I'd like to prioritize readability over compactness.

@Qtax
Copy link
Contributor Author

Qtax commented Sep 13, 2023

Could you please split the comment support and the performance improvements into two separate commits when we're done? I'll rebase this PR onto master in the end and since they're separate features I'd like them clearly separated in the commit history and the changelog.

So you want it split into commits feat: SQL comments support and perf: improved tokenizer? Sure.

@Qtax
Copy link
Contributor Author

Qtax commented Sep 13, 2023

I'd like to prioritize readability over compactness.

Agreed!

And JYI @parallels999 you want to remove |[\t ]+ in the replace there if you remove the spaces in the expression manually.

@parallels999
Copy link

@Qtax it's just an example, I do not have access to modify your PR

@scriptcoded
Copy link
Owner

Hey again! Just letting you know that I'll have to let this sit another few days as I've got quite a lot to do at the moment. Sorry for the holdup!

@Qtax
Copy link
Contributor Author

Qtax commented Sep 28, 2023

@scriptcoded np, I've split the commits. Also wanted to add the improved escape function, but jsben.ch has been down all day for me. :-/

@wkeese
Copy link

wkeese commented Oct 3, 2023

Glad to see this change because it looks like you're fixing a bug with the old tokenizer: AFAICT, the old tokenizer would (incorrectly) highlight keywords/symbols inside of strings, for example

select * from EMP where DEPT="select * from EMP";

It was an architectural problem because the tokenizer ran the regexps one by one, whereas you need to tokenize with a single regexp.

I hadn't seen that named capturing groups feature before, took me a while to figure it out but it works nicely.

lib/index.js Show resolved Hide resolved
@scriptcoded
Copy link
Owner

@Qtax So I've read through everything one more time now and ran it past some colleagues as well. So as soon as #141 (comment) is fixed I think we're ready to merge! I would push a commit myself but since you've been force pushing I'd rather not mess things up for you.

@scriptcoded
Copy link
Owner

So you want it split into commits feat: SQL comments support and perf: improved tokenizer? Sure.

@Qtax and just to explain I don't necessarily need the feature split into two commits specifically, but at least commits that make sense. So that when this gets merged each commit is contained and releasable. In this specific case I think the easiest is therefore to just keep it as two commits.

wkeese added a commit to wkeese/sql-highlight that referenced this pull request Oct 3, 2023
wkeese added a commit to wkeese/sql-highlight that referenced this pull request Oct 3, 2023
wkeese added a commit to wkeese/sql-highlight that referenced this pull request Oct 3, 2023
wkeese added a commit to wkeese/sql-highlight that referenced this pull request Oct 4, 2023
wkeese added a commit to wkeese/sql-highlight that referenced this pull request Oct 4, 2023
Faster and simpler tokenization.

Refs: scriptcoded#133
Add SQL comments support, including MySQL # comments.

Refs: scriptcoded#133
@Qtax
Copy link
Contributor Author

Qtax commented Oct 4, 2023

@Qtax So I've read through everything one more time now and ran it past some colleagues as well. So as soon as #141 (comment) is fixed I think we're ready to merge! I would push a commit myself but since you've been force pushing I'd rather not mess things up for you.

Fixed.

Force pushing since I don't want to make/keep unnecessary/unwanted commits, and prefer to update the commits instead. Especially if it's gonna be fast forwarded when merged. (Unless there is some other way to do that that I'm not aware of?)

@scriptcoded scriptcoded merged commit 9dad7b2 into scriptcoded:master Oct 4, 2023
8 checks passed
@scriptcoded
Copy link
Owner

There we go, finally merged! Sorry for the holdup and huge thanks for the PR and all input! 🎉

Closes #133

@scriptcoded scriptcoded mentioned this pull request Oct 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants