Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Syntax autodetect based on file content #2699

Open
ruihe774 opened this issue Oct 6, 2023 · 3 comments
Open

Syntax autodetect based on file content #2699

ruihe774 opened this issue Oct 6, 2023 · 3 comments
Labels
feature-request New feature or request

Comments

@ruihe774
Copy link

ruihe774 commented Oct 6, 2023

Hi!
In current implementation, bat currently detects the language of a file by its extension name and its first line, and may fail to detect and highlight files without extension or stdin. A solution for such cases is to guess the language according to the file content. This approach is used in editors like VSCode.

I've tried implementing this autodetecting feature for bat. See https://github.com/ruihe774/bat/tree/guesslang. In this implementation, bat probes the first few (kilo)bytes and detects the language using the model from guesslang, which is also used in VSCode, if the file extension detection and first line detection failed. It works fairly well and you could have a try. I'm wondering if you are interested in this feature and whether this can be merged into upstream.

@ruihe774 ruihe774 added the feature-request New feature or request label Oct 6, 2023
@keith-hall
Copy link
Collaborator

Looks interesting, thanks for sharing. So, if I understand correctly, the new asset file is 738KiB, so the bat binary would grow at least that much larger? I wonder how much it would affect startup time in the case where guesslang isn't needed, and I'm curious how long it takes when guesslang is used.

Also, how is the onnx file generated? Probably if we were to integrate something like this, we'd want instructions on how to update the model etc - I guess we'd have to read up on it in the guesslang documentation, right?

I see that, the way it is trained, it supports just that static list of LABELS , and when invoking it, it returns indexes from that array with probabilities? It's a little hard for me to mentally map those labels/"tokens" to the relevant syntax - especially as GitHub search doesn't search submodule content. Are they all file extensions? Actually, I think I partially answered my own question, it's taken directly from https://github.com/yoeo/guesslang/blob/f67e7b1bc963d06fc244304dba7f0f0ce39a0d4c/guesslang/data/languages.json. I wonder how many of those bat supports, and how many can/can't often have the syntax detected from the first line with syntect - i.e. how much benefit would it really bring?

@ruihe774
Copy link
Author

ruihe774 commented Oct 21, 2023

So, if I understand correctly, the new asset file is 738KiB, so the bat binary would grow at least that much larger?

After it is compressed by zlib, which is also used by themes.bin and syntaxes.bin, the size of guesslang.onnx will be 549K.

Meanwhile, we can link onnxruntime dynamically or statically. There are three situations. In all cases, ORT_STRATEGY should be set to system in build time (See doc of ort).

  • A dynamic onnxruntime library that is built without --use_extensions is installed in system. In this case, the onnxruntime-extensions library is also required to be installed. Bat will depend on these two dynamic libraries. ORT_LIB_LOCATION should be set to the library directory of onnxruntime in build time; bat will be dynamically linked with it. OCOS_LIB_PATH should be set to the path of onnxruntime-extensions in build time; ort will dlopen() it in run time.
  • A dynamic onnxruntime library that is built with --use_extensions (see doc of onnxruntime) is installed in system. In this case, bat will depend on only onnxruntime.
  • A minimal static build of onnxruntime with selected ops is built (see doc; it is somewhat complicated). ORT_LIB_LOCATION should be set to the directory of the build in build time; bat will be statically linked with it. No system-wide dependencies are required. In this case, bat will grow another 1.7M.

I wonder how much it would affect startup time in the case where guesslang isn't needed, and I'm curious how long it takes when guesslang is used.

I use some OnceCell to initialize the onnx runtime and session at the first call to guesslang(). So, if it is not called, it will initialize nothing. Also, take a look at assets.rs#L286-L294. guesslang() will only be called when other methods cannot infer the language. If guesslang is used, it takes only some tens of milliseconds. And, we can further provide a way to customize whether to run guesslang or not through command line and lib interface.

I have done some benchmarks. In my computer, the vanilla 0.24.0 is:

bat benchmark results

Startup time

Command Mean [ms] Min [ms] Max [ms] Relative
bat 5.5 ± 0.3 5.0 7.0 1.00

Startup time with syntax highlighting

Command Mean [ms] Min [ms] Max [ms] Relative
bat … small-CpuInfo-file.cpuinfo 8.6 ± 0.3 8.1 10.6 1.00

Startup time with syntax with dependencies

Command Mean [ms] Min [ms] Max [ms] Relative
bat … small-Markdown-file.md 12.0 ± 0.9 11.4 20.0 1.00

Plain-text speed

Command Mean [ms] Min [ms] Max [ms] Relative
bat … --language=txt numpy_test_multiarray.py 9.3 ± 0.3 8.9 10.8 1.00

Syntax highlighting speed --wrap=character: grep-output-ansi-sequences.txt

Command Mean [ms] Min [ms] Max [ms] Relative
bat … grep-output-ansi-sequences.txt 24.6 ± 4.2 23.4 69.3 1.00

Syntax highlighting speed --wrap=character: jquery.js

Command Mean [ms] Min [ms] Max [ms] Relative
bat … jquery.js 335.2 ± 5.0 332.5 349.0 1.00

Syntax highlighting speed --wrap=character: miniz.c

Command Mean [ms] Min [ms] Max [ms] Relative
bat … miniz.c 28.8 ± 1.2 27.7 36.1 1.00

Syntax highlighting speed --wrap=character: numpy_test_multiarray.py

Command Mean [ms] Min [ms] Max [ms] Relative
bat … numpy_test_multiarray.py 442.7 ± 5.6 436.4 452.3 1.00

Syntax highlighting speed --wrap=never: grep-output-ansi-sequences.txt

Command Mean [ms] Min [ms] Max [ms] Relative
bat … grep-output-ansi-sequences.txt 20.8 ± 0.6 20.2 25.6 1.00

Syntax highlighting speed --wrap=never: jquery.js

Command Mean [ms] Min [ms] Max [ms] Relative
bat … jquery.js 330.6 ± 1.1 329.0 332.6 1.00

Syntax highlighting speed --wrap=never: miniz.c

Command Mean [ms] Min [ms] Max [ms] Relative
bat … miniz.c 28.3 ± 1.0 27.6 37.8 1.00

Syntax highlighting speed --wrap=never: numpy_test_multiarray.py

Command Mean [ms] Min [ms] Max [ms] Relative
bat … numpy_test_multiarray.py 437.7 ± 2.9 434.3 442.0 1.00

Many small files speed (overhead of metadata)

Command Mean [ms] Min [ms] Max [ms] Relative
bat … --language=txt *.txt 6.7 ± 0.4 6.2 8.5 1.00

The bat with guesslang has a startup time of:

bat benchmark results

Startup time

Command Mean [ms] Min [ms] Max [ms] Relative
bat 6.2 ± 0.3 5.8 7.5 1.00

Startup time with syntax highlighting

Command Mean [ms] Min [ms] Max [ms] Relative
bat … small-CpuInfo-file.cpuinfo 9.4 ± 0.6 8.9 17.8 1.00

Startup time with syntax with dependencies

Command Mean [ms] Min [ms] Max [ms] Relative
bat … small-Markdown-file.md 13.5 ± 3.9 12.4 64.9 1.00

Also, I have benchmarked a modified version that enable guesslang for all inputs. Its speed is:

bat benchmark results

Startup time

Command Mean [ms] Min [ms] Max [ms] Relative
bat 6.1 ± 0.3 5.7 7.4 1.00

Startup time with syntax highlighting

Command Mean [ms] Min [ms] Max [ms] Relative
bat … small-CpuInfo-file.cpuinfo 38.4 ± 0.6 37.6 40.9 1.00

Startup time with syntax with dependencies

Command Mean [ms] Min [ms] Max [ms] Relative
bat … small-Markdown-file.md 42.3 ± 0.6 41.3 46.0 1.00

Plain-text speed

Command Mean [ms] Min [ms] Max [ms] Relative
bat … --language=txt numpy_test_multiarray.py 10.2 ± 0.3 9.8 11.7 1.00

Syntax highlighting speed --wrap=character: grep-output-ansi-sequences.txt

Command Mean [ms] Min [ms] Max [ms] Relative
bat … grep-output-ansi-sequences.txt 61.5 ± 0.7 60.9 65.4 1.00

Syntax highlighting speed --wrap=character: jquery.js

Command Mean [ms] Min [ms] Max [ms] Relative
bat … jquery.js 372.6 ± 1.4 370.7 375.7 1.00

Syntax highlighting speed --wrap=character: miniz.c

Command Mean [ms] Min [ms] Max [ms] Relative
bat … miniz.c 65.8 ± 0.5 65.3 68.2 1.00

Syntax highlighting speed --wrap=character: numpy_test_multiarray.py

Command Mean [ms] Min [ms] Max [ms] Relative
bat … numpy_test_multiarray.py 477.2 ± 1.8 474.7 480.9 1.00

Syntax highlighting speed --wrap=never: grep-output-ansi-sequences.txt

Command Mean [ms] Min [ms] Max [ms] Relative
bat … grep-output-ansi-sequences.txt 58.1 ± 0.9 57.4 63.3 1.00

Syntax highlighting speed --wrap=never: jquery.js

Command Mean [ms] Min [ms] Max [ms] Relative
bat … jquery.js 369.3 ± 2.9 366.7 376.7 1.00

Syntax highlighting speed --wrap=never: miniz.c

Command Mean [ms] Min [ms] Max [ms] Relative
bat … miniz.c 65.4 ± 0.5 64.9 68.6 1.00

Syntax highlighting speed --wrap=never: numpy_test_multiarray.py

Command Mean [ms] Min [ms] Max [ms] Relative
bat … numpy_test_multiarray.py 472.7 ± 1.7 470.9 476.8 1.00

Many small files speed (overhead of metadata)

Command Mean [ms] Min [ms] Max [ms] Relative
bat … --language=txt *.txt 7.3 ± 0.3 6.9 9.0 1.00

Also, how is the onnx file generated?

You can refer to my script. (I have not polished it yet.)

I see that, the way it is trained, it supports just that static list of LABELS , and when invoking it, it returns indexes from that array with probabilities?

Yes, your link to https://github.com/yoeo/guesslang/blob/f67e7b1bc963d06fc244304dba7f0f0ce39a0d4c/guesslang/data/languages.json is right. We can use the keys for language names, or we can use the values for extensions. I'm a bit of lazy and I don't want to translate the language names to what we use in bat, so I just use the extensions.

The model outputs an array of probabilities of the 54 languages (the sum is 1). I just pick the one with largest probability and select it if the probability is greater than 0.5. The threshold can be further tuned.

I wonder how many of those bat supports, and how many can/can't often have the syntax detected from the first line with syntect - i.e. how much benefit would it really bring?

Well, it is difficult question for me to answer. How many kinds of first lines these 54 languages have...

Or, we can think how many languages do not have a first line. I think many.

@keith-hall
Copy link
Collaborator

keith-hall commented Oct 27, 2023

Thanks for the detailed explanations and benchmarks. It will be interesting to see what the other maintainers think about this.

I wonder how many of those bat supports, and how many can/can't often have the syntax detected from the first line with syntect - i.e. how much benefit would it really bring?

Well, it is difficult question for me to answer. How many kinds of first lines these 54 languages have...

It was more like food for thought than something I expected you to answer, sorry for not making that clearer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants