Syntax autodetect based on file content #2699

ruihe774 · 2023-10-06T09:40:37Z

Hi!
In current implementation, bat currently detects the language of a file by its extension name and its first line, and may fail to detect and highlight files without extension or stdin. A solution for such cases is to guess the language according to the file content. This approach is used in editors like VSCode.

I've tried implementing this autodetecting feature for bat. See https://github.com/ruihe774/bat/tree/guesslang. In this implementation, bat probes the first few (kilo)bytes and detects the language using the model from guesslang, which is also used in VSCode, if the file extension detection and first line detection failed. It works fairly well and you could have a try. I'm wondering if you are interested in this feature and whether this can be merged into upstream.

keith-hall · 2023-10-20T20:23:17Z

Looks interesting, thanks for sharing. So, if I understand correctly, the new asset file is 738KiB, so the bat binary would grow at least that much larger? I wonder how much it would affect startup time in the case where guesslang isn't needed, and I'm curious how long it takes when guesslang is used.

Also, how is the onnx file generated? Probably if we were to integrate something like this, we'd want instructions on how to update the model etc - I guess we'd have to read up on it in the guesslang documentation, right?

I see that, the way it is trained, it supports just that static list of LABELS , and when invoking it, it returns indexes from that array with probabilities? It's a little hard for me to mentally map those labels/"tokens" to the relevant syntax - especially as GitHub search doesn't search submodule content. Are they all file extensions? Actually, I think I partially answered my own question, it's taken directly from https://github.com/yoeo/guesslang/blob/f67e7b1bc963d06fc244304dba7f0f0ce39a0d4c/guesslang/data/languages.json. I wonder how many of those bat supports, and how many can/can't often have the syntax detected from the first line with syntect - i.e. how much benefit would it really bring?

ruihe774 · 2023-10-21T06:36:59Z

So, if I understand correctly, the new asset file is 738KiB, so the bat binary would grow at least that much larger?

After it is compressed by zlib, which is also used by themes.bin and syntaxes.bin, the size of guesslang.onnx will be 549K.

Meanwhile, we can link onnxruntime dynamically or statically. There are three situations. In all cases, ORT_STRATEGY should be set to system in build time (See doc of ort).

A dynamic onnxruntime library that is built without --use_extensions is installed in system. In this case, the onnxruntime-extensions library is also required to be installed. Bat will depend on these two dynamic libraries. ORT_LIB_LOCATION should be set to the library directory of onnxruntime in build time; bat will be dynamically linked with it. OCOS_LIB_PATH should be set to the path of onnxruntime-extensions in build time; ort will dlopen() it in run time.
A dynamic onnxruntime library that is built with --use_extensions (see doc of onnxruntime) is installed in system. In this case, bat will depend on only onnxruntime.
A minimal static build of onnxruntime with selected ops is built (see doc; it is somewhat complicated). ORT_LIB_LOCATION should be set to the directory of the build in build time; bat will be statically linked with it. No system-wide dependencies are required. In this case, bat will grow another 1.7M.

I wonder how much it would affect startup time in the case where guesslang isn't needed, and I'm curious how long it takes when guesslang is used.

I use some OnceCell to initialize the onnx runtime and session at the first call to guesslang(). So, if it is not called, it will initialize nothing. Also, take a look at assets.rs#L286-L294. guesslang() will only be called when other methods cannot infer the language. If guesslang is used, it takes only some tens of milliseconds. And, we can further provide a way to customize whether to run guesslang or not through command line and lib interface.

I have done some benchmarks. In my computer, the vanilla 0.24.0 is:

bat benchmark results

Startup time

Command Mean [ms] Min [ms] Max [ms] Relative

bat 5.5 ± 0.3 5.0 7.0 1.00

Startup time with syntax highlighting

Command Mean [ms] Min [ms] Max [ms] Relative

bat … small-CpuInfo-file.cpuinfo 8.6 ± 0.3 8.1 10.6 1.00

Startup time with syntax with dependencies

Command Mean [ms] Min [ms] Max [ms] Relative

bat … small-Markdown-file.md 12.0 ± 0.9 11.4 20.0 1.00

Plain-text speed

Command Mean [ms] Min [ms] Max [ms] Relative

bat … --language=txt numpy_test_multiarray.py 9.3 ± 0.3 8.9 10.8 1.00

Syntax highlighting speed --wrap=character: grep-output-ansi-sequences.txt

Command Mean [ms] Min [ms] Max [ms] Relative

bat … grep-output-ansi-sequences.txt 24.6 ± 4.2 23.4 69.3 1.00

Syntax highlighting speed --wrap=character: jquery.js

Command Mean [ms] Min [ms] Max [ms] Relative

bat … jquery.js 335.2 ± 5.0 332.5 349.0 1.00

Syntax highlighting speed --wrap=character: miniz.c

Command Mean [ms] Min [ms] Max [ms] Relative

bat … miniz.c 28.8 ± 1.2 27.7 36.1 1.00

Syntax highlighting speed --wrap=character: numpy_test_multiarray.py

Command Mean [ms] Min [ms] Max [ms] Relative

bat … numpy_test_multiarray.py 442.7 ± 5.6 436.4 452.3 1.00

Syntax highlighting speed --wrap=never: grep-output-ansi-sequences.txt

Command Mean [ms] Min [ms] Max [ms] Relative

bat … grep-output-ansi-sequences.txt 20.8 ± 0.6 20.2 25.6 1.00

Syntax highlighting speed --wrap=never: jquery.js

Command Mean [ms] Min [ms] Max [ms] Relative

bat … jquery.js 330.6 ± 1.1 329.0 332.6 1.00

Syntax highlighting speed --wrap=never: miniz.c

Command Mean [ms] Min [ms] Max [ms] Relative

bat … miniz.c 28.3 ± 1.0 27.6 37.8 1.00

Syntax highlighting speed --wrap=never: numpy_test_multiarray.py

Command Mean [ms] Min [ms] Max [ms] Relative

bat … numpy_test_multiarray.py 437.7 ± 2.9 434.3 442.0 1.00

Many small files speed (overhead of metadata)

Command Mean [ms] Min [ms] Max [ms] Relative

bat … --language=txt *.txt 6.7 ± 0.4 6.2 8.5 1.00

The bat with guesslang has a startup time of:

bat benchmark results

Startup time

Command Mean [ms] Min [ms] Max [ms] Relative

bat 6.2 ± 0.3 5.8 7.5 1.00

Startup time with syntax highlighting

Command Mean [ms] Min [ms] Max [ms] Relative

bat … small-CpuInfo-file.cpuinfo 9.4 ± 0.6 8.9 17.8 1.00

Startup time with syntax with dependencies

Command Mean [ms] Min [ms] Max [ms] Relative

bat … small-Markdown-file.md 13.5 ± 3.9 12.4 64.9 1.00

Also, I have benchmarked a modified version that enable guesslang for all inputs. Its speed is:

bat benchmark results

Startup time

Command Mean [ms] Min [ms] Max [ms] Relative

bat 6.1 ± 0.3 5.7 7.4 1.00

Startup time with syntax highlighting

Command Mean [ms] Min [ms] Max [ms] Relative

bat … small-CpuInfo-file.cpuinfo 38.4 ± 0.6 37.6 40.9 1.00

Startup time with syntax with dependencies

Command Mean [ms] Min [ms] Max [ms] Relative

bat … small-Markdown-file.md 42.3 ± 0.6 41.3 46.0 1.00

Plain-text speed

Command Mean [ms] Min [ms] Max [ms] Relative

bat … --language=txt numpy_test_multiarray.py 10.2 ± 0.3 9.8 11.7 1.00

Syntax highlighting speed --wrap=character: grep-output-ansi-sequences.txt

Command Mean [ms] Min [ms] Max [ms] Relative

bat … grep-output-ansi-sequences.txt 61.5 ± 0.7 60.9 65.4 1.00

Syntax highlighting speed --wrap=character: jquery.js

Command Mean [ms] Min [ms] Max [ms] Relative

bat … jquery.js 372.6 ± 1.4 370.7 375.7 1.00

Syntax highlighting speed --wrap=character: miniz.c

Command Mean [ms] Min [ms] Max [ms] Relative

bat … miniz.c 65.8 ± 0.5 65.3 68.2 1.00

Syntax highlighting speed --wrap=character: numpy_test_multiarray.py

Command Mean [ms] Min [ms] Max [ms] Relative

bat … numpy_test_multiarray.py 477.2 ± 1.8 474.7 480.9 1.00

Syntax highlighting speed --wrap=never: grep-output-ansi-sequences.txt

Command Mean [ms] Min [ms] Max [ms] Relative

bat … grep-output-ansi-sequences.txt 58.1 ± 0.9 57.4 63.3 1.00

Syntax highlighting speed --wrap=never: jquery.js

Command Mean [ms] Min [ms] Max [ms] Relative

bat … jquery.js 369.3 ± 2.9 366.7 376.7 1.00

Syntax highlighting speed --wrap=never: miniz.c

Command Mean [ms] Min [ms] Max [ms] Relative

bat … miniz.c 65.4 ± 0.5 64.9 68.6 1.00

Syntax highlighting speed --wrap=never: numpy_test_multiarray.py

Command Mean [ms] Min [ms] Max [ms] Relative

bat … numpy_test_multiarray.py 472.7 ± 1.7 470.9 476.8 1.00

Many small files speed (overhead of metadata)

Command Mean [ms] Min [ms] Max [ms] Relative

bat … --language=txt *.txt 7.3 ± 0.3 6.9 9.0 1.00

Also, how is the onnx file generated?

You can refer to my script. (I have not polished it yet.)

I see that, the way it is trained, it supports just that static list of LABELS , and when invoking it, it returns indexes from that array with probabilities?

Yes, your link to https://github.com/yoeo/guesslang/blob/f67e7b1bc963d06fc244304dba7f0f0ce39a0d4c/guesslang/data/languages.json is right. We can use the keys for language names, or we can use the values for extensions. I'm a bit of lazy and I don't want to translate the language names to what we use in bat, so I just use the extensions.

The model outputs an array of probabilities of the 54 languages (the sum is 1). I just pick the one with largest probability and select it if the probability is greater than 0.5. The threshold can be further tuned.

I wonder how many of those bat supports, and how many can/can't often have the syntax detected from the first line with syntect - i.e. how much benefit would it really bring?

Well, it is difficult question for me to answer. How many kinds of first lines these 54 languages have...

Or, we can think how many languages do not have a first line. I think many.

keith-hall · 2023-10-27T01:58:19Z

Thanks for the detailed explanations and benchmarks. It will be interesting to see what the other maintainers think about this.

I wonder how many of those bat supports, and how many can/can't often have the syntax detected from the first line with syntect - i.e. how much benefit would it really bring?

Well, it is difficult question for me to answer. How many kinds of first lines these 54 languages have...

It was more like food for thought than something I expected you to answer, sorry for not making that clearer.

ruihe774 added the feature-request New feature or request label Oct 6, 2023

Syntax autodetect based on file content #2699

Syntax autodetect based on file content #2699

Comments

ruihe774 commented Oct 6, 2023 • edited

keith-hall commented Oct 20, 2023

ruihe774 commented Oct 21, 2023 • edited

bat benchmark results

Startup time

Startup time with syntax highlighting

Startup time with syntax with dependencies

Plain-text speed

Syntax highlighting speed --wrap=character: grep-output-ansi-sequences.txt

Syntax highlighting speed --wrap=character: jquery.js

Syntax highlighting speed --wrap=character: miniz.c

Syntax highlighting speed --wrap=character: numpy_test_multiarray.py

Syntax highlighting speed --wrap=never: grep-output-ansi-sequences.txt

Syntax highlighting speed --wrap=never: jquery.js

Syntax highlighting speed --wrap=never: miniz.c

Syntax highlighting speed --wrap=never: numpy_test_multiarray.py

Many small files speed (overhead of metadata)

bat benchmark results

Startup time

Startup time with syntax highlighting

Startup time with syntax with dependencies

bat benchmark results

Startup time

Startup time with syntax highlighting

Startup time with syntax with dependencies

Plain-text speed

Syntax highlighting speed --wrap=character: grep-output-ansi-sequences.txt

Syntax highlighting speed --wrap=character: jquery.js

Syntax highlighting speed --wrap=character: miniz.c

Syntax highlighting speed --wrap=character: numpy_test_multiarray.py

Syntax highlighting speed --wrap=never: grep-output-ansi-sequences.txt

Syntax highlighting speed --wrap=never: jquery.js

Syntax highlighting speed --wrap=never: miniz.c

Syntax highlighting speed --wrap=never: numpy_test_multiarray.py

Many small files speed (overhead of metadata)

keith-hall commented Oct 27, 2023 • edited

ruihe774 commented Oct 6, 2023 •

edited

ruihe774 commented Oct 21, 2023 •

edited

`bat` benchmark results

Syntax highlighting speed --wrap=character: `grep-output-ansi-sequences.txt`

Syntax highlighting speed --wrap=character: `jquery.js`

Syntax highlighting speed --wrap=character: `miniz.c`

Syntax highlighting speed --wrap=character: `numpy_test_multiarray.py`

Syntax highlighting speed --wrap=never: `grep-output-ansi-sequences.txt`

Syntax highlighting speed --wrap=never: `jquery.js`

Syntax highlighting speed --wrap=never: `miniz.c`

Syntax highlighting speed --wrap=never: `numpy_test_multiarray.py`

`bat` benchmark results

`bat` benchmark results

Syntax highlighting speed --wrap=character: `grep-output-ansi-sequences.txt`

Syntax highlighting speed --wrap=character: `jquery.js`

Syntax highlighting speed --wrap=character: `miniz.c`

Syntax highlighting speed --wrap=character: `numpy_test_multiarray.py`

Syntax highlighting speed --wrap=never: `grep-output-ansi-sequences.txt`

Syntax highlighting speed --wrap=never: `jquery.js`

Syntax highlighting speed --wrap=never: `miniz.c`

Syntax highlighting speed --wrap=never: `numpy_test_multiarray.py`

keith-hall commented Oct 27, 2023 •

edited