New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Syntax autodetect based on file content #2699
Comments
Looks interesting, thanks for sharing. So, if I understand correctly, the new asset file is 738KiB, so the Also, how is the I see that, the way it is trained, it supports just that static list of |
After it is compressed by zlib, which is also used by themes.bin and syntaxes.bin, the size of guesslang.onnx will be 549K. Meanwhile, we can link onnxruntime dynamically or statically. There are three situations. In all cases,
I use some I have done some benchmarks. In my computer, the vanilla 0.24.0 is:
The bat with guesslang has a startup time of:
Also, I have benchmarked a modified version that enable guesslang for all inputs. Its speed is:
You can refer to my script. (I have not polished it yet.)
Yes, your link to https://github.com/yoeo/guesslang/blob/f67e7b1bc963d06fc244304dba7f0f0ce39a0d4c/guesslang/data/languages.json is right. We can use the keys for language names, or we can use the values for extensions. I'm a bit of lazy and I don't want to translate the language names to what we use in bat, so I just use the extensions. The model outputs an array of probabilities of the 54 languages (the sum is 1). I just pick the one with largest probability and select it if the probability is greater than 0.5. The threshold can be further tuned.
Well, it is difficult question for me to answer. How many kinds of first lines these 54 languages have... Or, we can think how many languages do not have a first line. I think many. |
Thanks for the detailed explanations and benchmarks. It will be interesting to see what the other maintainers think about this.
It was more like food for thought than something I expected you to answer, sorry for not making that clearer. |
Hi!
In current implementation, bat currently detects the language of a file by its extension name and its first line, and may fail to detect and highlight files without extension or stdin. A solution for such cases is to guess the language according to the file content. This approach is used in editors like VSCode.
I've tried implementing this autodetecting feature for bat. See https://github.com/ruihe774/bat/tree/guesslang. In this implementation, bat probes the first few (kilo)bytes and detects the language using the model from guesslang, which is also used in VSCode, if the file extension detection and first line detection failed. It works fairly well and you could have a try. I'm wondering if you are interested in this feature and whether this can be merged into upstream.
The text was updated successfully, but these errors were encountered: