Skip to content

Incorrect language detected (C++ as C, XML as TypeScript, etc.) #26

@aaronfranke

Description

@aaronfranke

https://github.com/github/linguist

Linguist is a tool developed by GitHub for the specific purpose of detecting languages. It's a very mature tool that gets it right the majority of the time by using complex rules.

Activity

o2sh

o2sh commented on Feb 17, 2019

@o2sh
Owner

But why ?
https://github.com/Aaronepower/tokei is written in Rust and does a great job detecting languages.

aaronfranke

aaronfranke commented on Feb 17, 2019

@aaronfranke
Author

Is that what Onefetch currently uses? It detects C++ as C in the case of Godot, and it didn't detect anything for the repo of a Godot project (while GitHub detects GDScript).

o2sh

o2sh commented on Feb 17, 2019

@o2sh
Owner

it only detects the languages that are currently supported by onefetch (WIP):

C
Clojure
C++
C#
Go
Haskell
Java
Lisp
Lua
Python
R
Ruby
Rust
Scala
Shell
TypeScript
JavaScript
Php

Also tokei ignores all commented lines which is why the language distribution sometimes differs from GH.

Supported languages by tokei --> https://github.com/Aaronepower/tokei#supported-languages

aaronfranke

aaronfranke commented on Mar 10, 2019

@aaronfranke
Author

Upstream issues: XAMPPRocky/tokei#305 and XAMPPRocky/tokei#67

We can leave this closed though if you want.

changed the title [-]Use GitHub Linguist to detect languages[/-] [+]Improve language detection system to recognize C++ headers[/+] on Mar 10, 2019
reopened this on Mar 10, 2019
o2sh

o2sh commented on Mar 10, 2019

@o2sh
Owner

Ok, with the new title it makes more sense to keep this open.

We'll wait for tokei to fix it then.

Thx @aaronfranke

stale

stale commented on Aug 21, 2020

@stale

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

aaronfranke

aaronfranke commented on Aug 21, 2020

@aaronfranke
Author

This issue still exists, though it is likely seen by the devs as low priority, so I'll probably have to bump this again later to please the stale bot.

stale

stale commented on Nov 20, 2020

@stale

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

39 remaining items

spenserblack

spenserblack commented on Apr 26, 2023

@spenserblack
dsully

dsully commented on Apr 26, 2023

@dsully
spenserblack

spenserblack commented on Apr 26, 2023

@spenserblack
spenserblack

spenserblack commented on Aug 14, 2023

@spenserblack
Collaborator

Hey everyone following this 👋

There's been a bit of discussion here, but to keep you all up to date: I went ahead and started a project called gengo that should be more linguist-like, to hopefully improve our language detection eventually. Unlike tokei, there can be file extension collisions, and gengo will try to pick the right language using heuristics. For example, for this comment, it would need to register ts as an XML file extension, and include a heuristic to be confident that the .ts file is actually XML.

But right now, gengo doesn't support nearly enough languages. While I can just grab the data from linguist (and maybe I eventually will), right now I'm hoping that language support grows more organically, with discussion for each added language. So if you'd like to contribute, please do! I'll definitely need help with languages that I'm unfamiliar with, especially when it comes to adding heuristics, for example for C and C++ .h header files.

Edit: See spenserblack/gengo#34

linked a pull request that will close this issue on Aug 25, 2023
fenio

fenio commented on Nov 10, 2023

@fenio
spenserblack

spenserblack commented on Nov 10, 2023

@spenserblack
fenio

fenio commented on Nov 11, 2023

@fenio
spenserblack

spenserblack commented on Nov 12, 2023

@spenserblack
linked a pull request that will close this issue on Apr 8, 2024
unpinned this issue on Sep 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      Participants

      @dsully@fenio@erikgaal@aaronfranke@mapau

      Issue actions

        Incorrect language detected (C++ as C, XML as TypeScript, etc.) · Issue #26 · o2sh/onefetch