Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to merge embedded languages #105

Closed
2 tasks
mathben opened this issue Feb 11, 2023 · 2 comments · Fixed by #147
Closed
2 tasks

Add option to merge embedded languages #105

mathben opened this issue Feb 11, 2023 · 2 comments · Fixed by #147
Assignees
Milestone

Comments

@mathben
Copy link

mathben commented Feb 11, 2023

Story

As user I want to see a single count for a base language even of there are source codes with various embedded languages so that I can get a general idea how much the base language is used independent of the embedded languages.

Example languages this is useful for: HTML, XML, JavaScript.

Goals

  • When Pygments detects a language that contains a plus between two non-plus characters, the left one is the base language and the right one the embedded language. Examples:
    • JavaScript → base: JavaScript, no sub
    • JavaScript+Lasso → base: JavaScript, sub: Lasso
    • C++ → base: C++, no sub; reason: A + must be followed by a non-plus to be the start of a sub-language
  • When the command line option --merge-embedded is specified, all source files from an embedded language count only towards the base language.

Original request: Option to merge sub language

Example of output
┏━━━━━━━━━━━━━━━━
┃ Language ┃
┡━━━━━━━━━━━━━━━━
│ Python │
│ XML │
│ XML+Django/Jinja │
│ JavaScript+Lasso │
│ JavaScript │
│ Genshi │
│ SCSS
│ JavaScript+Genshi Text │
│ HTML │
│ JavaScript+Django/Jinja │
│ CSS+Lasso │
empty

Can we have an option to merge result of
XML with XML+Django/Jinja + Genshi
Javascript with Javascript+Lasso + Javascript+Genshi Text + Javascript+Django/Jinja

Are maybe remove sub language classification from analysis?

@roskakori
Copy link
Owner

The classification is done by pygments, so this would need to be an extra step performed by pygount.

Seemingly pygments uses the convention base_language + "+" + other_languages, so pygount could split before the + and just use the base language as actual language.

Not high on my list of priorities but I leave it in the backlog.

@roskakori
Copy link
Owner

roskakori commented Feb 13, 2023

Note to self: We still need to detect C++ to be a full language, not the "+" sub-language of C.

Here's a first code snipplet to derive the base language, if any.

import re

_BASE_LANGUAGE_REGEX = re.compile(r"^(?P<base_language>[^+]+)\+[^+].*$")


def base_language(language: str) -> str:
    base_language_match = _BASE_LANGUAGE_REGEX.match(language)
    return language if base_language_match is None else base_language_match.group("base_language")


assert base_language("JavaScript") == "JavaScript"
assert base_language("JavaScript+Lasso") == "JavaScript"
assert base_language("JavaScript+") == "JavaScript+"  # no actual language 
assert base_language("C++") == "C++"
assert base_language("++C") == "++C"  # no actual language

@roskakori roskakori self-assigned this May 12, 2024
@roskakori roskakori added this to the v1.7.0 milestone May 12, 2024
@roskakori roskakori changed the title Option to merge sub language Add option to merge embedded languages May 12, 2024
roskakori added a commit that referenced this issue May 12, 2024
roskakori added a commit that referenced this issue May 12, 2024
roskakori added a commit that referenced this issue May 12, 2024
…ed-languages

#105 Add option to merge embedded languages
roskakori added a commit that referenced this issue May 12, 2024
roskakori added a commit that referenced this issue May 12, 2024
…ed-languages

#105 Clean up deprecation warnings
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants