Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Find syntax ignoring known backup/template filename suffixes #1687

Merged
merged 4 commits into from
Jul 9, 2021

Conversation

scop
Copy link
Contributor

@scop scop commented Jun 13, 2021

For example, fall back to foo.extension for foo.extension~, foo.extension.orig, foo.extension.in.in etc.

Rust newbie alert, likely clumsy/not idiomatic, but appears to work. Critique/guidance very much welcome, for example redundant/repeated conversion between OsStr, Path, and str doesn't seem that clean to me.

For example, fall back to `foo.extension` for `foo.extension~`,
`foo.extension.orig`, `foo.extension.in.in` etc.
@Enselic
Copy link
Collaborator

Enselic commented Jun 15, 2021

Thanks a lot for taking the time to make a contribution!

I like this idea in general. Since you first try to find a supported syntax with an unmodified file extension, it should also be future proof in terms of support for new syntaxes.

For example, right now we do not have any language that uses the .bak file extension. But it looks like the code handles the addition of such a syntax in a good way.

Not sure when I will get the time for a detailed code review (maybe someone else will soon), but just wanted to say that on a high level it looks good to me.

@scop
Copy link
Contributor Author

scop commented Jun 15, 2021

Yes, that's the intent here. For the same reason we strip the minimum amount of info on each round and try each result in turn -- one suffix at a time even if there would be many, and retaining rest of the full file path as we go.

Enselic added a commit to Enselic/bat that referenced this pull request Jun 28, 2021
To make it easier to make further enhancements. See e.g. sharkdp#1687.
@Enselic
Copy link
Collaborator

Enselic commented Jun 28, 2021

I have taken a more detailed look now, and noticed that you use recursion. It would be nice to get rid of the recursion.

Just for brainstorming purposes, what do you think of basing a solution on this base sketch implementation of get_extension_syntax() instead?

    fn get_extension_syntax(&self, file_name: &OsStr) -> Option<&SyntaxReference> {
        // file_name can be e.g. '.bashrc' in which case .extension() will
        // return the empty string, so we need to check both variants
        let extensions = vec![
            file_name,
            Path::new(file_name).extension().unwrap_or_default(),
        ];

        extensions.iter().find_map(|ext| {
            self.syntax_set
                .find_syntax_by_extension(ext.to_str().unwrap_or_default())
        })
    }

A commit with this is found here: Enselic@311f561. This commit passes all CI checks.

The idea is that instead of using recursion, you can add code that will simply add more items to extensions.

It would also be great to add regression tests for this. You can find a bunch of tests in tests/integration_tests.rs, and it should be pretty straightforward to add a new tests for your change. Do you want to take a shot at it? Otherwise I can help out when I get the time to do so.

@scop
Copy link
Contributor Author

scop commented Jun 28, 2021

The idea is that instead of using recursion, you can add code that will simply add more items to extensions.

Unless I'm missing something, this approach couldn't be used to strip arbitrary combinations of ignored file suffixes, including arbitrary number of them. But it's certainly debatable whether that's something we want/"need" to support.

Practical use cases I have in mind are .in.in (doubly processed template files), and everything ending with a ~, including ones that might have another ignored suffix preceding it, for example .orig~. I think .in.in could be handled by adding it as a separate ignored suffix that is processed before .in, and the mentioned ~ behavior by first constructing a list of filenames to process including other ignored suffixes, then adding a variant for each of them with ~ appended.

I have a nagging feeling that if we're only working with extensions, ~ might cause a problem in some cases because it's not an extension but rather a "plain suffix". But maybe that's not a problem if a separate extension with that appended gets added for all other existing ones.

But what about globs involving file paths, they're not dealing with extensions, and yet I think we'd want the ignored suffix behavior to work with them, too? Didn't check the code, just saying as I happened to think of it.

I believe the recursive approach in my current revision doesn't have any of the above issues. Not that I'm at all insisting on covering everything it does, and eliminating recursion would be nice I guess, but it would be cool to get the above mentioned special cases working too.

It would also be great to add regression tests for this. You can find a bunch of tests in tests/integration_tests.rs, and it should be pretty straightforward to add a new tests for your change. Do you want to take a shot at it? Otherwise I can help out when I get the time to do so.

I certainly can, but not sure exactly when; possibly sometime this week. I don't mind if you or someone else beats me to it though, let's drop a note here whoever starts working on it first to avoid duplicate work though.

@scop
Copy link
Contributor Author

scop commented Jun 28, 2021

BTW adding "real" extensions or filenames for the ignored suffixes statically would make the --list-languages output ugly though. But maybe you meant doing it on the fly under the hood so that it wouldn't cause that.

@Enselic
Copy link
Collaborator

Enselic commented Jun 28, 2021

Aha ok. I didn’t think of arbitrarily deep “backup extension on backup extension” scenarios. For that recursion makes more sense.

It would actually be good to get tests in place before we iterate on the implementation. Because we also need to iterate on how this should behave from a user point of view.

I’m thinking we don’t want to list backup extensions with --list-languages.

@scop
Copy link
Contributor Author

scop commented Jun 30, 2021

I took a brief look how would it look like to add some tests, but tests/integration_tests.rs doesn't strike me as a natural place to add tests for this stuff -- I'm not sure how I'd use the executable to test this along the lines of already existing things there. If bat would add let's say the file type it ended up using in the header or something in the output, we could look for that, but it doesn't seem to have anything of the sort.

But along those lines, I suppose we could add some files with suffixes to ignore in their names to the tests/syntax-tests mix, let's say a Ignored suffixes dir containing some simple file of the "parent" type that gets highlighted in a way that is distinctive enough to avoid false positives, and use the existing create and compare scripts as is. But this too feels a bit clumsy to me.

Any other ideas how to go about testing this? Did I miss something?

@Enselic
Copy link
Collaborator

Enselic commented Jul 1, 2021

I took a fresh look at tests/integration_tests.rs and agree, that's not the primary place to add tests for your feature.

Your proposal to base tests on tests/syntax-tests instead sounds good. Your Ignored suffixes approach sounds like one possible way forward. Another way would perhaps be to add backup suffixed "siblings" of existing test files. For example, create a file tests/syntax-tests/source/Kotlin/test.kt~.

Feel free to experiment to find a way that feels right. Also feel free to frequently iterate on your ideas here so that you don't spend a lot of effort on an approach that we end up not using.

@scop
Copy link
Contributor Author

scop commented Jul 5, 2021

Added some tests using the Ignored suffixes approach, using a Rust file with just oneliner comment in it. Opted for this approach rather than the sibling one because I wanted to include at least one case that would not highlight on purpose, and so that these tests are easier to locate.

Copy link
Owner

@sharkdp sharkdp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for your contribution! This looks great. I added two minor inline comments.

src/assets.rs Outdated
Comment on lines 296 to 303
if file_str.ends_with(suffix) {
return self.get_extension_syntax(
OsStr::new(
file_str
.strip_suffix(suffix)
.unwrap_or_default()
));
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using .ends_with(suffix), .strip_suffix(suffix) and finally .unwrap*, I think you could simply call strip_suffix and decide based on the result. Something like:

if let Some(stripped_filename) = file_str.strip_suffix(suffix) {
    return self.get_extension_syntax(OsStr::new(stripped_filename));
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff, TIL, done.

src/assets.rs Outdated
.extension()
.and_then(|x| x.to_str())
.unwrap_or_default(),
)
).or_else(|| {
let file_str = file_path.to_str().unwrap_or_default();
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will lead to an empty string if file_path contains invalid UTF-8 (which is possible). Instead of directly aborting here, we would still run the full for loop below. I think the behavior would still be okay, but it feels a bit strange. And could possibly lead to errors in the future if the code below is changed. I think I would prefer something like:

if let Some(file_str) = file_path.to_str() {
  // …
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM this one too, done.

Thanks a bunch for the guidance and opportunity to learn a bit of Rust 👍

Copy link
Owner

@sharkdp sharkdp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@sharkdp sharkdp merged commit 21338ed into sharkdp:master Jul 9, 2021
sharkdp added a commit that referenced this pull request Jul 9, 2021
@scop scop deleted the ignored-suffixes branch July 9, 2021 09:50
@sharkdp
Copy link
Owner

sharkdp commented Jul 13, 2021

Released in v0.18.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants