Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drop submodule support #174

Closed
spenserblack opened this issue Sep 11, 2023 · 10 comments
Closed

Drop submodule support #174

spenserblack opened this issue Sep 11, 2023 · 10 comments
Assignees
Milestone

Comments

@spenserblack
Copy link
Owner

spenserblack commented Sep 11, 2023

Submodule support TBH was kind of rushed in without proper discussion or planning, being an unrelated change wrapped up in a PR to improve performance.

Right now submodule support is not great IMO.

  • Overrides can be confusing and obscure.
  • Most commonly users won't even see the submodule stats, as all submodules are vendored.
  • Detecting submodules as vendored is hacky, as its not organized with Vendored, but sits in the main implementation as its own boolean value is_submodule.
  • Submodules can't even be analyzed unless they are initialized, potentially confusing for users.
  • Because of the above, repositories with clean histories checked out at the same commit can have different results.

For these reasons, I think that submodules should be skipped. I'd rather drop submodule support completely and potentially add it back in with proper support, than try and patch in proper support to the current implementation.

If a user is determined to analyze submodules it's fairly easy to do without support.

  • As a library, discover each submodule and run Gengo::analyze on it.
  • As a binary, iterate over submodules (this should be trivial with git submodule foreach).
@spenserblack spenserblack added this to the v1 milestone Sep 11, 2023
@spenserblack spenserblack self-assigned this Sep 11, 2023
@spenserblack spenserblack linked a pull request Sep 11, 2023 that will close this issue
1 task
@Byron
Copy link
Contributor

Byron commented Sep 12, 2023

  • As a binary, iterate over submodules (this should be trivial with git submodule foreach).

There is support for that in gix as well.

@spenserblack spenserblack removed a link to a pull request Sep 12, 2023
1 task
@spenserblack
Copy link
Owner Author

There is support for that in gix as well.

Awesome! I'm still considering the best way to support submodules, bub knowing that gix supports this adds some possibilities:

  • Re-export one or more types from gix
  • Document suggestions for how one could handle submodules
  • A combination of the above

@Byron
Copy link
Contributor

Byron commented Sep 13, 2023

My feeling is that submodule handling should be an integrated part of gengo because it is information that should be leveraged when analysing work trees (i.e. the checkout of a commit on disk). Otherwise, while traversing files, one would traverse into submodules and not know about it.

These do have .git files typically (or .git directories) so they can be excluded from traversal, even without explicitly querying them using gix.

Maybe that submodule-ignoring behaviour, by default, would also be in the interest of onefetch. Over time, it can be expended once it becomes clearer on how to handle and/or classify submodules.

@spenserblack
Copy link
Owner Author

it is information that should be leveraged when analysing work trees (i.e. the checkout of a commit on disk). Otherwise, while traversing files, one would traverse into submodules and not know about it.

I think you've touched on something important here (and also read my mind a bit 😆).

When analyzing a revision, I think gengo shouldn't analyze submodules because the usage can be obscure and confusing. However, if gengo analyzed the files on disk, it should treat a submodule just like any other directory and analyze it. It's not necessarily obvious that the submodules must be initialized (git submodule update --init) when reading from a git revision like HEAD. However, it's very obvious when reading from the disk -- if the files aren't in the work tree, they don't get analyzed when reading from disk.

So my idea is that submodule support should be wrapped up with work tree support. In fact, it would be confusing if submodules weren't supported when reading files from the disk. 😆 It's one of those scenarios where it will be harder to not do it than to do it.

This does raise thoughts and concerns that should probably go in an "Add support for analyzing worktrees" issue. But to document them while they're on my mind

  • Like tokei, worktrees should be generic, not require a git repository
  • Read from .gitignore or another ignore file to know which files/dirs to exclude? I think not, because, unlike tokei, we mark files as generated, vendored, etc. That does mean we read files that might be excluded, but opens up granular behavior like --include-generated.
  • How should overrides be handled? With the git implementation we use .gitattributes, but for a worktree I don't think we should. Too git-specific for a generic worktree implementation. There will probably be a lot of back-and-forth on this, so I'll save my ideas on this for another issue.
  • As far as the binary goes, this should perhaps be divided into subcommands gengo git and gengo worktree (and git-gengo can be an alias for gengo git).

@spenserblack spenserblack changed the title Rethink submodule support Drop submodule support Sep 13, 2023
@spenserblack
Copy link
Owner Author

Renaming issue to be more of an action, and closing as this action is completed.

@Byron
Copy link
Contributor

Byron commented Sep 14, 2023

As I am not sure if it came across, let me state my primary concern: If git is available, even when reading the work tree directly, I think gengo should use the available .gitattribute information, instead of ignoring it. However, it should treat this as optional.

The conclusions presented sound a lot like ignoring .gitattributes even if present, which wouldn't be what I expect as a user that went through the trouble of adding gengo specific attributes.

As far as the binary goes, this should perhaps be divided into subcommands gengo git and gengo worktree (and git-gengo can be an alias for gengo git).

If gengo is a developer tool (mainly) for debugging the library and running it in the real-world, maybe that's the right choice. But for maximum convenience, it's my hunch that one can make it so that 'it does the right thing' while allowing users to be specific about what they want. For instance, gengo HEAD or gengo @~1 is clearly a revspec, but just gengo . is clearly something related to the current directory. Not all options are unambiguous here, but I think that could be dealt with decently as well. Of course, this isn't a fully fleshed out description of possible UX, it's merely a description of my 'feeling' for how I would want to use it.

@spenserblack
Copy link
Owner Author

As I am not sure if it came across, let me state my primary concern: If git is available, even when reading the work tree directly, I think gengo should use the available .gitattribute information, instead of ignoring it. However, it should treat this as optional.

First, keep in mind that any overrides should be the exception, not the rule. GitHub has about 40k .gitattributes with linguist-* attributes, and 132k .gitattributes without any linguist-* attributes. On top of that, the vast majority probably don't have any .gitattributes at all. So I don't think there will be a lot of wasted hard work if in some circumstances we don't read from .gitattributes.

I get that it might be confusing if the git implementation uses git attributes, while a generic implementation doesn't. However, I believe a generic implementation should be fully generic, not a "maybe-git" implementation. If the generic implementation reads gitattributes, then I believe this can lead to some awkwardness because we're then left with these choices:

  • Ask users to put .gitattributes in non-git directories (Mercurial, SVN, or a plain old directory)
    • If we do this we'd also have to search up directories to find .gitattributes, with possibly no clear stopping point until /
  • Implement multiple ways to override (.gitattributes + config file) which leads to code complexity and questions about which should take precedence

Also, IMO setting overrides in .gitattributes is, TBH, a bit obscure. It's used for reading from a rev, and IMO the best way to do it since we can support reading from bare repos (thanks again for getting .gitattributes to be read from the rev!). But when it comes to reading from a work tree I don't think a tool that uses this library should say "create a .gitattributes file and read the gengo's usage." In onefetch's case, assuming we read from a work tree and not a rev, I believe the best thing is to allow onefetch to configure overrides itself. Here's the basic idea:

let file_source = Directory::new(".").with_overrides(overrides);
let gengo = Builder::new(file_source);
do_stuff_with(gengo.analyze());

Now onefetch is free to set overrides as it sees fit, whether it's from CLI options, a .onefetchrc file (or whatever we decide to call it), or something else.
cc @o2sh

Besides a fully generic implementation, I have been toying with the idea of a hybrid file source. The hybrid would be git-specific, use .gitattributes, and search down directories, taking advantage of .gitignore. That idea might alleviate your concerns, and I believe that this is what you had in mind by reading a git worktree. Yet, I'm still concerned about submodules. If .gitattributes is used to set the attributes of submodules, then we are, IMO, asking a user to make invalid .gitattributes. AFAIK .gitattributes cannot affect submodule contents for any of git's built-in attributes. I don't want to turn .gitattributes into our personal config file with additional, gengo-specific behavior. If git wouldn't do anything with the attribute's filepath, gengo shouldn't either.


As far as CLI usage goes, automatically detecting if an arg is a revspec or a directory is certainly a fun idea. However, I think there's a lot of risk for ambiguity. I often use branch names like test/javascript, for example, which looks a lot like a directory. Of course, we could check if a rev with that name exists, and if not try to read from a directory with that name. However, that magic can lead to confusion.

  • If I misspell and type tests/javascript, did I misspell a branch name or a directory? The error message would lose specificity ("tests/javascript is not a valid rev or directory"), and we wouldn't be able to just pass along errors returned by the library.
  • If the name matches both a revspec and a directory, what happens? Even git itself can't figure this out, and will require usage like git checkout tests -- tests to disambiguate.
  • Right now gengo can be run a specified directory (repository) with -R REPOSITORY and a specified rev with -r REV. This is a chance to disambiguate (previous point), but at the same time can complicate the CLI, as we'll then have to check for gengo FOO -r BAR -R BAZ.

Right now, at least, I don't really want to guess the user's intent, and would rather require explicitness. I also feel like subcommands are a great way to group CLI options that are only valid for specific usage.


So, just to summarize, here are my opinions:

  • When usage is tied to git, use git tools (git attributes, git ignore, git config), and act like git
  • When usage isn't tied to git, don't use git tools, and make behavior more configurable

@Byron
Copy link
Contributor

Byron commented Sep 15, 2023

First, keep in mind that any overrides should be the exception, not the rule. GitHub has about 40k .gitattributes with linguist-* attributes, and 132k .gitattributes without any linguist-* attributes. On top of that, the vast majority probably don't have any .gitattributes at all. So I don't think there will be a lot of wasted hard work if in some circumstances we don't read from .gitattributes.

I love these queries, thanks for bringing data to the discussion, it definitely helps. Following that line of argumentation, then .gitattribute support should also not be worth it for the git source though, especially since it's even less likely to see gengo-* attributes.

  • Ask users to put .gitattributes in non-git directories (Mercurial, SVN, or a plain old directory)

    • If we do this we'd also have to search up directories to find .gitattributes, with possibly no clear stopping point until /
  • Implement multiple ways to override (.gitattributes + config file) which leads to code complexity and questions about which should take precedence

Also, IMO setting overrides in .gitattributes is, TBH, a bit obscure. It's used for reading from a rev, and IMO the best way to do it since we can support reading from bare repos (thanks again for getting .gitattributes to be read from the rev!).

As a technical note, placing bare .gitattributes files would actually work even if no actual git repository is there if plumbing crates are used directly. And I was about to write the opposite until I was surprised by the flexibility of my own creation 😅.

But reading all of the above makes me think you don't actually want .gitattributes support at all, unless there is also a way to set overrides via API. This probably is due to the idea that it has to be generic.

Besides a fully generic implementation, I have been toying with the idea of a hybrid file source. The hybrid would be git-specific, use .gitattributes, and search down directories, taking advantage of .gitignore. That idea might alleviate your concerns, and I believe that this is what you had in mind by reading a git worktree. Yet, I'm still concerned about submodules. If .gitattributes is used to set the attributes of submodules, then we are, IMO, asking a user to make invalid .gitattributes. AFAIK .gitattributes cannot affect submodule contents for any of git's built-in attributes. I don't want to turn .gitattributes into our personal config file with additional, gengo-specific behavior. If git wouldn't do anything with the attribute's filepath, gengo shouldn't either.

I liked the previous idea of leaving submodule handling to the caller entirely. That way they can decide how to deal with .gitattributes, and whether or not the superproject's .gitattribute file gets to affect the submodule. Probably doing something like that isn't necessary either, and right now this train of thought lacks an anchor for a real-world usecase which makes it difficult to reason about. Thus, removing submodules from the equation would help.

Right now, at least, I don't really want to guess the user's intent, and would rather require explicitness. I also feel like subcommands are a great way to group CLI options that are only valid for specific usage.

I absolutely understand. Please note that all my comments aren't attempts to apply direction, but merely share my perspective, being well aware that everything comes with tradeoffs. As a user, I like gits behaviour of being fuzzy around its inputs while still being safe around ambiguity, while understanding that it's more complex on the implementation side. It's fair for gengo to be explicit and it won't be less usable for that.

So, just to summarize, here are my opinions:

  • When usage is tied to git, use git tools (git attributes, git ignore, git config), and act like git
  • When usage isn't tied to git, don't use git tools, and make behavior more configurable

I think its about making the decision if gengo has an opinion on the storage format of these overrides or not. If not, making the API for overrides based on callbacks is easy, but then each tool using gengo needs to sort this out by themselves. Most simply won't use overrides, maybe onefetch wouldn't or would implement them in terms of linguist which makes sense for compatibility (as much as possible).
The good thing is that one day, you could provide default implementations for these callbacks (or this API, to be less specific), and that's how users of gengo can opt-in to the default behaviour which unifies attribute handling, while also allowing them to add their own twist or maintain backwards compatibility to what was before.
This sounds so good to me that I'd even advocate for making overrides API based, possibly with a default implementation with .gitattributes if this is what you want.

@spenserblack
Copy link
Owner Author

Alright, thanks for the feedback!

I'd even advocate for making overrides API based, possibly with a default implementation with .gitattributes if this is what you want.

I hadn't thought too much about this, but this makes me think that something like Override::from_gitattributes could be the way to go. Right now I think I'm going to leave the current git-based implementation as a biased, special case, while other implementations are more open-ended. Frankly, I love the git-based behavior so much I'm reluctant to change it 😆
At a later point, I can consider making the git-based implementation more generic, after seeing how other implementations end up.

@Byron
Copy link
Contributor

Byron commented Sep 18, 2023

Frankly, I love the git-based behavior so much I'm reluctant to change it 😆
At a later point, I can consider making the git-based implementation more generic, after seeing how other implementations end up.

That sounds very feasible, and seems like a good way forward. Alternative implementations create facts which in turn make it easy to see which abstractions make sense to have more unification among them. Over all of this, I'd love it if it was clear what onefetch needs, as it's a candidate to be the first major 'client' if you will. Submodules in worktrees, yes or no, would be my primary question. And based on that answer one would know what should be done when looking at bare repositories as well. Maybe that was already clarified somewhere and I just forgot though 😅.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

2 participants