Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

track "projects" not repositories #1233

Merged
merged 5 commits into from Jul 23, 2019
Merged

track "projects" not repositories #1233

merged 5 commits into from Jul 23, 2019

Conversation

teamdandelion
Copy link
Contributor

@teamdandelion teamdandelion commented Jul 17, 2019

This re-organizes SourceCred to track data based on the "project" it is a part of, rather than the GitHub repository it came from. This is motivated by a desire to add first-class support for multi-repository projects in SourceCred.

Currently, we have a very hacky approach: we merge multiple repositories worth of data into a combined git repository (via mergeRepository) and a combined GitHub RelationalView, and then project graphs from the combined data structure.

This approach has a number of deficiencies:

This complexity was motivated by the fact that Graph didn't use to store enough info for cred analysis and running the UI. Now that #1136 has merged, all we need is graphs, and Graph.merge is already well defined (and a lot simpler).

Therefore, this PR starts from the assumption that each plugin will create atomic Graphs for whatever their "atom" of data is (e.g. a repository for GitHub, but it might be an instance for a future Discourse plugin, etc). Then the graphs for that plugin can get merged together. The scope of a "project" includes many plugins, each with their own atoms, and they all get merged for the project level graph. Then, sourcecred computes timeline cred for the project level graph.

In this PR, I haven't made an attempt to get the config to anything like it's final, fully-fleshed-out form. So the project config doesn't even have a field for plugin, since we're only loading GitHub data at the moment. I don't think there's a big advantage to adding the extra abstraction just yet; I'd rather experiment with this in its simplest form for a while. Just effecting the one change of indexing by project rather than repo is a big change. :)

The new load command is patterned off @wchargin's suggestions in #945 around making an API-ified SourceCred. Specifically, load command is implemented in api/load.js and takes a declarative config. Then cli/load.js is just a wrapper for producing one of those configs with a bit of sugar.

This will close #1119 and close #1118 and close #701

Copy link
Contributor

@Beanow Beanow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some thoughts going through this.

if (!fs.existsSync(credPath)) {
return [];
}
const filenames = fs.readdirSync(credPath).filter((x) => x.endsWith(".json"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should this function be synchronous? Scanning arbitrarily large directory structures can cause noticeable slowdowns here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pulled this logic out to #1238. The new version is still synchronous; the reason is that it is depended on in our webpack config, and I didn't feel like re-writing the webpack config to accept asnyc code. If it winds up being a perf issue we could do the re-write then, or just keep the sync version for use in webpack, and make an async version for regular use.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turned out to be pretty easy to have webpack support async config (just export a promise). So I refactored #1238 to be async.


export async function load(options: LoadOptions): Promise<void> {
const {project, params, sourcecredDirectory, githubToken} = options;
const taskReporter = new TaskReporter();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're talking about future refactoring ideas. I believe a dependency injection style would be preferable over directly referencing and initializing dependencies. They would make great stubs for testing as well.

const loadTask = `load-${options.project.id}`;
taskReporter.start(loadTask);
const cacheDirectory = path.join(sourcecredDirectory, "cache");
mkdirp.sync(cacheDirectory);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be async as the function already is defined as such.

taskReporter.start("compute-cred");
const cred = await TimelineCred.compute(graph, params, DEFAULT_CRED_CONFIG);
const credDir = path.join(sourcecredDirectory, "data");
mkdirp(credDir);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This invocation is inconsistent with the previous one. Also could be async.

+githubToken: string,
|};

export async function load(options: LoadOptions): Promise<void> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the naming a little misleading. As functionality in an analysis namespace I wouldn't naturally assume it to be I/O heavy tasks. Additionally this function is called load, but it does not produce any in-memory loaded data. Instead it is a cred pre-compute task which produces the cred.json file.

/**
* Convert a project filename back into an id.
*
* Converts all '$' to '/'. Errors if the filename contains '/'.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Encoding scheme wise, if you insist on using filenames to store the ID, you'll want the usual encoding suspects. hex encoding of utf-8, or base64-url mode. However since different filesystems have different constraints on case-sensitivity and maximum length of filenames. A safer option would be hashing (sha1 for instance) and using a file's contents to store the ID.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, take a look at #1238 where I've done it using base-64. Is there a particular reason we need base64-url mode?

Could do hashing, but I like having a bijection. Although the way it's implemented in #1238 there's also a project.json file stored in every project directory, so we could just use a hash, and then getProjectIds can read the project.json file to get the id. It's just a little bit more complexity.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1; this should be base-64. Can’t really be hex-of-UTF-8 because you
can’t assume that the input filenames are UTF-8.

The mirror cache uses base-64.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggested base64-url because regular base64 contains the / in it's character set. Which for filesystems seems like it could cause issues.

About not being able to assume UTF-8. There's no difference there between hex or base64 encoding. You always need to know how to convert from binary to/from the correct characters, regardless of how you encode the binary format. We can assume it however, because we control the filenames and their encoding.

The problem is data corruption / users (or "smart" filesystems) changing filenames. As we will attempt to decode that, it may result in some very weird strings, as they would be able to circumvent filesystem protections against illegal characters.

Using hashing removes this entire class of issues, because you'll be forced to work with the assumption you can't decode it.

);
if (!fs.existsSync(credFile)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be async.

@Beanow
Copy link
Contributor

Beanow commented Jul 18, 2019

We should refactor so that the load command is responsible for getting and merging the graphs, and the analyze command is responsible for computing timeline cred with the chosen weights and settings. The "meta" sourcecred tool that @Beanow suggests here can abstract over these pieces.

I'm definitely not suggesting the normal workflow for people who use the meta package should be different / simplified from those who use individual packages. I would instead consider the meta package as a way to install a bundle of commonly used and known-to-work combination of versions. One of these packages that is not the meta package (probably core) should be responsible for acting as the integration layer.

@teamdandelion teamdandelion changed the base branch from master to core-project July 19, 2019 10:42
@teamdandelion teamdandelion force-pushed the core-project branch 3 times, most recently from da12de1 to 4df5969 Compare July 19, 2019 12:38
teamdandelion added a commit that referenced this pull request Jul 21, 2019
This module builds on the project logic added in #1238, and makes it
easy to create projects based on a simple string configuration.
Basically, the spec `foo/bar` creates a project containing just the repo
foo/bar, and the spec `@foo` creates a project containing all of the
repos from the user/organization named foo.

This is pulled out of #1233, but I've enhanced it to support
organizations out of the box.

The method is thoroughly tested.
teamdandelion added a commit that referenced this pull request Jul 21, 2019
This module builds on the project logic added in #1238, and makes it
easy to create projects based on a simple string configuration.
Basically, the spec `foo/bar` creates a project containing just the repo
foo/bar, and the spec `@foo` creates a project containing all of the
repos from the user/organization named foo.

This is pulled out of #1233, but I've enhanced it to support
organizations out of the box.

The method is thoroughly tested.
teamdandelion added a commit that referenced this pull request Jul 22, 2019
This adds a new module, `api/load`, which implements the logic that will
underly the new `sourcecred load` command. The `api` package is a new
folder that will contain the logic that powers the CLI (but will be
callable directly as we improve SourceCred). As a heuristic, nontrivial
logic in `cli/` should be factored out to `api/`.

In the future, we will likely want to refactor these APIs to
make them more atomic/composable. `api/load` does "all the things" in
terms of loading data, computing cred, and writing it to disk. I'm going
with the simplest approach here (mirroring existing functionality) so
that we can merge #1233 and realize its many benefits more easily.

This work is factored out of #1233. Thanks to @Beanow for [review]
of the module, which resulted in several changes (e.g. organizing it
under api/, having the TaskReporter be dependency injected).

review: #1233 (review)

Test plan: `api/load` is tested (via mocking unit tests). Run `yarn test`
@teamdandelion teamdandelion mentioned this pull request Jul 22, 2019
teamdandelion added a commit that referenced this pull request Jul 22, 2019
This adds a new module, `api/load`, which implements the logic that will
underly the new `sourcecred load` command. The `api` package is a new
folder that will contain the logic that powers the CLI (but will be
callable directly as we improve SourceCred). As a heuristic, nontrivial
logic in `cli/` should be factored out to `api/`.

In the future, we will likely want to refactor these APIs to
make them more atomic/composable. `api/load` does "all the things" in
terms of loading data, computing cred, and writing it to disk. I'm going
with the simplest approach here (mirroring existing functionality) so
that we can merge #1233 and realize its many benefits more easily.

This work is factored out of #1233. Thanks to @Beanow for [review]
of the module, which resulted in several changes (e.g. organizing it
under api/, having the TaskReporter be dependency injected).

[review]: #1233 (review)

Test plan: `api/load` is tested (via mocking unit tests). Run `yarn test`
@teamdandelion teamdandelion changed the base branch from core-project to master July 22, 2019 23:53
teamdandelion added a commit that referenced this pull request Jul 22, 2019
teamdandelion added a commit that referenced this pull request Jul 22, 2019
@teamdandelion teamdandelion marked this pull request as ready for review July 22, 2019 23:59
I'm re-organizing SC data to be oriented on the graph, rather than on
plugin-specific data structures. So there is no longer a need for the
git loading logic which orients around saving a repository.json file
that's been potentially merged across repos, or indeed the logic for
merging repositories at all. So I'm removing `git/loadGitData`,
`git/mergeRepository`, and relatives.

Test plan: `yarn test --full` passes.
This commit deprecates `cli/load` so that we can write a new
implementation, and then make an atomic switch.

Test plan: `yarn test --full`
The new implementation wraps `api/load`.

Test plan: I've ported over the tests from the old `cli/load`. Run `yarn
test`.
This commit swaps usage over to the new implementation of `cli/load`
(the one that wraps `api/load`) and makes changes throughout the project
to accomodate that we now track instances by Project rather than by
RepoId.

Test plan: Unit tests updated; run `yarn test --full`. Also, for safety:
actually load a project (by whole org, why not) and verify that the
frontend still works.
@teamdandelion teamdandelion merged commit b4c2846 into master Jul 23, 2019
@teamdandelion teamdandelion deleted the dev-load-org branch July 23, 2019 00:01
teamdandelion added a commit that referenced this pull request Jul 23, 2019
This commit removes the `pagerank` and `analyze` commands (both of which
never saw real usage), removes the outdated adapter-based `loadGraph`
method, and removes all traces of the analysis adapters.

It builds on work in #1233 and #1136.

Test plan: `yarn test --full` passes.
teamdandelion added a commit that referenced this pull request Jul 23, 2019
This commit removes the `pagerank` and `analyze` commands (both of which
never saw real usage), removes the outdated adapter-based `loadGraph`
method, and removes all traces of the analysis adapters.

It builds on work in #1233 and #1136.

Test plan: `yarn test --full` passes.
teamdandelion added a commit that referenced this pull request Jul 23, 2019
This commit removes the `pagerank` and `analyze` commands (both of which
never saw real usage), removes the outdated adapter-based `loadGraph`
method, and removes all traces of the analysis adapters.

It builds on work in #1233 and #1136.

Test plan: `yarn test --full` passes.
teamdandelion added a commit that referenced this pull request Jul 23, 2019
I moved sourcecred/example-git{,hub} to the @sourcecred-test org.

This commit fixes the build given that move.

I've realized that in #1233 I in-advertently made some Git tests that
depend on a snapshot un-updateable. I'm going to compound on that slight
technical debt by skipping the tests that depended on that snapshot. I
recognize and accept that I'll need to pay this down when I resuscitate
the git plugin.

Test plan: `yarn test --full`.
teamdandelion added a commit that referenced this pull request Jul 23, 2019
I moved sourcecred/example-git{,hub} to the @sourcecred-test org.

This commit fixes the build given that move.

I've realized that in #1233 I in-advertently made some Git tests that
depend on a snapshot un-updateable. I'm going to compound on that slight
technical debt by skipping the tests that depended on that snapshot. I
recognize and accept that I'll need to pay this down when I resuscitate
the git plugin.

Test plan: `yarn test --full`.
teamdandelion added a commit that referenced this pull request Jul 23, 2019
I moved sourcecred/example-git{,hub} to the @sourcecred-test org.

This commit fixes the build given that move.

I've realized that in #1233 I in-advertently made some Git tests that
depend on a snapshot un-updateable. I'm going to compound on that slight
technical debt by skipping the tests that depended on that snapshot. I
recognize and accept that I'll need to pay this down when I resuscitate
the git plugin.

Test plan: `yarn test --full`.
wchargin added a commit to sourcecred/cred-action that referenced this pull request Feb 17, 2020
This should be removed upstream, too; it’s no longer used after
<sourcecred/sourcecred#1233>.

wchargin-branch: unused-repos
wchargin added a commit to sourcecred/cred-action that referenced this pull request Feb 17, 2020
This should be removed upstream, too; it’s no longer used after
<sourcecred/sourcecred#1233>.

wchargin-branch: unused-repos
wchargin added a commit to sourcecred/cred-action that referenced this pull request Feb 20, 2020
This should be removed upstream, too; it’s no longer used after
<sourcecred/sourcecred#1233>.

wchargin-branch: unused-repos
wchargin added a commit to sourcecred/cred-action that referenced this pull request Apr 7, 2020
This should be removed upstream, too; it’s no longer used after
<sourcecred/sourcecred#1233>.

wchargin-branch: unused-repos
wchargin added a commit to sourcecred/cred-action that referenced this pull request Apr 8, 2020
This should be removed upstream, too; it’s no longer used after
<sourcecred/sourcecred#1233>.

wchargin-branch: unused-repos
wchargin added a commit to sourcecred/cred-action that referenced this pull request Apr 8, 2020
This should be removed upstream, too; it’s no longer used after
<sourcecred/sourcecred#1233>.

wchargin-branch: unused-repos
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unify / Deduplicate the GitHub cache Issues with reference detection logic Project-level cred attribution
3 participants