track "projects" not repositories #1233

teamdandelion · 2019-07-17T03:06:59Z

This re-organizes SourceCred to track data based on the "project" it is a part of, rather than the GitHub repository it came from. This is motivated by a desire to add first-class support for multi-repository projects in SourceCred.

Currently, we have a very hacky approach: we merge multiple repositories worth of data into a combined git repository (via mergeRepository) and a combined GitHub RelationalView, and then project graphs from the combined data structure.

This approach has a number of deficiencies:

it leads to bugs when combining projects, e.g. Issues with reference detection logic #1118
it leads to bad cache performance, Unify / Deduplicate the GitHub cache #1119
it requires writing complex merging logic for every plugins' internal data structures
it just doesn't model the world well

This complexity was motivated by the fact that Graph didn't use to store enough info for cred analysis and running the UI. Now that #1136 has merged, all we need is graphs, and Graph.merge is already well defined (and a lot simpler).

Therefore, this PR starts from the assumption that each plugin will create atomic Graphs for whatever their "atom" of data is (e.g. a repository for GitHub, but it might be an instance for a future Discourse plugin, etc). Then the graphs for that plugin can get merged together. The scope of a "project" includes many plugins, each with their own atoms, and they all get merged for the project level graph. Then, sourcecred computes timeline cred for the project level graph.

In this PR, I haven't made an attempt to get the config to anything like it's final, fully-fleshed-out form. So the project config doesn't even have a field for plugin, since we're only loading GitHub data at the moment. I don't think there's a big advantage to adding the extra abstraction just yet; I'd rather experiment with this in its simplest form for a while. Just effecting the one change of indexing by project rather than repo is a big change. :)

The new load command is patterned off @wchargin's suggestions in #945 around making an API-ified SourceCred. Specifically, load command is implemented in api/load.js and takes a declarative config. Then cli/load.js is just a wrapper for producing one of those configs with a bit of sugar.

This will close #1119 and close #1118 and close #701

Beanow

Some thoughts going through this.

Beanow · 2019-07-18T12:55:07Z

src/analysis/getProjectIds.js

+  if (!fs.existsSync(credPath)) {
+    return [];
+  }
+  const filenames = fs.readdirSync(credPath).filter((x) => x.endsWith(".json"));


Why should this function be synchronous? Scanning arbitrarily large directory structures can cause noticeable slowdowns here.

I pulled this logic out to #1238. The new version is still synchronous; the reason is that it is depended on in our webpack config, and I didn't feel like re-writing the webpack config to accept asnyc code. If it winds up being a perf issue we could do the re-write then, or just keep the sync version for use in webpack, and make an async version for regular use.

It turned out to be pretty easy to have webpack support async config (just export a promise). So I refactored #1238 to be async.

Beanow · 2019-07-18T13:04:54Z

src/analysis/load.js

+
+export async function load(options: LoadOptions): Promise<void> {
+  const {project, params, sourcecredDirectory, githubToken} = options;
+  const taskReporter = new TaskReporter();


If we're talking about future refactoring ideas. I believe a dependency injection style would be preferable over directly referencing and initializing dependencies. They would make great stubs for testing as well.

Beanow · 2019-07-18T13:07:03Z

src/analysis/load.js

+  const loadTask = `load-${options.project.id}`;
+  taskReporter.start(loadTask);
+  const cacheDirectory = path.join(sourcecredDirectory, "cache");
+  mkdirp.sync(cacheDirectory);


Could be async as the function already is defined as such.

Beanow · 2019-07-18T13:07:43Z

src/analysis/load.js

+  taskReporter.start("compute-cred");
+  const cred = await TimelineCred.compute(graph, params, DEFAULT_CRED_CONFIG);
+  const credDir = path.join(sourcecredDirectory, "data");
+  mkdirp(credDir);


This invocation is inconsistent with the previous one. Also could be async.

Beanow · 2019-07-18T13:17:24Z

src/analysis/load.js

+  +githubToken: string,
+|};
+
+export async function load(options: LoadOptions): Promise<void> {


I find the naming a little misleading. As functionality in an analysis namespace I wouldn't naturally assume it to be I/O heavy tasks. Additionally this function is called load, but it does not produce any in-memory loaded data. Instead it is a cred pre-compute task which produces the cred.json file.

Beanow · 2019-07-18T13:57:09Z

src/analysis/getProjectIds.js

+/**
+ * Convert a project filename back into an id.
+ *
+ * Converts all '$' to '/'. Errors if the filename contains '/'.


Encoding scheme wise, if you insist on using filenames to store the ID, you'll want the usual encoding suspects. hex encoding of utf-8, or base64-url mode. However since different filesystems have different constraints on case-sensitivity and maximum length of filenames. A safer option would be hashing (sha1 for instance) and using a file's contents to store the ID.

Yeah, take a look at #1238 where I've done it using base-64. Is there a particular reason we need base64-url mode?

Could do hashing, but I like having a bijection. Although the way it's implemented in #1238 there's also a project.json file stored in every project directory, so we could just use a hash, and then getProjectIds can read the project.json file to get the id. It's just a little bit more complexity.

+1; this should be base-64. Can’t really be hex-of-UTF-8 because you
can’t assume that the input filenames are UTF-8.

The mirror cache uses base-64.

I suggested base64-url because regular base64 contains the / in it's character set. Which for filesystems seems like it could cause issues.

About not being able to assume UTF-8. There's no difference there between hex or base64 encoding. You always need to know how to convert from binary to/from the correct characters, regardless of how you encode the binary format. We can assume it however, because we control the filenames and their encoding.

The problem is data corruption / users (or "smart" filesystems) changing filenames. As we will attempt to decode that, it may result in some very weird strings, as they would be able to circumvent filesystem protections against illegal characters.

Using hashing removes this entire class of issues, because you'll be forced to work with the assumption you can't decode it.

Beanow · 2019-07-18T14:01:53Z

src/cli/scores.js

  );
+  if (!fs.existsSync(credFile)) {


This can be async.

Beanow · 2019-07-18T17:21:18Z

We should refactor so that the load command is responsible for getting and merging the graphs, and the analyze command is responsible for computing timeline cred with the chosen weights and settings. The "meta" sourcecred tool that @Beanow suggests here can abstract over these pieces.

I'm definitely not suggesting the normal workflow for people who use the meta package should be different / simplified from those who use individual packages. I would instead consider the meta package as a way to install a bundle of commonly used and known-to-work combination of versions. One of these packages that is not the meta package (probably core) should be responsible for acting as the integration layer.

This module builds on the project logic added in #1238, and makes it easy to create projects based on a simple string configuration. Basically, the spec `foo/bar` creates a project containing just the repo foo/bar, and the spec `@foo` creates a project containing all of the repos from the user/organization named foo. This is pulled out of #1233, but I've enhanced it to support organizations out of the box. The method is thoroughly tested.

@Beanow

This adds a new module, `api/load`, which implements the logic that will underly the new `sourcecred load` command. The `api` package is a new folder that will contain the logic that powers the CLI (but will be callable directly as we improve SourceCred). As a heuristic, nontrivial logic in `cli/` should be factored out to `api/`. In the future, we will likely want to refactor these APIs to make them more atomic/composable. `api/load` does "all the things" in terms of loading data, computing cred, and writing it to disk. I'm going with the simplest approach here (mirroring existing functionality) so that we can merge #1233 and realize its many benefits more easily. This work is factored out of #1233. Thanks to @Beanow for [review] of the module, which resulted in several changes (e.g. organizing it under api/, having the TaskReporter be dependency injected). review: #1233 (review) Test plan: `api/load` is tested (via mocking unit tests). Run `yarn test`

@Beanow

This adds a new module, `api/load`, which implements the logic that will underly the new `sourcecred load` command. The `api` package is a new folder that will contain the logic that powers the CLI (but will be callable directly as we improve SourceCred). As a heuristic, nontrivial logic in `cli/` should be factored out to `api/`. In the future, we will likely want to refactor these APIs to make them more atomic/composable. `api/load` does "all the things" in terms of loading data, computing cred, and writing it to disk. I'm going with the simplest approach here (mirroring existing functionality) so that we can merge #1233 and realize its many benefits more easily. This work is factored out of #1233. Thanks to @Beanow for [review] of the module, which resulted in several changes (e.g. organizing it under api/, having the TaskReporter be dependency injected). [review]: #1233 (review) Test plan: `api/load` is tested (via mocking unit tests). Run `yarn test`

I'm re-organizing SC data to be oriented on the graph, rather than on plugin-specific data structures. So there is no longer a need for the git loading logic which orients around saving a repository.json file that's been potentially merged across repos, or indeed the logic for merging repositories at all. So I'm removing `git/loadGitData`, `git/mergeRepository`, and relatives. Test plan: `yarn test --full` passes.

This commit deprecates `cli/load` so that we can write a new implementation, and then make an atomic switch. Test plan: `yarn test --full`

The new implementation wraps `api/load`. Test plan: I've ported over the tests from the old `cli/load`. Run `yarn test`.

This commit swaps usage over to the new implementation of `cli/load` (the one that wraps `api/load`) and makes changes throughout the project to accomodate that we now track instances by Project rather than by RepoId. Test plan: Unit tests updated; run `yarn test --full`. Also, for safety: actually load a project (by whole org, why not) and verify that the frontend still works.

This commit removes the `pagerank` and `analyze` commands (both of which never saw real usage), removes the outdated adapter-based `loadGraph` method, and removes all traces of the analysis adapters. It builds on work in #1233 and #1136. Test plan: `yarn test --full` passes.

I moved sourcecred/example-git{,hub} to the @sourcecred-test org. This commit fixes the build given that move. I've realized that in #1233 I in-advertently made some Git tests that depend on a snapshot un-updateable. I'm going to compound on that slight technical debt by skipping the tests that depended on that snapshot. I recognize and accept that I'll need to pay this down when I resuscitate the git plugin. Test plan: `yarn test --full`.

This should be removed upstream, too; it’s no longer used after <sourcecred/sourcecred#1233>. wchargin-branch: unused-repos

teamdandelion requested review from wchargin and Beanow July 17, 2019 03:07

teamdandelion force-pushed the dev-load-org branch from 3ff7724 to a4195a6 Compare July 17, 2019 03:08

teamdandelion mentioned this pull request Jul 17, 2019

Publish sourcecred on npm #1232

Open

Beanow reviewed Jul 18, 2019

View reviewed changes

teamdandelion mentioned this pull request Jul 18, 2019

Add core/project and core/project_io #1238

Merged

teamdandelion force-pushed the dev-load-org branch from a4195a6 to 1ad628f Compare July 19, 2019 10:42

teamdandelion changed the base branch from master to core-project July 19, 2019 10:42

teamdandelion force-pushed the core-project branch 3 times, most recently from da12de1 to 4df5969 Compare July 19, 2019 12:38

teamdandelion force-pushed the dev-load-org branch from 1ad628f to 207c454 Compare July 19, 2019 12:41

teamdandelion mentioned this pull request Jul 21, 2019

Add util/taskReporter #1246

Merged

teamdandelion mentioned this pull request Jul 21, 2019

add github/specToProject #1247

Merged

teamdandelion mentioned this pull request Jul 22, 2019

add api/load #1251

Merged

teamdandelion force-pushed the dev-load-org branch from 207c454 to afdbb0b Compare July 22, 2019 23:40

teamdandelion changed the base branch from core-project to master July 22, 2019 23:53

teamdandelion added a commit that referenced this pull request Jul 22, 2019

Update CHANGELOG for #1233

12663ae

teamdandelion force-pushed the dev-load-org branch from 12663ae to ba1a22c Compare July 22, 2019 23:58

teamdandelion added a commit that referenced this pull request Jul 22, 2019

Update CHANGELOG for #1233

ba1a22c

teamdandelion marked this pull request as ready for review July 22, 2019 23:59

teamdandelion added 4 commits July 23, 2019 00:59

deprecate cli/load

a17262c

This commit deprecates `cli/load` so that we can write a new implementation, and then make an atomic switch. Test plan: `yarn test --full`

re-implement src/cli/load

ee306d6

The new implementation wraps `api/load`. Test plan: I've ported over the tests from the old `cli/load`. Run `yarn test`.

Update CHANGELOG for #1233

cdd3085

teamdandelion force-pushed the dev-load-org branch from ba1a22c to cdd3085 Compare July 22, 2019 23:59

teamdandelion merged commit b4c2846 into master Jul 23, 2019

teamdandelion deleted the dev-load-org branch July 23, 2019 00:01

teamdandelion mentioned this pull request Jul 23, 2019

Unify / Deduplicate the GitHub cache #1119

Closed

teamdandelion mentioned this pull request Jul 23, 2019

Remove deprecated commands and adapters #1254

Merged

teamdandelion mentioned this pull request Jul 23, 2019

Fixup project for move of example repos #1255

Merged

wchargin added a commit to sourcecred/cred-action that referenced this pull request Feb 17, 2020

Remove unused repos variable

7bc8a0f

This should be removed upstream, too; it’s no longer used after <sourcecred/sourcecred#1233>. wchargin-branch: unused-repos

wchargin added a commit to sourcecred/cred-action that referenced this pull request Feb 17, 2020

Remove unused repos variable

0ad9320

This should be removed upstream, too; it’s no longer used after <sourcecred/sourcecred#1233>. wchargin-branch: unused-repos

wchargin added a commit to sourcecred/cred-action that referenced this pull request Feb 20, 2020

Remove unused repos variable

fed5bbe

This should be removed upstream, too; it’s no longer used after <sourcecred/sourcecred#1233>. wchargin-branch: unused-repos

wchargin added a commit to sourcecred/cred-action that referenced this pull request Apr 7, 2020

Remove unused repos variable

b148d2e

This should be removed upstream, too; it’s no longer used after <sourcecred/sourcecred#1233>. wchargin-branch: unused-repos

wchargin added a commit to sourcecred/cred-action that referenced this pull request Apr 8, 2020

Remove unused repos variable

6d646d4

This should be removed upstream, too; it’s no longer used after <sourcecred/sourcecred#1233>. wchargin-branch: unused-repos

wchargin added a commit to sourcecred/cred-action that referenced this pull request Apr 8, 2020

Remove unused repos variable

f49cda6

This should be removed upstream, too; it’s no longer used after <sourcecred/sourcecred#1233>. wchargin-branch: unused-repos

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

track "projects" not repositories #1233

track "projects" not repositories #1233

teamdandelion commented Jul 17, 2019 •

edited

Beanow left a comment

Beanow Jul 18, 2019

teamdandelion Jul 19, 2019

teamdandelion Jul 19, 2019

Beanow Jul 18, 2019

Beanow Jul 18, 2019

Beanow Jul 18, 2019

Beanow Jul 18, 2019

Beanow Jul 18, 2019

teamdandelion Jul 19, 2019

wchargin Jul 20, 2019

Beanow Aug 6, 2019

Beanow Jul 18, 2019

Beanow commented Jul 18, 2019

track "projects" not repositories #1233

track "projects" not repositories #1233

Conversation

teamdandelion commented Jul 17, 2019 • edited

Beanow left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Beanow commented Jul 18, 2019

teamdandelion commented Jul 17, 2019 •

edited