Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

track "projects" not repositories #1233

Merged
merged 5 commits into from Jul 23, 2019

Conversation

@decentralion
Copy link
Member

commented Jul 17, 2019

This re-organizes SourceCred to track data based on the "project" it is a part of, rather than the GitHub repository it came from. This is motivated by a desire to add first-class support for multi-repository projects in SourceCred.

Currently, we have a very hacky approach: we merge multiple repositories worth of data into a combined git repository (via mergeRepository) and a combined GitHub RelationalView, and then project graphs from the combined data structure.

This approach has a number of deficiencies:

  • it leads to bugs when combining projects, e.g. #1118
  • it leads to bad cache performance, #1119
  • it requires writing complex merging logic for every plugins' internal data structures
  • it just doesn't model the world well

This complexity was motivated by the fact that Graph didn't use to store enough info for cred analysis and running the UI. Now that #1136 has merged, all we need is graphs, and Graph.merge is already well defined (and a lot simpler).

Therefore, this PR starts from the assumption that each plugin will create atomic Graphs for whatever their "atom" of data is (e.g. a repository for GitHub, but it might be an instance for a future Discourse plugin, etc). Then the graphs for that plugin can get merged together. The scope of a "project" includes many plugins, each with their own atoms, and they all get merged for the project level graph. Then, sourcecred computes timeline cred for the project level graph.

In this PR, I haven't made an attempt to get the config to anything like it's final, fully-fleshed-out form. So the project config doesn't even have a field for plugin, since we're only loading GitHub data at the moment. I don't think there's a big advantage to adding the extra abstraction just yet; I'd rather experiment with this in its simplest form for a while. Just effecting the one change of indexing by project rather than repo is a big change. :)

The new load command is patterned off @wchargin's suggestions in #945 around making an API-ified SourceCred. Specifically, load command is implemented in api/load.js and takes a declarative config. Then cli/load.js is just a wrapper for producing one of those configs with a bit of sugar.

This will close #1119 and close #1118 and close #701

@decentralion decentralion requested review from wchargin and Beanow Jul 17, 2019

@decentralion decentralion force-pushed the dev-load-org branch from 3ff7724 to a4195a6 Jul 17, 2019

@Beanow
Copy link
Member

left a comment

Some thoughts going through this.

if (!fs.existsSync(credPath)) {
return [];
}
const filenames = fs.readdirSync(credPath).filter((x) => x.endsWith(".json"));

This comment has been minimized.

Copy link
@Beanow

Beanow Jul 18, 2019

Member

Why should this function be synchronous? Scanning arbitrarily large directory structures can cause noticeable slowdowns here.

This comment has been minimized.

Copy link
@decentralion

decentralion Jul 19, 2019

Author Member

I pulled this logic out to #1238. The new version is still synchronous; the reason is that it is depended on in our webpack config, and I didn't feel like re-writing the webpack config to accept asnyc code. If it winds up being a perf issue we could do the re-write then, or just keep the sync version for use in webpack, and make an async version for regular use.

This comment has been minimized.

Copy link
@decentralion

decentralion Jul 19, 2019

Author Member

It turned out to be pretty easy to have webpack support async config (just export a promise). So I refactored #1238 to be async.


export async function load(options: LoadOptions): Promise<void> {
const {project, params, sourcecredDirectory, githubToken} = options;
const taskReporter = new TaskReporter();

This comment has been minimized.

Copy link
@Beanow

Beanow Jul 18, 2019

Member

If we're talking about future refactoring ideas. I believe a dependency injection style would be preferable over directly referencing and initializing dependencies. They would make great stubs for testing as well.

const loadTask = `load-${options.project.id}`;
taskReporter.start(loadTask);
const cacheDirectory = path.join(sourcecredDirectory, "cache");
mkdirp.sync(cacheDirectory);

This comment has been minimized.

Copy link
@Beanow

Beanow Jul 18, 2019

Member

Could be async as the function already is defined as such.

taskReporter.start("compute-cred");
const cred = await TimelineCred.compute(graph, params, DEFAULT_CRED_CONFIG);
const credDir = path.join(sourcecredDirectory, "data");
mkdirp(credDir);

This comment has been minimized.

Copy link
@Beanow

Beanow Jul 18, 2019

Member

This invocation is inconsistent with the previous one. Also could be async.

+githubToken: string,
|};

export async function load(options: LoadOptions): Promise<void> {

This comment has been minimized.

Copy link
@Beanow

Beanow Jul 18, 2019

Member

I find the naming a little misleading. As functionality in an analysis namespace I wouldn't naturally assume it to be I/O heavy tasks. Additionally this function is called load, but it does not produce any in-memory loaded data. Instead it is a cred pre-compute task which produces the cred.json file.

/**
* Convert a project filename back into an id.
*
* Converts all '$' to '/'. Errors if the filename contains '/'.

This comment has been minimized.

Copy link
@Beanow

Beanow Jul 18, 2019

Member

Encoding scheme wise, if you insist on using filenames to store the ID, you'll want the usual encoding suspects. hex encoding of utf-8, or base64-url mode. However since different filesystems have different constraints on case-sensitivity and maximum length of filenames. A safer option would be hashing (sha1 for instance) and using a file's contents to store the ID.

This comment has been minimized.

Copy link
@decentralion

decentralion Jul 19, 2019

Author Member

Yeah, take a look at #1238 where I've done it using base-64. Is there a particular reason we need base64-url mode?

Could do hashing, but I like having a bijection. Although the way it's implemented in #1238 there's also a project.json file stored in every project directory, so we could just use a hash, and then getProjectIds can read the project.json file to get the id. It's just a little bit more complexity.

This comment has been minimized.

Copy link
@wchargin

wchargin Jul 20, 2019

Member

+1; this should be base-64. Can’t really be hex-of-UTF-8 because you
can’t assume that the input filenames are UTF-8.

The mirror cache uses base-64.

This comment has been minimized.

Copy link
@Beanow

Beanow Aug 6, 2019

Member

I suggested base64-url because regular base64 contains the / in it's character set. Which for filesystems seems like it could cause issues.

About not being able to assume UTF-8. There's no difference there between hex or base64 encoding. You always need to know how to convert from binary to/from the correct characters, regardless of how you encode the binary format. We can assume it however, because we control the filenames and their encoding.

The problem is data corruption / users (or "smart" filesystems) changing filenames. As we will attempt to decode that, it may result in some very weird strings, as they would be able to circumvent filesystem protections against illegal characters.

Using hashing removes this entire class of issues, because you'll be forced to work with the assumption you can't decode it.

);
if (!fs.existsSync(credFile)) {

This comment has been minimized.

Copy link
@Beanow

Beanow Jul 18, 2019

Member

This can be async.

@Beanow

This comment has been minimized.

Copy link
Member

commented Jul 18, 2019

We should refactor so that the load command is responsible for getting and merging the graphs, and the analyze command is responsible for computing timeline cred with the chosen weights and settings. The "meta" sourcecred tool that @Beanow suggests here can abstract over these pieces.

I'm definitely not suggesting the normal workflow for people who use the meta package should be different / simplified from those who use individual packages. I would instead consider the meta package as a way to install a bundle of commonly used and known-to-work combination of versions. One of these packages that is not the meta package (probably core) should be responsible for acting as the integration layer.

@decentralion decentralion force-pushed the dev-load-org branch from a4195a6 to 1ad628f Jul 19, 2019

@decentralion decentralion changed the base branch from master to core-project Jul 19, 2019

@decentralion decentralion force-pushed the core-project branch 3 times, most recently from da12de1 to 4df5969 Jul 19, 2019

@decentralion decentralion force-pushed the dev-load-org branch from 1ad628f to 207c454 Jul 19, 2019

decentralion added a commit that referenced this pull request Jul 21, 2019

add github/specToProject
This module builds on the project logic added in #1238, and makes it
easy to create projects based on a simple string configuration.
Basically, the spec `foo/bar` creates a project containing just the repo
foo/bar, and the spec `@foo` creates a project containing all of the
repos from the user/organization named foo.

This is pulled out of #1233, but I've enhanced it to support
organizations out of the box.

The method is thoroughly tested.

decentralion added a commit that referenced this pull request Jul 21, 2019

add github/specToProject
This module builds on the project logic added in #1238, and makes it
easy to create projects based on a simple string configuration.
Basically, the spec `foo/bar` creates a project containing just the repo
foo/bar, and the spec `@foo` creates a project containing all of the
repos from the user/organization named foo.

This is pulled out of #1233, but I've enhanced it to support
organizations out of the box.

The method is thoroughly tested.

decentralion added a commit that referenced this pull request Jul 22, 2019

add `api/load`
This adds a new module, `api/load`, which implements the logic that will
underly the new `sourcecred load` command. The `api` package is a new
folder that will contain the logic that powers the CLI (but will be
callable directly as we improve SourceCred). As a heuristic, nontrivial
logic in `cli/` should be factored out to `api/`.

In the future, we will likely want to refactor these APIs to
make them more atomic/composable. `api/load` does "all the things" in
terms of loading data, computing cred, and writing it to disk. I'm going
with the simplest approach here (mirroring existing functionality) so
that we can merge #1233 and realize its many benefits more easily.

This work is factored out of #1233. Thanks to @Beanow for [review]
of the module, which resulted in several changes (e.g. organizing it
under api/, having the TaskReporter be dependency injected).

review: #1233 (review)

Test plan: `api/load` is tested (via mocking unit tests). Run `yarn test`
@decentralion decentralion referenced this pull request Jul 22, 2019

decentralion added a commit that referenced this pull request Jul 22, 2019

add `api/load` (#1251)
This adds a new module, `api/load`, which implements the logic that will
underly the new `sourcecred load` command. The `api` package is a new
folder that will contain the logic that powers the CLI (but will be
callable directly as we improve SourceCred). As a heuristic, nontrivial
logic in `cli/` should be factored out to `api/`.

In the future, we will likely want to refactor these APIs to
make them more atomic/composable. `api/load` does "all the things" in
terms of loading data, computing cred, and writing it to disk. I'm going
with the simplest approach here (mirroring existing functionality) so
that we can merge #1233 and realize its many benefits more easily.

This work is factored out of #1233. Thanks to @Beanow for [review]
of the module, which resulted in several changes (e.g. organizing it
under api/, having the TaskReporter be dependency injected).

[review]: #1233 (review)

Test plan: `api/load` is tested (via mocking unit tests). Run `yarn test`

@decentralion decentralion force-pushed the dev-load-org branch from 207c454 to afdbb0b Jul 22, 2019

@decentralion decentralion changed the base branch from core-project to master Jul 22, 2019

decentralion added a commit that referenced this pull request Jul 22, 2019

@decentralion decentralion force-pushed the dev-load-org branch from 12663ae to ba1a22c Jul 22, 2019

decentralion added a commit that referenced this pull request Jul 22, 2019

@decentralion decentralion marked this pull request as ready for review Jul 22, 2019

decentralion added some commits Jul 16, 2019

change the world: track projects not repos
This commit swaps usage over to the new implementation of `cli/load`
(the one that wraps `api/load`) and makes changes throughout the project
to accomodate that we now track instances by Project rather than by
RepoId.

Test plan: Unit tests updated; run `yarn test --full`. Also, for safety:
actually load a project (by whole org, why not) and verify that the
frontend still works.
remove old-style git loading, and its testing
I'm re-organizing SC data to be oriented on the graph, rather than on
plugin-specific data structures. So there is no longer a need for the
git loading logic which orients around saving a repository.json file
that's been potentially merged across repos, or indeed the logic for
merging repositories at all. So I'm removing `git/loadGitData`,
`git/mergeRepository`, and relatives.

Test plan: `yarn test --full` passes.
deprecate cli/load
This commit deprecates `cli/load` so that we can write a new
implementation, and then make an atomic switch.

Test plan: `yarn test --full`
re-implement src/cli/load
The new implementation wraps `api/load`.

Test plan: I've ported over the tests from the old `cli/load`. Run `yarn
test`.

@decentralion decentralion force-pushed the dev-load-org branch from ba1a22c to cdd3085 Jul 22, 2019

@decentralion decentralion merged commit b4c2846 into master Jul 23, 2019

1 check passed

ci/circleci: test Your tests passed on CircleCI!
Details

@decentralion decentralion deleted the dev-load-org branch Jul 23, 2019

decentralion added a commit that referenced this pull request Jul 23, 2019

Remove deprecated commands and adapters
This commit removes the `pagerank` and `analyze` commands (both of which
never saw real usage), removes the outdated adapter-based `loadGraph`
method, and removes all traces of the analysis adapters.

It builds on work in #1233 and #1136.

Test plan: `yarn test --full` passes.

decentralion added a commit that referenced this pull request Jul 23, 2019

Remove deprecated commands and adapters
This commit removes the `pagerank` and `analyze` commands (both of which
never saw real usage), removes the outdated adapter-based `loadGraph`
method, and removes all traces of the analysis adapters.

It builds on work in #1233 and #1136.

Test plan: `yarn test --full` passes.

decentralion added a commit that referenced this pull request Jul 23, 2019

Remove deprecated commands and adapters
This commit removes the `pagerank` and `analyze` commands (both of which
never saw real usage), removes the outdated adapter-based `loadGraph`
method, and removes all traces of the analysis adapters.

It builds on work in #1233 and #1136.

Test plan: `yarn test --full` passes.

decentralion added a commit that referenced this pull request Jul 23, 2019

Fixup project for move of example repos
I moved sourcecred/example-git{,hub} to the @sourcecred-test org.

This commit fixes the build given that move.

I've realized that in #1233 I in-advertently made some Git tests that
depend on a snapshot un-updateable. I'm going to compound on that slight
technical debt by skipping the tests that depended on that snapshot. I
recognize and accept that I'll need to pay this down when I resuscitate
the git plugin.

Test plan: `yarn test --full`.

decentralion added a commit that referenced this pull request Jul 23, 2019

Fixup project for move of example repos
I moved sourcecred/example-git{,hub} to the @sourcecred-test org.

This commit fixes the build given that move.

I've realized that in #1233 I in-advertently made some Git tests that
depend on a snapshot un-updateable. I'm going to compound on that slight
technical debt by skipping the tests that depended on that snapshot. I
recognize and accept that I'll need to pay this down when I resuscitate
the git plugin.

Test plan: `yarn test --full`.

decentralion added a commit that referenced this pull request Jul 23, 2019

Fixup project for move of example repos
I moved sourcecred/example-git{,hub} to the @sourcecred-test org.

This commit fixes the build given that move.

I've realized that in #1233 I in-advertently made some Git tests that
depend on a snapshot un-updateable. I'm going to compound on that slight
technical debt by skipping the tests that depended on that snapshot. I
recognize and accept that I'll need to pay this down when I resuscitate
the git plugin.

Test plan: `yarn test --full`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.