New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
track "projects" not repositories #1233
Conversation
3ff7724
to
a4195a6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some thoughts going through this.
src/analysis/getProjectIds.js
Outdated
if (!fs.existsSync(credPath)) { | ||
return []; | ||
} | ||
const filenames = fs.readdirSync(credPath).filter((x) => x.endsWith(".json")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should this function be synchronous? Scanning arbitrarily large directory structures can cause noticeable slowdowns here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pulled this logic out to #1238. The new version is still synchronous; the reason is that it is depended on in our webpack config, and I didn't feel like re-writing the webpack config to accept asnyc code. If it winds up being a perf issue we could do the re-write then, or just keep the sync version for use in webpack, and make an async version for regular use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It turned out to be pretty easy to have webpack support async config (just export a promise). So I refactored #1238 to be async.
src/analysis/load.js
Outdated
|
||
export async function load(options: LoadOptions): Promise<void> { | ||
const {project, params, sourcecredDirectory, githubToken} = options; | ||
const taskReporter = new TaskReporter(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're talking about future refactoring ideas. I believe a dependency injection style would be preferable over directly referencing and initializing dependencies. They would make great stubs for testing as well.
src/analysis/load.js
Outdated
const loadTask = `load-${options.project.id}`; | ||
taskReporter.start(loadTask); | ||
const cacheDirectory = path.join(sourcecredDirectory, "cache"); | ||
mkdirp.sync(cacheDirectory); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be async as the function already is defined as such.
src/analysis/load.js
Outdated
taskReporter.start("compute-cred"); | ||
const cred = await TimelineCred.compute(graph, params, DEFAULT_CRED_CONFIG); | ||
const credDir = path.join(sourcecredDirectory, "data"); | ||
mkdirp(credDir); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This invocation is inconsistent with the previous one. Also could be async.
src/analysis/load.js
Outdated
+githubToken: string, | ||
|}; | ||
|
||
export async function load(options: LoadOptions): Promise<void> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find the naming a little misleading. As functionality in an analysis
namespace I wouldn't naturally assume it to be I/O heavy tasks. Additionally this function is called load, but it does not produce any in-memory loaded data. Instead it is a cred pre-compute task which produces the cred.json file.
src/analysis/getProjectIds.js
Outdated
/** | ||
* Convert a project filename back into an id. | ||
* | ||
* Converts all '$' to '/'. Errors if the filename contains '/'. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Encoding scheme wise, if you insist on using filenames to store the ID, you'll want the usual encoding suspects. hex encoding of utf-8, or base64-url mode. However since different filesystems have different constraints on case-sensitivity and maximum length of filenames. A safer option would be hashing (sha1 for instance) and using a file's contents to store the ID.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, take a look at #1238 where I've done it using base-64. Is there a particular reason we need base64-url mode?
Could do hashing, but I like having a bijection. Although the way it's implemented in #1238 there's also a project.json
file stored in every project directory, so we could just use a hash, and then getProjectIds
can read the project.json file to get the id. It's just a little bit more complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1; this should be base-64. Can’t really be hex-of-UTF-8 because you
can’t assume that the input filenames are UTF-8.
The mirror cache uses base-64.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggested base64-url because regular base64 contains the /
in it's character set. Which for filesystems seems like it could cause issues.
About not being able to assume UTF-8. There's no difference there between hex or base64 encoding. You always need to know how to convert from binary to/from the correct characters, regardless of how you encode the binary format. We can assume it however, because we control the filenames and their encoding.
The problem is data corruption / users (or "smart" filesystems) changing filenames. As we will attempt to decode that, it may result in some very weird strings, as they would be able to circumvent filesystem protections against illegal characters.
Using hashing removes this entire class of issues, because you'll be forced to work with the assumption you can't decode it.
); | ||
if (!fs.existsSync(credFile)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be async.
I'm definitely not suggesting the normal workflow for people who use the meta package should be different / simplified from those who use individual packages. I would instead consider the meta package as a way to install a bundle of commonly used and known-to-work combination of versions. One of these packages that is not the meta package (probably core) should be responsible for acting as the integration layer. |
a4195a6
to
1ad628f
Compare
da12de1
to
4df5969
Compare
1ad628f
to
207c454
Compare
This module builds on the project logic added in #1238, and makes it easy to create projects based on a simple string configuration. Basically, the spec `foo/bar` creates a project containing just the repo foo/bar, and the spec `@foo` creates a project containing all of the repos from the user/organization named foo. This is pulled out of #1233, but I've enhanced it to support organizations out of the box. The method is thoroughly tested.
This module builds on the project logic added in #1238, and makes it easy to create projects based on a simple string configuration. Basically, the spec `foo/bar` creates a project containing just the repo foo/bar, and the spec `@foo` creates a project containing all of the repos from the user/organization named foo. This is pulled out of #1233, but I've enhanced it to support organizations out of the box. The method is thoroughly tested.
This adds a new module, `api/load`, which implements the logic that will underly the new `sourcecred load` command. The `api` package is a new folder that will contain the logic that powers the CLI (but will be callable directly as we improve SourceCred). As a heuristic, nontrivial logic in `cli/` should be factored out to `api/`. In the future, we will likely want to refactor these APIs to make them more atomic/composable. `api/load` does "all the things" in terms of loading data, computing cred, and writing it to disk. I'm going with the simplest approach here (mirroring existing functionality) so that we can merge #1233 and realize its many benefits more easily. This work is factored out of #1233. Thanks to @Beanow for [review] of the module, which resulted in several changes (e.g. organizing it under api/, having the TaskReporter be dependency injected). review: #1233 (review) Test plan: `api/load` is tested (via mocking unit tests). Run `yarn test`
This adds a new module, `api/load`, which implements the logic that will underly the new `sourcecred load` command. The `api` package is a new folder that will contain the logic that powers the CLI (but will be callable directly as we improve SourceCred). As a heuristic, nontrivial logic in `cli/` should be factored out to `api/`. In the future, we will likely want to refactor these APIs to make them more atomic/composable. `api/load` does "all the things" in terms of loading data, computing cred, and writing it to disk. I'm going with the simplest approach here (mirroring existing functionality) so that we can merge #1233 and realize its many benefits more easily. This work is factored out of #1233. Thanks to @Beanow for [review] of the module, which resulted in several changes (e.g. organizing it under api/, having the TaskReporter be dependency injected). [review]: #1233 (review) Test plan: `api/load` is tested (via mocking unit tests). Run `yarn test`
207c454
to
afdbb0b
Compare
12663ae
to
ba1a22c
Compare
I'm re-organizing SC data to be oriented on the graph, rather than on plugin-specific data structures. So there is no longer a need for the git loading logic which orients around saving a repository.json file that's been potentially merged across repos, or indeed the logic for merging repositories at all. So I'm removing `git/loadGitData`, `git/mergeRepository`, and relatives. Test plan: `yarn test --full` passes.
This commit deprecates `cli/load` so that we can write a new implementation, and then make an atomic switch. Test plan: `yarn test --full`
The new implementation wraps `api/load`. Test plan: I've ported over the tests from the old `cli/load`. Run `yarn test`.
This commit swaps usage over to the new implementation of `cli/load` (the one that wraps `api/load`) and makes changes throughout the project to accomodate that we now track instances by Project rather than by RepoId. Test plan: Unit tests updated; run `yarn test --full`. Also, for safety: actually load a project (by whole org, why not) and verify that the frontend still works.
ba1a22c
to
cdd3085
Compare
I moved sourcecred/example-git{,hub} to the @sourcecred-test org. This commit fixes the build given that move. I've realized that in #1233 I in-advertently made some Git tests that depend on a snapshot un-updateable. I'm going to compound on that slight technical debt by skipping the tests that depended on that snapshot. I recognize and accept that I'll need to pay this down when I resuscitate the git plugin. Test plan: `yarn test --full`.
I moved sourcecred/example-git{,hub} to the @sourcecred-test org. This commit fixes the build given that move. I've realized that in #1233 I in-advertently made some Git tests that depend on a snapshot un-updateable. I'm going to compound on that slight technical debt by skipping the tests that depended on that snapshot. I recognize and accept that I'll need to pay this down when I resuscitate the git plugin. Test plan: `yarn test --full`.
I moved sourcecred/example-git{,hub} to the @sourcecred-test org. This commit fixes the build given that move. I've realized that in #1233 I in-advertently made some Git tests that depend on a snapshot un-updateable. I'm going to compound on that slight technical debt by skipping the tests that depended on that snapshot. I recognize and accept that I'll need to pay this down when I resuscitate the git plugin. Test plan: `yarn test --full`.
This should be removed upstream, too; it’s no longer used after <sourcecred/sourcecred#1233>. wchargin-branch: unused-repos
This should be removed upstream, too; it’s no longer used after <sourcecred/sourcecred#1233>. wchargin-branch: unused-repos
This should be removed upstream, too; it’s no longer used after <sourcecred/sourcecred#1233>. wchargin-branch: unused-repos
This should be removed upstream, too; it’s no longer used after <sourcecred/sourcecred#1233>. wchargin-branch: unused-repos
This should be removed upstream, too; it’s no longer used after <sourcecred/sourcecred#1233>. wchargin-branch: unused-repos
This should be removed upstream, too; it’s no longer used after <sourcecred/sourcecred#1233>. wchargin-branch: unused-repos
This re-organizes SourceCred to track data based on the "project" it is a part of, rather than the GitHub repository it came from. This is motivated by a desire to add first-class support for multi-repository projects in SourceCred.
Currently, we have a very hacky approach: we merge multiple repositories worth of data into a combined git repository (via
mergeRepository
) and a combined GitHubRelationalView
, and then project graphs from the combined data structure.This approach has a number of deficiencies:
This complexity was motivated by the fact that
Graph
didn't use to store enough info for cred analysis and running the UI. Now that #1136 has merged, all we need is graphs, andGraph.merge
is already well defined (and a lot simpler).Therefore, this PR starts from the assumption that each plugin will create atomic
Graph
s for whatever their "atom" of data is (e.g. a repository for GitHub, but it might be an instance for a future Discourse plugin, etc). Then the graphs for that plugin can get merged together. The scope of a "project" includes many plugins, each with their own atoms, and they all get merged for the project level graph. Then, sourcecred computes timeline cred for the project level graph.In this PR, I haven't made an attempt to get the config to anything like it's final, fully-fleshed-out form. So the project config doesn't even have a field for plugin, since we're only loading GitHub data at the moment. I don't think there's a big advantage to adding the extra abstraction just yet; I'd rather experiment with this in its simplest form for a while. Just effecting the one change of indexing by project rather than repo is a big change. :)
The new load command is patterned off @wchargin's suggestions in #945 around making an API-ified SourceCred. Specifically, load command is implemented in
api/load.js
and takes a declarative config. Thencli/load.js
is just a wrapper for producing one of those configs with a bit of sugar.This will close #1119 and close #1118 and close #701