Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeline Cred! #1212

Merged
merged 8 commits into from Jul 11, 2019

Conversation

@decentralion
Copy link
Member

commented Jul 10, 2019

This pull request has a productionized version of timeline cred. See this post and #862 for some context.

The general approach is:

  • We slice the history of the graph into week-long intervals
  • We iterate through the intervals, keeping track of 'node weight' and 'edge weight' as we go.
  • Each node/edge has full weight when it is added, and then decays by a constant proportion (intervalDecay) every week.
  • For each interval, we run PageRank, using the current node weights as the seed vector, and using the current edge weights to create the markov chain. (Any edges that have not yet been created are given a weight of 0.)
  • We then normalize the PageRank scores into 'cred' according to the rule that the total amount of cred accumulated by users in that week is equal to the total node weight across all nodes.

What this means intuitively: if pull requests have a weight of 4, then that means that a total of 4 cred will be created, and it will flow to people who did work associated with that pull request. Most of that 4 cred will be created and "released" in the week when that pull request was made, but it will still continue to flow out of the pull for a few weeks afterwards.

As compared to the implementation being shown off in the discourse post above, this implementation is 'productionized'. That basically means I gave more thought to the APIs, and added more testing. Also, the performance is a little better (about 12% better in my tests).

It's also factored such that it will be easy for us to get future performance wins, by using a mutable markov chain rather than re-creating a markov chain for each interval.

Test plan: The individual commits have testing, and I've used this branch along with a yet-to-be-merged UI for displaying timeline cred; the results it gives are reasonable.

Thanks to @mzargham for help choosing the algorithm.

@decentralion decentralion force-pushed the clean-timelines branch from 7e0ad8b to 903e8cd Jul 11, 2019

@decentralion decentralion referenced this pull request Jul 11, 2019
decentralion added 8 commits Jul 7, 2019
add `analysis/timeline/interval`
This commit adds an `interval` module which defines intervals (time
ranges), and methods for slicing up a graph into its consistuent time
intervals. This is pre-requisite work for #862.

I've added a dep on d3-array.

Test plan: Unit tests added; run `yarn test`
add `analysis/weightEvaluator`
This commit adds new weight evaluators for nodes and edges. Unlike the
previous evaluator, edges and nodes are handled as separate concerns,
rather than composing the node weights into the edge weights. I think
this separation is cleaner.

Both evaluators use only the address, not the full (Node or Edge)
object. Although we may want to give the edge evaluator access to the
full Edge later, if we decide we want node-type-differentiated edge
weights (e.g. if a hasParent edge has a different weight depending on
whether it is connected to an Issue or a Repository).

weightsToEdgeEvaluator has been refactored to use the new evaluators,
and has been given a deprecation notice.

Test plan: `yarn test`
Modify `sourcecred load` to save timeline cred
Test plan: Observe changes to the snapshot for example-github-load.
`yarn test --full` passes.
Add `analysis/timeline/timelinePagerank`
As the name would suggest, this module allows computing timeline
PageRank on a graph. See documentation in the module for details on the
algorithm.

Test plan: The module has incomplete testing. The timelinePagerank
function calls out to iterators for getting the time-decayed node
weights and the time-decayed markov chain; these iterators are tested.
However, the wrapper logic that composes these pieces together,
calculates seed vectors, and runs PageRank is not tested. I felt it
would be a pain to test and settled for reviewing the code carefully,
and putting a cautionary note at the top of the function.
add `analysis/timeline/distributionToCred`
This module takes the timeline distributions created by
`timelinePagerank`, and re-normalizes the scores into cred. For details
on the algorithm, read comments and docstrings in the module.

Test plan: Unit tests added.
add `analysis/timeline/filterTimelineCred`
This adds the `filterTimelineCred` module, which dramatically reduces
the size of timeline cred by throwing away all nodes that are not a user
or repository. It also supports serialization / deserialization.

Test plan: unit tests included
add `analysis/timeline/timelineCred`
This adds a TimelineCred class which serves several functions:
- acts as a view on timeline cred data
  - (lets you get highest scoring nodes, etc)
- has an interface for computing timeline cred
- lets you serialize cred along with the graph and paramter settings
  that generated it in a single object

One upshot of this design is that now if we let the user provide weights
(or other config) on load time in the CLI, those weights will get
carried over to the frontend, since they are included along with the
cred results.

TimelineCred has 'Parameters' and 'Config'. The parameters are
user-specified and may change within a given instance. The config is
essentially codebase-level configuration around what types are used for
scoring, etc; I don't expect users to be changing this. To keep the
analysis module decoupled from the plugins module, I put a default
config in `src/plugins/defaultCredConfig`; I expect all users of
TimelineCred to use this config. (At least for a while!)

Test plan: I've added some tests to `TimelineCred`. Run `yarn test`. I
also have a yet-unmerged branch that builds a functioning cred display
UI using the `TimelineCred` class.

fixup tlc
Update CHANGELOG.md
Test plan: Visual inspection

@decentralion decentralion force-pushed the clean-timelines branch from 903e8cd to ea38e49 Jul 11, 2019

@decentralion decentralion merged commit 2d16afe into master Jul 11, 2019

1 check passed

ci/circleci: test Your tests passed on CircleCI!
Details

@decentralion decentralion deleted the clean-timelines branch Jul 11, 2019

decentralion added a commit that referenced this pull request Jul 11, 2019
Decrease GitHub TTL from 7 days to 12 hours
As described in #987, we use a single TTL across GitHub types. Right
now, the TTL is set to 7 days. This means that it's possible to run
`sourcecred load`, but still be missing the last 7 days worth of issues.
Now that we're doing timeline cred (cf #1212), this is not acceptable.

As a workaround until we fix #987, I'm decreasing the TTL to 12 hours.
That's still long enough to make a good experience for someone who is
tweaking config and calling `sourcecred load` a lot, but ensures that
freshly-loaded results still have recent activity.

Test plan: `yarn test`
decentralion added a commit that referenced this pull request Jul 11, 2019
Decrease GitHub TTL from 7 days to 12 hours
As described in #987, we use a single TTL across GitHub types. Right
now, the TTL is set to 7 days. This means that it's possible to run
`sourcecred load`, but still be missing the last 7 days worth of issues.
Now that we're doing timeline cred (cf #1212), this is not acceptable.

As a workaround until we fix #987, I'm decreasing the TTL to 12 hours.
That's still long enough to make a good experience for someone who is
tweaking config and calling `sourcecred load` a lot, but ensures that
freshly-loaded results still have recent activity.

Test plan: `yarn test`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
1 participant
You can’t perform that action at this time.