Commit classification #188

vmarkovtsev · 2019-02-06T13:40:57Z

Add Hercules analysis to classify commits. There must be a "bugfix" class. Other classes are a subject of discussion.

Available features:

Regular diffs
go-cloc
Structured diffs via https://github.com/quinor/sdk/tree/master/uast/diff
Commit messages

Relates to https://github.com/src-d/poc-pluralsight/issues/6#issuecomment-457708560

TODO for @Jan21: add here the list of relevant papers

m09 · 2019-02-06T16:05:44Z

For the potential classes, the ones I have in mind would be:

bugfix
documentation
refactoring
feature
ci
packaging

Then an interesting point to discuss is: do we want a single class per commit or one class per line/block or multiple classes per commit (without line/block tagging).

EgorBu · 2019-02-06T18:13:54Z

About classification - it's possible to collect some labels from PRs in huge repositories like

https://github.com/tensorflow/tensorflow/labels and bug related
https://github.com/pytorch/pytorch/labels and bug related
and so on

Pattern mining

I like an idea of pattern mining.
I would suggest making commit deduplication + community detection / (or clustering / topic modeling) instead of classification.
Why:

it's not clear how many classes we have (50+ labels in each repository that I mentioned above).
it could be quite a big variability even inside one class.
incredible number of commits

`deduplication + community detection`

The way how to do it the same as in apollo:

extract features: textual, structural, etc -> bag-of-something.
fuzzy deduplication + hyperopt (it may require manual labeling or automatic calculating similarity score and selecting threshold).
connected components + community detection.
descriptive statistics of each community / labeling.

So after it will be possible to query and receive communities and their descriptions / labels.

`topic modeling`

How to - standart topic modeling pipeline:

extract features: textual, structural, etc -> bag-of-something.
hierarchical or simple topic modeling using bigartm and vis

So after it will be possible to query and receive topics for commits.

One more point - that the first step (feature extraction) could be reused for several different approaches.

vmarkovtsev · 2019-02-06T18:50:08Z

@EgorBu Can you please transplant this comment (great job btw!) to #187
This is the issue for commit classification.

Jan21 · 2019-02-06T20:53:07Z

Thanks a lot!
@m09 I've read few papers about commit classification and what people did in the past was to classify to 5 categories:

Corrective (bud fixes, corrections of wrong implementation, etc.)
Adaptive (Adapting to external changes, but I'm not 100% sure about this category)
Perfective (Performance enhancements, Maintainability)
Feature addition
Non functional (Legal, code cleanup, documentation)

Here are links to the papers which are using this taxonomy (to be precise, the last three use only the first 3 categories):

@EgorBu I was also thinking that I will first try to do something unsupervised like topic modeling. From what I have read it seems that the feature engineering is most important here. For example, what people use instead of raw diffs is some kind of taxonomy of code changes (class added, statement deleted, etc.) which should be described in this paper: Classifying Change Types for Qualifying Change Couplings.
In one paper they also reported that even the info about the author has some predictive power.
I will now look at what kind of data would be easy to get for every commit.

m09 · 2019-02-07T08:54:51Z

@Jan21 interesting taxonomy. It maps quite well with the intuitive categories we come up with. I have to say I still like the last two though (Feature addition and Non functional) because if we try to fit a feature addition in the first 3 it should probably end up in Adaptive, but it doesn't feel right, and if we try to fit documentation or CI in Perfective it's already a bit better but not great either.

r0mainK · 2019-05-14T14:01:06Z

So I'll be taking over this issue at some point - I expect in a month or two. I read the literature in the past couple days, and thought about it. Regarding taxonomy, I think using the one these guys established might be a good idea, as it separates perhaps more accurately then academics do the different kind of commits - and also conveys more information. The types would thus be:

feature: a new feature
fix: a bug fix
refactor: changes that are neither a bug nor a new feature
performance: changes that improves performance
docs: documentation changes
style: change that do not affect the meaning of code (white-space, formatting, missing semi-colons, etc)
test: adding/removing/correcting tests
build: changes that affect the build system or external dependencies
ci: changes to CI configuration files and scripts (
revert: reversion of previous commits
chore: anything else that don't modify the code or the tests

With this taxonomy, I think that we will mostly need to work on the first 4 types of commits, as the others can be inferred relatively easily using file names and types, or simple parsing in the case of the style commits. I am not sure about separating the refactor and performance commits - it makes sens but will probably be hard to do. Also not sure the revert types are really useful.

Thoughts @m09 ?

m09 · 2019-05-20T12:14:12Z

I also find your taxonomy easier to interpret for end users. If we want to go beyond unsupervised learning, we need to find a way to get data (preferably automatically), likely from GitHub issues as @EgorBu mentioned in a previous comment. It might prove hard/impossible to get data for this precise taxonomy though.

vmarkovtsev added the new-analysis label Feb 6, 2019

vmarkovtsev assigned Jan21 Feb 6, 2019

vmarkovtsev mentioned this issue Feb 6, 2019

Bugfix pattern mining #187

Open

5 tasks

r0mainK assigned r0mainK and unassigned Jan21 May 10, 2019

r0mainK removed their assignment Nov 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit classification #188

Commit classification #188

vmarkovtsev commented Feb 6, 2019 •

edited

Loading

m09 commented Feb 6, 2019 •

edited

Loading

EgorBu commented Feb 6, 2019 •

edited

Loading

vmarkovtsev commented Feb 6, 2019

Jan21 commented Feb 6, 2019

m09 commented Feb 7, 2019

r0mainK commented May 14, 2019

m09 commented May 20, 2019

Commit classification #188

Commit classification #188

Comments

vmarkovtsev commented Feb 6, 2019 • edited Loading

m09 commented Feb 6, 2019 • edited Loading

EgorBu commented Feb 6, 2019 • edited Loading

Pattern mining

deduplication + community detection

topic modeling

vmarkovtsev commented Feb 6, 2019

Jan21 commented Feb 6, 2019

m09 commented Feb 7, 2019

r0mainK commented May 14, 2019

m09 commented May 20, 2019

vmarkovtsev commented Feb 6, 2019 •

edited

Loading

m09 commented Feb 6, 2019 •

edited

Loading

EgorBu commented Feb 6, 2019 •

edited

Loading

`deduplication + community detection`

`topic modeling`