-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Commit classification #188
Comments
For the potential classes, the ones I have in mind would be:
Then an interesting point to discuss is: do we want a single class per commit or one class per line/block or multiple classes per commit (without line/block tagging). |
About classification - it's possible to collect some labels from PRs in huge repositories like
Pattern miningI like an idea of
|
Thanks a lot!
Here are links to the papers which are using this taxonomy (to be precise, the last three use only the first 3 categories):
@EgorBu I was also thinking that I will first try to do something unsupervised like topic modeling. From what I have read it seems that the feature engineering is most important here. For example, what people use instead of raw diffs is some kind of taxonomy of code changes (class added, statement deleted, etc.) which should be described in this paper: Classifying Change Types for Qualifying Change Couplings. |
@Jan21 interesting taxonomy. It maps quite well with the intuitive categories we come up with. I have to say I still like the last two though ( |
So I'll be taking over this issue at some point - I expect in a month or two. I read the literature in the past couple days, and thought about it. Regarding taxonomy, I think using the one these guys established might be a good idea, as it separates perhaps more accurately then academics do the different kind of commits - and also conveys more information. The types would thus be:
With this taxonomy, I think that we will mostly need to work on the first 4 types of commits, as the others can be inferred relatively easily using file names and types, or simple parsing in the case of the style commits. I am not sure about separating the refactor and performance commits - it makes sens but will probably be hard to do. Also not sure the revert types are really useful. Thoughts @m09 ? |
I also find your taxonomy easier to interpret for end users. If we want to go beyond unsupervised learning, we need to find a way to get data (preferably automatically), likely from GitHub issues as @EgorBu mentioned in a previous comment. It might prove hard/impossible to get data for this precise taxonomy though. |
Add Hercules analysis to classify commits. There must be a "bugfix" class. Other classes are a subject of discussion.
Available features:
Relates to https://github.com/src-d/poc-pluralsight/issues/6#issuecomment-457708560
TODO for @Jan21: add here the list of relevant papers
The text was updated successfully, but these errors were encountered: