-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new threshold calculation for network-based classifications and other small fixes #209
Conversation
When updating the data tests to use deep clones, the comparison of two Conf objects brakes. This happens since a conf parameter gets a new field 'value' when updated. When this parameter is then changed back to the default value, this value field stays present. But since we compare Conf objects with the comparison of their string representation, this now brakes. The fix is to delete the 'value' field when the parameter is set to default. Signed-off-by: Christian Hechtl <hechtl@cs.uni-saarland.de>
When classifying authors into core and peripheral groups using network-based metrics, we now use the 80% quantile as threshold instead of the sum-based threshold. Fix tests related to this. Signed-off-by: Christian Hechtl <hechtl@cs.uni-saarland.de>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good @hechtlC!
I only spotted some lines where we could improve code style or comprehension, see detailed comments below.
Btw, this PR is suited to also add further count-based classification metrics, such as "mail.count" or "issue.(comments/created/commented/...).count". But we can also postpone this one if you would like to 😉
Signed-off-by: Christian Hechtl <hechtl@cs.uni-saarland.de>
The count-based classification metrics `mail.count`, `issue.count`, `issue.comment.count`, `issue.commented.in.count`, and `issue.created.count` can now be used for the classification of authors into core and peripheral groups as according to se-sic#70 . Add further tiny fixes. Signed-off-by: Christian Hechtl <hechtl@cs.uni-saarland.de>
Signed-off-by: Christian Hechtl <hechtl@cs.uni-saarland.de>
Add roxygen skeleton comments to the uncommented methods in util-core-peripheral.R as according to se-sic#70. Signed-off-by: Christian Hechtl <hechtl@cs.uni-saarland.de>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the additional changes. As we aim at closing issue #70 soon, I would propose the to add the following additional changes (for which I cannot add individual comments, as the corresponding lines are not changed in this PR so far... all of them address util-core-peripheral.R
)
- Please also add a classification metric "mail.thread.count". Just to not forget about it, as we have the metric function already. So, make sure that everywhere where "mail.count" is stated the additional option "mail.thread.count" is also added and also an additional
get.author.class. ...
function is necessary. And also a test would be good. - Please adjust the
type
parameter of functionget.author.class.overview
to contain the new types (and also its documentation). - Add missing roxygen2 documentation for function
get.class.turnover.overview
(line 377) - Add missing roxygen2 documentation for function
get.unstable.authors.overview
(line 455) - Coding conventions (I): Please replace
1:length
byseq_along
(lines 383, 461, 923, 951) - Coding conventions (II): Please replace the
$
operator in lines 400-444, 474-481, 523-525, 933-947, 975, and 1005, 1038-1046, 1119-1129. (Please also try out the functions to make sure that they still don't break afterwards.) - Fix typo in log statement: "choens" ➡️ "cohens"
- Many of the functions which had missing documentation use (inline) comments talking about "version range". But I am not sure about the wording - in other functions and modules of
coronet
we are using justrange
orrevision range
. Please check and replace "version range" so that we avoid having multiple names for the same concepts.
So also some other comments below.
util-core-peripheral.R
Outdated
"issue.count" = "issue.count", | ||
"issue.comment.count" = "issue.comment.count", | ||
"issue.commented.in.count" = "issue.commented.in.count", | ||
"issue.created.count" = "issue.created.count") | ||
|
||
if (startsWith(type, "network")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use the new constant CLASSIFICATION.METRIC.TO.TYPE
here, instead of using the startsWith
check.
The same holds for a similar check around line 333 in function get.author.class.overview
. Please also check for further occurrences.
❓ In general, I am a little bit confused: The parameter "type" contains the metric name and not the the classification type. I would like to rename the type
parameter of all the author.class....
functions to classification.metric
. This would be less confusing, but would be a breaking change. Otherwise, CLASSIFICATION.METRIC.TO.TYPE
has to be renamed. I would go for renaming the parameter, but let's discuss this first before changing...
util-core-peripheral.R
Outdated
#' [default: "network.degree"] | ||
#' @param issue.type which issue type to consider for count-based metrics using issues | ||
#' (see \code{preprocess.issue.data}). One of \code{"issues"}, | ||
#' \code{"pull.requests"} or \code{"all"}. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe let's add "This parameter is ignored for non-issue-related classification metrics."
util-core-peripheral.R
Outdated
#' their respective LOC counts | ||
get.author.loc.count = function(proj.data) { | ||
logging::logdebug("get.author.loc.count: starting.") | ||
#' @param proj.data the \code{ProjectData} containing the authors' commit data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
commit data ➡️ mail data
util-core-peripheral.R
Outdated
#' | ||
#' The details of the classification algorithm are explained in the documentation of \code{get.author.class.by.type}. | ||
#' | ||
#' @param proj.data the \code{ProjectData} containing the authors' commit data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
commit data ➡️ issue data
util-core-peripheral.R
Outdated
#' | ||
#' The details of the classification algorithm are explained in the documentation of \code{get.author.class.by.type}. | ||
#' | ||
#' @param proj.data the \code{ProjectData} containing the authors' commit data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
commit data ➡️ issue data
util-core-peripheral.R
Outdated
#' | ||
#' The details of the classification algorithm are explained in the documentation of \code{get.author.class.by.type}. | ||
#' | ||
#' @param proj.data the \code{ProjectData} containing the authors' commit data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
commit data ➡️ issue data
util-core-peripheral.R
Outdated
#' | ||
#' The details of the classification algorithm are explained in the documentation of \code{get.author.class.by.type}. | ||
#' | ||
#' @param proj.data the \code{ProjectData} containing the authors' commit data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
commit data ➡️ issue data
util-core-peripheral.R
Outdated
#' @return the cohens kappa coefficient describing the agreement of the two classification | ||
#' lists |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd bet that the word "lists" still fits into the previous line... looks kind of weird...
expected.core = data.frame(author.name = c("Olaf", "Björn"), vertex.degree = c(4, 2)) | ||
expected.peripheral = data.frame(author.name = c("Darth Sidious"), vertex.degree = c(NA)) | ||
expected.core = data.frame(author.name = c("Olaf"), vertex.degree = c(4)) | ||
expected.peripheral = data.frame(author.name = c("Björn", "Darth Sidious"), vertex.degree = c(2, NA)) | ||
expected = list(core = expected.core, peripheral = expected.peripheral) | ||
|
||
row.names(result$core) = NULL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we use the [[ ]] operator instead of $ here? Holds also for all these copy-pasted lines below 😉
- Add `mail.thread.count` to the list of possible classification metrics and add a test for that - Rename `CLASSIFICATION.METRIC.TO.TYPE` to `CLASSIFICATION.TYPE.TO.CATEGORY` as this fits better. Also change the corresponding parameter names in the functions the category is passed to - Change `$` notation to bracket notation as according to the coding conventions - Fix documentation errors and add missing documentation All of these fixes and changes are made according to se-sic#70. Signed-off-by: Christian Hechtl <hechtl@cs.uni-saarland.de>
Signed-off-by: Christian Hechtl <hechtl@cs.uni-saarland.de>
With se-sic#209 we changed the threshold calculation for network-based classifications. But the use of the new threshold was still the old one. So now change the classification using the new threshold so that all authors with a centrality value greater than the threshold are considered core. This is documented in se-sic#205. Signed-off-by: Christian Hechtl <hechtl@cs.uni-saarland.de>
With se-sic#209 we changed the threshold calculation for network-based classifications. But the use of the new threshold was still the old one. So now change the classification using the new threshold so that all authors with a centrality value greater than the threshold are considered core. This is documented in se-sic#205. Signed-off-by: Christian Hechtl <hechtl@cs.uni-saarland.de>
With se-sic#209 we changed the threshold calculation for network-based classifications. But the use of the new threshold was still the old one. So now change the classification using the new threshold so that all authors with a centrality value greater than the threshold are considered core. This is documented in se-sic#205. Signed-off-by: Christian Hechtl <hechtl@cs.uni-saarland.de>
With se-sic#209 we changed the threshold calculation for network-based classifications. But the use of the new threshold was still the old one. So now change the classification using the new threshold so that all authors with a centrality value greater than the threshold are considered core. This is documented in se-sic#205. Signed-off-by: Christian Hechtl <hechtl@cs.uni-saarland.de>
Prerequisites
dev
.Description
With this pull request we add a new threshold calculation for classifications of authors using network-based centrality values. This relates to Issue #205. The new threshold calculation uses the
CORE.THRESHOLD
% quantile of the centrality values to better represent the core and peripheral groups of the corresponding project.When fixing this issue, we noticed that in some other unrelated tests we are cloning
ProjectData
objects using theclone()
method of R6 but without the parameterdeep
set toTRUE
. So we only cloned theProjectData
object but not theProjectConf
object within. Fixing this led to the problem that aConf
object of any kind that gets a parameter changed once and then changed back to the default value, is not the same as aConf
object that never changed the default. This is unintended behaviour so it gets fixed with this pull request.Moreover, when touching
util-core-peripheral.R
anyways, we decided to work on fixing the open issues from #70. This includes changes to the code corresponding to the coding conventions, documentation fixes, and the addition of new count-based classification types for the classification of developers.Changelog
Added:
Changed/Improved:
util-core-peripheral.R
by adding roxygen skeleton documentation to undocumented functions: a3d5ca7, 6f737c8$
notation to the bracket notation inutil-core-peripehral.R
: 6f737c8Fixed:
test-data.R
to use deep clones ofProjectData
objects: d75373aupdate.values()
function inutil-conf.R
to delete thevalue
field if the new value is equal to the default value as the comparison of two otherwise equalConf
objects fails without this: d75373a