Add new threshold calculation for network-based classifications and other small fixes #209

hechtlC · 2021-07-23T13:06:39Z

Prerequisites

I adhere to the coding conventions (described here) in my code.
I have updated the copyright headers of the files I have modified.
I have written appropriate commit messages, i.e., I have recorded the goal, the need, the needed changes, and the location of my code modifications for each commit. This includes also, e.g., referencing to relevant issues.
I have put signed-off tags in all commits.
I have updated the changelog file NEWS.md appropriately.
The pull request is opened against the branch dev.

Description

With this pull request we add a new threshold calculation for classifications of authors using network-based centrality values. This relates to Issue #205. The new threshold calculation uses the CORE.THRESHOLD% quantile of the centrality values to better represent the core and peripheral groups of the corresponding project.

When fixing this issue, we noticed that in some other unrelated tests we are cloning ProjectData objects using the clone() method of R6 but without the parameter deep set to TRUE. So we only cloned the ProjectData object but not the ProjectConf object within. Fixing this led to the problem that a Conf object of any kind that gets a parameter changed once and then changed back to the default value, is not the same as a Conf object that never changed the default. This is unintended behaviour so it gets fixed with this pull request.

Moreover, when touching util-core-peripheral.R anyways, we decided to work on fixing the open issues from #70. This includes changes to the code corresponding to the coding conventions, documentation fixes, and the addition of new count-based classification types for the classification of developers.

Changelog

Added:

Add support for classifying developers on the basis of more count-based classification metrics, including mail-count, mail-thread count, issue-count, issue-comment count, issue-commented-in count, and issue-created count: d7b2455, 6f737c8

Changed/Improved:

Change the threshold calculation for the classification of developers to use a quantile approach when classifying on the basis of network centrality metrics: 5128252
Improve the documentation in util-core-peripheral.R by adding roxygen skeleton documentation to undocumented functions: a3d5ca7, 6f737c8
Change the $ notation to the bracket notation in util-core-peripehral.R: 6f737c8

Fixed:

Fix the data tests in test-data.R to use deep clones of ProjectData objects: d75373a
Fix the update.values() function in util-conf.R to delete the value field if the new value is equal to the default value as the comparison of two otherwise equal Conf objects fails without this: d75373a

When updating the data tests to use deep clones, the comparison of two Conf objects brakes. This happens since a conf parameter gets a new field 'value' when updated. When this parameter is then changed back to the default value, this value field stays present. But since we compare Conf objects with the comparison of their string representation, this now brakes. The fix is to delete the 'value' field when the parameter is set to default. Signed-off-by: Christian Hechtl <hechtl@cs.uni-saarland.de>

When classifying authors into core and peripheral groups using network-based metrics, we now use the 80% quantile as threshold instead of the sum-based threshold. Fix tests related to this. Signed-off-by: Christian Hechtl <hechtl@cs.uni-saarland.de>

bockthom

Looks good @hechtlC!
I only spotted some lines where we could improve code style or comprehension, see detailed comments below.

Btw, this PR is suited to also add further count-based classification metrics, such as "mail.count" or "issue.(comments/created/commented/...).count". But we can also postpone this one if you would like to 😉

tests/test-data.R

util-conf.R

util-core-peripheral.R

tests/test-core-peripheral.R

Signed-off-by: Christian Hechtl <hechtl@cs.uni-saarland.de>

The count-based classification metrics `mail.count`, `issue.count`, `issue.comment.count`, `issue.commented.in.count`, and `issue.created.count` can now be used for the classification of authors into core and peripheral groups as according to se-sic#70 . Add further tiny fixes. Signed-off-by: Christian Hechtl <hechtl@cs.uni-saarland.de>

Signed-off-by: Christian Hechtl <hechtl@cs.uni-saarland.de>

Add roxygen skeleton comments to the uncommented methods in util-core-peripheral.R as according to se-sic#70. Signed-off-by: Christian Hechtl <hechtl@cs.uni-saarland.de>

bockthom

Thanks for the additional changes. As we aim at closing issue #70 soon, I would propose the to add the following additional changes (for which I cannot add individual comments, as the corresponding lines are not changed in this PR so far... all of them address util-core-peripheral.R)

Please also add a classification metric "mail.thread.count". Just to not forget about it, as we have the metric function already. So, make sure that everywhere where "mail.count" is stated the additional option "mail.thread.count" is also added and also an additional get.author.class. ... function is necessary. And also a test would be good.
Please adjust the type parameter of function get.author.class.overview to contain the new types (and also its documentation).
Add missing roxygen2 documentation for function get.class.turnover.overview (line 377)
Add missing roxygen2 documentation for function get.unstable.authors.overview (line 455)
Coding conventions (I): Please replace 1:length by seq_along (lines 383, 461, 923, 951)
Coding conventions (II): Please replace the $ operator in lines 400-444, 474-481, 523-525, 933-947, 975, and 1005, 1038-1046, 1119-1129. (Please also try out the functions to make sure that they still don't break afterwards.)
Fix typo in log statement: "choens" ➡️ "cohens"
Many of the functions which had missing documentation use (inline) comments talking about "version range". But I am not sure about the wording - in other functions and modules of coronet we are using just range or revision range. Please check and replace "version range" so that we avoid having multiple names for the same concepts.

So also some other comments below.

bockthom · 2021-07-27T14:44:51Z

util-core-peripheral.R

+                         "issue.count" = "issue.count",
+                         "issue.comment.count" = "issue.comment.count",
+                         "issue.commented.in.count" = "issue.commented.in.count",
+                         "issue.created.count" = "issue.created.count")

    if (startsWith(type, "network")) {


Let's use the new constant CLASSIFICATION.METRIC.TO.TYPE here, instead of using the startsWith check.

The same holds for a similar check around line 333 in function get.author.class.overview. Please also check for further occurrences.

❓ In general, I am a little bit confused: The parameter "type" contains the metric name and not the the classification type. I would like to rename the type parameter of all the author.class.... functions to classification.metric. This would be less confusing, but would be a breaking change. Otherwise, CLASSIFICATION.METRIC.TO.TYPE has to be renamed. I would go for renaming the parameter, but let's discuss this first before changing...

bockthom · 2021-07-27T14:47:05Z

util-core-peripheral.R

 #'             [default: "network.degree"]
+#' @param issue.type which issue type to consider for count-based metrics using issues
+#'                   (see \code{preprocess.issue.data}). One of \code{"issues"},
+#'                   \code{"pull.requests"} or \code{"all"}.


Maybe let's add "This parameter is ignored for non-issue-related classification metrics."

bockthom · 2021-07-27T14:49:48Z

util-core-peripheral.R

-#'         their respective LOC counts
-get.author.loc.count = function(proj.data) {
-    logging::logdebug("get.author.loc.count: starting.")
+#' @param proj.data the \code{ProjectData} containing the authors' commit data


commit data ➡️ mail data

bockthom · 2021-07-27T14:50:52Z

util-core-peripheral.R

+#'
+#' The details of the classification algorithm are explained in the documentation of \code{get.author.class.by.type}.
+#'
+#' @param proj.data the \code{ProjectData} containing the authors' commit data


commit data ➡️ issue data

bockthom · 2021-07-27T14:52:44Z

util-core-peripheral.R

+#'
+#' The details of the classification algorithm are explained in the documentation of \code{get.author.class.by.type}.
+#'
+#' @param proj.data the \code{ProjectData} containing the authors' commit data


commit data ➡️ issue data

bockthom · 2021-07-27T14:53:04Z

util-core-peripheral.R

+#'
+#' The details of the classification algorithm are explained in the documentation of \code{get.author.class.by.type}.
+#'
+#' @param proj.data the \code{ProjectData} containing the authors' commit data


commit data ➡️ issue data

bockthom · 2021-07-27T14:53:39Z

util-core-peripheral.R

+#'
+#' The details of the classification algorithm are explained in the documentation of \code{get.author.class.by.type}.
+#'
+#' @param proj.data the \code{ProjectData} containing the authors' commit data


commit data ➡️ issue data

bockthom · 2021-07-27T15:05:00Z

util-core-peripheral.R

+#' @return the cohens kappa coefficient describing the agreement of the two classification
+#'         lists


I'd bet that the word "lists" still fits into the previous line... looks kind of weird...

bockthom · 2021-07-27T15:39:52Z

tests/test-core-peripheral.R

-    expected.core = data.frame(author.name = c("Olaf", "Björn"), vertex.degree = c(4, 2))
-    expected.peripheral = data.frame(author.name = c("Darth Sidious"), vertex.degree = c(NA))
+    expected.core = data.frame(author.name = c("Olaf"), vertex.degree = c(4))
+    expected.peripheral = data.frame(author.name = c("Björn", "Darth Sidious"), vertex.degree = c(2, NA))
    expected = list(core = expected.core, peripheral = expected.peripheral)

    row.names(result$core) = NULL


Shouldn't we use the [[ ]] operator instead of $ here? Holds also for all these copy-pasted lines below 😉

- Add `mail.thread.count` to the list of possible classification metrics and add a test for that - Rename `CLASSIFICATION.METRIC.TO.TYPE` to `CLASSIFICATION.TYPE.TO.CATEGORY` as this fits better. Also change the corresponding parameter names in the functions the category is passed to - Change `$` notation to bracket notation as according to the coding conventions - Fix documentation errors and add missing documentation All of these fixes and changes are made according to se-sic#70. Signed-off-by: Christian Hechtl <hechtl@cs.uni-saarland.de>

Signed-off-by: Christian Hechtl <hechtl@cs.uni-saarland.de>

With se-sic#209 we changed the threshold calculation for network-based classifications. But the use of the new threshold was still the old one. So now change the classification using the new threshold so that all authors with a centrality value greater than the threshold are considered core. This is documented in se-sic#205. Signed-off-by: Christian Hechtl <hechtl@cs.uni-saarland.de>

hechtlC added 2 commits July 22, 2021 16:31

hechtlC requested a review from bockthom July 23, 2021 13:06

bockthom requested changes Jul 26, 2021

View reviewed changes

hechtlC added 4 commits July 26, 2021 16:08

Add changes from review by @bockthom

b3b2c91

Signed-off-by: Christian Hechtl <hechtl@cs.uni-saarland.de>

Add tests for new count-based classification metrics

32c277c

Signed-off-by: Christian Hechtl <hechtl@cs.uni-saarland.de>

Add documentation to undocumented methods

a3d5ca7

Add roxygen skeleton comments to the uncommented methods in util-core-peripheral.R as according to se-sic#70. Signed-off-by: Christian Hechtl <hechtl@cs.uni-saarland.de>

bockthom requested changes Jul 27, 2021

View reviewed changes

hechtlC force-pushed the master branch from 653faeb to 773357e Compare July 28, 2021 14:51

hechtlC force-pushed the master branch from 773357e to 6f737c8 Compare July 29, 2021 09:32

Add entries in the changelog for the pull request

1ec501e

Signed-off-by: Christian Hechtl <hechtl@cs.uni-saarland.de>

hechtlC force-pushed the master branch from fa7b04f to 1ec501e Compare July 30, 2021 15:14

bockthom mentioned this pull request Jul 30, 2021

Update core-peripheral module #70

Closed

10 tasks

bockthom approved these changes Jul 30, 2021

View reviewed changes

bockthom merged commit 99fdb01 into se-sic:dev Jul 30, 2021

bockthom mentioned this pull request Aug 26, 2021

Apply core threshold correctly using 'quantile' #205

Closed

hechtlC mentioned this pull request Aug 30, 2021

Fix bug in author classification #210

Merged

6 tasks

bockthom mentioned this pull request Sep 1, 2021

Version 4.0 #212

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new threshold calculation for network-based classifications and other small fixes #209

Add new threshold calculation for network-based classifications and other small fixes #209

hechtlC commented Jul 23, 2021 •

edited

Loading

bockthom left a comment

bockthom left a comment

bockthom Jul 27, 2021 •

edited

Loading

bockthom Jul 27, 2021

bockthom Jul 27, 2021

bockthom Jul 27, 2021

bockthom Jul 27, 2021

bockthom Jul 27, 2021

bockthom Jul 27, 2021

bockthom Jul 27, 2021

bockthom Jul 27, 2021

		#' @return the cohens kappa coefficient describing the agreement of the two classification
		#' lists

Add new threshold calculation for network-based classifications and other small fixes #209

Add new threshold calculation for network-based classifications and other small fixes #209

Conversation

hechtlC commented Jul 23, 2021 • edited Loading

Prerequisites

Description

Changelog

bockthom left a comment

Choose a reason for hiding this comment

bockthom left a comment

Choose a reason for hiding this comment

bockthom Jul 27, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hechtlC commented Jul 23, 2021 •

edited

Loading

bockthom Jul 27, 2021 •

edited

Loading