Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new threshold calculation for network-based classifications and other small fixes #209

Merged
merged 8 commits into from
Jul 30, 2021

Conversation

hechtlC
Copy link
Contributor

@hechtlC hechtlC commented Jul 23, 2021

Prerequisites

  • I adhere to the coding conventions (described here) in my code.
  • I have updated the copyright headers of the files I have modified.
  • I have written appropriate commit messages, i.e., I have recorded the goal, the need, the needed changes, and the location of my code modifications for each commit. This includes also, e.g., referencing to relevant issues.
  • I have put signed-off tags in all commits.
  • I have updated the changelog file NEWS.md appropriately.
  • The pull request is opened against the branch dev.

Description

With this pull request we add a new threshold calculation for classifications of authors using network-based centrality values. This relates to Issue #205. The new threshold calculation uses the CORE.THRESHOLD% quantile of the centrality values to better represent the core and peripheral groups of the corresponding project.

When fixing this issue, we noticed that in some other unrelated tests we are cloning ProjectData objects using the clone() method of R6 but without the parameter deep set to TRUE. So we only cloned the ProjectData object but not the ProjectConf object within. Fixing this led to the problem that a Conf object of any kind that gets a parameter changed once and then changed back to the default value, is not the same as a Conf object that never changed the default. This is unintended behaviour so it gets fixed with this pull request.

Moreover, when touching util-core-peripheral.R anyways, we decided to work on fixing the open issues from #70. This includes changes to the code corresponding to the coding conventions, documentation fixes, and the addition of new count-based classification types for the classification of developers.

Changelog

Added:

  • Add support for classifying developers on the basis of more count-based classification metrics, including mail-count, mail-thread count, issue-count, issue-comment count, issue-commented-in count, and issue-created count: d7b2455, 6f737c8

Changed/Improved:

  • Change the threshold calculation for the classification of developers to use a quantile approach when classifying on the basis of network centrality metrics: 5128252
  • Improve the documentation in util-core-peripheral.R by adding roxygen skeleton documentation to undocumented functions: a3d5ca7, 6f737c8
  • Change the $ notation to the bracket notation in util-core-peripehral.R: 6f737c8

Fixed:

  • Fix the data tests in test-data.R to use deep clones of ProjectData objects: d75373a
  • Fix the update.values() function in util-conf.R to delete the value field if the new value is equal to the default value as the comparison of two otherwise equal Conf objects fails without this: d75373a

When updating the data tests to use deep clones, the comparison of two
Conf objects brakes. This happens since a conf parameter gets a new
field 'value' when updated. When this parameter is then changed back to
the default value, this value field stays present. But since we compare
Conf objects with the comparison of their string representation, this
now brakes. The fix is to delete the 'value' field when the parameter is
set to default.

Signed-off-by: Christian Hechtl <hechtl@cs.uni-saarland.de>
When classifying authors into core and peripheral groups using
network-based metrics, we now use the 80% quantile as threshold
instead of the sum-based threshold.
Fix tests related to this.

Signed-off-by: Christian Hechtl <hechtl@cs.uni-saarland.de>
@hechtlC hechtlC requested a review from bockthom July 23, 2021 13:06
Copy link
Collaborator

@bockthom bockthom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good @hechtlC!
I only spotted some lines where we could improve code style or comprehension, see detailed comments below.

Btw, this PR is suited to also add further count-based classification metrics, such as "mail.count" or "issue.(comments/created/commented/...).count". But we can also postpone this one if you would like to 😉

tests/test-data.R Outdated Show resolved Hide resolved
tests/test-data.R Outdated Show resolved Hide resolved
util-conf.R Outdated Show resolved Hide resolved
util-core-peripheral.R Outdated Show resolved Hide resolved
util-core-peripheral.R Outdated Show resolved Hide resolved
util-core-peripheral.R Outdated Show resolved Hide resolved
util-core-peripheral.R Outdated Show resolved Hide resolved
util-core-peripheral.R Outdated Show resolved Hide resolved
tests/test-core-peripheral.R Outdated Show resolved Hide resolved
Signed-off-by: Christian Hechtl <hechtl@cs.uni-saarland.de>
The count-based classification metrics `mail.count`, `issue.count`,
`issue.comment.count`, `issue.commented.in.count`, and
`issue.created.count` can now be used for the classification of
authors into core and peripheral groups as according to se-sic#70 .
Add further tiny fixes.

Signed-off-by: Christian Hechtl <hechtl@cs.uni-saarland.de>
Signed-off-by: Christian Hechtl <hechtl@cs.uni-saarland.de>
Add roxygen skeleton comments to the uncommented methods in
util-core-peripheral.R as according to se-sic#70.

Signed-off-by: Christian Hechtl <hechtl@cs.uni-saarland.de>
Copy link
Collaborator

@bockthom bockthom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the additional changes. As we aim at closing issue #70 soon, I would propose the to add the following additional changes (for which I cannot add individual comments, as the corresponding lines are not changed in this PR so far... all of them address util-core-peripheral.R)

  • Please also add a classification metric "mail.thread.count". Just to not forget about it, as we have the metric function already. So, make sure that everywhere where "mail.count" is stated the additional option "mail.thread.count" is also added and also an additional get.author.class. ... function is necessary. And also a test would be good.
  • Please adjust the type parameter of function get.author.class.overview to contain the new types (and also its documentation).
  • Add missing roxygen2 documentation for function get.class.turnover.overview (line 377)
  • Add missing roxygen2 documentation for function get.unstable.authors.overview (line 455)
  • Coding conventions (I): Please replace 1:length by seq_along (lines 383, 461, 923, 951)
  • Coding conventions (II): Please replace the $ operator in lines 400-444, 474-481, 523-525, 933-947, 975, and 1005, 1038-1046, 1119-1129. (Please also try out the functions to make sure that they still don't break afterwards.)
  • Fix typo in log statement: "choens" ➡️ "cohens"
  • Many of the functions which had missing documentation use (inline) comments talking about "version range". But I am not sure about the wording - in other functions and modules of coronet we are using just range or revision range. Please check and replace "version range" so that we avoid having multiple names for the same concepts.

So also some other comments below.

"issue.count" = "issue.count",
"issue.comment.count" = "issue.comment.count",
"issue.commented.in.count" = "issue.commented.in.count",
"issue.created.count" = "issue.created.count")

if (startsWith(type, "network")) {
Copy link
Collaborator

@bockthom bockthom Jul 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use the new constant CLASSIFICATION.METRIC.TO.TYPE here, instead of using the startsWith check.

The same holds for a similar check around line 333 in function get.author.class.overview. Please also check for further occurrences.

❓ In general, I am a little bit confused: The parameter "type" contains the metric name and not the the classification type. I would like to rename the type parameter of all the author.class.... functions to classification.metric. This would be less confusing, but would be a breaking change. Otherwise, CLASSIFICATION.METRIC.TO.TYPE has to be renamed. I would go for renaming the parameter, but let's discuss this first before changing...

#' [default: "network.degree"]
#' @param issue.type which issue type to consider for count-based metrics using issues
#' (see \code{preprocess.issue.data}). One of \code{"issues"},
#' \code{"pull.requests"} or \code{"all"}.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe let's add "This parameter is ignored for non-issue-related classification metrics."

#' their respective LOC counts
get.author.loc.count = function(proj.data) {
logging::logdebug("get.author.loc.count: starting.")
#' @param proj.data the \code{ProjectData} containing the authors' commit data
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commit data ➡️ mail data

#'
#' The details of the classification algorithm are explained in the documentation of \code{get.author.class.by.type}.
#'
#' @param proj.data the \code{ProjectData} containing the authors' commit data
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commit data ➡️ issue data

#'
#' The details of the classification algorithm are explained in the documentation of \code{get.author.class.by.type}.
#'
#' @param proj.data the \code{ProjectData} containing the authors' commit data
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commit data ➡️ issue data

#'
#' The details of the classification algorithm are explained in the documentation of \code{get.author.class.by.type}.
#'
#' @param proj.data the \code{ProjectData} containing the authors' commit data
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commit data ➡️ issue data

#'
#' The details of the classification algorithm are explained in the documentation of \code{get.author.class.by.type}.
#'
#' @param proj.data the \code{ProjectData} containing the authors' commit data
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commit data ➡️ issue data

Comment on lines 1104 to 1105
#' @return the cohens kappa coefficient describing the agreement of the two classification
#' lists
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd bet that the word "lists" still fits into the previous line... looks kind of weird...

expected.core = data.frame(author.name = c("Olaf", "Björn"), vertex.degree = c(4, 2))
expected.peripheral = data.frame(author.name = c("Darth Sidious"), vertex.degree = c(NA))
expected.core = data.frame(author.name = c("Olaf"), vertex.degree = c(4))
expected.peripheral = data.frame(author.name = c("Björn", "Darth Sidious"), vertex.degree = c(2, NA))
expected = list(core = expected.core, peripheral = expected.peripheral)

row.names(result$core) = NULL
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we use the [[ ]] operator instead of $ here? Holds also for all these copy-pasted lines below 😉

- Add `mail.thread.count` to the list of possible classification metrics
and add a test for that
- Rename `CLASSIFICATION.METRIC.TO.TYPE` to
`CLASSIFICATION.TYPE.TO.CATEGORY` as this fits better. Also change the
corresponding parameter names in the functions the category is passed to
- Change `$` notation to bracket notation as according to the coding
conventions
- Fix documentation errors and add missing documentation

All of these fixes and changes are made according to se-sic#70.

Signed-off-by: Christian Hechtl <hechtl@cs.uni-saarland.de>
Signed-off-by: Christian Hechtl <hechtl@cs.uni-saarland.de>
@bockthom bockthom merged commit 99fdb01 into se-sic:dev Jul 30, 2021
hechtlC added a commit to hechtlC/coronet that referenced this pull request Aug 30, 2021
With se-sic#209 we changed the threshold calculation for network-based
classifications. But the use of the new threshold was still the old one.
So now change the classification using the new threshold so that all
authors with a centrality value greater than the threshold are
considered core. This is documented in se-sic#205.

Signed-off-by: Christian Hechtl <hechtl@cs.uni-saarland.de>
@hechtlC hechtlC mentioned this pull request Aug 30, 2021
6 tasks
hechtlC added a commit to hechtlC/coronet that referenced this pull request Aug 31, 2021
With se-sic#209 we changed the threshold calculation for network-based
classifications. But the use of the new threshold was still the old one.
So now change the classification using the new threshold so that all
authors with a centrality value greater than the threshold are
considered core. This is documented in se-sic#205.

Signed-off-by: Christian Hechtl <hechtl@cs.uni-saarland.de>
hechtlC added a commit to hechtlC/coronet that referenced this pull request Aug 31, 2021
With se-sic#209 we changed the threshold calculation for network-based
classifications. But the use of the new threshold was still the old one.
So now change the classification using the new threshold so that all
authors with a centrality value greater than the threshold are
considered core. This is documented in se-sic#205.

Signed-off-by: Christian Hechtl <hechtl@cs.uni-saarland.de>
hechtlC added a commit to hechtlC/coronet that referenced this pull request Aug 31, 2021
With se-sic#209 we changed the threshold calculation for network-based
classifications. But the use of the new threshold was still the old one.
So now change the classification using the new threshold so that all
authors with a centrality value greater than the threshold are
considered core. This is documented in se-sic#205.

Signed-off-by: Christian Hechtl <hechtl@cs.uni-saarland.de>
@bockthom bockthom mentioned this pull request Sep 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants