Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to specify split.basis by stating multiple data sources and start splitting according to the earliest data source #254

Open
3 tasks
bockthom opened this issue Mar 6, 2024 · 4 comments

Comments

@bockthom
Copy link
Collaborator

bockthom commented Mar 6, 2024

Description

The splitting function split.data.time.based (and possibly also similar splitting functions) takes a parameter split.basis which defines the data source that is used to calculate the time ranges. Currently, it is only allowed to set one specific data source as a split basis. However, it might also be the case that one wants to split the data based on the earliest of two or more data sources, which is currently not possible.

Motivation

This way, we could automatically determine the appropriate (i.e., earliest data source) split basis from a set of data sources.

Ideas for the Implementation

To enable this, there are three things to change:

  • Change the argument validation of the split.basis parameter, such that it allows multiple values simultaneously but keeps the current default value (which is "commits") if no value is given. Therefore, we need to differentiate between the case that one explicitly sets all data source (e.g., c("commits", "mails", "issues") and the case that nothing is specified and just the choices are retrieved from the method signature. This can be done using the hasArg(...) function as the very first statement inside the splitting function, in order to identify whether the caller has set the parameter or not, as in the following snippet:

    if(!hasArg(split.basis)) {
        split.basis = match.arg.or.default(split.basis, several.ok = FALSE, default = "commits")
    } else {
        split.basis = match.arg.or.default(split.basis, several.ok = TRUE)
    }
  • If multiple data sources are set, identify the earliest data source, that is, the data source whose earliest date is earlier.

    if (length(split.basis) > 1) {
        # implement lookup of earliest date
        earliest.split.basis = ...
    }

    Here it might be helpful to define a helper function get.earliest.data.source(data.sources = c("commits", "mails", "issues")) in the ProjectData class.

  • Use earliest.split.basis as basis for actual splitting and also take care of missing data at the end (i.e., increase last range to include later data from other data sources or add missing ranges at the end).

@MaLoefUDS
Copy link
Contributor

I have finally came to have a look at this issue and I have gathered some ideas I want to talk about.

First of all, it is crucial that we get on the same definition of what it should mean to split by multiple basis. Splitting boils down to first obtaining all elements of the data source given as the splitting basis, then grouping these elements into bins either based by time or activity (or by starting with predefined bins in the first place) and then grouping all element (including elements from different data sources to our splitting basis) into the just constructed bins. In your description you say:

However, it might also be the case that one wants to split the data based on the earliest of two or more data sources, which is currently not possible.

This to me sounds like given data sources of c("commits", "mails") you want to obtain only the elements of the data source that starts earlier, lets assume that is mails, to build bins upon and disregard the elements of commits for bin construction. Under this assumption it is correct that we need to first identify the one data source that contains the earliest element (which is not hard, I have implemented it just in case).

However, I do believe, and im not sure if that is just what you meant initially, that it would make more sense to consider the elements of mails and commits when constructing bins, which would then automatically include the earliest elements of the union of both and further also circumvents the problem of creating more bins for excluded tail elements like you suggest to do here:

... also take care of missing data at the end (i.e., increase last range to include later data from other data sources or add missing ranges at the end).

Further, this approach would mean that identifying the earliest element of both data sources would not be necessary anymore. In addition to adapting the validation as you described, the only other change would be when the bin construction is called in split.data.by.time.or.bins to include elements of all specified data sources:

...
if (is.null(bins)) {                       |||||||||||
    ## get bins based on split.basis       vvvvvvvvvvv                    
    bins = split.get.bins.time.based(data[[split.basis]][["date"]], splitting.length, number.windows)$bins
    bins.labels = head(bins, -1)
    ## logging
    logging::loginfo("Splitting data '%s' into time ranges of %s based on '%s' data.",
                     project.data$get.class.name(), splitting.length, split.basis)
}
...

Let me know if I misinterpreted something and your opinion on the matter!

@bockthom
Copy link
Collaborator Author

First of all, it is crucial that we get on the same definition of what it should mean to split by multiple basis.

Agreed 😄

Splitting boils down to first obtaining all elements of the data source given as the splitting basis, then grouping these elements into bins either based by time or activity (or by starting with predefined bins in the first place) and then grouping all element (including elements from different data sources to our splitting basis) into the just constructed bins.

Right. This is exactly what I have meant.

In your description you say:

However, it might also be the case that one wants to split the data based on the earliest of two or more data sources, which is currently not possible.

This to me sounds like given data sources of c("commits", "mails") you want to obtain only the elements of the data source that starts earlier, lets assume that is mails, to build bins upon and disregard the elements of commits for bin construction. Under this assumption it is correct that we need to first identify the one data source that contains the earliest element (which is not hard, I have implemented it just in case).

However, I do believe, and im not sure if that is just what you meant initially, that it would make more sense to consider the elements of mails and commits when constructing bins, which would then automatically include the earliest elements of the union of both and further also circumvents the problem of creating more bins for excluded tail elements [...]

I completely agree with you. This would be exactly what we need. However, I am not sure if this is easy to implement given the complex splitting functions we already have in coronet. That's why I have suggested to determine the earliest data source, split according to the earliest data source, and add the excluded tail elements of the other data source(s) as additional bins. If you find an easy way to implement splitting such that we can directly pass multiple data sources and all activities (commits/mails/issue events/etc.) are considered at once to determine the splitting bins, then of course I would prefer such a solution. But I am not sure how many functions need to be immensely changed in order to achieve such a solution. To the best of my knowledge, the only data-related function that considers multiple data sources at once is the function get.data.timestamps which accesses the data.timestamps field. So, I am not sure about the actual implementation effort of your suggestion. But I'd be glad if it would be easily implementable 😄

In addition to adapting the validation as you described, the only other change would be when the bin construction is called in split.data.by.time.or.bins to include elements of all specified data sources:

...
if (is.null(bins)) {                       |||||||||||
    ## get bins based on split.basis       vvvvvvvvvvv                    
    bins = split.get.bins.time.based(data[[split.basis]][["date"]], splitting.length, number.windows)$bins
    bins.labels = head(bins, -1)
    ## logging
    logging::loginfo("Splitting data '%s' into time ranges of %s based on '%s' data.",
                     project.data$get.class.name(), splitting.length, split.basis)
}
...

That's the tricky part. How to actually achieve considering multiple data sources here? I am not sure if this is the only code location where this needs to be adjusted or whether there need to be additional adjustments at similar code.

Let me know if I misinterpreted something and your opinion on the matter!

I completely agree with your interpretation, but I am not sure if it is as easy to implement as you have sketched here. Let's try it...

@MaLoefUDS
Copy link
Contributor

I implemented a prototype and a first test for it. In the test we have mail data with dates:

Browse[1]> mail.data[["date"]]
[1] "2016-07-12 15:58:40 UTC" "2016-07-12 15:58:50 UTC" "2016-07-12 16:04:40 UTC" "2016-07-12 16:05:37 UTC"

... issue data with dates:

Browse[1]> issue.data[["date"]]
[1] "2016-07-12 15:59:25 UTC" "2016-07-12 15:59:59 UTC" "2016-07-12 16:01:01 UTC" "2016-07-12 16:02:02 UTC" "2016-07-12 16:02:30 UTC" "2016-07-12 16:03:59 UTC"
[7] "2016-07-12 16:06:01 UTC"

... and commit data with dates:

Browse[1]> commit.data[["date"]]
[1] "2016-07-12 15:58:59 UTC" "2016-07-12 16:00:45 UTC" "2016-07-12 16:06:32 UTC"

I then split based on mails and issues and obtain splits for the following regions:

attr(,"bins")
[1] "2016-07-12 15:58:40 UTC" "2016-07-12 15:59:40 UTC" "2016-07-12 16:00:40 UTC" "2016-07-12 16:01:40 UTC" "2016-07-12 16:02:40 UTC" "2016-07-12 16:03:40 UTC"
[7] "2016-07-12 16:04:40 UTC" "2016-07-12 16:05:40 UTC" "2016-07-12 16:06:02 UTC"

We observe that the start of the first bin corresponds with the earliest mail element (which occurs before all commit and issue elements) and the last bin corresponds with the last issue element (which occurs after the last mail element but before the last commit element). Therefore, we see that this method builds bins upon the union of mail and issue elements. 😄

@bockthom
Copy link
Collaborator Author

From my quick look, the example looks good! 😄
When writing more advanced tests later on, please also don't forget about using sliding windows 😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants