Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to handle incomplete ranges resp. missing data? #38

Closed
bockthom opened this issue Jun 23, 2017 · 4 comments
Closed

How to handle incomplete ranges resp. missing data? #38

bockthom opened this issue Jun 23, 2017 · 4 comments

Comments

@bockthom
Copy link
Collaborator

Since we extract commit data and e-mail data (and later also issue data) from different sources, the time ranges for available data are different.
For example, there may be a huge amount of time between the first extracted commit and the first extracted e-mail (and also analogously for the last commit resp. e-mail). This affects especially multi-networks, as they can be constructed on both commit and e-mail data simultaneously.
This may cause some problems for some of the analyses, even for analyses which do not use multi-networks, but to make the analyses comparable to other analyses that use multi-networks. As different analyses use different kinds of networks and have different prerequisites, this question is not easily answered.

In the following, I will mention some different use cases and possible solutions:

  1. Global networks:
  • To build project-level networks, we could restrict the network generation to consider complete data, i.e., only consider the time range for which all data sources are available (globally cut incomplete time periods at beginning and end of the time series). However, some of the analyses may not care if there is incomplete data in the beginning or end, so we should make globally cutting configurable.
  1. Range-level networks:
    How to deal with incomplete ranges?
  • Remove ranges for which not all needed data are available?
  • Remove ranges for which all data sources are partly available but not for the hole range?
  • Only cut incomplete parts of a range? (This would also cut time periods without activity...)
  • Define a threshold for identifying incomplete ranges?
  • Globally cut incomplete time periods (analogously to 1.) before splitting?
  1. Comparability of different analyses:
  • How to get the same analyzed global time period for global networks and range-level networks?
  • How to deal with artifact/author/bipartite networks? Also skip the time ranges with incomplete data even if the missing data is not used? Make that configurable? To many (contrary) configuration options may confuse the users...

There are many options, but it is difficult to make all the analyses and networks compatible with each other.

Any ideas on that?

@clhunsen
Copy link
Collaborator

  1. Global networks:
  • To build project-level networks, we could restrict the network generation to consider complete data, i.e., only consider the time range for which all data sources are available (globally cut incomplete time periods at beginning and end of the time series). However, some of the analyses may not care if there is incomplete data in the beginning or end, so we should make globally cutting configurable.

This is basically easy to implement and also makes sense to have this. Although, there are two ways to do this:

  1. Get mail/commit/... data and compute the latest starting date (or the earliest) for the start of the project data (or the end, respectively), or
  2. define a basis for cutting (as for the splitting), get the start/end date from this, and just remove all earlier/later data from the other data sources.

Do we want to support both ways?

  1. Range-level networks:
    How to deal with incomplete ranges?

For clarification: We only talk about incomplete ranges in the front and at the end of the list of ranges, until we find a range that is not to be removed/cut/... This way, no range in the middle is to be modified or removed for any reason. Right?

  • Remove ranges for which not all needed data are available?

This is easy to implement, but we would need to define mandatory data sources (a non-empty subset of "mails", "commits", "github", ...) and remove any ranges where any of the mandatory data sources is empty.

  • Remove ranges for which all data sources are partly available but not for the [w]hole range?
  • Only cut incomplete parts of a range? (This would also cut time periods without activity...)

I would not apply these two ideas for any range, but only for those at the start or the end of the list of ranges. Then, the idea behind both is to either cut data from a range as described for project-level data above, but applied for a single range and only from one direction (start or end), or remove the range completely, right?

When cutting data here, then we would either cut data at the start or the end of the range, depending on its location in the list of ranges.

Additionally, what's the threshold here? If there is a time distance of two days between the first commit in a range and the first e-mail, we may not want to cut any data or remove the range completely? May we want to use the threshold idea from below here?

  • Define a threshold for identifying incomplete ranges?

As proposed in our talk, we could compute the amount of activity for all data source and remove any range at the start or the end whose amount of activity does not fall into the standard deviation (or a multiple of that).

Anyway, is this the way to identify an incomplete range anyway? What does it mean that data is only partly available? (see above)

  • Globally cut incomplete time periods (analogously to 1.) before splitting?

Then, we can just do that by calling the project-level functionality of that first, and construct ranges afterwards as wanted. 😉

  1. Comparability of different analyses:
  • How to get the same analyzed global time period for global networks and range-level networks?

This can be achieved by using the same set of data to construct networks etc. We basically do that all the time.

  • How to deal with artifact/author/bipartite networks? Also skip the time ranges with incomplete data even if the missing data is not used? Make that configurable? To many (contrary) configuration options may confuse the users...

I think the options for the user (and also for us!) should be strictly limited to a small list of choice, with all options properly designed and documented. And it should also be very concise.

What options do we need to implement everything concisely? cut.which (at least, one of "mails", "commits", "github", ...), cut.method (e.g., "incomplete.remove" for ranges, "cut.incomplete" for ranges and project-level data), and cut.basis (optional, for "cut.incomplete")?

Anyway, there are two ways we need to consider for implementing any of this:

  1. When we only want to remove ranges, we can basically return a list of logicals each indicating the inclusion or (non-removal) of the range. The subset has to be performed by the user him-/herself.
  2. We directly modify and remove ranges, so that we can return an adapted list of data objects.

What are your thoughts?

@clhunsen
Copy link
Collaborator

Okay, after a long discussion also about implementation details (@Roger1995, @hechtlC, @bockthom), we decided to implement the idea as follows.

Foreword

We have (currently) two main data source which can be split and, therefore, are affected by this issue and its solution: mails and raw commit data (see also here). For each of the data source, we have a correspoding getter (for v2.x, in CodefaceProjectData, for v3.x, in ProjectData) to obtain the raw data. This idea should be re-used for the current issue (the split functionality can then use this later, probably).

Synchronicity data or PaStA data are orthogonal and do not have timestamps on the data items. Therefore, these data source do not have to be respected here (as for the splitting).

Decisions and implementation

  1. We create a timestamp data structure in the data object (e.g., data.timestamps holding a data.frame) that hold the first and the last date of each data source.
    The data needs to be updated or, rather, inserted when the getter of a data source is called. This way, the timestamp data is always up-to-date.

    For example:

    mails commits issues
    start 2017-06-23 01:00:00 2017-05-22 02:00:00 2017-06-17 05:00:00
    end 2017-07-27 12:00:00 2017-08-12 13:00:00 2017-06-29 17:01:02
  2. For data objects, we provide two methods (names are not final):

    1. get.data.timestamps = function(data.sources = c("mails", "commits", "issues"), simple = FALSE) which returns the subset of data.timestamps for the given list of data sources (i.e., columns of the data frame). If the data is not available yet, the corresponding getters need to be called (that's why I added the foreword).
      If simple == TRUE, we would like to return not the complete data.frame, but only the latest start date and the earliest end date. This could ease the workflow in other implementation parts.
    2. get.data.cut.to.same.date = function(data.sources = c("mails", "commits", "issues")): First, we call the first method get.data.timestamps; then, we retrieve the latest start date and the earliest end date (we can also call get.data.timestamps(..., simple = TRUE) and skip this step); last, we cut all selected data sources to the dates we extracted before and return them.
      The data are not to be stored in the object!
  3. For the network construction, in general, we first need to obtain the data and then construct the networks from that. Now, we need to adjust this outline in the following way:

    1. The data obtainment needs to be adapted to the method described in Point 2.ii, otherwise, anything stays as before.
    2. To make the Point 3.i configurable, we need to introduce a network-configuration parameter, e.g., cut.data.to.same.dates or unify.date.ranges.for.data.

Workflow (for (range-level) data objects)

  1. Obtain the data object.
  2. Get the timestamp information via times = get.data.timestamps(..., simple = TRUE) (see Point 2.i above).
  3. Call data.cut = split.data.time.based(data, bins = times, ...).
  4. Split the data into ranges as wanted.

We can provide a function for this workflow (recommended) and then store the data-cut information in the ProjectConf, such as the splitting information.

Possible difficulties

As we call several high-level functions during the network construction (e.g., get.commit2artifact()), we likely need to store the cut data in the data object somehow (e.g., construct a new object or clone the original), otherwise, we are not able construct the network properly (we do not want to work on the raw data again).

My idea: We have a data object in the network builder (or, for version v2.x, the raw data themselves in the joint object). When we enable the cutting for the network building, we can (lazily!) clone the original object, cut the data of the clone, and store the cut clone next to the original. Then, we can just use the cut data.
For cutting the data object, we could use the function outlined under Workflow.


Any further suggestions or thougths? Especially, regarding the identified difficulty...

@clhunsen
Copy link
Collaborator

See 545850a#commitcomment-24284658 for a valuable discussion.

@clhunsen
Copy link
Collaborator

As PR #78 is merged into the dev branch, this can be regarded as closed.

fehnkera pushed a commit to fehnkera/coronet that referenced this issue Sep 23, 2020
Add information on the cutting information to the README file. This
explicitly includes the handling of 'ProjectData' object in
'NetworkBuilder' instances, i.e., the cloning process based on the
originally given data (see issues se-sic#38 and se-sic#116 as well as commit
2b327a9 for more details).

Cross-references and the table of contents are update accordingly.

Signed-off-by: Claus Hunsen <hunsen@fim.uni-passau.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants