-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to handle incomplete ranges resp. missing data? #38
Comments
This is basically easy to implement and also makes sense to have this. Although, there are two ways to do this:
Do we want to support both ways?
For clarification: We only talk about incomplete ranges in the front and at the end of the list of ranges, until we find a range that is not to be removed/cut/... This way, no range in the middle is to be modified or removed for any reason. Right?
This is easy to implement, but we would need to define mandatory data sources (a non-empty subset of "mails", "commits", "github", ...) and remove any ranges where any of the mandatory data sources is empty.
I would not apply these two ideas for any range, but only for those at the start or the end of the list of ranges. Then, the idea behind both is to either cut data from a range as described for project-level data above, but applied for a single range and only from one direction (start or end), or remove the range completely, right? When cutting data here, then we would either cut data at the start or the end of the range, depending on its location in the list of ranges. Additionally, what's the threshold here? If there is a time distance of two days between the first commit in a range and the first e-mail, we may not want to cut any data or remove the range completely? May we want to use the threshold idea from below here?
As proposed in our talk, we could compute the amount of activity for all data source and remove any range at the start or the end whose amount of activity does not fall into the standard deviation (or a multiple of that). Anyway, is this the way to identify an incomplete range anyway? What does it mean that data is only partly available? (see above)
Then, we can just do that by calling the project-level functionality of that first, and construct ranges afterwards as wanted. 😉
This can be achieved by using the same set of data to construct networks etc. We basically do that all the time.
I think the options for the user (and also for us!) should be strictly limited to a small list of choice, with all options properly designed and documented. And it should also be very concise. What options do we need to implement everything concisely? Anyway, there are two ways we need to consider for implementing any of this:
What are your thoughts? |
Okay, after a long discussion also about implementation details (@Roger1995, @hechtlC, @bockthom), we decided to implement the idea as follows. ForewordWe have (currently) two main data source which can be split and, therefore, are affected by this issue and its solution: mails and raw commit data (see also here). For each of the data source, we have a correspoding getter (for v2.x, in Synchronicity data or PaStA data are orthogonal and do not have timestamps on the data items. Therefore, these data source do not have to be respected here (as for the splitting). Decisions and implementation
Workflow (for (range-level) data objects)
We can provide a function for this workflow (recommended) and then store the data-cut information in the Possible difficultiesAs we call several high-level functions during the network construction (e.g., My idea: We have a data object in the network builder (or, for version v2.x, the raw data themselves in the joint object). When we enable the cutting for the network building, we can (lazily!) clone the original object, cut the data of the clone, and store the cut clone next to the original. Then, we can just use the cut data. Any further suggestions or thougths? Especially, regarding the identified difficulty... |
See 545850a#commitcomment-24284658 for a valuable discussion. |
As PR #78 is merged into the |
Add information on the cutting information to the README file. This explicitly includes the handling of 'ProjectData' object in 'NetworkBuilder' instances, i.e., the cloning process based on the originally given data (see issues se-sic#38 and se-sic#116 as well as commit 2b327a9 for more details). Cross-references and the table of contents are update accordingly. Signed-off-by: Claus Hunsen <hunsen@fim.uni-passau.de>
Since we extract commit data and e-mail data (and later also issue data) from different sources, the time ranges for available data are different.
For example, there may be a huge amount of time between the first extracted commit and the first extracted e-mail (and also analogously for the last commit resp. e-mail). This affects especially multi-networks, as they can be constructed on both commit and e-mail data simultaneously.
This may cause some problems for some of the analyses, even for analyses which do not use multi-networks, but to make the analyses comparable to other analyses that use multi-networks. As different analyses use different kinds of networks and have different prerequisites, this question is not easily answered.
In the following, I will mention some different use cases and possible solutions:
How to deal with incomplete ranges?
There are many options, but it is difficult to make all the analyses and networks compatible with each other.
Any ideas on that?
The text was updated successfully, but these errors were encountered: