Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

length field in core #79

Closed
bhilburn opened this issue Nov 1, 2017 · 8 comments
Closed

length field in core #79

bhilburn opened this issue Nov 1, 2017 · 8 comments

Comments

@bhilburn
Copy link
Contributor

bhilburn commented Nov 1, 2017

So, over in #60, a number of people have commented on the need for a length field. I just hit a case in my own SigMF work, completely independent of streaming where having a length field would make things significantly easier.

We decided not to include a length field in the captures segment, originally, because we thought it would complicate streaming captures segments to-disk. Some thoughts after working with our current design for a bit:

  1. I'm not convinced that not providing a length field in each captures segment would actually impede writing the segment to disk. If a Writer application realizes it needs to write a new capture segment, it can add a length field to the current segment before proceeding on to writing the next segment. Since, pragmatically, the captures and annotations arrays have to be buffered / written independently, anyway, to support streaming, I don't think this causes any additional seeking, really.
  2. Another option, which would be useful both on it's own and in addition to (1), above, is providing a total length field in the global object. We've already made decisions that would require modifying a metadata file if a data file is split, so I don't think this causes any more burden than already exists.

You could argue that it's up to the Reader application to grab the global length by datatype and size of file, but (a) that's not possible for metadata-only distributions (per #76) and (b) would only be for the global field, not the captures field.

Thoughts on the above? Anyone agree? Anyone disagree? What problems exist that I'm not thinking about?

@kpreid
Copy link
Contributor

kpreid commented Nov 2, 2017

On the one hand, I dislike the idea because it is redundant with the data file, and that means it could be incorrect. On the other, we already have sample indexes for various positions, so there are already possible consistency issues, of course.

But I'm thinking of a simple recorder that creates a metadata file, starts a thing that streams samples to disk, then exits (or does not have the ability to ask the data-streamer when it is done, or the whole process/machine is killed abruptly such as by the battery running out in an 'embedded' recorder). This would be unable to know what the final length is.

In the streaming-SigMF applications, presumably length could be required for that purpose in particular but not necessarily in all recordings. Does the additional case you noted also need length to be a required field in all recordings, or could it be optional?

@bhilburn
Copy link
Contributor Author

bhilburn commented Nov 2, 2017

@kpreid - You have a good point regarding redundancy with the data file. I also have an intense distaste for duplicated data, as it inevitably leads to errors. The one issue I see with not having it, though, is in the MD-only distribution case, where you don't have access to the datafile. That said, if we put it in captures, you perhaps wouldn't need it as you could just look at the final captures segment.

And great point regarding Writers that don't have the ability to ask to ask the data-streamer when it's done. When this happens, though, wouldn't some other process need to come back around later and clean-up the metadata to make it a compliant recording? For example, it may need to concatenate the written captures and annotations streams, and append closing JSON syntex (a }, perhaps), to make it truly compliant. In the event of a hard failure like this, it seems like some clean-up would need to happen anyway - so why not have it add the lengths as part of that process?

That said, if the above is too burdensome, I think making it optional is a good midpoint, allowing Writers to remain very basic if they need to be, while also providing the extra utility for those applications that can handle it.

Thanks for your continued input and feedback, @kpreid!

@kpreid
Copy link
Contributor

kpreid commented Nov 2, 2017

And great point regarding Writers that don't have the ability to ask to ask the data-streamer when it's done. When this happens, though, wouldn't some other process need to come back around later and clean-up the metadata to make it a compliant recording?

If the recorder is not doing any annotation (because it is just a signal recorder with no smarts) and there is no required length field, then it only needs to write one capture segment and other metadata which can all be known at the start.

(I am tangentially reminded of the issue of clock skew between RF sample clock and wall clock, which is a reason one would want to have an end timestamp for long recordings. But that should be optional, is not the same as a sample length value, and is another issue entirely.)

The use case I am imagining is like: "Hi, I am a black-box recorder who was turned on at [GPS position and timestamp], here is what I recorded since that instant." Such devices could be useful for a situation when you want to record RF in a large area from multiple points. (Drone competition? Foxhunt? Wildlife tracking? I dunno but it seems like something to do.)

(Oh, another slight complication to going back to fix stuff later: writing until your disk is literally full. It would be easy as such things go to reserve a block, but it's still more things to get right.)

Certainly a cleanup-later tool could add lengths, but it makes the process less "just drag the files off the SD card" and creates an extra opportunity for Technically Invalid But The Information Is All There files.

@bhilburn
Copy link
Contributor Author

bhilburn commented Nov 2, 2017

@kpreid - Great point regarding a Writer just being a recorder. Also, disk space.

Okay, let's talk about core:sample_count as an optional field, then. This would solve my problem (which is a unique post-processing problem), the split-MD usecase, and the streaming usecase in #60. The major question here, I think, is do we put it just in captures or also in global?

While it would be more convenient in many cases (including mine) to have it in both global and captures segments, @kpreid raised the accurate point that putting it in global (where it would represent the length of the entire dataset) would be duplicated data; you could get this either from the dataset file, or from the final capture segment's start+count.

In #60, @drewlio originally requested a length in the global segment. I would be interested to hear his opinion here, too. Also, @djanderson.

@kpreid - What are your thoughts? I have a feeling I know what your opinion will be, but guessing is for when you can't get canonical answers =)

@kpreid
Copy link
Contributor

kpreid commented Nov 2, 2017

Assuming capture segments are still intended to be a partition of the dataset (exactly one capture segment applies to any sample), then the least-redundant schema is to have a global length and capture segments only have start points.

@drewlio
Copy link

drewlio commented Nov 2, 2017

Love it.
Optional is good since some use cases won't care.
My use cases don't care whether it's in capture or global. You could include both (optionally) since they have different meanings. (I'm thinking of a fictitious case where you get a global length which tells you the ultimate size of all captures, then each capture tells you it's length. This might be an oddball corner case. Lots of redundancy.) If you were only going to pick one location for sample_count then I would say capture makes the most sense.

@djanderson
Copy link
Contributor

In principle I agree with @kpreid's argument that we should minimize redundancy and the chances for error, but also in purely practical terms:

Getting the length of the nth capture array if we only have the total length in global:

if n == len(md['captures']) - 1:
    len_nth_capture = md['global']['core:sample_count'] - md['captures'][n]['sample_start']
else:
    len_nth_capture = md['captures'][n + 1]['sample_start'] - md['captures'][n]['sample_start']

compared to if each capture segment has a sample_count:

len_nth_capture = md['captures'][n]['sample_count']

So we pay for reducing redundancy in the spec with seriously complicating each instance of what will be a very common computation.

@bhilburn
Copy link
Contributor Author

I have added a length field to the captures segment and pushed it to master. Thanks so much, everyone, for your input in this discussion!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants