Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple Recordings Extension #151

Closed
wants to merge 4 commits into from
Closed

Conversation

bhilburn
Copy link
Contributor

@bhilburn bhilburn commented Aug 25, 2021

This PR replaces #99

New draft of the multi-recordings extension based on all of the discussion we have had to-date. I have a few issues with current draft, but sharing it for general comment / discussion so we can converge on the best solution.

My issues with it:

  • There is duplicated information implied in the primary, channel, and streams fields. For example, the existence of the streams array implies that the current recording is the primary.
  • I'm not certain the channel field is still useful when paired with the streams array.
  • Right now, linking recordings is based solely around filenames. Should we also enable referencing the sha512 hashes?

Here's my biggest concern:

The way the previous draft of this extension worked was by specifying exactly which fields in core and extension namespaces could have a multirecording link. That was clunky and painful. The new draft allows you to use it in place of any non-string datatype, which I think is cleaner and realistically the right move for future extensions. That means, though, that anytime this extension is included, parsers cannot assume that fields that are normally ints or bools are actually those -- they have to be checked to see if they are strings, first. This seems painful to me. You could actually consider this a violation of the "don't modify core" rule for extensions, since it enables a new datatype for core fields.

I would especially love to hear thoughts on this from the broader group. As always, we're trying to figure out how to strike the right balance of not simplicity in the design and simplicity for applications working with those recordings.

Would be great to get comments / thoughts / perspectives, @jacobagilbert , @Teque5 , @storborg , @pwicks86 , @n-west , @gmabey

@bhilburn bhilburn self-assigned this Aug 25, 2021
@bhilburn bhilburn added this to PR Filed in v1.0.0 Release via automation Aug 25, 2021
@bhilburn bhilburn added this to the Release v1.0.0 milestone Aug 25, 2021
@bhilburn bhilburn marked this pull request as draft August 25, 2021 02:42
Copy link
Member

@jacobagilbert jacobagilbert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an improvement, and is significantly simplified in that it does not modify the core specification, which I believe we have agreed is not something any extension should do.

My suggestions are as follows:

  1. Make the global primary field required (see inline comment).
  2. I agree, while this field was originally added at my suggestion I believe it largely serves as a convenient way for users to shoot themselves in the foot.
  3. We can add something like the following to cover the intent of the channel field:
### The `streams` field

The `multirecordings:streams` field is a JSON array
of `recording` strings, as defined by this extension, that indicate multiple
streams of data that were captured as part of the same event.

This field MUST only appear in the `primary` recording's metadata file. The
primary recording must appear first in this list, and the order of the array
can be used to infer channel order.

If we do suggestion 2 above, then the logic to determine what channel a recording is a little clunky but explicit:

primary_recording = this_file[`global`:`multirecordings:primary`]
load primary_recording_metadata
streams = primary_recording_metadata[`global`:`multirecordings:streams`]
channel_num = index(streams[this_file])

Thanks Ben!


| name | required | type | description |
| ------------- | -------- | --------- | ----------------------------------- |
| `primary` | false | recording | The primary recording this recording is linked to.|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this field should be required if using this extension, and if the data is the primary channel it is its own filename. This will enable code implementing multirecordings to always be capable of reliably referencing the primary channel, which may be the only file holding certain metadata.

| ------------- | -------- | --------- | ----------------------------------- |
| `primary` | false | recording | The primary recording this recording is linked to.|
| `channel` | false | uint | The channel index of this recording.|
| `streams` | false | object | List of SigMF recordings that represent multiple streams.|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should be slightly reworded to be more specific:

| streams | false | object | List of SigMF recording objects that represent multiple streams.|

Comment on lines +95 to +96
"example-channel-0",
"example-channel-1",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two lines should be:

            "example-0",
            "example-1"

@bhilburn
Copy link
Contributor Author

I'd really like to get @n-west, @mormj, and/or @willcode's opinion on my concern re: string parse checking in the original description.

@bhilburn bhilburn requested a review from n-west August 27, 2021 00:32
@jacobagilbert
Copy link
Member

I had misread and misunderstood that part so my apologies. I do think that is somewhat complicating. Is the primary purpose of this to avoid duplicate copies of annotations and captures? If so, it may be simpler to just add global fields such as use_annotations and use_captures to link these.

This certainly does not cover every use case, and there are edge cases we might want to address but what do you think about this general idea?

@willcode
Copy link
Contributor

If I'm understanding this right, you want to bundle recordings together by including metadata about the bundling in the individual recordings. If bundling is a goal, then I think there should be higher level metadata object, rather than an extension object, that describes this, leaving the individual recordings unmodified. Then, you can remap channels, put together recordings from different sources, etc., without modifying anything.

@bhilburn
Copy link
Contributor Author

@jacobagilbert - Yes, exactly. So, if normally you would describe a field, say, geolocation, with a single value (in this case a Geopoint), but instead want to describe it with a recording of Geopoints because it's a time-varying field (i.e., the receiver is moving), how do you go about that? In my mind, that's the toughest problem this PR has to solve.

We have discussed three potential options thus far:

  1. This extension specifies specific fields in core that can be recordings instead of their original values (this was the approach in previous drafts).
  2. This extension allows any field's value to be replaced by a recording (the current draft's approach).
  3. This extension specifies new fields, duplicating existing core fields, but pointing to recordings instead of individual values (we have never seriously considered this).

@willcode raises an interesting idea... it's actually kind of related to an idea that @dbouquin and I discussed at Hat Creek Observatory a couple of years ago regarding a top-level metadata file showing relationships between recordings. We had never talked about using it for this specific use-case, but now that Jeff calls it out, it does seem super relevant.

So, in this context, would the value in the specific recording (again, let's say geolocation) be the "start point", and then in a top-level metadata file, we say "the geolocation field in recording A is described by recording B"?

@bhilburn
Copy link
Contributor Author

bhilburn commented Aug 29, 2021

Been thinking about this. I think @willcode's suggestion is the right path, and it's actually the same core concept of #108 from Daina about having a top-level metadata file that describes relationships between recordings. In addition to other benefits, the fact that each individual Recording remains the same as it would have been otherwise is huge.

I'm experimenting a bit with how to make this work, working name is a "collection". Right now, the working model is that fields from other namespaces that can be described by recordings get added to the "collections" namespace, and in the key/value pair, the value is a [base filename, dataset hash] JSON array. There are a couple of new fields that need to get added, such as "streams" for MIMO systems.

toplevel

(note: in above image, the values of ["string", "hash"], the first String is meant to be a base filename like N.sigmf-[meta|data].)

The downside of this is that every field that can be used at the top-level must get added to the extension. BUT, if we don't do that, then we have to overwrite the definition of those fields to allow pointing to a Recording as the value for a key, which I think is worse. Better to re-define the key in a new namespace attached to a new value datatype that create confusion in the original namespace, IMO.

Posting this to gather thoughts / feedback on the approach.

@mormj
Copy link

mormj commented Aug 30, 2021

Certainly from a practical perspective, this seems much more manageable than having individual collects optionally reference each other. Collection seems like a reasonable name.

@willcode
Copy link
Contributor

A parallel application is video/audio editing MLT/EDL
https://mltframework.org/docs/mltxml/
https://shotcut.org/notes/mltxml-annotations/

@jacobagilbert
Copy link
Member

jacobagilbert commented Aug 30, 2021

I do believe allowing arbitrary type overloading will be a parsing burden (mostly in libsigmf), and I think that something like this is reasonable. Adding a file to manage simpler more common use cases (like 2 phase coherent channels) does seem a bit heavy handed, but I've been thinking about this and there are drawbacks to all approaches i can come up with and I do not find this latest suggestion is any more onerous than others.

leaving the individual recordings unmodified.

This is a reasonable idea, but I think we should require an unambiguous way to identify that a file is part of a collection, and to identify and access the other files in that collection given any file that is part of said collection. This could be a required (or possibly optional...) global object which defines the top level collection file (and the presence of this field indicates it is part of a collection). If there is the objective that specific files be part of multiple independent collections then this won't work (though at that point they are all part of the same collection, right? its entirely possible I am missing a use case here).

Final thought: @willcode was your suggestion that this be implemented as part of the core spec (and not as an extension)? It sounds like that but I am not sure.

@bhilburn
Copy link
Contributor Author

Final thought: @willcode was your suggestion that this be implemented as part of the core spec (and not as an extension)? It sounds like that but I am not sure.

I actually didn't see this in Jeff's note, but I was thinking about this this morning and I think that's exactly what we should do - especially since it preserves existing backwards compatibility. BUT, where I think this breaks is that we need the fields in their own namespace, rather than core, for the field collision reasons mentioned in my last message. We've been pretty firm that the main spec only defines core, not other namespaces, so I think ultimately it should remain an extension.

@willcode
Copy link
Contributor

Hmm, I don't have a real opinion about core vs ext for collections, but it's a vertical extension, where "extensions" are horizonal. So does it really fit the concept of an extension?

@jacobagilbert
Copy link
Member

jacobagilbert commented Aug 30, 2021

I also do not think this fits the concept of an extension for the same reason Jeff mentioned. We could implement something like the following structure:

{
  "global":
  {
    "core:collections":
    {
      "geolocation": "geo_data_basefile",
      "hagl": "hagl_data_basefile",
      "streams": ["channel-0", "channel-1", "channel-2", "channel-3"]
    }
  },
  ...?
}

And define the top level collections file as just the appropriate namespace core:collections object.

Of course at that point we can probably just put this object directly in the 'primary' collection data file and avoid the need for a top level file altogether (full circle to your idea 3 above that has not been seriously considered). I also think we should be explicit about what values can take on a recording object... there are not very many currently and there are a lot of integer/double/object types that really should never change within a sigmf dataset. For example, we should certainly not permit a user to assign a recording object to the global core:num_channels or core:offset fields.

Do you envision use of this concept be confined to global objects? Objects at the captures and annotation level already have a way of specifying discrete changes, though maybe it is desirable for this to cover continuously changing features within a captures segment or even annotation objects.

@bhilburn
Copy link
Contributor Author

bhilburn commented Sep 3, 2021

Responding to specific points from @willcode and @jacobagilbert -

  1. There are great points in this discussion about why collections shouldn't be an extension - I think you've convinced me.
  2. I think a top-level collections file is still the right path. It makes more sense from the perspective of how recordings are currently structured, IMO. Requiring metadata that's not about a specific recording to be carried by a Recording's sigmf-meta is a little bit of an abuse of the structure, I think.
  3. I agree regarding specifying each field that can be re-defined - per my comment above, otherwise, you are overwriting datatypes in existing fields defined elsewhere, which I don't think is a good approach.
  4. My thinking right now is that there is nothing in core outside of Global that we would allow in collections, but I kind of expect there will be in user extensions. Actually, the antenna extension has azimuth_angle and elevation_angle in annotations, which are examples of things that could be defined by a collection. Thoughts?

Okay, so, let's say these are our requirements:

  • Fields that can be used with collections must be specified.
  • It is not defined in an extension.
  • It must be contained in a separate sigmf-collection file

Thinking about it, right now we've got three top-level objects: global, captures, annotations. It sort of feels like we are describing a fourth, honestly, but one that is optional.

So, if we pull on that thread, the next question is -- how do you specify the fields in that object? Other than top-level objects, all other fields are namespaced, and breaking consistency there seems confusing. Also, I would argue we /have/ to namespace them, because user extensions will certainly add new fields to the collection object under their own namespace.

I think what this distills down to is that the main spec now specifies two namespaces, core and collection, and four top-level objects. It defines the fields from core that are part of collection. Users made add new fields under extension namespaces to the collection object.

Mock-up file:
File: top-level.sigmf-collection

{
    "collection": {
        "collection:extensions" : {
            "antenna": "v0.1.0"
        },

        "collection:geolocation": ["latlong", "hash"],
        "collection:hagl": ["hagl", "hash"],

        "collection:azimuth": ["antenna", "hash"],

        "collection:streams": [
            ["example-channel-0", "hash"],
            ["example-channel-1", "hash"]
        ],

        // this one is from a user extension
        "userdef:somefield": ["foobar", "hash"]
    }
}

I think I like this... only problem I have with it is calling the top-level object "collection" and the namespace "collection" is confusing. Maybe we change the namespace defined by the primary spec to something else to be less confusing?

Otherwise, thoughts overall?

@jacobagilbert
Copy link
Member

jacobagilbert commented Sep 3, 2021

I agree this is effectively a 4th top level object.

You raise some very good points, but I am still not sure it is necessary to specify these in a namespace other than core, and actually I think it may be slightly simpler to do it within core. Keys for the specified information remain the same, and if a key exists in (and is retrieved out of) the collection top level object then it will be a collection type object, if it exists and is retrieved from the global object, it's a fixed value.

User extensions could then also define fields as valid in collection and/or global if that is appropriate just as in the base spec. The way it is above, core SigMF fields would have collection:param/core:param in collection/global respectively, and user extensions would have extension:param in both (correct me if thats wrong).

Another possibility for example.sigmf-collection:

{
    "collection": {
        "core:extensions" : {
            "antenna": "v0.1.0"
        }

        "core:geolocation": ["latlong", "hash"],
        "core:hagl": ["hagl", "hash"],

        "antenna:azimuth": ["antenna_az", "hash"],

        "core:streams": [
            ["example-channel-0", "hash"],
            ["example-channel-1", "hash"]
        ],

        // this one is from a user extension
        "extension:field": ["datafile_for_field", "hash"]
    }
}

I may be missing the point of confusion with doing something like the above, so please let me know if so.

@bhilburn
Copy link
Contributor Author

bhilburn commented Sep 9, 2021

I don't think you are missing anything -- I think I'm just more worried about having the same "key" name in the "core" extension but with two different datatypes for the "value" conditionalized on whether it's in "global" or "collection".

I would love to get inputs from other folks, especially if it's more people that disagree with me. @n-west, @storborg, @willcode, @mormj, @Teque5?

This is really, I think, the final detail before we hammer out the last draft of this, which is itself the last outstanding feature of v1.0.0 😊

@bhilburn bhilburn mentioned this pull request Sep 17, 2021
@bhilburn
Copy link
Contributor Author

Continued in #157.

@bhilburn bhilburn closed this Sep 17, 2021
@jacobagilbert jacobagilbert moved this from PR Filed to Done in v1.0.0 Release Feb 2, 2022
@jacobagilbert jacobagilbert deleted the multiple-recordings-ext-2021 branch January 17, 2023 20:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

4 participants