Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up incoming ID3C data #32

Open
trvrb opened this issue Dec 15, 2019 · 11 comments
Open

Clean up incoming ID3C data #32

trvrb opened this issue Dec 15, 2019 · 11 comments
Assignees

Comments

@trvrb
Copy link
Member

trvrb commented Dec 15, 2019

@joverlee521 ---

There are a small handful of upstream fixes we need to shipping views.

  1. The date field in v2/shipping/augur-build-metadata was formatted as 2019-09-25T19:37:35.483+00:00. This should just read 2019-09-25. I've fixed this on the augur side here: https://github.com/seattleflu/augur-build/blob/master/scripts/download_sfs_metadata.py#L25 for the time being.
  2. Our strain names should match those used by the rest of the world rather than just being a long UUID. I'd like to match existing format as closely as possible. Strains in the US are geographically labeled by state, like B/Washington/2/2019. This means that sample UUID fe1a1206-21ef-45ff-8be0-9d7643eef879 would be strain A/Washington/43eef879/2019, ie taking A or B depending on flu A or flu B and taking year from date.
  3. We need neighborhood (within Seattle proper) / puma (outside Seattle proper) for location. I believe that @kairstenfay may have started on this already in ID3C.
  4. Include age_range_coarse as a field in the shipping view.
  5. Restrict rows in shipping.augur-build-metadata to only those samples that have sequencing data.

Edited to update format for strain name in item 2 and to include items 4 and 5.

@tsibley
Copy link
Member

tsibley commented Dec 16, 2019

This means that sample UUID fe1a1206-21ef-45ff-8be0-9d7643eef879 would be strain A/Washington/SFS-43eef879/2019

I would really prefer to keep the entire UUID in the strain name. The whole reason for using the UUIDs in the first place is that they are universally unique; a property that we lose if we truncate them. If we don't use the UUID, then we've lost all its benefits and shouldn't have used them from the start.

I would also caution against using opaque acronyms like SFS, since they're meaningless outside of the study. Can we use something like

A/Washington/seattleflu.org/fe1a1206-21ef-45ff-8be0-9d7643eef879/2019

instead?

@trvrb
Copy link
Member Author

trvrb commented Dec 17, 2019

@tsibley --- I'm afraid I don't agree. We should aim to be as consistent as possible with how the entire flu field treats strain names. It will be super weird if there are canonical names like B/Washington/2/2019 while we name things like B/Washington/seattleflu.org/fe1a1206-21ef-45ff-8be0-9d7643eef879/2019. It's far outside standard naming.

The strain name itself is meant to be unique, but short enough to be usable. Even A/Singapore/Infimh-16-0019/2016 was quite unwieldy. Keep in mind that each strain is tied to unique accession provisioned by Genbank or by GISAID that gives detailed provenance information. Strain names are meant to:

  1. Provide broad virus information, ie A vs B
  2. Provide broad geo information, ie Washington
  3. Provide a short disambiguation string (traditionally 1, 2, 3)
  4. Provide broad time information, ie 2019

(Field order is important too, extra slashes are non-standard and would break parsing)

I might even say to just name this as A/Washington/43eef879/2019. There is no way that the 8-digit hex will conflict with the CDC's 1, 2, 3 naming. (The SFS- was there for additional disambiguation, not for provenance)

@tsibley
Copy link
Member

tsibley commented Dec 18, 2019

…but short enough to be usable. Even A/Singapore/Infimh-16-0019/2016 was quite unwieldy.

Ok! It seems like I don't understand how these names are used in practice, if that's considered unwieldy. (It doesn't, from my naive, outside perspective, seem unwieldy to me.)

Are these names regularly spoken, as opposed to copied/programmatically processed?

@trvrb
Copy link
Member Author

trvrb commented Dec 18, 2019

Yes. Regularly spoken aloud and used to point people around a tree or around a titer table.

If you'd like to keep UUID, we can provide this as a "sample ID" in flat file data download that's paired with strain name.

@tsibley
Copy link
Member

tsibley commented Dec 18, 2019

I think it would be smart to keep the full UUID linked one way or another. It is an identifier equivalent in utility to the GenBank accession.

@joverlee521
Copy link

@joverlee521
Copy link

@trvrb
Copy link
Member Author

trvrb commented Dec 31, 2019

One additional request here: just using age_category eg adult vs child is too coarse of an analysis. I'd like to additionally have age_range_coarse, eg ["5 years","18 years"). I think age range coarse will be the right resolution for the genomic work and we won't be able to use age range fine.

I've added this as request number 4 above.

@trvrb
Copy link
Member Author

trvrb commented Jan 2, 2020

Yet one more request. Can we restrict rows in shipping.augur-build-metadata to only those samples that have sequencing data? There are two reasons for this:

  1. We want to protect data privacy in these shipping views, so rather than downloading a dataset of ~20k rows with all encounters, it's safer to download a dataset of ~2k rows with just encounters that were sequenced.
  2. Dealing with the extra large metadata table is somewhat unwieldy given how scripts like select_strains.py are written.

I've added this as request number 5 above.

@kairstenfay
Copy link

Yet one more request. Can we restrict rows in shipping.augur-build-metadata to only those samples that have sequencing data?

@trvrb do you still only want the new shipping.metadata_for_augur_build to include samples with sequencing data? If so, is there a separate desire for a view similar to what Mike requested that contains all samples regardless of encounter or sequence data?

kairstenfay added a commit to seattleflu/id3c-customizations that referenced this issue Jan 22, 2020
In a [GitHub
issue](seattleflu/augur-build#32), Trevor
requested that encountered date no longer be formatted as a timestamp
but rather a date in YYYY-MM-DD format for the
`shipping.metadata_for_augur_build_v*` views.
kairstenfay added a commit to seattleflu/id3c-customizations that referenced this issue Jan 22, 2020
In a [GitHub
issue](seattleflu/augur-build#32), Trevor
requested that encountered date no longer be formatted as a timestamp
but rather a date in YYYY-MM-DD format for the
`shipping.metadata_for_augur_build_v2` view.
kairstenfay added a commit to seattleflu/id3c-customizations that referenced this issue Jan 22, 2020
In a [GitHub
issue](seattleflu/augur-build#32), Trevor
requested that encountered date no longer be formatted as a timestamp
but rather a date in YYYY-MM-DD format for the
`shipping.metadata_for_augur_build_v2` view.
kairstenfay added a commit to seattleflu/id3c-customizations that referenced this issue Jan 22, 2020
In seattleflu/augur-build#32, Trevor
requested that encountered date no longer be formatted as a timestamp
but rather a date in YYYY-MM-DD format for the
`shipping.metadata_for_augur_build_v2` view.
kairstenfay added a commit to seattleflu/id3c-customizations that referenced this issue Jan 22, 2020
In seattleflu/augur-build#32, Trevor requested
that we include `age_range_coarse` as a column in the view.
kairstenfay added a commit to seattleflu/id3c-customizations that referenced this issue Jan 23, 2020
In seattleflu/augur-build#32, Trevor
requested that encountered date no longer be formatted as a timestamp
but rather a date in YYYY-MM-DD format for the
`shipping.metadata_for_augur_build_v2` view.
kairstenfay added a commit to seattleflu/id3c-customizations that referenced this issue Jan 23, 2020
In seattleflu/augur-build#32, Trevor requested
that we include `age_range_coarse` as a column in the view.
tsibley pushed a commit to seattleflu/id3c-customizations that referenced this issue Jan 25, 2020
In seattleflu/augur-build#32, Trevor
requested that encountered date no longer be formatted as a timestamp
but rather a date in YYYY-MM-DD format for the
`shipping.metadata_for_augur_build_v2` view.
tsibley pushed a commit to seattleflu/id3c-customizations that referenced this issue Jan 25, 2020
In seattleflu/augur-build#32, Trevor requested
that we include `age_range_coarse` as a column in the view.
kairstenfay added a commit to seattleflu/id3c-customizations that referenced this issue Jan 28, 2020
In seattleflu/augur-build#32, Trevor
requested that encountered date no longer be formatted as a timestamp
but rather a date in YYYY-MM-DD format for the
`shipping.metadata_for_augur_build_v2` view.

Co-authored-by: Thomas Sibley <tsibley@fredhutch.org>
kairstenfay added a commit to seattleflu/id3c-customizations that referenced this issue Jan 28, 2020
In seattleflu/augur-build#32, Trevor requested
that we include `age_range_coarse` as a column in the view.
@kairstenfay
Copy link

There are a small handful of upstream fixes we need to shipping views.

1. The `date` field in `v2/shipping/augur-build-metadata` was formatted as `2019-09-25T19:37:35.483+00:00`. This should just read `2019-09-25`. I've fixed this on the augur side here: https://github.com/seattleflu/augur-build/blob/master/scripts/download_sfs_metadata.py#L25 for the time being.

This is now fixed on master.

4. Include `age_range_coarse` as a field in the shipping view.

This column is now present on master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants