Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pilot conversion of MIxS TSV into a GBIF/OBIS system #53

Open
pbuttigieg opened this issue Mar 16, 2021 · 13 comments
Open

Pilot conversion of MIxS TSV into a GBIF/OBIS system #53

pbuttigieg opened this issue Mar 16, 2021 · 13 comments
Labels
DwC Mapping The issue is about the mapping of a MIxS term to a Darwin Core term or terms. DwC-MIxS TG This issue is related to the work of the Sustainable DarwinCore MIxS Interoperability Task Group

Comments

@pbuttigieg
Copy link
Collaborator

@timrobertson100 @thomasstjerne @cmungall @pieterprovoost @79-6d

Following our meeting today, would you mind scoping out how you'll test exchanging a MIxS TSV for an attempt to auto-convert it into a DwC Archive?

@pbuttigieg pbuttigieg added DwC-MIxS TG This issue is related to the work of the Sustainable DarwinCore MIxS Interoperability Task Group DwC Mapping The issue is about the mapping of a MIxS term to a Darwin Core term or terms. labels Mar 16, 2021
@pieterprovoost
Copy link
Collaborator

@79-6d Do you have something that is in MIxS already? If so I suggest we work with that? If would be great if we can come up with a notebook that does most of the conversion automatically based on what's in the spreadsheet.

@ymgan
Copy link
Collaborator

ymgan commented Mar 17, 2021

Just realised that the dataset we have is from MIxS v2, I will write to our data provider to ask if she has something from MIxS v5.

Great idea about working on it in a notebook!

@thomasstjerne
Copy link
Collaborator

For reference, here are two examples of test datasets in the GBIF test environment that use the DNA derived data extension.
There are one marine and one terrestrial, the latter use the extended measurement and fact in addition to the DNA extension.
Be ware that these are prepared as tests and have very minimal EML metadata.

SMHI Baltic Picoplankton (Marine) - about: https://www.ebi.ac.uk/ena/browser/view/PRJEB12362

  1. Dataset in GBIF
  2. A Sampling Event (scroll down to see taxonomic breakdown)
  3. An occurrence (scroll down to see the MIxS data)

Insect mobile (Terrestrial) - about: https://www.biorxiv.org/content/10.1101/2020.11.19.389742v1

  1. Dataset in GBIF
  2. A Sampling Event (scroll down to see taxonomic breakdown)
  3. An occurrence (scroll down to see the MIxS data)

@ymgan
Copy link
Collaborator

ymgan commented Mar 17, 2021

Awesome!! Do you mind to share the link to the repo on how the conversion is made if that's available?

How would the eml part be addressed? Do you get the data provider to fill in those information or is that something that will be extracted from the data?

@thomasstjerne
Copy link
Collaborator

@79-6d both of these datasets were uploaded by the publishers through IPTs.

The extension is of course not in production, but IPTs running in test mode detects the extension and can map to it.
The EML can be filled through the IPT as well through a form - I guess they just didn´t spend time on this yet as this is pre-production.

@msweetlove
Copy link

I found a suitable marine 'omics dataset that we can use to look at the conversion from MIxS to DwC in our biodiversity.aq/POLA3R database. It is a microbial dataset where the authors used 16S rDNA amplicon sequencing to profile the community composition of Bacteria and Archaea in marine sediments. I think it's a good representative of a typical (small) microbial DNA-based dataset.

Here is the .xlsx file how we formatted it as MIxS, I adapted it to MIxS v5
MIxS_testdataset_PRJNA335729.xlsx

This dataset was published in:
Franco, D. C., Signori, C. N., Duarte, R. T., Nakayama, C. R., Campos, L. S., & Pellizari, V. H. (2017). High prevalence of gammaproteobacteria in the sediments of admiralty bay and north bransfield Basin, Northwestern Antarctic Peninsula. Frontiers in microbiology, 8, 153.

here is the link to the IPT: https://ipt.biodiversity.aq/resource?r=antarctic_marine_sediment_microbes

The sequences can be retrieved from here: https://www.ebi.ac.uk/ena/browser/view/PRJNA335729

@thomasstjerne
Copy link
Collaborator

Thanks @msweetlove ,
I can´t open the xlsx file in Neither Excel nor Google sheets - Can you check the file is as expected?

@msweetlove
Copy link

I can open it just fine... Here is a version saved as tab separated txt, does this work?

MIxS_testdataset_PRJNA335729.txt

The original is a csv, but for some reason GitHub doesn't allow that format.

@thomasstjerne
Copy link
Collaborator

Yes - the txt file works for. Thanks!

@thomasstjerne
Copy link
Collaborator

Is there any taxonomic annotation of the sequences available?
In the paper it says

At the phylum level, all OTUs could be classified and belonged to 22 formally described bacterial phyla and 18 candidate phyla

But I cant seem to find any information on the classification step (database, thresholds etc)

@msweetlove
Copy link

I don't have any more information than that... Like most microbial studies, these authors only provide the raw sequence data because the methods to bin/cluster sequences, detect errors and taxonomically annotate sequences vary widely from lab to lab and the techniques evolve very fast over time... You can always try to contact the authors if they still have the original OTU tables, or use your own pipeline to annotate the sequences, or you can also request an analysis at MGnify: https://www.ebi.ac.uk/metagenomics/

@thomasstjerne
Copy link
Collaborator

OK, thanks. I just wanted to be sure that I didn´t overlook anything.

@thomasstjerne
Copy link
Collaborator

thomasstjerne commented Mar 25, 2021

@msweetlove the dataset is now in the GBIF test environment here:

  1. Dataset in GBIF
  2. A Sampling Event (scroll down to see taxonomic breakdown)
  3. An occurrence (scroll down to see the MIxS data)

This one is using the extension from this repo with the MIXS IRIs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DwC Mapping The issue is about the mapping of a MIxS term to a Darwin Core term or terms. DwC-MIxS TG This issue is related to the work of the Sustainable DarwinCore MIxS Interoperability Task Group
Development

No branches or pull requests

5 participants