Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Genbase sequences to usher #337

Closed
xz-keg opened this issue Apr 10, 2023 · 6 comments
Closed

Adding Genbase sequences to usher #337

xz-keg opened this issue Apr 10, 2023 · 6 comments

Comments

@xz-keg
Copy link

xz-keg commented Apr 10, 2023

Starting from March 2023, China began to upload most of its sequences not to GISAID but to a self-developed platform called GenBase. Sequences are free to download on NGDC by selecting dataset ”GenBase".

GenBase

I wonder if sequences in GenBase could be included in usher database.

@russcd
Copy link
Collaborator

russcd commented Apr 10, 2023

@AngieHinrichs can you take a look? Just glancing at the database now, it looks relatively straightforward.

Two considerations that we should look into:

  1. How much data is unique to GenBase?
  2. Are any of these data posted elsewhere and therefore would require additional deduplication efforts?

@AngieHinrichs
Copy link
Contributor

Yes, it does look really straightforward and it's easy to form a URL to download metadata for all GenBase sequences. However, at the moment I am not able to download any sequences from the website; even if I select only one or two sequences, I'm getting an empty file. I will try to contact the operators of ngdc.cncb.ac.cn.

@xz-keg
Copy link
Author

xz-keg commented Apr 11, 2023

Yes, it does look really straightforward and it's easy to form a URL to download metadata for all GenBase sequences. However, at the moment I am not able to download any sequences from the website; even if I select only one or two sequences, I'm getting an empty file. I will try to contact the operators of ngdc.cncb.ac.cn.

Seems that there was a bug yesterday and the bug is fixed today.

However there seems to be a 2000 upperbound threshold per download.

@AngieHinrichs
Copy link
Contributor

Yes, manual download with a limit of 2000 is working for me too today. I will download sequences that way for now, but I hope there is an automated solution. The site has some download files but they are either outdated (2022 & earlier) or mostly GenBank with very few GenBase sequences, as far as I can tell. I emailed the Contact addresses for the search page and for GenBase asking if there could be compressed fasta downloads or an API to query sequences.

@xz-keg
Copy link
Author

xz-keg commented Apr 13, 2023

Yes, manual download with a limit of 2000 is working for me too today. I will download sequences that way for now, but I hope there is an automated solution. The site has some download files but they are either outdated (2022 & earlier) or mostly GenBank with very few GenBase sequences, as far as I can tell. I emailed the Contact addresses for the search page and for GenBase asking if there could be compressed fasta downloads or an API to query sequences.

The system works very badly, I don't find an API too.

Select ”GenBase" on database option, this shows all GenBase sequences, sequences that have been submitted to other platforms will have a "related_ID" showing its ID on GISAID or GenBank, so sort by related_id and exclude sequences with any related_ID other than None you get unique GenBase sequences.

However, it seems there's still no way to query the “create date" of sequences, only a "view the latest data” option to show sequences with the most recent create date.

After the initial build, either download the "view the latest" daily, or download all GenBase sequences weekly and de-duplicate with previous week's result. I guess these two are the best ways under current situation...

Screen Shot 2023-04-13 at 11 24 00

@AngieHinrichs
Copy link
Contributor

I emailed the contact listed for GenBase on https://ngdc.cncb.ac.cn/databasecommons/database/id/8197 and he replied that there is an API to fetch one GenBase sequence at a time (e.g. https://ngdc.cncb.ac.cn/genbase/api/file/fasta?acc=C_AA004835.1). So I wrote a script (in production for the first time in today's build) that fetches metadata for all sequences in GenBase, GWH, CNGBdb and NMDC from CNCB, compares it to the previous day's metadata, and fetches new GenBase sequences one at a time with delays so I don't DoS the server. I updated my script that combines sequences from all sources, identifies sequences not already in the tree that pass quality filters, and makes input for UShER, to also check for new CNCB sequences. The deduplication could still use a little work -- there are some sequences in both GISAID and GenBase that may appear twice in the tree.

If all goes well, then hopefully by tomorrow the 2023-04-13 tree will be available including GenBase sequences.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants