-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Databases to convert to tables #2
Comments
Are you planning to add each resource (i.e. its data) to the package? |
That was my plan if it is feasible without violating licenses. Parsing for example the json from lipidblast is very slow and people need to download a 1.6GB file. It is not very clear to me what the license situation is. |
The idea is to match compounds by (adduct) m/z, right? |
Right. I was hoping not to have DB specific columns to be able to easily mix and match though. |
Example to explain the hide the internals: this is the concept we were following for/in the
An example here would be to have something like a |
Regarding HMDB parsing: I did implement a simple parser to extract fields from HMDBs xml file(s): |
Yeah I say that just now when poking around your repo. I actually also wrote one some years ago I was planning to add. I have a suspicion that yours is smarter though... |
If you don't enforce column names in databases won't it become difficult to mix them if for example you want data from HMDB and lipidmaps for the annotation? |
Re: code from Re: enforcing column names - let's wait for your use case. I agree that common column names should be used. |
Are you already working on the HMDB import function? Otherwise I could do that do start getting my hands dirty... |
Nope, I am working on pubchem so that would be great. |
@wilsontom Thanks! But you don't supply functions to actually generate the HMDB table? I'd like that have that in the package too so it is easy to update. I basically have PubChem working. Trying to generate the table now. Takes a while though since Pubchem is enormous. I wonder what the final size is gonna look like. |
hmdb parsing is also on its way - I've just updated to use the |
I have been trying to write something feasible to handle pubchem using sqlite intermediates. The problem is that it is enormous. 130 million structures supposedly. Holding the final table in memory requires about 60GB from my approximations. An RDS file would be ~7.5GB. An sqlite file ~40GB. So two problems:
With an sqlite file as far as I understand you could subset it before it is read into R. I guess that might make it useful for something. Thoughts? |
- Add the generate_hmdb_tbl function. - Add related documentation and unit tests.
The HMDB is added (see #10). |
Re PubChem: for these large files using a tibble-based approach is not feasible. SQL might do it - or on the fly access? Do they have a web API that could be queried? The approach I have in mind might also work here: define a Regarding the adducts - I had a thought about that too: I wouldn't create all adducts for all compounds that are in the database but rather go the other way round: calculate adducts from the identified chromatographic peaks instead. That would be more efficient, because supposedly there are always less peaks to annotate than compounds in the database |
Thanks for HMDB. Re Pubchem: PubChem does have an API: https://pubchem.ncbi.nlm.nih.gov/pug_rest/PUG_REST_Tutorial.html#_Toc458584424 Re: CompoundDb <- generate_CompoundDb(dbs=c("HMDB","LipidBlast")) -->
Re adducts: Yes you are right. That makes much more sense. |
Re > library(AnnotationHub)
> ah <- AnnotationHub()
updating metadata: retrieving 1 resource
|======================================================================| 100%
snapshotDate(): 2017-10-24
> ## Look for a specific resource, like gene annotations from Ensembldb, in our
> ## case we could then search e.g. for "CompoundDb", "HMDB"
> query(ah, "EnsDb.Hsapiens.v90")
AnnotationHub with 1 record
# snapshotDate(): 2017-10-24
# names(): AH57757
# $dataprovider: Ensembl
# $species: Homo Sapiens
# $rdataclass: EnsDb
# $rdatadateadded: 2017-08-31
# $title: Ensembl 90 EnsDb for Homo Sapiens
# $description: Gene and protein annotations for Homo Sapiens based on Ensem...
# $taxonomyid: 9606
# $genome: GRCh38
# $sourcetype: ensembl
# $sourceurl: http://www.ensembl.org
# $sourcesize: NA
# $tags: c("EnsDb", "Ensembl", "Gene", "Transcript", "Protein",
# "Annotation", "90", "AHEnsDbs")
# retrieve record with 'object[["AH57757"]]'
> ## retrieve the resource:
> edb <- ah[["AH57757"]]
require(“ensembldb”)
loading from cache '/Users/jo//.AnnotationHub/64495' This means users could fetch the resource they want from Now, I'd also like to keep separate This means also that you can't query multiple resources at the same time, but that shouldn't be a problem, is it? |
That sounds very reasonable. It is a bit a learning curve for me with the S4 objects so I hope you have patients with me while I try to wrap my head around that. I am wondering if it could also make sense to split the database stuff from the peak browser. It is getting more comprehensive than originally envisioned. For the last Q: It would probably be nice to be able to annotate with multiple databases at the same time. The objective is the browser in the end where you'd want a single table with all the suggested annotations. Any idea what to do with the very big databases? The pubchem sqlite file ended up being 43GB. |
Re annotation with multiple databases: one could annotate with a Re very big database: only thing I could think of here is to use a central |
Ah ok. If nothing prevents bind_rows then it is all good. Re very big database: I guess we can put that on the back-burner for now and just provide the parser. PubChem is rarely really useful for annotation anyway. |
Thinking it all over - eventually that might be a not too bad idea. I could focus on the database/data import stuff (with your help) and you can focus on the annotation, matching and browsing stuff. Pros for splitting:
@stanstrup , what do you think? |
In the end this is probably the most efficient way to do this so go ahead if you want. |
OK, I'll make a repo and add you as collaborator |
Do you want to move this issue and the other db related using https://github-issue-mover.appspot.com? |
Or we just link to this issue? Whatever you prefer. |
This issue was moved to rformassspectrometry/CompoundDb#6 |
Functions added to package:
License situation clearified
Please suggest.
The text was updated successfully, but these errors were encountered: