Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Databases to convert to tables #2

Closed
4 of 12 tasks
stanstrup opened this issue Oct 19, 2017 · 28 comments
Closed
4 of 12 tasks

Databases to convert to tables #2

stanstrup opened this issue Oct 19, 2017 · 28 comments

Comments

@stanstrup
Copy link
Owner

stanstrup commented Oct 19, 2017

Functions added to package:

  • LipidMaps
  • LipidBlast
  • HMDB
  • MyCompoundDB
  • PhenolExplorer
  • PubChem. Too big? Not really useful?

License situation clearified

  • LipidMaps
  • LipidBlast - Confirmed CC BY. So OK with attribution.
  • HMDB
  • MyCompoundDB
  • PhenolExplorer
  • PubChem. Too big? Not really useful?

Please suggest.

@jorainer
Copy link

Are you planning to add each resource (i.e. its data) to the package?

@stanstrup
Copy link
Owner Author

stanstrup commented Oct 19, 2017

That was my plan if it is feasible without violating licenses. Parsing for example the json from lipidblast is very slow and people need to download a 1.6GB file.
Whereas the parsed table is only 1-2 MB in rds format.

It is not very clear to me what the license situation is.
As far as I know simple data cannot be copyrighted. For example a simple table from a paper should always be copyright free. But I am not sure what applies here.

@jorainer
Copy link

The idea is to match compounds by (adduct) m/z, right?
So you'll have some columns (like mass, id and name) that are common and have to be present in all data resources, and you might have some data resource specific columns.
In that case I would change from a data.frame approach to a S4 class approach (see also issue #6). This would also hide internals (like the actual column names etc) from the user.

@stanstrup
Copy link
Owner Author

Right. I was hoping not to have DB specific columns to be able to easily mix and match though.
What do you mean my hide internals?

@jorainer
Copy link

Example to explain the hide the internals: this is the concept we were following for/in the AnnotationFilter, ensembldb packages:

  • define a common name for a filter or database attribute that the user is used to, such as genename.
  • define a filter that can be used to search in a (any) database for a certain gene by its name: GenenameFilter.
  • now, no matter which database the user is querying, he can always use the GenenameFilter to search for entries matching a certain gene name. The methods to access the data in a database have to translate it to the correct column name. So it does not matter whether the name of the column in the database table is gene_name, GeneName, genename etc. The user doesn't have to bother what the name of the column might be and use different column names across different databases.

An example here would be to have something like a InchiFilter that can be used to search for inchis in the database...

@jorainer
Copy link

Regarding HMDB parsing: I did implement a simple parser to extract fields from HMDBs xml file(s):
https://github.com/jotsetung/xcmsExtensions/blob/master/R/hmdb-utils.R use whatever you want/need.

@stanstrup
Copy link
Owner Author

stanstrup commented Oct 20, 2017

Yeah I say that just now when poking around your repo. I actually also wrote one some years ago I was planning to add. I have a suspicion that yours is smarter though...
So should I eventually import from your package or copy?

@stanstrup
Copy link
Owner Author

If you don't enforce column names in databases won't it become difficult to mix them if for example you want data from HMDB and lipidmaps for the annotation?

@jorainer
Copy link

Re: code from xmcsExtensions, please copy what you need - I wan't update/use that package anymore. Your's will be much better!

Re: enforcing column names - let's wait for your use case. I agree that common column names should be used.

@jorainer
Copy link

Are you already working on the HMDB import function? Otherwise I could do that do start getting my hands dirty...

@stanstrup
Copy link
Owner Author

Nope, I am working on pubchem so that would be great.

@wilsontom
Copy link

Hi Jan,

I parsed HMDB into a package a while ago if it's any help. And a colleague did something similar for PubChem. Some of it may be some use to you

Thanks

Tom

@stanstrup
Copy link
Owner Author

@wilsontom Thanks! But you don't supply functions to actually generate the HMDB table? I'd like that have that in the package too so it is easy to update.

I basically have PubChem working. Trying to generate the table now. Takes a while though since Pubchem is enormous. I wonder what the final size is gonna look like.

@jorainer
Copy link

hmdb parsing is also on its way - I've just updated to use the xml2 package instead of the XML package.

@stanstrup
Copy link
Owner Author

I have been trying to write something feasible to handle pubchem using sqlite intermediates. The problem is that it is enormous. 130 million structures supposedly. Holding the final table in memory requires about 60GB from my approximations. An RDS file would be ~7.5GB. An sqlite file ~40GB.

So two problems:

  1. Any of the usual solutions even allow such a large file?
  2. People cannot use it on a regular computer without loads of memory.
  3. expanding it to adducts would balloon it even more.

With an sqlite file as far as I understand you could subset it before it is read into R. I guess that might make it useful for something.
Still don't know if that is feasible. And wouldn't know where to host a 40 GB sqlite file.

Thoughts?

jorainer added a commit to jorainer/PeakABro that referenced this issue Oct 23, 2017
- Add the generate_hmdb_tbl function.
- Add related documentation and unit tests.
@jorainer
Copy link

The HMDB is added (see #10).

@jorainer
Copy link

Re PubChem: for these large files using a tibble-based approach is not feasible. SQL might do it - or on the fly access? Do they have a web API that could be queried? The approach I have in mind might also work here: define a CompoundDb S4 object and implement all of the required methods (select etc) for it. For smaller databases these can access the internal SQLite database. We could then also implement a PubChemDb class that extends the CompoundDb and the select method could e.g. query the database online (if they provide an API) and return the results.

Regarding the adducts - I had a thought about that too: I wouldn't create all adducts for all compounds that are in the database but rather go the other way round: calculate adducts from the identified chromatographic peaks instead. That would be more efficient, because supposedly there are always less peaks to annotate than compounds in the database

@stanstrup
Copy link
Owner Author

Thanks for HMDB.

Re Pubchem: PubChem does have an API: https://pubchem.ncbi.nlm.nih.gov/pug_rest/PUG_REST_Tutorial.html#_Toc458584424
But it will be way to slow for this purpose to me. It will be thousands of compounds if you attempt to annotate a whole peaklist. I wanted specifically to get away from the whole "look up one at a time" approach. So that once you have created your annotate peaklist you can just browse around and see everything.
I suggest we change to an sqlite databases in general such that larger databases can be accommodated in the same framework.
I say we supply the function to generate the pubchem sqlite but don't supply it anywhere. To me annotating with all of pubchem is anyway not very useful. You always get too many irrelevant hits.

Re: CompoundDb: I think it makes sense to have such an object.
Do you know if it is possible to cache generated data in the installed package folder?
What would be nice is if there was:

CompoundDb <- generate_CompoundDb(dbs=c("HMDB","LipidBlast"))

--> The LipidBlast database have not been generated (initialized is a better word?) yet. Please run generate_db_lipidblast to create a cached database

generate_CompoundDb would read the included sqlite files if they exist. If generate_db_lipidblast and friends could simply add the sqlite file for the specific database to the package folder you'd need to generate each only once.

Re adducts: Yes you are right. That makes much more sense.

@jorainer
Copy link

Re CompoundDb and cached - no, I don't think it's possible to cache anything in the package folder. I would keep the annotation data separately from PeakABro. What I would propose is the following: in the initial phase provide some CompoundDb objects/SQLite databases within dedicated annotation packages (e.g. https://github.com/jotsetung/CompoundDb.Hsapiens.HMDB.4.0). On the longer run: distribute them via AnnotationHub check the following:

> library(AnnotationHub)
> ah <- AnnotationHub()
updating metadata: retrieving 1 resource
  |======================================================================| 100%

snapshotDate(): 2017-10-24
> ## Look for a specific resource, like gene annotations from Ensembldb, in our
> ## case we could then search e.g. for "CompoundDb", "HMDB"
> query(ah, "EnsDb.Hsapiens.v90")
AnnotationHub with 1 record
# snapshotDate(): 2017-10-24 
# names(): AH57757
# $dataprovider: Ensembl
# $species: Homo Sapiens
# $rdataclass: EnsDb
# $rdatadateadded: 2017-08-31
# $title: Ensembl 90 EnsDb for Homo Sapiens
# $description: Gene and protein annotations for Homo Sapiens based on Ensem...
# $taxonomyid: 9606
# $genome: GRCh38
# $sourcetype: ensembl
# $sourceurl: http://www.ensembl.org
# $sourcesize: NA
# $tags: c("EnsDb", "Ensembl", "Gene", "Transcript", "Protein",
#   "Annotation", "90", "AHEnsDbs") 
# retrieve record with 'object[["AH57757"]]' 
> ## retrieve the resource:
> edb <- ah[["AH57757"]]
require(“ensembldb”)
loading from cache '/Users/jo//.AnnotationHub/64495'

This means users could fetch the resource they want from AnnotationHub and this will be cached locally. Does that make sense?

Now, I'd also like to keep separate CompoundDb objects/databases for different resources (e.g. HMDB, LipidBlast). Reason: that way you can version the resources, respectively packages (see e.g. https://github.com/jotsetung/CompoundDb.Hsapiens.HMDB.4.0). Different resources will never have the same release cycles - and versioning annotation resources is key to reproducible research.

This means also that you can't query multiple resources at the same time, but that shouldn't be a problem, is it?

@stanstrup
Copy link
Owner Author

That sounds very reasonable. It is a bit a learning curve for me with the S4 objects so I hope you have patients with me while I try to wrap my head around that.

I am wondering if it could also make sense to split the database stuff from the peak browser. It is getting more comprehensive than originally envisioned.

For the last Q: It would probably be nice to be able to annotate with multiple databases at the same time. The objective is the browser in the end where you'd want a single table with all the suggested annotations.

Any idea what to do with the very big databases? The pubchem sqlite file ended up being 43GB.

@jorainer
Copy link

Re annotation with multiple databases: one could annotate with a CompoundDb for each resource and bind_rows the results. Then you'll have the final table.

Re very big database: only thing I could think of here is to use a central MySQL server hosted somewhere (eventually I could do that, not sure though). And here comes the power of the S4 objects. We define simply a PubChemDb object that extends CompoundDb. We would only have to implement the compounds or src_cmpdb (or the annotating function/method) accordingly. For the user it would be just as using a simple local SQLite-base CompoundDb object,

@stanstrup
Copy link
Owner Author

stanstrup commented Oct 25, 2017

Ah ok. If nothing prevents bind_rows then it is all good.
EDIT: now I understood. bind the result. Yes that works too.

Re very big database: I guess we can put that on the back-burner for now and just provide the parser. PubChem is rarely really useful for annotation anyway.

@jorainer
Copy link

jorainer commented Oct 26, 2017

I am wondering if it could also make sense to split the database stuff from the peak browser. It is getting more comprehensive than originally envisioned.

Thinking it all over - eventually that might be a not too bad idea. I could focus on the database/data import stuff (with your help) and you can focus on the annotation, matching and browsing stuff.

Pros for splitting:

  • keep database and creation of database separate from the browser and annotator - easier to maintain.
  • we would not run into the Bioconductor style <-> tidyverse coding style clash. Something what I find very ugly would be e.g. create_CompoundDb, i.e. mixing CamelCase with snake_case.
  • You don't have to go through my pull request ;)
    Cons:
  • PeakABro will become very slim (is that a cons?)
  • Changes in one of the two packages will have to be reflected/fixed in the other too.

@stanstrup , what do you think?

@stanstrup
Copy link
Owner Author

In the end this is probably the most efficient way to do this so go ahead if you want.

@jorainer
Copy link

OK, I'll make a repo and add you as collaborator

@stanstrup
Copy link
Owner Author

Do you want to move this issue and the other db related using https://github-issue-mover.appspot.com?

@jorainer
Copy link

Or we just link to this issue? Whatever you prefer.

@stanstrup
Copy link
Owner Author

This issue was moved to rformassspectrometry/CompoundDb#6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants