Databases to convert to tables #2

stanstrup · 2017-10-19T13:11:23Z

jorainer · 2017-10-19T17:01:13Z

Are you planning to add each resource (i.e. its data) to the package?

stanstrup · 2017-10-19T17:07:13Z

That was my plan if it is feasible without violating licenses. Parsing for example the json from lipidblast is very slow and people need to download a 1.6GB file.
Whereas the parsed table is only 1-2 MB in rds format.

It is not very clear to me what the license situation is.
As far as I know simple data cannot be copyrighted. For example a simple table from a paper should always be copyright free. But I am not sure what applies here.

jorainer · 2017-10-20T06:49:00Z

The idea is to match compounds by (adduct) m/z, right?
So you'll have some columns (like mass, id and name) that are common and have to be present in all data resources, and you might have some data resource specific columns.
In that case I would change from a data.frame approach to a S4 class approach (see also issue #6). This would also hide internals (like the actual column names etc) from the user.

stanstrup · 2017-10-20T07:26:05Z

Right. I was hoping not to have DB specific columns to be able to easily mix and match though.
What do you mean my hide internals?

jorainer · 2017-10-20T07:55:25Z

Example to explain the hide the internals: this is the concept we were following for/in the AnnotationFilter, ensembldb packages:

define a common name for a filter or database attribute that the user is used to, such as genename.
define a filter that can be used to search in a (any) database for a certain gene by its name: GenenameFilter.
now, no matter which database the user is querying, he can always use the GenenameFilter to search for entries matching a certain gene name. The methods to access the data in a database have to translate it to the correct column name. So it does not matter whether the name of the column in the database table is gene_name, GeneName, genename etc. The user doesn't have to bother what the name of the column might be and use different column names across different databases.

An example here would be to have something like a InchiFilter that can be used to search for inchis in the database...

jorainer · 2017-10-20T08:08:37Z

Regarding HMDB parsing: I did implement a simple parser to extract fields from HMDBs xml file(s):
https://github.com/jotsetung/xcmsExtensions/blob/master/R/hmdb-utils.R use whatever you want/need.

stanstrup · 2017-10-20T08:18:58Z

Yeah I say that just now when poking around your repo. I actually also wrote one some years ago I was planning to add. I have a suspicion that yours is smarter though...
So should I eventually import from your package or copy?

stanstrup · 2017-10-20T08:23:29Z

If you don't enforce column names in databases won't it become difficult to mix them if for example you want data from HMDB and lipidmaps for the annotation?

jorainer · 2017-10-20T09:12:52Z

Re: code from xmcsExtensions, please copy what you need - I wan't update/use that package anymore. Your's will be much better!

Re: enforcing column names - let's wait for your use case. I agree that common column names should be used.

jorainer · 2017-10-20T12:43:27Z

Are you already working on the HMDB import function? Otherwise I could do that do start getting my hands dirty...

stanstrup · 2017-10-20T12:57:08Z

Nope, I am working on pubchem so that would be great.

wilsontom · 2017-10-20T15:54:20Z

Hi Jan,

I parsed HMDB into a package a while ago if it's any help. And a colleague did something similar for PubChem. Some of it may be some use to you

Thanks

Tom

stanstrup · 2017-10-20T16:49:03Z

@wilsontom Thanks! But you don't supply functions to actually generate the HMDB table? I'd like that have that in the package too so it is easy to update.

I basically have PubChem working. Trying to generate the table now. Takes a while though since Pubchem is enormous. I wonder what the final size is gonna look like.

jorainer · 2017-10-20T17:03:54Z

hmdb parsing is also on its way - I've just updated to use the xml2 package instead of the XML package.

stanstrup · 2017-10-22T13:27:59Z

I have been trying to write something feasible to handle pubchem using sqlite intermediates. The problem is that it is enormous. 130 million structures supposedly. Holding the final table in memory requires about 60GB from my approximations. An RDS file would be ~7.5GB. An sqlite file ~40GB.

So two problems:

Any of the usual solutions even allow such a large file?
People cannot use it on a regular computer without loads of memory.
expanding it to adducts would balloon it even more.

With an sqlite file as far as I understand you could subset it before it is read into R. I guess that might make it useful for something.
Still don't know if that is feasible. And wouldn't know where to host a 40 GB sqlite file.

Thoughts?

- Add the generate_hmdb_tbl function. - Add related documentation and unit tests.

jorainer · 2017-10-23T03:50:22Z

The HMDB is added (see #10).

jorainer · 2017-10-23T03:59:12Z

Re PubChem: for these large files using a tibble-based approach is not feasible. SQL might do it - or on the fly access? Do they have a web API that could be queried? The approach I have in mind might also work here: define a CompoundDb S4 object and implement all of the required methods (select etc) for it. For smaller databases these can access the internal SQLite database. We could then also implement a PubChemDb class that extends the CompoundDb and the select method could e.g. query the database online (if they provide an API) and return the results.

Regarding the adducts - I had a thought about that too: I wouldn't create all adducts for all compounds that are in the database but rather go the other way round: calculate adducts from the identified chromatographic peaks instead. That would be more efficient, because supposedly there are always less peaks to annotate than compounds in the database

stanstrup · 2017-10-23T08:10:29Z

Thanks for HMDB.

Re Pubchem: PubChem does have an API: https://pubchem.ncbi.nlm.nih.gov/pug_rest/PUG_REST_Tutorial.html#_Toc458584424
But it will be way to slow for this purpose to me. It will be thousands of compounds if you attempt to annotate a whole peaklist. I wanted specifically to get away from the whole "look up one at a time" approach. So that once you have created your annotate peaklist you can just browse around and see everything.
I suggest we change to an sqlite databases in general such that larger databases can be accommodated in the same framework.
I say we supply the function to generate the pubchem sqlite but don't supply it anywhere. To me annotating with all of pubchem is anyway not very useful. You always get too many irrelevant hits.

Re: CompoundDb: I think it makes sense to have such an object.
Do you know if it is possible to cache generated data in the installed package folder?
What would be nice is if there was:

CompoundDb <- generate_CompoundDb(dbs=c("HMDB","LipidBlast"))

--> The LipidBlast database have not been generated (initialized is a better word?) yet. Please run generate_db_lipidblast to create a cached database

generate_CompoundDb would read the included sqlite files if they exist. If generate_db_lipidblast and friends could simply add the sqlite file for the specific database to the package folder you'd need to generate each only once.

Re adducts: Yes you are right. That makes much more sense.

jorainer · 2017-10-25T14:22:08Z

Re CompoundDb and cached - no, I don't think it's possible to cache anything in the package folder. I would keep the annotation data separately from PeakABro. What I would propose is the following: in the initial phase provide some CompoundDb objects/SQLite databases within dedicated annotation packages (e.g. https://github.com/jotsetung/CompoundDb.Hsapiens.HMDB.4.0). On the longer run: distribute them via AnnotationHub check the following:

> library(AnnotationHub)
> ah <- AnnotationHub()
updating metadata: retrieving 1 resource
  |======================================================================| 100%

snapshotDate(): 2017-10-24
> ## Look for a specific resource, like gene annotations from Ensembldb, in our
> ## case we could then search e.g. for "CompoundDb", "HMDB"
> query(ah, "EnsDb.Hsapiens.v90")
AnnotationHub with 1 record
# snapshotDate(): 2017-10-24 
# names(): AH57757
# $dataprovider: Ensembl
# $species: Homo Sapiens
# $rdataclass: EnsDb
# $rdatadateadded: 2017-08-31
# $title: Ensembl 90 EnsDb for Homo Sapiens
# $description: Gene and protein annotations for Homo Sapiens based on Ensem...
# $taxonomyid: 9606
# $genome: GRCh38
# $sourcetype: ensembl
# $sourceurl: http://www.ensembl.org
# $sourcesize: NA
# $tags: c("EnsDb", "Ensembl", "Gene", "Transcript", "Protein",
#   "Annotation", "90", "AHEnsDbs") 
# retrieve record with 'object[["AH57757"]]' 
> ## retrieve the resource:
> edb <- ah[["AH57757"]]
require(“ensembldb”)
loading from cache '/Users/jo//.AnnotationHub/64495'

This means users could fetch the resource they want from AnnotationHub and this will be cached locally. Does that make sense?

Now, I'd also like to keep separate CompoundDb objects/databases for different resources (e.g. HMDB, LipidBlast). Reason: that way you can version the resources, respectively packages (see e.g. https://github.com/jotsetung/CompoundDb.Hsapiens.HMDB.4.0). Different resources will never have the same release cycles - and versioning annotation resources is key to reproducible research.

This means also that you can't query multiple resources at the same time, but that shouldn't be a problem, is it?

stanstrup · 2017-10-25T14:33:50Z

That sounds very reasonable. It is a bit a learning curve for me with the S4 objects so I hope you have patients with me while I try to wrap my head around that.

I am wondering if it could also make sense to split the database stuff from the peak browser. It is getting more comprehensive than originally envisioned.

For the last Q: It would probably be nice to be able to annotate with multiple databases at the same time. The objective is the browser in the end where you'd want a single table with all the suggested annotations.

Any idea what to do with the very big databases? The pubchem sqlite file ended up being 43GB.

jorainer · 2017-10-25T14:46:00Z

Re annotation with multiple databases: one could annotate with a CompoundDb for each resource and bind_rows the results. Then you'll have the final table.

Re very big database: only thing I could think of here is to use a central MySQL server hosted somewhere (eventually I could do that, not sure though). And here comes the power of the S4 objects. We define simply a PubChemDb object that extends CompoundDb. We would only have to implement the compounds or src_cmpdb (or the annotating function/method) accordingly. For the user it would be just as using a simple local SQLite-base CompoundDb object,

stanstrup · 2017-10-25T14:59:17Z

Ah ok. If nothing prevents bind_rows then it is all good.
EDIT: now I understood. bind the result. Yes that works too.

Re very big database: I guess we can put that on the back-burner for now and just provide the parser. PubChem is rarely really useful for annotation anyway.

jorainer · 2017-10-26T09:19:58Z

I am wondering if it could also make sense to split the database stuff from the peak browser. It is getting more comprehensive than originally envisioned.

Thinking it all over - eventually that might be a not too bad idea. I could focus on the database/data import stuff (with your help) and you can focus on the annotation, matching and browsing stuff.

Pros for splitting:

keep database and creation of database separate from the browser and annotator - easier to maintain.
we would not run into the Bioconductor style <-> tidyverse coding style clash. Something what I find very ugly would be e.g. create_CompoundDb, i.e. mixing CamelCase with snake_case.
You don't have to go through my pull request ;)
Cons:
PeakABro will become very slim (is that a cons?)
Changes in one of the two packages will have to be reflected/fixed in the other too.

@stanstrup , what do you think?

stanstrup · 2017-10-27T07:35:24Z

In the end this is probably the most efficient way to do this so go ahead if you want.

jorainer · 2017-10-27T07:36:23Z

OK, I'll make a repo and add you as collaborator

stanstrup · 2017-10-27T11:09:53Z

Do you want to move this issue and the other db related using https://github-issue-mover.appspot.com?

jorainer · 2017-10-27T12:30:12Z

Or we just link to this issue? Whatever you prefer.

stanstrup · 2017-10-27T12:47:16Z

This issue was moved to rformassspectrometry/CompoundDb#6

stanstrup added the enhancement label Oct 20, 2017

jorainer added a commit to jorainer/PeakABro that referenced this issue Oct 23, 2017

Add generate_hmdb_tbl (issue stanstrup#2)

cfe4d43

- Add the generate_hmdb_tbl function. - Add related documentation and unit tests.

stanstrup mentioned this issue Oct 27, 2017

Databases to convert to tables rformassspectrometry/CompoundDb#6

Open

14 tasks

stanstrup closed this as completed Oct 27, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Databases to convert to tables #2

Databases to convert to tables #2

stanstrup commented Oct 19, 2017 •

edited

Loading

jorainer commented Oct 19, 2017

stanstrup commented Oct 19, 2017 •

edited

Loading

jorainer commented Oct 20, 2017

stanstrup commented Oct 20, 2017

jorainer commented Oct 20, 2017

jorainer commented Oct 20, 2017

stanstrup commented Oct 20, 2017 •

edited

Loading

stanstrup commented Oct 20, 2017

jorainer commented Oct 20, 2017

jorainer commented Oct 20, 2017

stanstrup commented Oct 20, 2017

wilsontom commented Oct 20, 2017

stanstrup commented Oct 20, 2017

jorainer commented Oct 20, 2017

stanstrup commented Oct 22, 2017

jorainer commented Oct 23, 2017

jorainer commented Oct 23, 2017

stanstrup commented Oct 23, 2017

jorainer commented Oct 25, 2017

stanstrup commented Oct 25, 2017

jorainer commented Oct 25, 2017

stanstrup commented Oct 25, 2017 •

edited

Loading

jorainer commented Oct 26, 2017 •

edited

Loading

stanstrup commented Oct 27, 2017

jorainer commented Oct 27, 2017

stanstrup commented Oct 27, 2017

jorainer commented Oct 27, 2017

stanstrup commented Oct 27, 2017

Databases to convert to tables #2

Databases to convert to tables #2

Comments

stanstrup commented Oct 19, 2017 • edited Loading

jorainer commented Oct 19, 2017

stanstrup commented Oct 19, 2017 • edited Loading

jorainer commented Oct 20, 2017

stanstrup commented Oct 20, 2017

jorainer commented Oct 20, 2017

jorainer commented Oct 20, 2017

stanstrup commented Oct 20, 2017 • edited Loading

stanstrup commented Oct 20, 2017

jorainer commented Oct 20, 2017

jorainer commented Oct 20, 2017

stanstrup commented Oct 20, 2017

wilsontom commented Oct 20, 2017

stanstrup commented Oct 20, 2017

jorainer commented Oct 20, 2017

stanstrup commented Oct 22, 2017

jorainer commented Oct 23, 2017

jorainer commented Oct 23, 2017

stanstrup commented Oct 23, 2017

jorainer commented Oct 25, 2017

stanstrup commented Oct 25, 2017

jorainer commented Oct 25, 2017

stanstrup commented Oct 25, 2017 • edited Loading

jorainer commented Oct 26, 2017 • edited Loading

stanstrup commented Oct 27, 2017

jorainer commented Oct 27, 2017

stanstrup commented Oct 27, 2017

jorainer commented Oct 27, 2017

stanstrup commented Oct 27, 2017

stanstrup commented Oct 19, 2017 •

edited

Loading

stanstrup commented Oct 19, 2017 •

edited

Loading

stanstrup commented Oct 20, 2017 •

edited

Loading

stanstrup commented Oct 25, 2017 •

edited

Loading

jorainer commented Oct 26, 2017 •

edited

Loading