Skip to content
This repository has been archived by the owner on May 19, 2021. It is now read-only.

A package for higher-level R metadata extraction #41

Open
vsbuffalo opened this issue Mar 26, 2015 · 11 comments
Open

A package for higher-level R metadata extraction #41

vsbuffalo opened this issue Mar 26, 2015 · 11 comments

Comments

@vsbuffalo
Copy link

Basically there should be a higher-level way to extract metadata from directory and filenames. For example:

150202_HS2A/Project_XXX/Sample_OEP02_N712_S508/OEP02_N712_S508_GTAGAGGA-CTAAGCCT_L002_R1_001.fastq.gz

There's a ton of metadata in this file that we should be able to quickly extract and work with (e.g. in a dataframe). The (initial) goals of this package are:

  1. Load in and extract file metadata.
  2. Queries on this metadata — think list.files() that's way more powerful.
@karthik
Copy link
Member

karthik commented Mar 26, 2015

\o/

@vsbuffalo
Copy link
Author

Should we wrap list.files() at all? Should we use Python's groupdict.

Some example usage:

files <- list.files(recursive=TRUE, full.names=TRUE)
md <- extract_metadata(files, "data/(?P<samples>\\w+)/(?P<replicates>\\w+)/file_(?P<file_number>\\d+).txt")

Is md a dataframe? If so, we can use dplyr and purrr to process each entry. Or we could use base functions, e.g.:

lapply(split(md, md$samples), read_function)

Just some ideas...

@vsbuffalo
Copy link
Author

Should we have hooks that validate certain metadata?

@vsbuffalo
Copy link
Author

A place to get started: https://github.com/vsbuffalo/pathfindr fork, branch, code, and be merry!

EDIT: the name is terrible, but it's a placeholder.

@HenrikBengtsson
Copy link

Rather topic specific, but in the Aroma Framework we utilize such "meta" data encoded in file and directory names, cf. http://aroma-project.org/docs/HowDataFilesAndDataSetsAreLocated/. It's been in place since 2006. It also support on-the-fly reparsing, e.g. filtering out character sequences without information and reordering etc. The essence of it is in the R.filesets package. It's been a long-term wish to extend this to a generic framework based on regular expressions, but there's never been an urgent need for it.

@tracykteal
Copy link

How about we use a different data format for genomic data that encodes the metadata within the file. :) I can't really see it happening, but I'm not really joking either. HDF5?

@drisso
Copy link

drisso commented Mar 28, 2015

We were talking about this with @jarrodmillman but I personally don't see that happening unless Illumina (and/or another big player) pushes for it. It would be cool, though...

@jennybc
Copy link
Member

jennybc commented Mar 28, 2015

This reminds me of two other thoughts I've had in the past, related to what sort of larger package this functionality might fit in.

[1] A package that implements some standard unix commands but gives the result in the most sensible R native format and anticipates piping them together with %>%. In the current case, the idea would be to emulate ls and to offer some of the most expected arguments and to get the result back as a data.frame. If memory serves, the current task of parsing metadata in file names was exactly my own motivating case, as this comes up often for me as well.

[2] It also feels like R needs a package for operating on files and paths. Google and the hotel wifi are keeping me from pointing to a great example from another language but I know such exist. Something that goes beyond file.path(), base name(), etc. I even feel like someone has taken a stab at this but can't remember who/when. Anyone else remember?

@gaborcsardi
Copy link

E.g. https://docs.python.org/2/library/os.path.html in Python.

node.js has a bunch of them, they are pretty basic. I guess this is what you need most of the time. E.g. https://www.npmjs.com/package/fs-extra

Edit: also, http://cran.r-project.org/web/packages/pathological/index.html

@richfitz
Copy link
Member

I started a direct port of os.path and will push that up to gh - it's mostly just going to be grunt work to get all the bits filled in, so if other people have a need we'd get a naive R version done pretty quickly.

@karthik
Copy link
Member

karthik commented Mar 31, 2015

@richfitz That sounds great. Happy to help fill in bits.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants