Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retain the effective refgenie build inputs #260

Open
pinin4fjords opened this issue Jul 6, 2021 · 6 comments
Open

Retain the effective refgenie build inputs #260

pinin4fjords opened this issue Jul 6, 2021 · 6 comments

Comments

@pinin4fjords
Copy link

Hello!

When we run analysis and present results to users, we report the reference files that were used, as provided by the reference resource (e.g. Ensembl).

Switching to refgenie makes that quite difficult, because assets are stored relative to the SHAs, under whatever tags we provide. As I start to use our refgenie instance in anger, I'm finding myself having to recover that filename from the logs by grep'ing the first cp in the logs:

function find_orig_refgenie_asset_name() {
    asset_path=$1
    basename $(grep "cp " $(dirname $asset_path)/_refgenie_build/refgenie_commands.sh | head -n 1 | awk '{print $2}')
}

Is there a less hacky way?

@nsheff
Copy link
Contributor

nsheff commented Jul 6, 2021

Can you give me an example of what exact file name you're looking for?

@pinin4fjords
Copy link
Author

e.g. here's how I've set up some human assets:

> refgenie list -g homo_sapiens--GRCh38 
                                                                Local refgenie assets                                                                 
                                                 Server subscriptions: http://refgenomes.databio.org                                                  
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ genome                                          ┃ asset (seek_keys)                              ┃ tags                                            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ homo_sapiens--GRCh38, homo_sapiens,             │ fasta (fasta, fai, chrom_sizes, dir)           │ genome                                          │
│ homo_sapiens--newest, homo_sapiens--current     │                                                │                                                 │
│ homo_sapiens--GRCh38, homo_sapiens,             │ fasta_txome (fasta_txome, fai, chrom_sizes,    │ cdna_current, cdna_ensembl104, cdna_newest,     │
│ homo_sapiens--newest, homo_sapiens--current     │ dir)                                           │ cdna_ensembl95                                  │
│ homo_sapiens--GRCh38, homo_sapiens,             │ ensembl_gtf (ensembl_gtf, ensembl_tss,         │ newest, ensembl104, ensembl95, current          │
│ homo_sapiens--newest, homo_sapiens--current     │ ensembl_gene_body, dir)                        │                                                 │
│ homo_sapiens--GRCh38, homo_sapiens,             │ salmon_index (salmon_index, dir)               │ cdna_newest--salmon_v1.2.0,                     │
│ homo_sapiens--newest, homo_sapiens--current     │                                                │ cdna_current--salmon_v1.2.0,                    │
│                                                 │                                                │ cdna_ensembl95--salmon_v1.2.0,                  │
│                                                 │                                                │ cdna_ensembl104--salmon_v1.2.0                  │
│ homo_sapiens--GRCh38, homo_sapiens,             │ kallisto_index (kallisto_index, dir)           │ cdna_newest--kallisto_v0.45.0,                  │
│ homo_sapiens--newest, homo_sapiens--current     │                                                │ cdna_current--kallisto_v0.45.0                  │
│ homo_sapiens--GRCh38, homo_sapiens,             │ hisat2_index (hisat2_index, dir)               │ genome--hisat2_v2.1.0                           │
│ homo_sapiens--newest, homo_sapiens--current     │                                                │                                                 │
└─────────────────────────────────────────────────┴────────────────────────────────────────────────┴─────────────────────────────────────────────────┘

So e.g. there's a human gtf file I've tagged as 'newest':

> refgenie seek homo_sapiens--GRCh38/ensembl_gtf:newest
/path/to/refgenie/alias/homo_sapiens--GRCh38/ensembl_gtf/newest/homo_sapiens--GRCh38.gtf.gz

The resulting file name is generic, reflecting the assembly, and wouldn't help the user work out what file we built as an asset. How do I find out what file I used to generate that, besides checking the logs like:

>  basename $(grep "cp " $(dirname /path/to/refgenie/alias/homo_sapiens--GRCh38/ensembl_gtf/newest/homo_sapiens--GRCh38.gtf.gz)/_refgenie_build/refgenie_commands.sh | head -n 1 | awk '{print $2}')
Homo_sapiens.GRCh38.104.gtf.gz

?

@stolarczyk
Copy link
Contributor

There is no refgenie command to retrieve the inputs.

Broadly speaking, we need to retain the effective build inputs (defaults or values passed to --files, --assets, --params options in refgenie build command) and make them accessible via CLI and server API.

Similar issue has been raised before: #170 (comment)

@stolarczyk stolarczyk changed the title Recovery of original filename Retain the effective refgenie build inputs Jul 6, 2021
@pinin4fjords
Copy link
Author

Okay, thanks for the response @stolarczyk

stolarczyk added a commit that referenced this issue Jul 21, 2021
- retain effective recipe inputs: #260
- use Recipe and AssetClass classes, flexible recipes implementation: #198
- enable automatic software version-based tag setting: #195
@pinin4fjords
Copy link
Author

pinin4fjords commented Sep 29, 2021

@stolarczyk I notice the 'likely solved' tag- could you expand? e.g. how would I modify the above hack to get the information more cleanly?

I notice that there's not been a release since the commit above, so is the fix not released as yet?

@nsheff
Copy link
Contributor

nsheff commented Nov 2, 2021

this has been fixed, but only on the dev branch. If you run refgenie build with refgenie@dev a JSON file is produced that consists of info regarding any build inputs and it's accessible in refgenieserver

Here's an example of the endpoint that serves this on a development server:

http://igenomes.databio.org/v3/assets/build_inputs/059d8eb13b5a40a4cf4d9044cd0d08456c806ce07c69a3ba/bowtie1_index?tag=1.3.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants