Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

suggestion: aliases in @SQ section #100

Closed
lindenb opened this issue Sep 10, 2015 · 6 comments
Closed

suggestion: aliases in @SQ section #100

lindenb opened this issue Sep 10, 2015 · 6 comments
Assignees

Comments

@lindenb
Copy link

lindenb commented Sep 10, 2015

A suggestion for the Sam Sequence dictionary: would it be possible to add an optional list of aliases in the @sq header ? softwares would use those aliases to fix the various nomenclatures (UCSC, ENSEMBL )

Something like:

@SQ SN:MT LN:12345 AL:chrMT|chrM
@SQ SN:X LN:1234567 AL:chrX|chr0X|0X

then

samtools view my.bam MT:1-100

would return the same ouput than

samtools view my.bam chrM:1-100
@atks
Copy link

atks commented Sep 10, 2015

I think we should have this for the contigs in VCF too.

@jmarshall
Copy link
Member

This is IMHO a rather good idea, although we'd have to come up with a different separator as e.g. NCBI likes to use | characters within sequence names. I might prefer AN: or so ("alternative/alias names") to parallel SN:.

We'd also have to think about what rules are necessary around distinctness, and to be explicit whether alignment records in SAM files would be allowed to use these aliases in RNAME/RNEXT (surely not!).

@jmarshall
Copy link
Member

jmarshall commented Mar 10, 2016

It seems NCBI will be phasing out their | characters in sequence names, though we probably still can't use | here as people will still have older NCBI files lying around.

@vadimzalunin
Copy link
Contributor

are aliases known at write time? seems like aliases are read time decision to me.

@dpryan79
Copy link

Most programs don't currently know about them when they make BAM files, but that could be changed.

I have to mention that dealing with different naming schemes is a huge annoyance, particularly on things like Galaxy where you have novice users who are often not aware of this issue and inevitably need to be walked through properly munging things.

@jmarshall jmarshall added this to the June 2016 meeting milestone Jun 1, 2016
@jmarshall jmarshall removed this from the June 2016 meeting milestone Dec 1, 2016
@jmarshall
Copy link
Member

jmarshall commented Dec 1, 2016

We discussed this at some length back at our September meeting (and I'm finally writing up my notes from then). We continue to mostly like the idea in principle, but need to spell out rules around uniqueness and so on.

We noted that we can use this as an opportunity to define the alias regexp to be the excellently tight regexp (disallowing especially : and perhaps other punctuation too; cf #167) that we wish we'd always had for SN. Doing so will solve the what-separator-to-use problem, and provide a path towards restricting SN to the same regexp — at which point chr:beg-end notation would finally be unambiguous.

There was concern expressed that adding this reduces pressure on GRC to agree on a “chr” vs “” prefix. The opposing view to that is to admit that the world has not yet agreed on which end of the prefix to crack open and providing tools to reduce the ensuing pain is useful, as espoused by others on this thread.

In PR #103, @lindenb proposes the following text (thanks!):

AN Reference sequence Alternative Names. A semicolon separated list of alternative names for this sequence e.g: 1;01;CM000663;NC_000001.10. Tools are free to use this information to decode the users' positions but the name displayed in the output is always SN.

I've taken this on, and will propose additions to this that spells out distinctness requirements etc.

@jmarshall jmarshall self-assigned this Dec 1, 2016
jmarshall added a commit to jmarshall/hts-specs that referenced this issue Jun 1, 2017
Enables tools to allow users to make queries with e.g. "1" or "chr1"
interchangeably.  Also allows for the possibility of tools using an alias
when displaying sequence names to the user.  Hat tip @lindenb, fixes samtools#100.

However aliases must not appear elsewhere within the SAM file, in
particular not in RNAME/RNEXT fields.  This ensures that files will
still be parsed correctly by non-@SQ-AN-aware tools.
jmarshall added a commit to jmarshall/hts-specs that referenced this issue Jun 1, 2017
Enables tools to allow users to make queries with e.g. "1" or "chr1"
interchangeably.  Also allows for the possibility of tools using an alias
when displaying sequence names to the user.  Hat tip @lindenb, fixes samtools#100.

However aliases must not appear elsewhere within the SAM file, in
particular not in RNAME/RNEXT fields.  This ensures that files will
still be parsed correctly by non-@SQ-AN-aware tools.
jmarshall added a commit to jmarshall/hts-specs that referenced this issue Jun 29, 2017
Enables tools to allow users to make queries with e.g. "1" or "chr1"
interchangeably.  Also allows for the possibility of tools using an alias
when displaying sequence names to the user.  Hat tip @lindenb, fixes samtools#100.

However aliases must not appear elsewhere within the SAM file, in
particular not in RNAME/RNEXT fields.  This ensures that files will
still be parsed correctly by non-@SQ-AN-aware tools.
jmarshall added a commit to jmarshall/hts-specs that referenced this issue Jul 27, 2017
Enables tools to allow users to make queries with e.g. "1" or "chr1"
interchangeably.  Also allows for the possibility of tools using an alias
when displaying sequence names to the user.  Hat tip @lindenb, fixes samtools#100.

However aliases must not appear elsewhere within the SAM file, in
particular not in RNAME/RNEXT fields.  This ensures that files will
still be parsed correctly by non-@SQ-AN-aware tools.
jmarshall added a commit to jmarshall/hts-specs that referenced this issue Jul 27, 2017
Enables tools to allow users to make queries with e.g. "1" or "chr1"
interchangeably.  Also allows for the possibility of tools using an alias
when displaying sequence names to the user.  Hat tip @lindenb, fixes samtools#100.

However aliases must not appear elsewhere within the SAM file, in
particular not in RNAME/RNEXT fields.  This ensures that files will
still be parsed correctly by non-@SQ-AN-aware tools.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants