Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Storing base modifications (methylation) #362

Closed
jkbonfield opened this issue Nov 29, 2018 · 41 comments
Closed

Storing base modifications (methylation) #362

jkbonfield opened this issue Nov 29, 2018 · 41 comments
Labels

Comments

@jkbonfield
Copy link
Contributor

jkbonfield commented Nov 29, 2018

This is just an issue to track progress on the idea.

The basic proposal so far is some form of side-channel in a SAM/BAM/CRAM tag which is used in conjunction with the sequence field to characterise the base modification. The two combined together form the actual modification. This works around issues with BAM having a limited (and already defined) 16 character set and also potentially problems with more than 20 different base modifications.

The likely solutions for this will either be an MD style tag with a mixture of lengths and modification symbols, or a BQ style tag with the same length as the sequence.

Some futher experiments and discussions can be found here:

https://docs.google.com/document/d/1LNVELOMU4ZlKASPoWxuUtgk7vtOZsU-ogERa_IeMeiI/edit?usp=sharing

See also the table in Appendix A (last page) of https://www.biorxiv.org/content/early/2016/03/15/043794

@jkbonfield
Copy link
Contributor Author

I'll leave this here with no comment... ;-)

http://modomics.genesilico.pl/modifications/

@jkbonfield jkbonfield changed the title Storing base modifications Storing base modifications (methylation) Feb 6, 2019
@jkbonfield
Copy link
Contributor Author

Revisiting this I now understand it better. Strand is important as it's the way we indicate on a double-stranded template that we have observed the methylation on the opposite strand. Some sequencing methods do use both strands (eg PacBio?) so this is a genuine issue.

My favoured simple proposal for now is to use a style which is a halfway house between MD and SA tags. I rejected the BQ style notation (one value per base, with most being NOP) because it's just inefficient and bloated in SAM, BAM and memory and the frequency of modifications is low.

We store methylation codes and distances between them (MD style), but our codes are numeric IDs to permit a wider range as there are already over 100 known for RNA and growing. We also need to cope with the potential for a single site having a modification on both top and bottom strand simultaneously (vanishingly rare, but we cannot rule it out). Thus we combine pos,type(,type*) lists, separated by a semi-colon (SA style).

An example (tag name is arbitrary!):

MM:Z:105,1;27,2;150,1,2;412,3

(Where "1" is 5mC, "2" is 5mC on complementary strand, and "3" is 5hmC. These are "m", "1" and "h" in Coby's slides respectively.)

It's not optimal for compression as we're mixing multiple types of data into the single field - types, strand (part of type) and position-deltas and also not taking into account data duplication (knowledge of primary base type). However it's simple, in keeping with the rest of SAM, and frankly it's unlikely to be a HUGE hit whatever we do.

If we wish to separate out strand, we could do so explicitly. Eg above converted to:

MM:Z:105,+,1;27,-,1;150,+,1,-,1;412,3

Any thoughts?

@dpryan79
Copy link

dpryan79 commented Feb 6, 2019

If you add this to the specification you'll have to update it every time someone looks at a new modification (what about 5-fC and 5-caC?)...

@jkbonfield
Copy link
Contributor Author

jkbonfield commented Feb 6, 2019

We'd populate the standard set of modification types with a long list of known ones, so we wouldn't envisage a change happening regularly. Coby Viner has already given a good talk on this and there are papers explaining it too. See the above links.

The second bit which I forgot to add is that the ChEBI registery of small molecular chemicals can also be used to give a more complex description of anything we haven't thought of. It was also advocated by the paper. This is probably best done by simply hacking the name space. Eg either positive types are ours (see the defined table) and negative types are theirs, or better still letters are our shorthand (m, h, f, etc) and numbers are ChEBI..

We don't want to define everything using ChEBI because even the common modifications are 5 digit numbers which just bloats everything.

So maybe this:

MM:Z:105,+,m;27,-,m;150,+,m,-,h;412,+,f;50,+,76397

I'm still unsure about whether strand should be explicit as it is here, or encoded within the type as Coby suggests. Both work. Explicit strand is cleaner and so I'm leaning towards that, but not all technologies (most?) are going to be reporting double-stranded methylation.

Edit: or shorter form MM:Z:105+m;27-m;150+m-h;412+f;50+76397

Edit: shorter still, only use semicolon for ChEBI IDs: MM:Z:105+m27-m150+m-h412+f50+76397;
There's also the option of 4 (or 5) MM tags, eg Ma Mc Mg Mt that describe these edits only for their respective source base types. This reduces the data complexity considerably. Either that or we have one table but the codes as looked up in conjunction with the input base type. Both work well I think for compression of this data (in BAM and CRAM). Figures incoming later.

@nodrogluap
Copy link

Do you also plan a code for "There's a modification, but I don't know what it is"? I have a method that detects statistically significant variance from the unmodified nucleotide kmer models in Oxford Nanopore data, and it would be great to be able to note this in the alignment. I'd originally thought of hacking the MD string with an 'm' in those positions, which would easily allow me to make a standards-conforming file by conversion of the MD string to upper case, but real support for encoding an unknown mod would be better :-)

@jkbonfield
Copy link
Contributor Author

It was in the paper linked above. I'm not sure which are practically likely to be useful but some form of modification ambiguity code seems sensible.

@ahcm
Copy link

ahcm commented Feb 11, 2019

I like: MM:Z:105+m;27-m;150+m-h;412+f;50+76397;

But modifications are already part of the input sequence of mappers. Basecallers start to extend the alphabet. Will they encode an MM:Z tag in the FASTQ @ line? What are the ideas/suggestions there?

@jkbonfield
Copy link
Contributor Author

That's a good question. FASTQ is already operating outside it's own weak specification, eg via the aux tag elements, but it seems to be a common method and I don't see why MM tags couldn't be encoded that way.

Alternatively, for the common modification types FASTQ doesn't have the alphabet limitations that BAM does so it may be justified to store the data in the sequence portion instead. It'd need an aligner that understands it though. If it doesn't then an MM (or whatever) style tag permits us to use simple alignment methods while having a sidechannel for the methylation data. Ugly though!

@ahcm
Copy link

ahcm commented Feb 27, 2019

Btw. methylated Cs make a high proportion of the bases, e.g. in a human sample that we have it's about 25% of the Cs.
So a tag might be expensive.
Maybe some modifications deserve expanding the bases alphabet.

@yfarjoun
Copy link
Contributor

yfarjoun commented Feb 27, 2019 via email

@ahcm
Copy link

ahcm commented Feb 27, 2019

Information theory. Things that are common are represented compactly.

@jmarshall
Copy link
Member

There aren't any bits available in the BAM format for expanding the alphabet of bases, so for BAM the choices are encoding this in a tag or making an incompatible BAMv2 format.

IMHO it'd be useful (at least for viewing!) if methylated-C (and a handful of others) could also be represented directly in the SEQ field in SAM, where there is scope for expanding the alphabet. Note though that IUPAC ambiguity characters are already used in that field, so there's more letters already taken than you might guess.

@yfarjoun
Copy link
Contributor

yfarjoun commented Feb 27, 2019 via email

@jmarshall
Copy link
Member

Yeah, by “viewing” I meant human beings looking at textual SAM files, and that transformation that you just described rather well is exactly what I had in mind. Seems like it would be useful for tools to be able to both read and write such SAM files (perhaps not by default!).

@jkbonfield
Copy link
Contributor Author

For clarification, I'll use the term "fundamental seq" to mean the unmethylated original sequence.

I'm all in favour of a transform request that tags "fundamental seq" plus methylation tag to produce methylated seq, but the limitations of BAM are what dictate splitting this into seq+tag. I also think it is better for most analysis tools if the sequence they read is the "fundamental seq". If we start storing other base types there then most software will interpret this as N, which is a huge problem. Having the option to transform sequence only for specific tools gives us the most flexibility.

As for size, tag doesn't have to be large. Some forms are larger in memory than others, but we can do data compression efficiently on the tags still. Potentially more so than putting this data into the sequence field, which will cause havoc for reference based compressors such as CRAM.

@lh3
Copy link
Member

lh3 commented Feb 28, 2019

we could have a SAM-only representation that has well-defined transformation into the "traditional" SEQ+tag representation.

There are 15 letters in IUB. U is used. X should be reserved. We have 9 letters left in the alphabet. For each type of modification, we need two letters, one for forward strand and the other for reverse strand to store in SAM. We can only introduce 4 major classes of modifications unless we get symbols or lowercase involved. Is that enough?

As to the tag, another option is to use a structured array like:

XX:B:i,27551,2,10,20,-27552,1,30

"+27551" indicates methylation on the forward "strand". There are "2" modifications at offset 10 and 20, respectively. There is "1" modification of 27552 on the reverse "strand" at 30. I quoted "strand" because there are two types of strands: read strand and modification strand, which may not be the same (e.g. in case of Illumina bisulfite sequencing). I am not sure what is the best to use. This is an intricate issue.

Such a B-typed array is easier for programs to parse, though is not good for eyeballing. I am also good with @jkbonfield's proposal.

@jkbonfield
Copy link
Contributor Author

I need to get back to my evaluation of what encodings work well and what don't. [I've been rather side-tracked recently by the CRAMv4 prototyping (in Javascript Heng!), but that's for a different issue.] From a compression point all can be morphed into each other of course, but I'm talking about existing methods in both BAM and CRAM. Several worked OK though (and several not so great).

Anything in ASCII is doomed to failure due to alphabet sizes. It's growing all the time and while currently we're dominated by just a few symbols, there are hundreds of modifications known about and over time we'll start to see instruments that can distinguish them, or distinguish to one of several. That means ambiguity codes, which further expands our alphabet. I think just kill the idea of embedding in the sequence and use a side channel from day one so we don't have to retrofit it in at a later stage.

@jkbonfield
Copy link
Contributor Author

Information theory. Things that are common are represented compactly.

More precisely it's not "common", but "accurately predicted". Eg "e" is more common than "u" in English text, but we can predict the presence of "u" following a "q" with very high accuracy so those typically take up less space. PPM for the win. :-)

It's also true whether they are embedded within the SEQ field or whether they are common elements of a string or B array, although some standard compression tools will do better on one than the other despite being equivalent from an entropy perspective.

We also need to consider memory footprint. It's quite efficient to compress a long string of characters with low information content, eg the BQ array which is mostly "@" (@=64, so no +/- change); probably more so than attempting a CIGAR / MD style encoding. However the BAM record in memory is still large as it'll almost certainly be the uncompressed form.

@ahcm
Copy link

ahcm commented Mar 1, 2019

If something compresses well, it just meins that it is encoded
inefficienly.

And while you want efficiency you also want readability and room for all the modifications.

How about UTF-8?

@jkbonfield
Copy link
Contributor Author

How about UTF-8?

Like http://modomics.genesilico.pl/modifications/ ?

Please no! The fact is we need the sequence to be unadulterated so that BAM works (it cannot deal with more than is currently used) and so that many other existing tools don't interpret unknown symbols as Ns.

Methylation support is a new feature, so it should be a new field without changing the meaning of the existing one.

@jkbonfield
Copy link
Contributor Author

I haven't got time to give the full summary here, but some conclusions from experiments so far.

  • Dotty syntax (".....m...h......1......"), similar to BQ tag encoding, generally take up less storage, especially when modifications are frequent than CIGAR style syntax (4m3h5M etc). Saving depends on format and other factors, but it's not a huge factor and probably under 10% saved.

  • Per base type tags offer better compression, mainly due to context. Eg instead of seq ACGTCCGAGC, with methylation AmGTCC1AGC as mm:Z:.m....1... we could have mC:Z:m... mG:Z:.1. with mC and mG only recording the C and G bases (we'd need xn for others/ambiguities). The same also applies if we use a CIGAR based encoding. The reason is simply a reduced alphabet and exploitation of the correlation between methylation and the SEQ field. It's more work to process though. Saving here is around 40%.

  • If we want to handle strand and links to larger alphabets via ChEBI (which I think we do), this is probably cleaner in a CIGAR style syntax. It adds a bit of overhead, but is still reasonable. Again best done split into 4 or 5 tags. E.g. mC:Z:2m25m5m57m11... mG:Z:1M1M4M4M3H3-...

@marcus1487
Copy link

This is really great discussion on storage in the SAM/BAM/CRAM format, but I wanted to raise a point for analysis downstream from this storage point. When aggregating methylated/modified bases per-reference site there does not seem to be a standard format. The ENCODE bedMethyl format is the closest thing I have found, but this format is not reproducibly extensible to multiple modifications (e.g. 5mC, 5hmC, 5fmC) called on the same set of reads. Adding custom modified base format tags to the VCF spec (modVCF?) seems like a pretty decent option to me (though the underlying ploidy assumptions would be broken a bit).

This may not be the right forum for this point, so please let me know if there is a more appropriate location to raise this point (or if this should be moved to a separate issue).

P.S. Background: I am working on methods at ONT to skip FASTQ and SAM/BAM/CRAM to give higher accuracy modified base calls anchored to the reference. There is a lot of information (e.g. confidence metrics for a set of modified bases per-reference site, per-read) to be carried through from signal processing to per-reference site read aggregation that would seem to be onerous on the SAM/BAM/CRAM specs (as discussed in this thread). As such I am looking for the correct per-reference site multiple-modified base format as a direct output.

@lh3
Copy link
Member

lh3 commented Mar 20, 2019

@marcus1487 bedMethyl is easier to parse, but it is less extensible. There is also no standard way to collate data from different samples. VCF may already do the job with symbolic alleles.

@marcus1487
Copy link

@lh3 I like that bedMethyl is easier to parse as well. And I am less worried about collating data from different samples (though it is a nice bonus). I am currently more concerned with including multiple modification types in the same file from a single sample. For example calling 6mA, 5mC and 5hmC and collating results across reads from a single sample. It seems like some version of VCF might be the best solution.

@jkbonfield
Copy link
Contributor Author

Apologies for the slow progress of this. I'm also juggling several other things and haven't managed to put the time on this it deserves. However it is also vital to get people closer to methylation work involved so I value all these inputs greatly. Thank you.

I don't see it as onerous to have to store methylation data in SAM/BAM/CRAM. Rather I see it as a way, if done correctly, of bringing uniformity and consistency with storing all data in the same place instead of suffering a proliferation of extra side-channels to hold the missing information. We've already seen several different letters being used to describe the same 5mC in fastq.

@marcus1487
Copy link

@jkbonfield to expand on my point on the onerous storage in [S|B|CR]AM, I am thinking more about storing probabilities for a number of modifications in the same read at the same reference/read position. This could improve downstream results when aggregating over reads. For example it is conceivable to output probabilities for canonical C, 5mC, 5hmC, 5fmC and more for each read at each position. Then when aggregating over reads these probabilities could be used to produce better estimate for the total fraction of each modification at a reference position than would hard thresholding values to pick just the most likely of those modifications within each read. This is why I think storage of methylation in the [S|B|CR]AM format might be onerous.

On the FASTQ storage issue, that is at least partially my fault (I arbitrarily added the Z base output for 5mC to the ONT mod base caller). I'd love to standardize that as well (especially since more mods are coming). Is there a consensus on how/with what single letter alphabet methylated/modified bases should be stored in a FASTQ?

@marcus1487
Copy link

Are there any thoughts/pointers for

  1. Standard alphabets for modified/methylated bases in FASTA/Q
  2. Storage format for aggregated (over reads) modification/methylation, that can include multiple modification types at a single genomic locus.
    a. Currently planning to use VCF with a couple non-standard fields (strand and estimated proportion of reads modified).

@marcus1487
Copy link

Has there been any development on the issues raised here? We are looking to release a tool which would output modified base calls anchored to a reference for many modifications simultaneously. As far as I can tell there is no format that supports this right now. As noted above, we are currently planning to use VCF format with some (hopefully quite logical) tweaks in order to support storage of multiple modified bases (include strand and mod specific format fields). If there are better alternatives that would be accepted by the wider field it would be great to get these integrated soon (if likely after release at this point).

The other major point for which I have not seen definitive conclusions is the storage of modified bases in a FASTA/FASTQ output (though I personally think the application of this data type should be quite limited, potentially not even supporting mapping). We are currently outputting these as alternative letters (for now these letters are arbitrary and even configurable by any user training a modified base model; see here for details). If there is an attempt to create a uniform "modified bases" alphabet it would be great to integrate this into the ONT toolchain to support wider adoption.

@jkbonfield
Copy link
Contributor Author

Unfortunately I just haven't had time to work on this issue myself recently. However I wouldn't be the right person to think about modifications in VCF anyway.

FASTA/FASTQ is challenging long term as the limited alphabet is going to cause problems at some point and there are no headers associated with those file formats to act as a dictionary. Are your default letters compatible with https://dnamod.hoffmanlab.org/? If so perhaps we have an emerging consensus for FAST[AQ].

@jkbonfield
Copy link
Contributor Author

jkbonfield commented May 31, 2019

Ok, current thinking is tags like:

MM:Z:Go+,28,Cm,23,14,2,12,12,18,13,14,38,42,13,23,10,95,6,39,79,Ch+,234

This string is (<unmodified-base><modification><strand>(,<pos-delta>)+)+

<pos-delta> here is a count of the number of unmodified-base of that type that have been since the last <modification>.

Ie the above has the 28th G modified to "o". It has the 23rd C modified to m, along with the 37th, 39th, etc. Finally is has the 234th C modified to Ch. (These may all be +1 actually, I think I count from zero..) Note we permit "N" as the unmodified base to count on which means it's the nth base rather than the nth "N" base. That should capture awkward scenarios.

Note we can encode modifications on the opposite strand too. Eg:

mC:Z:Gh-,189,Gm-,72,6,41,2,8,63,7

This is saying the 189th G in the sequence has a modification "h" (which applies to C so it's inherently the other strand anyway) on the opposite strand. This permits us to record modifications on either strand, either all for one strand or all for the other (single stranded) or for genuine double stranded sensing, including mods at the same site (vanishingly rare I expect).

Finally, the codes m, h, f, c, o, etc could be numerical ChEBI values. As they only occur once per code type it won't bloat the file significantly.

PS. The reason we do delta vs the last base of type <unmodified-base> rather than just position as a whole, is it harms compression to do the latter. We're having duplicate sequence information encoded given specific modifications only apply to specific base types (an extra 50-70% more storage cost for this tag).

Edit: this doesn't mean we're wedded to specific one letter codes. We should probably adopt the ones currently in use and recommended (see above), but maybe there should be some way (header?) of adding assigning new codes via ChEBI values. However that can come later as it'll be an easy extension.

(Similarly also for a quality score tag, should they be required.)

@jkbonfield
Copy link
Contributor Author

jkbonfield commented May 31, 2019

Here's some horrid perl code to experiment with base modifications.

add_meth2.pl method file.sam > new_file.sam

Methods are numbers:

  1. Do nothing.
  2. Store in SEQ field (fails for BAM)
  3. Store as MM tag, dotty style
  4. Store as MM tag, CIGAR style
    ...
    (stuff that's not relevant)
    ...
  5. Store as mC tag using the above (more or less) definition.

An example usage:

$ samtools view -h data/MinION_Targeted.cram |head -1000 | samtools view -O cram - |wc -c
813557

# Demonstrate methylation signal only added ~40Kb of extra data
$ expr `samtools view -h data/MinION_Targeted.cram |head -1000 | ./add_meth2.pl 25 | samtools view -O cram - |tee meth.cram | wc -c` - 813557
39633

# Extract simple methylation sequence (method 1) and convert back from method 25.
$ samtools view -h data/MinION_Targeted.cram |head -1000 | ./add_meth2.pl 1 | awk '!/^@/ {print $10}' > _meth_seq
$ samtools view meth.cram | ./m25-dot.pl > _meth25_seq
$ cmp _meth_seq _meth25_seq
(matches)

Code is here:

https://gist.github.com/jkbonfield/bb600c57e710f60cbfd79407230c4a4c

Note the $scale=10 bit which is stress testing with 10x the expected(?) rate. Most of my testing was scale=1.

Example content from tags:

C+m,30,80,25,10,76,86,1,25,C+h,211,G+o,163
G-h,50,562,G-m,34,20,10,7,33,9,58,4,47,106,53,32,34,18,48,25,9,5,2,26,10,17,7,64,10

So - pick it apart! What's wrong with it? Bound to be something :-)

Note: I know it's not hugely trivial to work with, but once we have a decent API that becomes irrelevant and the perl code demonstrates it's not overly problematic.

@avilella
Copy link

avilella commented Jun 4, 2019

Nice to see some code/prototyping in this ticket.
Q: how would we go from FASTQ to annotated SAM via alignment?

In the Dotty world, the aligner is given a pure ATCG(N) sequence string, and the SAM output is post-processed to push the remaining information in the FASTQ into the SAM tags area (as Dotty e.g. Bismark or other tagging methods, e.g. bwameth).

Would this be the same procedure for the option (3), MM tags?

@jkbonfield
Copy link
Contributor Author

Good question, but it's kind of not our problem. This group is focused on providing ways of storing the data in SAM/BAM/CRAM and VCF/BCF. We don't support FASTA/FASTQ within this GA4GH team. Maybe that's a mistake, but our current philosophy is to encourage people to use unaligned SAM (etc) instead of FASTA/FASTQ as it's so much more flexible and has file headers too.

Unofficially though, there are two possibilities I guess.

  1. If the sequence uses a very limited alphabet, then it could be put in-place in the sequence and the aligners would need updating to understand these letters. This wouldn't be my preferred solution as it's likely to break a LOT of tools.

  2. We could add auxiliary tags to the @fastq or >fasta header lines and use an aligner which already supports migration of that data to SAM auxiliary tags. The most likely outcome here is the methylation is just silently lost if using a non aux-aware aligner, so it fails safe-ish. The mechanism is already in use by some aligners (eg bwa).

Either way, it needs a compliant aligner.

@avilella
Copy link

avilella commented Jun 4, 2019

I think I understand: we currently would carry the information in the @fasta or >fasta header lines, give it to a compliant aligner (or wrapper around an aligner) that would align read by read and carry that information into the QNAME field and then a post-processing step (i.e the modcaller) could take the output of the alignment, apply whichever multi-read pileup-style set of rules apply to the aligned reads, combining the alignment information and the per-read information in the QNAME auxiliary tags, and convert that information into an MM-tagged SAM output.

Is my understanding correct @jkbonfield ?

Having re-read this ticket, I think I understand why people suggest this could be done at the VCF/BCF level (as @marcus1487 suggests), given that some of this mod-calling has similarities to the pileup-style way of calling variants from multiple aligned reads from a SAM. I guess it's a theoretical question as to which one is preferred: SAM with mod-calls (or per-read mod-info) in the MM-tags, or VCF/BCF with mod-calls from a BAM file that contains aligned reads (with per-read mod-info).

@jkbonfield
Copy link
Contributor Author

jkbonfield commented Jun 4, 2019

Yes that wrapper strategy should work.

As for SAM vs VCF, I don't see this as either-or, but more of an "and". With Fastq -> SAM -> VCF the same base modification data needs to be represented in each and migrated as appropriate.

Although for VCF it's likely we'll be working in counts or consensus and ambiguity code level. That's someone elses problem through ;-)

Edit: ie we can keep this as discussion, but ultimately we need separate PRs for SAMtags and VCF specs for the actual definitions.

@avilella
Copy link

avilella commented Jun 4, 2019

Thanks for the explanation. It does sound like there are separate PRs:

  1. SAMtags for people wanting to add per-read mod-info that's extendable into multiple mods (e.g. ONT).
  2. VCF specs for mod-calling that takes several counts and makes a call per position (CpG or CHH, etc.)

Then tools to convert from modVCF to bedMethyl or bedGraph or other formats.

@cjw85
Copy link

cjw85 commented Aug 5, 2020

@jkbonfield

Have you (or anyone else you know) tried extending the pileup API, or prototyped any other functionality, for htslib to be able to provide access to this data in a convenient fashion? I can see that a common question would be, "what's the probability that the base at reference position i of alignment A modified to e.g. mC?". Further to aggregate those numbers over reads.

@jkbonfield
Copy link
Contributor Author

Sorry, not yet, but we need to do it.

This is getting off-topic for hts-specs being more htslib, but do you have a suggestion for what the API should look like? Pileup is pretty crufty as an interface as the structures are public and non-changeable without breaking ABI. However there's room to exploit the "pileup_cd" member perhaps although it wasn't what it was intended for.

The alternative is a function that given a bam structure an a query-position (qpos) within it, returns the relative modification possibilities. It'd ideally need to cache state that between calls to avoid repeated linear scanning, and again there's no obvious way to do that. The htslib pileup interface is simply crufty. I'm open to ideas though!

@cjw85
Copy link

cjw85 commented Aug 5, 2020

Without dragging this too far off topic, your comments are in line with my own ideas. I do think something like the pileup API is warranted such that just as we can access bases from alignments and their qualities at a reference position we can get modified bases information. If you want an example use case to orient ideas see fast5mod which first converts Guppy's Fast5 data tables into sam with the mod-base scores in array-valued tags, and then builds a per reference position data table.

I should say that my initial question here was motivated by the fact that the Guppy basecaller will soon be outputting data in the form described in #418.

@jkbonfield
Copy link
Contributor Author

An aside and just a FYI (and not really hts-specs): I've started on adding these to htslib. First thing first is a general "I have a read, show me the modifications" style interface. Once that's done I'll consider whether we can integrate it somewhere into pileup.

@jkbonfield
Copy link
Contributor Author

Closing as this was solved in #418, but we may need a new issue / PR for tracking the inevitable next step of representing these in VCF.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

10 participants