Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IUPAC codes in VCF REF field #268

Closed
ACEnglish opened this issue May 25, 2024 · 1 comment
Closed

IUPAC codes in VCF REF field #268

ACEnglish opened this issue May 25, 2024 · 1 comment
Labels

Comments

@ACEnglish
Copy link

Hello,

First of all, thank you for supporting this package. I've had fun using it.

I received an issue on my tool that uses noodles-vcf in which variant records with IUPAC codes in the REF were being thrown out. In an attempt to fix the problem I started by moving to the latest version of noodles-vcf (0.57) and found that the reading error was no longer being thrown.

However, when attempting to write the record back out, the error is being thrown. This would be fine except noodles writes partial lines in the output VCF which are truncated at the offending bases. For example:

chr20	24408074	.	GTATTTCCAGCCTTGGCCATAGchr20	24408077	.	TTTCCAGCCTTGGCC

(note that the rest of the entry @ 24408077 was written, just truncated here for readability)

Is it possible for noodles-vcf to raise the error before writing in order to prevent the corrupted lines? My idea is that I can catch the error, attempt to fix the VCF entry, and retry writing.

I think no is a more than fair answer to this request. If so, I'd just need to validate all entries before trying to write, which would incur some overhead doing redundant checks to noodles.

If the answer is yes, but you don't have time, I'm willing to try creating a pull request. My idea would be to edit io/writer/record so that the write_* calls are sent a temporary buffer instead of the final writer. Then, at the end of the method the buffer is sent to the final writer. However, I'm not familiar enough with the code base to know if this would be sufficient. So any guidance on what all would need to change would be helpful.

@zaeleus zaeleus added the vcf label May 25, 2024
@zaeleus
Copy link
Owner

zaeleus commented May 26, 2024

As you noted in ACEnglish/kanpig#1, the error is being triggered because the input is invalid. Reference bases must be {A, C, G, T, N}. In previous versions of noodles, this was validated when parsed, but it was changed to when serializing to allow at least reading such cases.

I think the solution is to integrate an ambiguity base resolver into noodles-vcf's record reference bases writer. This likely covers the vast majority of "invalid" reference bases. Although this behavior is undefined in VCF < 4.3, it is valid to apply to those versions.

Is it possible for noodles-vcf to raise the error before writing in order to prevent the corrupted lines?

This isn't really ideal because the given record's in-memory format is opaque, i.e., records are implementations of vcf::variant::Record. It's the reason why validation is done inline; the record is not guaranteed to have been previously parsed (e.g., vcf::Record and bcf::Record).

My idea would be to edit io/writer/record so that the write_* calls are sent a temporary buffer instead of the final writer.

For this approach, my recommendation would be to create a wrapper around vcf::io::Writer, e.g.,

main.rs
// cargo add noodles@0.74.0 --features core,vcf

use std::io::{self, Write};

use noodles::{
    core::Position,
    vcf::{
        self as vcf,
        header::record::value::{map::Contig, Map},
        variant::io::Write as _,
    },
};

fn main() -> io::Result<()> {
    let stdout = io::stdout().lock();
    let mut writer = VcfLineWriter::new(vcf::io::Writer::new(stdout));

    let header = vcf::Header::builder()
        .add_contig("sq0", Map::<Contig>::new())
        .build();

    writer.write_header(&header)?;

    let record = vcf::variant::RecordBuf::builder()
        .set_reference_sequence_name("sq0")
        .set_variant_start(Position::MIN)
        .set_reference_bases("R")
        .build();

    writer.write_variant_record(&header, &record)?;

    Ok(())
}

struct VcfLineWriter<W> {
    inner: vcf::io::Writer<W>,
    buf: Vec<u8>,
}

impl<W> VcfLineWriter<W>
where
    W: Write,
{
    fn new(inner: vcf::io::Writer<W>) -> Self {
        Self {
            inner,
            buf: Vec::new(),
        }
    }

    fn write_header(&mut self, header: &vcf::Header) -> io::Result<()> {
        self.buf.clear();

        let mut writer = vcf::io::Writer::new(&mut self.buf);
        writer.write_header(header)?;

        self.inner.get_mut().write_all(&self.buf)
    }

    fn write_variant_record<R>(&mut self, header: &vcf::Header, record: &R) -> io::Result<()>
    where
        R: vcf::variant::Record,
    {
        self.buf.clear();

        let mut writer = vcf::io::Writer::new(&mut self.buf);
        writer.write_variant_record(header, record)?;

        self.inner.get_mut().write_all(&self.buf)
    }
}

Thanks for the report, and let me know if you have further questions or other issues with migrating to noodles-vcf >= 0.52.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants