Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

List up the cases where the result of tataki is against our expectations #6

Open
inutano opened this issue Jul 11, 2024 · 2 comments
Open
Assignees
Labels
documentation Improvements or additions to documentation invalid This doesn't seem right

Comments

@inutano
Copy link

inutano commented Jul 11, 2024

case: a file only with header lines

##fileformat=VCFv4.2
##nanopolish_window=MN908947.3:1-29902
##INFO=<ID=TotalReads,Number=1,Type=Integer,Description="The number of event-space reads used to call the variant">
##INFO=<ID=SupportFraction,Number=1,Type=Float,Description="The fraction of event-space reads that support the variant">
##INFO=<ID=SupportFractionByStrand,Number=2,Type=Float,Description="Fraction of event-space reads that support the variant for each strand">
##INFO=<ID=BaseCalledReadsWithVariant,Number=1,Type=Integer,Description="The number of base-space reads that support the variant">
##INFO=<ID=BaseCalledFraction,Number=1,Type=Float,Description="The fraction of base-space reads that support the variant">
##INFO=<ID=AlleleCount,Number=1,Type=Integer,Description="The inferred number of copies of the allele">
##INFO=<ID=StrandSupport,Number=4,Type=Integer,Description="Number of reads supporting the REF and ALT allele, by strand">
##INFO=<ID=StrandFisherTest,Number=1,Type=Integer,Description="Strand bias fisher test">
##INFO=<ID=SOR,Number=1,Type=Float,Description="StrandOddsRatio test from GATK">
##INFO=<ID=RefContext,Number=1,Type=String,Description="The reference sequence context surrounding the variant call">
##INFO=<ID=Pool,Number=1,Type=String,Description="The pool name">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

This file will be detected as a bed file, as it does not contain lines.

case: gzipped binary files

$ tataki tiny.bam.gz --yaml -v
[2024-07-11T07:26:40Z INFO  tataki::module] tataki started
[2024-07-11T07:26:40Z DEBUG tataki::module] Args: Args { input: ["tiny.bam.gz"], output: None, output_format: Csv, yaml: true, cache_dir: None, conf: None, tidy: false, no_decompress: false, num_records: 100000, dry_run: false, verbose: true, quiet: false }
[2024-07-11T07:26:40Z DEBUG tataki::module] Output format: Yaml
[2024-07-11T07:26:40Z INFO  tataki::module] Created temporary directory: /tmp/tataki_2024-0711-162640_BgiSCI
[2024-07-11T07:26:40Z INFO  tataki::module] Processing input: tiny.bam.gz
[2024-07-11T07:26:40Z DEBUG tataki::source] Provided input is in GZ format
Error: stream did not contain valid UTF-8

The file is gzipped, but the tataki (specifically the internal Rust GZ decoder) expects a flat file out from it.

case: BGZF

tataki SAMPLE_01.pass.vcf.gz --yaml
[2024-07-11T07:36:49Z INFO  tataki::module] tataki started
[2024-07-11T07:36:49Z INFO  tataki::module] Created temporary directory: /tmp/tataki_2024-0711-163649_HL6qcv
[2024-07-11T07:36:49Z INFO  tataki::module] Processing input: SAMPLE_01.pass.vcf.gz
[2024-07-11T07:36:49Z INFO  tataki::parser] Invoking parser empty
[2024-07-11T07:36:49Z INFO  tataki::parser] Invoking parser bam
[2024-07-11T07:36:49Z INFO  tataki::parser] Invoking parser bcf
[2024-07-11T07:36:49Z INFO  tataki::parser] Invoking parser bed
[2024-07-11T07:36:49Z INFO  tataki::parser] Invoking parser cram
[2024-07-11T07:36:49Z INFO  tataki::parser] Invoking parser fasta
[2024-07-11T07:36:49Z INFO  tataki::parser] Invoking parser fastq
[2024-07-11T07:36:49Z INFO  tataki::parser] Invoking parser gff3
[2024-07-11T07:36:49Z INFO  tataki::parser] Invoking parser gtf
[2024-07-11T07:36:49Z INFO  tataki::parser] Invoking parser sam
[2024-07-11T07:36:49Z INFO  tataki::parser] Invoking parser vcf
[2024-07-11T07:36:49Z INFO  tataki::module] Detected!! vcf
[2024-07-11T07:36:49Z INFO  tataki::module] Deleting temporary directory: /tmp/tataki_2024-0711-163649_HL6qcv
SAMPLE_01.pass.vcf.gz:
  id: http://edamontology.org/format_3016
  label: VCF
  decompressed:
    label: null
    id: null

The file SAMPLE_01.pass.vcf.gz looks like a normal GZIP file, but it is a BGZF (Blocked GNU Zip Format) file. As it has a header which shows the file inside is VCF, tataki tells that it is a normal VCF file.

@fmaccha
Copy link
Collaborator

fmaccha commented Jul 11, 2024

I wil describe these in README

@fmaccha fmaccha self-assigned this Jul 11, 2024
@fmaccha fmaccha added the documentation Improvements or additions to documentation label Jul 11, 2024
@inutano
Copy link
Author

inutano commented Jul 11, 2024

note:
tataki may need to distinguish these formats: compressed BCF, uncompressed BCF, compressed VCF, uncompressed VCF

@fmaccha fmaccha added the invalid This doesn't seem right label Jul 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation invalid This doesn't seem right
Development

No branches or pull requests

2 participants