Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trouble generating images from a VCF file: Doesn't work on my data or sample data #101

Closed
moldach opened this issue Apr 15, 2020 · 15 comments

Comments

@moldach
Copy link

moldach commented Apr 15, 2020

I'm able to run the samplot.py on the example data and on my own data but am having issues when I try the samplot_vcf.py script.

As far as I know there is no Repeat Masker track for C. elegans (at least not on UCSC) so the --important_regions regions.bed has been omitted.

python ~/bin/samplot/src/samplot_vcf.py \
    --vcf Pindel.vcf\
    -d test/\
    -O png\
    -b 470.sorted.dedupped.bam > samplot_commands.sh

[W::bcf_hdr_check_sanity] PL should be declared as Number=G

I see an empty samplot_commands.sh file and when I go to open the index.html I see zero records.

Because this did not work I want to validate on the example data; however, I get similar results using all of the .bam files provided with samplot:

cd ~/bin/samplot/test/data/
python ~/bin/samplot/src/samplot_vcf.py --vcf test.vcf -d vcf-test/ -O png -b NA12878_restricted.bam > samplot_commands.sh

Your help would be greatly appreciated

@jbelyeu
Copy link
Collaborator

jbelyeu commented Apr 15, 2020 via email

@moldach
Copy link
Author

moldach commented Apr 15, 2020

To install with conda, just run "conda install samplot". To run the samplot vcf command: "samplot vcf" instead of "python samplot_vcf.py".

I've installed viz conda and run into the same issue (both on my data and test data from samplot repo)

samplot vcf \
    --vcf Delly.vcf\
    -d test/\
    -O png\
    -b 470.sorted.dedupped.bam > samplot_commands.sh
cd ~/bin/samplot/test/data
samplot vcf \
    --vcf test.vcf\
    -d test/\
    -O png\
    -b NA12878_restricted.bam > samplot_commands.sh

@jbelyeu
Copy link
Collaborator

jbelyeu commented Apr 15, 2020

My bad, I forgot to mention an additional change. The default behavior is now to autorun instead of printing out the commands. Did it create the images?

You can also specify that you want to run manually by using the --manual_run option. This will write the commands to samplot_vcf_cmds.tmp (and you can change the command file name using the --command_file option.

Again, my apologies

@moldach
Copy link
Author

moldach commented Apr 15, 2020

Hi @jbelyeu

So when you say autorun instead of printing out the commands I assume you meant to omit the > samplot_commands.sh part?

samplot vcf \
    --manual_run \
    --vcf Delly.vcf\
    -d test/\
    -O png\
    -b 470.sorted.dedupped.bam

The output directory test/ only contain index.html and when I open this firefox index.html it says

No data available in table. Showing 1-0 of 0 records

@jbelyeu
Copy link
Collaborator

jbelyeu commented Apr 15, 2020

Yes, and I'm not sure what the issue is now. Can you cd into the samplot directory and run the tests (bash runtests.sh) and see if PNGs are created in test/func/test_vcf_auto_dir/?

Also check that the version is 1.0.12 (samplot -v)

@moldach
Copy link
Author

moldach commented Apr 15, 2020

Hi @jbelyeu okay so bash runtests.sh creates PNGs and peering into the source code I also tried the following:

samplot vcf \
    -d test/ \
    --vcf ~/bin/samplot-master/test/data/test.vcf \
    --sample_ids HG002 \
    -b ~/bin/samplot-master/test/data/HG002_Illumina.bam \
    --manual_run \
    --command_file test_samplot.sh

bash test_samplot.sh
firefox test/index.html

This also works.
Now I'll try it on my data:

Pindel

samplot vcf \
    -d YOLO/ \
    --vcf Pindel.vcf \
    --sample_ids maddog \
    -b  470.sorted.dedupped.bam \
    --manual_run \
    --command_file test_samplot.sh

Here I don't see anything in test_samplot.sh or index.html.
Luckily I have a number of (non-empty) call sets we can troubleshoot with: CNVnator, Delly, GRIDSS, Lumpy, Manta, MindTheGap, NGSep, Pindel, Tardis.

NGSep

(base) mtg@pop-os:~/MADDOG$ samplot vcf \
    -d YOLO/ \
    --vcf NGSep.vcf \
    --sample_ids maddog  \
   -b  470.sorted.dedupped.bam  \
   --manual_run  \
   --command_file test_samplot.sh

Traceback (most recent call last):
  File "/home/mtg/anaconda3/bin/samplot", line 10, in <module>
    sys.exit(main())
  File "/home/mtg/anaconda3/lib/python3.7/site-packages/samplot/__main__.py", line 31, in main
    args.func(parser)
  File "/home/mtg/anaconda3/lib/python3.7/site-packages/samplot/samplot_vcf.py", line 444, in vcf
    svtype = variant.info.get("SVTYPE", "SV")
  File "pysam/libcbcf.pyx", line 2676, in pysam.libcbcf.VariantRecordInfo.get
ValueError: Invalid header

Lumpy

Trying Lumpy i see the following:

(base) mtg@pop-os:~/MADDOG$ samplot vcf  \
   -d YOLO/  \
   --vcf Lumpy.vcf \
   --sample_ids maddog \
   -b  470.sorted.dedupped.bam \
   --manual_run  \
   --command_file test_samplot.sh

[W::vcf_parse] Contig 'I' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] Contig 'II' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] Contig 'III' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] Contig 'IV' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] Contig 'V' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] Contig 'X' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] Contig 'MtDNA' is not defined in the header. (Quick workaround: index the file with tabix.)

Okay let's follow the prompt:

(base) mtg@pop-os:~/MADDOG$ bgzip -c Lumpy.vcf > Lumpy.vcf.gz
(base) mtg@pop-os:~/MADDOG$ tabix -p vcf Lumpy.vcf.gz 
[E::hts_idx_push] Unsorted positions on sequence #1: 670648 followed by 230637
tbx_index_build failed: Lumpy.vcf.gz

All of the other call sets produce either no-output like Pindel or the error like NGSep.

@jbelyeu
Copy link
Collaborator

jbelyeu commented Apr 15, 2020

Ok, well, that narrows it down to data-specific problems. Samplot vcf doesn't plot small events by default (for more details run samplot vcf -h and check out the --min_bp argument), but if the variants in your pindel file are larger than 20 bp that shouldn't be an issue.

The other vcfs have different isssues. Lumpy.vcf probably just needs to be sorted before indexing (check this out: https://www.biostars.org/p/299659/). I'm not familiar with ngsep personally, but it appears not to include the SVTYPE field, which samplot vcf requires.

@jbelyeu
Copy link
Collaborator

jbelyeu commented Apr 15, 2020

Is sharing a subset of your pindel data an option? If so I'd be happy to try to reproduce the issue myself

@moldach
Copy link
Author

moldach commented Apr 15, 2020

Ok, well, that narrows it down to data-specific problems. Samplot vcf doesn't plot small events by default (for more details run samplot vcf -h and check out the --min_bp argument), but if the variants in your pindel file are larger than 20 bp that shouldn't be an issue.

There are SVs larger than 20 bp in these files.

The other vcfs have different isssues. Lumpy.vcf probably just needs to be sorted before indexing (check this out: https://www.biostars.org/p/299659/). I'm not familiar with ngsep personally, but it appears not to include the SVTYPE field, which samplot vcf requires.

Earlier I ran into similar problems for the .gff3 annotations; I got errors about needing to sort and index

samplot plot \
    -n MADDOG \
    -b 470.sorted.dedupped.bam \
    -o chrI_464950_465071_genes.png \
    -c chrI \
    -s 464950 \
    -e 465071 \
    -t DEL \
    -T Caenorhabditis_elegans.WBcel235.99.gff3.gz

Window size is under 1.5x the estimated fragment length and will be resized to 502. Rerun with -w 60 to override
Traceback (most recent call last):
  File "/home/mtg/anaconda3/bin/samplot", line 10, in <module>
    sys.exit(main())
  File "/home/mtg/anaconda3/lib/python3.7/site-packages/samplot/__main__.py", line 31, in main
    args.func(parser)
  File "/home/mtg/anaconda3/lib/python3.7/site-packages/samplot/samplot.py", line 3456, in plot
    options.xaxis_label_fontsize,
  File "/home/mtg/anaconda3/lib/python3.7/site-packages/samplot/samplot.py", line 3244, in plot_transcript
    transcript_plan = get_transcript_plan(ranges, transcript_file)
  File "/home/mtg/anaconda3/lib/python3.7/site-packages/samplot/samplot.py", line 3147, in get_transcript_plan
    itr = get_tabix_iter(r.chrm, r.start, r.end, transcript_file)
  File "/home/mtg/anaconda3/lib/python3.7/site-packages/samplot/samplot.py", line 183, in get_tabix_iter
    tbx = pysam.TabixFile(datafile)
  File "pysam/libctabix.pyx", line 351, in pysam.libctabix.TabixFile.__cinit__
  File "pysam/libctabix.pyx", line 383, in pysam.libctabix.TabixFile._open
OSError: index `Caenorhabditis_elegans.WBcel235.99.gff3.gz.tbi` not found

After sorting and indexing the samplot plot code above works:

bedtools sort -i Caenorhabditis_elegans.WBcel235.99.gff3.gz | bgzip -c > Caenorhabditis_elegans.WBcel235.99.sorted.gff3.gz
samplot plot \
    -n MADDOG \
    -b 470.sorted.dedupped.bam \
    -o chrI_464950_465071_genes.png \
    -c chrI \
    -s 464950 \
    -e 465071 \
    -t DEL \
    -T Caenorhabditis_elegans.WBcel235.99.sorted.gff3.gz

However, following the biostars advice to sort/index the .vcf did not solve the issue:

cat Lumpy.vcf  | awk '$1 ~ /^#/ {print $0;next} {print $0 | "sort -k1,1 -k2,2n"}' > Lumpy_sorted.vcf
bgzip -c Lumpy_sorted.vcf > Lumpy_sorted.vcf.gz
tabix -p vcf Lumpy_sorted.vcf.gz 

samplot vcf  \
   -d YOLO/  \
   --vcf Lumpy_sorted.vcf.gz \
   --sample_ids maddog \
   -b  470.sorted.dedupped.bam \
   --manual_run  \
   --command_file test_samplot.sh

@jbelyeu
Copy link
Collaborator

jbelyeu commented Apr 15, 2020

What error did that last command generate?

@moldach
Copy link
Author

moldach commented Apr 15, 2020

This is a tricky one that passes with no error message

@jbelyeu
Copy link
Collaborator

jbelyeu commented Apr 15, 2020

As I understand it, the first issue with the lumpy data was a failure to read the vcf, which threw an error. Looks like the sort/index steps solved that and we're back to the original problem that also occurred with the pindel vcf. That's what I call progress!

Another thing you might check is that there's an exact match between the maddog sample id and the column name in the vcf

@moldach
Copy link
Author

moldach commented Apr 15, 2020

As I understand it, the first issue with the lumpy data was a failure to read the vcf, which threw an error. Looks like the sort/index steps solved that and we're back to the original problem that also occurred with the pindel vcf. That's what I call progress!

Science typically progresses funeral by funeral so were on a better track!

Another thing you might check is that there's an exact match between the maddog sample id and the column name in the vcf

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT maddog

Looks like it's the same name in the column as was used in the sample-id.

What's your e-mail? I have permission to share the Pindel.vcf. I can share a dropbox link with you.

@jbelyeu
Copy link
Collaborator

jbelyeu commented Apr 15, 2020

jrbelyeu@gmail.com

I'll also need part of the alignment file. You can use samtools to extract a region (samtools view -hb 470.sorted.dedupped.bam {}:{}-{}> partial.bam). I'll just need enough of the bam to fully encompass a variant or two that should be plotted.

@jbelyeu
Copy link
Collaborator

jbelyeu commented Apr 16, 2020

Ok, the issue is that in this Pindel vcf, all of the the genotype for this sample, for all variants, is homref. Samplot vcf only plots when it finds a sample recorded as het/homalt. Is this also the case for the Lumpy vcf? You may want to look into https://github.com/hall-lab/svtyper

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants