SampCompDB error with bed file #132

samir-watson · 2020-04-29T15:16:07Z

Hi,
I am getting the following error when trying to use SampCompDB to save a bed file:

nanocompore.common.NanocomporeError: Some references are missing from the BED file provided

I am using gencode.v33.transcripts.fa for transcript alignment and I am making a bed file using the gft2bed tool using the gencode.v33.annotation.gtf file

Im not sure what the problem is here as I am using the same gencode release (v33) to download both the gtf and the fasta file.

here is my code:

db = SampCompDB (
...     db_fn = "/home/samirwatson/faststorage/NAT10/basecalled/Nanocompore_results/out_SampComp.db",
...     fasta_fn = "/home/samirwatson/faststorage/NAT10/gencode.v33.transcripts.fa",
...     log_level= "warning",
...     bed_fn = "/home/samirwatson/faststorage/NAT10/gencode.v33.annotation.gtf.bed"
... )

here is the traceback:

Traceback (most recent call last):
File "", line 5, in
File "/home/samirwatson/miniconda3/envs/nanocompore/lib/python3.7/site-packages/nanocompore/SampCompDB.py", line 108, in init
self.results = self.__calculate_results(adjust=True)
File "/home/samirwatson/miniconda3/envs/nanocompore/lib/python3.7/site-packages/nanocompore/SampCompDB.py", line 166, in __calculate_results
raise NanocomporeError("Some references are missing from the BED file provided")
nanocompore.common.NanocomporeError: Some references are missing from the BED file provided

The text was updated successfully, but these errors were encountered:

tleonardi · 2020-04-29T17:17:59Z

Hi @samir-watson,
there must be some mismatch in the transcript IDs between the fasta file and the BED file.
You can check what ref_ids are used in the SampCompDB (by loading it without passing a BED file) and compare those to the ref_ids used in the BED file (col 4). Do they match?

The way I usually do it is to convert the Gencode/Ensemble GTF to bed (with bedparse) and then convert the BED to fasta with bedtools getfasta and fix the fasta headers with a regex:

bedtools getfasta -fi ${genome_fasta} -s -split -name -bed reference_transcriptome.bed -fo - | perl -pe 's/>(.+)\\([+-]\\)/>\$1/'

As a reference, you can also check the code of our nextflow pipeline

samir-watson · 2020-04-30T21:34:46Z

Hi @tleonardi,

I have had a good look at what was happening with and it seems to a problem with pipes in the fasta headers that are being carried over from the fasta I used during SampComp so when I compare the SampCompDB database to the bed file there is no longer a match.
I will try and redo the alignment with the pipes and extra IDs removed and see if that works.

On another note I found a tiny issue in SampCompDB.py line 421, it says "red_if" and I think its meand to be ref_id.

Cheers,
Samir

tleonardi · 2020-05-01T11:05:19Z

Hi @samir-watson,
thanks for spotting the typo. It's getting fixed

tleonardi mentioned this issue May 1, 2020

Typo in header of shift_stats file #136

Closed

tleonardi closed this as completed May 1, 2020

nemitheasura mentioned this issue Jan 7, 2021

Error with parsing bed file #176

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SampCompDB error with bed file #132

SampCompDB error with bed file #132

samir-watson commented Apr 29, 2020

tleonardi commented Apr 29, 2020

samir-watson commented Apr 30, 2020

tleonardi commented May 1, 2020

SampCompDB error with bed file #132

SampCompDB error with bed file #132

Comments

samir-watson commented Apr 29, 2020

tleonardi commented Apr 29, 2020

samir-watson commented Apr 30, 2020

tleonardi commented May 1, 2020