Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SampCompDB error with bed file #132

Closed
samir-watson opened this issue Apr 29, 2020 · 3 comments
Closed

SampCompDB error with bed file #132

samir-watson opened this issue Apr 29, 2020 · 3 comments

Comments

@samir-watson
Copy link

Hi,
I am getting the following error when trying to use SampCompDB to save a bed file:

nanocompore.common.NanocomporeError: Some references are missing from the BED file provided

I am using gencode.v33.transcripts.fa for transcript alignment and I am making a bed file using the gft2bed tool using the gencode.v33.annotation.gtf file

Im not sure what the problem is here as I am using the same gencode release (v33) to download both the gtf and the fasta file.

here is my code:

db = SampCompDB (
...     db_fn = "/home/samirwatson/faststorage/NAT10/basecalled/Nanocompore_results/out_SampComp.db",
...     fasta_fn = "/home/samirwatson/faststorage/NAT10/gencode.v33.transcripts.fa",
...     log_level= "warning",
...     bed_fn = "/home/samirwatson/faststorage/NAT10/gencode.v33.annotation.gtf.bed"
... )

here is the traceback:

Traceback (most recent call last):
File "", line 5, in
File "/home/samirwatson/miniconda3/envs/nanocompore/lib/python3.7/site-packages/nanocompore/SampCompDB.py", line 108, in init
self.results = self.__calculate_results(adjust=True)
File "/home/samirwatson/miniconda3/envs/nanocompore/lib/python3.7/site-packages/nanocompore/SampCompDB.py", line 166, in __calculate_results
raise NanocomporeError("Some references are missing from the BED file provided")
nanocompore.common.NanocomporeError: Some references are missing from the BED file provided

@tleonardi
Copy link
Owner

Hi @samir-watson,
there must be some mismatch in the transcript IDs between the fasta file and the BED file.
You can check what ref_ids are used in the SampCompDB (by loading it without passing a BED file) and compare those to the ref_ids used in the BED file (col 4). Do they match?

The way I usually do it is to convert the Gencode/Ensemble GTF to bed (with bedparse) and then convert the BED to fasta with bedtools getfasta and fix the fasta headers with a regex:

bedtools getfasta -fi ${genome_fasta} -s -split -name -bed reference_transcriptome.bed -fo - | perl -pe 's/>(.+)\\([+-]\\)/>\$1/'

As a reference, you can also check the code of our nextflow pipeline

@samir-watson
Copy link
Author

Hi @tleonardi,

I have had a good look at what was happening with and it seems to a problem with pipes in the fasta headers that are being carried over from the fasta I used during SampComp so when I compare the SampCompDB database to the bed file there is no longer a match.
I will try and redo the alignment with the pipes and extra IDs removed and see if that works.

On another note I found a tiny issue in SampCompDB.py line 421, it says "red_if" and I think its meand to be ref_id.

Cheers,
Samir

@tleonardi
Copy link
Owner

Hi @samir-watson,
thanks for spotting the typo. It's getting fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants