Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

getting spliced coding sequences instead of the gene sequence #17

Open
terrimporter opened this issue Dec 13, 2019 · 3 comments
Open

getting spliced coding sequences instead of the gene sequence #17

terrimporter opened this issue Dec 13, 2019 · 3 comments
Labels
enhancement New feature or request

Comments

@terrimporter
Copy link

Session Info
When I use BioPerl to parse GenBank records there is a way to target coding sequences (CDS) and retrieve the spliced sequence (without introns).  I would like to switch to restez for this task, but I can't seem to find a way to do this?

See for example GenBank record X55026.1 where the COI gene range is 38501..63007 but the CDS is join(38501..38669,41209..41297,42536..42576,43837..43868,
                     46369..46451,47959..48055,49512..49633,51885..51978,
                     53386..53396,54384..54470,55738..55797,56925..56927,
                     58328..58357,59363..59519,60813..60862,62103..62120,
                     62525..63007)

Thanks
@DomBennett
Copy link
Contributor

Hi!

Unfortunately there's no readily available method to retrieve the spliced regions. The package does contain record extraction tools which can be coupled to do what you want.

# for example, download single whole sequence using rentrez
record <- rentrez::entrez_fetch(db='nuccore', id='X55026', rettype='gb')
# to check, run
# cat(record)
# eqv. to  restez::gb_record_get() if restez database existed

# manually pull out spliced sequence
# 1/ look up exons
features <- restez::gb_extract(record, 'features')
are_exons <- vapply(features, function(x) tolower(x[['type']]) == 'exon', logical(1))
exons <- features[are_exons]
length(exons)
# 2/ get locations
exon_locations <- vapply(exons, '[[', character(1), 'location')
exon_locations <- strsplit(x = exon_locations, split = '\\.\\.')
# 3/ assemble sequence
sequence <- restez::gb_extract(record, 'sequence')
sequence <- strsplit(x = sequence, split = '')[[1]]
exon_sequences <- unlist(lapply(X = exon_locations, function(x) {
  sequence[x[[1]]:x[[2]]]
}))
assembled_sequence <- paste0(exon_sequences, collapse = '')
nchar(assembled_sequence)

^This code is untested plus I'm not 100% sure it does exactly what you want.

Also, if you do come up with a good function(s), I welcome pull-requests!

@maelle

This comment was marked as outdated.

@joelnitta
Copy link
Contributor

I am the new maintainer. A PR would still be welcomed for this issue.

@joelnitta joelnitta added the enhancement New feature or request label Aug 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants