getting spliced coding sequences instead of the gene sequence #17

terrimporter · 2019-12-13T16:38:03Z

Session Info

When I use BioPerl to parse GenBank records there is a way to target coding sequences (CDS) and retrieve the spliced sequence (without introns).  I would like to switch to restez for this task, but I can't seem to find a way to do this?

See for example GenBank record X55026.1 where the COI gene range is 38501..63007 but the CDS is join(38501..38669,41209..41297,42536..42576,43837..43868,
                     46369..46451,47959..48055,49512..49633,51885..51978,
                     53386..53396,54384..54470,55738..55797,56925..56927,
                     58328..58357,59363..59519,60813..60862,62103..62120,
                     62525..63007)

Thanks

DomBennett · 2020-01-03T10:48:52Z

Hi!

Unfortunately there's no readily available method to retrieve the spliced regions. The package does contain record extraction tools which can be coupled to do what you want.

# for example, download single whole sequence using rentrez
record <- rentrez::entrez_fetch(db='nuccore', id='X55026', rettype='gb')
# to check, run
# cat(record)
# eqv. to  restez::gb_record_get() if restez database existed

# manually pull out spliced sequence
# 1/ look up exons
features <- restez::gb_extract(record, 'features')
are_exons <- vapply(features, function(x) tolower(x[['type']]) == 'exon', logical(1))
exons <- features[are_exons]
length(exons)
# 2/ get locations
exon_locations <- vapply(exons, '[[', character(1), 'location')
exon_locations <- strsplit(x = exon_locations, split = '\\.\\.')
# 3/ assemble sequence
sequence <- restez::gb_extract(record, 'sequence')
sequence <- strsplit(x = sequence, split = '')[[1]]
exon_sequences <- unlist(lapply(X = exon_locations, function(x) {
  sequence[x[[1]]:x[[2]]]
}))
assembled_sequence <- paste0(exon_sequences, collapse = '')
nchar(assembled_sequence)

^This code is untested plus I'm not 100% sure it does exactly what you want.

Also, if you do come up with a good function(s), I welcome pull-requests!

joelnitta · 2022-07-07T07:11:23Z

I am the new maintainer. A PR would still be welcomed for this issue.

This comment was marked as outdated.

Sign in to view

joelnitta added the enhancement New feature or request label Aug 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

getting spliced coding sequences instead of the gene sequence #17

getting spliced coding sequences instead of the gene sequence #17

terrimporter commented Dec 13, 2019

DomBennett commented Jan 3, 2020

This comment was marked as outdated.

joelnitta commented Jul 7, 2022

getting spliced coding sequences instead of the gene sequence #17

getting spliced coding sequences instead of the gene sequence #17

Comments

terrimporter commented Dec 13, 2019

DomBennett commented Jan 3, 2020

This comment was marked as outdated.

joelnitta commented Jul 7, 2022