Skip to content

readgff fails with protein sequences #20

Open
@diegozea

Description

@diegozea

readgff tries to read sequences as DNA sequences, therefore it fails when reading files containing protein sequences.

Input

GFF file containing a protein sequence downloaded from the ELM database: http://elm.eu.org/downloads.html
Link to the file: http://elm.eu.org/instances.gff?q=SRC_HUMAN

##gff-version 3
P12931	ELM	sequence_feature	530	534	.	.	.	ID=LIG_SH2_SFK_CTail_3
P12931	ELM	sequence_feature	252	259	.	.	.	ID=LIG_SH3_4
P12931	ELM	sequence_feature	72	78	.	.	.	ID=MOD_CDK_SPxK_1
P12931	ELM	sequence_feature	1	7	.	.	.	ID=MOD_NMyristoyl
P12931	ELM	sequence_feature	526	534	.	.	.	ID=MOD_TYR_CSK
##FASTA
>P12931
MGSNKSKPKDASQRRRSLEPAENVHGAGGGAFPASQTPSKPASADGHRGPSAAFAPAAAEPKLFGGFNSSDTVTSPQRAGPLAGGVTTFVALYDYESRTETDLSFKKGERLQIVNNTEGDWWLAHSLSTGQTGYIPSNYVAPSDSIQAEEWYFGKITRRESERLLLNAENPRGTFLVRESETTKGAYCLSVSDFDNAKGLNVKHYKIRKLDSGGFYITSRTQFNSLQQLVAYYSKHADGLCHRLTTVCPTSKPQTQGLAKDAWEIPRESLRLEVKLGQGCFGEVWMGTWNGTTRVAIKTLKPGTMSPEAFLQEAQVMKKLRHEKLVQLYAVVSEEPIYIVTEYMSKGSLLDFLKGETGKYLRLPQLVDMAAQIASGMAYVERMNYVHRDLRAANILVGENLVCKVADFGLARLIEDNEYTARQGAKFPIKWTAPEAALYGRFTIKSDVWSFGILLTELTTKGRVPYPGMVNREVLDQVERGYRMPCPPECPESLHDLMCQCWRKEPEERPTFEYLQAFLEDYFTSTEPQYQPGENL

Output

julia> p = readgff("elm_instances.gff")
MGSNKSKPKDASQRRRSLEPAENVHGAGGGAFPASQTPSKPASADGHRGPSAAFAPAAAEPKLFGGFNSSDTVTSPQRAGPLAGGVTTFVALYDYESRTETDLSFKKGERLQIVNNTEGDWWLAHSLSTGQTGYIPSNYVAPSDSIQAEEWYFGKITRRESERLLLNAENPRGTFLVRESETTKGAYCLSVSDFDNAKGLNVKHYKIRKLDSGGFYITSRTQFNSLQQLVAYYSKHADGLCHRLTTVCPTSKPQTQGLAKDAWEIPRESLRLEVKLGQGCFGEVWMGTWNGTTRVAIKTLKPGTMSPEAFLQEAQVMKKLRHEKLVQLYAVVSEEPIYIVTEYMSKGSLLDFLKGETGKYLRLPQLVDMAAQIASGMAYVERMNYVHRDLRAANILVGENLVCKVADFGLARLIEDNEYTARQGAKFPIKWTAPEAALYGRFTIKSDVWSFGILLTELTTKGRVPYPGMVNREVLDQVERGYRMPCPPECPESLHDLMCQCWRKEPEERPTFEYLQAFLEDYFTSTEPQYQPGENL
ERROR: Cannot encode byte 0x50 (char 'P') at index 8 to BioSequences.DNAAlphabet{4}()
Stacktrace:
  [1] error(s::String)
    @ Base .\error.jl:35
  [2] throw_encode_error(A::BioSequences.DNAAlphabet{4}, src::Base.CodeUnits{UInt8, String}, soff::Int64)
    @ BioSequences C:\Users\dz272503\.julia\packages\BioSequences\Mf23T\src\longsequences\copying.jl:164
  [3] encode_chunk
    @ C:\Users\dz272503\.julia\packages\BioSequences\Mf23T\src\longsequences\copying.jl:178 [inlined]
  [4] encode_chunks!(dst::BioSequences.LongSequence{BioSequences.DNAAlphabet{4}}, startindex::Int64, src::Base.CodeUnits{UInt8, String}, soff::Int64, N::Int64)
    @ BioSequences C:\Users\dz272503\.julia\packages\BioSequences\Mf23T\src\longsequences\copying.jl:189
  [5] LongSequence
    @ C:\Users\dz272503\.julia\packages\BioSequences\Mf23T\src\longsequences\constructors.jl:97 [inlined]
  [6] LongSequence
    @ C:\Users\dz272503\.julia\packages\BioSequences\Mf23T\src\longsequences\constructors.jl:85 [inlined]
  [7] parsechromosome!(input::TranscodingStreams.NoopStream{IOStream}, record::GenomicAnnotations.Record{Gene})
    @ GenomicAnnotations.GFF C:\Users\dz272503\.julia\packages\GenomicAnnotations\37yeV\src\GFF\reader.jl:133
  [8] tryread!
    @ C:\Users\dz272503\.julia\packages\GenomicAnnotations\37yeV\src\GFF\reader.jl:49 [inlined]
  [9] iterate(reader::GenomicAnnotations.GFF.Reader{TranscodingStreams.NoopStream{IOStream}}, nextone::GenomicAnnotations.Record{Gene})
    @ GenomicAnnotations.GFF C:\Users\dz272503\.julia\packages\GenomicAnnotations\37yeV\src\GFF\reader.jl:42
 [10] _collect(cont::UnitRange{Int64}, itr::GenomicAnnotations.GFF.Reader{TranscodingStreams.NoopStream{IOStream}}, ::Base.HasEltype, isz::Base.SizeUnknown)
    @ Base .\array.jl:727
 [11] collect
    @ .\array.jl:716 [inlined]
 [12] readgff(input::String)
    @ GenomicAnnotations C:\Users\dz272503\.julia\packages\GenomicAnnotations\37yeV\src\utils.jl:87
 [13] top-level scope
    @ REPL[5]:1

Version

julia> versioninfo()
Julia Version 1.11.3
Commit d63adeda50 (2025-01-21 19:42 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 12 × 13th Gen Intel(R) Core(TM) i7-1365U
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, goldmont)
Threads: 1 default, 0 interactive, 1 GC (on 12 virtual cores)

(Downloads) pkg> st
Status `C:\Users\dz272503\Downloads\Project.toml`
  [4f8a0a0a] GenomicAnnotations v0.4.5

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions