# Chapter ‍4 Sequence annotation objects

Chapter ‍3 introduced the sequence classes. Immediately “above” the Seq class is the Sequence Record or SeqRecord class, defined in the Bio.SeqRecord module. This class allows higher level features such as identifiers and features (as SeqFeature objects) to be associated with the sequence, and is used throughout the sequence input/output interface Bio.SeqIO described fully in Chapter ‍5.

If you are only going to be working with simple data like FASTA files, you can probably skip this chapter for now. If on the other hand you are going to be using richly annotated sequence data, say from GenBank or EMBL files, this information is quite important.

While this chapter should cover most things to do with the SeqRecord and SeqFeature objects in this chapter, you may also want to read the SeqRecord wiki page [http://biopython.org/wiki/SeqRecord](http://biopython.org/wiki/SeqRecord), and the built in documentation (also online – SeqRecord and SeqFeature):

In [2]:
from Bio.SeqRecord import SeqRecord
help(SeqRecord)

Help on class SeqRecord in module Bio.SeqRecord:

class SeqRecord(builtins.object)
 |  SeqRecord(seq, id='<unknown id>', name='<unknown name>', description='<unknown description>', dbxrefs=None, features=None, annotations=None, letter_annotations=None)
 |  
 |  A SeqRecord object holds a sequence and information about it.
 |  
 |  Main attributes:
 |   - id          - Identifier such as a locus tag (string)
 |   - seq         - The sequence itself (Seq object or similar)
 |  
 |  Additional attributes:
 |   - name        - Sequence name, e.g. gene name (string)
 |   - description - Additional text (string)
 |   - dbxrefs     - List of database cross references (list of strings)
 |   - features    - Any (sub)features defined (list of SeqFeature objects)
 |   - annotations - Further information about the whole sequence (dictionary).
 |     Most entries are strings, or lists of strings.
 |   - letter_annotations - Per letter/symbol annotation (restricted
 |     dictionary). This holds Pyt

## 4.1 The SeqRecord object

The SeqRecord (Sequence Record) class is defined in the Bio.SeqRecord module. This class allows higher level features such as identifiers and features to be associated with a sequence (see Chapter ‍3), and is the basic data type for the Bio.SeqIO sequence input/output interface (see Chapter ‍5).

The SeqRecord class itself is quite simple, and offers the following information as attributes:

- .seq – The sequence itself, typically a Seq object.
- .id – The primary ID used to identify the sequence – a string. In most cases this is something like an accession number.
- .name – A “common” name/id for the sequence – a string. In some cases this will be the same as the accession number, but it could also be a clone name. I think of this as being analogous to the LOCUS id in a GenBank record.
- .description – A human readable description or expressive name for the sequence – a string.
- .letter_annotations – Holds per-letter-annotations using a (restricted) dictionary of additional information about the letters in the sequence. The keys are the name of the information, and the information is contained in the value as a Python sequence (i.e. a list, tuple or string) with the same length as the sequence itself. This is often used for quality scores (e.g. Section ‍20.1.6) or secondary structure information (e.g. from Stockholm/PFAM alignment files).
- .annotations – A dictionary of additional information about the sequence. The keys are the name of the information, and the information is contained in the value. This allows the addition of more “unstructured” information to the sequence.
- .features – A list of SeqFeature objects with more structured information about the features on a sequence (e.g. position of genes on a genome, or domains on a protein sequence). The structure of sequence features is described below in Section ‍4.3.
- .dbxrefs - A list of database cross-references as strings.

## 4.2    Creating a SeqRecord

Using a SeqRecord object is not very complicated, since all of the information is presented as attributes of the class. Usually you won’t create a SeqRecord “by hand”, but instead use Bio.SeqIO to read in a sequence file for you (see Chapter ‍5 and the examples below). However, creating SeqRecord can be quite simple.

### 4.2.1 SeqRecord objects from scratch
To create a SeqRecord at a minimum you just need a Seq object:

In [4]:
from Bio.Seq import Seq
simple_seq = Seq("GATC")

from Bio.SeqRecord import SeqRecord
simple_seq_r = SeqRecord(simple_seq)

#### Additionally, you can also pass the id, name and description to the initialization function, but if not they will be set as strings indicating they are unknown, and can be modified subsequently:

In [5]:
simple_seq_r.id = "AC12345"
simple_seq_r.id


'AC12345'

In [6]:
simple_seq_r.description = "Made up sequence I wish I could write a paper about"
print(simple_seq_r.description)

Made up sequence I wish I could write a paper about


In [7]:
simple_seq_r.seq


Seq('GATC')

Including an identifier is very important if you want to output your SeqRecord to a file. You would normally include this when creating the object.

In [8]:
from Bio.Seq import Seq
simple_seq = Seq("GATC")
from Bio.SeqRecord import SeqRecord
simple_seq_r = SeqRecord(simple_seq, id="AC12345")

As mentioned above, the SeqRecord has an dictionary attribute annotations. This is used for any miscellaneous annotations that doesn’t fit under one of the other more specific attributes. Adding annotations is easy, and just involves dealing directly with the annotation dictionary:

In [9]:
simple_seq_r.annotations["evidence"] = "None. I just made it up."
print(simple_seq_r.annotations)

{'evidence': 'None. I just made it up.'}


In [10]:
print(simple_seq_r.annotations["evidence"])

None. I just made it up.


Working with per-letter-annotations is similar, letter_annotations is a dictionary like attribute which will let you assign any Python sequence (i.e. a string, list or tuple) which has the same length as the sequence:

In [11]:
simple_seq_r.letter_annotations["phred_quality"] = [40, 40, 38, 30]
print(simple_seq_r.letter_annotations)

{'phred_quality': [40, 40, 38, 30]}


In [12]:
print(simple_seq_r.letter_annotations["phred_quality"])

[40, 40, 38, 30]


### 4.2.2 SeqRecord objects from FASTA files

This example uses a fairly large FASTA file containing the whole sequence for Y`ersinia pestis biovar Microtus str. 91001 plasmid pPCP1`, originally downloaded from the NCBI. This file is included with the Biopython unit tests under the GenBank folder, or online NC_005816.fna from our website.

The file starts like this - and you can check there is only one record present (i.e. only one line starting with a greater than symbol):

In [13]:
## Downloading the NC_005816
import wget

url = 'https://raw.githubusercontent.com/biopython/biopython/master/Tests/GenBank/NC_005816.fna'
_path = 'data/'
wget.download(url, out=_path)

'data//NC_005816.fna'

In [15]:
from Bio import SeqIO
record = SeqIO.read(_path + "NC_005816.fna", "fasta")
record

SeqRecord(seq=Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG'), id='gi|45478711|ref|NC_005816.1|', name='gi|45478711|ref|NC_005816.1|', description='gi|45478711|ref|NC_005816.1| Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence', dbxrefs=[])

Back in `Chapter 2` you will have seen the function Bio.SeqIO.parse(...) used to loop over all the records in a file as SeqRecord objects. The Bio.SeqIO module has a sister function for use on files which contain just one record which we’ll use here (see Chapter ‍5 for details):

In [17]:
from Bio import SeqIO
record = SeqIO.read(_path +"NC_005816.fna", "fasta")
record

SeqRecord(seq=Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG'), id='gi|45478711|ref|NC_005816.1|', name='gi|45478711|ref|NC_005816.1|', description='gi|45478711|ref|NC_005816.1| Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence', dbxrefs=[])

Now, let’s have a look at the key attributes of this SeqRecord individually – starting with the seq attribute which gives you a Seq object:

In [18]:
record.seq

Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG')

#### Next, the identifiers and description:

In [19]:
record.id

'gi|45478711|ref|NC_005816.1|'

In [20]:
record.name

'gi|45478711|ref|NC_005816.1|'

In [21]:
record.description

'gi|45478711|ref|NC_005816.1| Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence'

As you can see above, the first word of the FASTA record’s title line (after removing the greater thansymbol) is used for both theidandnameattributes.  The whole title line (after removing the greater thansymbol) is used for the record description.  This is deliberate, partly for backwards compatibility reasons,but it also makes sense if you have a FASTA file like this:

In [22]:
record.dbxrefs

[]

In [23]:
record.annotations

{}

In [24]:
record.letter_annotations

{}

In [25]:
record.features

[]

In  this  case  our  example  FASTA  file  was  from  the  NCBI,  and  they  have  a  fairly  well  defined  set  ofconventions for formatting their FASTA lines.  This means it would be possible to parse this informationand extract the GI number and accession for example.  However, FASTA files from other sources vary, sothis isn’t possible in general.

### 4.2.3    SeqRecord objects from GenBank files

As in the previous example, we’re going to look at the whole sequence for Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, originally downloaded from the NCBI, but this time as a GenBank file. Again, this file is included with the Biopython unit tests under the GenBank folder, or online `NC_005816.gb` from our website.

In [26]:
## Downloading Genebank file
## Downloading the NC_005816
import wget

url = 'https://raw.githubusercontent.com/biopython/biopython/master/Tests/GenBank/NC_005816.gb'
_path = 'data/'
wget.download(url, out=_path)

'data//NC_005816.gb'

In [27]:
from Bio import SeqIO
record = SeqIO.read(_path + "NC_005816.gb", "genbank")
record

SeqRecord(seq=Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG'), id='NC_005816.1', name='NC_005816', description='Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence', dbxrefs=['Project:58037'])

In [28]:
record.seq

Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG')

#### The name comes from the LOCUS line, while the id includes the version suffix. The description comes from the DEFINITION line:


In [29]:
record.id, record.name, record.description

('NC_005816.1',
 'NC_005816',
 'Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence')

In [31]:
## record.letter_annotations
record.letter_annotations

{}

In [33]:
## Most of the annotations information gets recorded in the annotations dictionary, for example:

len(record.annotations)

13

In [34]:
record.annotations["source"]

'Yersinia pestis biovar Microtus str. 91001'

The dbxrefs list gets populated from any PROJECT or DBLINK lines:

In [35]:
record.dbxrefs

['Project:58037']

Finally, and perhaps most interestingly, all the entries in the features table (e.g. the genes or CDS features) get recorded as SeqFeature objects in the features list.

In [36]:
len(record.features)

41

## 4.3 Feature, location and position objects