# 6. Sequence Record objects
Uptil now, we've been using sequences (`Seq`) objects that stored a sequence and the file format (i.e. fasta, genbank, etc.). 
Biopython allows us to annotate these `Seq` objects with additional information like an identifier, a name of the sequence, a description, features and ultimately a bunch of annotations. All of this information is stored in the so-called `SeqRecord` object which is the follow-up of the `Seq` object. 

In [None]:
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord

In [None]:
help(SeqRecord)

Content:
- 6.1 The SeqRecord object
- 6.2 Features
- 6.3 Slicing a SeqRecord object

## 6.1 The SeqRecord object

As an example we'll read in a GenBank file, *NC_005816.gb*, accessible via [NCBI](https://www.ncbi.nlm.nih.gov/nuccore/NC_005816) and which we’ll load using the SeqIO module. The next chapter will discuss the `SeqIO` module, however here we're just using it to read in a `SeqRecord` object from a file. 

In [None]:
from Bio import SeqIO

# Read in a SeqRecord object & print it out
record = SeqIO.read("data/NC_005816.gb","gb")
print(record)

The following elements are present (amongst others):
- **ID**: usually the accession number of the sequence
- **Name**: the more commonly used name of the sequence (often the same as accession number)
- **Description**: a description or expressive name for the sequence
- **Features**: a list of SeqFeature objects with more structured information about the sequence (discussed below)
- **Annotations**: a dictionary of additional information about the sequence. 
- **Seq**: the sequence itself

We can retrieve the methods and properties of this record using the `dir()` function as well. 

In [None]:
dir(record)

The `seq` property allows us to fetch the sequence. Do you recognize the format of the output?

In [None]:
# Access the Seq object from a SeqRecord
record.seq

With annotations we can acces a dictionary with annotations of this object. You'll typically find more information regarding the organism, accession numbers and date, taxonomy, etc.  

In [None]:
# Access the annotations as part of the record
record.annotations

In [None]:
# Other possibilities are:
# ID
print(record.id)
# Name
print(record.name)
# Description
print(record.description)

---
### 6.1.1 Exercise
Find the title of all the articles related to the genbank entry 'NC_005816'. Import this file using the following block of code.  

Extra: Create a list of URL-links that brings you directly to the article. For this you can use the Pubmed ID in combination with `https://pubmed.ncbi.nlm.nih.gov/` (e.g.: https://pubmed.ncbi.nlm.nih.gov/15368893/). 


Hint: look at the section of *references* of the annotations ([some more information](https://biopython.readthedocs.io/en/latest/chapter_seq_annot.html))

In [None]:
from Bio import SeqIO
record = SeqIO.read("data/NC_005816.gb","gb")

----

## 6.2 Features
The features and their `SeqFeature` object are a fairly complex thing on their own. Basically they contain more abstract and detailed information about the `SeqRecord` object (and thus the sequence). It attempts to encapsulate as much of the information about the sequence as possible by describing a region on the parent sequence. For the sake of completeness, we'll give a short example here, however we consider features to be part of an even more detailed course. If you're interested we advise you to have a look at the official documentation [here](http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec38). 

A `SeqFeature` or location object doesn't directly contain a sequence, instead the location describes how to get this from the parent sequence. For example consider a (short) gene sequence with
location 5:18 on the reverse strand, which in GenBank/EMBL notation using 1-based counting would be
complement(6..18), like this:

In [None]:
from Bio.Seq import Seq
from Bio.SeqFeature import SeqFeature, FeatureLocation

# Creating a Sequence and a subpart of it as a SeqFeature with a location, strand and type (e.g. gene, intron, exon)
example_parent = Seq("ACCGAGACGGCAAAGGCTAGCATAGGTATGAGACTTCCTTCCTGCCAGTGCTGAGGAACTGGGAGCCTAC")
example_feature = SeqFeature(FeatureLocation(5, 18), type="gene", strand=-1)

You could take the parent sequence, slice it to extract 5:18, and then take the reverse complement.

In [None]:
# Exploiting the SeqFeature methods to define beginning and end of a feature
start = example_feature.location.start
end = example_feature.location.end
feature_seq = example_parent[start:end].reverse_complement()
feature_seq

Or you could simply use the extract method:

In [None]:
# Alternatively, making life easier for yourself
feature_seq = example_feature.extract(example_parent)
feature_seq

When there are multiple features - which is usually the case - we can slice them from the record:

In [None]:
# Reading in the same genbank file and slicing the 6th feature from it
record = SeqIO.read("data/NC_005816.gb", "genbank")
print(record.features[5])

Checking if a SNP on a specific location is part of a feature and extracting some information from that feature

In [None]:
# Extract features and check whether SNP of interest (4350) is present
my_snp = 4350
record = SeqIO.read("data/NC_005816.gb", "genbank")

for feature in record.features:
    if my_snp in feature: 
        print("{} {}".format(feature.type, feature.qualifiers.get("db_xref")))

Note that gene and CDS features from GenBank or EMBL files defined with joins are the union of the exons - they do not cover any introns.

## 6.3 Slicing a SeqRecord

As briefly shown here above, we can slice `SeqRecord`s similarly as strings or sequences. The following lines of code will give some examples on the same record that we defined earlier:

In [None]:
# Extract pim gene from SeqRecord and all its features/annotations
print(record.features[20])

In [None]:
# Slice in the sequence of the record and keep information on  the ID, name, description & features with it
sub_record = record[4300:4800]
print(sub_record)

In [None]:
# The length of the Sequence which is part of the SeqRecord
len(sub_record)

In [None]:
# Number of features on this subSequence, (as seen in the record)
len(sub_record.features)

In [None]:
print(sub_record.features[0])

In [None]:
print(sub_record.features[1])

Notice that their locations have been adjusted to reflect the new parent sequence!

In [None]:
# The annotations are now empty
sub_record.annotations

In [None]:
# Get an overview of all possible methods on the references
r = record.annotations['references']
dir(r[0])

In [None]:
# Also database cross-references are empty
sub_record.dbxrefs

In [None]:
# Still remains parent ID
print(sub_record.id)
print(sub_record.name)
print(sub_record.description)

In [None]:
# However we can overwrite this
sub_record.description = 'Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, partial' #partial!

In [None]:
# We can create a fasta lay-out using the format method on this subrecord
print(sub_record.format("fasta"))

You could also create your own SeqRecord object from scratch. This is however not considered during this course, but some preliminary reading can be done in the further reading section ([here]("further_reading/06_Biopython_sequence_annotation_Extra.ipynb")).

## 6.4 Next session
Click here to go to the [next session](07_Biopython_SeqIO.ipynb). 