<a href="https://colab.research.google.com/github/smthorat/Biopython/blob/main/Sequence_Annotation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install biopython
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord




In Biopython, SeqRecord is a class used to store a biological sequence and its associated information, such as an identifier, name, description, and features. It acts like a container for sequence data and makes it easy to handle and manipulate sequences with their metadata.

Here's a simple breakdown of its main components:

Sequence (seq): The actual biological sequence (DNA, RNA, or protein).
ID (id): A unique identifier for the sequence.
Name (name): A common name or short description of the sequence.
Description (description): A longer description providing more details about the sequence.
Features (features): Annotations about specific parts of the sequence, like genes or regulatory regions.

# 1. Create a SeqRecord
## SeqRecord objects from Fasta

In [None]:
from Bio.Seq import Seq # Import the module
simple_seq = Seq("ATGCATGC")
from Bio.SeqRecord import SeqRecord
simple_seq_r = SeqRecord(simple_seq, id="simple_seq", name="Simple Sequence", description="A simple sequence") # This is standard, how you create the records
print(simple_seq_r)

ID: simple_seq
Name: Simple Sequence
Description: A simple sequence
Number of features: 0
Seq('ATGCATGC')


In [None]:
# Create one more record

from Bio.Seq import Seq
sequence = Seq("GTAGTAGATGACCCGATGCGATCGA")
from Bio.SeqRecord import SeqRecord
sequence_record = SeqRecord(sequence, id= 'One', name = 'DNA Sequence', description='Human DNA')
print(sequence_record)


ID: One
Name: DNA Sequence
Description: Human DNA
Number of features: 0
Seq('GTAGTAGATGACCCGATGCGATCGA')


In [None]:
sequence_record.annotations['Evidence'] = 'There is no evidence, I just made it up'
print(sequence_record.annotations) # Use this command to view all the annotations

{'Evidence': 'There is no evidence, I just made it up'}


In [None]:
print(sequence_record.annotations['Evidence']) # This command is helpful to access perticular annotation

There is no evidence, I just made it up


## Read the fasta file without downloading

In [14]:
import requests
from Bio import SeqIO
from io import StringIO

# URL of the FASTA file
url = "https://raw.githubusercontent.com/biopython/biopython/master/Tests/GenBank/NC_005816.fna"

# Fetch the content of the file
response = requests.get(url)
response.raise_for_status()  # Ensure we notice bad responses Check for any HTTP errors using response.raise_for_status()

# Read the content using SeqIO
file_content = StringIO(response.text)
record = SeqIO.read(file_content, "fasta")

record

SeqRecord(seq=Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG'), id='gi|45478711|ref|NC_005816.1|', name='gi|45478711|ref|NC_005816.1|', description='gi|45478711|ref|NC_005816.1| Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence', dbxrefs=[])

In [15]:
record.id

'gi|45478711|ref|NC_005816.1|'

In [16]:
record.name

'gi|45478711|ref|NC_005816.1|'

In [17]:
record.description

'gi|45478711|ref|NC_005816.1| Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence'

In [18]:
record.dbxrefs # If a sequence is referenced in other databases, those references would be listed here.

[]

In [19]:
record.annotations # This can include various annotations such as the source of the sequence, references to literature, and more.

{}

In [20]:
record.features # This attribute holds a list of SeqFeature objects, which describe various features of the sequence, such as genes, regulatory regions, and other functional elements.

[]

In [21]:
record.letter_annotations # This dictionary contains per-letter annotations. This means annotations that apply to each individual letter (nucleotide or amino acid) in the sequence, such as quality scores or secondary structure information.

{}

## SeqRecord object from GenBank file

In [34]:
import requests
from Bio import SeqIO
from io import StringIO

# URL of the GenBank file
url = "https://raw.githubusercontent.com/biopython/biopython/master/Tests/GenBank/NC_005816.gb"

# Fetch the content of the file
response = requests.get(url)
response.raise_for_status()  # Ensure we notice bad responses

# Read the content using SeqIO
file_content = StringIO(response.text)
record = SeqIO.read(file_content, "genbank")

print(record)


ID: NC_005816.1
Name: NC_005816
Description: Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence
Database cross-references: Project:58037
Number of features: 41
/molecule_type=DNA
/topology=circular
/data_file_division=BCT
/date=21-JUL-2008
/accessions=['NC_005816']
/sequence_version=1
/gi=45478711
/keywords=['']
/source=Yersinia pestis biovar Microtus str. 91001
/organism=Yersinia pestis biovar Microtus str. 91001
/taxonomy=['Bacteria', 'Proteobacteria', 'Gammaproteobacteria', 'Enterobacteriales', 'Enterobacteriaceae', 'Yersinia']
/references=[Reference(title='Genetics of metabolic variations between Yersinia pestis biovars and the proposal of a new biovar, microtus', ...), Reference(title='Complete genome sequence of Yersinia pestis strain 91001, an isolate avirulent to humans', ...), Reference(title='Direct Submission', ...), Reference(title='Direct Submission', ...)]
/comment=PROVISIONAL REFSEQ: This record has not yet been subject to final
NCBI review. The 

In [24]:
record.id

'NC_005816.1'

In [25]:
record.name

'NC_005816'

In [26]:
print(record.annotations, record.description)

{'molecule_type': 'DNA', 'topology': 'circular', 'data_file_division': 'BCT', 'date': '21-JUL-2008', 'accessions': ['NC_005816'], 'sequence_version': 1, 'gi': '45478711', 'keywords': [''], 'source': 'Yersinia pestis biovar Microtus str. 91001', 'organism': 'Yersinia pestis biovar Microtus str. 91001', 'taxonomy': ['Bacteria', 'Proteobacteria', 'Gammaproteobacteria', 'Enterobacteriales', 'Enterobacteriaceae', 'Yersinia'], 'references': [Reference(title='Genetics of metabolic variations between Yersinia pestis biovars and the proposal of a new biovar, microtus', ...), Reference(title='Complete genome sequence of Yersinia pestis strain 91001, an isolate avirulent to humans', ...), Reference(title='Direct Submission', ...), Reference(title='Direct Submission', ...)], 'comment': 'PROVISIONAL REFSEQ: This record has not yet been subject to final\nNCBI review. The reference sequence was derived from AE017046.\nCOMPLETENESS: full length.'} Yersinia pestis biovar Microtus str. 91001 plasmid pPC

In [27]:
record.annotations


{'molecule_type': 'DNA',
 'topology': 'circular',
 'data_file_division': 'BCT',
 'date': '21-JUL-2008',
 'accessions': ['NC_005816'],
 'sequence_version': 1,
 'gi': '45478711',
 'keywords': [''],
 'source': 'Yersinia pestis biovar Microtus str. 91001',
 'organism': 'Yersinia pestis biovar Microtus str. 91001',
 'taxonomy': ['Bacteria',
  'Proteobacteria',
  'Gammaproteobacteria',
  'Enterobacteriales',
  'Enterobacteriaceae',
  'Yersinia'],
 'references': [Reference(title='Genetics of metabolic variations between Yersinia pestis biovars and the proposal of a new biovar, microtus', ...),
  Reference(title='Complete genome sequence of Yersinia pestis strain 91001, an isolate avirulent to humans', ...),
  Reference(title='Direct Submission', ...),
  Reference(title='Direct Submission', ...)],
 'comment': 'PROVISIONAL REFSEQ: This record has not yet been subject to final\nNCBI review. The reference sequence was derived from AE017046.\nCOMPLETENESS: full length.'}

In [28]:
record.annotations['source']

'Yersinia pestis biovar Microtus str. 91001'

In [29]:
record.dbxrefs

['Project:58037']

In [30]:
record.features

[SeqFeature(SimpleLocation(ExactPosition(0), ExactPosition(9609), strand=1), type='source', qualifiers=...),
 SeqFeature(SimpleLocation(ExactPosition(0), ExactPosition(1954), strand=1), type='repeat_region'),
 SeqFeature(SimpleLocation(ExactPosition(86), ExactPosition(1109), strand=1), type='gene', qualifiers=...),
 SeqFeature(SimpleLocation(ExactPosition(86), ExactPosition(1109), strand=1), type='CDS', qualifiers=...),
 SeqFeature(SimpleLocation(ExactPosition(86), ExactPosition(959), strand=1), type='misc_feature', qualifiers=...),
 SeqFeature(SimpleLocation(BeforePosition(110), ExactPosition(209), strand=1), type='misc_feature', qualifiers=...),
 SeqFeature(SimpleLocation(ExactPosition(437), ExactPosition(812), strand=1), type='misc_feature', qualifiers=...),
 SeqFeature(SimpleLocation(ExactPosition(1105), ExactPosition(1888), strand=1), type='gene', qualifiers=...),
 SeqFeature(SimpleLocation(ExactPosition(1105), ExactPosition(1888), strand=1), type='CDS', qualifiers=...),
 SeqFeatu

In [31]:
len(record.features)

41

# Location Testing



In [39]:
from Bio import SeqIO
my_snp = 4350 # We are looking for this perticular SNP, if its there, will ask to print the location and feature

import requests
from Bio import SeqIO
from io import StringIO

# URL of the GenBank file
url = "https://raw.githubusercontent.com/biopython/biopython/master/Tests/GenBank/NC_005816.gb"

# Fetch the content of the file
response = requests.get(url)
response.raise_for_status()  # Ensure we notice bad responses

# Read the content using SeqIO
file_content = StringIO(response.text)
record = SeqIO.read(file_content, "genbank")

print(record)



ID: NC_005816.1
Name: NC_005816
Description: Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence
Database cross-references: Project:58037
Number of features: 41
/molecule_type=DNA
/topology=circular
/data_file_division=BCT
/date=21-JUL-2008
/accessions=['NC_005816']
/sequence_version=1
/gi=45478711
/keywords=['']
/source=Yersinia pestis biovar Microtus str. 91001
/organism=Yersinia pestis biovar Microtus str. 91001
/taxonomy=['Bacteria', 'Proteobacteria', 'Gammaproteobacteria', 'Enterobacteriales', 'Enterobacteriaceae', 'Yersinia']
/references=[Reference(title='Genetics of metabolic variations between Yersinia pestis biovars and the proposal of a new biovar, microtus', ...), Reference(title='Complete genome sequence of Yersinia pestis strain 91001, an isolate avirulent to humans', ...), Reference(title='Direct Submission', ...), Reference(title='Direct Submission', ...)]
/comment=PROVISIONAL REFSEQ: This record has not yet been subject to final
NCBI review. The 

In [40]:
for feature in record.features: # Using for loop to iterate through each feature
  if my_snp in feature:
      print("%s %s" % (feature.type, feature.qualifiers.get("db_xref")))

source ['taxon:229193']
gene ['GeneID:2767712']
CDS ['GI:45478716', 'GeneID:2767712']


# Sequence described by feature and location

In [41]:
from Bio.Seq import Seq
from Bio.SeqFeature import SeqFeature, FeatureLocation

# Sample DNA sequence
sequence = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")

# Annotating coding region
coding_region = SeqFeature(FeatureLocation(start=0, end=30), type="CDS")
print(f"Annotated Coding Region: {coding_region.location}")


Annotated Coding Region: [0:30]
