## Extract  a Gene Sequence from a Genbank file using Python

- **Task**
    - Extract rpoB gene from a bacterial genome 
- **Python library**
   - Biopython
- **Information required**
    - feature type  (gene)
    - gene name          (rpoB)

In [18]:
from Bio import SeqIO

In [19]:
file_path='/home/kobina/Desktop/sequence.gb'

In [20]:
gb_obj=SeqIO.read(file_path,'gb')

In [21]:
genes=[]

In [22]:
for feature in gb_obj.features:
    if feature.type=='gene':
        genes.append(feature)

In [23]:
len(genes)

2872

In [24]:
gene_of_interest='rpoB'

In [25]:
hits=[]

In [26]:
for gene in genes:
    if 'gene' in gene.qualifiers.keys():
        if gene_of_interest == gene.qualifiers['gene'][0]:
            hits.append(gene)
            print('gene found')

gene found


In [27]:
len(hits)

1

In [28]:
rpoB=hits[0]

In [29]:
extracted_sequence=rpoB.extract(gb_obj)

In [30]:
len(extracted_sequence.seq)

3552

In [31]:
extracted_sequence.id='rpoB'

In [32]:
extracted_sequence.description='NCTC_8325'

In [33]:
from Bio.SeqUtils import GC

In [34]:
gc_content=GC(extracted_sequence.seq)

In [35]:
gc_content

37.556306306306304

In [36]:
round(gc_content,2)

37.56

In [37]:
outputfile='/home/kobina/Desktop/rpob.fasta'

In [38]:
SeqIO.write(extracted_sequence,outputfile,'fasta')

1

## Extract Multiple Gene Sequences from a GenBank file

* Task : extract gene sequences from an *E. coli* genome
* Requirement:
    * a text file containing genes of interest
    * biopython

In [1]:
genome_file='/home/kobina/Desktop/sakai.gb'
gene_list_file='/home/kobina/Desktop/vir_factors.txt'

In [2]:
from Bio import SeqIO

In [4]:
with open(gene_list_file,'r') as input_file:
    gene_names=[line.strip('\n') for line in input_file]

In [5]:
len(gene_names)

28

In [6]:
gene_names[0:5]

['eae', 'ecpA', 'ecpB', 'ecpC', 'ecpD']

In [7]:
gb_object=SeqIO.read(genome_file,'gb')

In [8]:
allgenes=[feature for feature in gb_object.features if feature.type =='gene']

In [9]:
len(allgenes)

5329

In [10]:
gene_sequences=[]

In [11]:
for gene in allgenes:
    if 'gene' in gene.qualifiers.keys():
        gene_name=gene.qualifiers['gene'][0]
        if gene_name in gene_names:
            extract=gene.extract(gb_object)
            extract.id=gene_name
            extract.description=''
            gene_sequences.append(extract)
            print('gene %s has been found'%gene_name)

gene ecpE has been found
gene ecpD has been found
gene ecpC has been found
gene ecpB has been found
gene ecpA has been found
gene ecpR has been found
gene paa has been found
gene lpfB has been found
gene espF has been found
gene escF has been found
gene cesD has been found
gene espB has been found
gene espD has been found
gene espA has been found
gene sepL has been found
gene escD has been found
gene eae has been found
gene tir has been found
gene escN has been found
gene escV has been found
gene escJ has been found
gene escC has been found
gene cesD has been found
gene escU has been found
gene escT has been found
gene escS has been found
gene ler has been found
gene lpfA has been found


In [12]:
len(gene_sequences)

28

In [13]:
outputfile='/home/kobina/Desktop/vir_factors.fasta'

In [14]:
SeqIO.write(gene_sequences,outputfile,'fasta')

28