## Calculate GC Content of a sequence in a single fasta file

**Activities**
- Download fasta file from NCBI Database
- Read fasta file
- Calculate GC content

**Python Library**
- Biopython

### Calculate GC Content on a Single Sequence Fasta File

In [7]:
filename="/home/kobina/Desktop/sequence.fasta"

In [8]:
from Bio import SeqIO
from Bio.SeqUtils import GC

In [9]:
seq_object=SeqIO.read(filename,"fasta")
sequence=seq_object.seq


In [10]:
len(sequence)

5631606

In [11]:
print(sequence[0:20])

TTGACCAATGACCCCGGTTC


In [12]:
gc_content=GC(sequence)

In [13]:
print(gc_content)

65.47269109380166


In [14]:
round(gc_content,2)

65.47

In [15]:
sequence2="AGCCTAC"
GC(sequence2)

57.142857142857146

## GC content of sequences in multiple fasta files

- Download multiple fasta files
- Use loops to
   - read files
   - calculate GC content
- Save result to a file

- Required Libraries
   -  Biopython
   -  Pandas
   

In [63]:
from Bio import SeqIO
from Bio.SeqUtils import GC
import pandas as pd
import os
import glob

In [64]:
file_directory='/home/kobina/Desktop/sequences'

In [65]:
fasta_files=glob.glob('%s/*.fasta'%file_directory)

In [66]:
len(fasta_files)

5

In [67]:
fasta_files[0]

'/home/kobina/Desktop/sequences/AR465.fasta'

In [68]:
print(fasta_files)

['/home/kobina/Desktop/sequences/AR465.fasta', '/home/kobina/Desktop/sequences/R50.fasta', '/home/kobina/Desktop/sequences/M48.fasta', '/home/kobina/Desktop/sequences/P10.fasta', '/home/kobina/Desktop/sequences/V521.fasta']


In [69]:
def calculate_gc(fasta):
    seq_obj=SeqIO.read(fasta,'fasta')
    sequence=seq_obj.seq
    
    gc_content=GC(sequence)
    gc_content=round(gc_content,2)
    
    filename=os.path.split(fasta)
    filename=filename[1]
    filename=filename.strip('.fasta')
    
    print(filename)
    print(gc_content)
    

In [70]:
for fasta in fasta_files:
    calculate_gc(fasta)

AR465
32.92
R50
32.91
M48
32.81
P10
32.71
V521
32.81


In [71]:
filenames=[]
gc_contents=[]

In [72]:
def calculate_gc(fasta):
    seq_obj=SeqIO.read(fasta,'fasta')
    sequence=seq_obj.seq
    
    gc_content=GC(sequence)
    gc_content=round(gc_content,2)
    
    filename=os.path.split(fasta)
    filename=filename[1]
    filename=filename.strip('.fasta')
    
    print('GC content has been calculated')
    
    return filename,gc_content

In [73]:
for fasta in fasta_files:
    filename,gc_content=calculate_gc(fasta)
    
    filenames.append(filename)
    gc_contents.append(gc_content)

GC content has been calculated
GC content has been calculated
GC content has been calculated
GC content has been calculated
GC content has been calculated


In [74]:
len(filenames)

5

In [75]:
len(gc_contents)

5

In [76]:
filenames[0]

'AR465'

In [77]:
gc_contents[0]

32.92

In [78]:
dataframe=pd.DataFrame()
dataframe['filename']=filenames
dataframe['gc_content']=gc_contents

In [79]:
dataframe.shape

(5, 2)

In [80]:
dataframe.head()

Unnamed: 0,filename,gc_content
0,AR465,32.92
1,R50,32.91
2,M48,32.81
3,P10,32.71
4,V521,32.81


In [81]:
outputfile='/home/kobina/Desktop/sequences/gc_content.csv'

In [82]:
dataframe.to_csv(outputfile,index=False)

In [14]:
len(gc_contents)

5

In [15]:
filenames[0]


'AR465'

In [16]:
seq_ids[0]

'CP029082.1'

In [17]:
gc_contents[0]

32.92

In [18]:
dataframe=pd.DataFrame()


NameError: name 'pd' is not defined

## Calculate GC content of sequences in a multi-fasta file

- Python Libraries:
  - Biopython    (  _pip install biopython --user_)
  - Pandas       (  _pip install pandas --user_)

In [2]:
from Bio import SeqIO
from Bio.SeqUtils import GC
import pandas as pd

In [3]:
filepath='/home/kobina/Desktop/multi-fasta.fasta'

In [4]:
seq_objects=SeqIO.parse(filepath,'fasta')

In [5]:
sequences=[seq for seq in seq_objects]

In [6]:
number_of_sequences=len(sequences)
print(number_of_sequences)

3


In [7]:
for seq in sequences:
    seq_id=seq.id
    sequence=seq.seq
    gc_content=GC(sequence)
    gc_content=round(gc_content,2)
    print(seq_id,gc_content)
    

SeqID_01 34.26
SeqID_02 34.21
SeqID_03 40.91


In [8]:
seq_ids=[]
gc_contents=[]

for seq in sequences:
    seq_id=seq.id
    sequence=seq.seq
    gc_content=GC(sequence)
    gc_content=round(gc_content,2)
    
    seq_ids.append(seq_id)
    gc_contents.append(gc_content)
    print('GC content has been computed')
    

GC content has been computed
GC content has been computed
GC content has been computed


In [9]:
print(seq_ids)

['SeqID_01', 'SeqID_02', 'SeqID_03']


In [10]:
print(gc_contents)

[34.26, 34.21, 40.91]


In [11]:
print(seq_ids[0])
print(gc_contents[0])

SeqID_01
34.26


In [12]:
dataframe=pd.DataFrame()
dataframe['Sequence_ID']=seq_ids
dataframe['GC_Content']=gc_contents

In [13]:
print(dataframe.shape)

(3, 2)


In [14]:
print(dataframe)

  Sequence_ID  GC_Content
0    SeqID_01       34.26
1    SeqID_02       34.21
2    SeqID_03       40.91


In [15]:
outputfile='/home/kobina/Desktop/gc_content.csv'
dataframe.to_csv(outputfile,index=False)