# DNA Composition: GC and AT Content


## GC Content Of DNA

+ Percentage of nitrogeneous bases in DNA or RNA is either Guanine or Cystosine

A T = 2 bonds


G C = 3 bonds

## Usefulness
+ In polymerase chain reaction(PCR) experiments, help us predict the anneling temperature
+ A higher GC content indicates a relatively higher melting temperatore
+ DNA with a low GC content is less stable than DNA with a high GC content
+ High GC Content can make it difficult to perform PCR amplification due to difficulty in designing a primer long enough to provide greater specificity

## AT Contents In DNA

+ AT content is the percentage of nitrogeneous bases in a DNA or RNA molecule that containe either Adenine (A) or Thymine (T)
+ AT base pairing yields only 2 hydrogen bonds

In [4]:
     

from Bio.Seq import Seq



In [5]:
from Bio.SeqUtils import GC

In [7]:
dna_seq = Seq('ATGATCTCGTAA')


In [9]:
# Lets find the GC content percentage
GC(dna_seq)

33.333333333333336

In [11]:
# Method 2
# Custom Function to get GC count

dna_seq.count('A')

4

In [15]:
# lets make our own function


def gc_content(seq):
    result =  float(seq.count('G')+ seq.count ('C'))/len(seq) * 100
    return result



In [16]:
gc_content (dna_seq)

33.33333333333333

In [22]:
# Method 3
def gc_content2(seq):
    gc = [ B for B in seq if B in 'GC']
    result = float (len(gc)) /len(seq) * 100
    return(result)

In [23]:
gc_content2(dna_seq)

33.33333333333333

In [24]:
dna_seq.lower()

Seq('atgatctcgtaa')

In [25]:
gc_content2('atgatctcgtaa')
#Our function doesn't work with lower case

0.0

In [27]:
GC('atgatctcgtaa')
# Biopython's does

33.333333333333336

In [31]:
# We can conver anything to upper case to force it into a function we made
def gc_content3(seq):
    gc = [ B for B in seq.upper() if B in 'GC']
    result = float (len(gc)) /len(seq) * 100
    return(result)

In [33]:
gc_content3('atgatctcgtaa')
#Our function now works with upper and lower case as it is all forced to be upper case

33.33333333333333

In [34]:
# AT Content

#Making our own function

def at_content(seq):
    result = float(seq.count('A') + seq.count('T') ) /len(seq) * 100
    return result


In [35]:
at_content(dna_seq)

66.66666666666666

## Melting Point Of DNA

+ Higher GC means higher melting point
+ Tm_Wallace: 'Rule of thumb'
+ Tm_GC: Empirical formulas based on GC content. Salt and mismath corrections can be included.
+ Tm_NN: Calculations based on nearest neighbor thermodynamics. Several tables for DNA/DNA, DNA/RNA and RNA/RNA hybridizations are included. Correction for mismatches, dangling ends, salt concentration and other additives are available.


In [37]:
import Bio.SeqUtils

In [38]:
dir(Bio.SeqUtils)

['GC',
 'GC123',
 'GC_skew',
 'IUPACData',
 'Seq',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 'cos',
 'molecular_weight',
 'nt_search',
 'pi',
 're',
 'seq1',
 'seq3',
 'sin',
 'six_frame_translations',
 'xGC_skew']

In [39]:
 from Bio.SeqUtils import MeltingTemp as mt

In [40]:
dna_seq


Seq('ATGATCTCGTAA')

In [41]:
GC(dna_seq)

33.333333333333336

In [42]:
# Check for the melting point using wallace

mt.Tm_Wallace(dna_seq)

32.0

In [43]:
# Checking for the melting point using GC content

mt.Tm_GC(dna_seq)

23.32155893208184

## Excercise 

+ Which of the following will have the highest GC?
+ ex1 = 'ATGCATGGTGCGCGA'
+ ex2 = 'ATTTGTGCTCCTGGA'

In [44]:
ex1 = 'ATGCATGGTGCGCGA'
ex2 = 'ATTTGTGCTCCTGGA'

In [45]:
GC(ex1)

60.0

In [46]:
GC(ex2)

46.666666666666664

In [None]:
# ex1 has the higest GC content

In [47]:
# One way to  get GC content

def get_metrics(seq):
    gc = GC(seq)
    at = at_content(seq)
    melting = mt.Tm_GC(seq)
    result =  "GC:{} , AT:{}, Temp{}".format(gc,at, melting)
    return(result)

In [48]:
get_metrics(ex1)

'GC:60.0 , AT:40.0, Temp44.254892265415165'

In [49]:
get_metrics(ex2)

'GC:46.666666666666664 , AT:53.333333333333336, Temp38.7882255987485'

# GC Skew

+ check when the nucletide (G,C) are over or under abundant in a particular region of a DNA or RNA
+ Helps to indicate DNA lagging strand or leading strand
+ GC skew positive = leading
+ GC skew negative = lagging



In [51]:
from Bio.SeqUtils import GC123, GC_skew, xGC_skew

In [53]:
# GC content first, second, third position


dna_seq


Seq('ATGATCTCGTAA')

In [55]:
GC(dna_seq)

33.333333333333336

In [54]:
GC123(dna_seq)

(33.333333333333336, 0.0, 25.0, 75.0)

In [56]:
# GC skew
#By default the window is 100
# GC_skew (sequence, window)
GC_skew(dna_seq)

[0.0]

In [57]:
#Lets make the window 10

GC_skew(dna_seq,10)

[0.0, 0.0]

In [58]:
GC_skew('ATGGGGTCCCGCTC')

[0.0]

In [59]:
xGC_skew(dna_seq)

# Subsequences
+ Search for a DNA subsequence in a sequence, return a list of [subseq, positions]

In [64]:
from Bio.SeqUtils import nt_search

In [65]:
main_s1 = Seq('ACTATT')

In [66]:
subseq = Seq('ACT')

In [69]:
nt_search(str(main_s1),  str(subseq) )
# It is found in position 0

['ATT', 3]

In [70]:
subseq = Seq('ATT')
nt_search(str(main_s1),  str(subseq) )
# It is found in position 3

['ATT', 3]

In [None]:
# This is how to find subsequences

In [None]:
# This is all Section 2 Module 2  - Bioinformatics With Biopython - Part 8. BioPython - DNA Composition - GC Content,AT Content and Frequency