# Biopython : Sequence 객체

* TATA box

TATA box로 불리는 서열은 DNA가 RAN로 전사(transcript)되는 시작점인 promoter 서열 중 어떤 종에서도 그 서열 정보가 같다.
TATA box에 TBP(TATA box binding protein)와 전사 인자(transcription factor)가 붙어 전사가 시작된다.

In [1]:
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
from Bio.Data import CodonTable 

In [2]:
# TATA box 중 일부 서열
tatabox_seq = Seq("tataaaggcAATATGCAGTAG")
print(len(tatabox_seq))
print(type(tatabox_seq))

21
<class 'Bio.Seq.Seq'>


### DNA, RNA, Protein 구분
```
IUPAC.IUPACProtein          # 기본 20개 아미노산
IUPAC.ExtendedIUPACProtein  # 20 + 6 개 아미노산

IUPAC.unambiguous_dna       # basic DNA. ACGT만 포함.
IUPAC.ambiguous_dna         # 다양한 코드 포함된 DNA

IUPAC.unambiguous_rna       # basic RNA. ACGT만 포함.
IUPAC.ambiguous_rna         # 다양한 코드 포함된 RNA
```

In [3]:
tatabox_seq = Seq("tataaaggcAATATGCAGTAG", IUPAC.unambiguous_dna)  # DNA
print(type(tatabox_seq))
print(tatabox_seq.alphabet)

<class 'Bio.Seq.Seq'>
IUPACUnambiguousDNA()


In [4]:
count_a = tatabox_seq.count("A")
count_a

5

In [5]:
# GC content (%)
g_count = tatabox_seq.count("G") 
c_count = tatabox_seq.count("C") 
gc_contents = (g_count + c_count) / len(tatabox_seq) * 100 
print(gc_contents)

19.047619047619047


In [6]:
count_a = tatabox_seq.count("A")
count_a

5

In [7]:
count_at = tatabox_seq.count("AT")  # non-overlapping count
count_at

2

In [8]:
# 대소문자 변화
print(tatabox_seq.upper())
print(tatabox_seq.lower())

TATAAAGGCAATATGCAGTAG
tataaaggcaatatgcagtag


### Transcription & Translation
```
A-T 2중 결합
G-C 3중 결합

시직코돈 : ATG
종결코돈 : TAA, TAG, TGA

전사 방향 : 5' (five prime) to 3'

coding strand   (코드 가닥): 5'-ATGCAGTAG-3'
template strand (주형 가닥): 3'-TACGTCATC-5'

--> transcription --> mRNA : 5'-AUGCAGUAG-3'

--> translation --> protein: Met-Gln-종결코돈(*)
```

In [9]:
dna = Seq('ATGCAGTAG')
mRna = dna.transcribe()
protein = dna.translate()

print(mRna)
print(protein)

AUGCAGUAG
MQ*


In [10]:
# 종결 코돈 여러 개 있는 경우 첫번째에서 종료하기
# complete coding sequence (CDS)

mRNA = Seq("AUGAACUAAGUUUAGAAU")  

ptn = mRNA.translate() 
print(ptn)

ptn = mRNA.translate(to_stop=True) 
print(ptn)

MN*V*N
MN


In [11]:
# 종결 코돈 기준으로 분리

mrna = Seq("AUGAACUAAGUUUAGAAU") 
ptn = mrna.translate() 

for seq in ptn.split("*"): 
    print(seq)

MN
V
N


In [12]:
from Bio.Alphabet import generic_dna

In [13]:
gene = Seq("GTGAAAAAGATGCAATCTATCGTACTCGCACTTTCCCTGGTTCTGGTCGCTCCCATGGCA" + \
           "AAGAAAGCTCCTCATGATCATCACGGCGGTCATGGTCCAGGCAAACATCACCGCTAA",
           generic_dna)

bac_pro = gene.translate(table="Bacterial")
bac_pro

Seq('VKKMQSIVLALSLVLVAPMAKKAPHDHHGGHGPGKHHR*', HasStopCodon(ExtendedIUPACProtein(), '*'))

In [14]:
# 상보 서열 생성 1

seq = "TATAAAGGCAATATGCAGTAG" 
comp_dic = { 'A':'T', 'C':'G', 'G':'C', 'T':'A' }
comp_seq = ""

for s in seq:
    comp_seq += comp_dic[s]
    
revcomp_seq = comp_seq[::-1]  # 문자열을 뒤집어준다

print(comp_seq)     # 상보서열
print(revcomp_seq)  # 역상보서열

ATATTTCCGTTATACGTCATC
CTACTGCATATTGCCTTTATA


In [15]:
# 상보 서열 생성 2
seq = Seq("TATAAAGGCAATATGCAGTAG") 
comp_seq = seq.complement() 
rev_comp_seq = seq.reverse_complement()

print(comp_seq)     # 상보서열
print(revcomp_seq)  # 역상보서열

ATATTTCCGTTATACGTCATC
CTACTGCATATTGCCTTTATA


In [16]:
# codon table 출력
codon_table = CodonTable.unambiguous_dna_by_name["Standard"] 
print(codon_table) 

Table 1 Standard, SGC0

  |  T      |  C      |  A      |  G      |
--+---------+---------+---------+---------+--
T | TTT F   | TCT S   | TAT Y   | TGT C   | T
T | TTC F   | TCC S   | TAC Y   | TGC C   | C
T | TTA L   | TCA S   | TAA Stop| TGA Stop| A
T | TTG L(s)| TCG S   | TAG Stop| TGG W   | G
--+---------+---------+---------+---------+--
C | CTT L   | CCT P   | CAT H   | CGT R   | T
C | CTC L   | CCC P   | CAC H   | CGC R   | C
C | CTA L   | CCA P   | CAA Q   | CGA R   | A
C | CTG L(s)| CCG P   | CAG Q   | CGG R   | G
--+---------+---------+---------+---------+--
A | ATT I   | ACT T   | AAT N   | AGT S   | T
A | ATC I   | ACC T   | AAC N   | AGC S   | C
A | ATA I   | ACA T   | AAA K   | AGA R   | A
A | ATG M(s)| ACG T   | AAG K   | AGG R   | G
--+---------+---------+---------+---------+--
G | GTT V   | GCT A   | GAT D   | GGT G   | T
G | GTC V   | GCC A   | GAC D   | GGC G   | C
G | GTA V   | GCA A   | GAA E   | GGA G   | A
G | GTG V   | GCG A   | GAG E   | GGG G   | G
--+---------

In [17]:
codon_table = CodonTable.unambiguous_dna_by_name["Vertebrate Mitochondrial"] 
print(codon_table)

Table 2 Vertebrate Mitochondrial, SGC1

  |  T      |  C      |  A      |  G      |
--+---------+---------+---------+---------+--
T | TTT F   | TCT S   | TAT Y   | TGT C   | T
T | TTC F   | TCC S   | TAC Y   | TGC C   | C
T | TTA L   | TCA S   | TAA Stop| TGA W   | A
T | TTG L   | TCG S   | TAG Stop| TGG W   | G
--+---------+---------+---------+---------+--
C | CTT L   | CCT P   | CAT H   | CGT R   | T
C | CTC L   | CCC P   | CAC H   | CGC R   | C
C | CTA L   | CCA P   | CAA Q   | CGA R   | A
C | CTG L   | CCG P   | CAG Q   | CGG R   | G
--+---------+---------+---------+---------+--
A | ATT I(s)| ACT T   | AAT N   | AGT S   | T
A | ATC I(s)| ACC T   | AAC N   | AGC S   | C
A | ATA M(s)| ACA T   | AAA K   | AGA Stop| A
A | ATG M(s)| ACG T   | AAG K   | AGG Stop| G
--+---------+---------+---------+---------+--
G | GTT V   | GCT A   | GAT D   | GGT G   | T
G | GTC V   | GCC A   | GAC D   | GGC G   | C
G | GTA V   | GCA A   | GAA E   | GGA G   | A
G | GTG V(s)| GCG A   | GAG E   | GGG G   

In [18]:
# ORF(Open reading frame, 시작코돈 ~ 종결코돈) 찾기

tatabox_seq = Seq("tataaaggcAATATGCAGTAG")

start_idx = tatabox_seq.find("ATG")   
end_idx = tatabox_seq.find("TAG", start_idx)  # 편의상 TAG로 사용

orf = tatabox_seq[start_idx:end_idx+3]
print(orf)

ATGCAGTAG


### MutableSeq

In [19]:
# Seq 객체는 수정 불가
my_seq = Seq("GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA", IUPAC.unambiguous_dna)

```
my_seq[5] = "G"   # error. 'Seq' object does not support item assignment
```

In [20]:
from Bio.Seq import MutableSeq

In [21]:
mutable_seq = MutableSeq("GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA", IUPAC.unambiguous_dna)
mutable_seq[5] = "C"
mutable_seq.remove("T")
mutable_seq

MutableSeq('GCCACGTAATGGGCCGCTGAAAGGGTGCCCGA', IUPACUnambiguousDNA())

In [22]:
new_seq = mutable_seq.toseq()   # immutable
new_seq

Seq('GCCACGTAATGGGCCGCTGAAAGGGTGCCCGA', IUPACUnambiguousDNA())

### Bio.SeqUtils 모듈

In [23]:
from Bio.SeqUtils import GC
from Bio.SeqUtils import molecular_weight
from Bio.SeqUtils import six_frame_translations
from Bio.SeqUtils import MeltingTemp as mt

In [24]:
# GC content (%) 계산

exon_seq = Seq("ATGCAGTAG")
gc_contents = GC(exon_seq)
print(gc_contents)

44.44444444444444


In [25]:
# 분자량(molecular weight)
# 서열이 같아도 종류에 따라 분자량은 다르다.

seqStr = "ATGCAGTAG"

seq1 = Seq(seqStr) 
seq2 = Seq(seqStr, IUPAC.unambiguous_dna) 
seq3 = Seq(seqStr, IUPAC.protein) 

print(molecular_weight(seq1))
print(molecular_weight(seq2))
print(molecular_weight(seq3))

2842.8206999999993
2842.8206999999993
707.7536


In [26]:
# DNA 서열에서 가능한 6가지 번역

seq1 = Seq("AGTCTGGGACGGCGCGGCAATCGCA") 
print(six_frame_translations(seq1))

GC_Frame: a:5 t:3 g:10 c:7 
Sequence: agtctgggac ... ggcaatcgca, 25 nt, 68.00 %GC


1/1
  S  G  T  A  R  Q  S
 V  W  D  G  A  A  I  A
S  L  G  R  R  G  N  R
agtctgggacggcgcggcaatcgca   68 %
tcagaccctgccgcgccgttagcgt
T  Q  S  P  A  A  I  A 
 D  P  V  A  R  C  D  C
  R  P  R  R  P  L  R




In [27]:
# DNA 서열의 melting temperature (Tm)
# DNA 이중나선이 단일나선으로 분리되는 온도. GC content가 높을수록 높다.

myseq = Seq("AGTCTGGGACGGCGCGGCAATCGCA")
print(mt.Tm_Wallace(myseq))

84.0


In [28]:
# 아미노산 1문자 - 3문자 변환
from Bio.SeqUtils import seq1

amino_acid = "LeuLysMetValIleThrTrpPhe"

amino_acid_1 = seq1(amino_acid)
print(amino_acid_1)

LKMVITWF


In [29]:
from Bio.SeqUtils import seq3

amino_acid_3 = seq3(amino_acid_1)
print(amino_acid_3)

LeuLysMetValIleThrTrpPhe
