# 1. Where in the Genome Does Replication Begin? (part 2)

```
DNA 복제시 정방향과 역방향 가닥의 복제 속도가 서로 다름.

Cytosine (C)는 Deamination 으로 인해 Thymine (T)으로 변이되는 경향이 있기 때문에
single-strand 존재하는 시간이 긴 순방향 가닥(forward half-strand)에 C가 수가 줄어든다.

G-C mutation rate 의 변화를 통해 ori 위치 탐색.
```

In [1]:
def PatternCount(Text, Pattern):
    import re
    count = len(re.findall(Pattern, Text))
    return count

```
window size를 크게하여 슬라이딩을 해야 하기 때문에 genome size의 절반을 복사하여 원본 genome에 더해준다.
ExtendedGenome = Genome + Genome[0:n//2]
```

In [2]:
def SymbolArray(Genome, symbol):
    array = {}
    n = len(Genome)
    ExtendedGenome = Genome + Genome[0:n//2]
    
    for i in range(n):
        array[i] = PatternCount(ExtendedGenome[i:i+(n//2)], symbol)  # 계산속도 느림.
        
    return array


print(SymbolArray('AAAAGGGG', 'A'))

{0: 4, 1: 3, 2: 2, 3: 1, 4: 0, 5: 1, 6: 2, 7: 3}


In [3]:
def FasterSymbolArray(Genome, symbol):
    array = {}
    n = len(Genome)
    ExtendedGenome = Genome + Genome[0:n//2]

    # look at the first half of Genome to compute first array value
    array[0] = PatternCount(Genome[0:n//2], symbol)

    for i in range(1, n):
        # start by setting the current array value equal to the previous array value
        array[i] = array[i-1]

        # the current array value can differ from the previous array value by at most 1
        if ExtendedGenome[i-1] == symbol:
            array[i] = array[i]-1
        if ExtendedGenome[i+(n//2)-1] == symbol:
            array[i] = array[i]+1
            
    return array

print(FasterSymbolArray('AAAAGGGG', 'A'))

{0: 4, 1: 3, 2: 2, 3: 1, 4: 0, 5: 1, 6: 2, 7: 3}


### Skew Daigram

skew = #G - #C

In [4]:
genome_1 = 'CATGGGCATCGGCCATACGCC'
genome_2 = 'TAAAGACTGCCGAGAGGCCAACACGAGTGCTAGAACGAGGGGCGTAAACGCGGGTCCGAT'

In [5]:
def SkewArray(Genome):
    skew = [0]
    score = {"C":-1, "G":1, "A":0, "T":0, }
    for i in range(1,len(Genome)+1):
        skew.append(score[Genome[i-1]] + skew[i-1])
        
    return skew


# skew 가 최소가 되는 위치
def MinimumSkew(Genome):
    skew = SkewArray(Genome)
    positions = [i for i, x in enumerate(skew) if x == min(skew)]

    return positions

In [6]:
skew = SkewArray(genome_1)
print(skew)
print(min(skew))

indexes = MinimumSkew(genome_1)
print(indexes)

[0, -1, -1, -1, 0, 1, 2, 1, 1, 1, 0, 1, 2, 1, 0, 0, 0, 0, -1, 0, -1, -2]
-2
[21]


In [7]:
skew = SkewArray(genome_2)
print(skew)
print(min(skew))

indexes = MinimumSkew(genome_2)
print(indexes)

[0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, -1, 0, 0, 1, 1, 2, 3, 2, 1, 1, 1, 0, 0, -1, 0, 0, 1, 1, 2, 1, 1, 1, 2, 2, 2, 1, 2, 2, 3, 4, 5, 6, 5, 6, 6, 6, 6, 6, 5, 6, 5, 6, 7, 8, 8, 7, 6, 7, 7, 7]
-1
[11, 24]


### Hamming distance (두 문자열 p와 q 사이의 총 불일치 수)

In [8]:
def HammingDistance(p, q):
    t_len = max(len(p), len(q))
    ham = [1 for x in range(t_len) if p[x] != q[x]]
    count = sum(ham)
           
    return count

In [9]:
genome_1 = 'GGGCCGTTGGT'
genome_2 = 'GGACCGTTGAC'

h_dist = HammingDistance(genome_1, genome_2)
h_dist

3

k-mer Pattern에 대해 최대 d 개의 불일치가 있는 부분 문자열 Pattern'이 시작되는 위치 검색.

In [10]:
def ApproximatePatternMatching(Text, Pattern, d):
    positions = []
    for i in range(len(Text)-len(Pattern)+1):
        if HammingDistance(Text[i:i+len(Pattern)], Pattern) <= d:
            positions.append(i)
    
    return positions

In [11]:
genome = 'CGCCCGAATCCAGAACGCATTCCCATATTTCGGGACCACTGGCCTCCACGGTACGGACGTCAATCAAAT'
pattern = 'ATTCTGGA'
dist = 3

positions = ApproximatePatternMatching(genome, pattern, dist)
positions

[6, 7, 26, 27]