# Find a Median String

[ba2b](https://rosalind.info/problems/ba2b/)

Given a k-mer Pattern and a longer string Text, we use d(Pattern, Text) to denote the minimum Hamming distance between Pattern and any k-mer in Text,

    d(Pattern,Text)=min (by all k-mers Pattern' in Text) HammingDistance(Pattern,Pattern′).

Given a k-mer Pattern and a set of strings Dna = {Dna1, … , Dnat}, we define d(Pattern, Dna) as the sum of distances between Pattern and all strings in Dna,

    d(Pattern,Dna)=∑(i=1 to t) d(Pattern,Dnai).

Our goal is to find a k-mer Pattern that minimizes d(Pattern, Dna) over all k-mers Pattern, the same task that the Equivalent Motif Finding Problem is trying to achieve. We call such a k-mer a median string for Dna.

## Median String Problem

Find a median string.

    Given: 
    
An integer k and a collection of strings Dna.

    Return: 
    
A k-mer Pattern that minimizes d(Pattern, Dna) over all k-mers Pattern. (If multiple answers exist, you may return any one.)

    Sample Dataset

3

AAATTGACGCAT

GACGACCACGTT

CGTCAGCGCCTG

GCTGAGCACCGG

AGTACGGGACAG

    Sample Output

GAC

# Pseudocode

    MEDIANSTRING(Dna, k)
        distance <-- inf
        for each k-mer Pattern from AA...AA to TT...TT
            if distance > d(Pattern, Dna)
                distance <-- d(Pattern, Dna)
                Median <-- Pattern
        return Median

In [2]:
import numpy as np
from itertools import product
from sys import path
path.append("../")
from common import get_hamming_distance

In [3]:
def median_string(dnas, k):
    dist = np.inf
    for pattern in product('ACGT', repeat=k):
        pattern = ''.join(pattern)
        hammDist = sum([min_hamming_distance(pattern, dna, k) for dna in dnas])
        if dist > hammDist:
            dist = hammDist
            median = pattern
    return median


In [4]:
def min_hamming_distance(pattern, dna, k):
    return min([get_hamming_distance(pattern, dna[i:i+k]) for i in range(len(dna)-k+1)])

In [5]:
file = "MedianString/inputs/rosalind_ba2b.txt"
with open(file, 'r') as f:
    lines = f.readlines()
    k = int(lines[0])
    dna = [line.strip() for line in  lines[1:]]
    print(median_string(dna,k))



CGATCA
