# Combining Through the Haystack

Finding the same interval of DNA in the genomes of two different organisms (often taken from different species) is highly suggestive that the interval has the same function in both organisms.

We define a motif as such a commonly shared interval of DNA. A common task in molecular biology is to search an organism's genome for a known motif.

The situation is complicated by the fact that genomes are riddled with intervals of DNA that occur multiple times (possibly with slight modifications), called repeats. These repeats occur far more often than would be dictated by random chance, indicating that genomes are anything but random and in fact illustrate that the language of DNA must be very powerful (compare with the frequent reuse of common words in any human language).

The most common repeat in humans is the Alu repeat, which is approximately 300 bp long and recurs around a million times throughout every human genome (see Figure 1). However, Alu has not been found to serve a positive purpose, and appears in fact to be parasitic: when a new Alu repeat is inserted into a genome, it frequently causes genetic disorders.

[Link to Rosalind](https://rosalind.info/problems/subs/)

# Problem

Given two strings $s$ and $t$,$t$ is a substring of $s$ if $t$ is contained as a contiguous collection of symbols in $s$ (as a result, $t$ must be no longer than $s$).

The position of a symbol in a string is the total number of symbols found to its left, including itself (e.g., the positions of all occurrences of 'U' in "AUGCUUCAGAAAGGUCUUACG" are 2, 5, 6, 15, 17, and 18). The symbol at position $i$ of $s$ is denoted by $s[i]$.

A substring of $s$ can be represented as $s[j:k]$, where $j$ and $k$ represent the starting and ending positions of the substring in $s$; for example, if $s$ = "AUGCUUCAGAAAGGUCUUACG", then $s[2:5]$ = "UGCU".

The location of a substring $s[j:k]$ is its beginning position $j$; note that $t$ will have multiple locations in $s$ if it occurs more than once as a substring of $s$ (see the Sample below).

<span style="color:rgba(70,165,70,255); font-weight:bold">Given</span>: Two DNA strings $s$ and $t$ (each of length at most 1 kbp).

<span style="color:rgba(70,165,70,255); font-weight:bold">Return</span>: All locations of $t$ as a substring of $s$.

# Read Example Input and Output Files

In [8]:
%run ../../functions/read_files.ipynb

In [9]:
input = read_text('sample_input.txt')
print(input)

output = read_text('sample_output.txt')
print(output)

GATATATGCATATACTT
ATAT
2 4 10


# Problem Solving Logic

In [10]:
def find_substring_positions(input):
    dna_string = input.split('\n')[0]
    dna_substring = input.split('\n')[1]
    
    positions = []
    for i in range(len(dna_string)):
        if dna_string[i:i+len(dna_substring)] == dna_substring:
            positions.append(i+1)
            
    output = "".join(f"{str(p)} " for p in positions)
    return output.strip()
    
find_substring_positions(input)


'2 4 10'

In [11]:
find_substring_positions(input) == output

True

# Run Real Input

In [12]:
real_input = read_text('rosalind_subs.txt')

find_substring_positions(real_input)

'64 82 97 140 197 222 229 236 251 258 287 294 359 423 438 455 491 671 678 919 926 971'