# Background

The war between viruses and bacteria has been waged for over a billion years. Viruses called bacteriophages (or simply phages) require a bacterial host to propagate, and so they must somehow infiltrate the bacterium; such deception can only be achieved if the phage understands the genetic framework underlying the bacterium's cellular functions. The phage's goal is to insert DNA that will be replicated within the bacterium and lead to the reproduction of as many copies of the phage as possible, which sometimes also involves the bacterium's demise.

To defend itself, the bacterium must either obfuscate its cellular functions so that the phage cannot infiltrate it, or better yet, go on the counterattack by calling in the air force. Specifically, the bacterium employs aerial scouts called restriction enzymes, which operate by cutting through viral DNA to cripple the phage. But what kind of DNA are restriction enzymes looking for?

The restriction enzyme is a homodimer, which means that it is composed of two identical substructures. Each of these structures separates from the restriction enzyme in order to bind to and cut one strand of the phage DNA molecule; both substructures are pre-programmed with the same target string containing 4 to 12 nucleotides to search for within the phage DNA (see Figure 1.). The chance that both strands of phage DNA will be cut (thus crippling the phage) is greater if the target is located on both strands of phage DNA, as close to each other as possible. By extension, the best chance of disarming the phage occurs when the two target copies appear directly across from each other along the phage DNA, a phenomenon that occurs precisely when the target is equal to its own reverse complement. Eons of evolution have made sure that most restriction enzyme targets now have this form.

# Problem

A DNA string is a `reverse palindrome` if it is equal to its reverse complement. For instance, GCATGC is a reverse palindrome because its reverse complement is GCATGC. See Figure 2.

Given: A DNA string of length at most 1 kbp in FASTA format.

Return: The position and length of every reverse palindrome in the string having length between 4 and 12. You may return these pairs in any order.


# Solution

We are given a DNA string and must search it for reverse palindromes which represent restriction sites. Our goal is to return the start of each reverse palindrome plus its length. This can be accomplished with nested for loops. The outer loop will loop though each character in the DNA string. Our inner loop will create substrings at each character of length 4-12 and determine if the substring is a reverse palindrome or not. At the end of the code block, the restriction site locations and their lengths will be printed out, along with the reverse palindrome.

The BioPython package is used to read the FASTA file and to determine the reverse compliment of substrings in the DNA sequence. 

In [1]:
# Import necessary packages
from Bio.Seq import Seq
from Bio import SeqIO

In [2]:
fileName = 'rosalind_revp.txt'

with open(fileName,'r') as f:
    
    # Read the first line of the FASTA file and assign the seqeunce to a variable
    iterator = SeqIO.parse(f, 'fasta')
    dna = next(iterator).seq

    # Loop though the nucleotides in the DNA sequence
    for nuc in range(len(dna)):
        
        # Our palindromes will be between 4 and 12 nucleotides in length
        for i in range(4,13):
        
            # Ensures that we do not go past the max length of our DNA sequence
            if i > len(dna) - nuc:
                break 
            
            # Creates a substring from the DNA sequence 
            # The length of the subtring will be 4 nucleotides long on the first iteration of
            # the loop, then 5 nucleotides, then 6, until we reach 12 nucleotides
            temp_seq = Seq(dna[nuc:(nuc + i)])
            
            # Check if the substring is a reverse palindrome
            # If the sequence of nucleotides is a reverse palindorme, print out the starting
            # location on the DNA sequence and the length 
            # Location index is 1-based
            if temp_seq == temp_seq.reverse_complement():
                print(f'Position: {nuc + 1}', f'Length: {len(temp_seq)}', \
                     f'Sequence: {temp_seq}')

Position: 1 Length: 4 Sequence: TATA
Position: 2 Length: 4 Sequence: ATAT
Position: 24 Length: 4 Sequence: GGCC
Position: 27 Length: 6 Sequence: CCATGG
Position: 28 Length: 4 Sequence: CATG
Position: 56 Length: 6 Sequence: CGATCG
Position: 57 Length: 4 Sequence: GATC
Position: 59 Length: 4 Sequence: TCGA
Position: 104 Length: 4 Sequence: CCGG
Position: 117 Length: 4 Sequence: CTAG
Position: 118 Length: 8 Sequence: TAGTACTA
Position: 119 Length: 6 Sequence: AGTACT
Position: 120 Length: 4 Sequence: GTAC
Position: 128 Length: 6 Sequence: AATATT
Position: 129 Length: 4 Sequence: ATAT
Position: 137 Length: 10 Sequence: GAGAGCTCTC
Position: 138 Length: 8 Sequence: AGAGCTCT
Position: 139 Length: 6 Sequence: GAGCTC
Position: 140 Length: 4 Sequence: AGCT
Position: 155 Length: 4 Sequence: TATA
Position: 156 Length: 4 Sequence: ATAT
Position: 163 Length: 8 Sequence: GTTATAAC
Position: 164 Length: 6 Sequence: TTATAA
Position: 165 Length: 4 Sequence: TATA
Position: 178 Length: 4 Sequence: CTAG
Posi