# Generate your own random DNA sequences #

Let's do some math and make some random strings. It's often useful to have some random DNA sequences around where you know their characteristics because you generated them that way. This is a problem you can actually solve with just the things we've learned so far (although there are other, better ways to solve it with python modules). We'll also the new built-ins we just learned for numbers -- round() and abs(). 

- To solve this, I used:
- importing a module
- input statements (assigned their values to variables)
- math -- multiplication, division, abs(), switching from fractions to percentages, round()
- string concatenation and string multiplication
- converting a string to a list
- shuffling a list in place with random.shuffle()
- joining a sequence with .join()
- and of course, the major elements of a function: def line with arguments, return line with one value, and a function call

In the first cell below, write this as a function that takes two parameters (desired_length,desired_gc) and returns the randomized sequence. In the second cell, collect the user input and call the function on it. In the third cell, use your GC function (from in-class) to check your results.

In [3]:
# your functions here
import random

def randomDNA(desired_length, desired_gc):
    numGC = round(desired_length * (desired_gc/100))
    g = random.randint(0,numGC)
    c = numGC - g
    numAT = desired_length - numGC
    a = random.randint(0,numAT)
    t = numAT - a
    dna = "G"*g + "C"*c + "A"*a + "T"*t
    dna_list = list(dna)
    random.shuffle(dna_list)
    dna = "".join(dna_list)
    return dna
def gc_content(dna):
    dec = (dna.count("C")+dna.count("G")) / len(dna)
    percent = round((dec) * 100)
    return percent

In [4]:
# get user input
length = int(input("Enter length: "))
gcPercent = int(input("Enter gc percent (0-100): "))
seq = randomDNA(length, gcPercent)

Enter length:  100
Enter gc percent (0-100):  50


In [5]:
# check gc
print(seq)
print(gc_content(seq),"%")
print("len:",len(seq))

AGGGGCAGAGTGAAGAACGAGAGAGAGCACGAATCAACCAAGGTGAAGCGGAATGAAGGAATGGGCCATGCGAAGGGAGAGAAAATCTCGAGTACATTGA
50 %
len: 100


# Restriction Enzymes #

In this exercise, we'll take a large DNA sequence string and cut it with restriction enzymes, then slice out the pieces and append them to a a list of fragments.

The very first cell below is pre-filled with some code that opens and interprets the sequence file I distributed along with this exercise. Download that file into the same directory where you are running jupyter, and then run the first cell. For now, just take a look at the code and the explanation, we'll learn to do this soon enough.

At the end, the file contents will be stored as one big long sequence string in the variable DNASeq. You can print them out if you like just to be sure.

In [6]:
def genomic_fasta(genome):
    """genomic_fasta: parses the sequence lines out of a genomic DNA FASTA file
    parameters: expects an open file object
    return: a single DNA sequence string
    """
    lines = genome.readlines() # converts the open file object to a list
    DNASeq = [] # creates a new empty list
    for i in range (0, len(lines)): # iterates over the lines from the file
        if lines[i][0:1] != ">": # filters out the header line that starts with >
            DNASeq.append(lines[i].strip("\n")) # appends the remaining lines to the empty list after stripping
        
    return ''.join(DNASeq)    # joins and returns the DNA sequence lines

DNAFile = open("NC_007898.fasta") # creates a file object from a stored file
DNASeq = genomic_fasta(DNAFile) # passes the file object to the fasta parser function
DNAFile.close() # closes the file object

# The Pst1 Restriction Enzyme #

If you remember what restriction enzymes do, they bind to the DNA double helix and cut at a specific, palindromic sequence of DNA.

The Pst1 enzyme recognizes the DNA sequence CTGCAG. It cuts leaving an overhanging single strand.  So on the 5' strand that you read in a DNA sequence file, the cut happens at 5'...CTGCA^G...3'. If you're cutting DNA on this site, every fragment should end with CTGCA, and start with G.

What we want to do is cut the string that is stored in DNASeq at this restriction enzyme cut site, and save a list that contains the correct fragments, along with a second list of coordinates for where the fragment begins in the original sequence.

First try counting instances of the pattern to see what information that gives you about this genome.

In [7]:
DNASeq.count("CTGCAG")

15

In the cell below, try cutting the sequence on the pattern using .split() and see if that gives you enough information and if it gives you correct sequences (remember they have to start and end with fragments of the restriction pattern).

In [8]:
cuts = DNASeq.split("CTGCAG")
print(len(cuts))
print(cuts[0])

16
TGGGCGAACGACGGGAATTGAACCCGCGCATGGTGGATTCACAATCCACTGCCTTGATCCACTTGGCTACATCCGCCCCCTCGCCTACTTACATTCCATTTTTACATTATTTAAATTAGAAAACAAAAGATTCAAGTTCGAATATTTCTCTTCTTTCTTAGTTCAATGATATTATTATTTTGATTATTATTTCAAAAATAAGAATATGAAGTCAAAATTTTATTTTTTGTGAAATGAAATAAAAAAGATATAGTAACATTAGTAACAAGAGGAACACGTTATATTTCTACAATTTTCAATAAATAGAAAATAAAACATAGAATACTCAATCATGAATAAATATCATGAATAAATGCAAGCAAATACCCTCTCTTTCTTTTTCGATAATGTAAACAAAAAAGTCTATGTCAGTAAAATACTAGGAAATTAGTAAAGAAAAAAAAAAAAAGAAAGGAGCAATAGCACCCTCTTGACAGAACAAGAAAATGATTATTGCTCCTTTCTTTTCAAAACCTCCTATAGACTAGGCTAGGATCTTATCCATTTGTAGATGGAGCTTCGATAGCAGCTAGGTCTAGAGGGAAGTTATGAGCATTACGTTCATGCATAACTTCCATACCAAGGTTAGCACGGTTGATGATATCAGCCCAAGTGTTAATTACACGACCCTGACTGTCAACTACAGATTGGTTGAAATTGAAACCATTTAGGTTGAAAGCCATAGTGCTGATACCTAAAGCGGTAAACCAGATACCTACTACAGGCCAAGCAGCTAGGAAGAAGTGTAACGAACGAGAGTTGTTGAAACTAGCATATTGGAAGATCAATCGGCCAAAATAACCATGAGCGGCTACGATATTATAAGTTTCTTCCTCTTGACCGAATCTGTAACCTTCATTAGCAGATTCATTTTCTGTGGTTTCCCTGATCAAACTAGAAGTTACCAAGGAACCATGCATAGCACTGAATAGGGAGCCGCCGAATACCCCAGCTACGC

The challenge is that the .split() method will completely remove the nucleotides in the CTGCAG pattern if you split on that pattern, leaving the sequences incorrect, and it won't give you coordinates. 

Try using the .find() method and see what information it gives you about the sequence:

In [9]:
DNASeq.find("CTGCAG")

1336

On the other hand, the .find() method would find you the coordinates of a pattern, but only the first instance, so you might have to use it iteratively, or find some other tricksy way to use it. I can think of a few. 

Finally, the DNA we give you comes from a linearized circular chromosome, so the first and last fragment will have to be combined into single fragment in the correct order.

Put on your python thinking hat, and see if you can figure it out! You can solve this with only methods that we've learned so far. Use functions where you can, but don't stress about making absolutely everything into a function.

In [22]:
index = 0
cuts = []
indices = []

i = DNASeq.find("CTGCAG",index)
while (i != -1):
    cuts.append(DNASeq[index:i+5])
    indices.append(i)
    index = i+5
    i = DNASeq.find("CTGCAG",index+1)

cuts.append(DNASeq[index:])

#Showing that all characters from DNASeq are in cuts
l = 0
for cut in cuts:
    l += len(cut)
print(l) #number of aa in cuts
print(len(DNASeq)) #number of aa in DNASeq
#cuts happen at these 15 indices
print("indices: ", indices)
print("len(indices): ", len(indices))
#resulting in these 16 sub-sequences
print("len(cuts): ", len(cuts))

# circular fragment checks the last 5 --> first 5 amino acids
circularFragment = DNASeq[-5:]+DNASeq[:5]
print(circularFragment)
i = circularFragment.find("CTGCAG")
if i != -1:
    print(circularFragment[i:i+5])
    indices.append(i)
    cuts.append(circularFragment[i:i+5])


155461
155461
indices:  [1336, 19876, 34954, 35941, 57409, 76539, 82565, 85455, 87034, 94652, 99163, 123165, 142168, 146679, 154297]
len(indices):  15
len(cuts):  16
AATAATGGGC
