# Control flow

Control flow statements have a lot of different uses by allowing you to control how your loops proceed.

One of the simplest, but also one of the best, uses of control flow is reducing the level of indentation in your loops. Less indentation is generally clearer. 

Consider the following example (we'll start with something simple and contrived and then use something familiar and sensible): imagine we want to iterate over some numbers and print the square of the number except if the number is 2. We can do that as follows

In [1]:
numbers = [1, 2, 3, 4, 5, 6, 7]

for n in numbers:
    if n != 2:
        print(n**2)

1
9
16
25
36
49


While that loop isn't complicated yet, if we were to have more conditions, it could become several levels indented. Instead of that `if` check that only proceeds if `n != 2`, we could instead use a `continue` to simply skip that iteration. Once a `continue` statement is reached within a loop, none of the rest of the current iteration is executed. Instead, the next iteration is started.

In [2]:
for n in numbers:
    if n == 2:
        continue

    print(n**2)

1
9
16
25
36
49


As you can see, even though the `print(n**2)` is no longer indented into an `if` block, it is still not executed when `n == 2` as `continue` skips the rest of this iteration the loop body when that condition is satisfied.

What about an example where we actually might achieve something by using control flow statements? The homolog identification script is a good case in which control flow statements can improve our processing of BLAST hits and BED features. I've included the BLAST output and BED file for *Vibrio cholerae* in the Canvas folder with these notebooks, which we will use here. 

In [3]:
blastfile = "Vibrio_cholerae_N16961_blastout.txt"
bedfile = "Vibrio_cholerae_N16961.bed"

# read blast file
hits = []
with open(blastfile) as fin:
    for line in fin: # .readlines() is the default iter method for the open file class
        
        # unpack and convert types of desired columns. This is ugly. We'll revisit later...
        _, sid, pcnt, matchlen, _, _, _, _, sstart, send, _, _, qlen = line.split()
        pcnt = float(pcnt)
        matchlen = int(matchlen)
        sstart = int(sstart)
        send = int(send)
        qlen = int(qlen)
    
        # Keep hits that could be homologs
        if pcnt > 30 and matchlen > 0.9*qlen:
            # We could store matches as a list or tuple.
            # We won't want to modify the elements so a tuple is "safer" in that we then can't modify it by mistake
            hits.append((sid, sstart, send))

# Now read the bed file
feats = []
with open(bedfile) as fin:
    for line in fin:
        bed_sid, bed_start, bed_end, gene, *_ = line.split() # an asterisk before a variable name when unpacking makes that variable store remaining elements
        bed_start = int(bed_start)
        bed_end = int(bed_end)
        
        feats.append((bed_sid, bed_start, bed_end, gene))

# Now we have our two datasets read in, we can loop over them to find matches
homologs = []
for blast_sid, blast_sstart, blast_send in hits: # unpack our blast data
    for bed_sid, bed_start, bed_end, gene in feats:
        # Don't bother checking the rest if the sid doens't match
        if blast_sid != bed_sid:
            continue
        
        # Once we are dealing with features at higher index locations than our hit, go to the next hit (break loop over feats)
        if blast_sstart <= bed_start or blast_send <= bed_start:
            break
        
        # Otherwise, check if the hit is inside the feature
        if (blast_sstart > bed_start
            and blast_sstart <= bed_end
            and blast_send > bed_start
            and blast_send <= bed_end
        ):
            homologs.append(gene)
            break # Each BLAST hit will only be in one feature so move to next hit once you've found it

# Get the unique homologs using a set()
unique_homologs = set(homologs)

print(len(unique_homologs))

34


The control flow statements in that code helped us to separate out our different checks. Separating the checks lets us spece them apart with whitespace to improve clarity. It also allows us to organize them better into logical groupings. Finally, using `break` allowed us to skip useless operations, speeding up our loop.