# Scaling Up: using for loops to compare multiple genomes

Part of the beauty of bioinformatics is that once you figure out how to do something once, you can often ask the computer to repeat the process hundreds or thousands or millions of times. It usually takes some work to organize the results carefully, but far less time than if you had to run extensive comparisons by hand.

## In this section
    - Looping external software over directories full of genomes or other files
    - Using BLAST to compare gene content conservation between genomes
    - Using a double for loop to perform pairwise comparisons

## Prerequisites

    - Familiar with basic BASH (cd, ls, mkdir, etc)
    - Familiar with basic python (variables, for loops, if statements)
    - Have command-line BLAST+ installed (see Duck vs. Yeast exercise)
    - Familiar with loading `.csv` files as `pandas` `DataFrames`

## Table of Content
* [In this section](#In-this-section)
* [Prerequisites](#Prerequisites)
* [Downloading files using python](#Downloading-files-using-python)
* [Looping over files in a folder](#Looping-over-files-in-a-folder)
* [Running external software](#Running-external-software)
* [Using a double `for` loop to compare pairs of genomes](#Using-a-double-for-loop-to-compare-pairs-of-genomes)

## Running software in a loop

### Creating a directory to hold our data



In [7]:
from os import mkdir
from os.path import exists

data_dir = "./resources/genomes"

#Check if our data directory already
#exists. If not, make it
if not exists(data_dir):
    mkdir(data_dir)

### Downloading files using python

### Looping over files in a folder

We can use the `listdir` function to get a list of everything in a directory. If we then loop over that list using a `for` loop, we can get the name of each file in turn. Often our directory may have some files we don't want (e.g. random hidden system files that are not data files). We can address this by using an if statement to check whether each file matches some criterion for being a valid data file, and just skip those that aren't part of our analysis. 

#### Getting the contents of a directory

Before we can run software on a collection of data, we need to organize it into a folder. I made a folder called 'resources' and inside that a folder called 'genomes'. So relative to my current working directory, my data are in './resources/genomes'. You can put your data anywhere, as long as you can figure out the relative path to that place from your current working directory. If you forget your current working directory, you can always check it using `getcwd` that you can import from the `os` module

In [8]:
from os import listdir

#First use a string to set the directory where our data is
#Defining this in a variable makes it easy to change latter.
#(Replace my string with the path to wherever your data is, 
# relative to where you started the jupyter notebook)

data_dir = "./resources/genomes/" 

#Save data files into a variable
data_files = listdir(data_dir)

#Use a format string to print out our list of data files
print(f"Files in data dir {data_dir}:{data_files}")

Files in data dir ./resources/genomes/:['genome3.fna', 'genome2.fna', 'genome1.fna', 'genome4.fna']


### Running external software with `subprocess`

We want to do BLAST comparisons between each pair of genomes. To do that, we'll have to learn how to run external software in general, then apply what we learned to BLAST. By learning this in a general way, you should be able to run *any* command line software you want from inside python. Queue mad cackling like some sort of cartoon villain, drunk with power!

Before the mad cackling, let's start simple. How can we run `ls` from inside python.

One general way to do this is with pythons `subprocess` module. The module can do a lot, but we'll start by covering a simple approach that should work for many common cases.

The main input to subprocess is a `command`. This can be a list of strings, with each string being one "word" of a command line command. So instead of saying `ls -l ./`, we'll write `command = ['ls','-l','./']`, then pass that `command` variable to subprocess. Here's how that would look:

In [9]:

#Import the subprocess module
import subprocess

#Define our command as a list of strings
#(so far this is just basic python)
command = ['ls','-l','./']





### Running external software in a loop

###  Using a double for loop to compare pairs of genomes

### Parsing data

## Showing Images
An image **with description field for screen readers**

<img src="./resources/card_back_tree_canopy-01.png" width="400"  description="A picture of a tree canopy, as seen from below. The canopy is thick with intertwining branches, and blue sky can be seen peeking out from between the backlit leaves.">

## Writing Mathematical Symbols
A simple math symbol: (see https://sites.psu.edu/symbolcodes/codehtml/#math and https://www.keynotesupport.com/internet/special-characters-greek-letters-symbols.shtml

y = &beta;<sub>0</sub>x<sub>0</sub>  + &beta;<sub>1</sub>x<sub>1</sub> + &beta;<sub>2</sub>x<sub>2</sub> + &beta;<sub>3

## Jupyter Notebook Tricks and Tips
Split cells with cntrl-shift-minus

[https://www.dataquest.io/blog/jupyter-notebook-tips-tricks-shortcuts/](https://www.dataquest.io/blog/jupyter-notebook-tips-tricks-shortcuts/)

## 4-color palette
- Bright green ( RGB 146,249,11)
- Orange (RGB 253,173,1)
- Magenta (RGB 255,19,170)
- Deep Blue (RGB 11,79,175)

## Quickly generate tags for all .png images in the resources file that start like 1_first_diagram.png

In [None]:
from os import listdir
resources = listdir("./resources/")
resources = [r for r in resources if "_" in r and r.split("_")[0].isnumeric()]
resources.sort(key= lambda x:int(x.split("_")[0]))
for f in resources:
    if not f[0].isdigit():
        continue
    if not f.endswith(".png"):
        continue

    print(f'<img src="./resources/{f}" width="400"  description="TODO: describe {f}">')

## Exercises

## Reading Responses & Feedback 

^Make this a hyperlink using Google Forms

## Further Reading

In [1]:
import sys
print(sys.maxsize)

9223372036854775807


## References