# Exploring Python


## Overview of the python programming languages

Many programming languages including Python, the R statistical programming language, C, and Java are commonly used in Bioinformatics research. This project introduces bioinformatics research using python. This general-purpose programming language was created by Guido van Rossum and first released in 1991. It runs on Windows, Mac OSX and Linux.

If you are just getting started with coding, python is an excellent language to learn. If you are already experienced with other programming languages, you will find much that is familiar.

So what makes distinguishes these different languages? One way programming languages are distinguished is by how they fall into certain common categories. Python, for example, is an interpreted, high-level, general-purpose, dynamically-typed programming language. Let's unpack what those terms mean in order to better understand python.

**interpreted**. Python is described as an interpreted language. This means that python itself is a program on your computer. When you write a python script, that script is *simply a text file*. By running the python program on your text file, you can tell it to do all sorts of things - run calculations on a genome, generate a graph, etc. This is in contrast to *compiled* languages like C. In those languages, a compiler converts your code into a runnable program. Generally, compiled languages can run faster than interpreted languages. However, being able to rapidly test each version of your script without an extra compiling step can make developing new programs faster. Folks sometimes refer to this as a tradeoff between development time (time spent writing your code) and run time (time the computer spends actually running your code). In many scientific applications, where the number of users is small and new code must be continuously developed, development time is very long compared to run time, and therefore interpreted languages are widely used. (It should also be noted that there are ways to partially achieve some of the speed advantages of compiled languages - check out e.g. PyPy if interested)

**high-level**. High level languages mean that the language has many automated methods that automatically handle aspects of how the code you write is implemented by your computer. For example, when we define a variable in python we don't have to worry about allocating memory to store the contents of that variable - python does it for us. This contrasts with other languages like C where memory allocation must be performed manually. Generally speaking, it is much easier to quickly write a working program in a high-level language. However, low-level languages can provide fine-grained control that can help optimize very computationally-intensive programs. 

**dynamically-typed**. In python, a name like x or my_genome can be associated with objects of any type. This contrasts with statically-typed languages like C or Java. In those languages we must decide that the variable x will always be an integer when we create the variable. 

So this is what we mean when we say python is an interpreted, high-level, dynamically-typed language. You may notice that many of these features make python scripts fast (and I think *fun*) to write, but slower for your computer to run than some other languages. This is largely why you will find many bioinformatic programs written in python, but few best-selling video games written in the language.

## Installing python

There are many different versions of python available. This project is written in the Anaconda Python distribution of Python 3.7. This version of python is designed with bioinformatics and other types of data analytics in mind. It comes with many libraries (extensions to the core language) that allow for making custom graphs, using matrices to perform large-scale calculations rapidly, etc. 

A python installer can be found at https://www.anaconda.com/distribution/

The installer can be opened in the standard way to install python (e.g. by double-clicking on it).

During installation, a dialogue box with red text will appear asking if you would like to make this version of python your default, and another asking if you would like to add python to your path. If you have never used python before, I strongly recommend checking both boxes.

(If you are already an experienced programmer / systems administrator, you will be able to make your own determination.)

Once installed, if you checked the box to add python to your path, python can be run from the command line. (See the previous chapter if you need to learn how to use the command line, or need a refresher). 

After installing open  Terminal (Mac) or PowerShell (PC). Then you simply type:

python

after hitting enter, you may see something like this:

Python 3.7.2 (default, Dec 29 2018, 00:00:04) 
[Clang 4.0.1 (tags/RELEASE_401/final)] :: Anaconda custom (64-bit) on darwin
Type "help", "copyright", "credits" or "license" for more information.

>

That indicates that an interactive session has begun! You can now type python code and hit enter, and get either a result, or an error message (don't worry if you get errors - they're just feedback to help fix your code, and we'll talk about how to interpret them down below). 

**Troubleshooting common installation issues**

If you instead get an error message indicating that python wasn't found, you may need to retart your Terminal or PowerShell and try again (this allows the changes the installer made to be detected).

If that doesn't work, you might try reinstalling from scratch, being sure to check the two dialog boxes to make Anaconda Python your default python and to add it to your path.

If that *still* doesn't work, you should get in touch with a computer-savvy friend, instructor, or colleague who can help sort out any issues you might be having. 

Python normally installs easily, but if you have trouble, the most important thing to remember is: 

DON'T PUT THIS OFF OR GIVE UP! 

Getting set up to *start* programming is often the hardest part for folks who are just learning. It's very normal, and a very good idea, to ask for lots of help at the start of the process. Don't worry - it will get much easier to learn on your own as you get further along. 

# Taking python for a spin

Before diving into a more formal discussion of how python is structured, I'd like us to try a few simple examples. You can follow along by typing these into your python interpreter and hitting enter after each line.

Note that spelling, capitalization and spacing matters (python will interpret x as a totally different variable from X).

These will start off very general and non-biological, but by the end of the page we will apply some basics to calculate some common quantities used in biology.

## Hello World!

It is traditional to start learning any new programming language by printing the phrase "Hello World!" to the screen. We can do that in python as follows:

In [2]:
print("Hello World!")

Hello World!


Let's break down what we just did.

###  Calling Python Functions
When we typed print, python interpreted that as a *function*. Functions take input and do things with it. They may in some cases give us back a result. The way they take input is based on what we put inside the parentheses after we call the function.

What about the "Hello World!" part? When we put that text inside quotation marks, we told python to create a new string object. String objects hold text within python. We'll consider them in much more depth shortly.

We can print a number instead of "Hello world!" as follows: 

In [3]:
print(42)

42


We used the same function -- print -- but now changed the input we gave it to a number. This caused the number to be printed to the screen. The input that we have given the print function inside its parentheses (42 or "Hello World!") is called a **positional argument** to that function. The whole thing - the function plus any arguments inside parentheses - is called the **function call**. 

print() is just one of many [built in functions](https://docs.python.org/3/library/functions.html) that come with python. 

If we like, we can pass an **expression** rather than just a number or string to python functions. Here's an example:

In [6]:
print(42 + 42)

84


Here 42 + 42 is our expression. Python has first *evaluated* our expression by adding these numbers. The result, 84, was then passed to the print function and shown on screen. 

We can combine print statements and mathematical expressions to use python as a simple calculator. 

Here are some common mathematical symbols used in these expressions are:
```
+ add two numbers
* multiply two numbers
** exponentiate the first number by the second. So x**2 is x squared.
/ divide the first number by the second
// divide the first number by the second, ignoring any remainder (e.g. 3//2 = 1)
```

Here are some examples:

In [11]:
print("Multiply 2 times 3:")
print(2*3)

print("Divide 100 by 10")
print(100/10)

print("Square 12:")
print(12**2)

print("2 to the 10th power")
print(2**10)

Multiply 2 times 3:
6
Divide 100 by 10
10.0
Square 12:
144
2 to the 10th power
1024


We can combine multiple mathematical operations within the same expression.

In [14]:
print("Raise 2 to the 10th power and add 3")
2**10+3

Raise 2 to the 10th power and add 3


1027

In [32]:
### Functions with more than one argument

## Order of operations

The last example above raises an important point: when evaluating more complex expressions, the **order of operation** matters. The order of operation in python is similar to that in mathematics. For example, exponentiation is evaluated before multiplication, which is in turn evaluated before addition. We have already seen that the entire expression passed to a function is evaluated before being sent off to the function.
Like in mathematics, we can use parentheses () to change or clarify the order of operations if we need to.  Operations within parenthese will be performed before those outside parentheses.

Let's say we really want to add 10+3, and then raise 2 to that power. We could accomplish this as follows:


In [18]:
print("Order of operations matters!")
print(2**(10+3))

Order of operations matters!
8192


Note that this is very different from the number obtained above, because our use of parentheses has caused python to first add 10+3 (getting 13), *then* raise 2 to the 13th power, rather than raising 2 to the 10th power and adding 2. 

## Defining and using variables 

In the above examples, we told python exactly what to print to the screen. That isn't of course very useful.  Usually we will want to write a bunch of code, store some important result in a variable, and *then* do things with that variable like printing it to the screen.

We can assign information to python variables using an equals sign (=), as follows:

In [19]:
x = 7 
y = x + 2
print(y)

9


## Rules for naming python variables

We can call our python variables whatever we want, subject to a couple of rules:

- python variables can only be one word long. That means that no whitespace is allowed in a variable name. If we want to have a longer variable name, python programmers will often use an underscore character ```_``` to represent a space. For exampe, a variable holding the length of a genome might be called ```genome_size```.

- python variables can be any length, and can include any upper or lowercase letters, numbers, and underscores (```_```). However, variables are not allowed to start with a number.

- care should be taken to avoid calling your variables by names that are already used by [built in](https://docs.python.org/3/library/functions.html) python functions. Names you might want to use, but shouldn't, include ```type``` and ```input```. (You don't need to worry too much about this restriction when you are starting out).

## Suggestions for naming python variables.

While there are few *rules* for naming python variables, there are many suggestions that will save headaches down the road. Ideally python variable names should:

- be clear. If you're recording the length of a genome use genome_length rather than ar123_len.
- be as short as possible while still being clear. The variable name genome_length is preferred to length_of_the_genome_I_am_currently_studing_to_understand_its_antibiotic_production_genes

- Use variable names to explain any numbers used in your script.


### Some simple biological calculations in python

Let's try putting together what we've learned to do some simple biologial calculations. In these slightly more complicated examples I will use comments inside the python code to note what each line does. Python comments are anything on a line following a pound side  ```#```. Python simply ignores everything after the # sign.




### Example: One *Big* Ball of Bacteria

A key reason that evolutionary biologists understand natural selection to be inevitable in many real-world situations is that the reproductive rate of many species rapidly outstrips the carrying capacity of the earth. Thus, in most real ecosystems many individuals either die before reproducing or otherwise fail to reproduce. 

In *On the Origin of Species* Charles Darwin wrote:

>“There is no exception to the rule that every organic being increases at so high a rate, that if not destroyed, the earth would soon be covered by the progeny of a single pair. Even slow-breeding man has doubled in twenty-five years, and at this rate, in a few thousand years, there would literally not be standing room for his progeny. Linnaeus has calculated that if an annual plant produced only two seeds – and there is no plant so unproductive as this – and their seedlings next year produced two, and so on, then in twenty years there would be a million plants. The elephant is reckoned to be the slowest breeder of all known animals, and I have taken some pains to estimate its probable minimum rate of natural increase: it will be under the mark to assume that it breeds when thirty years old, and goes on breeding till ninety years old, bringing forth three pairs of young in this interval; if this be so, at the end of the fifth century there would be alive fifteen million elephants, descended from the first pair.” 

(via https://digitalminds2016.wordpress.com/2016/10/06/darwin-and-the-elephants/)

Let's use the python we've already learned to calculate something similar for bacteria to explore what would happen if there were no constraints on population growth. . Then we'll return to Darwin's example of the elephants.

Imagine for a moment that a single bacterium inhabits the earth, and reproduces by division every 2 hours. Let's further imagine (just for a moment) that no bacteria die or fail to reproduce.

If we start with one cell, then after two hours that first cell will split into 2 cells (1\*2).
After 4 hours, those 2 cells will split into 4 total cells (1\*2\*2).
After 6 hours, those 4 cells will split into 8 total cells (1\*2\*2\*2)

As you've probably guessed, this is a case of exponential growth. The number of bacteria around after $g$ 2-hour generations, given starting population $p$, is given by $p\times2^g$. Since generations happen every two hours, this is the same as $p\times2^{h/2}$, if we let $h$ represent the number of hours that have passed.




To start with let's use python to calculate how many bacteria will be present after just two weeks of unchecked growth. We'll have to figure out how many 2-hour generations will pass in the two week span, and then use the exponential growth equation to figure out how many bacteria we end up with.

Let's try writing some code to calculate this with or without informative variable names.  

**Without variable names we might write:**

In [39]:
total_bacteria = 1 * 2 **(2*7*24/2)
print(total_bacteria)

3.7414441915671115e+50


This gets us the right answer, and is quite short, but doesn't give a reader a lot of clues about how we got there. In this very simple example we may understand why we're calculating $1*2^{2*7*24/2}$ *right now* but if we come back in several months - or if we give the code to someone else - this may be less clear. The more complex code gets, the more of an issue this becomes.

Therefore it is often better to use short but informative variable names to document what each number you use means. 

Numbers that appear without an informative variable name are sometimes called **magic numbers**. Using lots of magic numbers in your code is considered bad form, and should be avoided.

Below we can modify our code using named variables and comments to make it a bit easier to follow:

In [43]:
#Record the information given in the program
division_time = 2 #division time in hours
starting_bacteria = 1
offspring_per_division = 2 #Each time a bacterium divides, it makes 2 new bacteria
total_hours = 2 * 7 * 24 #14 days, and 24 hours per day.
total_divisions = total_hours/division_time

#Now we can calculate how many bacteria will be present after a week.
total_bacteria = starting_bacteria * offspring_per_division ** total_divisions

print("Total bacteria:",total_bacteria)


Total bacteria: 3.7414441915671115e+50


#### Compare the mass of our bacteria to the mass of the  Sun.

The mass of the sun is about 5.972 × 10^33 g.
A single bacterial cell might weigh about 1 picogram or 1 * 10^-12g.

Let's use python to compare the mass of our big ball of bacteria to the mass of the sun.



In [47]:
sun_mass = 1.989 * 10**33 
bacterium_mass = 1*10**-12
total_bacteria_mass = total_bacteria * bacterium_mass
print("Our ball of bacteria weighs ",total_bacteria_mass,"g")
n_sun_masses = round(total_bacteria_mass / sun_mass) 
print("This is equivalent to roughly ", n_sun_masses, "Suns!")

Our ball of bacteria weighs  3.7414441915671114e+38 g
This is equivalent to roughly  188107 Suns!


Thats ... a lot of bacteria.

In real life, of course, our solar system is not overrun by an ever-growing mound of bacteria. One reason is that resources (space, food, etc) rapidly become limiting when bacteria reproduce. This can either cause some bacteria to die (e.g. due to the conditions created by overcrowing) or fail to reproduce (e.g. because they can't gather enough resources).

So in reality, bacteria may reproduce extremely quickly under favorable conditions. But real conditions rapidly become so unfavorable that most fail to do so. This becomes relevant to evolution when we consider some trait that lets some bacteria survive better in a particular condition. If that trait varies among the bacteria in the population (variation), can be passed down from a parent to its offspring (heritability), and bacteria with the trait suceed at producing more offspring (increased fitness), then that trait will spread through the population by natural selection. However, it is worth keeping in mind that a trait that 


#### Extending the Example

Darwin conducted his example using elephants rather than bacteria. As you might imagine there are some important differences between the two that might influence the calculation. These include:

- sexual reproduction, which requires both a male and a female elephant mate to produce offspring. Since only females can give birth, this implies that the ratio of males to females will influence how fast the population increases. To see why, consider a starting population of 5 females and 1 male (who can mate with more than one female) vs. 5 males and 1 female.
- multiple reproductions per organisms. A single elephant can reproduce more than once.
- delayed reproduction. Elephants cannot reproduce until a certain age.
- intrinsic mortality due to aging. We must account for elephants that die due to old age.


Overall, calculating the reproductive output of elephants requires a lot more information than for our bacteria, and although elephants reproducing unchecked would indeed cover the earth quite quickly, Darwin's exact number may only hold for certain narrow assumptions.

If you're interested in reading a recent academic treatment of this problem using Leslie matrices (which track the survival and reproduction of different ages of elephants), and an extended discussion of the (surprisingly tricky) math behind Darwin's example, this paper may be of interest:

"How Fast Does Darwin's Elephant Population Grow?" Podani *et al.* 2017. Journal of the History of Biology volume 51, pages 259–281(2018). A free preprint is available here: [Podani *et al* 2018](http://real.mtak.hu/72652/1/Podani_etal_DarwinElephants.pdf)



## Further reading

    


- A complete list of the order of operations in python is available [here]. (https://docs.python.org/3/reference/expressions.html#operator-precedence) It will refer to many operations you haven't encountered yet, but might be useful for reference later on.