# Why Python

Python is probably the most widely used all-purpose programming language in science. Although many people will use other languages for specific applications (e.g. R for statistical analysis), the beauty of Python is that it can be used for anything, from analyzing and comparing genome sequences, to training deep neural networks, producing beautiful interactive graphs, creating web apps, or designing an interface to interact with experimental equipment. Also outside academia, Python is being used more and more; take a look at these Google Trends:

![Trends in the use of programming languages](Media/languages.png)

For the purpose of this class, Python is useful because there are Python **modules** for all the topics we will be discussing (genomics, proteomics, metabolomics, networks, and machine learning). Additionally, Python is intuitive and well designed, so when your intuition tells you that something should work---well, it typically works.



## Jupyter notebooks

Last but not least, Python allows us to work with **notebooks**. The document you are reading (with an `.ipynb` extension) is a Python notebook. In notebooks, content is organized in **cells**, which may contain text or code. For example, the following cell contains a simple piece of Python code, which can be executed by selecting it and typing `ctrl+Enter`. Let's try it:

In [None]:
print('Hello world!')

There you go! You may have just run your first Python program! :)

Using the bar at the top of this page you can add, delete, move cells, change cell type to make them code or text, and many other things. You can also use the following shortcuts to do things faster:

![Notebook shortcuts](Media/jupyter-shortcuts.png)

# How to complete the practical work

Here is how this works. During each session in the lab, you are expected to go through the whole notebook for that session, reading the explanations and executing the code in it. We recommend that, **before executing any piece of code**, you stop to think what is the result that you expect from that code, and then verify that you get what you expected. If you do not get what you expected, stop and think what happened until you understand what is going on. In any case, **you are encouraged to change any of the example pieces of code, and to create new code cells to experiment and learn**.

From time to time, you will encounter an exercise in the notebook. Some of these may consist in writing some text, while others (probably most of them) consist in writing some code to make sure that you are keeping pace with the lecture. Additionally, at the end of each session you will find a **Final Exercise, which will carry a significant portion of the grade of the session** (see at the end of the notebook).

When you are done with all exercises, save the notebook and upload it (that is the completed `.ipynb` file) to the corresponding session in Moodle.

### Exercise

In the text cell bellow, type "I understand what I need to do and this is going to be fun!"

*Type your answer here.*

# Basic types and data structures in Python

As any other programming language, Python uses some basic types and data structures. For example, some variables in Python are integers, others are floats, or strings, or lists...

Unlike most other programming languages, however, in Python the types of variables are not fixed and do not need to be assigned beforehand. Rather, the type of a variable is defined when a value is assigned to it. Let's see how this works

In [None]:
# Let's make "v" a variable of type int (integer) 
v = 40
print(type(v))
print('The value of v is', v)

OK, lots of new things going on in this little code snippet. The first line is a **comment**: in Python, anything preceeded by the `#` symbol is interpreted as a comment and, therefore, ignored. Adding comments to code is very useful: **ALWAYS comment your code generously!** Comments will also help when we grade your work, as it will help us understand what you meant to do, even if what you did is not entirely correct.

In the second line of code we create a variable `v` and assign to it the value 40, so it automatically becomes an `int`. In the third line we print this type to make sure that it is, indeed, and `int`, and in the fourth we print the value of `v`.

Note that, compared to other programming languages, Python is less structured. For example, as we have seen, you do not need to define variables and can even change the type of a variable in the course of a program (although this is probably a bad idea). Also, notice that we do not need to specify the end of a line (in `C`, for example, all lines must end with `;`). This is good in some ways but it forces you to be extra-careful with format in others. We will see some examples when we discuss flow control below.

But, before that, let's take a look at the basic data types in Python. 

## Numbers

As we have seen, Python variables can take numerical values. Numerical values are easy to deal with, and operating with them is as straighforward as one could imagine:

In [None]:
v1 = 45
v2 = 23
# Sum
print(v1 + v2)
# Product
print(v1 * 3)
# Power 2 (square)
print(v2**2)
# Division
print(v1 / v2)

## Strings

Besides numbers, Python can also manipulate strings, which can be defined by enclosing any text in single quotes ('...') or in double quotes ("...") with the same result. Let's see some examples:

In [None]:
s1 = "This is one string"
print(s1)

Strings can be concatenated (glued together) with the + operator, and repeated with *:

In [None]:
s2 = 'This is another string'
print(s1 + s2)
print(s1 * 3)

Strings can be indexed, that is, accessed by position, with the first character having index 0. For example:

In [None]:
# Print the 1st, and the 3rd character of the string s1 defined above
print(s1[0])
print(s1[2])

In addition to indexing, *slicing* is also supported. While indexing is used to obtain individual characters, slicing allows you to obtain substring:

In [None]:
# Print characters 2 to 6 (not included) of the string s1 defined above
print(s1[2: 10])

The built-in function len() returns the length of a string, that is, the number of characters in it:

In [None]:
print(len(s1))

Another very useful method for strings is `.split()`, which takes a string and breaks it into pieces every time it encounters a space:

In [None]:
s1.split()

The resulting object is a `list`... but, hold on a second, we have not seen lists yet!

## Lists

Python knows a number of compound data types, used to group together other values. The most versatile is the list, which can be written as a list of comma-separated values (items) between square brackets `[]`. Lists might contain items of different types, but usually the items all have the same type.

In [None]:
squares = [1, 4, 9, 16, 25]
squares

Like strings (and all other built-in sequence types), lists can be *indexed* and *sliced*. By indexing, we mean that an element of a list can be accessed by its position in the list:

In [None]:
print(squares[0])
print(squares[2])

By slicing, we mean that several elements of a list can be accessed together. A slice is a *sublist* of the original list:

In [None]:
# Slice the list from element 0 to element 3 (not included in the slice)
print(squares[0:3])

Lists also support operations like concatenation:

In [None]:
print(squares + [36, 49, 64, 81, 100])

Unlike strings, lists are a mutable type, that is, it is possible to change their content:

In [None]:
cubes = [1, 8, 27, 65, 125]  # something's wrong here: the cube of 4 is 64, not 65!
cubes[3] = 64  # replace the wrong value
print(cubes)

Similar to strings, the built-in function `len()` also applies to lists:

In [None]:
print(len(cubes))

Note that it is possible to nest lists (create lists containing other lists), for example:

In [None]:
l1 = ['a', 'b', 'c']
l2 = [1, 2, 3]
lol = [l1, l2]
print(lol)
print(lol[0])
print(lol[0][1])

### Exercise

Create a list containing the 6 first prime numbers. Then print the slice containing elements 2 to 5 (not included).

In [None]:
# Write your code here

### Exercise

Build a string with your name and surname(s). Then split it and count the number of characters in each of the resulting "words".

In [None]:
# Write your code here

# Flow control in Python

Python mostly uses the usual flow control statements known from other languages. Let's look at them one by one.

## `if` statements

Perhaps the most well-known statement type is the `if` statement. For example:

In [None]:
v = 15

if v < 10:
    print('v is smaller than 10')
elif v <= 20:
    print('v is between 10 and 20')
else:
    print('v is larger than 20')

In this example, Python evaluates the first comparison and if it is true (that is, if `v`<10) it prints the first output and finishes. If this condition is not fullfilled, then it jumps to the second condition; if this alternative condition is fulfilled, then the program prints the second output and finishes. Finally, if none of the preceding conditions are fulfilled, the program prints the third output and finishes.

Going back to the issue of Python not being a very structured programming language, **note that indenting becomes very important in flow control**. In the case of `if` statements, all lines that are executed if a given condition is fulfilled need to be indented, typically by 4 spaces (although in notebooks you can just press `Tab` to indent a line). This is unlike most other languages. For example, in `C` lines to be executed conditionally need to be enclosed in `{}`. Python does not require that but, if the indentation is not correct, it will raise an error:

In [None]:
v = 15

if v < 10:
print('v is smaller than 10')

Here is a complete list of comparison, logical, identity and membership operators that one can use in `if` and other flow control statements:
![Python coparison, logical, identity and membership operators](Media/operators.png)

## `for` statements

The for statement in Python differs a bit from what you may be used to in C or other languages. Rather than always iterating over an arithmetic progression of numbers, or giving the user the ability to define both the iteration step and halting condition (as C), Python’s for statement **iterates over the items of any sequence** (for example, a list or a string), in the order that they appear in the sequence. For example (no pun intended):

In [None]:
# Iterate over elements in a list
for e in ['dog', 'cat', 34, [1, 2]]:
    print(e)

In [None]:
# Iterate over a sequence of integers. Note that it starts at 0!
for i in range(5):
    print(i)

In [None]:
# Iterate over letters in a string
for e in 'Hello':
    print(e)

### Exercise

Consider the following text:

In [None]:
text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Pellentesque a maximus tortor, eu dictum risus. Morbi sit amet ornare purus. Nullam viverra euismod commodo. Sed dapibus sodales ante id lacinia. Maecenas porta diam vel libero consectetur auctor. Cras facilisis laoreet ligula a rutrum. Donec ac molestie felis, in lacinia dolor. Nunc pellentesque odio sit amet pulvinar dictum. Nullam ligula tortor, consectetur pretium ex a, pharetra porta odio. Cras consequat, turpis at aliquam egestas, orci purus fermentum libero, ac aliquam dui nisl vel ex. Ut molestie tempor neque, eu vehicula sem luctus eu. Vestibulum ac tincidunt odio. Phasellus lacus enim, lobortis."

Now, write a piece of code that looks for the words 'sit' and 'Caesar' in this text. In each case, if it finds the word, the program should count and print the number of times the word appears in the text; otherwise, the program should print a sentence indicating that the word was not found:

In [None]:
# Write your code here

## `while` statements

`while` statements are a "mixture" of `for` and `if` statements. As in `if` statements, a conditions is evaluated; as in `for` statements, the program loops over the lines inside the statement. In this case, the program loops while the condition evaluates to `True`. Let's see an example:

In [None]:
a = 1
while a < 50:
    print(a)
    a = a * 2

# Functions

So far, we have only dealt with very simple pieces of code. Often, however, we need to write much more complex programs. Then, it becomes absolutely necessary (mandatory in this class!) to encapsulate pieces of code into **functions**, especially if these pieces of code will be used repeatedly. Python functions are, in this sense, similar to funtions in `C`, `Java` or most other programming languages. Let's start by defining a simple function that calculates the factorial of an integer n:

In [None]:
def factorial(n):
    result = 1
    for number in range(n):
        result = result * (number + 1)
    return result

It seems that nothing happened, but this code defines a new function `factorial`. From now on, we can use the function `factorial` to calculate the factorial of any number we want. To do this, we just need to **call** the function:

In [None]:
factorial(4)

So what just happened? When we call the function `factorial` with **argument value** 4, Python executes the operations specified in the definition of the function, using the assignment n=4. Inside the function, it creates a variable `result` and assigns to it a value of 1; then it keeps multiplyng all the integers to `result` until it reaches n (4, in this case). After this, Python exits the `for` loop and **returns** the value of result, in this case 24. The beauty of functions, of course, is that they can be used as many times as desired without having to repeat the code every time:

In [None]:
print(factorial(10))
print(factorial(20))

The factorial function only takes one argument, n. Other functions, however, many need several arguments to work. Consider, for example, the following function `power(x, exponent)`, which calculates $x^{exponent}$: 

In [None]:
def power(x, exponent):
    result = 1
    for repetition in range(exponent):
        result = result * x
    return result
print(power(10, 5))

Now imagine that, most of the times, we are interested in calculating the square of `x`, and only from time to time we want to use a different exponent. In that case, we could set a **default value for the argument** `exponent`:

In [None]:
def powernew(x, exponent=2):
    result = 1
    for repetition in range(exponent):
        result = result * x
    return result

The new function `powernew` can be called in two different ways. First, it can be called just as the old `power`, that is, specifying the values of `x` and `exponent`, and it will do the exact same thing as `power`:

In [None]:
powernew(10, 5)

Now, additionally, since the function was defined with a default value for `exponent`, it can be called without specifying `exponent`. In this case, it will use the default value `exponent=2`: 

In [None]:
powernew(10)

In Python, arguments with default values always need to be after regular arguments. Other than this, one is free to use as many of them as desired. For functions with many arguments, one may want to specify some values and not others. In those cases, it is better to just identify each parameter as follows: 

In [None]:
powernew(10, exponent=3)

### Exercise

Define a function that takes a string of text as input and counts the number of words in it. Apply it to the same text as before:

In [None]:
text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Pellentesque a maximus tortor, eu dictum risus. Morbi sit amet ornare purus. Nullam viverra euismod commodo. Sed dapibus sodales ante id lacinia. Maecenas porta diam vel libero consectetur auctor. Cras facilisis laoreet ligula a rutrum. Donec ac molestie felis, in lacinia dolor. Nunc pellentesque odio sit amet pulvinar dictum. Nullam ligula tortor, consectetur pretium ex a, pharetra porta odio. Cras consequat, turpis at aliquam egestas, orci purus fermentum libero, ac aliquam dui nisl vel ex. Ut molestie tempor neque, eu vehicula sem luctus eu. Vestibulum ac tincidunt odio. Phasellus lacus enim, lobortis."

In [None]:
# Write your code here

# Python modules

One of the strongest points of Python is that there is much code already written, for doing all sorts of things from parsing text or making plots, to comparing genomes and implementing machine learning algorithms. These extensions to basic Python functionality are called *modules*. Basically, each of the following practical sessions we will be learning about one particular Python module. Here, we introduce some modules that are generally useful for dealing with data.

## Pandas

Pandas is a Python module that allows you to work with objects that resemble Excel spreedsheets or, if you wish, `R` data frames. Let's start by importing Pandas:

In [None]:
import pandas as pd

This command imports the `pandas` module and renames it as `pd`. From this point on, whenever we want to use a `pandas` method or data structure we just need to write `pd.your_whatever`. Renaming `pandas` to `pd` is not necessary (that is, you can just do `import pandas` and then `pandas.your_whatever` whenever you need to), but renaming to `pd` saves a lot of typing when writing long programs.

The basic Pandas data type is the `DataFrame`, which, again, can be though of as an Excel spreadsheet or an `R` data frame. A `DataFrame` can be created in many different ways: let's start by creating a `DataFrame` from dictionary:

In [None]:
names = ['Harry', 'Hermione', 'Ron', 'Luna']
colleges = ['Gryffindor', 'Gryffindor', 'Gryffindor', 'Ravenclaw']
ages = [15, 15, 15, 14]
genders = ['male', 'female', 'male', 'female']
data_dict = {'name' : names, 'college' : colleges, 'age' : ages, 'gender' : genders}
df = pd.DataFrame(data_dict)
display(df)

There you go. When you create a Pandas `DataFrame` from a dictionary whose values are lists of the same size, dictionary keys become column titles and the lists become the columns themselves. `DataFrame`objects are useful, among other things, because we can easily filter and extract partial information from them. Let's see some examples, from simple to complex: 

In [None]:
# Extract some rows
df.iloc[1:3]

In [None]:
# Extract some columns
df[['name', 'age']]

In [None]:
# Extract rows according to a simple condition
display(df[df['age'] <= 14])
display(df[df['college'] == 'Gryffindor'])
display(df[df['gender'] == 'female'])

Filterning Pandas data frames with more complex searches involving multiple conditions is a bit confusing at first, because the syntax for the comparisons is not the same as in Python in general. For example, in regular Python if you wanted combine conditions `c1` and `c2` with an `and` operator you would write `c1 and c2`; however, within a Pandas filtering statement, you would need to write `(c1) & (c2)`. Similarly, instead of `or` you need to use `|`. Let's see a couple of examples: 

In [None]:
# Extract rows with more complex conditions
display(df[(df['age'] > 14) & (df['gender'] == 'female')])
display(df[(df['age'] > 14) | (df['gender'] == 'female')])


To learn more about complex filtering conditions, take a look at [this tutorial](https://www.ritchieng.com/pandas-multi-criteria-filtering/).

### Obtaining column statistics

Once our information is stored in a Pandas `DataFrame` it is really easy to get statistics for its columns. For example:

In [None]:
# Print the mean and the standard deviation of a column
print(df['age'].mean(), df['age'].std())

### Exercise

Try to obtain the mean of the `name` column and explain what happens.

In [None]:
# Write your code here

*Here, explain what happened.*

### Writing and reading files in Pandas

Another convenient feature of Pandas is that it allows you to easily read/write data from/to files, including Excel (or Excel-like) files. Let's first take a look at how to save the `DataFrame` we just created into an Excel file:

In [None]:
df.to_excel('Files/test.xls')

If you now go to the `Files` directory (folder) you will see that the file `test.xls` has been created, and if you open it you will see that the content is as expected.

Of course, we can do the opposite and read an Excel file into a `DataFrame`. Let's now read Excel file we just created into a new `DataFrame` named df2:

In [None]:
# We specify that the row indices are in the first column with index_col=0
df2 = pd.read_excel('Files/test.xls', index_col=0)
display(df2)

### Exercise

Try what happens if you read the same file **without specifiying** that the first column contains the row indices:

In [None]:
# Write your code here

*Here, explain in words what happened.*

## Matplotlib

[Matplotlib](https://matplotlib.org/) is a powerful Python module for making graphs. Basically, any plot you could ever dream of could be made in Matplotlib (although, of course, some will be very hard). If future sessions we will explore more possibilities of Matplotlib. Here, we just show how to draw the simplex possible XY graph:

In [None]:
# Import the basic functionalities of Matplotlib
import matplotlib.pyplot as plt

# Define the x values for our graph
x = [0, 1, 2, 3, 4]
# Define the y=x*x values
y = []
for value in x:
    y = y + [value * value]
print('x:', x)
print('y:', y)

# Make the plot
plt.plot(x, y)
plt.show()

`.plot()` is a versatile function that takes many parameters. For more details, read the documentation [here](https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.plot.html).

The `.scatter()` function is similar to `.plot()`, but by default it does not print the lines between the points:

In [None]:
plt.scatter(x, y)
plt.show()

Interestingly, Matplotlib can plot lists, as we have just seen, but it can also plot, for example, columns of Pandas data frames! Let's try this:

In [None]:
plt.plot(df['name'], df['age'])
plt.show()

### Exercise

Using the same values of x as in the previous plots, use `.plot()` and `.scatter()` to plot y=x\*x\*x.

In [None]:
# Write your code here

# Final exercise (5/10 points)

In this final exercise, you will carry out a preliminary characterization of the [Breast Cancer Wisconsin (Diagnostic) Data Set](https://www.kaggle.com/uciml/breast-cancer-wisconsin-data#data.csv). Columns in this data set are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. Each column describes a characteristic of the cell nuclei present in the image. The data are available from the file `Files/breast_cancer_kaggle.xls`.

Use a different code cell for each of the following tasks:

1. Load the Breast Cancer Data into a Pandas DataFrame, and display the DataFrame (or, even better, the first few lines of the DataFrame using the `.head()` method).
2. Calculate (with a program, of course) the number of "malignant" and "benignant" cases in the data set.
3. Obtain a list of all columns in the dataset, other than "id" and "diagnosis".
4. Calculate the minimum, the maximum, the mean, and the standard deviation of each of the columns you obtained in 3.
5. Calculate the mean and standard deviation of each column but only for "malignant" and only for "benignant" samples separately (that is, filtering the dataframe by the value in the diagnosis column).
6. For each column, plot the values in that column against the diagnosis (for example, the radius_mean against the diagnosis). For **extra credit**, take a look at the [Seaborn module for Python](https://seaborn.pydata.org/) and use `seaborn.boxplot()` or `seaborn.violinplot()` instead of `plt.plot()`.
7. Based on the results of 5 and 6, choose two columns that you think are good candidates to distinguish between benign and malignant breast cancers. Justify your answer. 

In [None]:
# Answer 1

In [None]:
# Answer 2

In [None]:
# Answer 3

In [None]:
# Answer 4

In [None]:
# Answer 5

In [None]:
# Answer 6

*Write answer 7 here.*

# Further reading

* In forthcoming sessions you will be using several Python modules to address specific problems related to genomics, proteomics, and so on. The materials in this first session are designed to get you up to speed with Python but, of course, the more fluent you are with this language the better. Therefore, it is **strongly recommended** that you take a tour at the official [Python Tutorial](https://docs.python.org/3/tutorial/).