# Python for Beginners

This course is intended to be run with [Jupyter Notebook](https://jupyter.org/).

For ease of installation, especially under Mac and Windows, we recommend installing Python 3 using the [Anaconda](https://www.anaconda.com/distribution/) bundle, which will include many commonly-used scientific libraries.


## Basic code structure

A very simple Python program is illustrated below. Note that all the commands start at the beginning of their line, i.e., they have no spaces or indentation before them. Indentation is a very important part of the Python language syntax and it is assumed to be consistent in each block of code; that’s how blocks are defined, as illustrated later. Spaces that are internal to each line, such as those next to the equals sign here are much less critical; indeed you could have more or no spaces here. The program consists of the lines:

In [None]:
mass = 3.4
volume = 1.8
density = mass/volume       # Division operation

What this tells the python interpreter to do is to assign a value of 3.4 to the label called `mass` and assign 1.8 to `volume`. The `density` is a new value that is calculated from the other values: one divided by the other, as indicated by the `/` symbol. Note that there are comments, from the `#` symbol onwards, which is only for a human to read and will not be interpreted as Python. Each labeled item, which is more correctly called a *variable*, is a way of attaching a name to a piece of data, and as the term suggests the actual value that is being referred to can change while the remainder of the program remains the same, i.e. we could calculate the density in the same way for different masses and volumes. 

In [None]:
density = round(density, 3) # 3 : decimal places
print(density)

Next the density is re-assigned to a new value which is its old value rounded to three decimal places: the old value and the number of decimal places are stated in the parenthesis of `round()`. Finally we use the `print()` function, which has the effect of displaying the underlying numeric value of the density to screen.

Note that we are free to choose any name for our variables within certain rules: names can contain numbers, letters and underscore (`_`) only but may not start with a number. The use of `round()` and `print()` are examples of *functions*; named tasks performing a specific set of operations on input data, specified inside the parenthesis, and which often generate some output. Many functions are inbuilt into Python (see [python.org](https://www.python.org) for full documentation), but it is possible to write new functions inside a Python program. 

# Simple Data Types

There are several basic data types in Python. The simplest of these include numbers, as already illustrated, values for true or false and a special `None` value:

Data type	|Description	|Example	|Converter
:---|:---|:---|:---
Integer	|Whole numbers	|x = 128|int()
Floating point	|Numbers with decimal points and/or power of ten exponent (scientific notation)	|x = 5e-3<br>y = 12.00 	|float()
Complex	|Numbers with real and imaginary parts	|x = 1.0-2j<br>y = 1j	|complex()
Boolean	|Truth values `True` and `False`	|x = True<br>y = False	|bool()
None	|A special value None for nothingness/undefined	|x = None|

### Numbers

Integer numbers are specified without any decimal point or exponent and represent the exact whole number. 

In [None]:
a = -7
b = 123
print(a, b)

Floating point numbers represent a number of significant digits and an exponent, though the latter is often implicit. An explicit `e` suffix states which power of ten the digits operate at.

In [None]:
c = 3.1415926536
d = 2.0e7 # Twenty million: two times ten to the seven
print(c)
print(d)

The scientific `e` notation is often optional, and a given number might be written in many different ways, e.g `0.001`, `1e-3` and `0.1e-2` are the same value. Floating point numbers have limited precision, i.e. they cannot represent all fractions precisely, as one might expect with ⅓ etc. However, some floating point calculations create an error in the least significant digits (caused by the underlying system being binary), which can occasionally cause problems for the programmer if not expected:

In [None]:
print(3.0 * 7.1)  # Not exactly 21.3 !

Calculations with integers often give integer results, though division of two integers gives a floating point result (this was different in Python 2). Calculations involving any floating point number tend to give a floating point result.  

In [None]:
x = 2     # Integer
y = 5     # Integer
z = 3.0   # Floating point

print(x * y) # 10  - Integer
print(x / y) # 0.4 - Floating point due to division
print(x + z) # 5.0 - Floating point

As shown below, data type can be inspected directly using the `type()` function.

In [None]:
t1 = type(1)  # Get a value's data type
t2 = type(2.9)
print(t1) # 'int' - integer number
print(t2) # 'float' - floating point number

In [None]:
print(type(x * y - z))  # float

Objects can be converted between one data type and another with the relevant conversion function, though naturally this can cause rounding when creating an integer from a float, e.g. `int(7.3)` gives integer 7. 

In [None]:
x = 1.3
y = 3.8
z = 7

print(int(x))    #  Remove non-integer part (integer floor)
print(int(y))
print(float(z))

Each entity in Python is an 'object' of a given type; it is a rich description of the data rather than just the plain underlying value. The object-oriented nature of Python is revealed whenever the dot syntax is used. This accesses values and functions (i.e. named operations run with `()` parentheses) that belong to an object, i.e. as `my_object.value` or `my_object.function()`.

For example, rather than using the usual `+` symbol, here we perform addition of two numbers using `x.__add__()`; the addition function belongs to the integer object we have called `x`. The double underscores are a hint this is a special inbuilt Pythonfunction:

In [None]:
x = 7            
y = x + 5         # Addition
z = x.__add__(2)  # Also, addition
print(y, z)

Similarly, Python functions are aso Python objects. For example, here documentation text is accessed for the integer creation/conversion function `int()` using `.__doc__`:

In [None]:
print(int.__doc__) # Print Python documentation for the int() function

## Mathematical operations


Standard arithmetic operations can be done on numeric data with the appropriate symbol: 

In [None]:
x = 5.7
y = 1.2
print(x + y)  # Addition
print(x - y)  # Subtraction
print(x * y)  # Multiplication
print(x / y)  # Division

Similarly using double asterisks `**` means raise to the power:

In [None]:
print(2 ** 3)   # Two cubed
print(3 ** 0.5) # Square root three

The modulus operation: the remainder after division, is performed using the `%` symbol.

In [None]:
print(19 % 12)     # The remainder after division by 12
print(378 % 360)
print(537 % 2)

When multiple opersations are used the usual `/\*+-` precidence is applied. Round brackets can be used to specify the operation order.

In [None]:
x = 2
y = 7
print(x + y * 5)    # Multiplication first
print((x + y) * y)  # Addition first

When performing mathematical operations, all the usual rules apply, e.g:

In [None]:
print(x / 0.7)  #  No problem
print(x / 0.0)  #  Gives an error

Interestingly, Python actually does have the concept of (plus and minus) infinity.

In [None]:
x = float('inf')
print(x)
print(x + 7)

The symbols used above are just some of the inbuilt mathematical symbols (operators) in Python. The full complement of these is:

|Operation	|Description	|Example |
|:----------|:--------------|:-------|
|x + y 	|Addition	|`revenue = profit + expenses`|
|x - y 	|Subtraction	|`income = profit - taxes`|
|x * y 	|Multiplication	|`area = volume * height`|
|x / y 	|Division 	|mean = (x + y + z) / 3.0|
|x // y |Floor division: divide and round down.	|`a = 13.0 // 5.0`<br>*a is 2.0*|
|x % y 	|Modulus: remainder after division.	|`a = 13 % 5`<br>*a is 3*|
|-x 	|Negate the value of x	|`a = -5 * 3`<br>*a is -15*|
|x ** y |Raise to the power, i.e. x<sup>y</sup>.|`a = 2 ** 3`<br>*a is 8*|

Note that there are also related operators that modify the value of a variable, in-place. For example `x *= 3` means x is assigned to a new value which is triple its old value: equivalent to `x = x * 3`. Similarly `x += 3` means add 3 to x and `x -=3` means subtract 3 from x.

In [None]:
x = 11
x = x * 3  # Set x to be triple the previous value
print(x)   # 33

x *= 3     # Also, set x to be triple the previous value
print(x)   # 99

x += 3     # Add three to x
print(x)   # 102

## <font color=purple>Exercise 1: Taylor series terms</font>

<font color=purple>Calculate an approximation to $e^x$ using the first few terms of the Taylor series:<br></font>

$1 + x + \dfrac{x^{2}}{2} + \dfrac{x^{3}}{6} + \dfrac{x^{4}}{24} + \dfrac{x^{5}}{120} + \dfrac{x^{6}}{720} + \dfrac{x^{7}}{5040}$

<font color=purple><br>Print estimates for $e$, $e^\sqrt{2}$ and $e^{i\pi}$ by using values of x=1, x=$\sqrt{2}$ and x=3.14159265$i$ with the above equation. 
</font>

In [None]:
# Exercise code
pi = 3.141592653589793

### Booleans

Boolean values represent truth or falsehood, for example as used in logical operations. The Boolean data type can be created directly by using the special words `True` and `False` or by using the `bool()` function. This treats zeros, empty containers (see below) and the `None` object as false, and everything else as true.

In [None]:
x = True
print(x)
print(bool(0.0)) # False
print(bool(7))   # True

Many `True` and `False` values in Python arise as a result of a comparison operation. For example, when testing if one number is larger or smaller than another:

In [None]:
x = 2
print(x > 5) # False
y = x < 3
print(y)     # True

The Boolean operations `and`, `or` and `not` can be used to combine multiple comparisons. Note, as shown below, that testing equality uses **double equals** signs `==`, which is easily confused with the single sign used for assignment, and that `!=` is the test for not equals.

In [None]:
x = 1
y = -1
print(x > y and y > 1)   # False - second fails
print(x != 2 or y == 1)  # True - first succeeds
print(not (y > x))       # True - not False is True


The complement of general comparison operators in Python is specified in the below table. In Python 3 inequality comparisons can only be made on values of a comparable type (Python 2 is different in this regard; you could compare anything, whether it was meaningful or not).  


Operator|Description|Example
---|---|---
`==`|Tests whether two values are equal.|`x = 3`<br>`x == 3.0`<br>*True*
`!=`|Tests whether two values are not equal.|`x = 3`<br>`x != 3`<br>*False*
`>`|Tests whether the value of the first value is greater in than the second.|`x = 2**10`<br>`x > 1024`<br>*False; <br>2<sup>10</sup> equal not more*
`<`|Tests whether the value of the first value is smaller in than the second.   |`x = 2**10`<br>`x < 1025`<br>*True*
`>=`|Tests whether the value of the first value is greater or equal in than the second|`x = 2**10`<br>`x >= 1024`<br>*True*
`<=`|Tests whether the value of the first value is greater or equal in than the second.|`x = 2**10`<br>`x <= 512`<br>*False*
`is`|Tests whether two values represent the same Python object.|`3 == 3.0`<br>*True*<br>`3 is 3.0`<br>*False*
`is not`|Tests whether two values represent different Python objects.|`3 is not 3.0`<br>*True*

The keyword operators `is` and `is not` are of note because they compare whether two items are the same Python object, not whether their values are the same. As we illustrate in the table, an integer and floating point number can be equal in **value** but they are two different **objects**, with different data types. The `is` comparison is often used with `None`, to detect if something is defined whether or not it has a zero value.

## <font color=purple>Exercise 2: XOR</font>

<font color=purple>
Write a combination of Python logical operations for two input variables that are equivalent to a single *exclusive or* (XOR) operation. The truth table for XOR is:</font>
    
In A| In B|Out
---|---|---
False|False|False
True|False|True
False|True|True
True|True|False

<font color=purple>Test your code on various inputs.
</font>

In [None]:
# Exercise code

### The None object

The `None` object is special built-in value which can be thought of as representing **nothingness** or that something is **undefined**. For example, it can be used to indicate that a variable exists, but has not yet been set to anything specific.

In [None]:
z = None
print(z)

### Text strings

Text in Python is represented by the `string` data type (i.e. a string of characters) and can be specified in code using single or double quotes, which distinguishes the characters from unquoted parts of Python.

In [None]:
x = 'Hello'
y = "world"
print(x,y)  # Print both values to screen

There is also a triple-quote syntax which allows text strings to flow over multiple lines:

In [None]:
x = '''This method lets text flow from one line
to the next line inside triple quotes and is
very handy for adding documentation.
'''
print(x)       

Text strings can be created from any value with `str()`, like when using `print()` and where appropriate they may be converted to other data types:

In [None]:
x = str(1.23)             # x is text '1.23'
y = int('     007    ' )  # y is the integer 7
print(x, type(x))
print(y, type(y))

Some symbolic operators work with strings, though naturally in a textual way, even if the characters happen to represent digits:

In [None]:
x = '1' + '99'  # concatenation
y = 'abc' * 3   # repeat concatenation
print(x, y)

Many operations are performed on strings using inbuilt functions that belong to the Python object, as accessed using the dot notation, noting that the original data remains unaffected:

In [None]:
x = 'abc'      # x is a string object
y = x.upper()  # x not changed 
print(x, y)

Strings can be considered to be arrays of characters, and accordingly have a length which can be found with the inbuilt `len()` function. 

In [None]:
x = 'abcde'
print(len(x)) 

Some elements of a textual data are not printable as normal character glyphs, such as new lines or tab stops. These are represented in Python strings using special escape codes that start with `‘\’`, for example `‘\n’` means new line:

In [None]:
print('Hello\nworld') # \n means new line

The full list of character escape codes is listed in the following table. It is notable that Python has a concept or raw strings where these codes are not used: raw strings have an `‘r’` before their quotes, e.g. `r'Hello\nworld'` has actual `‘\’` and `‘n’` characters in the middle, not a new line.

Code|Description|Example|Notes
---|---|---|---
\\ |A backslash character, which needs to be forced when the following character would otherwise form an escape code|text = '\\title'|Text is '\title' and does not have a tab code (\t)  inside.
\' |A single quote, which may need to be escaped in situations where it should not be considered as the start or end of a string.|text = 'Don\'t do that!'|Not required when a string is defined with double quotes.
\" |A double quote, which may need to be escaped in situations where it should not be considered as the start or end of a string.|text = "Shout \"Help!\"loudly."|Not required when a string is defined with single quotes.
\n |A newline (linefeed) control character. Used to separate lines of text in UNIX and Linux based computers|text = 'Line A\nLine B\n'|Text value is split into two lines on Linux and UNIX machines
\r |A carriage return control character. Used in combination with \n on Windows based systems to separate lines of text.|text = "Line A\r\nLine B\r\n"|Text is split into two lines on Windows machines.
\t |A tab character, providing indentation with white space to pre-set stop points.|text = 'Col 1\tCol 2\tCol 3\n'|Tabs indent to form three columns.
\u···· |Specifies a Unicode character using a 16-bit hexadecimal value.|text = u'\u03b1-helix'|Creates '\uf061-helix', e.g. for graphical displays.
\x··|Specifies a character using a hexadecimal value.|text = '\x48\x65\x6C\x6C\x6f'|Text is hexadecimal code for 'Hello'.

## <font color=purple>Exercise 3: Text escape codes</font>

<font color=purple>
Create a single string variable that represents a table of one- and three-letter DNA codes (i.e. `T` - `Thy`, `A` - `Ade` etc.) so that each base goes on a different row. Separate the elements of each pair with a tap-stop and each row with a new line. Finally print the string variable.
</font>

In [None]:
# Exercise code

### String methods

There are many inbuilt functions (methods) that are associated with strings, as listed in the full [Python documentation](https://docs.python.org/3/library/stdtypes.html#string-methods), and many of these will be used in later examples. A simple list of all the inbuilt values and functions for strings (and indeed any kind of Python object) can be accessed using `dir()`:

In [None]:
x = 'ABC'
print(dir(x))

This reveals the `.lower` function, amonst many, which as you might guess creates a lowercase version of a string. We can get a description of this using `help()`:

In [None]:
print(help(x.lower))

In [None]:
print(x.lower())

There are many other handy inbuilt string functions, such as:

In [None]:
x = '  Hello\n'
y = x.strip()        # Remove edge whitespace characters
print(y)
print(x.count('l'))  # Count characters/sub-string

A very useful string function, especially when reading data from file, is `split()`, which chops a string on a sub-string (or by whitespace by default)

In [None]:
x = 'Val Gly Lys'
y = x.split()  # Make a list of separate words
z = x.split('Gly')
print(y)
print(z)

One of the most important of these functions is `format()`, which allows variable values to be inserted inside a text string. The values are inserted into the string at the `{}` positions, replacing the bracket section.

In [None]:
name = 'Lisa'
t = 'Hello {}.'.format(name)    # name inserted into brackets
print(t)

In [None]:
name1 = 'Homer'
name2 = 'Marge'
t = 'Hello {} and {}.'.format(name1, name2)    # inserting two names
print(t)

The brackets can optionally have a number, specifying which item to insert where, and also a format specification after `:` that specifies how to represent the item. For example here the names are padded to 12 characters width using spaces:

In [None]:
t = 'Hello "{:12}" and "{:>12}".'.format(name1, name2)    # Padded to 12 characters wide, left and right (using >)
print(t)

There are many special formatting codes, as described in the [`format()` documentation](https://docs.python.org/3.4/library/string.html#formatspec).
This system is especially useful for specifying decimal places and scientific/exponent formatting for numbers.
In the next example the first number is formatted as floating-point with five decimal places using the code `.5f` and the second number is in exponent format using the code `.2e`.

In [None]:
x = 0.12
y = 34121.0
s = 'X {:.5f} Y {:.2e}'.format(x, y)   # 5 dp float, 2 dp sci
print(s)

It is very common to combine both an overall character width and a number of decimal places. Here two numbers are formatted to be nine characters wide and use four decimal places, with code `9.4f`:

In [None]:
x = 2 ** 10
y = 2 ** 0.5
t = 'A;{:9.4f} B;{:9.4f}'.format(x, y)   #  Pad to 9 characters wid and use 4 d.p.
print(t) 

## <font color=purple>Exercise 4: Formatted text</font>

<font color=purple>Create and print a single, formatted text string indicating the masses of the Earth ($5.97237×10^{24} kg$) and the Moon ($7.342×10^{22} kg$) next to the ratio of their masses (Earth/Moon). Display the masses in yottagrams ($10^{21}$ kg), padded to a field-width of eight characters and to two decimal places. Format the ratio to 5 decimal places.</font>

In [None]:
# Exercise code

### Indices and sub-strings

Text strings can be thought of as lists of individual characters. As such, each character has a position within the string. This position is accessed using square brackets and an index number. It is notable that these index numbers **start at zero** (like many other computing languages).

In [None]:
x = 'abcde'
print(x[0])  # First character
print(x[2])  # Third character

Negative indices count back from the end of the straing.

In [None]:
print(x[-1]) # Last character
print(x[-2]) # Second last character

Internal substrings can also be accessed using square brackets, by specifying an appropriate index range using a `start:limit` notation. Note that the limit index is **not included** in the range.

In [None]:
x = 'abcdefghijklmnopqrstuvwxyz'
print(x[1:3])   # index 1, up to, but not including 3

In [None]:
print(x[2:-2])  # Omit the first and last two characters

Unspecified range indices default to the start and end of the string.

In [None]:
print(x[2:])   # From idex to to end
print(x[:7])   # From start to < index 7
print(x[:])    # All characters (a copy)

A step size may also be included in the index range; specified after the start and limit using a second `:`. For example the following starting at `b` and going to the end of the string, selects every other character (i.e. step = 2)

In [None]:
print(x[1::2]) 

### Membership

Given that text strings are sequences of characters, they can be tested for the presence or absence (i.e. membership) of characters using the `in` keyword:

In [None]:
x = 'Bananarama'
print('a' in x)
print('b' in x)

This can also be used to test the presence of sub-strings:

In [None]:
print('Banana' in x)

The `index()` method of strings is handy to find where a character, or sub string, is first found:

In [None]:
print(x.index('r'))
print(x.index('an'))   #  Smallest index

# Collections

Collections are Python objects that can contain other Python objects; containers for organizing data. For example, a list of various objects, specified with square parenthesis:

In [None]:
x = True
y = ['GCAT', 4.9, x] # a list of 3 items, which includes the value of x
z = []               # an empty list
print(y)

In [None]:
x = False
print(y)    # Unaltered

Containers can contain the simple data types already discussed and also other containers (within certain rules) and they can also be empty. The basic inbuilt types of container in Python are:

Type|Description|Example|Converter
----|-----------|-------|---------
List|A modifiable, ordered list of items|`x = ['cat', 'dog', 'pig']`<br>*A list of three strings*|`list()`
Tuple|An unmodifiable, ordered list of items|`x = (0.742, 0.159)`<br>*A tuple of two floats*|`tuple()`
Set|A modifiable, unordered collection of unique items|`x = {9, 1, 7, 2, 5}`<br>*A set of five numbers*|`set()`
Frozenset|An unmodifiable, unordered collection of items|`x = frozenset([1,2,3])`|`frozenset()`
Dictionary|A modifiable, unordered collection of items accessed by unique keys|`x = {'A':1, 'C':3, 'E':5}`<br>*A dictionary with three key:value pairs*|`dict()`

### Lists

Lists are specified with square brackets and can be accessed by numeric indices and ranges, like strings, with the first item being at index `0`.

In [None]:
x = ['a', 'b', 'c', 'd']
print(x[2])   # index 2 : third position
print(x[1:])  # from index 1 to the end

The value of a specific index or range can also be set, and items in the list can be removed.

In [None]:
x[0] = 'z'    # index 0 is set to 'z'
print(x)
del x[2]      # deletes item at index 2
print(x)      

Lists can be added together (concatenated) using the `+` operation:

In [None]:
x = [1,4,9,16]
y = [25]       # A list with one item
z = x + y
print(z)

However, you cannot add non-list items in this manner:

In [None]:
z = x + 9   # Gives an error
print(z)

In [None]:
z = x + [9] # Works: added a list with one item
print(z)

A list can contain internal sub-lists, the elements of which can be accessed using multiple bracketed indices:

In [None]:
y = [[4,7], [9,6]] # A list containing two other lists
print(y[1])        # sub-list at y index 1
print(y[1][0])     # item 0 from sub-list 1  

Collections can be made from other collections using the appropriate creation function, e.g. `list()` to make list and `set()` to make a set (see below). More generally however, they can be created from Python objects that are iterable: things that can generate a sequence of items. For example, as shown below, `range()` generates a sequence of integer numbers up to a specified limit (a start number and step could also be specified),  which is then used to make a list.

In [None]:
x = list('PQRST')     # Create list of characters from text string
print(x)

In [None]:
y = list(range(10))   # A list of the range from 0 up to <10 
print(y)

Collections can be unpacked into individual variables if the number of items being assigned matches the collection size:

In [None]:
x = [4, 1, 0]  # A list containing three items 
a, b, c = x    # Assign to each sub-element
print(a)
print(b)
print(c)

As with all the main Python collections, lists have a size, accessible with `len()`, and membership tests are performed using the `in` operator.

In [None]:
x = ['a', 'b', 'c', 'd']
print(len(x))     # number of items in list
print('c' in x)   # string 'c' is in the list
print(2 in x)     # number 2 is not in list

### Tuples

Tuples, specified with round brackets, are similar to lists and have indexed items etc. 

In [None]:
x = (9, 3, 1, 0)  # Create a tuple
y = (2,)          # Tuple with one item (needs trailing comma)
z = ()            # Empty tuple
print(x)          # Print whole tuple
print(x[-1])      # Print the last item

However, they cannot be modified: their contents are defined when they are created. This is useful because tuples can be used as keys in a dictionary (see below) while lists cannot.

In [None]:
x[0] = 6          # Does not work! Tuples cannot be changed!

Although tuples cannot be modified, they are easily converted to lists if needed:

In [None]:
w = list(x)       # Create equivalent list from a tuple
w[0] = 6          # List copy can be changed
print(w)

## <font color=purple>Exercise 5: Splitting and joining lists</font>

<font color=purple>Convert the string of-one letter amino acid codes given below into a list containing every third letter. Use the string `string.join(list)` function combine the list's characters with commas into a single string.</font>

In [None]:
# Exercise code
a = 'MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPK'

### Sets

Sets are specified with curly brackets. They do not contain repeated values and the items do not have an order, and so cannot be accessed by index.

In [None]:
x = {1,2,3,4,3,2,1}  # duplicates ignored
y = set([3,4,5,6])   # created from a list
print(y)             # print a whole set
print(len(x))        # size of the set

Sets have helpful associated set operations (intersection, union, difference, disjoint test, subset test etc.), and allow for membership tests.

In [None]:
print(1 in y)        # 1 is not in y
print(x & y)         # intersection, items in both
print(x | y)         # union, items in either

### Dictionaries

Dictionaries are like lookup tables containing `key:value` pairs, they are also specified with curly brackets but can be distinguished from sets by `:` joining the pairs. The values in a dictionary are accessed by key, not by index, and each key is used only once.

In [None]:
d = {"G":329.21, "C":289.18, "A":313.21, "T":314.19}
print(d['A'])      # value associated with 'A'

Dictionary keys are typically strings or numbers, but more generally they are restricted to items that are not modifiable (so you cannot use lists or other dictionaries as keys) but the values being referred to can be any data type.

The length of a dictionary is the number of key-value pairs.

In [None]:
print(len(d))      # number of key:value pairs

Dictionaries have various inbuilt functions to access thier component parts.

In [None]:
keys = d.keys()     # Just keys
values = d.values() # Just values
items = d.items()   # Pairs
print(keys)
print(items)

These give special iterable objects, which may be converted to lists etc. as required: 

In [None]:
print(list(keys))  # A list of keys

If a key is already present in the dictionary then a simple assignment of the form `dict[key] = value` is used to change the value associated with that key.

In [None]:
d = {"X":99, "Y":121, "Z":14}
print('Original:', d)
d['Y'] = 201     # Change the value of an existing item
print('Updated:', d)

However, if a key was not already present this kind of assignment will add a new key:value pair. Existing keys cannot be changed directly but it is possible to remove a key:value pair using `del` and add the same value back again with a different key.

In [None]:
d['W'] = 256    # Add a new key:value pair
print('Bigger:', d)
del d['Z']        # Delete a key and its value from the dictionary
print('Smaller:', d)

### Empty collections

As already hinted at above, empty collections (with size zero) are specified using the correct kind of brackets with no contents, or using the conversion function with no inputs. However, a slight complication here is that the curly parentheses `{}` make an empy dictionary, not a set. Although both these collections use the same kind of brackets, dictionaries were introduced into Python first.

In [None]:
a = []
b = ()
c = {}
d = dict()
e = set()

print(len(e))
print(bool(a))   # Logically false
print(type(a), type(b), type(c), type(d), type(e))

### Collection operations

The collection types have a number of inbuilt functions (methods) that are accessed with the dot syntax, some of which are described below for lists and sets. Naturally the functions available to a given collection are appropriate to its type, e.g. sets do not have functions that refer to positional indices. All of the inbuilt functions and attributes of a Python object may be listed with `dir()`, noting that those with `__` around the name are generally only for internal use. For example to see what is inbuilt for lists:

In [None]:
print(dir([]))   #  Any list

The `append()` and `extend()` methods are commonly used to add items to the **end** of lists:

In [None]:
x = ['Mon', 'Tue', 'Wed'] # A list of strings
y = ['Fri', 'Sat', 'Sun'] # And another
print(x)
x.append('Thu')    # Add a single new item to end
x.extend(y)	       # Extend with items from another collection 
print(x)

The `insert()` method is used to add items at any position:

In [None]:
x.insert(0, 'Sun') # Insert an item at an index
print(x)

The position (first occurrence) of an item can be found using `index()` and the number of occurences found with `count()`:

In [None]:
print(x.index('Sat'))     # Positional index of an item
print(x.count('Sun'))     # Number of items

Items can be removed (first occurrence) with `remove()`:

In [None]:
x.remove('Sun')    # Remove an item
print(x)

Another handy function is to sort a list internally:

In [None]:
x.sort()           # Sort contents alphabetically
print(x)

Similarly, sets can have their contents changed. Note here that there is no positional (index) information.

In [None]:
s = {'G', 'C', 'A', 'T'}   # A set with 4 strings
t = {'N', 'R', 'Y'}
s.add('U')          # Add a single item (if not present) 
print(s)
s.update(t)         # Add any new items from another collection
print(s)
s.remove('N')
print(s)

## <font color=purple>Exercise 6: Dictionary construction</font>

<font color=purple>Construct a dictionary containing containing counts for the different DNA letters in the sequence given below. Use one-letters code as the dictionary key, and the counts as the corresponding value. Using the dictionary calculate the percentage G+C content of the sequence.</font>

In [None]:
# Exercise code
seq = """ATTAATTAATTCTGAGAGCTGCTGAGTTGTGTTTACTGAGAGATTGTGTATCTGCGAGAGAAGTCTGTAGCAAGTAGCTAGACTGTGCTTGACCTAGGAACATATACAGTAGATTGCTAAAATGTCTCACTTGGGGAATTTTAGACTAAACAGTAGAGCATGTATAAAAATACTCTAGTC"""

# Control code

Lines of Python code are generally executed in sequential order, one after the other. However there are situations where we wish to deviate from this, for example to repeat a section of code several times in a loop, or to only execute a block of code under certain conditions. Accordingly, there are a set of keyword commands including `if`, `for`, `while`, `try` and `def` that are used to control the execution of a subsequent code block, which is indented relative to the keyword and comes **after a colon**.

## Conditional code

The `if` statement is used to only perform operations if a particular condition is met (if the value of an expression is logically true).

In [None]:
x = 3
if x < -1 or x > 1:    # Run indented lines when true. ** Note colon at end ** 
    x *= 2             # Start of indented lines   
    print('Value was doubled')  

print('Value is:', x)  # Always executed, not in indented block

It should be noted that any number of spaces can be used for indentation of the controlled block, but it must be consistent. Four spaces are generally recommended, though sometimes two may be used in cramped situations.

The `if` statement can also have a number of `elif` clauses and a final `else` clause, each with their own block: `elif` does further conditional tests if all preceding ones failed, and `else` marks the block that is run if no condition is met.

In [None]:
if x > 0:
    print('Positive')
elif x < 0:            # Checked if first condition was false
    print('Negative')
else:                  # If all fails
    print('Zero')

When testing whether the expression after an `if` statement is true it is obvious what happens in situations like `x < 10` or `y == 5`, given that these comparisons generate `True` or `False` (Boolean) objects. However, any object in Python can be tested for truth. A few specific objects are deemed to be logically false: `False`, `None`, `0`, `0.0`, empty string:`''` and empty collections, while almost everything else is deemed to be true.

In [None]:
x = 'abc'
if x:              # Test innate truth of x
    print('true')  # This prints; any non-empty string is true
else:
    print('false')

## Loops

Repetitive loops can be created with a `for` statement or a `while` statement. The `for` loop extracts items from a collection (or other iterable object, like a string of characters) and assigns a loop variable to each value in turn, repeating execution of the indented block of code each time. Here a list of numbers `data` is defined and `x` is repeatedly assigned to the value of each item in the list before printing.

In [None]:
data = [1,4,9,25,36]
for x in data:        # x is first 1, then 4, then 9 etc.
    print(x)
    

The exmple is next modified so that the value of `x` is added to a running total every cycle. This total was was initially defined as zero before the loop:

In [None]:
total = 0             # Starting value
data = [1,4,9,25,36]
for x in data:        
    total += x        # Add current value of x to total
    print(x, total)   # Current values in this cycle
    
print('Final:', total)    

It is often convenient to use the `enumerate()` function with a `for` loop. This allows the loop to iterate over both item numbers (usually the positional indices) and their actual values. 

In [None]:
text = 'AGCAGTAGACGAACAT'     # String of characters
for pair in enumerate(text):  # (index, item) pairs
    print(pair)               

As illustrated below, it is common for the number and value (`i` and `x` respectively) to be stated as separate variables in the `for` line, using the unpacking syntax illustrated previously:

In [None]:
text = 'AGCAGTAGACGAACAT'     # String of characters
for i, x in enumerate(text):  # separate index and item
    print(i, x)               

A `while` loop repeats a block of code while a certain condition evaluates to be true, and so it is important to make sure that the condition is eventually false (on the command-line Ctrl+C keys can be used to stop an ‘infinite’ loop).

In [None]:
x = 1
while x < 1000:   # Repeat the indented block while this is true
    print(x) 
    x *= 2        # Double the value

print('Final:', x)  # final value stopped the loop: not less than 1000

## <font color=purple>Exercise 7: Taylor series reprised</font>

<font color=purple>Using a `for` loop, repeat Exercise 1, summing the Taylor series of $e^x$, i.e:

$1 + x + x^2/2! + x^3/3! + x^4/4! + x^5/5! + x^6/6! + x^7/7!$ + ...

Print estimates for $e$ and $e^2$, do this for both eight and 16 terms. Hint: use the previous factorial to make a new factorial each time.
</font></font>

In [None]:
# Exercise code


Loops can appear inside other loops, as long as each uses a different iterating variable. Here the inner loop is indented relative to the first, hence the inner-most `print()` is indented twice, relative to the start.

In [None]:
for i in range(3):       # i is first 0, then 1, then 2
    for j in range(3):   # for each value of i, is 0, then 1, then 2 
        print(i, j, i*j)

Loops (both `while` and `for`) can be skipped, for the remainder of their block, using `continue` and stopped entirely with `break`. In this example an inner `if` statement is used to trigger these in specific circumstances. Note the `print()` is in the block of the `for` statement and so is indented once.

In [None]:
data = [3, -1, 2, -5, 0, 9, -2]
for x in data:  
   if x < 0:
       continue      # Skip the remainder of 'for' loop
   elif x == 0:
       break         # Quit entirely
     
   print(x)          # Only prints positive values before zero

There is another kind of loop that does not have an indented syntax. It is not a general purpose loop like the ones described above, rather it is a means of constructing a collection (list, set, dictionary). In essence, a kind of `for` loop is specified inside the collection’s brackets.  Considering the following example for constructing a list:

In [None]:
squares = []
for x in range(1, 10):
    squares.append(x*x)
print(squares)

This could be equivalently written as a list comprehension, effectively building a list from the inside, where the item that enters the list (here `x*x`) appears before the `for`:

In [None]:
squares = [x*x for x in range(1,10)]
print(squares)

Changing the bracket type changes the type of collection constructed. Curly brackets `{}` can be used to construct a set or a dictionary, depending on whether a `key:value` dictionary specification is made.

In [None]:
s = {x*x for x in range(1,10)}              # Make a set
d = {i:x for i,x in enumerate('ABCDEF')}    # Make a dictionary
print('Set:', s)
print('Dict:', d)

There is also an option to add a conditional filter when making a collection. Here an internal `if` comes after the `for` section(s). In the following example a list is made using only odd values:

In [None]:
odd_sq = [x*x for x in range(1,10) if x % 2 == 1]    # Only odd x, then squared
print(odd_sq)

# Example: Calculate sequence identity

The following example calculates the percentage of sequence identity for two input sequences `seq1` and `seq2`. The sequences should be aligned and carry ‘-’ characters to represent alignment gaps.

In [None]:
seq1 = 'ALIGDPVENTS'
seq2 = 'ALIGN-MENTS'
n = len(seq1)

count = 0.0                   # Starting identity count is zero

for i in range(n):            # Loop through  position indices
    a = seq1[i]               # Letter at index i for first seq
    b = seq2[i]               # Letter at index i for second seq
    if a == b and a != '-':   # Test if letters are same and not a gap
      count += 1.0            # .. if they are increase count by one

ident = 100.0 * count/n      # Calc. and send back identity as % total 

print('Identity {:.2f}%'.format(ident)) # Format to 2 dp

## <font color=purple>Exercise 8: Dictionary counting</font>

<font color=purple>Using the protein sequence given below, count the number of occurrences of each amino acid letter using a dictionary and a `for` loop. Use an `if` statement inside the loop to initialise the count for letters that have not been seen before i.e. `if letter not in count_dict:`. Use a second loop to print each letter and its corresponding count.</font>

In [None]:
# Exercise code
seq = "MYGKIIFVLLLSEIVSISASSTTGVAMHTSTSSSVTKSYISSQTNDTHKRDTYAATPRAHEVSEISVRTVYPPEEETGERVQLAHHFSEPEITLIIFGVMAGVIGTILLISYGIRRLIKKSPSDVKPLPSPDTDVPLSSVEIENPETSDQ"
count_dict = {}

## Catching errors

A `try`, `except` block is used to catch and deal with illegal circumstances. The code in a `try` block is run and if a problem occurs an `except` block of code may be run if a particular kind of error (a type of `Exception` object) is detected. Consider the following error-generating code:

In [None]:
x = 1
y = 0
w = x/y

We can prevent the program from failing and sensibly handle an error. In the jargon we catch a particular *exception*: detect error objects at the `except` keywords. Here the error object is a `ZeroDivisionError` which is present in standard Python.

In [None]:
x = 1
y = 0
try:                 # Run the following block and check for failure
    w = x / y

except ZeroDivisionError as err:        # 
   print('divided by zero, continuing') # warn, but otherwise ignore

An error of any kind may be detected using the general `Exception` object. Also, 
the original error can be re-triggered using `raise()` if required. For example, here we extend the above with a second `except` to catch any additional errors, which we then fail at.

In [None]:
x = 1
y = '0'
try:                 # Run the following block and check for failure
    w = x / y

except ZeroDivisionError as err:        # This specific error is OK
   print('divided by zero, continuing')

except Exception as err:   # Any other error is not OK
    raise(err)             # Trigger the error, do not continue

## Functions

The keyword `def` is used to define functions, i.e. user created subroutines. In essence these are a specification for a named bit of code. Defining a function is distinct from running (or calling) a function, but once defined a function can be called into action any number of times; achieved by using its name with parentheses (which convey any input data). Naturally the general idea is that functions represent reusable code, performing the same operation in many different places, albeit for different input data. 

Here is a very simple function that prints some text, noting that when defining the function no actual print operation is done:

In [None]:
def demo_func():
    print('Hello')

    

The function only works, after it is defined, by invoking its name with brackets. In the jargon the function is *called*:

In [None]:
demo_func()    

Next this function is redefined, so that it takes a single input, which we call `text`. This input is referred to as an *argument* and is a label that will only be filled with a definite value when the function is run.

In [None]:
def demo_func(text):
    print(text)

demo_func('abcdefg')
demo_func('GCAT')

Note that the `text` variable is internal to the function and not defined outside:

In [None]:
print(text)

The following function called `my_calc`, takes two input values (two arguments) which are labelled as `x` and `y` inside the function and which do not have any specific values inside the definition. The definition involves specifying a calculation, the result of which, `z` is passed back from the function at the `return` statement. 

In [None]:
def my_calc(x, y):      # Two inputs
    z = x * x - y * y   # The operations to perform
    return z            # Send back the result

Once the definition exists the name `my_calc` can be used on two input values, which fill the `x` and `y` slots in the function, generating a value for `z`, which is then what is output from the function and, in this case, printed:

In [None]:
r = my_calc(4,5)    # internal x set to 4, y to 5
print(r)            # the result
print(my_calc(3,2)) # run on different values

There is significant flexibility with the input arguments of Python functions. They can have defaults, for when they are not explicitly specified, and there is freedom to use named or unnamed arguments. Named input arguments can appear in any order though they must be stated after any unnamed ones, which fill slots in order, as shown above. The next example is a modification of the previous function which uses a default value of `1` for `y` on the `def` line.

In [None]:
def my_calc(x, y=1):     # x is mandatory, y defaults to 1 if not given
    z = x * x - y * y
    return z

Calling the function is then illustrated with and without specifying an explicit value for `y`, using the named and unnamed conventions:

In [None]:
a = my_calc(7)           # x is 7, y defaults to 1  
b = my_calc(x=2, y=2)    # name both arguments
c = my_calc(y=9, x=-1)   # name arguments in a different order
d = my_calc(3, y=-2)     # unnamed arguments (x is 3) come first
print(a, b, c, d)

Decorators are a relatively recent addition to Python and allow a modifying statement to be added to the start of a function with a ‘@’ syntax. In essence this wraps one function with another, which can modify and inspect both its input and output.

In [None]:
@decorator_func 
def my_calc(x, y=1):
    z = x * x - y * y
    return z

For the sake of brevity, creating decorator functions will not be discussed; however decorator use will be shown in later examples.

## <font color=purple>Exercise 9: Reverse-complement function</font>

<font color=purple>Write a function to find the reverse complement of a DNA sequence, pairing G with C, A with T and their corresponding reciprocals (this mapping can be stored as a Python dictionary). The function should accept an input sequence seq and give back (return) a reverse complement sequence. The sequence can be assumed to be an iterable collection of characters like a list or a string. After the defining the function, call the function with some test data to check it works properly.</font>

In [None]:
# Exercise code
def rev_complement(seq):
    rc_seq = ''
    
    # fill function contents

test_seq = 'AGCATAAGAATAGCAGCAGCGCGA'


# Modules

Some functions, like `len()` or `int()`, are available at any time in Python. However, functions must often be imported into a program from a separate module before they can be used. There are three basic sources of modules: those that automatically come as part of every Python installation (the Standard Library), those that require a separate installation (such as NumPy or BioPython) and those that are specific to the user. If a module is accessible to Python it can be used via the `import` keyword and the various components of the module are referred to with dot syntax:

In [None]:
import math           # Import the inbuilt mathematics module
print(math.e)         # An attribute representing e
print(math.exp(2.0))  # Use the exponent function

It is also possible to locally use a different name for the module using the `import .. as ..` syntax:

In [None]:
import math as m      # Import as a different name
print(m.log(2.0))     # Use the logarithm function

Alternatively, specific components of a module may be imported using a `from .. import ..` syntax:

In [None]:
from math import sqrt, cos   # Import named module components
x = sqrt(3/2)                # Use the square root function
print(cos(x))                # Use the cosine function

The `math` module used above contains a several commonly used mathematical constants and functions. There are various other commonly used libraries in the standard set, some of which we illustrate below. However, for a full module listing see the documentation at [python.org](https://docs.python.org/3/py-modindex.html). The `sys` module relates to the run-time Python environment:

In [None]:
import sys
print(sys.argv) # List of words typed at the command line after "python"
print(sys.path) # Directory search path used to locate Python modules 
sys.exit()      # Quit the Python program

The `time` module has some simple functions for handling time:

In [None]:
import time
time.sleep(5)       # Pause for 5 seconds
print(time.time())  # Number of seconds since the start of Unix epoch (00:00:00, 1 January 1971)

However, it is the `datetime` module that has more comple functionality that allows you to work with calendar days.

In [None]:
import datetime
now = datetime.date.today()
print(now)
print(now.month)

This is especially useful when subtracting dates to calculate timespans:

In [None]:
c20 = datetime.date(1900,1,1)
delta = now - c20
print(delta.days)

The `random` module is use to generate pseudo-random numbers:

In [None]:
import random
random.seed(3)                   # Set the random number seed
print(random.randint(1,10))      # A random integer from 1 to 10 inc.
print(random.uniform(0.0, 2.0))  # A random float between 0.0 and 2.0

In [None]:
data = [1,2,3,4,5]
random.shuffle(data)             # Shuffle list order
print(data)

The `os` module is for things that depend on which particular operating system (e.g. Windows, OSX, Linux) is running. Much of this relates to use of file systems, i.e. dealing with file paths, directories, permissions etc. Note that locations in the file system (directories and file names) are simpy specified as a text string.

In [5]:
import os
os.chdir('/home/user/')        # Change the current working directory
dir_name = 'temp'
os.mkdir(dir_name)             # Make a new directory
os.listdir(dir_name)           # Get a list of directory contents

FileNotFoundError: [Errno 2] No such file or directory: '/home/user/'

In [None]:
file_path = '/home/user/test.py'
os.remove(file_path)            # Delete a file 
os.rename(file_path, new_path)  # Move a file to a new location

It is notable that `os` doesn’t handle copying files; this is done with the `shutil` module. Within the `os` module of particular importance is the `os.path` submodule which handles the text strings that represent locations within the file system.  

In [None]:
file_path = 'LMB_Python_Basics.ipynb'
print(os.path.exists(file_path))   # True if the file path exists, else False

In [None]:
file_path = os.path.abspath(file_path)
print(file_path)

In [None]:
print(os.path.split(file_path))    # Split into [leading, end/file] parts                               

In [None]:
print(os.path.isdir(file_path))    # True if a directory, else False
print(os.path.join('usr','local')) # Join with directory separator 'usr/local' 
print(os.path.splitext(file_path)) # Chop the file extension ['folder/file', '.txt']

The `re` module is used for regular expressions; pattern matching in text strings. Often the functions generate a special match object that can be interrogated, for example to find where the pattern was found and what the actual substring was.

In [None]:
import re
pattern = re.compile('\d+') # Make a regular expression object (one or more digits)
text = 'A 123 B 456'        # A string to look in 

match_obj = pattern.search(text)  # Match inside string
print(match_obj)

In [None]:
print(match_obj.group())          # '123' – matching substring
print(match_obj.start())          # 2 – position of match

In [None]:
text_2 = pattern.sub('**', text)  # Substitute all matches with '**'
print(text_2)                     # 'A ** B **' 

In [None]:
hits = pattern.findall(text)      # List of matching sub-strings
print(hits)                       # ['123', '456']

## <font color=purple>Exercise 10: Using the `math` module</font>

<font color=purple>Using a loop construct a list of sine and cosine pairs for every ten degrees in the range zero to 180. Hint: use `math.radians()` to convert from degrees.</font>

In [1]:
# Exercise code
import math
p = []
for a in range(0, 181, 10):
    r = math.radians(a)
    s = math.sin(r)
    c = math.cos(r)
    p.append((s,c))
print(p)

[(0.0, 1.0), (0.17364817766693033, 0.984807753012208), (0.3420201433256687, 0.9396926207859084), (0.49999999999999994, 0.8660254037844387), (0.6427876096865393, 0.766044443118978), (0.766044443118978, 0.6427876096865394), (0.8660254037844386, 0.5000000000000001), (0.9396926207859083, 0.3420201433256688), (0.984807753012208, 0.17364817766693041), (1.0, 6.123233995736766e-17), (0.984807753012208, -0.1736481776669303), (0.9396926207859084, -0.3420201433256687), (0.8660254037844387, -0.4999999999999998), (0.766044443118978, -0.6427876096865394), (0.6427876096865395, -0.7660444431189779), (0.49999999999999994, -0.8660254037844387), (0.3420201433256689, -0.9396926207859083), (0.17364817766693028, -0.984807753012208), (1.2246467991473532e-16, -1.0)]


A table listing a selection of commonly used modules from the Python Standard Library is given below. Full documentation of these and the other modules mentioned above can be found at [python.org](https://docs.python.org/3/py-modindex.html).

Module|Description
------|------------
argparse|A module that helps interpret command-line options/arguments, i.e. information typed after the name of a program, as available in sys.argv.
copy|Create a new Python object by copying an existing object. Can create shallow or deep copies; whether any object contained by an object is itself also copied.
datetime|Provides date, time, timedelta and datetime objects to represent temporal information. Deals with daylight savings, date formatting, time string interpretation etc.
glob|Provides file name fetching using UNIX-like wild cards, i.e. patterns that include \u201c*\u201d and \u201c?\u201d rather than regular expressions.
ftplib|Used to send and receive files using the File Transfer Protocol. 
gzip, bz2, zipfile, tarfile|Modules that deal with creating and extracting compressed and/or archived files. 
http |Used to send and receive information across the Internet using the Hypertext Transport Protocol. A lower level library than urllib.
multiprocessing|Run Python code as separate, parallel, processes/jobs on multiple core/processor systems.
pickle|Converts Python object data to and from a text string (serialization) which may be saved to or loaded from a file system.
platform|Used to get information about the current computer and its architecture.
shutil|Performs higher level file operations, such as copying files, copying trees of files and finding executable files.
sqlite3|Interaction with SQLite : a lightweight file-based SQL database.
string|Provides some functions not directly available to string objects. Useful for accessing particular sets of characters such as whitespace, punctuation, digits etc. 
subprocess|Run an external program as a separate job/process and connect any input/output data streams.
time|Basic time related functions using numbers and strings. Can be used to time program execution and to pause execution (time.sleep()) .
threading|Run Python code in separate threads, which will not run concurrently on multiprocessor systems (use multiprocessing for that). Can be useful to process intermittent data streams.
urllib |Used to send and receive information across the Internet. Higher level and often more convenient than httplib. Handles web proxies, redirection, passwords, cookies etc. Often used to interact with web services and databases.
zlib|Used to compress data into more compact representation, using the zlib algorithm. Can be useful for caches and undo functions.

Writing custom modules is easily achieved in Python: in general a normal file containing Python code can be imported as a module. There are a few caveats to this, but most importantly the Python system needs to know where to look in the file system to find a module. Modules are found by looking in a series of directories, the *search path*, for file names that match the attempted `import`. Some of these will be in standard locations for the Python installation, but search directories can be added at any time via `sys.path`.

In [2]:
import sys
print(sys.path)                        # Current module search path

['', '/usr/lib/python36.zip', '/usr/lib/python3.6', '/usr/lib/python3.6/lib-dynload', '/home/tjs23/.local/lib/python3.6/site-packages', '/usr/local/lib/python3.6/dist-packages', '/usr/local/lib/python3.6/dist-packages/ViTables-3.0.0-py3.6.egg', '/usr/local/lib/python3.6/dist-packages/tables-3.5.2-py3.6-linux-x86_64.egg', '/usr/local/lib/python3.6/dist-packages/QtPy-1.9.0-py3.6.egg', '/usr/local/lib/python3.6/dist-packages/numexpr-2.7.0-py3.6-linux-x86_64.egg', '/usr/local/lib/python3.6/dist-packages/mock-3.0.5-py3.6.egg', '/usr/lib/python3/dist-packages', '/usr/lib/python3.6/dist-packages', '/usr/lib/python3/dist-packages/IPython/extensions', '/home/tjs23/.ipython']


In [3]:
sys.path.append('/home/tim/my_modules')
print(sys.path)                        # Extended search path

['', '/usr/lib/python36.zip', '/usr/lib/python3.6', '/usr/lib/python3.6/lib-dynload', '/home/tjs23/.local/lib/python3.6/site-packages', '/usr/local/lib/python3.6/dist-packages', '/usr/local/lib/python3.6/dist-packages/ViTables-3.0.0-py3.6.egg', '/usr/local/lib/python3.6/dist-packages/tables-3.5.2-py3.6-linux-x86_64.egg', '/usr/local/lib/python3.6/dist-packages/QtPy-1.9.0-py3.6.egg', '/usr/local/lib/python3.6/dist-packages/numexpr-2.7.0-py3.6-linux-x86_64.egg', '/usr/local/lib/python3.6/dist-packages/mock-3.0.5-py3.6.egg', '/usr/lib/python3/dist-packages', '/usr/lib/python3.6/dist-packages', '/usr/lib/python3/dist-packages/IPython/extensions', '/home/tjs23/.ipython', '/home/tim/my_modules']


Considering a file called `my_module.py` that resides in the current working directory and which has the following contents:<br>
`
CONSTANT = 1.0545718e−34       # Example constant
def my_calc(x, y):             # Example function
    return x*x - 2*x*y - y*y`

Once, the directory containing the Python file is present in `sys.path` we can access it with `import my_module` in a different Python program and use its variables and functions. It is notable that the file extension (`.py` here) is omitted from the module name when it is imported.

In [6]:
sys.path.append(os.getcwd())    # Directory with my_module.py
import my_module                # Import custom module (no .py)

print(my_module.CONSTANT)       # Use variable from other Python file
print(my_module.my_calc(9,-1))  # Use function from other Python file

1.0545718e-34
98


It is often convenient to more permanently add the locations of custom Python modules to the search path by adding them to the PYTHONPATH environment variable (generally set in the operating system), so that `sys.path` automatically contains the required locations whenever any Python program is run. If instead the module was located inside a sub-directory of the one in `sys.path`, such as `‘examples/’`, relative to the current working directory, then the module could be imported as follows, using dot syntax.

In [7]:
import examples.demo_module     # Import from sub-directory

**************************
** H E L L O  W O R L D **
**************************


(Before Python version 3.3, to make thid sub-directory import work, a file called `__init__.py`, which is usually blank or contains only `pass`, must be present in the `my_examples` directory.(

When a module is imported its contents are run as Python. While this is usually no problem for defining constants and functions etc. sometimes the file may contain code that should only be run when the file is used directly and not when it is imported. To overcome this, the special internal Python `__name__` variable can be inspected. This will be the string `__main__` if the code is run as the main program, but will otherwise be the name of the module. Hence this can simply be checked, to make sure the code is not imported as a module.

In [None]:
def my_calc(x, y):             # Example function
    return x*x - 2*x*y - y*y

if __name__ == '__main__':     # Is the code run as a main program?
    print(my_calc(2,3))        # Run test code only when it is main
    print(my_calc(-1,0))       # Module imports do not run this block

Though the standard libraries are extensive and custom modules allow great flexibility there are a large number of external modules that are really useful for molecular biology, and which mean you can build programs using pre-written and well-tested code. A small selection of popular general purpose external modules useful for molecular biology is given in the below table.

Module|Description
------|-----------
NumPy|Numeric Python with a highly-functional multidimensional array object and associated functionality for linear algebra, pseudo-random numbers, Fourier transforms etc.
BioPython|A collection of tools for computational biology. Provides modules to work with and manipulate many common bioinformatics format files relating to sequences, alignments, phylogentics, molecular structures etc. 
PySam|The Python interface to SAMtools, which allows reading, writing and manipulation of BAM and SAM format files used in high-throughput DNA sequence mapping. Also deals with variant call VCF/BCF files.
HTSeq|A module for the analysis of high-throughput DNA sequencing data. Deals with the common informatics formats including FASTQ, SAM, BAM, BED and GFF. Has specialized data structures to handle large genomic array and genomic interval data.
PIL|The Python Imaging Library which handles loading and saving image data in a large number of different file formats. Provides many functions to manipulate image data: enhance, filer and mask etc.
Pandas|Manipulation and analysis of data tables with an organization similar to spreadsheets and relational databases. Provides many advanced functions for data manipulation and file access (including Excel, CSV and general text).
Pymol, UCSF Chimera|Popular molecular structure viewers that may be imported as Python modules to produce graphics for protein structures etc.
SciPy|Extensive scientific an engineering library, building upon NumPy with modules for statistics, integration, optimization, signal processing etc.
matplotlib|A plotting library to create many different types of charts and graphs from numeric data, with customizable graphical styles.
scikit-learn|A library for machine learning, clustering, regression, dimensionality reduction etc. that is built upon SciPy and NumPy.
keras|Deep neural networks using a TensorFlow or Theano back-end

# Stored data

A key operation in most programs is reading and saving data to and from disk, or other storage filesystem. In Python we can choose to handle all of the writing explicitly in a program. However, if there is a module which handles a particular data format then that is often a good choice instead. For example, we can use BioPython modules to read most common bioinformatics formats and use PANDAS to read and write CSV files and Excel spreadsheets.

File operations revolve around a file object. This is typically generated using the inbuilt `open()` function, given the location of the data on the file system. Once the file object is created various functions can be used to read and write data; the object is the Python interface to the stored data. In the next example a file object is created, and all of its data read. The use of `'r'` is to specify that the file is opened for reading, though this is the default when not specified. The data may be read in its entirety using `read()`: 

In [8]:
path = 'examples/demo_file.txt'   # Location of data in file system
file_obj = open(path, 'r')        # Create file object for reading
data = file_obj.read()            # Read all the data (as a string)
print(data)

Fuzzy Wuzzy was a bear.
Fuzzy Wuzzy had no hair.
Fuzzy Wuzzy wasn't fuzzy, was he? 



If the data is re-read then nothing results because the file object remembers the last read point and this refers to the end of the data after the first complete read.

In [9]:
data = file_obj.read()            # Nothing remaining
print(data)




If we want to change this we can use `seek(`) to go to a particular position (in bytes with `0` being the start). When the file is no longer needed it is closed for further use.

In [10]:
print(file_obj.read())   # Empty, nothing to read at end of file
file_obj.seek(0)         # Point at start of data      
print(file_obj.read())   # Read everything
file_obj.close()         # Close file, no further operations


Fuzzy Wuzzy was a bear.
Fuzzy Wuzzy had no hair.
Fuzzy Wuzzy wasn't fuzzy, was he? 



Recently it has become common practice to use the with.. as.. syntax to deal with file objects. The with block creates a managed context, which in this case means that the file object is only open within the block and closed automatically after.

In [11]:
with open(path) as file_obj:
    print(file_obj.read())

Fuzzy Wuzzy was a bear.
Fuzzy Wuzzy had no hair.
Fuzzy Wuzzy wasn't fuzzy, was he? 



There are other functions to read the file in terms of lines, separated with \n (or \r if universal read mode 'rU' mode is used). All lines can be read at once using readlines() or readline() can fetch a single line. 

In [12]:
with open(path, 'rU') as file_obj:  # NOTE: 'U' mode will become obsolete in Python 3
    print(file_obj.readline())      # First line
    print(file_obj.readline())      # Second line
    print(file_obj.readlines())     # All remaining lines (as a list)

Fuzzy Wuzzy was a bear.

Fuzzy Wuzzy had no hair.

["Fuzzy Wuzzy wasn't fuzzy, was he? \n"]


  """Entry point for launching an IPython kernel.


Subsequent readline() calls give each line in turn; the pointer to the file data picks up at the end of the previous line. However, it is often more convenient to treat the file object as an iterator and loop through it, i.e. as if it were a list of lines:

In [13]:
file_obj = open(path, 'r')
for line in file_obj:       # Loop through all lines in file
    print(line)

Fuzzy Wuzzy was a bear.

Fuzzy Wuzzy had no hair.

Fuzzy Wuzzy wasn't fuzzy, was he? 



Many files that are commonly read by Python are essentially stored as text. Consequently, any numbers must be properly interpreted as such, appropriately creating proper `int` or `float` objects, if we want to use the data for calculations etc. It is fairly common to have file data formatted so that each line represents a different item and each line contains multiple columns, separated by commas, spaces or tabs, such as:

`
chr  length          prot_genes
1	248956422	2058
2	242193529	1309
3	198295559	1078`

The following code will read this kind of file, given file name, and put the data into a list. Note how the first header line (which contains no numeric data) is read before the `for` loop and that the lines are split (in this instance on the default white-space) to make sub-lists.
.

In [14]:
file_path = 'examples/chromo_stats.tsv'

data = []
with open(file_path) as file_obj:
    head = file_obj.readline()    # Read first header line; not used
    for line in file_obj:         # Go through each remaining line
        row = line.split()        # Split line at whitespace into a list
        data.append(row)

print(data)

[['1', '248956422', '2058'], ['2', '242193529', '1309'], ['3', '198295559', '1078'], ['4', '190214555', '752'], ['5', '181538259', '876'], ['6', '170805979', '1048'], ['7', '159345973', '989'], ['8', '145138636', '677'], ['9', '138394717', '786'], ['10', '133797422', '733'], ['11', '135086622', '1298'], ['12', '133275309', '1034'], ['13', '114364328', '327'], ['14', '107043718', '830'], ['15', '101991189', '613'], ['16', '90338345', '873'], ['17', '83257441', '1197'], ['18', '80373285', '270'], ['19', '58617616', '1472'], ['20', '64444167', '544'], ['21', '46709983', '234'], ['22', '50818468', '488'], ['X', '156040895', '842'], ['Y', '57227415', '71']]


## <font color=purple>Exercise 11: </font>

<font color=purple>Extend the above example that reads a tab separated value (TSV) file line-by-line. Extract the numerical value from the second column and calculate the sum over the whole file, i.e. the total of chromosome lengths. </font>

In [15]:
# Exercise code
file_path = 'examples/chromo_stats.tsv'

For writing to a file mode `'w'` or mode `'a'` must be used: opening with `'w'` mode initially writes a blank file (deleting any previous data) while `‘a’` appends to the end of a file. To do the actual writing, data is simply passed to the `write()` function, in one or more parts. If we want to save the data as lines the appropriate newline characters (`'\n'` etc.) need to be added.

In [None]:
file_path = 'demo_out.txt'
file_obj = open(file_path, 'w')  # Open file object in writing mode

x = 1
while x < 1000:               # A loop which will generate many lines
  line = '{}\n'.format(x)     # Create the string for each line
  file_obj.write(line)        # Write each line
  x *= 2

file_obj.close()

When dealing with files it is common practice to accept the name of the file to use at the time when the program is run. On the command line this is easily achieved by specifying the name of the file after the Python script name. What was entered at the command line is then accessed using the `sys.argv` list. For example, if the following is typed at the operating system command prompt (>):

> python programFile.py data/inputFile.txt

then the name of the files can be captured in the following way, noting that the first item in the sys.argv list is the Python script itself, so it is the second item (index 1) that we usually want.

In [16]:
import sys                 # Access the sys module
py_script = sys.argv[0]    # 'programFile.py'   
data_file = sys.argv[1]    # 'data/inputFile.txt'

data = open(data_file).read()  # Make file object and read its data

FileNotFoundError: [Errno 2] No such file or directory: '-f'

When dealing with common data file formats there may be a module that already deals with file access; to handle all of the reading and writing for you and interpret the data properly. For example the BioPython module can read and write many bioinformatics formats. In the next example we import the SeqIO module. This can read from an open file object using its `parse()` function to generate sequence record objects (named protein below) from which data is accessed via the dot syntax. In this case the data format is specified as FASTA, common for protein or nucleic acid sequences:

In [None]:
from Bio import SeqIO    # Load BioPython module; must be installed
 
file_name = "examples/demo_sequences.fasta"   # Location of data
file_obj = open(file_name)

for protein in SeqIO.parse(file_obj, 'fasta'): # Go through each entry
  print(protein.id)                            # The ID of seq record
  print(protein.seq)                           # The actual sequence

file_obj.close()

Next the Pandas module is demonstrated to show its ability to read character-separated value files (CSV) using the `read_csv()` function. Similar `read_excel()` and `read_sql_table()` functions, for Excel spreadsheets and SQL databases, also exist. The Pandas module stores data in DataFrame objects, which print nicely.

In [None]:
import pandas
file_path = 'examples/chromo_stats.tsv'
data_set = pandas.read_csv(file_path, sep='\t', header=0)

print(type(data_set))    
print(data_set)

Here the `DataFrame` columns are accessed in a similar way to dictionaries:


In [None]:
for value in data_set['chr']:
      print(value)

# Download an informatics file from a URL

The following function illustrates how data can be fetched over the Internet using the urlib module which does the hard work of finding the requested file and downloading its data. The url variable contains most of the Internet address location but needs a specific database entry identifier (db_id) to be inserted into the address (using format()) before the connection can be opened to the correct file. There is a slight complication in this function as the data is downloaded in its raw binary form and we have to specifically convert it to a Python string object via decode() which interprets the data using a standard (UTF-8) character encoding scheme.

In [None]:
def download_db_id(db_id, url, file_name=None): # Save file is optional

  from urllib.request import urlopen     # Import urlopen() function  
  response = urlopen(url.format(db_id))  # Makes object to handle link
  data = response.read()                 # Read data from URL
  data = data.decode('utf-8')            # Interpret data as plain text
  
  if file_name:                       # If save file was specified...
    file_obj = open(file_name, 'w')   # Create file object for writing
    file_obj.write(data)              # Save all data to file
    file_obj.close()

  return data                         # Hand back data from function

The function is tested by downloading a protein structure from the PDB. The URL for these databases are given below and include ‘{}’ indicating where the identifier code for the specific database entry will be added. It is notable that these variables use the uppercase convention as they are acting as constants and lie outside the functions.

In [None]:
PDB_URL = 'https://files.rcsb.org/download/{}.pdb'

data = download_db_id('1AFO', PDB_URL, '1AF0.pdb')  # Save .pdb file
print(data)                                   

The next example combines the file download in the above example with use of BioPython. The SeqIO module from BioPython will be used to handle the FASTA format data obtained from the UniProt database, which saves us from having to interpret the data format ourselves. The function takes a database entry identifier db_id and downloads the FASTA format data, as illustrated above, before automatically parsing the data and sending back the (first or only) sequence as a sequence record object created by BioPython. This object has handy attributes that can be accessed via the dot syntax to get at the actual sequence etc.

In [None]:
UNIPROT_FASTA_URL = 'http://www.uniprot.org/uniprot/{}.fasta'

def get_uniprot_seq_record(db_id):

   from Bio import SeqIO
   
   file_name = db_id + '.fasta'  # Add extension to ID to make file name
   
   if not os.path.exists(file_name):
     download_db_id(db_id, UNIPROT_FASTA_URL, file_name)
   
   file_obj = open(file_name)    # Open file for reading
   
   for seq_record in SeqIO.parse(file_obj, 'fasta'):
     return seq_record           # Give back first record encountered

sr = get_uniprot_seq_record('P18754')
print(sr.id, sr.seq)

# Object classes

All the Python objects (items of data) used thus far have been of the standard types. However it is possible to create custom Python objects using the class keyword. This creates a named prototype which connects data values and bound functions (methods) together in an organized way. A class can often be thought of as equivalent to a table in a database. For reasons of brevity this aspect of Python will not be discussed in great detail in this chapter. However, a simple example is provided below which illustrates a rudimentary Person object. The class block contains the definitions of two functions. It defines the `__init__()` method which, because it has a particular name, is run any time an object of that type is created: here its task is to associate the input name and age with `self`; a special variable that represents the run-time object. A second method `get_first_name()` is defined which extracts and passes back the first part of the full name, which was stored as `self.name`. 

In [None]:
class Person():         # Next indented block is in the class definition

  def __init__(self, name, age):  # Values specified when object is made
    self.name = name              # Link input values to the object
    self.age = age

  def get_first_name(self):       # A second, custom function
    names = self.name.split()     # self refers to the run-time object
    return names[0]               # Give back first word

p1 = Person('Lisa Simpson', 8)    # Make object of Person class
p2 = Person('Bart Simpson', 10)   # Make another
print(p1.age, p2.age)             # Values linked to objects - 8, 10 
print(p1.get_first_name())        # Run a linked function - gives 'Lisa'

As illustrated above, two objects were made using the Person prototype, thus creating two different instances of that class: stored as variables `p1` and `p2`. The methods and simple attributes that were associated with the `self` are available to each instance using the dot syntax, e.g. `p1.name` and `p1.get_first_name()`. Here, the `self` value stated in class definition represents the object stated before the dot, i.e. `p1`, and is not passed via the parentheses.