<a href="https://colab.research.google.com/github/cohmathonc/biosci670/blob/master/IntroductionComputationalMethods/02_IntroPythonForScientificComputing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Selected Packages for Scientific Computing in Python

In the last lecture, we have covered basics of the general Python language. This lecture focuses on the use of Python for scientific computing.
Specifically, we will introduce libraries commonly used for numerical computation and plotting.

[SciPy](https://www.scipy.org) is a Python-based ecosystem of open-source software packages for mathematics, science and engineering. The [Scipy lecture notes](https://www.scipy-lectures.org/index.html) provide a good overview of its components and basic usage.

For this course, we will mostly focus on two components:

*   [NumPy](http://www.numpy.org) is the most widely used package for numerical computing in Python. 
*   [matplotlib](https://matplotlib.org) is a Python library for 2d plotting.

We will also briefly introduce the *Python Data Analysis Library* [pandas](https://pandas.pydata.org) which provides convenient data structures and anlysis tools for working with datasets.


## The NumPy Package

[NumPy](http://www.numpy.org)  provides data structures and methods for handling and manipulating multi-dimensional arrays. If you have prior experience with Matlab, checkout [NumPy for Matlab users](https://docs.scipy.org/doc/numpy/user/numpy-for-matlab-users.html), otherwise, the
[NumPy tutorial](https://docs.scipy.org/doc/numpy/user/quickstart.html) is a good starting point for a more in-depth introduction.

Unlike Python lists that can contain objects of different types, NumPy is meant for creating and manipulating multi-dimensional arrays of homogeneous type.

Let's compare a pure Python array (list of numbers) to a NumPy array:

In [None]:
import numpy as np    # we import numpy

python_array = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
numpy_array  = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

print(type(python_array))
print(type(numpy_array))


NumPy arrays are *typed*, i.e. all objects in an array are of the same type.
This restriction allows for more efficient memory usage and support for mathematical operations on arrays.
The type of a NumPy array can be inspected using the `dtype` attribute:

In [None]:
print( numpy_array.dtype )


Since all elements in a NumPy array are guaranteed to be of the same type, arithmetic operations can be applied to arrays. 

In [None]:
# NumPy arrays
numpy_array = np.array(python_array) # numpy arrays can be created from lists!
print(numpy_array)                   # the original array
print(numpy_array + 1)               # add 1 to each element of the list
print(numpy_array + numpy_array)     # sum of two numpy arrays
print(numpy_array * 2)               # each array element multiplied by 2

Now try this with a python list.

---
**Exercise (1):**

Use the `python_array` defined above and apply the same arithmetic operations as for the NumPy array.

- What do you observe?
- What would you need to do to
  - increment each element in the list by 1
  - compute the element-wise sum of 2 python lists
  - multiply each element in the list by 2
---



NumPy supports multi-dimensional arrays, similar to nested lists. 

In [None]:
np_2d_arr = np.array ( [                          # define 2d NumPy array
                        [1, 2, 3, 4, 5],
                        [6, 7, 8, 9, 10]
                        ] )
print(np_2d_arr)
print(np_2d_arr + np_2d_arr)

print('original shape: ', np_2d_arr.shape)        # inspect 'shape' of array

np_2d_arr_reshaped = np_2d_arr.reshape(5, 2)      # change 'shape' of array
print('reshaped:       ',np_2d_arr_reshaped.shape)             

Elements of multi-dimensional arrays are accessed by `array[index_axis0, index_axis1, ..., index_axisN]`.  
An 'axis' in NumPy corresponds to an array dimension.
Axes in which no index is specified are treated as if all indices in that axis were selected.

In each axis, the slice operator `:` can be used to specify ranges of indices.

In [None]:
print(np_2d_arr[1, 1])   # second element (index==1) in axes 0, 1
  
print(np_2d_arr[:, 1])   # all elements along axis0 where index of axis1 ==1

print(np_2d_arr[1, :])   # all elements along axis1 where index of axis0 ==1

print(np_2d_arr[1])      # this is equivalent to [1, :]

print(np_2d_arr[1, 1:4]) # selection of elements ('slice') of axis1, with axis0==1 fixed 

And identically for higher dimensions:

In [None]:
np_md_arr = np.zeros( (3,3,3,2) )  # create 4D array with 
                                   # 3 elements along the 0th, 1st, 2nd
                                   # 2 elements along the 3rd axis                              
#print(np_md_arr)
#print(np_md_arr[:,:,1, 0])



---

**Exercise (2):**

Suppose you want to track the position of a point in 3D space over time.

- How would you dimension a NumPy array to store this positional data for 10 time steps?
- What if you need to track two positions?


---



We saw previously that *objects* of a specific *type* may have *attributes* and *methods* associated to this type.

Objects of type `numpy.ndarray` have *attributes* that provide information about the specific array and help understanding its structure:

In [None]:
print("ndim:     ", np_md_arr.ndim)    # number of axes / dimensions
print("shape:    ", np_md_arr.shape)   # tuple of integers, i-th integer corresponds to size 
                                       # of array in i-th dimension
print("size:     ", np_md_arr.size)    # total number of elements in array
print("type:     ", np_md_arr.dtype)   # type of elements in array
print("itemsize: ",np_md_arr.itemsize) # size of bytes of each element of the array

They also provide *methods* for common operations:

In [None]:
numpy_array = np.array([1,9,2,8,3,7,4,6,5])
print("sum:          ", numpy_array.sum())
print("mean:         ", numpy_array.mean())
print("max:          ", numpy_array.max())

print("array:        ", numpy_array)
numpy_array.sort()
print("sorted array: ", numpy_array)

As any for any other object, you can inspect all associated *attributes* and *methods* by `dir(object)`.

## A note on numerical precision

[Numbers in digital systems](https://en.wikipedia.org/wiki/Computer_number_format) are represented by sets of binary digits called *bit*s. Each bit represents one of two possible states, such as 1 or 0.
While a single binary bit can only encode two different values, the amount of possible combinations in a sequence of bits doubles with each additional bit. Thus, a sequence of $N$ bits is able to express $2^N$ different value combinations.

An [*octet*](https://en.wikipedia.org/wiki/Octet_(computing)) is a sequence of 8 bits that allows representing [*binary numbers*](https://en.wikipedia.org/wiki/Binary_number) with 256 distinct values. Today, [*byte*](https://en.wikipedia.org/wiki/Byte) is the more commonly used name for such an 8-bit sequence in the computing context. However, historically, *byte* simply denoted the number of bits used to encode a single character of text. This used to be hardware dependent.

Previously, we used NumPy's `dtype` attribute to inspect the datatype of the elements in the array, and the `itemsize` attribute to see their size. 
By default, NumPy uses 64-bit representations (at least on 64-bit architectures).

In [None]:
import numpy.distutils.system_info as sysinfo
print("This is a %i-bit architecture."%sysinfo.platform_bits)

print("NumPy array data type: ", np_md_arr.dtype)   
print("Size of each element in bytes: ", np_md_arr.itemsize)

A 64-bit (8 byte) binary number can express $2^{64}$ (more than $1.8\times 10^{19}$) different values.
To see the actual achievable value ranges for specific data types, you can use [`finfo`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.finfo.html#numpy-finfo) and [`iinfo`](https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.iinfo.html).

In [None]:
print(np.iinfo(np.int64))
print(np.finfo(np.float64))

NumPy supports various [datatypes](https://docs.scipy.org/doc/numpy/user/basics.types.html) of different precision:

In [None]:
array = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

# 8 bit integer
numpy_array  = np.array(array, dtype=np.int8)
print("bytes per int8: ",numpy_array.itemsize)

# 16 bit integer
numpy_array  = np.array(array, dtype=np.int16)
print("bytes per int16: ",numpy_array.itemsize)

# 64 bit integer
numpy_array  = np.array(array, dtype=np.int64)
print("bytes per int64: ",numpy_array.itemsize)

# 16 bit float
numpy_array  = np.array(array, dtype=np.float16)
print("bytes per float16: ",numpy_array.itemsize)

# 128 bit float
numpy_array  = np.array(array, dtype=np.float128)
print("bytes per float128: ",numpy_array.itemsize)

Depending on the dataype, values of different ranges and resolutions can be represented.
The more bits are available for encoding a given number, the higher the *precision* with which this number can be represented.
However, with any finite number of bits, precision will remain finite.

Such *finite precision* is a source of computational [errors](https://en.wikipedia.org/wiki/Round-off_error). In particular, when *rounding errors* are accumulated in arithmetic operations.

In [None]:
info = np.iinfo(np.int8)
print(info)

array = [info.min, info.max, info.min-1, info.max+1, 2*info.max, 1e8]
numpy_array  = np.array(array)
print("original array: ", numpy_array)
print("array as int8:  ", numpy_array.astype(np.int8))

In [None]:
info = np.finfo(np.float16)
print(info)

array = [info.min, info.max, info.max+10, info.max+100, 1e-8]
numpy_array  = np.array(array)
print("original array:    ", numpy_array)
print("array as float16:  ", numpy_array.astype(np.float16))

We will face some of the consequences of the inherently finite precision of computer arithmetic in the following lectures on *numerical methods*. 
From these, you may get the impression that higher precision datatypes are always the *better* choice.

This is not the case.
As we have seen here, higher precision also implies that more memory needs to be allocated for each single data item. For the anaysis of large data sets or for large simulations, this may be a limiting factor.

## The matplotlib package


[Matplotlib](https://matplotlib.org/index.html) is a Python 2D plotting library. For basic plotting, it provides the `pyplot` module with a MATLAB-like interface. Besides the official [Pyplot tutorial](https://matplotlib.org/tutorials/introductory/pyplot.html), [this](http://www.labri.fr/perso/nrougier/teaching/matplotlib/#other-types-of-plots) and [this](https://github.com/matplotlib/AnatomyOfMatplotlib) tutorial also provide a nice overview of the various plot types and options. 
If you are interested in the finer details of some of the plotting approaches that matplotlib provides, head over [here](https://realpython.com/python-matplotlib-guide/). When you need inspiration for what plot type to use for your data or numerical results, this [gallery](https://matplotlib.org/gallery/index.html) may be helpful.
See [this](https://colab.research.google.com/notebooks/charts.ipynb) for plotting support and examples in Colab.



A 'plot' consists of a 'figure' and one or multiple 'axes' (sub plots). Each 'axis' has x-axis and y-axis, with labels, range etc and a title. 
Here is an example of an empty plot.
See matplotlib's [anatomy of a figure](https://matplotlib.org/examples/showcase/anatomy.html) example for a more fine-grained view of figure components.

In [None]:
import matplotlib.pyplot as plt

fig = plt.figure(figsize=plt.figaspect(0.5))
ax = fig.add_subplot(111) 
ax.set(xlim=[0.5, 4.5], ylim=[-2, 8], 
       title='An empty figure',
       ylabel='Y-Axis', xlabel='X-Axis')
plt.show()

One or multiple data representations can be attached to an axis. Here, two plot statements (`plot`, `scatter`) are attached to a single axis.

In [None]:
# Plotting on axes

fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot([1, 2, 3, 4], 
        [10, 20, 25, 30], 
        color='lightblue', linewidth=3)
ax.scatter([0.3, 3.8, 1.2, 2.5], 
           [11, 25, 9, 26], 
           c=[1, 2, 3, 5], marker='^')
ax.set_xlim(0.5, 4.5)
plt.show()

Instead of specifying figures and axes explicitly, `pyplot` allows plots to be defined more concisely by automatically creating an 'axis' and then reusing the current axis for any further `pyplot` command.


In [None]:
# Plotting with pyplot
plt.plot([1, 2, 3, 4], [10, 20, 25, 30], color='lightblue', linewidth=3)
plt.scatter([0.3, 3.8, 1.2, 2.5], [11, 25, 9, 26], c=[1, 2, 3, 5], marker='^')
plt.xlim(0.5, 4.5)
plt.show()

# `plt` simply uses the current axis:
#ax = plt.gca()                           # gca -> get current axis
#ax.set_xlim(1, 4.5)

This is useful for the most simple plots. For more complicated or customized plot, explicit declaration of figure and axes is preferable.

For example, this figure with subfigures has multiple axes.

In [None]:
# multiple axes, i.e. subplots

fig, axes = plt.subplots(nrows=2, ncols=2)
plt.show()

print("The 'axes' object is of type %s with shape "%type(axes), axes.shape)

In this case,  explicit declaration becomes necessary:

In [None]:
fig, axes = plt.subplots(nrows=2, ncols=2)
axes[0,0].set(title='Upper Left')
axes[0,1].set(title='Upper Right')
axes[1,0].set(title='Lower Left')
axes[1,1].set(title='Lower Right')

for ax in axes.flat:
    ax.set(xticks=[], yticks=[])            # Remove all xticks and yticks...

plt.show()

Now, let's add some data...
This example shows how a figure with two subplots could be created.

In [None]:
# generate some data
x = np.random.randint(low=1, high=11, size=50)
y = x + np.random.randint(1, 3, size=x.size)
data = np.column_stack((x, y))

# create figure with 2 subplots
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(8, 4))
# define content of 1st subplot
axes[0].scatter(x=x, y=y, marker='o', c='r', edgecolor='b')
axes[0].set_title('Scatter: $x$ versus $y$')
axes[0].set_xlabel('$x$')
axes[0].set_ylabel('$y$')
# define content of 2nd subplot
axes[1].hist(data, bins=np.arange(data.min(), data.max()),label=('x', 'y'))
axes[1].legend(loc='best')
axes[1].set_title('Frequencies of $x$ and $y$')
axes[1].yaxis.tick_right()




---

**Exercise (3):**

Plot $sin(x)$ for $x\in [0, 4\pi]$.

*Hint:* See the [`np.arange()`](https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.arange.html) and [`np.linspace()`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.linspace.html) functions.

---



Suppose, you want to save this figure. Normally, `fig.savefig()` is sufficient, but since this example is not running on your local machine, we need to employ some additional colab magic.
See this [page](https://colab.research.google.com/notebooks/io.ipynb#scrollTo=p2E4EKhCWEC5) for more information about downloading files from colab to your local system. 

In [None]:
# this only works when executed in colab
from google.colab import files    # we need this to interact 
                                  # with the remote file system
figure_name = 'test_fig.png'
fig.savefig(figure_name, dpi=300) # this is the standard matplotlib save command
                                  # Note that this is a command of a 'figure' object
files.download(figure_name)       # download file to your local machine

You know the basics now! We will go into more detail as we face specific plotting needs during the course.

##### About 
This notebook is part of the *biosci670* course on *Mathematical Modeling and Methods for Biomedical Science*.
See https://github.com/cohmathonc/biosci670 for more information and material.

## The pandas package

The Python Data Analysis Library [pandas](https://pandas.pydata.org) provides data structures and data anlysis tools that are convenient for working with datasets.
We will only introduce the most basic functions of this package here. For more details, see the official [package overview](http://pandas.pydata.org/pandas-docs/stable/getting_started/overview.html) and [10 minutes introduction](http://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html).

One of the data structures that pandas provides is the `DataFrame`. This corresponds to a *table* in which each row is identified by an *index* and each *column* by a column name.

In [None]:
import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.randn(8, 3),columns=['A', 'B', 'C'])

print(df)

print("columns: ", df.columns)
print("index:   ", df.index)

Each column in a DataFrame can be accessed by its name. Pandas calls the object that forms a 'column', a *Series*.
Any Series has an *index*, a *name*, an *values* of a specific *data type*.

In [None]:
colA = df.A     # or df['A']

print( type(colA) )

print("Name of Series: ", colA.name)
print("Index of Series: ", colA.index)
print("values of Series: ", colA.values)
print("DataType of Series: ", colA.dtype)

The values of a pandas *Series* are numpy arrays!

In [None]:
print( type(colA.values) )

And we can use them for computations as any other numpy arrays. For example, we can create a new numpy array from existing arrays and add it to the *DataFrame*.

In [None]:
df['new'] = df['A'] + df['B']**2

print(df)

It is also possible to access specific elements in a *DataFrame*, or change their value:

In [None]:
print( df.loc[3, 'new'] )

df.loc[3, 'new'] = 10

print( df.loc[3, 'new'] )

An existing *DataFrame* can be extended, for example by incremeting its index:

In [None]:
df.loc[8, 'new'] = 1

print(df)

There are different ways to create *DataFrame*s.
If you have various numpy arrays that you would like to collect, display or process, the following approach is convenient:

In [None]:
array_1 = np.random.randn( 10 )
array_2 = np.random.randn( 10 )
array_3 = np.random.randn( 10 )
array_4 = np.random.randn( 10 )

array_dict = {"my_first_array" : array_1,
              "my_second_array" : array_2,
              "my_third_array" : array_3,
              "my_fourth_array" : array_4}

df = pd.DataFrame(array_dict)

df

Pandas also supports plotting functionalities from matplotlib, directly from DataFrames. See the [user guide](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html) for an introduction:

In [None]:
df.plot.scatter(x="my_first_array", y="my_third_array")

# Exercises

- In the [approximate Euler's number](https://github.com/cohmathonc/biosci670/blob/master/IntroductionComputationalMethods/exercises/03_ApproximateEulerNumber.ipynb) exercise you will explore the effect of working with different numeric types on the precision of your results, and visualize results using `matplotlib`. (**optional**)

##### About 
This notebook is part of the *biosci670* course on *Mathematical Modeling and Methods for Biomedical Science*.
See https://github.com/cohmathonc/biosci670 for more information and material.