# Scientific Programming with NumPy

The most fundamental third-party package for scientific computing in Python is NumPy, which provides multidimensional array (`ndarray`) data structure, along with associated functions and methods to manipulate them. While Python comes with several container types (`list`,`tuple`,`dict`), NumPy's arrays are implemented closer to the hardware, and are therefore more efficient than the built-in types. 


## Basics of Numpy arrays

We now turn our attention to the NumPy library, which forms the base layer for the entire scientific Python ecosystem.  Once you have installed NumPy, you can import it as

In [None]:
import numpy

though here we will use the common shorthand

In [None]:
import numpy as np

As noted, the fundamental structure provided by NumPy is a powerful `array` object.  We'll start by exploring how the `array` differs from Python lists.  We start by creating a simple `list` and an `array` with the same contents of the list:

In [None]:
lst = list(range(1000))
arr = np.arange(1000)

# Here's what the array looks like
arr[:10]

In [None]:
type(arr)

Elements of a one-dimensional array are indexed with square brackets, as with lists:

In [None]:
arr[5:10]

In [None]:
arr[-1]

As with Python's built-in data structures, the membership of particular values can be tested with the `in` expression:

In [None]:
7 in arr

In [None]:
-15 in arr

The first difference to note between lists and arrays is that arrays are *homogeneous*; i.e. all elements of an array must be of the same type.  In contrast, lists can contain elements of mixed, arbitrary types. For example, we can change the last element in our list above to be a string:

In [None]:
lst[0] = 'a string inside a list'
lst[:10]

but the same can not be done with an array, as we get an error message:

In [None]:
arr[0] = 'a string inside an array'

The information about the type of an array is contained in its `dtype` attribute:

In [None]:
arr.dtype

Once an array has been created, its `dtype` is fixed and it can only store elements of that type.  For this example where the `dtype` is a 64-bit integer, if we store a floating point number it will be automatically converted into an integer:

In [None]:
arr[0] = 1.234
arr[:10]

Above, we created an `array` from an existing `list`, but there is a variety of ways that we can use to create arrays. In some appliations, we may want to have an array initialized with a constant value, and very often this value is 0 or 1 (suitable as starting value for additive and multiplicative loops respectively); `zeros` creates arrays of all zeros, with any desired dtype:

In [None]:
np.zeros(5, float)

In [None]:
np.zeros(3, int)

In [None]:
np.zeros(3, complex)

and similarly for `ones`:

In [None]:
print('5 ones: {0}'.format(np.ones(5)))

If we want an array initialized with an arbitrary value, we can create an empty array and then use its `fill` method to insert the value we want into the array:

In [None]:
a = np.empty(4)
a

In [None]:
a.fill(5.5)
a

### Exercise

What is another way to create an array of a particular value?

In [None]:
# Write your answer here

We have seen how the `arange` function generates an array for a range of integers. Relatedly,  the `linspace` and `logspace` functions to create linearly and logarithmically-spaced grids, respectively, with a fixed number of points, which includes both ends of the specified interval:

In [None]:
np.linspace(0, 1, num=5)

In [None]:
np.logspace(1, 4, num=4)

Finally, it is often useful to create arrays with random numbers that follow a specific distribution.  The `np.random` module contains a number of functions that can be used to this effect, for example this will produce an array of 5 random samples taken from a standard normal distribution (0 mean and variance 1):

In [None]:
np.random.randn(5)

whereas this will also give 5 samples, but from a normal distribution with a mean of 10 and a standard deviation of 3:

In [None]:
norm10 = np.random.normal(loc=10, scale=3, size=10)

norm10

### Exercise

Generate a sample of 10 values drawn from a **binomial** distribution, with `n=5` and probability `p=0.5`.

In [None]:
# Write your answer here

Computers cannot generate *true* random numbers, but rather use an algorithm to generate pseudo-random numbers that are difficult to distinguish from a true set of random numbers. NumPy employs the [Mersenne Twister algorithm](https://en.wikipedia.org/wiki/Mersenne_Twister). For a given **seed** value, the algorithm generates a deterministic sequence of numbers that can be shown to have an extremely long period.

In [None]:
np.random.seed(42)

In [None]:
np.random.random(10)

To illustrate that this is a deterministic algorithm, we can re-seed the pseudo-random number generator and draw values again:

In [None]:
np.random.seed(42)

In [None]:
np.random.random(10)

## Indexing with other arrays

Above we saw how to index arrays with single numbers and slices, just like Python lists.  But arrays allow for a more sophisticated ("fancy") kind of indexing which is very powerful: you can index an array with another array:

In [None]:
arr[[5, 10, 15]]

This is particluarly useful to extract information from an array that matches a certain condition, using a **boolean** values.

Imagine that in the array `norm10` we want to replace all values larger than 9 with the value 0.  We can do this by first creating a *mask* that indicates where this condition is true or false:

In [None]:
mask = norm10 > 9
mask

Notice that the mask is just an array of `bool` elements; we can use it to either index those values or to reset them to 0:

In [None]:
norm10[mask]

In [None]:
norm10[mask] = 0

norm10

### Exercise

Create a 4 x 4 matrix of ones, then change the first two columns of the first two rows to zeros.

In [None]:
# Write your answer here

## Multidimensional Arrays

The NumPy type `ndarray` stands for *n-dimensional array*. Hence, arrays can be constructed of aribtrary dimension. For example, a list of lists can be used to initialize a two dimensional array:

In [None]:
lst2 = [[1, 2], [3, 4]]
arr2 = np.array([[1, 2], [3, 4]])
arr2.shape

With two-dimensional arrays we start seeing the power of NumPy arrays: while a nested list can be indexed by chaining the `[ ]` operator, `ndarray` supports a much more natural indexing syntax with a single `[ ]` and a set of indices separated by commas:

In [None]:
lst2[0][1]


In [None]:
arr2[0,1]

Most of the array creation functions listed above can be used with more than one dimension, for example:

In [None]:
np.zeros((2,3))

In [None]:
np.random.normal(10, 3, size=(2, 4))

Using the `reshape` method, the shape of an array can be changed at any time, as long as the total number of elements is unchanged.  

For example, if we want a 2x4 array with numbers increasing from 0:

In [None]:
arr = np.arange(8).reshape(2,4)

arr

Or a 4x2 array:

In [None]:
arr.reshape(4,2)

Notice that the `reshape` function creates a new array and does not itself modify the original array (*i.e.* it is not an *in-place* operation).

In [None]:
arr

This array can be reverted to its original one-dimensional shape with the `flatten1` method or function:

In [None]:
arr.flatten()

With multidimensional arrays, you can also index with slices, and you can mix and match slices and single indices among the axes of an array:

In [None]:
arr[1, 2:4]

In [None]:
arr[:, 2]

If you only provide one index, then you will index the first dimension (*i.e.* row):

In [None]:
arr[1]

Now that we have seen how to create arrays with more than one dimension, let's look at some of the most useful attributes and methods that arrays have. 

The following provide basic information about the size, shape and data in the array:

In [None]:
print('Data type                :', arr.dtype)
print('Total number of elements :', arr.size)
print('Number of dimensions     :', arr.ndim)
print('Shape (dimensionality)   :', arr.shape)
print('Memory used (in bytes)   :', arr.nbytes)

Arrays also have many useful summarization methods, incuding:

In [None]:
print('Minimum and maximum             :', arr.min(), arr.max())
print('Sum and product of all elements :', arr.sum(), arr.prod())
print('Mean and standard deviation     :', arr.mean(), arr.std())

The above operations are computed over *all* the elements of the array.  But for a multidimensional array, it is possible to restrict the computation to a single dimension, by passing the `axis` parameter:

In [None]:
arr.sum(axis=0)

In [None]:
arr.sum(axis=1)

As you can see in this example, the value of the `axis` parameter is the dimension which will be *consumed* once the operation has been carried out.  Thus, `axis=0` sums across rows.  

### Exercise

Create an array with 4 dimensions and shape `(3,4,5,6)`. Then, sum along the axis number 2 (i.e. the *third* axis).  Check the shape of the resulting array.

In [None]:
# Write your answer here

Another widely used property of arrays with multiple dimensions is the `.T` attribute, which accesses the transpose of the array:

In [None]:
arr.T

Notice that `T` is an attribute, and not a method call. Equivalently,

In [None]:
arr.transpose()

In [None]:
np.transpose(arr)

There is a wide variety of methods and properties of arrays.       

In [None]:
[attr for attr in dir(arr) if not attr.startswith('__')]

## Sorting

There are several ways of sorting the values of an array. If we take a random array of integers (via the `randint` function), we can sort it using the `sort` function:

In [None]:
random_integers = np.random.randint(10, size=10)
random_integers

In [None]:
np.sort(random_integers)

The `sort` method will perform the sort *in place*:

In [None]:
random_integers.sort()

Or, we can use the `sorted` function from the Python standard library, but it will return a `list` rather than an array:

In [None]:
sorted(random_integers)

Notice that the random integer array contains duplicates of certain values. The `unique` function returns an array of just the unique values:

In [None]:
np.unique(random_integers)

Its also possible to sort array values **indirectly** using `argsort`. This returns the array of index values that would sort the array if used as an index. For example, take 5 individual weights:

In [None]:
height = np.array([71, 71, 63, 74, 65])

This yields the sequence of index values:

In [None]:
np.argsort(height)

To verify this, try using them to index `height`.

In [None]:
# Write your answer here

Notice that for the tied values, the sorting algorlithm simply kept them in their original order.

It is possible to sort multiple arrays of variables as a group, as you would sort columns in a spreadsheet. Suppose, for example, we had a second array of values corresponding to the weights of the individuals for which we already have the heights. We can use the `lexsort` function to sort them lexically, with the last array being used as the primary sort index.

`lexsort` expects a tuple or array of sequences.

In [None]:
weight = np.array([201, 186, 164, 198, 170])

In [None]:
sort_index = np.lexsort((weight, height))

We can confirm that these sort the values correctly:

In [None]:
height[sort_index]

Notice that the tied weight values were sorted according to weight:

In [None]:
weight[sort_index]

## Array Operations

Arrays support all regular arithmetic operators, and the NumPy library also includes a rich collection of basic and advanced mathematical functions that operate on arrays.  It is important to remember that, in general, all operations with arrays are applied *element-wise*, i.e., are applied to all the elements of the array at the same time.  Consider for example:

In [None]:
arr1 = np.arange(4)
arr2 = np.arange(10, 14)
arr_sum = arr1 + arr2

print('{0} + {1} = {2}'.format(arr1, arr2, arr_sum))

Importantly, the multiplication operator is also applied element-wise, it is *not* the matrix multiplication from linear algebra (as is the case in Matlab, for example):

In [None]:
print('{0} X {1} = {2}'.format(arr1, arr2, arr1*arr2))

While this means that arrays to which arithmetic operators are applied must have matching dimensions, NumPy will *broadcast* dimensions when possible.  For example, suppose that you want to add the number 1.5 to `arr1`; the following would be a valid way to do it:

In [None]:
arr1 + 1.5*np.ones(4)

But thanks to NumPy's broadcasting rules, the following is equally valid:

In [None]:
arr1 + 1.5

In this case, NumPy looked at both operands and saw that the first (`arr1`) was a one-dimensional array of length 4 and the second was a scalar (considered a zero-dimensional object), and expanded the second to match the dimension of the first. 

The broadcasting rules allow numpy to:

* *create* new dimensions of length 1 (since this doesn't change the size of the array)
* 'stretch' a dimension of length 1 that needs to be matched to a dimension of a different size.

So in the above example, the scalar 1.5 is effectively:

* first 'promoted' to a 1-dimensional array of length 1
* then, this array is 'stretched' to length 4 to match the dimension of `arr1`.

After these two operations are complete, the addition can proceed as now both operands are one-dimensional arrays of length 4.

This broadcasting behavior is powerful, especially because it doesn't require that  the data be replicated in order to match the larger dimensions.  In the example above the operation is carried *as if* the 1.5 was a 1-d array with 1.5 in all of its entries, but no array is created in memory.  This is important for cases when the arrays in question are large and replication can have significant performance implications.

The general rule is: when operating on two arrays, NumPy compares their shapes element-wise. It starts with the trailing dimensions, and works its way forward, creating dimensions of length 1 as needed. Two dimensions are considered compatible when

* they are equal to begin with, or
* one of them is 1; in this case numpy will do the 'stretching' to make them equal.

If these conditions are not met, a `ValueError: frames are not aligned` exception is thrown, indicating that the arrays have incompatible shapes. The size of the resulting array is the maximum size along each dimension of the input arrays.

This shows how the broadcasting rules work in several dimensions:

In [None]:
b = np.array([2, 3, 4, 5])
bcast_sum = arr + b

print('{0}\n\n+ {1}\n{2}\n{3}'.format(arr, b, '-'*12, bcast_sum))

Now, how could you use broadcasting to say add `[4, 6]` along the rows to `arr` above?  Simply performing the direct addition will produce the error we previously mentioned:

In [None]:
c = np.array([4, 6])
arr + c

According to the rules above, the array `c` would need to have a *trailing* dimension of 1 for the broadcasting to work.  NumPy allows you to 'inject' new dimensions anywhere into an array on the fly, by indexing it with `None` (or the special object `np.newaxis`):

In [None]:
c

In [None]:
cplus = c[:, None, None]
cplus

This is exactly what we need, and indeed it works:

In [None]:
arr + cplus

For the full broadcasting rules, please see the official Numpy docs, which describe them in detail and with more complex examples.

### Exercises

Generate the following structure as a numpy array, without typing the values by hand. Then, create another array containing just the 2nd and 4th rows.

        [[1,  6, 11],
         [2,  7, 12],
         [3,  8, 13],
         [4,  9, 14],
         [5, 10, 15]]

In [None]:
# Write your answer here

Generate a 10 x 3 array of random numbers (in range [0,1]). For each row, pick the number closest to 0.5.

*Hints*:

* NumPy functions/methods that may be useful here include `abs`, `choose` and `argsort` 


## Linear Algebra

Numpy ships with a basic linear algebra library, and all arrays have a `dot` method whose behavior is that of the scalar dot product when its arguments are vectors (one-dimensional arrays) and the traditional matrix multiplication when one or both of its arguments are two-dimensional arrays:

In [None]:
v1 = np.array([2, 3, 4])
v2 = np.array([1, 0, 1])
dprod = v1.dot(v2)

print(v1, '.', v2, '=', dprod)

There is an equivalent `dot` function:

In [None]:
np.dot(v1, v2)

Here is a regular matrix-vector multiplication, note that the array `v1` should be viewed as a **column** vector in traditional linear algebra notation; NumPy makes no distinction between row and column vectors and simply verifies that the dimensions match the required rules of matrix multiplication, in this case we have a $2 \times 3$ matrix multiplied by a 3-vector, which produces a 2-vector:

In [None]:
A = np.arange(6).reshape(2, 3)

In [None]:
print(A.shape, v1.shape)

In [None]:
A.dot(v1)

For matrix-matrix multiplication, the same dimension-matching rules must be satisfied, e.g. consider the difference between $A \times A^T$:

In [None]:
A.dot(A.T)

and $A^T \times A$:

In [None]:
A.T.dot(A)

Furthermore, the `numpy.linalg` module includes additional functionality such as determinants, matrix norms, Cholesky, eigenvalue and singular value decompositions, etc.  For even more linear algebra tools, `scipy.linalg` contains the majority of the tools in the classic LAPACK libraries as well as functions to operate on sparse matrices.  We refer the reader to the [NumPy](http://docs.scipy.org/doc/numpy/reference/) and [SciPy](http://docs.scipy.org/doc/scipy/reference/) documentation for additional details on these.

## Reading and writing arrays to disk

Numpy lets you read and write arrays into files in a number of ways.  In order to use these tools well, it is critical to understand the difference between a *text* and a *binary* file containing numerical data.  In a text file, the number $\pi$ could be written as "3.141592653589793", for example: a string of digits that a human can read, with in this case 15 decimal digits.  

In contrast, that same number written to a binary file would be encoded as 8 characters (bytes) that are not readable by a human but which contain the exact same data that the variable `pi` had in the computer's memory.  

The tradeoffs between the two modes are thus:

* **Text mode**: occupies more space, precision can be lost (if not all digits are written to disk), but is readable and editable by hand with a text editor.  Can *only* be used for one- and two-dimensional arrays.

* **Binary mode**: compact and exact representation of the data in memory, can't be read or edited by hand.  Arrays of any size and dimensionality can be saved and read without loss of information.

First, let's see how to read and write arrays in text mode.  The `np.savetxt` function saves an array to a text file, with options to control the precision, separators and even adding a header:

In [None]:
arr = np.arange(10).reshape(2, 5)
np.savetxt('test.out', arr, fmt='%.2e', header="My dataset")

In [None]:
!cat test.out

And this same type of file can then be read with the matching `np.loadtxt` function:

In [None]:
arr2 = np.loadtxt('test.out')
arr2

For binary data, Numpy provides the `np.save` and `np.savez` routines.  `np.save` saves a single array to a file with `.npy` extension, while `np.savez` can be used to save a **group** of arrays into a single file with `.npz` extension.  The files created with these routines can then be read with the `np.load` function.

Let us first see how to use the simpler `np.save` function to save a single array:

In [None]:
np.save('test.npy', arr2)

Now we can read this back.

In [None]:
arr2n = np.load('test.npy')

Let's see if any element is non-zero in the difference. A value of True would be a problem.

In [None]:
np.any(arr2 - arr2n)

Now let us see how the `np.savez` function works.  You give it a filename and either a sequence of arrays or a set of keywords.  In the first mode, the function will auotmatically name the saved arrays in the archive as `arr_0`, `arr_1`, etc:

In [None]:
np.savez('test.npz', arr, arr2)
arrays = np.load('test.npz')
arrays.files

Alternatively, we can explicitly name the arrays we save using keyword arguments for `savez`:

In [None]:
np.savez('test.npz', array1=arr, array2=arr2)
arrays = np.load('test.npz')
arrays.files

The object returned by `np.load` from an `.npz` file works like a dictionary:

In [None]:
# First row of array
arrays['array1'][0]

Equivalently, you can access its constituent files by attribute using its special `.f` field:

In [None]:
arrays.f.array1[0]

This `.npz` format is a very convenient way to package compactly and without loss of information, into a single file, a group of related arrays that pertain to a specific problem.  At some point, however, the complexity of your dataset may be such that the optimal approach is to use one of the standard formats in scientific data processing that have been designed to handle complex datasets, such as NetCDF or HDF5.  

### Exercise

Generate a 10 x 3 array of random numbers (in range [0,1]). For each row, pick the number closest to 0.5.

*Hints*:

* NumPy functions/methods that may be useful here include `abs` and `argsort` 


In [None]:
# Write your answer here

## References

[Scientific Python Lecture Notes](http://scipy-lectures.github.io)