# NumPy

### Questions:

  * What is vectorization?
  * Why using NumPy instead of pure python?
  * How to use basic NumPy?

### Objectives

*    Understand the Numpy array object
*    Be able to use basic NumPy functionality
*    Understand enough of NumPy to seach for answers to the rest of your questions ;)


## What is an array?

For example, consider 
``` python 
[1, 2.5, 'asdf', False, [1.5, True]]
```

this is a Python list but it has different types for every element. When you do math on this, every element has to be handled separately.

NumPy is the most used library for scientific computing. Even if you are not using it directly, chances are high that some library uses it in the background. NumPy provides the high-performance multidimensional array object and tools to use it.

An array is a 'grid' of values, with all the same types. It is indexed by tuples of non negative indices and provides the framework for multiple dimensions. An array has:

  *  <font color=blue>dtype</font> - data type. Arrays always contain one type

  *  <font color=blue>shape</font> - shape of the data, for example 3×2 or 3×2×500 or even 500 (one dimensional) or [] (zero dimensional).

  *  <font color=blue>data</font> - raw data storage in memory. This can be passed to C or Fortran code for efficient calculations.

To test the performance of pure Python vs NumPy we can write in our jupyter notebook:

Create one list and one 'empty' list, to store the result in

In [6]:
import numpy as np
a = np.array([[1, 2], [3, 4]])
print(a)
a.data

[[1 2]
 [3 4]]


<memory at 0x000001B0BD0A5040>

In [12]:
a = list(range(10000))
b = [ 0 ] * 10000

In a new cell starting with *%%timeit*, loop through the list *a* and fill the second list *b* with a squared

In [13]:
%%timeit
for i in range(len(a)):
    b[i] = a[i]**2

2.55 ms ± 132 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


That looks and feels quite fast. But let’s take a look at how NumPy performs for the same task.

So for the NumPy example, create one array and one 'empty' array to store the result in

In [14]:
import numpy as np
a = np.arange(10000)

In a new cell starting with *%%timeit*, fill *b* with *a* squared

In [15]:
%%timeit
b = a ** 2

7.21 µs ± 333 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


We see that compared to working with numpy arrays, working with traditional python lists is actually slow.

## Creating arrays

There are different ways of creating arrays (*numpy.array()*, *numpy.ndarray.shape*, *numpy.ndarray.size*):

In [35]:
# a = np.array([1,2,3])               # 1-dimensional array (rank 1)
# b = np.array([[1,2,3],[4,5,6]])     # 2-dimensional array (rank 2)

# b.shape                             # the shape (rows,columns)
# b.size                              # number of elements

In addition to above ways of creating arrays, there are many other ways of creating arrays depending on content (*numpy.zeros()*, *numpy.ones()*, *numpy.full()*, *numpy.eye()*, *numpy.arange()*, *numpy.linspace()*):

In [45]:
## examples


Arrays can also be stored and read from a (*.npy*) file (*numpy.save()*, *numpy.load()*):

In [47]:
# np.save('x.npy', a)           # save the array a to a .npy file
# x = np.load('x.npy')          # load an array from a .npy file and store it in variable x

In many occasions (especially when something goes different than expected) it is useful to check and control the datatype of the array (*numpy.ndarray.dtype*, *numpy.ndarray.astype()*):

In [49]:
# d.dtype                    # datatype of the array
# d.astype('int')            # change datatype from boolean to integer

In the last example, *.astype('int')*, it will make a copy of the array, and re-allocate data unless the dtype is exactly the same as before. Understanding and minimizing copies is one of the most important things to do for speed.

## Array maths

Clearly, you can do math on arrays. Math in *NumPy*, is very fast because it is implemented in C or Fortran - just like most other high-level languages such as R, Matlab, etc do.

By default, in *NumPy* all math is element-by-element. This is unlike Matlab, where most things are element-by-element, but * becomes array multiplication. NumPy values consistency and does not treat 2-dimensional arrays specially (*numpy.add*):

In [68]:
# a = np.array([[1,2],[3,4]])
# b = np.array([[5,6],[7,8]])

# c = a + b
# d = np.add(a,b)

Also: - (*numpy.subtract*), * (*numpy.multiply*), / (*numpy.divide*), *numpy.sqrt*, *numpy.sum()*, *numpy.mean()*, …

## Indexing and Slicing

**See also [Numpy basic indexing docs](https://numpy.org/doc/stable/user/basics.indexing.html#basics-indexing)**

NumPy has many ways to extract values out of arrays:

*    You can select a single element
*    You can select rows or columns
*    You can select ranges where a condition is true.

Clever and efficient use of these operations is a key to NumPy’s speed: you should try to cleverly use these selectors (written in C) to extract data to be used with other NumPy functions written in C or Fortran. This will give you the benefits of Python with most of the speed of C.

In [64]:
# a = np.arange(16).reshape(4, 4)  # 4x4 matrix from 0 to 15
# a[0]                             # first row
# a[:,0]                           # first column
# a[1:3,1:3]                       # middle 2x2 array
# a[(0, 1), (1, 1)]                # second element of first and second row as array

Boolean indexing on above created array:

In [65]:
# idx = (a > 0)      # creates boolean matrix of same size as a
# a[idx]             # array with matching values of above criterion
# a[a > 0]           # same as above in one line

## Types of operations

There are different types of standard operations in NumPy:

*ufuncs*, "universal functions": These are element-by-element functions with standardized arguments:

 *   One, two, or three input arguments
 *   For example, a + b is similar to np.add(a, b) but the ufunc has more control.
 *   out= output argument, store output in this array (rather than make a new array) - saves copying data!
 *   See the [full reference](https://numpy.org/doc/stable/reference/ufuncs.html)
 *   They also do broadcasting ([ref](https://numpy.org/doc/stable/user/basics.broadcasting.html#basics-broadcasting)). Can you add a 1-dimensional array of shape (3) to an 2-dimensional array of shape (3, 2)? With broadcasting you can! Broadcasting is smart and consistent about what it does, which I’m not clever enough to explain quickly here: the manual page on broadcasting. The basic idea is that it expands dimensions of the smaller array so that they are compatible in shape.

In [66]:
# a = np.array([[1, 2, 3],[4, 5, 6]])
# b = np.array([10, 10, 10])
# a + b

Array methods do something to one array. Some of these are the same as ufuncs: Other functions: there are countless other functions covering linear algebra, scientific functions, etc.

In [67]:
# x = np.arange(12)
# x.shape = (3, 4)
# x
# x.max()
# x.max(axis=0)
# x.max(axis=1)

## Linear algebra and other advanced math

In general, you use arrays (n-dimensions), not matrixes (specialized 2-dimensional) in NumPy.

Internally, NumPy doesn’t invent its own math routines: it relies on BLAS and LAPACK to do this kind of math - the same as many other languages.

*    [Linear algebra in numpy](https://numpy.org/doc/stable/reference/routines.linalg.html)
*    [Many, many other array functions](https://numpy.org/doc/stable/reference/routines.html)
*    [Scipy](https://docs.scipy.org/doc/scipy/reference/) has even more functions
*    Many other libraries use NumPy arrays as the standard data structure: they take data in this format, and return it similarly. Thus, all the other packages you may want to use are compatible
*    If you need to write your own fast code in C, NumPy arrays can be used to pass data. This is known as [extending Python](https://docs.python.org/3/extending/).

## See also

* [NumPy manual](https://numpy.org/doc/stable/reference/)
    *  [Basic array class reference](https://numpy.org/doc/stable/reference/arrays.html)
    *  [Indexing](https://numpy.org/doc/stable/reference/arrays.indexing.html)
    *  [ufuncs](https://numpy.org/doc/stable/reference/ufuncs.html)
* [2020 Nature paper on NumPy’s role and basic concepts](https://www.nature.com/articles/s41586-020-2649-2)
