# DAML 02 - `NumPy` Basics

Michal Grochmal <michal.grochmal@city.ac.uk>

Array oriented Python library meant for *numerical computing*.

But what actually is numerical computing?
It is a set of techniques where numerical approximation is used,
i.e. instead of attempting to find the exact result of an expression we build algorithms
which will operate on approximations and find answers which are approximations themselves.
Since numbers on digital computers have limited precision,
calculating on approximations provides almost indistinguishable results.

Computing on arrays of numbers existed and has been optimized much earlier.
And `NumPy` holds to a great deal of legacy of the numeric computing optimizations of the past.
That said, if it ain't broke don't fix it.
If you run the same vectorial code (more on this in a moment) in `NumPy`, `MATLAB` or `R`,
you will find that it takes pretty much the same time.
This is because all these implementations actually use the same old,
and well tested, libraries.

`NumPy` really is a glue that binds the old optimizations with Python.
And allows Python's quick prototyping dynamic syntax on top
of the optimized vectorial computing.

## Vectorial Computing

Is the idea that we can perform a computation on several values at once.
When a computer performs an operation it moves words (several bytes)
into CPU registers and only then performs the operation.
To figure out what operation needs to be performed (e.g. addition),
the computer in turn perform other operations on memory addresses.
For example, the CPU adds together memory offsets to figure out
where in memory the operands for the addition are.
But that does mean that we cannot really perform things _"at once"_!

What we can do is to **minimize the overhead of adjunct processing** during
the computation of several operations "at once".
We can do that by aligning the operands and then telling the CPU to always
move just a single word (or double/triple word) before performing the next operation.
The speedup results of this are astonishing.

Unfortunately, it turns out that Python is terrible at memory aligning values.
This is mostly because Python's data types are sparse,
i.e. use pointers and *metadata* to define the actual value of a type
and its extra properties.
Thanks to these properties Python is easy to use, easy to learn,
and programmer friendly - think ducktyping.
Enters `NumPy` array which attempts to bring the best of both worlds.

![memory-usage.svg](attachment:memory-usage.svg)

<div style="text-align:right"><sup>Image available as `*memory-usage.svg`</sup></div>

The heavy metadata of Python types is reduced.
It has its trade-offs of course.
For example, Python lists allows for lists of mixed types, NumPy arrays do not.

To be fair [fixed array types exists in Python][arrtyp] but arenot nearly as powerful,
or as widely used, as NumPy arrays.

[arrtyp]: https://docs.python.org/3/library/array.html

## NumPy Arrays

Let's try this out, we import NumPy and create a bunch of arrays.
We can use Python lists to create them but there are more efficient methods, see:

In [None]:
import numpy as np

In [None]:
np.array([7, 6, 9, 11, 12])

In [None]:
np.zeros(6)

In [None]:
np.arange(2, 17, 2)

In [None]:
np.linspace(1, 3, 10)

In [None]:
np.ones((2, 3))

That last one is more than a list,
this leads us to shapes one of, and the most important of, the array attributes.

## Array Attributes

The single metadata for the array contains information about the array, including:

- memory consumption estimates
- data types (more on this later)
- methods that can be executed over all elements of the array ("at once")
- static metadata, where the *shape* of the array is the most important

The shape is important because without it the arrays would always have just on dimension.
In memory the NumPy array is simply a very long string of values one after another.
The *shape* is a list of numbers that defines *offsets* at which one dimension starts
and another ends.  Easier if we see it, let's create some arrays of different shapes.

In [None]:
x = np.arange(6)
x, x.shape

![1d-array.svg](attachment:1d-array.svg)

<div style="text-align:right"><sup>Image available as `*1d-array.svg`</sup></div>

In [None]:
x = np.arange(18).reshape((3, 6))
x, x.shape

![2d-array.svg](attachment:2d-array.svg)

<div style="text-align:right"><sup>Image available as `*2d-array.svg`</sup></div>

In [None]:
x = np.arange(36).reshape((2, 3, 6))
x, x.shape

![3d-array.svg](attachment:3d-array.svg)

<div style="text-align:right"><sup>Image available as `*3d-array.svg`</sup></div>

The `reshape` method we have been using alters the `shape` metadata of the NumPy array.
It changes the shape of the arrays without moving any items around.

Other useful static metadata are:

- **size** - the flattened length of the array
- **ndim** - number of dimensions, equivalent to `len(shape)`
- **itemsize** - bytes occupied by each item in the array
- **nbytes** - memory use, estimated, equivalent to `itemsize * size`
- **dtype** - the data type of items in the array

## Data Types

Not only numbers can be placed in an array,
although you will want to use it with numbers most of the time.
Even then, there are several ways to encode a number.
If you remember the discussion about numerical computing,
deciding the precision of numbers is often needed.
Better precision or smaller memory footprint is a trade-off between better answers and speed.
Some data types in NumPy:

| Data type  | Description |
|:---------- |:----------- |
| bool\_     | Boolean (True or False) stored as a byte |
| int\_      | Default integer type (same as C long; normally either int64 or int32) |
| intc       | Identical to C int (normally int32 or int64) |
| intp       | Integer used for indexing (same as C ssize_t; normally either int32 or int64) |
| int8       | Byte (-128 to 127) |
| int16      | Integer (-32768 to 32767) |
| int32      | Integer (-2147483648 to 2147483647) |
| int64      | Integer (-9223372036854775808 to 9223372036854775807) |
| uint8      | Unsigned integer (0 to 255) |
| uint16     | Unsigned integer (0 to 65535) |
| uint32     | Unsigned integer (0 to 4294967295) |
| uint64     | Unsigned integer (0 to 18446744073709551615) |
| float\_    | Shorthand for float64. |
| float16    | Half precision float: sign bit, 5 bits exponent, 10 bits mantissa |
| float32    | Single precision float: sign bit, 8 bits exponent, 23 bits mantissa |
| float64    | Double precision float: sign bit, 11 bits exponent, 52 bits mantissa |
| complex\_  | Shorthand for complex128. |
| complex64  | Complex number, represented by two 32-bit floats (real and imaginary components) |
| complex128 | Complex number, represented by two 64-bit floats (real and imaginary components) |

<sup>[Table from NumPy documentation][nptypes]</sup>

[nptypes]: https://docs.scipy.org/doc/numpy/user/basics.types.html

In [None]:
x = np.arange(18).reshape((3, 6))
x.dtype

In [None]:
x = np.arange(18, dtype=np.uint8).reshape((3, 6))
x, x.dtype

In [None]:
x = np.linspace(1, 5, 9, dtype=np.float16).reshape((3, 3))
x, x.dtype