# Array processing and _numpy_

Szymon Talaga, 11.12.2019

<hr>

## Scientific computing in Python

Python as a language and computing environment is very convenient as it has very simple and natural syntax and supports interactive work within a Python shell or other environment such as Jupyter notebook.
This is possible because it has so-called dynamic typing.

### Dynamic typing

A language has dynamic typing when it does not force a user to specify types of variables by hand. Instead, a language with dynamic typing will dynamically infer a type of a variable when it has to access it.
This makes writing code much easier and facilitates interactivity, but it also mean that Python (and similar languages) are much slower than languages such C/C++ that have static typing.

### Static typing

Static typing means that a user has to specify types of all variables when they define it. It makes the code much faster as there is no need to do dynamic type inference all the time. However, it also means that the code has to be compiled.

### The problem of scientific computing

In scientific computing we need to have both ease-of-use of a dynamic language like Python (since we do not want to force scientists to be full-fledged software engineers) as well as performance provided by compiled languages with static typing such as C/C++.

<hr>

## Numpy

Numpy is not a part of the standard library so it has to be installed with PIP first. However, it installed by default in the Anaconda distribution of Python.

In [2]:
# Import numpy: alias np is traditional; everyone uses it
import numpy as np

Numpy is the standard Python solution for the problem of scientific computing. This is possible thanks to the specific design of Numpy as a software package. It consists of two main parts:

1. Core library written in C/C++. It implements all the crucial computationally intensive procedures.
2. Python interface. It allows to communicate with core library from within Python. It also implements a lot of helper procedures that are not computationally complex.

Thus, Numpy in some sense _translates_ our code written in Python to very efficient C/C++ code.
However, there are two additional constraints that have to be satisfied for this to happend.

1. Data passed from Python to Numpy core has to be of fixed data type (i.e. int, float, bool), so it can be used by a program with static typing.
2. Data has to be passed between Python and Numpy core as rarely as possible as the transfer inflicts a significant additional computational cost.

The above constraints can be satisfied because the central data structure in Numpy is an array.

### Arrays

In general, an array is a multidimensional generalization of a list.

```python
# 1D array with shape (6,)
[0, 1, 11, 5, 7, 0]

# 2D array with shape (2, 3)
[ [0, 5, 10], 
  [11, 6, 7] ]

# 3D array with shape (2, 2, 3)
[ [ [1, 2, 3], [4, 0, 7] ],
  [ [1, 4, 6], [1, 1, 3] ] ]
```

Every array is defined not only by its content but also by its shape. And the shape is defined by:

1. Zero or more axes (it is possible to define a null array with no data and zero axes). Axes are dimensions along which data is arranged. One axis gives a vector, two axes give a matrix and three or more axes give just an array (sometimes also called tensor).
2. Numbers of elements along a given axis (defined as a tuple of positive integers).

Thus, in general in Numpy we store our data in arrays.

In [5]:
# Standard array creation
# (often people use uppercase letters to denote arrays)

# 1D
X = np.array([1, 2, 3])
print("1D\n", X, "\n")

# 2D
X = np.array([
    [1, 2],
    [3, 4]
])
print("2D\n", X, "\n")

# 3D
X = np.array([
    [ [1, 2], [3, 4] ],
    [ [4, 5], [5, 6] ]
])
print("3D\n", X)

1D
 [1 2 3] 

2D
 [[1 2]
 [3 4]] 

3D
 [[[1 2]
  [3 4]]

 [[4 5]
  [5 6]]]


Every array has two important attributes defined on it.

In [6]:
print("1D array")
X = np.array([1, 2, 3])
print(X.ndim)     # number of axes
print(X.shape)    # number of elements along axes

print("\n2D array")
X = np.array([ [1, 2], [3, 4] ])
print(X.ndim)     # number of axes
print(X.shape)    # number of elements along axes

1D array
1
(3,)

2D array
2
(2, 2)


#### Data types

The second defining feature of every array is its data type. As we already mentioned, high performance computing requires static typing. That is why our arrays needs to be of a fixed, homogeneous type (that is all elements have to be of the same type). Thanks to this it can be passed to core library written in C/C++ and handled in an optimal manner.

In Numpy we can define our arrays to be of one of the standard Python types. We use special `dtype` argument in array creation functions.

In [8]:
X = np.array([1, 2, 3], dtype=int)
print("int\n", X)

X = np.array([1, 2, 3], dtype=float)
print("float (note decimal dots)\n", X)

X = np.array([0, 1, 2], dtype=bool)
print("bool\n", X)

int
 [1 2 3]
float (note decimal dots)
 [1. 2. 3.]
bool
 [False  True  True]


However, Numpy provides with more control over types of our data. It is a quite complex subject (and you can read more about it [here](https://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html)).
For now, let us just notice that we have quite a lot of control over the amount of memory consumed by a single number in given representation. Below are some examples of Numpy data types (`dtypes`). A name of each type is followed by a number indicating how many bits a single value of this type consumes.

* `np.float128`
* `np.float64` 
* `np.float32`
* `np.float16`
* `np.int64`
* `np.int32`
* `np.int16`
* `np.bool`

On most machines (those with 64bit OS architectures) built-in Python int and float types correspond to `np.int64` and `np.float64`.

But why may we want to care about `dtypes` and memory. One reason is that we may need our floating-point arithmetic to be extremely precise. In such a case it may be worthwhile to use for instance `np.float128` instead of default `np.float64`. Let us see what it means on a simple example.

In [9]:
small_x1 = 1 / 422310513
small_x2 = 2 / 422310513

print(f"x1 = {small_x1}\nx2 = {small_x2}")

x1 = 2.3679258962705483e-09
x2 = 4.735851792541097e-09


Standard Python is quite precise as it uses 64 bit floats.

In [12]:
print(small_x1 == np.float64(small_x1))
print(small_x1 == np.float32(small_x1))
np.float32(small_x1)

True
False


2.3679259e-09

Below we see why precision may matter.

In [13]:
print(np.float32(small_x1))
print(np.float32(small_x2))
print(np.float32(small_x1) == np.float32(small_x2))

2.3679259e-09
4.7358517e-09
False


In [14]:
print(np.float16(small_x1))
print(np.float16(small_x2))
print(np.float16(small_x1) == np.float16(small_x2))

0.0
0.0
True


However, this does not mean that we should always use as much precision as we can. In fact, we usually does not need to. Quite often we may prefer to use less precise data types in order to lower the memory footprint of our application. However, we have to understand our problem well in order to be able to choose `dtype` properly, as lack of precision may result in corruption of our data.

Below we show how this may happend for integers.

In general integers can store only up to $2^b - 1$ different values, where $b$ is the number of bits. Moreover, in general integers are signed (can represent both positive and negative numbers) so the number of unique absolute values is $2^{b-1} - 1$.

In [15]:
# The case of signed 8 bit integer
# It can take values from -128 to 127 (256 unique value in total) 
print(np.int8(127))
print(np.int8(-128))

127
-128


But what happens if we try to represent too large a number? This leads to integer under or overflow.

In [16]:
# Integer overflow
# We get third lowest value
print(np.int8(130))

# Integer underflow
# We get the second highest value
print(np.int8(-130))

-126
126


#### Axes hierarchy

Axes in an array are ordered to form a hierarchy. The ordering is defined in the `.shape` attribute.

In [17]:
X = np.array([ [1,2,3], [4,5,6] ])
X.shape

(2, 3)

The array above has to elements along the first (main) axis and three elements along the second axis.
This means that it has a form of a sequence with two elements (main axis), each of which is a sequence of three elements (second axis).
This interpretation of course generalizes to more than two dimensions.

Numpy prints array in such a way as to stress the shape of elements along the main axis.

In [18]:
X = np.array([
    [ [1, 2], [3, 4] ],
    [ [5, 6], [5, 6] ],
    [ [7, 8], [8, 9] ] 
]) 
print(X.shape)
X

(3, 2, 2)


array([[[1, 2],
        [3, 4]],

       [[5, 6],
        [5, 6]],

       [[7, 8],
        [8, 9]]])

When we iterate over an array with a standard for-loop, we always iterate over the elements of the first (main) axis.

In [19]:
for chunk in X:
    print("Shape:", chunk.shape, "\n", chunk)

Shape: (2, 2) 
 [[1 2]
 [3 4]]
Shape: (2, 2) 
 [[5 6]
 [5 6]]
Shape: (2, 2) 
 [[7 8]
 [8 9]]


### Creating arrays

There are other ways to create arrays than only with the literal constructor function `np.array()`. Below we present some other useful array creation functions.

In [23]:
# Create array of ones with shape (5, 2)
np.ones((5, 2))

array([[1., 1.],
       [1., 1.],
       [1., 1.],
       [1., 1.],
       [1., 1.]])

In [24]:
# Create array of zeros with shape (10,)
np.zeros((10,))

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [25]:
# Create 5-by-5 identity matrix
np.eye(5, dtype=int)

array([[1, 0, 0, 0, 0],
       [0, 1, 0, 0, 0],
       [0, 0, 1, 0, 0],
       [0, 0, 0, 1, 0],
       [0, 0, 0, 0, 1]])

**NOTE.** An identity matrix is a square matrix with zeros and ones along the diagonal. Its main special mathematical property relates to the fact that it is a matrix equivalent of the number $1$. If we have an $n$-by-$n$ identity matrix $I$ and other $X$ matrix with $n$ rows, we have that:

$$IX = X$$

In [26]:
# Create a range of integers from 0 to 10 (not including 10)
np.arange(10)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [27]:
# Create a range of integers from -5 to 5 (not including 5)
np.arange(-5, 5)

array([-5, -4, -3, -2, -1,  0,  1,  2,  3,  4])

In [28]:
# Create a range of integers from 0 to 10 with step size of 3
np.arange(0, 10, 3)

array([0, 3, 6, 9])

In [29]:
# Create a range of 11 equally spaced floating point numbers between 0 and 1
np.linspace(0, 1, 11)

array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ])

Standard `np.array` function can be used with anything that can be interpreted as an array. So it can be a list, a list of lists (tuples work too), but also an iterator like a `range` object. 

Any array creation function accepts also the `dtype` arguments which takes any valid Numpy `dtype` as well as standard types such as `int`, `float` and `bool`.

#### Type conversion

Moreover, any array can be at any time converted to any other type (or at least we may try to do so).

In [30]:
X = np.arange(10)
X.astype(float)

array([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])

However, there is one special `dtype` called `object`. It is used to represent any type of Python object. This means that this is a type that **do not give us any significant performance gain** as data of this type can not be used by Numpy core.

In [31]:
# Array of Python objects
X = np.array([ {'a': 1}, 'a string', 10 ])
X

array([{'a': 1}, 'a string', 10], dtype=object)

Object arrays also often do not support many common operations such as type conversion as they may not be well defined.

In [32]:
X.astype(int)

TypeError: int() argument must be a string, a bytes-like object or a number, not 'dict'

#### Reshaping, copies and views

One of the main features of Numpy is that it allows us to easily cast our arrays to different shapes.

In [33]:
X = np.arange(16)
X

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15])

In [34]:
X.reshape(4, 4)

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])

**IMPORTANT.** By default, reshaping in Numpy always fills values in such a way that the last axis is filled first.

In [35]:
X = np.arange(12)
X.reshape(2, 2, 3)

array([[[ 0,  1,  2],
        [ 3,  4,  5]],

       [[ 6,  7,  8],
        [ 9, 10, 11]]])

The crucial thing is that a new, reshaped array does not copy the data. It still points to the same data and it only **view** it differently.

This mean that a new reshaped array is a new Python object, but in terms of actual data it still uses the same data.
This can be revealed by using the `.base` attribute defined on any Numpy array. It points to the array from which a given array was derived.
The base of an original array is ``None``.

In [37]:
X = np.arange(16)
Y = X.reshape(4, 4)

print(X is Y)       # they are different objects
print(X is Y.base)  # but point to the same data

False
True


In [38]:
print(X.base)
print(Y.base)

None
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15]


It is important to be more or less aware when you create new data and when just new views of the same data.
Below we see why it can be very important. When we chage an element of one of the arrays it is changed in all other arrays that view the same data.

In [39]:
X = np.arange(16)
Y = X.reshape(4, 4)
X[5] = -999
print(X)
print("\n", Y)

[   0    1    2    3    4 -999    6    7    8    9   10   11   12   13
   14   15]

 [[   0    1    2    3]
 [   4 -999    6    7]
 [   8    9   10   11]
 [  12   13   14   15]]


However, if we want we can copy any array explicitly.

In [40]:
X = np.ones((3, 3))
Y = X.copy()

print(X is Y)
print(X.base, Y.base)

False
None None


There is also a specific type of `.reshape()` which is called transpose. The idea of transpose is very important in linear algebra, where it corresponds to swapping rows with columns in a matrix. In a more general contexts of arrays, it refers to swapping of axes, so the first become the last, the second becomes the second last and so on.

To get a transpose of an array we use a special `.T.` attribute defined on every array object.

In [42]:
X = np.arange(5)
# Transpose does nothing for a 1D array
print(X.shape, X.T.shape)

# 2D transpose
X = np.arange(20).reshape(5, 4)
print(X.shape, X.T.shape)

# 4D transpose
X = np.arange(3*2*5*4).reshape(3, 2, 5, 4)
print(X.shape, X.T.shape)

(5,) (5,)
(5, 4) (4, 5)
(3, 2, 5, 4) (4, 5, 2, 3)


### Aggregation

Numpy also provides us with many built in methods defined on every array object that allows us to aggregate our data in different ways.

In [43]:
X = np.arange(10)
X.sum()

45

Below or some of the most useful aggregating functions.

* `sum` (sum of all elements)
* `prod` (product of all elements)
* `mean` (mean value / average value)
* `var` (variance)
* `std` (standard deviation)
* `min` (minimum)
* `max` (maximum)
* `all` (boolean and between all values)
* `any` (boolean or between all values)

The crucial thing, however, is that we can aggregate along specific axes.

In [44]:
X = np.arange(20).reshape(5, 4)
X

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19]])

In [45]:
X.sum(0)   # aggregate along the first axis

array([40, 45, 50, 55])

In [46]:
X.sum(1)   # aggregate along the second axis

array([ 6, 22, 38, 54, 70])

In [47]:
X.sum(-1)  # aggrega along the last axis

array([ 6, 22, 38, 54, 70])

With multidimensional arrays we can also aggregate along multiple axes.

In [48]:
X = np.arange(36).reshape(4, 3, 3)
X

array([[[ 0,  1,  2],
        [ 3,  4,  5],
        [ 6,  7,  8]],

       [[ 9, 10, 11],
        [12, 13, 14],
        [15, 16, 17]],

       [[18, 19, 20],
        [21, 22, 23],
        [24, 25, 26]],

       [[27, 28, 29],
        [30, 31, 32],
        [33, 34, 35]]])

In [51]:
X.sum((1, 2))   # aggregate along the second and the third axes

array([ 36, 117, 198, 279])

### Exercises, part A: array creation, reshaping and aggregation

#### A1.

Compute sum of all multiples of 7 which are greater than or equal to 0 and lower than 100.

HINT. Use `np.arange` and `sum` aggregation method.

In [7]:
# Your code
np.arange(7, 100, 7).sum()

735

#### A2.

You are provided with repeated measurements from a test for 5 subjects. There are 7 measurements for each subject.
Compute average values and standard deviations for all subjects.

The problem is that the data you are provided is arrange in a one-dimensional list.
However, you know that measurements for any given subject are next to each other, so you should use this fact to your advantage.

In [8]:
X = np.array([
    9.13277641, 14.41078704,  9.22040494,  8.58801853,  7.74778277, 11.92951325,  8.35103319,  
    5.5999831 ,  9.58651896, 10.87977671,  8.22024318,  9.39598417, 10.74455952,  9.18794327,  
    9.43863215,  7.06870742,  9.49590352,  8.91426591, 14.68357244, 11.13252626,  9.91102564, 
    9.25597556, 12.78596713, 11.49586057, 13.84790885,  9.6170427 ,  8.28066311, 10.76219817,  
    9.4020769 , 11.30056981,  6.53012513, 12.36805772, 11.83046264,  8.34725296,  9.26385409
])

In [9]:
# Your code
Y = X.reshape(5, 7)
Y.mean(1)
Y.std(1)

array([ 9.91147373,  9.08785842, 10.09209048, 10.86365944,  9.86319989])

array([2.21257175, 1.65733353, 2.18787438, 1.84474369, 1.93426989])

### Standard indexing

Arrays can be of course indexed as any other ordered collection that supports numerical indexing with integer indices (such as lists and tuples).
However, since they can have multiple axes, they can be indexed with multiple indexers.

In [10]:
# 1D arrays works exactly like lists etc.
X = np.arange(10)
print(X[2])
print(X[-2])
print(X[2:5])

2
8
[2 3 4]


In [11]:
# 2D arrays can be indexed with one or two indexers
X = np.arange(12).reshape(3, 4)
X

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [12]:
X[1]    # second row (second element of the first axis)
X[:, 1] # all elements along the first axis and the second element along the second axis
X[1, 1] # second element along the first axis and the second element along the second axis

array([4, 5, 6, 7])

array([1, 5, 9])

5

In [13]:
# Indexing of course generalized to more dimensions
X = np.arange(18).reshape(3, 3, 2)
X

array([[[ 0,  1],
        [ 2,  3],
        [ 4,  5]],

       [[ 6,  7],
        [ 8,  9],
        [10, 11]],

       [[12, 13],
        [14, 15],
        [16, 17]]])

In [14]:
X[0]

array([[0, 1],
       [2, 3],
       [4, 5]])

In [15]:
X[2, 1]

array([14, 15])

In [16]:
X[:, :, 1]

array([[ 1,  3,  5],
       [ 7,  9, 11],
       [13, 15, 17]])

Indexes can be passed also as tuples with integers and `slice` objects.

In [17]:
X = np.arange(12).reshape(4, 3)
X

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11]])

In [18]:
X[3, 2] == X[(3, 2)]

True

In [19]:
X[:3, 1] == X[(slice(None, 3), 1)]

array([ True,  True,  True])

### Vectorization & broadcasting

Vectorization is the crucial idea that allows us to fully appreciate the power of arrays with fixed data types. It makes it possible to drastically limit the number of times we have to pass data between Python and Numpy core.

The main idea is to avoid explicit loops at all costs and instead reshape arrays properly and express complex computations as simple arithmetic computations between different arrays.

#### Array-scalar vectorization

The simplest case is the vectorization of array-scalar operations.

Below we see two approaches to the same operation of multiplying all numbers in a one-dimensional array by a constant (a scalar).

In [20]:
X = np.ones((10,), dtype=int)
X

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

In [21]:
# Naive, non-vectorized approach based on a for-loop
Y = np.array([ x * 5 for x in X ])
Y

array([5, 5, 5, 5, 5, 5, 5, 5, 5, 5])

In [22]:
# Vectorized approach
Y = X * 5
Y

array([5, 5, 5, 5, 5, 5, 5, 5, 5, 5])

In general all standard arithmetic operators when called with Numpy arrays are vectorized. In the case of array-scalar operations it means that they are carried out **element-wise**.

In [23]:
X = np.arange(10000)

Below we see a comparison of computation time for the two approaches.
Clearly, Numpy is many orders of magnitude faster.

In [24]:
%%timeit
[ x * 5 for x in X ]

1.95 ms ± 133 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [25]:
%%timeit
X * 5

5.92 µs ± 192 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


Arithmetic operators are element-wise in the same way also for multidimensional arrays.

In [26]:
X = np.ones((3, 2, 2), dtype=int)
X * 10

array([[[10, 10],
        [10, 10]],

       [[10, 10],
        [10, 10]],

       [[10, 10],
        [10, 10]]])

All standard arithmetic operators are supported:

* `+` (addition)
* `-` (subtraction)
* `*` (multiplication)
* `/` (division)
* `//` (integer division)
* `%` (modulo operator; integer division rest)
* `**` (exponentiation; i.e. $x^a$)

Moreover, logical and comparison operators are also supported. However, `or` and `and` operators does not work. Instead, their bitwise counterparts has to be used.

* `==` (equality test)
* `>` / `>=` / `<` / `<=` (comparisons)
* `&` (bitwise `and`)
* `|` (bitwise `or`)

Vectorized logical operators applied to Numpy arrays return boolean arrays of the same size in which ``True`` values indicate elements that passed the test.

In [27]:
X = np.array([0, 3, 1])

In [28]:
# Equality test
X == 3

array([False,  True, False])

In [29]:
# Comparison
X > 0

array([False,  True,  True])

Bitwise logical operators `|` and `&` can operate on booleans and integers. However, their behavior on integers is quite complex and is beyond the scope of this course.
So here we will focus on their behavior for boolean values (or `0` and `1` integers).

In [30]:
X = np.array([0, 0, 1])

In [31]:
# Bitwise AND
X & 1

array([0, 0, 1])

In [32]:
# Bitwise OR
X | 1

array([1, 1, 1])

Moreover, Numpy also provides vectorized implementations of all the standard mathematical functions (and many more exotic ones). They are defined as functions in the main Numpy package. Below are few examples:

* `np.sqrt` (square root; $\sqrt{x}$)
* `np.exp` (natural exponentiation; $e^{x}$)
* `np.log` (natural logartihm; $\log(x)$)
* `np.log2` (logarithm with base $2$; $\log_2(x)$)
* `np.log10` (logarithm with base $10$; $\log_{10}(x)$)
* `np.sin` (sine; $\sin(x)$)
* `np.cos` (cosine; $\cos(x)$)

Numpy also defines several important mathematical constants such as:

* `np.pi` (The number $\pi$)

In [33]:
X = np.array([1, 2, 4, 5])
print(np.exp(X))
print(np.log(X))

[  2.71828183   7.3890561   54.59815003 148.4131591 ]
[0.         0.69314718 1.38629436 1.60943791]


#### Array-array vectorization

The simplest case of array-array operation is when both arrays are of the same shape. Similarly to scalar-vector vectorization it can always be done.
When two arrays have the same shape the operations are done elementwise.

In [34]:
X = np.arange(12).reshape(4, 3)
#Y = np.ones((4, 3), dtype=int) * 3
Y = np.arange(12, 24).reshape(4, 3)
print(X)
print(Y)

X + Y

[[ 0  1  2]
 [ 3  4  5]
 [ 6  7  8]
 [ 9 10 11]]
[[12 13 14]
 [15 16 17]
 [18 19 20]
 [21 22 23]]


array([[12, 14, 16],
       [18, 20, 22],
       [24, 26, 28],
       [30, 32, 34]])

Logical operations work in the same way.

In [35]:
X = np.array([2, 4, 5])
Y = np.array([3, 4, 1])

X == Y

array([False,  True, False])

In [36]:
X > Y

array([False, False,  True])

In [37]:
((X == Y) | (X > Y)) == (X >= Y)

array([ True,  True,  True])

The second simplest case of array-array vectorization is when one array has less axes, but it shape is the same as the shape of the $n$ axes of lowest order in the larger array.

For instance, assume that array `A` has shape `(5, 2, 3)`.

* Assume that `B` has shape `(3,)`. Then, operations between them can be vectorized as `3` in the shape of `B` matches with the last `3` in the shape of `A`.
* Assume that `B` has shape `(2, 3)`. Then, operations between them can be vectorized as `(2, 3)` in the shape of `B` matches with the last `2, 3` in the shape of `A`.
* Assume that `B` has shape `(2,)`. Then, operations between them **cannot** be vectorized as `2` in the shape of `B` does not match with the last `3` in the shape of `A`.

This rules are called the **broadcasting rules**. Broadcasting is very powerful, but it also takes quite a lot of time to really understand it.

In [38]:
X = np.arange(8).reshape(4, 2)
print(X.shape)
X

(4, 2)


array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7]])

In [39]:
# Here we will remove mean values from columns
X_mean = X.mean(0)
print(X_mean.shape)
X_mean

(2,)


array([3., 4.])

In [40]:
# Note that shape of `X_mean` matches with the last axis of `X`
# Hence, we can vectorize
X - X_mean

array([[-3., -3.],
       [-1., -1.],
       [ 1.,  1.],
       [ 3.,  3.]])

In [41]:
for chunk in X:
    chunk - X_mean

array([-3., -3.])

array([-1., -1.])

array([1., 1.])

array([3., 3.])

Can we use the same trick to subtract row means? Let us check.

In [42]:
# We compute row means
X_mean = X.mean(1)
print(X_mean.shape)
X_mean

(4,)


array([0.5, 2.5, 4.5, 6.5])

In [43]:
# Note that the shape of `X_mean` does not match with the last axis of `X` this time.
# So we are bound to fail.
X - X_mean

ValueError: operands could not be broadcast together with shapes (4,2) (4,) 

But do not worry. It can be done, but it will require us to work around the standard broadcasting rules and reshape our arrays, so they can work together.

In order to do so, we will use one of the simplest but also most powerful tricks in Numpy. We will start to add dummy axes.

What is a dummy axis? It is just an axis with only one element along it. As such, it can be also added to any array, as it does not increase its `size` (the number of elements).

In [44]:
# Below we reshape an array by adding dummy axes
X = np.arange(10)
X

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [45]:
# We can add one dummy axis
X.reshape(10, 1)

array([[0],
       [1],
       [2],
       [3],
       [4],
       [5],
       [6],
       [7],
       [8],
       [9]])

In [46]:
# Or as many as we want and wherever we want
X.reshape(1, 1, 10)

array([[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]]])

In [47]:
# Note that this of course does not change the size of an array
# (also because otherwise the reshape method would throw an error)
X.size == X.reshape(10, 1).size == X.reshape(1, 1, 10).size

True

However, reshaping this way is not really convenient as we have to specify the number of elements. Luckily, we have a better syntax for adding dummy axes.

In [48]:
print(X.shape)
X

(10,)


array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [49]:
print(X[:, None].shape)
X[:, None]

(10, 1)


array([[0],
       [1],
       [2],
       [3],
       [4],
       [5],
       [6],
       [7],
       [8],
       [9]])

In [50]:
X = np.ones((5, 3), dtype=int)
Y = X[None, :, None, :]
Y.base

array([[1, 1, 1],
       [1, 1, 1],
       [1, 1, 1],
       [1, 1, 1],
       [1, 1, 1]])

In [51]:
X[None, :, None, :].shape

(1, 5, 1, 3)

Now, we can go back to the problem of removing row means, because we can add dummy axes to reshape our arrays so they can work together. This will also allow us to better understand the rules of broadcasting.

In [52]:
X = np.arange(8).reshape(4, 2)
print(X.shape)
X

(4, 2)


array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7]])

In [53]:
# Compute row means
X_mean = X.mean(1)
print(X_mean.shape)
X_mean

(4,)


array([0.5, 2.5, 4.5, 6.5])

We know that this shape will not do. However, we can add dummy axis to reshape the `X_mean` so it has `(4, 1)` shape.

In [54]:
# Add dummy axis to X_mean
X_mean_r = X_mean[:, None]
print(X_mean_r.shape)
X_mean_r

(4, 1)


array([[0.5],
       [2.5],
       [4.5],
       [6.5]])

In [55]:
# Now we can carry out the subtraction
print(X)
X - X_mean_r

[[0 1]
 [2 3]
 [4 5]
 [6 7]]


array([[-0.5,  0.5],
       [-0.5,  0.5],
       [-0.5,  0.5],
       [-0.5,  0.5]])

In [56]:
for chunk, colsum in zip(X, X_mean):
    chunk - colsum

array([-0.5,  0.5])

array([-0.5,  0.5])

array([-0.5,  0.5])

array([-0.5,  0.5])

Why did it work? It worked because we can vectorize when:

* The lowest axes of a larger array have the same numbers of elements as a smaller array or the smaller array has only 1 element along some of its axes.

For instance:

* (4, 2) can be vectorized with (2,)
* (4, 2) can be vectorized with (4, 1)
* (4, 3, 2) can be vectorized with (2,)
* (4, 3, 2) can be vectorized with (3, 2)
* (4, 3, 2) can be vectorized with (3, 1)
* (4, 3, 2) can be vectorized with (4, 3, 1)
* (4, 3, 2) can be vectorized with (4, 1, 2)
* (4, 3, 2) can be vectorized with (4, 1, 1)
* (4, 3, 2) can be vectorized with (1, 1, 1)

This fully defines the rules of broadcasting.

In [57]:
# Array with shape (4, 2)
X = np.arange(8).reshape(4, 2)
X

array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7]])

In [58]:
# (4, 2) and (2,)
Y = np.array([1, 2])
print(Y.shape)
X - Y

(2,)


array([[-1, -1],
       [ 1,  1],
       [ 3,  3],
       [ 5,  5]])

In [59]:
# (4, 2) and (4, 1)
Y = X.sum(1)[:, None]
print(Y.shape)
X - Y

(4, 1)


array([[-1,  0],
       [-3, -2],
       [-5, -4],
       [-7, -6]])

In [60]:
# Array with shape (4, 3, 2)
X = np.arange(24).reshape(4, 3, 2)
X

array([[[ 0,  1],
        [ 2,  3],
        [ 4,  5]],

       [[ 6,  7],
        [ 8,  9],
        [10, 11]],

       [[12, 13],
        [14, 15],
        [16, 17]],

       [[18, 19],
        [20, 21],
        [22, 23]]])

In [61]:
# (4, 3, 2) and (3, 2)
Y = X.sum(0)
print(Y.shape)
X - Y

(3, 2)


array([[[-36, -39],
        [-42, -45],
        [-48, -51]],

       [[-30, -33],
        [-36, -39],
        [-42, -45]],

       [[-24, -27],
        [-30, -33],
        [-36, -39]],

       [[-18, -21],
        [-24, -27],
        [-30, -33]]])

In [62]:
# (4, 3, 2) and (3, 1)
Y = X.sum((0, 2))[:, None]
print(Y.shape)
print(Y)
X - Y

(3, 1)
[[ 76]
 [ 92]
 [108]]


array([[[ -76,  -75],
        [ -90,  -89],
        [-104, -103]],

       [[ -70,  -69],
        [ -84,  -83],
        [ -98,  -97]],

       [[ -64,  -63],
        [ -78,  -77],
        [ -92,  -91]],

       [[ -58,  -57],
        [ -72,  -71],
        [ -86,  -85]]])

In [63]:
# (4, 3, 2) and (4, 3, 1)
Y = X.sum(-1)[:, :, None]
print(Y.shape)
X - Y

(4, 3, 1)


array([[[ -1,   0],
        [ -3,  -2],
        [ -5,  -4]],

       [[ -7,  -6],
        [ -9,  -8],
        [-11, -10]],

       [[-13, -12],
        [-15, -14],
        [-17, -16]],

       [[-19, -18],
        [-21, -20],
        [-23, -22]]])

In [64]:
# (4, 3, 2) and (4, 1, 2)
Y = X.sum(1)[:, None, :]
print(Y.shape)
X - Y

(4, 1, 2)


array([[[ -6,  -8],
        [ -4,  -6],
        [ -2,  -4]],

       [[-18, -20],
        [-16, -18],
        [-14, -16]],

       [[-30, -32],
        [-28, -30],
        [-26, -28]],

       [[-42, -44],
        [-40, -42],
        [-38, -40]]])

In [65]:
# (4, 3, 2) and (4, 1, 1)
Y = X.sum((1, 2))[:, None, None]
print(Y.shape)
X - Y

(4, 1, 1)


array([[[ -15,  -14],
        [ -13,  -12],
        [ -11,  -10]],

       [[ -45,  -44],
        [ -43,  -42],
        [ -41,  -40]],

       [[ -75,  -74],
        [ -73,  -72],
        [ -71,  -70]],

       [[-105, -104],
        [-103, -102],
        [-101, -100]]])

### Exercises, part B: vectorization & broadcasting

#### B1.

You are provided with three different IQ indicators ($M = 100$; $SD = 15$) for 10 subjects and you need to standardize them so they have mean 0 and standard deviation 1.

HINT. Remember about aggregation methods `.mean()` and `.std()`.


$X$ is some random variable, and $\bar{X}$ is sample mean of $X$.
Now define $X_c$, a centered version of $X$, $X_c = X - \bar{X}$.

$Z = X_c / s_X$.

In [66]:
np.random.seed(101)
X = np.random.normal(100, 15, (10, 3))

array([[140.60274759, 109.42199063, 113.6195417 ],
       [107.55738631, 109.76676922,  95.21022933],
       [ 87.27884525, 109.08948024,  69.72747634],
       [111.10183086, 107.93220241,  91.164992  ],
       [102.83042964,  88.61691916,  86.00144176],
       [114.32584764, 102.86191484, 129.68135986],
       [139.0895092 , 110.25263328, 104.53998173],
       [125.40584388,  74.40871104,  82.61320877],
       [ 97.97738919, 105.85791764, 102.50356954],
       [102.76752789, 112.11558871, 101.09439513]])

(10, 3)

(3,)

In [69]:
# Your code
Z = (X - X.mean(0)) / X.std(0)

# Check results
Z.mean(0)
Z.std(0)

array([-5.55111512e-17, -4.10782519e-16, -1.52655666e-16])

array([1., 1., 1.])

#### B2.

You are provided with 5 repeated measurements for 10 subjects. Standardize within-subject measurements so the lowest score for each subject is $0$ and the highest score for each subject is $1$ (this is called min-max scaling).

In [70]:
np.random.seed(102)
X = np.random.normal(10, 2, (10, 5))

In [71]:
# Your code
X_min = X - X.min(1)[:, None]
X_min / X_min.max(1)[:, None]

array([[1.        , 0.71325746, 0.76430621, 0.        , 0.85770407],
       [1.        , 0.67267362, 0.51961998, 0.82050214, 0.        ],
       [1.        , 0.52106431, 0.29981168, 0.        , 0.26425368],
       [1.        , 0.        , 0.67100469, 0.45298642, 0.64009693],
       [1.        , 0.61209228, 0.        , 0.05829331, 0.74410704],
       [0.82484247, 0.        , 0.90160817, 1.        , 0.78350719],
       [0.07175082, 0.22281792, 0.39323828, 1.        , 0.        ],
       [0.50232717, 0.        , 1.        , 0.68584151, 0.35534415],
       [0.87802735, 0.        , 0.75951478, 0.31977867, 1.        ],
       [0.78779121, 0.92783191, 0.4221443 , 0.        , 1.        ]])

### Fancy indexing

Numpy provides also non-standard kinds of indexing which are jointly called _fancy indexing_.

#### Integer indexing

We can index with integer indexes by providing integers specifying positions along given axes.
Integer indices may be provided as lists or as numpy arrays, but not as tuples.

Tuples can not be used in integer indexing, since tuples are used to pass multiple indexers for standard indexing.

In [72]:
import numpy as np
X = np.arange(10, 20)
X[[2, 7, 0, 7]]

array([12, 17, 10, 17])

In [73]:
X[np.array([1, 5, 0])]

array([11, 15, 10])

In [74]:
X[(1, 5, 0)]

IndexError: too many indices for array

In the general (multidimensional) case, integer indexes have to specify _coordinates_ of elements to be extracted.

In [75]:
X = np.arange(12).reshape(4, 3)
X

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11]])

In [76]:
# Extract elements (0, 0) and (2, 1)
X[[0, 2], [0, 1]]

array([0, 7])

In [77]:
# Extract elements (0, 0) and (2, 0)
X[[0, 2], 0]

array([0, 6])

In [78]:
X = np.arange(12).reshape(3, 2, 2)
X

array([[[ 0,  1],
        [ 2,  3]],

       [[ 4,  5],
        [ 6,  7]],

       [[ 8,  9],
        [10, 11]]])

In [79]:
# Extract elements (0, 0, 1) and (2, 1, 1)
X[[0, 2], [0, 1], 1]
X[0,0,1]
X[2,1,1]

array([ 1, 11])

1

11

In [80]:
# Extract elements (0, 2) along the main axis and element 1 along the second axis + all elements along the third axis
X[[0, 2], 1]

array([[ 2,  3],
       [10, 11]])

### Boolean indexing

The second type of _fancy indexing_ is boolean indexing. The idea is to provide a _boolean mask_ that specify which elements of should be extracted (`True` values) and which should be discarded (`False` values).

In [81]:
[ 1, 2, 3 ]
[ True, False, True ]

[1, 2, 3]

[True, False, True]

In [82]:
X = np.arange(12).reshape(4, 3)
X

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11]])

In [83]:
# We can provide a single mask along the main axis
X[[True, False, True, False]]

array([[0, 1, 2],
       [6, 7, 8]])

In [84]:
# We can mix standard indexing and pass boolean mask to a second (or any later) axis
X[:, [True, False, False]]

array([[0],
       [3],
       [6],
       [9]])

In [85]:
# We can use boolean indexing in a similar way as integer indexing
# by masking coordinates along particular axes
X[[True, False, False, True], [True, False, False]]

array([0, 9])

In [86]:
# We can also pass a full boolean mask
# with the same shape as the original array
# in order to extract any elements we want
X
X > 7
X[~((X > 7) & (X < 11))]

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11]])

array([[False, False, False],
       [False, False, False],
       [False, False,  True],
       [ True,  True,  True]])

array([ 0,  1,  2,  3,  4,  5,  6,  7, 11])

### Indexing & updating

One of the main applications of indexing is to update particular elements of an array.
The general syntax for this is based on composition of an indexing operation and an assignment.

In [87]:
X = np.arange(12).reshape(6, 2)
X

array([[ 0,  1],
       [ 2,  3],
       [ 4,  5],
       [ 6,  7],
       [ 8,  9],
       [10, 11]])

In [88]:
# Switch all even numbers to zero
X[X % 2 == 0] = 0
X[[0, 2], [1, 1]] = 999
X[:, [True, False]] = 7777
X[2, :] = 555
X

array([[7777,  999],
       [7777,    3],
       [ 555,  555],
       [7777,    7],
       [7777,    9],
       [7777,   11]])

### Exercises, part C: Indexing

#### C1.

You have an array of positions of some objects in 3-dimensional space. Moreover, you are provided with a position of a special landmark object.

Your task is to create a 1D boolean mask and use it to filter out objects that are farther from the landmark than $0.5$ unit.

Remeber that a (Euclidean) distance between two objects is defined as follows:

$$d(\text{object}_i, \text{object}_j) = \sqrt{(x_i - x_j)^2 + (y_i - y_j)^2 + (z_i - z_j)^2}$$
$$(x_i - x_j), (y_i - y_j), (z_i - z_j)$$

HINT. You may want to use `np.sqrt`. Alternatively, remember that square root is equivalent to $x^{1/2}$.

In [89]:
np.random.seed(103)
# Positions
P = np.random.uniform(0, 1, (1000, 3))
# Position of the landmark
landmark = np.array([0, 0, .7])

In [90]:
# Your code
D = np.sqrt(((P - landmark)**2).sum(1))
P[D <= 0.5].shape

(108, 3)

### C2.

You have a set of three measurements for 5000 subjects. 

Your task is to create a boolean mask extracting subjects with at least two measurements either lower than 6 or greater than 14. 
Use the mask to create a filtered dataset.

Count the number of subjects in the filtered dataset.

Use the same (or a similar) mask to extract all the values lower than 6 or greater than 14. Find the minimum and the maximum value.

HINT. Remember that you can aggregate boolean values. `True` is interpreted as `1` and `False` as `0`.

In [91]:
np.random.seed(104)
X = np.random.normal(10, 2, (5000, 3))

In [92]:
# Your code
extreme = ((X < 6) | (X > 14)).sum(1)
extreme
X[extreme >= 2].shape

array([0, 0, 1, ..., 0, 0, 0])

(24, 3)

In [93]:
E = X[(X < 6) | (X > 14)]
E.max()
E.min()

18.755788670932652

1.2317588137562616