![](img/logo.png)

# Numerical Python: NumPy

[NumPy](http://www.numpy.org/) is the fundamental package for scientific computing with Python. 
It contains arrays, math functions, linear algebra, random number capabilities and much more.

[![Numpy logo](https://numfocus.org/wp-content/uploads/2016/07/numpy-logo-300.png)](https://matplotlib.org/gallery/mplot3d/voxels_numpy_logo.html)

# Importing NumPy

The convention is `import numpy as np`. This loads the entire NumPy package once, and uses an alias `np` so that we don't pollute our code with too much `numpy`.

Some people like to do `from numpy import *`. This is frowned-upon as it pollutes the namespace: it overrides the default `sum` and hides the fact that we are using specific `numpy` functions.

If you only need specific NumPy objects you can load them using `from numpy import array, ones` etc.

In [1]:
import numpy as np
print("Numpy version:", np.__version__)

Numpy version: 1.19.5


NumPy’s main object is the _homogeneous_ multidimensional array.

Dimensions are called axes. 

* A vector is an array of one dimension
* A 2D matrix is an array with two dimensions.

# Analyzing Patient Data

We are studying inflammation in patients who have been given a new treatment for arthritis, and need to analyze the first dozen data sets. 

The data sets are stored in comma-separated values (CSV) format: each row holds information for a single patient, and the columns represent successive days. 

The first few rows of our data file look like this:

> 0,0,1,3,1,2,4,7,8,3,3,3,10,5,7,4,7,7,12,18,6,13,11,11,7,7,4,6,8,8,4,4,5,7,3,4,2,3,0,0
0,1,2,1,2,1,3,2,2,6,10,11,5,9,4,4,7,16,8,6,18,4,12,5,12,7,11,5,11,3,3,5,4,4,5,5,1,1,0,1
0,1,1,3,3,2,6,2,5,9,5,7,4,5,4,15,5,11,9,10,19,14,12,17,7,12,11,7,4,2,10,5,4,2,2,3,2,2,1,1
0,0,2,0,4,2,2,1,6,7,10,7,9,13,8,8,15,10,10,7,17,4,4,7,6,15,6,4,9,11,3,5,6,3,3,4,2,3,2,1
0,1,1,3,3,1,3,5,2,4,4,7,6,5,3,10,8,10,6,17,9,14,9,7,13,9,12,6,7,7,9,6,3,2,2,4,2,0,1,1


## Loading data from file

We read the file into a NumPy array - the new data structure which is the center of all scientific Python.

In [2]:
fname = "../data/inflammation-01.csv"
data = np.loadtxt(fname, delimiter=',')
print(data)

[[0. 0. 1. ... 3. 0. 0.]
 [0. 1. 2. ... 1. 0. 1.]
 [0. 1. 1. ... 2. 1. 1.]
 ...
 [0. 1. 1. ... 1. 1. 1.]
 [0. 0. 0. ... 0. 2. 0.]
 [0. 0. 1. ... 1. 1. 0.]]


The expression `np.loadtxt(...)` is a function call that asks Python to run the function `loadtxt` that belongs to the `numpy` library. 

`numpy.loadtxt` has two arguments: the name of the file we want to read, and the delimiter that separates values on a line. These arguments need to be strings.

We saved the output of `loadtxt` to the variable `data`.
When we `print(data)`, only a few rows and columns are shown (with `...` to omit elements when displaying big arrays).
To save space, Python displays numbers as `1.` instead of `1.0` when there's nothing interesting after the decimal point.

### Other ways to load data from files

- [np.load](https://docs.scipy.org/doc/numpy/reference/generated/numpy.load.html): Load arrays or [pickled](https://docs.python.org/3/library/pickle.html?highlight=pickle#module-pickle) objects from pickled files, saved using `np.save` with the extension `.npy` or `.npz` (the latter for gzip compressed files).
- [np.genfromtxt](https://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html): provides more sophisticated handling of, e.g. lines with missing values.

There are some more special I/O functions in [scipy.io](https://docs.scipy.org/doc/scipy/reference/io.html), for example for reading MATLAB data files and audio files, and [imageio](http://imageio.github.io) for reading image files.

## Manipulating Data

Now that our data is in memory, we can start doing things with it. First, let's ask what type of thing data refers to:

In [3]:
print(type(data))

<class 'numpy.ndarray'>


The output tells us that data currently refers to an N-dimensional array created by the NumPy library. We can see what its shape is like this:

In [4]:
print('number of dimensions:', data.ndim)
print('shape of array:', data.shape)
print('size of array:', data.size)

n_patients, n_days = data.shape

number of dimensions: 2
shape of array: (60, 40)
size of array: 2400


This tells us that `data` has 60 rows and 40 columns, which are 60 patients and 40 days. 

`data.shape` is a member of `data`, i.e. a value that is stored as part of an object, an attribute.

If we want to get a single value from the matrix, we must provide an index in square brackets, just as we do with a `list`, but with as many indices as the number of dimensions - `ndim` (two in this case):

In [5]:
print("first value in data", data[0,0])
print("middle value in data:", data[30, 20])
print("last value in data:", data[-1, -1])

first value in data 0.0
middle value in data: 13.0
last value in data: 0.0


Just like with `list` and `str`, if we have an M×N array in Python, its indices go from 0 to M-1 on the first axis and 0 to N-1 on the second. 

It takes a bit of getting used to, but one way to remember the rule is that the index is how many steps we have to take from the start to get the item we want.

> **In the Corner.**
> What may also surprise you is that when Python displays an array, it shows the element with index [0, 0] in the upper left corner rather than the lower left. This is consistent with the way mathematicians draw matrices, but different from the Cartesian coordinates. The indices are (row, column) instead of (column, row) for the same reason, which can be confusing when plotting data.


### __Slicing__
An index like `[30, 20]` selects a single element of an array, but we can select whole sections as well. 

For example, we can select the first ten days (columns) of values for the first four (rows) patients like this:

In [6]:
print(data[0:4, 0:10])

[[0. 0. 1. 3. 1. 2. 4. 7. 8. 3.]
 [0. 1. 2. 1. 2. 1. 3. 2. 2. 6.]
 [0. 1. 1. 3. 3. 2. 6. 2. 5. 9.]
 [0. 0. 2. 0. 4. 2. 2. 1. 6. 7.]]


The slice `0:4` means, "Start at index 0 and go up to, but not including, index 4." Again, the up-to-but-not-including takes a bit of getting used to, but the rule is that the difference between the upper and lower bounds is the number of values in the slice.

We don't have to start slices at 0:

In [7]:
print(data[5:10, 0:10])

[[0. 0. 1. 2. 2. 4. 2. 1. 6. 4.]
 [0. 0. 2. 2. 4. 2. 2. 5. 5. 8.]
 [0. 0. 1. 2. 3. 1. 2. 3. 5. 3.]
 [0. 0. 0. 3. 1. 5. 6. 5. 5. 8.]
 [0. 1. 1. 2. 1. 3. 5. 3. 5. 8.]]


We also don't have to include the upper and lower bound on the slice. If we don't include the lower bound, Python uses 0 by default; if we don't include the upper, the slice runs to the end of the axis, and if we don't include either (i.e. if we just use ':' on its own), the slice includes everything:

In [8]:
small = data[:3, 36:].copy()
print('small is:')
print(small)

small is:
[[2. 3. 0. 0.]
 [1. 1. 0. 1.]
 [2. 2. 1. 1.]]


In [9]:
# the 5th column:
print(data[:, 4])
print(data[:, 4].shape)

[1. 2. 3. 4. 3. 2. 4. 3. 1. 1. 4. 3. 4. 1. 1. 1. 2. 2. 2. 1. 1. 2. 4. 1.
 2. 3. 4. 1. 3. 4. 3. 3. 2. 3. 4. 1. 2. 3. 2. 1. 2. 3. 4. 1. 3. 4. 4. 1.
 2. 3. 1. 3. 4. 1. 1. 2. 2. 4. 4. 3.]
(60,)


In [10]:
# the 7th row:
print(data[6, :])
print(data[6, :].shape)

[ 0.  0.  2.  2.  4.  2.  2.  5.  5.  8.  6.  5. 11.  9.  4. 13.  5. 12.
 10.  6.  9. 17. 15.  8.  9.  3. 13.  7.  8.  2.  8.  8.  4.  2.  3.  5.
  4.  1.  1.  1.]
(40,)


## Exercise 1

1. Print the last value of the array, that is, the value at the last row and last column.
1. Print the entire last row.
1. Print the entire last column.

**Tips**
- Get autocompletion by pressing _Tab_
- Get documentation by pressing _Shift+Tab_

## __Creating arrays__

There are [5 general mechanisms for creating arrays](https://docs.scipy.org/doc/numpy-dev/user/basics.creation.html):

1. Reading arrays from disk, either from standard or custom formats
1. Conversion from other Python structures (e.g., lists, tuples)
1. Intrinsic numpy array creation objects (e.g., `arange`, `ones`, `zeros`, etc.)
1. Creating arrays from raw bytes through the use of strings or buffers
1. Use of special library functions (e.g., `numpy.random`)


Let's start by pecifying a list or list of lists to the `np.array` function:

In [11]:
a = np.array([0, 1, 2, 3])
print(a)
print(type(a), a.dtype)

[0 1 2 3]
<class 'numpy.ndarray'> int64


The `dtype` attribute gives the data-type. 

We can force a specific data-type:

In [12]:
a = np.array([0, 1, 2, 3], dtype=np.uint64)
print(a)
print(type(a), a.dtype)

[0 1 2 3]
<class 'numpy.ndarray'> uint64


In [13]:
a = np.float16([0, 1, 2, 3])
print(a)
print(type(a), a.dtype)

[0. 1. 2. 3.]
<class 'numpy.ndarray'> float16


__Remember that the arrays must be homogeneous:__

In [14]:
np.array([1,'a'])

array(['1', 'a'], dtype='<U21')

NumPy has many [data-types](https://docs.scipy.org/doc/numpy-dev/user/basics.types.html), we'll focus on the default `int` and `float` today.

We can create a 2D array from nested lists - make sure that all nested lists have the same length.

In [15]:
b = np.array(
    [
        [0, 1, 2], 
        [3, 4, 5]
    ]
)

# same as writing: b = np.array([[0, 1, 2], [3, 4, 5]])

print(b)
print(b.shape)

[[0 1 2]
 [3 4 5]]
(2, 3)


In [16]:
c = np.array(
    [
        [
            [1], 
            [2]
        ], 
        [
            [3], 
            [4]
        ]
    ]
)
print(c)
print(c.shape)

[[[1]
  [2]]

 [[3]
  [4]]]
(2, 2, 1)


Arrays are N-dimensional, so you can specify how many dimensions that you would like:

In [17]:
d = np.array(
    [
        [
            [1, 2],
            [3, 4],
        ],
        [
            [5, 6],
            [7, 8],
        ],        
    ]
)
print(d)
print(d.shape)

[[[1 2]
  [3 4]]

 [[5 6]
  [7 8]]]
(2, 2, 2)


Check the number of dimensions and the shape:

In [18]:
print(d.ndim)
print(d.shape)

3
(2, 2, 2)


__Sequence of numbers:__

Use `np.arange`, whish is similar to `range`, but also accepts `float`s:

In [19]:
a = np.arange(10)  # similar to range(10)
print(a)

[0 1 2 3 4 5 6 7 8 9]


In [20]:
b = np.arange(-1.5, 9.5, 0.2) # start, end (exclusive), step
print(b)

[-1.5 -1.3 -1.1 -0.9 -0.7 -0.5 -0.3 -0.1  0.1  0.3  0.5  0.7  0.9  1.1
  1.3  1.5  1.7  1.9  2.1  2.3  2.5  2.7  2.9  3.1  3.3  3.5  3.7  3.9
  4.1  4.3  4.5  4.7  4.9  5.1  5.3  5.5  5.7  5.9  6.1  6.3  6.5  6.7
  6.9  7.1  7.3  7.5  7.7  7.9  8.1  8.3  8.5  8.7  8.9  9.1  9.3]


`np.linspace` is similar, but it accepts the required number of points rather than the required step.

In [21]:
c = np.linspace(0, 1, 6)   # start (inclusive), end (inclusive), num-points
print(c)

[0.  0.2 0.4 0.6 0.8 1. ]


__Initializing an array with zeros / ones:__

In [22]:
print(np.zeros(7))
print()
print(np.ones((2,3)))
print()
print(np.zeros((1,7)))

[0. 0. 0. 0. 0. 0. 0.]

[[1. 1. 1.]
 [1. 1. 1.]]

[[0. 0. 0. 0. 0. 0. 0.]]


## __Operations__

We can also perform common arithmetic operations on arrays: add, subtract, multiply, divide, etc.

When you perform these operations on arrays, the operation is done on each individual element of the array, i.e. elementwise.

A new array is created and filled with the result.

bac to `data` that we loaded above with `no.loadtxt`:

In [23]:
doubledata = data * 2

will create a new array `doubledata` whose elements have the value of two times the value of the corresponding elements in `data`.

In [24]:
print('original:')
print(data[:3, 36:])
print('doubledata:')
print(doubledata[:3, 36:])

original:
[[2. 3. 0. 0.]
 [1. 1. 0. 1.]
 [2. 2. 1. 1.]]
doubledata:
[[4. 6. 0. 0.]
 [2. 2. 0. 2.]
 [4. 4. 2. 2.]]


Some more examples:

In [25]:
x = np.array([1,2,3])
print(x+1)
print(x*2)
print(x-5)
print(x/3)
print(x%2)

[2 3 4]
[2 4 6]
[-4 -3 -2]
[0.33333333 0.66666667 1.        ]
[1 0 1]


In [26]:
x = np.array([1,2,3])
y = np.array([4,5,6])
print(x+y)
print(x*y)

[5 7 9]
[ 4 10 18]


Note that the shapes of the arrays must be suitable:

In [27]:
x = np.array([1,2,3])
z = np.array([7,8,9,10])
print(x+z)

ValueError: operands could not be broadcast together with shapes (3,) (4,) 

In order to apply matrix multiplication (as in linear algebra, and not elementwise), we can use the `dot` method:

In [28]:
# note the reshape command that creates a new array in the given dimensions:
a = np.arange(1,10).reshape((3,3))
b = np.arange(1,4).reshape(3,1)
c = np.arange(4,7).reshape(1,3)
for elem in (a,b,c):
    print(elem, end='\n\n')

[[1 2 3]
 [4 5 6]
 [7 8 9]]

[[1]
 [2]
 [3]]

[[4 5 6]]



In [29]:
print(a.dot(b), end='\n\n')
print(c.dot(b))

[[14]
 [32]
 [50]]

[[32]]


### __The speed of numpy__

Thess operands on arrays are much faster than doing it with e.g. lists 

(`%timeit` is a magic command for measuring running time of single lines; use `%%timeit` to measure time of a whole cell).

Say we want to produce an array of the squares of numbers from 0 to 99.
There are a bunch of ways to do it using list comprehensions, `np.arange`, and using different approached for squaring the numbers. 

Let's compare the different approaches in terms of running time:

In [30]:
n = 100000

%timeit [x**2 for x in range(n)]

%timeit np.array([x**2 for x in range(n)])

%timeit np.array([x**2 for x in np.arange(n)])

%timeit np.power(range(n), 2)

%timeit np.power(np.arange(n), 2)

%timeit np.arange(n)**2

28.5 ms ± 669 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
35.2 ms ± 989 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
36.5 ms ± 950 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
10.1 ms ± 293 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
254 µs ± 6.02 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
69.3 µs ± 731 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


Just another comparison:

In [31]:
def add_lists(a,b):
    n = len(a)
    res = []
    for i in range(n):
        res.append(a[i] + b[i])
    return res
    
n = 10**5
a = range(n)
b = range(0, 2*n, 2)
%timeit add_lists(a,b) # same operation as list comprehension: [a[i] + b[i] for i in range(n)]

a = np.arange(n)
b = np.arange(0, 2*n, 2)
%timeit a + b

23.9 ms ± 536 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
46.5 µs ± 1.3 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


## Exercise 2

Calculate the square root of the data using `numpy`. 
Print the result for the first 5 columns of the first 3 rows.

## __Descriptive statistics__

Often, we want to do more than add, subtract, multiply, and divide values of data. 
We can also do descriptive statistics on arrays.
If we want to find the average inflammation for all patients on all days, for example, we can just calculate the mean value of the array.

In [32]:
data.mean()

6.14875

`mean` is a method of the array, i.e. a function that belongs to it in the same way that the member shape does. If variables are nouns, methods are verbs: they describe operations that can be perfomed on the object.
This is why `data.shape` doesn't need to be called (it's a member, not a method) but `data.mean()` does (it's a method).

It is also why we need empty parentheses for `data.mean()`: even when we're not passing in any arguments, parentheses are how we tell Python to call a function (what would happen if you just use `data.mean`?).

NumPy arrays have lots of useful methods:

In [33]:
print('maximum inflammation:', data.max())
print('minimum inflammation:', data.min())
print('standard deviation:', data.std())

maximum inflammation: 20.0
minimum inflammation: 0.0
standard deviation: 4.613833197118566


When analyzing data, though, we often want to look at marginal statistics, such as the maximum value per patient or the average value per day.
One way to do this is to select the data we want to create a new temporary array, then ask it to do the calculation:

In [34]:
patient_0 = data[0, :] # 0 on the first axis, everything on the second
print('maximum inflammation for patient 0:', patient_0.max())

maximum inflammation for patient 0: 18.0


What if we need the maximum inflammation for all patients, or the average for each day? As the diagram below shows, we want to perform the operation across an axis:
![axis example](https://github.com/swcarpentry/python-novice-inflammation/raw/gh-pages/fig/python-operations-across-axes.png)
To support this, most array methods allow us to specify the axis we want to work on.
If we ask for the average across axis 0, we get:

In [35]:
print(data.mean(axis=0))

[ 0.          0.45        1.11666667  1.75        2.43333333  3.15
  3.8         3.88333333  5.23333333  5.51666667  5.95        5.9
  8.35        7.73333333  8.36666667  9.5         9.58333333 10.63333333
 11.56666667 12.35       13.25       11.96666667 11.03333333 10.16666667
 10.          8.66666667  9.15        7.25        7.33333333  6.58333333
  6.06666667  5.95        5.11666667  3.6         3.3         3.56666667
  2.48333333  1.5         1.13333333  0.56666667]


As a quick check, we can check the shape of the result:

In [36]:
print(data.mean(axis=0).shape)

(40,)


The expression `(40,)` tells us we have an 1D array of length 40, so this is the average inflammation per day for all patients. If we average across axis 1, we get:

In [37]:
print(data.mean(axis=1))

[5.45  5.425 6.1   5.9   5.55  6.225 5.975 6.65  6.625 6.525 6.775 5.8
 6.225 5.75  5.225 6.3   6.55  5.7   5.85  6.55  5.775 5.825 6.175 6.1
 5.8   6.425 6.05  6.025 6.175 6.55  6.175 6.35  6.725 6.125 7.075 5.725
 5.925 6.15  6.075 5.75  5.975 5.725 6.3   5.9   6.75  5.925 7.225 6.15
 5.95  6.275 5.7   6.1   6.825 5.975 6.725 5.7   6.25  6.4   7.05  5.9  ]


which is the average inflammation per patient across all days.

## Exercise 3

On which day did each patient had the most inflammation?
Use `data.argmax` to find out.

# NumPy reference

We now go through some extra NumPy features worth mentioning.

### Exercise: creating arrays

Create an array with the inverse ($1/x$) of the even numbers lower than or equal to 100.

## Creating arrays - continued
You can can create an empty array of a certain shape (which is given as a `tuple`) or with the same shape as another array.

In [38]:
d = np.empty((2, 4))
print(d)

[[0. 0. 0. 0.]
 [0. 0. 0. 0.]]


In [39]:
f = np.empty_like(d)
print(f)

[[0. 0. 0. 0.]
 [0. 0. 0. 0.]]


You can create an array full of 1s or 0s, or any single number:

In [40]:
a = np.ones((3, 3))
print(a)

[[1. 1. 1.]
 [1. 1. 1.]
 [1. 1. 1.]]


In [41]:
a = np.ones((3, 3)) * 5.134 
print(a)

[[5.134 5.134 5.134]
 [5.134 5.134 5.134]
 [5.134 5.134 5.134]]


In [42]:
b = np.zeros((2, 2))
print(b)

[[0. 0.]
 [0. 0.]]


Create the identity matrix:

In [43]:
c = np.eye(3)
print(c)

[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]


Create matrices by specifying the digonals:

In [44]:
d = np.diag([1, 2, 3, 4])
print(d)

[[1 0 0 0]
 [0 2 0 0]
 [0 0 3 0]
 [0 0 0 4]]


In [45]:
d = np.diag([1, 2, 3], 1) + np.diag([4, 5, 6], -1)
print(d)

[[0 1 0 0]
 [4 0 2 0]
 [0 5 0 3]
 [0 0 6 0]]


Create matrices by reshaping another matrix or array:

In [46]:
f = d.reshape((2, 8))
print(f)

[[0 1 0 0 4 0 2 0]
 [0 5 0 3 0 0 6 0]]


### Exercise: new function

Skim through the documentation for `np.ravel`, and use this function to construct the array:
```py
[1 0 0 0 1 0 0 0 1]
```

## Random arrays

We'll set the random seed for reproducability (i.e. to get the same result every time), but in real-life application you should think if you want to set the seed.

In [47]:
np.random.seed(1231410) 

Start with drawing a single random number uniformly between 0 and 1:

In [48]:
np.random.random()

0.09823820032265207

An array of four random numbers between 0 and 1:

In [49]:
a = np.random.random(size=4)
print(a)

[0.21876966 0.40108675 0.26202072 0.66656901]


A 3x3 matrix or random numbers drawn from a normal distribution with mean 1 and standard deviation 0.5:

In [50]:
b = np.random.normal(1, 0.5, size=(3, 3))
print(b)

[[1.26338629 1.14548389 1.13313127]
 [0.26813755 1.01098859 0.7670454 ]
 [0.8370115  0.4950699  1.3018399 ]]


Now draw a 3x2x4 array from a Poisson distribution with mean 5:

In [51]:
c = np.random.poisson(5, size=(3, 2, 4))
print(c)

[[[6 5 8 7]
  [0 3 4 3]]

 [[4 5 6 8]
  [7 4 6 5]]

 [[3 7 4 4]
  [5 6 1 6]]]


[shuffle](https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.random.shuffle.html) an array:

In [52]:
vec = np.arange(10)
np.random.shuffle(vec)
print(vec)

[5 7 9 1 6 2 4 8 0 3]


[choose](https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.random.choice.html) from an array with specific probabiltiy for each value:

In [53]:
np.random.choice(a = [3,5,1,2], size = 15, replace=True, \
                 p = [0.1,0.2,0.3,0.4])

array([1, 2, 2, 1, 1, 2, 2, 1, 1, 1, 2, 1, 2, 2, 2])

### Exercise: random arrays

1) Create a 4x5x6 array with numbers drawn from a [geometric](https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.random.geometric.html) distribution with `p=0.1` (the number of trails until success, where the probability of success is `p`).

2) Normalize a 5x5 random matrix - that is, first subsctract by the minimum and then divide by the new maximum.

## Broadcasting

A very powerful mechanism of NumPy arrays is [broadcasting](https://docs.scipy.org/doc/numpy/user/basics.broadcasting.html).
Broadcasting is used when an operation is used on two arrays of different shapes.
The rules are:

1. If arrays dimension differ, left-pad the smaller array's shape with 1s.
1. If the shapes differ, change any dimension of size 1 to match the dimension of the other array.
1. If shapes still differ, raise an error.

Some exmaples:
![broadcasting examples](http://www.astroml.org/_images/fig_broadcast_visual_1.png)

In [54]:
np.arange(3) + 5

array([5, 6, 7])

In [55]:
np.ones((3,3)) + np.arange(3)

array([[1., 2., 3.],
       [1., 2., 3.],
       [1., 2., 3.]])

In [56]:
np.arange(3).reshape((3, 1)) + np.arange(3)

array([[0, 1, 2],
       [1, 2, 3],
       [2, 3, 4]])

In [57]:
np.ones((3,3)) + np.ones((3,2))

ValueError: operands could not be broadcast together with shapes (3,3) (3,2) 

In [58]:
np.ones((3,3,1)) + np.ones((3,1,3))

array([[[2., 2., 2.],
        [2., 2., 2.],
        [2., 2., 2.]],

       [[2., 2., 2.],
        [2., 2., 2.],
        [2., 2., 2.]],

       [[2., 2., 2.],
        [2., 2., 2.],
        [2., 2., 2.]]])

### Exercise: Broadcasting

Given a 1D array `X`, calculate the differences between each two elements of `X` using broadcasting and save it to array `D`.

In [None]:
X = np.linspace(0, 1, 50)

In [None]:
# your code here

In [None]:
assert D.shape == (50, 50)
assert (D.diagonal() == 0).all()
assert (D[5,5] == D[-5,-5])

## Indexing and slicing

[Indexing and slicing](https://docs.scipy.org/doc/numpy-dev/user/basics.indexing.html) on 1D arrays is similar to Python lists:

In [59]:
a = np.arange(1, 10)
print(a)
print(a[3])
print(a[2:5])
print(a[-2:-5:-1])

[1 2 3 4 5 6 7 8 9]
4
[3 4 5]
[8 7 6]


Similarly to lists, values in an array can be mutated.

BUT, the size of the array cannot.

In [60]:
print(a)
a[3] = 0
print(a)

[1 2 3 4 5 6 7 8 9]
[1 2 3 0 5 6 7 8 9]


However, inconsistent with lists, array slicing returns a **view** rather then a copy, so **changing a slices changes the original array**:

In [61]:
print("Lists:")
a = [1, 2, 3, 4, 5, 6]
b = a[2:5]
print('a is b?', a is b)
print(a)
b[0] = 0
print(a)

Lists:
a is b? False
[1, 2, 3, 4, 5, 6]
[1, 2, 3, 4, 5, 6]


In [62]:
print("Arrays:")
a = np.arange(1, 7)
b = a[2:5]
print('a is b?', a is b)
print(a)
b[0] = 0
print(a)

Arrays:
a is b? False
[1 2 3 4 5 6]
[1 2 0 4 5 6]


This is useful if you want to save space and the CPU required by copying arrays, but it can also be dangerous.
If you explicitly want a **copy** rather than a **view**, call the `copy` method.

In [63]:
print("Arrays:")
a = np.arange(1, 7)
b = a[2:5].copy()
print('a is b?', a is b)
print(a)
b[0] = 0
print(a)

Arrays:
a is b? False
[1 2 3 4 5 6]
[1 2 3 4 5 6]


Arrays support multidimensional indexing and slicing:

In [64]:
a = np.zeros((3,3))
a[1,1] = 1
print(a)

[[0. 0. 0.]
 [0. 1. 0.]
 [0. 0. 0.]]


In [65]:
a = np.diag([1,2,3])
print(a)
print()
print(a[:2,1:])

[[1 0 0]
 [0 2 0]
 [0 0 3]]

[[0 0]
 [2 0]]


In [66]:
y = np.arange(35).reshape(5,7)
print(y)

[[ 0  1  2  3  4  5  6]
 [ 7  8  9 10 11 12 13]
 [14 15 16 17 18 19 20]
 [21 22 23 24 25 26 27]
 [28 29 30 31 32 33 34]]


In [67]:
print(y[0])
print(y[0,:])

[0 1 2 3 4 5 6]
[0 1 2 3 4 5 6]


In [68]:
print(y[:,1])

[ 1  8 15 22 29]


Notice the difference between slicing by mentioning a specific column, and a range of columns:

In [69]:
print(y[:,1])

[ 1  8 15 22 29]


In [70]:
print(y[:,1:2])

[[ 1]
 [ 8]
 [15]
 [22]
 [29]]


Arrays can also be indexed using an array or list of indices, this is called **fancy indexing**:

In [71]:
a = np.arange(10, 30, 1)
print(a)
b = a[[1, 6, 9]]
print(b)
b = a[[1, 6, 9, 9, 9, 9, 9, 9]]
print(b)

[10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29]
[11 16 19]
[11 16 19 19 19 19 19 19]


In [72]:
y = np.arange(35).reshape(5,7)
print(y, end='\n\n')
print(y[[1,2,2,3]])

[[ 0  1  2  3  4  5  6]
 [ 7  8  9 10 11 12 13]
 [14 15 16 17 18 19 20]
 [21 22 23 24 25 26 27]
 [28 29 30 31 32 33 34]]

[[ 7  8  9 10 11 12 13]
 [14 15 16 17 18 19 20]
 [14 15 16 17 18 19 20]
 [21 22 23 24 25 26 27]]


In [73]:
print(y[ [0, 2], [1, 2] ])

[ 1 16]


If one of the indexing lists is smaller than the other, NumPy will attempt broadcasting:

In [74]:
print(y[ [0, 2], [1] ])

[ 1 15]


But broadcasting isn't always possible:

In [75]:
y[[0,2,4], [0,1]]

IndexError: shape mismatch: indexing arrays could not be broadcast together with shapes (3,) (2,) 

The broadcasting mechanism permits index arrays to be combined with scalars for other indices. The effect is that the scalar value is used for all the corresponding values of the index arrays:

In [76]:
print(y[[0,2,4], 1])

[ 1 15 29]


## Boolean or mask arrays

You can create boolean arrays by using the comparison operators:

In [77]:
a = np.random.random(size=(4, 4))
print(a, end='\n\n')
b = a < 0.5
print(b, end='\n\n')
print(b.sum())

[[0.99935785 0.38591005 0.68143347 0.03346332]
 [0.73650571 0.29572174 0.04476133 0.2207402 ]
 [0.78225989 0.20521312 0.34295919 0.04949095]
 [0.31758071 0.19293918 0.57631051 0.77131526]]

[[False  True False  True]
 [False  True  True  True]
 [False  True  True  True]
 [ True  True False False]]

10


These boolean arrays can be used for indexing:

In [78]:
c = np.arange(16).reshape((4, 4))
np.random.shuffle(c)
print(c, end='\n\n')
print(c > 5, end='\n\n')
print(c[c > 5])

[[12 13 14 15]
 [ 0  1  2  3]
 [ 8  9 10 11]
 [ 4  5  6  7]]

[[ True  True  True  True]
 [False False False False]
 [ True  True  True  True]
 [False False  True  True]]

[12 13 14 15  8  9 10 11  6  7]


### Exercise: mask arrays

Given a 1D array, negate (i.e. turn to negative) all elements that are greater than 3 and less than 8, in place (i.e. without creating a new array).

Note 1: you cannot use `and` because it acts on two booleans, whereas we want to do an elementwise "and". This is done using the operator `&`. 

Note 2: `&` is a Python bitwise operator which precedes arithmetic operators.

In [80]:
Z = np.arange(11)
print(Z)

[ 0  1  2  3  4  5  6  7  8  9 10]


In [82]:
# Your code here
print(Z)

[ 0  1  2  3 -4 -5 -6 -7  8  9 10]


# "Losing Your Loops": Fast Numerical Computing with NumPy 

From the PyCon 2015 conferece, a [presentation](https://speakerdeck.com/jakevdp/losing-your-loops-fast-numerical-computing-with-numpy-pycon-2015) by [Jake VanderPlas](http://vanderplas.com).

Also available on [YouTube](https://www.youtube.com/watch?v=EEUXKG97YRw).

# References

- [NumPy MedKit](http://mentat.za.net/numpy/numpy_advanced_slides/) for much more on indexing, slicing, and other advanced NumPy tricks.
- [NumPy tutorial](https://github.com/rougier/numpy-tutorial) with some exercises.
- [NumPy basics](https://docs.scipy.org/doc/numpy/user/basics.html)
- [NumPy for MATLAB users](https://docs.scipy.org/doc/numpy/user/numpy-for-matlab-users.html)
- [100 NumPy exercise](https://github.com/rougier/numpy-100)

# Colophon
This notebook was written by [Yoav Ram](http://python.yoavram.com).

The notebook was written using [Python](http://python.org/) 3.7.
Dependencies listed in [environment.yml](../environment.yml).

This work is licensed under a CC BY-NC-SA 4.0 International License.

![Python logo](https://www.python.org/static/community_logos/python-logo.png)