# NumPy tutorial

[Numerical Python](https://numpy.org) is one of the most fundamental tools in each data miner's toolbox. It is impossible to do serious data pre-processing and transformation without the understanding of `NumPy` and its most commonly used methods. The goal of this tutorial is to familiarize students with this awesome library.

## Introduction

Python is among the least efficient languages out there. Data mining is a field of study, which requires a lot of processing power. Why would we ever want to use python in order to do the data mining? The answer is really simple - we do not. We use python only as an interface to packages, which contain very efficient and optimized programs, such as numpy, scipy, sklearn, tensorflow, pytorch, keras... The list is long. Python is very flexible and simple. We just like its syntax, ability to combine it with programs written in other languages, not necessairly its efficency. In the previous tutorial we have already seen some numpy stuctures. When we imported the toy data sets, the data was stored in something called ndarray. In fact 99% of `NumPy` is about this very data structure and operations on this structure. `NumPy` stores everything in multidimensional arrays and vectorizes all operations on these arrays. 

Contrary to Python lists, the NumPy arrays represent tensors (e.g. 1st rank tensor - a vector, 2nd rank tensor - a matrix; do not confuse this with Tensorflow tensor data structure, tensors in a mathematical sense). Python list is just a list of things, specifically it could be a list of lists, there is no additional limitation to that. A tensor has some limitations:
- every element must be of the same type and size
- if an array has arrays, they must match as well

After all this:

\begin{matrix} 1 & 2 & 3 \\ 4 & 5 & 6 & 8 \\ 7 & 8 &  \end{matrix}

is not a matrix.

While this:



In [None]:
lst = [[1,2,3],[4,5,6,8],[7,8]]

is a list.

For a motivating example, let's compare the speed of computing an average of 10 mln of random numbers stored in a list vs an array

First, we are going to generate a list of 10 mln random numbers. We will use a function from the numpy package named randint. As you might suspect the function can be used to generate a random integer. In fact this function is flexible enough to create a tensor of any shape with random integers.

In [2]:
import numpy as np
from numpy.random import randint

randoms = randint(low=0, high=1000, size=(1000,1000))
randoms = randint(low=0, high=1000, size=(100,100,100))
randoms = randint(low=0, high=1000, size=(10,10000))
randoms = randint(low=0, high=1000, size=(10000000))
lst = list(randoms)

I hope you're not surprised the size is not just a single number. n-th rank tensor requires n numbers to describe its shape. A vector has just its length. A matrix has a number of columns and a number of rows. A cuboid has three dimensions, and so on. So, in the second to last line we generated a vector of 10 mln random numbers. In the last line we converted it to a list. We can get the length of this list with a built-in len function.

In [None]:
len(lst)

10000000

We expect the len function to return an integer. So it does. If we try to check the length of a numpy array we are going to get only the size of the first dimension. If we want to get a full description of the tensor shape, we can use the `shape` attribute.

In [None]:
sample = np.zeros((1000,1000,1000))

print(len(sample))
print(sample.shape)

1000
(1000, 1000, 1000)


Let's start the calculation of an average - we will use the cell magic here to compare the solutions.

In [None]:
%%time

# old-school iteration
summ = 0
for i in range(len(lst)):
    summ += lst[i]
    
print(f'Average = {summ/len(lst)}')

Average = [499.5150326]
CPU times: user 15.6 s, sys: 130 ms, total: 15.7 s
Wall time: 15.7 s


In [None]:
%%time

# using built-ins sum() and len()
print(f'Average = {sum(lst)/len(lst)}')

Average = [499.5150326]
CPU times: user 7.89 s, sys: 120 ms, total: 8.01 s
Wall time: 8.01 s


In [None]:
%%time

# using NumPy
print(f'Average = {np.mean(randoms)}')

Let's see how to create an array and what happens when we start messing with the types and sizes of objects.

In the previous examples we already saw how to create an array of random numbers (and an array, which consists of zeros).
We can also create a numpy array with a python list, like this:

In [None]:
a = np.array([1, 2, 3, 4, 5])




Shape (sizes of dimensions): (5,)
Number of dimensions: 1
Length (number of elements): 5
Size (number of nested elements): 5
Type : <class 'numpy.ndarray'>
Data type (type of array elements): int64



we have also seen how to get the exact shape of an array. In fact there are more descriptors for an array.

In [None]:
print(f"""
Shape (sizes of dimensions): {a.shape}
Number of dimensions: {a.ndim}
Length (number of elements): {len(a)}
Size (number of nested elements): {a.size}
Type : {type(a)}
Data type (type of array elements): {a.dtype}
""")


Shape (sizes of dimensions): (5,)
Number of dimensions: 1
Length (number of elements): 5
Size (number of nested elements): 5
Type : <class 'numpy.ndarray'>
Data type (type of array elements): int64



Now let's see how the same descriptors can be applied to a two- and three -dimensional array

In [None]:
a = np.array([
    [1, 2, 3, 4, 5],
    [1, 4, 9, 16, 25]
])

print(f"""
Shape (sizes of dimensions): {a.shape}
Number of dimensions: {a.ndim}
Length (number of elements): {len(a)}
Size (number of nested elements): {a.size}
Type : {type(a)}
Data type (type of array elements): {a.dtype}
""")


Shape (sizes of dimensions): (2, 5)
Number of dimensions: 2
Length (number of elements): 2
Size (number of nested elements): 10
Type : <class 'numpy.ndarray'>
Data type (type of array elements): int64



In [3]:
a = np.array([
    [
    [1, 2, 3, 4, 5],
    [1, 1, 2, 3, 5]
    ],
    [
    [1, 2, 3, 4, 5],
    [1, 4, 9, 16, 25]
    ]
])

print(f"""
Shape (sizes of dimensions): {a.shape}
Number of dimensions: {a.ndim}
Length (number of elements): {len(a)}
Size (number of nested elements): {a.size}
Type : {type(a)}
Data type (type of array elements): {a.dtype}
""")


Shape (sizes of dimensions): (2, 2, 5)
Number of dimensions: 3
Length (number of elements): 2
Size (number of nested elements): 20
Type : <class 'numpy.ndarray'>
Data type (type of array elements): int64



Array elements should be of the same type. Let's see what happens if we mix two or more types.

In [4]:
a = np.array([1, 2, 'mary', 'had', 2.5, 'lambs'])
a

array(['1', '2', 'mary', 'had', '2.5', 'lambs'], dtype='<U32')

In [None]:
a.dtype

dtype('<U21')

We can also try to modify the length of array's elements

In [None]:
a = np.array(['mary', 'had', 'a', 'little', 'lamb'])
a.dtype

dtype('<U6')

In [None]:
a[4] = 'and very very very very long snake'
a

array(['mary', 'had', 'a', 'little', 'and ve'], dtype='<U6')

After an array has been created, it can be reshaped to whatever shape one desires. A special function is provided for transposing an array (changing rows into columns and vice versa)

In [None]:
a = np.array(list(range(12)))

a.shape = (3,4)
a

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [None]:
a = a.reshape(6, 2)
a

array([[ 0,  1],
       [ 2,  3],
       [ 4,  5],
       [ 6,  7],
       [ 8,  9],
       [10, 11]])

In [None]:
a.T

array([[ 0,  2,  4,  6,  8, 10],
       [ 1,  3,  5,  7,  9, 11]])

## Creating arrays

The easiest way to create a 1-d array is to use a list. If you want a 2-d array, you use a list of lists. 3-d arrays are created using a list of lists of lists. You get the gist.

In [None]:
a_1d = np.array([1, 2, 3, 4])

a_2d = np.array([
    [1, 2, 3, 4],
    [1, 4, 9, 16],
    [1, 8, 27, 64]
])

a_3d = np.array([
    [
        [0, 0],
        [0, 1],
    ],
    [
        [1, 0],
        [1, 1],
    ],
])

There are utility functions in the `np` module for creating popular types of arrays:
- an array filled with zeros
- an array filled with ones
- an array filled with any value
- an array of consecutive (or stepped) values
- an array filled with random values
- a diagonal array

In [None]:
np.zeros(shape=(3,3))

In [None]:
np.zeros(shape=(3,5))

In [None]:
np.ones(10)

In [None]:
np.ones(10, dtype=np.int32)

In [None]:
np.full(shape=(4,4), fill_value='empty')

In [5]:
??np.arange

In [None]:
np.arange(-2, 2, 0.5)

In [None]:
np.random.randn(3, 3, 2)

In [None]:
np.random.randint(low=1, high=7, size=10)

In [None]:
np.eye(5)

In [None]:
np.eye(5, 8)

## Indexing arrays

Arrays in `NumPy` are 0-indexed. Indexing of 1-d arrays is very easy, just follow the pattern of *start*:*end*:*step*

In [None]:
a = np.arange(0, 10)

print(f"""{a}

First element: {a[0]}
First three elements: {a[0:3]}
Last element: {a[len(a)-1]} and {a[-1]}
Even elements: {a[::2]}
""")

[0 1 2 3 4 5 6 7 8 9]

First element: 0
First three elements: [0 1 2]
Last element: 9 and 9
Even elements: [0 2 4 6 8]



Indexing of n-dim arrays is a bit more tricky. Keep in mind that axis 0 refers to rows and axis 1 refers to columns. For high dimensional arrays try to build the following intuition:
- 1-d: a row of values
- 2-d: a matrix (rows and columns) of values
- 3-d: a row of arrays
- 4-d: a matrix of arrays
- and so on...

In [None]:
a = np.array([
    [1, 2, 3, 4],
    [10, 20, 30, 40],
    [100, 200, 300, 400],
])

print(f"""{a}

Element at second row, third column: {a[1,2]}
Entire first row: {a[0]}
Entire first row as 2-d array: {a[0, None]}
First and second rows, last column: {a[:2,-1]}
""")

The same goes with all n-dim arrays. For instance, let's extract first matrix, all rows, first column. You can also use indexing to assign multiple values to array cell at once.

In [None]:
a_3d

In [None]:
a_3d[0, :,0]

In [None]:
a_3d[1, :, 1] = -1

print(a_3d)

## Basic operations on arrays

All array operations are vectorized, so they tend to be very quick. By default, `NumPy` performs element-wise array operations. If you want to correctly multiply arrays, use `@` operator as shown below.

In [None]:
a = np.arange(0, 12)
b = np.arange(12, 24)

a.shape = b.shape = 3, 4

In [None]:
a

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [None]:
b

array([[12, 13, 14, 15],
       [16, 17, 18, 19],
       [20, 21, 22, 23]])

In [None]:
a + b

array([[12, 14, 16, 18],
       [20, 22, 24, 26],
       [28, 30, 32, 34]])

In [None]:
b - a

array([[12, 12, 12, 12],
       [12, 12, 12, 12],
       [12, 12, 12, 12]])

In [None]:
a * 10

array([[  0,  10,  20,  30],
       [ 40,  50,  60,  70],
       [ 80,  90, 100, 110]])

In [None]:
a @ b.T

array([[ 86, 110, 134],
       [302, 390, 478],
       [518, 670, 822]])

## Homework

### Calculating sliding averages

Given an array of daily measurements, create a new array with averages computed over each pair of consecutive days. Compare the execution time of various solutions.

In the first solution you should use the for loop comprehension in order to create an array of pairs of measurements. Then, using the numpy average function calculate averages over the list of pairs.

In [64]:
measurements = np.arange(100)

In [51]:
??np.average

In [112]:
%%timeit


UsageError: %%timeit is a cell magic, but the cell body is empty. Did you mean the line magic %timeit (single %)?


In the second solution you are still supposed to create an array of pairs of measurements. This time use the numpy vstack to stack two duplicates of the measurements array. Cut the last day from the first duplicate, cut the first day from the second duplicate. Remember to check the validity of the solution, use transposition if necessary.

In [52]:
??np.vstack

In [54]:
??np.transpose

In [56]:
a = np.array([[1,2,3],[2,3,4]])
print(a)
print(a.T)

[[1 2 3]
 [2 3 4]]
[[1 2]
 [2 3]
 [3 4]]


In [111]:
%%timeit


UsageError: %%timeit is a cell magic, but the cell body is empty. Did you mean the line magic %timeit (single %)?


In your third solution use the discrete convolution function. This time you calculate the averages with the convolution directly, you don't need to use average function here.

An example of discrete convolution:

given a filter $G=[g_1,g_2,g_3]$ and a vector (1st rank tensor) $V=[v_1,v_2,v_3,v_4,v_5,v_6]$ the convolution of the vector $V$ and a filter $G$ is calculated as follows:

$$V * G = [g_1\cdot v_1+g_2\cdot v_2+g_3\cdot v_3,$$
$$g_1\cdot v_2+g_2\cdot v_3+g_3\cdot v_4,$$
$$g_1\cdot v_3+g_2\cdot v_4+g_3\cdot v_5,$$
$$g_1\cdot v_4+g_2\cdot v_5+g_3\cdot v_6]$$

keep in mind that the filter can be of any size as long as each of its components (of the shape) is equal or lower than its equivalent in the tensor. However, the dimensionality of the filter has to be the same as the dimensionality of the tensor (the filter could be of a lower dimensionality, but then it should be interpreted as if it was of the same dimensionality as the tensor). Notice the similarity between sliding window and the convolution - in both cases you put a "window" or a "filter" on the data and move the window throughout the data. In the example above the window moves like this:

[**_1,2,3_**,4,5,6]

[1,**_2,3,4_**,5,6]

[1,2,**_3,4,5_**,6]

[1,2,3**_,4,5,6_**]


In [115]:
??np.convolve

In [110]:
%%timeit


UsageError: %%timeit is a cell magic, but the cell body is empty. Did you mean the line magic %timeit (single %)?


In this solution use the numpy insert to add a 0 at the beginning of the measurements array (for padding purposes). Then calculate the cumulative sums of the measurements. Finally use the cumulative sum array in order to calculate the averages (it is a combination of picking elements in the correct order and a simple numerical operation). 

hint:

measurements = [1**,2,3,**4,5,6]

cumulative_sums = [0,**1**,3,**6**,10,15,21]

$$c_1 = x_0+x_1$$
$$c_3=x_0+x_1+x_2+x_3$$
$$c_3-c_1 = x_2+x_3$$

measurements = [1,2,**3,4**,5,6]

cumulative_sums = [0,1,**3**,6,**10**,15,21]

$$c_2 = x_0+x_1+x_2$$
$$c_2=x_0+x_1+x_2+x_3+x_4$$
$$c_3-c_1 = x_3+x_4$$



In [70]:
??np.insert

In [94]:
??np.cumsum

In [109]:
%%timeit


UsageError: %%timeit is a cell magic, but the cell body is empty. Did you mean the line magic %timeit (single %)?


## Broadcasting

This is by far the most important concept in `NumPy`. Broadcasting is an automatic expansion of arrays so that they match with their operands.

Let's start with the simplest example.

In [None]:
a = np.arange(10)

a + 10

The same happens for 2-d arrays

In [None]:
a = np.array([
    [1, 2, 3, 4],
    [1, 4, 9, 16],
])

b = np.array([
    [0.1, 0.2, 0.3, 0.4]
])

In [None]:
a

In [None]:
b

In [None]:
a + b

In [None]:
a.shape, b.shape

The simple rule for broadcasting is the following:

If we want to operate on two arrays `a` and `b`:
- moving backwards from the last dimension of each array, we check if their dimensions are the same or one equals 1
- if all of `a`'s dimensions are compatible with `b`'s dimensions, arrays `a` and `b` are compatible.

In [None]:
np.random.seed(1234)

a = np.random.randint(low = 1, high = 10, size = (3, 4))
a

In [None]:
b = np.random.randint(low = 1, high = 10, size = (3, 1))
b

In [None]:
a + b

In [None]:
np.random.seed(1234)

a = np.random.randint(low = 1, high = 10, size = (3, 1, 4))
a

In [None]:
b = np.random.randint(low = 1, high = 10, size = (2, 1))
b

In [None]:
a + b

Sometimes it is useful to be able to manually modify the shape of the array. This can be done using the `np.newaxis` function (which is simply an alias for the `None` keyword)

In [None]:
a = np.array([1, 2, 3, 5, 7, 11, 13])

In [None]:
a

array([ 1,  2,  3,  5,  7, 11, 13])

In [None]:
a[:, np.newaxis]

array([[ 1],
       [ 2],
       [ 3],
       [ 5],
       [ 7],
       [11],
       [13]])

In [None]:
a[None, :]

array([[ 1,  2,  3,  5,  7, 11, 13]])

This can be very useful if one wants to build an array containing the results of a cross-join operation on two matrices. Suppose we are trying to create $c_{ij} = a_i - b_j$.

In [None]:
b = np.arange(7)
c = a[:, None] - b[None, :]
c

array([[ 1,  0, -1, -2, -3, -4, -5],
       [ 2,  1,  0, -1, -2, -3, -4],
       [ 3,  2,  1,  0, -1, -2, -3],
       [ 5,  4,  3,  2,  1,  0, -1],
       [ 7,  6,  5,  4,  3,  2,  1],
       [11, 10,  9,  8,  7,  6,  5],
       [13, 12, 11, 10,  9,  8,  7]])

## Homework

### Battleships

Given a 10x10 playing field with hidden battleships and a list of shooting targets, compute the number of hits. You are only allowed to use the following functions:

 - ndarray.take
 - ndarray.T
 - ndarray.shape



In [None]:
sea = np.random.randint(low=0, high=2, size=(10,10))

sea


In [None]:

targets = np.array([
    [0,3],
    [1,7],
    [2,2],
    [3,5],
    [8,2]
])

## Boolean indexing

Anytime you have a boolean array, you can use it to mask entries in another array.

In [None]:
a = np.random.randint(0, 100, size=(5,5))
a

In [None]:
mask = a > 80
mask

In [None]:
a[mask]

Boolean masking may be applied not only to values, but to rows and columns as well. Just remember to use slicing:

*array*[*row_mask*,*col_mask*]

In [None]:
rows_2_and_4 = np.array([False, True, False, True, False])
cols_1_and_2 = np.array([True, True, False, False, False])

In [None]:
a[rows_2_and_4]

In [None]:
a[rows_2_and_4, cols_1_and_2]

In [None]:
names = np.array(["Dennis", "Dee", "Charlie", "Mac", "Frank"])
ages = np.array([43, 44, 43, 42, 74])
genders = np.array(['male', 'female', 'male', 'male', 'male'])

In [None]:
names[(genders == 'male') & (ages > 43)]

In [None]:
names[~(genders == 'male') & (ages % 2 == 0)]

## `Random` module

One of the most frequently used parts of the `NumPy` is the random number generation procedure. Below you can see examples of different samples:
- normal sample
- uniform sample
- choosing from a set with/without replacement

In [None]:
np.random.normal(loc=10.0, scale=1.0, size=10)

In [None]:
np.random.randint(low=10, high=20, size=(3,3))

In [None]:
np.random.uniform(low=0, high=1, size=5)

In [None]:
np.random.choice(
    a=[1,2,3,4,5,6],
    replace=True,
    size=5
)

In [None]:
np.random.choice(
    a=['this','is','sampling','without','replacement'],
    replace=False,
    size=3
)

Despite the fact that most people use the `random` module as above, this way is in fact deprecated, because it introduces a dependency on the random number generator used currently by `NumPy`. In theory, if `NumPy` changes the generator, all the code becomes non-reproducible.
A simple solution is to use the generic `Generator` class.

In [None]:
generator = np.random.default_rng(seed=123)

In [None]:
generator.integers(low=1, high=100, size=10)

In [None]:
generator.normal(loc=0, scale=1, size=10)

In [None]:
generator.choice(a=[1,2,3], replace=True, size=10)

## Homework

### Two reviewers

You are given two arrays representing ratings assigned to 100 movies by two reviewers. Identify movies such that the reviewers differ in their rating by at most 1.

In [None]:
movies = np.arange(100)
reviewer_a = np.random.choice(a=[1,2,3,4,5], size=100)
reviewer_b = np.random.choice(a=[1,2,3,4,5], size=100)

In [None]:
movies_with_similar_review = ...

## Using `where`

`np.where` is a very useful function which allows to quickly filter elements of an array based on the condition. Imagine you have two large arrays and you want to create a third array such that it contains, for each cell, the larger value from the two arrays. First, let's do it in a traditional way.

In [None]:
a = np.random.randint(1, 6, size=10**5)
b = np.random.randint(1, 6, size=10**5)

In [None]:
%%time
c = np.zeros(a.size)

for i in range(a.size):
    if a[i] > b[i]:
        c[i] = a[i]
    else:
        c[i] = b[i]


In [None]:
%%time
d = np.where(a > b, a, b)

In [None]:
np.array_equal(c,d)

## Homework

### First to finish the assignment

Given an array with students' assignments ordered by the increasing date of submission, you want to reward first 3 students who submitted their work and who got at least 75 points. Increase their scores by 5 points.

In [None]:
grades = np.random.randint(low=0, high=100, size=50)
...

## Math functions

`NumPy` contains several highly optimized implementations of math functions. Whenever possible, try to use them instead of your own implementations. Remember, that math functions are easily generalized to n-dim arrays.

In [None]:
a = np.array([
    [1, 2, 3, 4],
    [5, 6, 7, 8],
    [9, 10, 11, 12],
    [13, 14, 15, 16],
], dtype=np.float64)

In [None]:
np.sum(a)

In [None]:
np.sum(a, axis=0)

In [None]:
np.sum(a, axis=1)

But beware of `nan`s, as they tend to destroy all math!

In [None]:
a[2,2] = np.nan

In [None]:
np.sum(a)

In [None]:
np.isnan(a)

In [None]:
np.sum(a, where=~np.isnan(a))

In [None]:
np.nansum(a)

In [None]:
np.sum(np.nan_to_num(a))

Let's see what else we can do with `nan`s

In [None]:
a[0,1] = np.nan
a[1,3] = np.nan

In [None]:
np.isnan(a)

In [None]:
np.any(np.isnan(a), axis=1)

In [None]:
mask = np.any(np.isnan(a), axis=1)
a[mask]

## Concatenation & sorting

Concatenation means joining two arrays by rows or by columns. An array may be concatenated with itself or with another array. There are 4 functions that help with concatenation.

In [None]:
a = np.zeros(shape=(3,2))
b = np.ones(shape=(2,2))

In [None]:
np.concatenate([a, a, a, a], axis=0)

In [None]:
np.concatenate([a, b], axis=0)

In [None]:
np.vstack([a,b])

In [None]:
np.hstack([a.T,b])

In [None]:
np.stack([a[:2,:2], b], axis=0)

Unfortunately, `NumPy` does not provide any easy way of reverse sort, and sorting is limited to two functions.

In [None]:
a = np.random.randint(1, 100, size=50)
a

In [None]:
a.sort()

In [None]:
a

In [None]:
np.sort(a)[::-1]

If you want to be able to sort values in the first column of an array according to the order in the second column of an array, you need to use `np.argsort`.

In [None]:
a = np.random.randint(1, 100, size=20)
a.shape = 5,4

a

In [None]:
np.sort(a, axis=1)

In [None]:
a

In [None]:
a[np.argsort(a[:,1])]