# NumPy/SciPy Tutorial

In this lab session, we'll go through [NumPy](https://numpy.org/). When you complete this notebook, you'll have a better understanding of most of the use cases that you'll need througout the course.

* this is a self-paced tutorial, but make sure you take some time on this week to get familiar with it. 
* If you find any part of this tutorial complicated, please prepare a list of questions for the beginning of the next lab. 
* I am aware that this may seem obvious and boring, but the answer to any question is probably contained in the [NumPy User Guide](https://numpy.org/doc/stable/user/index.html) and [Documentation](https://numpy.org/doc/stable/). 
* If you're serious about data science in Python, I strongly suggest going through the [NumPy Fundamentals](https://numpy.org/doc/stable/user/basics.html#numpy-fundamentals) at least once in your career. In fact, this very tutorial is mostly a condensed version of the Numpy fundamentals plus some SciPy, links to the relevant documentation pages are provided everywhere in the tutorial if you feel like having a deep dive into NumPy.

## Table of contents:

1. [The ndarray object](#The-ndarray-object)
2. [Array indexing](#Array-indexing)
3. [Broadcasting](#Broadcasting)
4. [Random number generators](#Random-number-generators)
5. [Copies and Views](#Copies-and-Views)
6. [Data Types](#Data-Types)

Before we start, let's import the libraries we'll need throughout the course, as well as some utility functions that will help us visualize what's going on.

In [None]:
# Only run this cell if you're using Google Colab

!git clone https://github.com/torresmateo/fgv-class-2022.git
!cp -r fgv-class-2022/images .
!cp -r fgv-class-2022/tutorials .

In [None]:
import numpy as np
from tutorials.utils import *

## The ndarray object

In this section we will interact with (arguably) the most popular data structure in data science. The [Numpy ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html). 

### Creating a new array

The most intuitive way to create a numpy array is from an existing Python list or tuple. We do this by calling [`np.array`](https://numpy.org/doc/stable/reference/generated/numpy.array.html) and passing a list to the function. The parameter can be any "array-like" object in python. To understand what is and isn't "array-like", check the documentation.

In [None]:
a = np.array([1,2,3,4,5])
a

Numpy also provides functions for intrinsic array creation. Here we look at some options that will be very useful throughout the course, but there are many more. The first group is to create 

In [None]:
a = np.empty(5) # array of uninitialized data
print(f"np.empty: {a}")

a = np.zeros(5) # array initialized with zeros
print(f"np.zeros: {a}")

a = np.arange(5) # array initialized with zeros
print(f"np.arange: {a}")

a = np.eye(5, 8) # array with ones in the "main" diagonal
print(f"np.eye:\n{a}")

a = np.identity(5) # array with ones in the main diagonal
print(f"np.identity:\n{a}")

We can also create an array from existing arrays:

Let's start with 2 arrays created from lists

In [None]:
a = np.arange(10)
b = np.arange(10, 20)

c = np.vstack([a, b])
print(f"np.vstack:\n{c}")

d = np.hstack([a, b])
print(f"np.hstack:\n{d}")

e = np.array(a, ndmin=3)
print(f"np.array with ndim=3:\n{e}")

Finally, we can mutate existing arrays:

In [None]:
c_t = c.transpose()
print(f"c.transpose():\n{c_t}")

D = np.diag(a)
print(f"np.diag(a):\n{D}")

f = np.diag(D)
print(f"np.diag(D):\n{f}")



### Array attributes and methods


The ndarray data structure has some convenient attributes and methods we can use to interact with this array in a way that is way more convenient, safer, and faster than implementing our own structure. For example, we use the `shape` atrribute to get the cardinality of each dimension of the array, the `ndim` attribute to tell how many dimensions the array has, and the `T` attribute to get a transposed version of the array:

In [None]:
print(f"shape: {a.shape}")
print(f"ndim: {a.ndim}")
print(f"T:\n{c.T}")

**IMPORTANT: unidimensional vectors in numpy**

Notice how the shape of the array is `(10,)`. This means that the array is 
not a column nor a row vector. This can have some complications when dealing with vector operations. For standard vector algebra, we need to explicitly create a 2D array.

numpy's `dot` and `matmul` (or `@` operator) will work with unidimensional arrays by appending or prepending a dimension to the arguments. Let's see some examples

In [None]:
# weirdness of unidimensional arrays:

print(f"np.array_equal(a, a.T): {np.array_equal(a, a.T)}")
print(f"np.matmul(a, a.T): {np.matmul(a, a.T)}")
print(f"np.matmul(a.T, a): {np.matmul(a.T, a)}")
print(f"np.matmul(a, a): {np.matmul(a, a)}")

# we transform the unidimensional array into a 2D array by adding a new,
# empty dimension
a_2d = a[:, np.newaxis]

print("\n2D array\n")

print(f"a_2d.shape: {a_2d.shape}")
print(f"np.array_equal(a_2d, a_2d.T): {np.array_equal(a_2d, a_2d.T)}")
print(f"np.matmul(a_2d, a_2d.T):\n{np.matmul(a_2d, a_2d.T)}")
print(f"np.matmul(a_2d.T, a_2d): {np.matmul(a_2d.T, a_2d)}")

# this one will generate an error, 
# but you're encouraged to uncomment and try
# print(f"np.matmul(a_2d, a_2d): {np.matmul(a_2d, a_2d)}")



Other useful methods are aggregations and indices identifiers, here are a couple of useful examples:

In [None]:
# let's start with a new, unsorted array
a = np.array([2,6,3,8,9,1,5,7,3])
print(f"original array: {a}")

# get the maximum element
print(f"a.max(): {a.max()}")

# get the index of the maximum element
print(f"a.argmax(): {a.argmax()}")

# get the maximum element
print(f"a.min(): {a.min()}")

# get the index of the maximum element
print(f"a.argmin(): {a.argmin()}")

# get the mean of all values
print(f"a.mean(): {a.mean()}")

# get the sum of all values
print(f"a.sum(): {a.sum()}")

# get the indices that would sort the array
print(f"a.argsort(): {a.argsort()}")
print(f"a[a.argsort()]: {a[a.argsort()]}")

# get a 2D array with the data from the array
print(f"a.reshape(3,3):\n{a.reshape(3,3)}")


## Array indexing

Array indexing in numpy is compatible with the standard Python `my_array[my_selection]` syntax, but allows for more complex indexing and slicing operations in higher dimensions.

### Basic indexing

If you've programmed using Python lists before, this way of indexing will feel very natural to you.

#### Single element indexing

To get a single element from a unidimensional ndarray, you use the exact same syntax as getting single elements from Python lists:

In [None]:
print(f"original array: {a}")

# get the first element
print(f"first element: {a[0]}")

# get the last element
print(f"last element: {a[-1]}")

# get the second to last element
print(f"second to last element: {a[-2]}")

for 2D arrays, we can use the "list of lists" way of indexing, but the "numpy way" is to use a tuple, where each element of the tuple indexes a dimension: 

In [None]:
a_2d = a.reshape(3, 3)
print(f"2D array:\n{a_2d}")

# these two notations are equivalent, but the latter is preferred,
# especially for higher dimensional arrays
print(f"a_2d[1][1]: {a_2d[1][1]}, a_2d[1, 1]: {a_2d[1,1]}")

# of course, you can use negative indexing to start from the end of each
# dimension
print(f"a_2d[1, -1]: {a_2d[1,-1]}")

# Get the first row (as a unidimensional array)
print(f"first row (1D): {a_2d[0]}")

# Get the last column (as a unidimensional array)
# the : in the first dimension means "all elements" of that dimension
print(f"last column (1D): {a_2d[:,-1]}")

# Get the first row (as a unidimensional array)
print(f"first row (2D): {a_2d[[0]]}")

# Get the last column (as a unidimensional array)
# the : in the first dimension means "all elements" of that dimension
print(f"last column (2D):\n{a_2d[:,[-1]]}")

#### Slicing and striding

Slicing in NumPy is also compatible and extends Python's basic concept of slicing to N dimensions. In this tutorial, we're going to work with unidimensional and 2-dimensional arrays, but I strongly encourage you to get familiar with array manipulation in higher dimensions, especially 3-dimensional arrays (often used for colored images), and 4-dimensional arrays (a "batch" of colored images). 

The basic slice is `start:stop:step`. This selects the elements in the array with indices corresponding to `start`, `start + step`, `start + 2 * step`, `...`, `start + m * steps`, where `m` is the maximum integer value of `m` so that `start + m * steps < stop`. By default, `start = 0`, `stop = None`, and `step = 1`.

Below we have some common useful examples:

In [None]:
# let's start with a sorted array, where every value matches its index
a = np.arange(10)

# select every element with an even index
print(f"every element with an even index: {a[::2]}")

# select every element with an odd index
print(f"every element with an odd index: {a[1::2]}")

# reverse the array 
print(f"reversed array: {a[::-1]}")

# let's create a bigger 2d array
b = np.arange(25).reshape(5, 5)

print(f"original 2D array:\n{b}")

# get every even row, and all columns
print(f"even rows:\n{b[::2,:]}")

# get all rows, and every odd column
print(f"odd columns:\n{b[:,1::2]}")

# get center square
print(f"center square:\n{b[1:4,1:4]}")

### Advanced indexing

On top of the Python slicing, we can use more complex selection objects for indexing. Getting used to all the variants of advanced indexing takes a lot of practice, and it's always useful to have the documentation open in a browser tab nearby.

Here I will simply write some examples that I think will be useful during the course.

In [None]:
# let's start with a fresh array
a = np.arange(10, 0, -1)
print(f"unidimensional array: {a}")

# we can use an ndarray of integers to make a complex selection of indices, and even repeat the indices:
print(f"integer ndarray: {a[np.array([3, 3, 1, 1, -8, 0])]}")

print(f"integer array: {a[[3, 3, 1, 1, -8, 0]]}")

#### Multidimensional indexing

In [None]:
# let's start with a fresh 2-dimensional array
a = np.arange(25).reshape(5,5)
print(f"2-dimensional array:\n{a}")

# let's get the elements at specific coordinates
# we simply pass an integer list of the components
# to each dimension, so, for elements with coordinates:
#     (0, 0)
#     (1, 4)
#     (1, 2)
#     (3, 0)
print(f"elements at specific coordinates: {a[[0, 1, 1, 3], [0, 4, 2, 0]]}")

# if all the elements share the same index along one dimension, we can mix
# advanced and simple indexing, so for element with coordinates:
#     (0, 1)
#     (1, 1)
#     (1, 1)
#     (3, 1)
print(f"elements at specific rows, same column: {a[[0, 1, 1, 3], 1]}")

# get the corner elements 
# this example uses broadcasting, which will be explained in a section below
# but notice how the we use a 2D column vector to indicate the rows, and 
# a unidimensional vector to indicate the columns
rows = np.array([0, -1]) # first and last
columns = np.array([0, -1]) # first and last
print(f"corner elements:\n{a[rows[:, np.newaxis], columns]}")

#### Boolean indexing

One of the most useful ways to index a multidimensional array is the boolean index. It allows us to select elements that hold true when subject to a condition. 

In [None]:
# let's keep the same array from the previous example:
print(f"2-dimensional array:\n{a}")

# let's get all elements that are below 7
idx = a < 7
print(f"indices below 7:\n{idx}")

print(f"elements below 7:\n{a[idx]}")

To demonstrate how useful boolean indexes are, let's look at a practical example:

We have the image of a dog with a white background. 

Let's pretend that we need to save some space in our hard-drive, and 
we want to keep only the relevant part of this image.

In [None]:
# Let's load the image of a dog in grayscale
dog = get_dog_image()

# the representation of this image is, in fact, 
# a 2-dimensional ndarray of integers
dog

In [None]:
# Let's plot the image, just to see what we're dealing with
plot_image(dog)

In [None]:
# If you look closely to the values in the array, we can see that 
# the higher the value, the whiter the pixel. 
# We can use this fact to figure out the silhouette of the dog:

white_pixels = dog > 250

# we can print the mask to see if we're right
plot_image(white_pixels)

In [None]:
# Now that we have a clear idea of where the dog is, 
# we can select just the rectangle that contains a part of the silhouette

# we can achieve this using boolean conditions again.

# our mask now has True where the pixel represents the background, 
# and false if the image represents the dog. This is counterintuitive, 
# we can flip this aroun

silhoutette = ~white_pixels

plot_image(silhoutette)

In [None]:
# now, we want to keep only the rows and columns where at least one pixel 
# contributes to the silhouette.

rows = silhoutette.any(axis=1)
cols = silhoutette.any(axis=0)

cropped_dog = dog[rows][:, cols]
plot_image(cropped_dog)

## Broadcasting

Broadcasting describes how NumPy treats arrays with different shapes during some operations.

Normally, operators in NumPy work in an element-wise fashion, which requires that the dimensionality of the operands to be the same. This limitation is removed when the operands have "broadcastable" shapes. 

The simplest form of broadcasting is multiplying a scalar by a unidimensional array:

In [None]:
a = np.arange(10)
print(f"a: {a}")

b = 5
print(f"b: {a}")

print(f"a * b: {a * b}")

c = np.full_like(a, b)
print(f"array full of scalar c: {c}")

print(f"a * c: {a * c}")

In the example above, we did not need to manually repeat the scalar to multiply it. NumPy takes care of such expansion under the hood.

Let's see a less trivial example:

In [None]:
a = np.arange(25).reshape(5,5)
print(f"a:\n{a}")

b = np.arange(5)
print(f"b:\n{b}")

print(f"a + b:\n{a + b}")

c = b[:, np.newaxis]

print(f"c:\n{c}")

print(f"a + c:\n{a + c}")


In the example above, we summed a unidimensional array, which was broadcast along the rows of the 2-dimensional array, and for each row, the element-wise sum was performed.

Then, we added a new dimension to the smaller array, and turned into a column vector, that was broadcasted over the columns of the bigger array.

Let's look at an example where both arrays will be broadcast:

In [None]:
b + c

In this example, we're summing a unidimensional `(5,)` vector with a column `(1, 5)` vector, resulting in a `(5,5)` matrix that is the outer product of the 2 vectors.

## Random number generators

In the [array creation](#Creating-a-new-array) section above, we saw an example of how to create an uninitialized array with `np.empty`. While this might seem enough to create a random array for rapid prototyping, this is note quite true.

In most cases when we deal with random number generators, some level of control is useful and often necessary to achieve repeatable results.

This matter is so important that a massive effort was put into creating [`numpy.random`](https://numpy.org/doc/stable/reference/random/index.html): the NumPy module of random number routines.

Below some useful examples using the new Random Generator API

In [None]:
# we import the default random number generator
from numpy.random import default_rng

by default, this uses a fresh unpredictable seed. Run this cell many times and you will generate a different set of mnatrices every time

In [None]:

rng = default_rng() 

a = rng.random((5,5))
b = rng.random((5,5))
c = rng.random((5,5))
print(f"a:\n{a}")
print(f"b:\n{b}")
print(f"c:\n{c}")

If we sed a known seed, we can make sure we're using the same data.
This is extremely usfeul for prototyping!


In [None]:
rng = default_rng(0) 

a = rng.random((5,5))
b = rng.random((5,5))
c = rng.random((5,5))
print(f"a:\n{a}")
print(f"b:\n{b}")
print(f"c:\n{c}")

Among the most useful tools is to draw sample from a known distribution. Extremely useful to generate test data.

In [None]:
# Sample some data from the normal distribution
rng = default_rng(0)

a = rng.normal(size=10000)
plot_distribution(a, title="Normal distribution")

a = rng.power(50, size=10000)
plot_distribution(a, title="Power distribution")

a = rng.uniform(size=10000)
plot_distribution(a, title="Uniform distribution")



## Copies and Views

The ndarray is ultimately a data structure, and it consists mainly of two parts: 
* a data buffer with the actual elements of the array
* metadata that contains information about the data buffer, such as the data type, strides, and all the information necessary to manipulate the array

### Views

Many NumPy operations can be achieved by simply modifying some of the metadata, but using the same underlying data buffer. This can save memory and ensures good performance, but it is important to be aware that we're not dealing with a copy of the values to avoid bugs. 

Let's test this:

In [None]:
a = np.arange(10)
print(f"a: {a}")

b = a[:5]
print(f"b: {b}")

print(f"\nsetting b[0] = 25")
b[0] = 25

print(f"b: {b}")
print(f"a: {a}")

print(f"\nsetting a[1] = 25")
b[1] = 25

print(f"a: {a}")
print(f"b: {b}")


We can see that the two arrays are somehow "linked" together, because changing one affects the other one. This is because `b` is a view of the original array `a`

### Copies

As the name suggest, when we create a copy, we are not only creating some new metadata for the same data buffer, but we're actually making a copy of the entire buffer, and the two variables will not be conneced.

Let's try it:

In [None]:
a = np.arange(16).reshape(4,4)

print(f"a:\n{a}")

# adding a value to every element forces a copy
b = a + 1 

print(f"b:\n{b}")

print(f"\nsetting b[0,0] = 25")

b[0,0] = 25

print(f"b:\n{b}")
print(f"a:\n{a}")

print(f"\nsetting a[-1,-1] = 25")

a[-1,-1] = 25

print(f"b:\n{b}")
print(f"a:\n{a}")


To tell whether a particular array is a view or a copy, we can check if the `base` attribute is set to something other than `None`

In [None]:
a = np.arange(10)
print(f"a: {a}")

b = a[:5]
print(f"b: {b}")
print(f"b.base: {b.base}")

a = np.arange(16).reshape(4,4)

print(f"\na:\n{a}")

# adding a value to every element forces a copy
b = a + 1 

print(f"b:\n{b}")
print(f"b.base: {b.base}")


## Data Types

To finalize the tutorial, I want to remind everybody that even though we're using a state-of-the-art piece of software, when we're dealing with real numbers [it is generally impossible to avoid precision errors](https://en.wikipedia.org/wiki/Floating-point_arithmetic). 

With this in mind, let's have a look at some of the main data types in NumPy, and why they are relevant to us.

In [None]:
rng = default_rng()
integer_array = rng.integers(100, size=(3,3))

print(f"integer array:\n{integer_array}")

float_array = rng.random(size=(3,3))

print(f"float array:\n{float_array}")

We just created two arrays with different types, by default, both will use 64-bit representations, so we should get the same numbewr of bytes for each matrix

In [None]:
print(f"size of integer array: {integer_array.nbytes} bytes")

print(f"size of float array: {float_array.nbytes} bytes")

For such small matrices, this is quite trivial, but remember that things can grow very quikly and become quadratic very easily. For example, keeping track of connections in a social network of $N$ people will result in an array with shape $N \times N$. If the links are not directed and there are no self-links, we can store all the data into a data buffer with $\frac{N(N-1)}{2}$ items, still pretty large.

Let's check how much memory we need to store a single graph if we use the defaults:

In [None]:
plot_graph_links_in_bits(64)

It's apparent that we run out of memory rather quickly, considering that big data regularly operates on networks of several hundred thousands of nodes, it's pretty obvious we can't ignore this if we're serious about big data.

Since our example is to store links, and we can do that using boolean values, let's see the same picture by moving to a smaller representation

In [None]:
nbits = np.bool_(True).nbytes * 8

In [None]:
x = np.arange(1000000)
y = (x * (x-1) // 2) * (nbits)

In [None]:
y

In [None]:
plot_graph_links_in_bits(nbits)

In [None]:
a = np.arange(1000)

In [None]:
a.dtype

In [None]:
fig, ax = plt.subplots(figsize = (10,10), dpi=100)
    
x = np.arange(100000, dtype=np.float64)
y = (x * (x-1) // 2) * (nbits)
ax.plot(x, y, label="undirected")

# avg_lap = 6.4e+10
# ax.axhline(y=avg_lap, color="green", label="8GB (average laptop)")
# nnodes = np.argmin(np.abs(y - avg_lap))
# ax.axvline(x=nnodes, color="green", ls=":", label=f"{nnodes:,} nodes")

# exp_lap = 2.56e+11
# ax.axhline(y=exp_lap, color="orange", label="64GB (expensive personal computer)")
# nnodes = np.argmin(np.abs(y - exp_lap))
# ax.axvline(x=nnodes, color="orange", ls=":", label=f"{nnodes:,} nodes")

# server = 2.4e+12
# ax.axhline(y=server, color="red", label="300GB (dedicated server ~30USD+/day)")
# nnodes = np.argmin(np.abs(y - server))
# ax.axvline(x=nnodes, color="red", ls=":", label=f"{nnodes:,} nodes")

ax.set_xlabel("nodes")
ax.set_ylabel("bits")

ax.set_title(f'Memory requirements using {nbits} bits')
ax.legend()

plt.show()
plt.close("all")

Just with that simple change, we can now almost tripled the number of nodes that we can hold in memory (in a dense array).