# `numpy`

`numpy` is the fundamental package for scientific computing in Python. It is a Python library that provides a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation and much more.

`numpy` is the foundation for so much in Python and data science. Its multidimensional array object allows for tremendous performance gains when compared against typical Python structures and operations. For this reason, when dealing with large amount of numerical data `numpy` is used almost exclusively. We will focus on learning about `numpy`'s array type and common utility functions for working with numerical data.

Let's compare some basic operations in standard Python vs `numpy`. Let's begin by creating a list of 1 million random numbers uniformly distributed from 0.0 to 100.0.

In [1]:
import random

random.seed(0)
random_data = [random.uniform(0.0, 100.0) for _ in range(1000000)]

Let's perform and time a few simple operations on this list: sum and sort.

In [None]:
%%timeit
sum(random_data)

In [None]:
%%timeit
sorted(random_data)

The sum operation on a regular Python list of 1 million elements is a few milliseconds, and sorting is a few hundred milliseconds. Now let us convert this to a *`numpy` array* and perform the same operations:

In [None]:
import numpy as np

numpy_random_data = np.array(random_data)
numpy_random_data

In [None]:
%%timeit
numpy_random_data.sum()

In [None]:
%%timeit
numpy_random_data.sort()

Using `numpy` arrays these times are reduced to a few hundred *microseconds* and a few milliseconds respectively for summing and sorting. This is a tremendous improvement! These times may not seem significant, but this is only summing and sorting - for larger and more complex operations the performance gains are even more substantial. Let's revisit how we even created the `numpy` array:

```python
import numpy as np
numpy_random_data = np.array(random_data)
```

First, we need to import `numpy` into Python so that we can use it. It is conventional to alias the `numpy` package as `np`. Once we have import `numpy` we can access the `array` object, which can be constructed from a Python list. Note that constructing an array this way copies the input list! An array in `numpy` is a one dimensional structure, though we can combine single dimensional arrays into multidimensional arrays (i.e. matrices). We can observe the shape of an array using the `shape` attribute of an array:

In [None]:
numpy_random_data.shape

This tuple contains a single element, telling us that the array is 1-dimensional and that one dimension has 1000000 elements. By default `numpy` stores arrays in row-major order, and so we can also think of our 1-dimensional 1000000 element array as a 1x1000000 matrix (1 row, 1000000 columns). Creating a multidimensional array is as simple as providing a *nested list* when creating the array:

In [None]:
data = [[0, 1, 2], [3, 4, 5], [6, 7, 8]]
sample_matrix = np.array(data)
sample_matrix

In [None]:
sample_matrix.shape

A unique requirement of multidimensional `numpy` arrays is that every row must have the same number of elements! Python lists of lists ignore this requirement, but `numpy` cannot!

This is an important use case for lists (compared to arrays). If we need to store data with a variable number of elements per record, a list might be more convenient.

In [None]:
data = [[0, 1, 2], [3, 4, 5], [6, 7]]
sample_matrix = np.array(data)
sample_matrix

This error is a little obscure, but it is telling us that the dimensions across all rows must be the same (it was expecting 3 elements in the last row). We have a number of ways of creating arrays and multidimensional arrays with preset sizes and shapes. It is common to want to create a default array or matrix containing all 0s or 1s, and `numpy` gives us helper functions to do exactly that:  

In [None]:
all_zeros = np.zeros(shape=(5, 10))
all_zeros

In [None]:
all_ones = np.ones(shape=(5, 10))
all_ones

We can also create arrays and matrices with other default values, but just need to use the *fill* function to fill the object with that value:

In [None]:
all_twos = np.zeros(shape=(10, 2))
all_twos.fill(2)
all_twos

Note here that these values are *floating point* numbers - what if we want an array of integers? We need to tell `numpy` what datatype we want to use, otherwise it will assume a floating point value type.

In [None]:
all_twos.dtype

We can tell `numpy` the type of the array elements by specifying the `dtype` of the array when creating it. Below we are going to tell `numpy` to make an array of integers, suing `numpy`'s special integer type (it has a few of these!).

In [None]:
all_int_twos = np.zeros(shape=(10, 2), dtype=np.int32)
all_int_twos.fill(2)
all_int_twos

In [None]:
all_int_twos.dtype

The "32" in the name of the type indicates that the integers should be 32-bit, or 4-byte, integers. We can use 32-bit integers over 64-bit integers to save memory, and thus processing time! `numpy` out of the box provides the many integer and floating point types, we are largely only concerned with the following:

* `np.int8` - 8-bit signed integer
* `np.int16` - 16-bit signed integer
* `np.int32` - 32-bit signed integer
* `np.int64` - 64-bit signed integer
* `np.uint8` - 8-bit unsigned integer
* `np.uint16` - 16-bit unsigned integer
* `np.uint32` - 32-bit unsigned integer
* `np.uint64` - 64-bit unsigned integer
* `np.float32` - 32-bit single precision floating point number
* `np.float64` - 64-bit double precision floating point number

There are 128-bit and 256-bit variants of some of these, but are largely unnecessary for what we will be doing in this course. If we try using Python's types when creating an array, `numpy` will replace it with the appropriate `numpy` type:

In [None]:
np.array([], dtype=int).dtype, np.array([], dtype=float).dtype

We can also create identify matrices:

In [None]:
np.identity(5, dtype=np.int32)

Many utility functions in `numpy` that generate data produce that data as arrays; from these arrays we can *reshape* them into matrices. Take for example generating random numbers in `numpy`: 

In [None]:
np.random.seed(1)
random_matrix = np.random.randint(0, 100, 144).reshape((12, 12))
random_matrix

Here we are asking `numpy` to generate 144 random integers between 0 and 100. `numpy` returns this to us as a large 1-dimensional array. If we want to turn this into a matrix, we can use the `reshape` function. This function will return a new matrix (copied from the array) such that the data is broken into rows. We asked `numpy` to reshape the large array into a 12x12 matrix (remember that 12*12=144!).

We can also have `numpy` generate evenly spaced data using either ranges or linear-spaces:

In [None]:
# Create an array of floating point numbers in the range [0, 100)], taking
# every 4th number, reshaped into a 5x5 matrix
# np.arange(0.0, 100.0, 4.0).reshape((5,5))

np.arange(0.0, 100.0, 4).reshape((5, 5))

In [None]:
np.linspace(0, 100, 30).reshape((5, 6))

# create an array of 30 evenly distributed integers in the range [0, 100],
# reshaped into a 5x6 matrix

np.linspace(0, 100, 30).reshape((5, 6))

## Selecting Data

Selecting and indexing data from a `numpy` array is effectively the same as with a Python list, but there are some bonuses with `numpy`! Let's use the randomized matrix that we just created.

In [None]:
random_matrix

In [None]:
random_matrix[2]  # get the 3rd row

In [None]:
random_matrix[2, 7]  # get the 3rd row, 8th column

In [None]:
random_matrix[:, 3]  # slice the 4th column

In [None]:
random_matrix[0:5, 0:5]

# slice both axes, taking the first half of each, giving us a "quadrant" of data

## Operations

We can use arithmetic operations on arrays *if and only if the operation is well defined for the shapes of the arrays*. There are both scalar and matrix operations that we can apply. Here are a few examples of of some scalar operations:

In [None]:
data = np.ones(shape=(10, 10))
data

In [None]:
# scalar addition
data + 1

In [None]:
# scalar multiplication
data * 10

In [None]:
# we can combine operations
data * 10 + 1

In [None]:
# we can even raise elements to some power!
(data + 1) ** 4

We also have matrix operations, some of which require the other matrix to have a compatible shape!

In [None]:
# for these examples let's set up a 1x3 matrix (an array) with the list [1, 2, 3] and two 3x3 matrices, one with the numbers 0 through 8, and another 10 through 18

array_1 = np.array([1, 2, 3])
data_1 = np.array(np.arange(0, 9).tolist()).reshape((3, 3))
data_2 = np.array(np.arange(10, 19).tolist()).reshape((3, 3))

array_1, data_1, data_2

We can perform element-wise addition, subtraction, multiplication, etc. using the standard operators

In [None]:
data_1 + data_2

In [None]:
data_1 * data_2

In [None]:
# adding an array and a matrix can only be done if they have compatible shapes!
# 1x3 and 3x3 are compatible shapes! but, say, 1x4 or 2x3 are not compatible with 3x3.
# this compatibility is called broadcasting, and is only successful when all dimensions
# either match or one of them is 1
array_1 + data_1

We can also perform common matrix operations, like computing dot products, cross products, and even computing the determinant:

In [None]:
np.dot(array_1, data_1)

In [None]:
np.cross(array_1, data_1)

In [None]:
np.linalg.det(data_1)

## Exercise

`numpy` is really great at applying functions to arrays and matrices in large batches, so long as the function is *vectorized*. This means that the function is designed to operate on entire arrays. We want to use `numpy` to generate the points on the edge of a circle. We can do this by generating a linear spacing of values from 0 to $2\pi$ and then applying trigonometric functions to that array to build the vertices.

Take a moment to:

1. define a variable named `angles` using a `numpy` linear space from 0 to $2\pi$ with 100 elements
2. define a variable named `x_vector` using the `numpy` function `np.cos` and passing in `angles`
3. define a variable named `y_vector` using the `numpy` function `np.sin` and passing in `angles`

In [39]:
angles = np.linspace(0.0, 2 * np.pi, 100)
x_vector = np.cos(angles)
y_vector = np.sin(angles)

In [None]:
import matplotlib.pyplot as plt

plt.scatter(x_vector, y_vector)

# Make the axes square so the circle is perfect
ax = plt.gca()
ax.set_aspect("equal", adjustable="box")