# Introduction to NumPy

This chapter outlines techniques for loading, storing, and manipulating in-memory data in Python. It helps to think of all data as arrays of numbers because datasets come from various sources and formats, including collections of documents, images, sound clips, numerical measurements, or nearly anything else.

**NumPy** (short for Numerical Python) is usually the first to go library for efficient storage and manipulation of numerical arrays.

> ✏️ The example is inspired by {cite}`tomasbeuzen` and {cite}`vanderplas_2017`.

Let's import numpy with the alias np (to avoid having to type out n-u-m-p-y every time we want to use it):

In [1]:
import numpy as np

np.__version__

'1.23.3'

## What are NumPy Arrays?

NumPy arrays ("ndarrays") are **homogenous data structures** that can contain all the basic Python data types (e.g., floats, integers, strings, etc.)—being homogenous means that each item is of the same type.

A numpy array is sort of like a list:

In [2]:
my_list = [1, 2, 3, 4, 5]
my_list

[1, 2, 3, 4, 5]

In [3]:
my_array = np.array([1, 2, 3, 4, 5])
my_array

array([1, 2, 3, 4, 5])

Notice that, unlike a list, arrays can only hold a single type (are homogenous):

In [4]:
my_array = np.array([1, "hi"])
my_array

array(['1', 'hi'], dtype='<U21')

## Creating arrays

Arrays are typically created using two main methods:

- From existing data (usually lists or _tuples_) using _np.array()_, like we saw above;
- Using built-in functions such as _np.arange()_, _np.linspace()_, _np.zeros()_, etc.

We have tried the first option; let's try the second one:

In [5]:
# rom 1 inclusive to 5 exclusive
np.arange(1, 5)

array([1, 2, 3, 4])

In [6]:
# step by 2 from 1 to 11
np.arange(0, 11, 2)

array([ 0,  2,  4,  6,  8, 10])

In [7]:
# 5 equally spaced points between 0 and 10
np.linspace(0, 10, 5)

array([ 0. ,  2.5,  5. ,  7.5, 10. ])

In [8]:
# an array of zeros with size 2 x 3
np.zeros((2, 3))

array([[0., 0., 0.],
       [0., 0., 0.]])

In [9]:
# an array of the number 3.14 with size 3 x 3 x 3
np.full((3, 3, 3), 3.14)

array([[[3.14, 3.14, 3.14],
        [3.14, 3.14, 3.14],
        [3.14, 3.14, 3.14]],

       [[3.14, 3.14, 3.14],
        [3.14, 3.14, 3.14],
        [3.14, 3.14, 3.14]],

       [[3.14, 3.14, 3.14],
        [3.14, 3.14, 3.14],
        [3.14, 3.14, 3.14]]])

In [10]:
# random numbers uniformly distributed from 0 to 1 with size 5 x 2
np.random.rand(5, 2)

array([[0.44566357, 0.66566482],
       [0.12479558, 0.35027656],
       [0.80644454, 0.67887849],
       [0.67785986, 0.56672146],
       [0.44970969, 0.61699536]])

## Array Shapes

As you just have seen, arrays can be of any dimension, shape, and size. There are three main attributes to work out the characteristics of an array:

- _.ndim:_ the number of dimensions of an array
- _.shape:_ the number of elements in each dimension (like calling len() on each dimension)
- _.size:_ the total number of elements in an array (i.e., the product of .shape)

In [11]:
array_2d = np.ones((3, 2))
print(f"Dimensions: {array_2d.ndim}")
print(f"Shape: {array_2d.shape}")
print(f"Size: {array_2d.size}")

Dimensions: 2
Shape: (3, 2)
Size: 6


## Comparing Arrays

Determining if two arrays have the same shape and elements might be confusing. Let's define our arrays first:

In [12]:
x = np.ones(5)
x

array([1., 1., 1., 1., 1.])

In [13]:
y = np.ones((1, 5))
y

array([[1., 1., 1., 1., 1.]])

In [14]:
z = np.ones((5, 1))
z

array([[1.],
       [1.],
       [1.],
       [1.],
       [1.]])

In [15]:
np.array_equal(x, x)

True

In [16]:
np.array_equal(x, y)

False

In [17]:
np.array_equal(x, z)

False

In [18]:
np.array_equal(y, z)

False

It is crucial to distinguish between different shapes as they affect operations.

## Computation on Arrays

Now, we will dive into why NumPy is so important in Python data science. Namely, it provides an easy and flexible interface to optimize computation with data arrays, such as:

- Elementwise Operations
- Aggregations
- Broadcasting

Before we show you some basics, let's have a motivational example. Computation on NumPy arrays can be speedy, or it can be slow. The key to making it fast is to use _vectorized_ operations.

In [19]:
# DON'T DO THIS
array = np.array(range(5))
for i, element in enumerate(array):
    array[i] = element ** 2
array

array([ 0,  1,  4,  9, 16])

In [20]:
# DO THIS
array = np.array(range(5))
array **= 2
array

array([ 0,  1,  4,  9, 16])

Let's actually compare those approaches:

In [21]:
# loop method
array = np.array(range(5))
time_loop = %timeit -q -o -r 3 for i, element in enumerate(array): array[i] = element ** 2

# vectorized method
time_vec = %timeit -q -o -r 3 array ** 2
print(f"Vectorized operation is {time_loop.average / time_vec.average:.2f}x faster than looping here.")

Vectorized operation is 3.56x faster than looping here.


We are using a tiny array here; imagine you are working with a vast dataset.

### Elementwise Operations

Elementwise operations refer to operations applied to each element of an array or between the paired elements of two arrays:

In [22]:
x = np.ones(4)
x

array([1., 1., 1., 1.])

In [23]:
y = x + 1
y

array([2., 2., 2., 2.])

In [24]:
x * y

array([2., 2., 2., 2.])

In [25]:
x == y

array([False, False, False, False])

In [26]:
np.array_equal(x, y)

False

### Aggregations

When faced with a large amount of data, the first step is to compute summary statistics for the data in question.

There are many aggregations functions, for example:

| Function Name | Description                            |
|:--------------|:---------------------------------------|
| np.sum        | Compute sum of elements                |
| np.prod       | Compute product of elements            |
| np.mean       | Compute mean of elements               |
| np.std        | Compute standard deviation             |
| np.min        | Find minimum value                     |
| np.max        | Find maximum value                     |
| np.argmin     | Find index of minimum value            |
| np.argmax     | Find index of maximum value            |
| np.any        | Evaluate whether any elements are true |
| np.all        | Evaluate whether all elements are true |

Let's do some practice. Consider working out the hypotenuse of a triangle that with sides _3m_ and _4m_:

In [27]:
sides = np.array([3, 4])

There are several ways we could solve this problem. We could directly use Pythagoras's Theorem:

In [28]:
np.sqrt(np.sum([np.power(sides[0], 2), np.power(sides[1], 2)]))

5.0

Or apply a "vectorized" operation to the whole vector at one time:

In [29]:
(sides ** 2).sum() ** 0.5

5.0

### Broadcasting

Arrays with different sizes cannot be directly used in arithmetic operations. Broadcasting describes how NumPy treats arrays with different shapes during arithmetic operations. The idea is to wrangle data so that operations can occur element-wise.

Broadcasting in NumPy follows a strict set of rules to determine the interaction between the two arrays:

- If the two arrays differ in their number of dimensions, the shape of the one with fewer dimensions is padded with the ones on its leading (left) side.
- If the shape of the two arrays does not match in any dimension, the array with a shape equal to 1 in that dimension is stretched to match the other shape.
- If in any dimension the sizes disagree and neither is equal to 1, an error is raised.

Let's try to center an array:

In [30]:
X = np.random.random((10, 3))
X

array([[0.94184422, 0.60988128, 0.75779938],
       [0.78453848, 0.75236185, 0.67625871],
       [0.6913031 , 0.65004607, 0.75261537],
       [0.12581446, 0.19242525, 0.52631881],
       [0.84183838, 0.14133457, 0.27797456],
       [0.77427511, 0.14505094, 0.72103279],
       [0.88974499, 0.95589889, 0.13018557],
       [0.78147822, 0.14631096, 0.75225142],
       [0.43078037, 0.62707699, 0.18763019],
       [0.78690986, 0.3502185 , 0.14556253]])

In [31]:
# compute mean along column
x_mean = X.mean(axis=0)
x_mean

array([0.70485272, 0.45706053, 0.49276293])

In [32]:
X_centered = X - x_mean
X_centered

array([[ 0.2369915 ,  0.15282075,  0.26503645],
       [ 0.07968576,  0.29530132,  0.18349578],
       [-0.01354962,  0.19298554,  0.25985244],
       [-0.57903826, -0.26463528,  0.03355588],
       [ 0.13698566, -0.31572596, -0.21478837],
       [ 0.06942239, -0.31200959,  0.22826986],
       [ 0.18489227,  0.49883836, -0.36257736],
       [ 0.0766255 , -0.31074957,  0.25948849],
       [-0.27407235,  0.17001646, -0.30513275],
       [ 0.08205714, -0.10684202, -0.34720041]])

Let's check that the centered array has near zero mean:

In [33]:
X_centered.mean(axis=0)

array([ 9.99200722e-17,  6.10622664e-17, -6.66133815e-17])

## Reshaping

There are 3 key reshaping methods I want you to know about for reshaping numpy arrays:

- _.rehshape()_
- _np.newaxis_
- _.ravel()/.flatten()_

In [34]:
x = np.full((4, 3), 3.14)
x

array([[3.14, 3.14, 3.14],
       [3.14, 3.14, 3.14],
       [3.14, 3.14, 3.14],
       [3.14, 3.14, 3.14]])

In [35]:
x.reshape(6, 2)

array([[3.14, 3.14],
       [3.14, 3.14],
       [3.14, 3.14],
       [3.14, 3.14],
       [3.14, 3.14],
       [3.14, 3.14]])

In [36]:
# using -1 will calculate the dimension for you (if possible)
x.reshape(2, -1)

array([[3.14, 3.14, 3.14, 3.14, 3.14, 3.14],
       [3.14, 3.14, 3.14, 3.14, 3.14, 3.14]])

In [37]:
x[:, np.newaxis]

array([[[3.14, 3.14, 3.14]],

       [[3.14, 3.14, 3.14]],

       [[3.14, 3.14, 3.14]],

       [[3.14, 3.14, 3.14]]])

In [38]:
x.flatten()

array([3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14,
       3.14])

In [39]:
x.ravel()

array([3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14,
       3.14])

## Exercises

#### Create an array of 10 ones and 10 zeroes:

In [40]:
# TODO: your answer here

#### Create a matrix using created arrays as rows

In [41]:
# TODO: your answer here

#### Calculate the sum of each column in the matrix

In [42]:
# TODO: your answer here

## Resources

```{bibliography}
:filter: docname in docnames
```