# Numpy Tutorial

Oliver W. Layton and Stephanie R Taylor

CS251: Data Analysis and Visualization

Spring 2021

In [12]:
import numpy as np

## Numpy ndarray basics

### Creation from Python lists

In [13]:
# Make a numpy array from a 2D python list
arr = np.array([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])

In [14]:
# print it
print(arr)
print(arr[0, 0]) #compared to arr[0][0]

[[1. 2. 3.]
 [4. 5. 6.]]
1.0


### Data type of ndarray

In [15]:
# determine data type
print('Type of array is\n', arr.dtype)

Type of array is
 float64


Type can be set or changed in a few ways. 

1. when creating array â€” (a) implicitly or (b) explicitly
2. by casting types.

In [16]:
# 1b explicitly


In [17]:
# 2. NOTE: Casting is a METHOD of the array, not a FUNCTION


In [18]:
# Can also be string. be careful in your CSV parser that your "numbers"
# aren't actually strings!


### Convert back to Python list

In [19]:
# Convert back from ndarray to Python list (2 methods)


### Other ways to create ndarrays quickly

#### 1. zeros

#### 2. ones

In [20]:
# can easily make any constant array


#### 3. Random values

In [21]:
# Uniform random values


#### 4. Equally spaced floats in an interval

#### 5. Equally spaced ints in an interval

### Check dimensions - `shape`

In [22]:
# check shape of 3D array
arr = np.arange(stop=5.0, step=0.25, start = 0)
print(arr.shape)

(20,)


In [23]:
# check number of dimensions (M)
print(len(arr.shape))

1


In [24]:
# Access 1st dim (#rows), 2nd dim (#cols) (Use f-string)
arr = np.array([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]]) / 9
print(arr)
row_idx = 1
col_idx = 2
print(f'Value at row {row_idx} and column {col_idx} is {arr[row_idx, col_idx]}')

[[0.11111111 0.22222222 0.33333333]
 [0.44444444 0.55555556 0.66666667]]
Value at row 1 and column 2 is 0.6666666666666666


In [25]:
# Check number of elements total
print(arr.size)

6


## Brief detour: Rapidly build python lists (list comprehension)

In [26]:
# Brief detour: In python you can replace the workflow of 
# list-building by creating an empty list and looping to append...


In [27]:
# ...with Python list comprehensions


In [28]:
# you can build lists using any function of i. How about i^2?


## ndarray indexing

Basic Accessing and modifying of ndarrays.

### Access and modify single elements

In [29]:
# To access elements in a multidimensional ndarray use ONE set of square brackets []
# Make a new random array
np.random.seed(0)  # ensures random numbers come up the same each time. Useful for debugging.
arr = np.random.random([2 ,3])

In [30]:
# Get the 1st element
print(arr)
print(arr[0,0])
arr[0,0] = 5
print(arr)

[[0.5488135  0.71518937 0.60276338]
 [0.54488318 0.4236548  0.64589411]]
0.5488135039273248
[[5.         0.71518937 0.60276338]
 [0.54488318 0.4236548  0.64589411]]


In [31]:
# Modifying single values is similar

print('arr is now:\n', arr)

arr is now:
 [[5.         0.71518937 0.60276338]
 [0.54488318 0.4236548  0.64589411]]


### Slicing: real power of numpy

Use **colon** notation for all values in a dimension

Access and modify different ranges of data along different dimensions 

Make a 3x5 random array. Access 2nd column

In [32]:
np.random.seed(0)
arr = np.random.random([2 ,3])
print(arr)

[[0.5488135  0.71518937 0.60276338]
 [0.54488318 0.4236548  0.64589411]]


Access 1st row

In [33]:
print(arr[0, :])

[0.5488135  0.71518937 0.60276338]


Access last 2 columns

In [34]:
print(arr[:,-2:])
print(arr[:,1:])

[[0.71518937 0.60276338]
 [0.4236548  0.64589411]]
[[0.71518937 0.60276338]
 [0.4236548  0.64589411]]


Access columns at indices 1-2 and in 1st row. Careful about off-by-one.

- Low range (before :) CONTAINS that index
- High range (after :) DOES NOT contain that index (i-1)

Use slicing to assign values efficiently in batch without loops

In [35]:
# Assign 1st row to -1s


In [36]:
# Assign 1st row to increasing ints


In [37]:
# Multiply the 3rd row by 5 times itself and update the row


### What if we want to access a set of rows or columns that are not adjacent?

Can't use colon notation. Instead use `np._ix`

Example: Say we want column indices 0, 2, 4 and all rows.

**Syntax for `np._ix`:**
- `np._ix` goes inside the square brackets: `arr[np._ix(blah)]`
- Give it `M` arguments (e.g. 2 for a 2D matrix).
- Each argument is a Python list (or ndarray) of indices to take along that dimension.

## Memory

- Numpy tries to be efficient with arrays so assignment does a shallow copy. To do a deep copy, you need to use `.copy()` method

In [38]:
a = np.linspace(-1, 1, 5)
a

array([-1. , -0.5,  0. ,  0.5,  1. ])

In [39]:
# fixed with .copy()

## Reshaping

**Problem:**
- You want to preserve the number of elements in an ndarray but "regroup" the elements

Example: Have a `64 x 64` image and want to make one big `64*64` 1D vector by "gluing the rows together":

e.g: (3, 3) -> (9,)

Turn:

    [[1, 2, 3],
     [4, 5, 6],
     [7, 8, 9]]

Into:

     [1, 2, 3, 4, 5, 6, 7, 8, 9]

How can we do this without hard coding?

**Key:** Total number of elements in ndarray doesn't change.

## Combining multiple ndarrays

**Problem:**
- You have two ndarrays and want to concatenate them
- You have an ndarray and **want to append a column or row vector**

### Add/append a new column â€” "stack horizontally"

**Mnemonic**: Columns go horizontally.

Have `a`:

    [[1, 2]
     [3, 4]]
and `b`

    [[9]
     [9]]
    
want to make:

    [[1, 2, 9]
     [3, 4, 9]]
    
i.e. stack horizontally. Could be two matrices (not just a matrix and a vector).

**Caveat:** We need to make sure shapes are compatible for broadcasting:

- Result shape = `(2, 3)`
- We are starting with `a` shape: `(2, 2)`

The shape of `b` needs to be `(2, 1)` (why wouldn't `(1, 2)` work?)

## Math with ndarrays (vectorization)
There are many options for doing math with ndarrays. If we can apply a function to an entire array or have an entire array as an operand in a mathematical operation (which we can), then we call this **Vectorization** and we avoid loops.

First, we talk about functions. Then, we talk about mathematical expressions.

## Built-in functions that compute along dimensions (`axes`)

- Axes are the numpy term for different ndarray dimensions. 
- Example: axis 0 are the rows, axis 1 are the columns, etc.
- We can apply functions over one or more axis super efficiently in one line of code! This isÂ MUCH MUCH faster than loops (stay tuned).

Sum along rows -> "collapse" across rows to get sum within each column â€”Â 5 numbers

Sum along columns -> "collapse" across columns to get sum within each row â€”Â 3 numbers

**Careful:** Applying a function without specifying the axis may compute across the ENTIRE ndarray.

**Mnemonic trick:** Applying a function along an axis eliminates that dimension from the shape. Left with remaining dimensions.

In [40]:
print(one.shape)

print(f'Mean across axis 0: {the_mean.shape}')

print(one.shape)

print(f'Mean across axis 1: {the_mean.shape}')

NameError: name 'one' is not defined

## Math Operations with ndarrays
In NumPy, you can apply the same operations as with scalars (+,-,\*,/,\**), but to every element of an ndarray. This is super straightforward if the operands are ndarrays with the same shape. Then the operations are simply performed element-wise.

But what if the operands are not identically sized? Under certain conditions, NumPy will do the work (hopefully the work you want) anyways. NumPy does this via a powerful feature called Broadcasting.

## Broadcasting

**This is the most useful numpy feature thus far! This will become your bread-and-butter!**

### Simple example: A Scalar and an NDarray

As we saw, we can create an array of any size with any constant value WITHOUT ANY LOOPS. This is the simplest example of numpy **broadcasting** the scalar across the ndarray.

In [None]:
# Example with basic arithmetic


### Example: Subtract the minimum value of each column for the original 2D array
In this case, we have a 2D array and a 1D array. 

In [None]:
np.random.seed(0)
rand_inds = np.random.randint(low=0, high=5, size=(5, 6))
rand_inds

In [None]:
# Take the min across each column and subtract it from rand_inds. Print shape of result

print(result.shape)

#### What's going on??

Let's look at the shapes:

Numpy is **broadcasting** the vector to operate on the 2d ndarray (**draw this out on board**):

- Numpy looks for axis shape compatibility among the different arrays.
- Numpy sees the column dimension (5, **6**) matches the min vector (**6,**)
- Numpy adds a **singleton dimension** (a "fake" leading dimension for rows). Now the min vector is treated with shape: **(1, 6)**.
- Numpy dynamically "grows" the singleton dimension to the needed shape (5). So now the min vector is treated like a (5, 6) array.
- Numpy element-wise subtracts the two arrays: (5, 6) array can be subtracted by a (5, 6) array.
- Numpy returns the result, which is a (5, 6) array!

Process is **very memory efficient**: No new memory gets allocated during broadcasting.

#### Broadcasting only adds singleton dimensions LEFTWARD to the ndarray with smaller number of dimensions.

What if we did the same thing as above, but now wanted to **subtract the minimum value in each row**?

In [None]:
print(f'rand_inds shape: {rand_inds.shape}')
print(f'min shape: {np.min(rand_inds, axis=1).shape}')

**Problem:** adding singleton dimensions to the LEFT could never make the shapes compatible!! For example, numpy tries: 

    rand_inds shape: (5, 6)
    min shape: (1, 5)

but that won't work! Crash...

#### Adding singleton dimensions by ourselves

We can help numpy out and add a "new axis" ourselves to make the shapes compatible with `np.newaxis`!

Can also make this more readable by defining a temp variable...

#### Squeeze if you need to get rid of all singleton dimensions

"Undo" a new axis / singleton dimension

In [None]:
ex = np.zeros([1, 2, 1, 3, 1, 4, 5, 1])
ex.shape

#### Not automatically squeezing computations

As we saw above, functions like `min` over an axis eliminate that axis from the result: 

In [None]:
print(f'Min over axis 0 shape: {.shape}')
print(f'Min over axis 1 shape: {.shape}')

If we want `min` (and some other functions) to keep the singleton dimension, we can use the optional argument `keepdims=True`:

In [None]:
print(f'Min over axis 0 shape: {.shape}')
print(f'Min over axis 1 shape: {.shape}')

This can **help with broadcasting compatibility when performing an operation on an axis then modifying the original ndarray.** Then we don't need to manually add a new axis.

Example with subtracting the mean:

In [None]:
# Subtract off mean along either axis
rand_inds_centered = rand_inds.copy()

## Vectorization speed vs loops

Time computation of adding 1 to a ndarray with loop vs vectorized.

In [None]:
def timeit(fun):
    '''Just a function to time the runtime of another function'''
    def timer():
        start = time.time()
        fun()
        end = time.time()
        print(f'Took {end - start:.3} secs to run.')
    return timer


@timeit
def sumLoop():
    '''Use for loop to sum a row vector'''
    longRow = np.array([i for i in range(1, 1000000)])
    theSum = 0
    for i in range(len(longRow)):
        theSum += longRow[i]


@timeit
def sumVectorized():
    '''Vectorized version of summing a row vector'''
    longRow = np.array([i for i in range(1, 1000000)])
    theSum = np.sum(longRow)

In [None]:
# Dynamic typing in python makes for loops with lots of small
# operations slow
print('addOneLoop:')
addOneLoop()

# Vectorization allows Numpy to stop searching at runtime
# and use efficient pre-compiled functions to batch-process
# the computation over the matrix
print('addOneVectorized:')
addOneVectorized()