# NumPy

<div>
<img src="../images/numpy_logo.png" width="300"/>
</div>

## Intro

- NumPy = "Numerical Python", which is a commonly used Python package working with arrays, linear algebra and other numerical computations
    - Written in C, NumPy can deal with arrays much more efficiently comparing to working with arrays in Python
- Let us import NumPy and explore the basics of this helpful package:
    - Included in the `env-dstools.yml` in the root directory; can also be installed with `conda install numpy`

In [1]:
import numpy as np

In [2]:
first_array = np.array([1, 2, 3, 4, 5])
print(first_array)
print(type(first_array))

[1 2 3 4 5]
<class 'numpy.ndarray'>


- Unlike Python lists, NumPy arrays can only contain elements with a single data type

In [3]:
python_list = ["this is a string", 1, True]
[type(element) for element in python_list]

[str, int, bool]

In [4]:
numpy_array = np.array(python_list)
[type(element) for element in numpy_array]

[numpy.str_, numpy.str_, numpy.str_]

- Some examples of creating NumPy array with Python lists and other methods:

In [32]:
print("2D array: \n", np.array([[1, 2], [3, 4], [5, 6]]))
print("Arithmetic sequence: \n", np.arange(1, 10, 1.5))
print("Arithmetic sequence v2: \n", np.linspace(0, 10, 5))
print("Generate sequence of constants: \n", np.full((3, 9), 9.9))
# Set random seed for reproductivity
np.random.seed(8888)
print("Generate random numbers from 0 to 1: \n", np.random.rand(3, 5))

# Uncomment the following line to see all the accessible 
#  methods and attributes for NumPy N-d array
#print(dir(np.ndarray))

2D array: 
 [[1 2]
 [3 4]
 [5 6]]
Arithmetic sequence: 
 [1.  2.5 4.  5.5 7.  8.5]
Arithmetic sequence v2: 
 [ 0.   2.5  5.   7.5 10. ]
Generate sequence of constants: 
 [[9.9 9.9 9.9 9.9 9.9 9.9 9.9 9.9 9.9]
 [9.9 9.9 9.9 9.9 9.9 9.9 9.9 9.9 9.9]
 [9.9 9.9 9.9 9.9 9.9 9.9 9.9 9.9 9.9]]
Generate random numbers from 0 to 1: 
 [[0.96238432 0.69422036 0.44226615 0.02448785 0.07069067]
 [0.7940371  0.60528839 0.50772913 0.13890398 0.55663346]
 [0.57006236 0.05038574 0.58028586 0.84061497 0.97075989]]


- We can access the shapes of NumPy arrays with `.shape`:

In [38]:
numpy_array = np.ones((3, 5))
print(numpy_array)
print("Number of dimensions: ", numpy_array.ndim)
print("Shape of the array: ", numpy_array.shape)
print("Number of elements: ", numpy_array.size)

[[1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1.]]
Number of dimensions:  2
Shape of the array:  (3, 5)
Number of elements:  15


- NumPy 1-D array can have three different forms:

In [45]:
ones1 = np.ones(5)
ones2 = np.ones((1, 5))
ones3 = np.ones((5, 1))
print("Compare each element:", ones1 == ones2)
print("Compare the whole array:", np.array_equal(ones1, ones2))

Compare each element: [[ True  True  True  True  True]]
Compare the whole array: False


- We can perform some basic element-wise operations on NumPy arrays:

In [50]:
print(ones1 + 1)
print(ones1 - ones2)
print(ones2 * ones3)

[2. 2. 2. 2. 2.]
[[0. 0. 0. 0. 0.]]
[[1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1.]]


- More importantly, NumPy array has the **broadcasting** feature

In [54]:
cost = np.array([20, 15, 25])
sales = np.array([
    [2, 3, 1],
    [6, 3, 3],
    [5, 3, 5]
])
print("Cost matrix:\n", cost)
print("Sales matrix:\n", sales)

Cost matrix:
 [20 15 25]
Sales matrix:
 [[2 3 1]
 [6 3 3]
 [5 3 5]]


In [62]:
# Row by row multiplication (unexpected)
print(cost * sales)
# Row by column multiplication (expected)
print(cost.reshape(3, 1).repeat(3, axis=1) * sales)
# Row by column multiplication (NumPy handles it with broadcasting)
print(cost.reshape(3, 1) * sales)

[[ 40  45  25]
 [120  45  75]
 [100  45 125]]
[[ 40  60  20]
 [ 90  45  45]
 [125  75 125]]
[[ 40  60  20]
 [ 90  45  45]
 [125  75 125]]


In [64]:
total = np.zeros((3, 3))
time_loop = %timeit -q -o -r 3 for col in range(sales.shape[1]): total[:, col] = sales[:, col] * np.squeeze(cost)
time_vec = %timeit -q -o -r 3 cost.reshape(3, 1) * sales
print(f"Broadcasting is {time_loop.average / time_vec.average:.2f}x faster than looping in our example.")

Broadcasting is 3.60x faster than looping in our example.


- Note that not all the arrays are compatible for broadcasting:
   - Dimensions are compatible if _they are equal_, or _one of the dimensions is 1_
- NumPy checks for dimensional compatibility by starting from the right-most dimension.

```
Broadcastable:
A      (3d array): 256 x 256 x 3
B      (1d array):             3
Result (3d array): 256 x 256 x 3

A      (4d array): 8 x 1 x 6 x 1
B      (3d array):     7 x 1 x 5
Result (4d array): 8 x 7 x 6 x 5
```

- Reshaping the arrays in NumPy mainly counts on the following methods
    - `.reshape()` to change the shape of the existing array
    - `np.newaxis()` to add a new axis (usually for broadcasting)
    - `.ravel()` or `.flatten()` to return a flatten copy of the array

In [78]:
numpy_array = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], 
                        [10, 11, 12], [13, 14, 15]])
print(numpy_array)
print(numpy_array.reshape(3, 5))
# We can leave one unknown for it to automatically figure out
print(numpy_array.reshape(1, -1))
# We can set order to be false to get column-major reshaping
print(numpy_array.reshape(1, -1, order='F'))

[[ 1  2  3]
 [ 4  5  6]
 [ 7  8  9]
 [10 11 12]
 [13 14 15]]
[[ 1  2  3  4  5]
 [ 6  7  8  9 10]
 [11 12 13 14 15]]
[[ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15]]
[[ 1  4  7 10 13  2  5  8 11 14  3  6  9 12 15]]


In [95]:
try:
    print(numpy_array + np.ones((5, 3, 2)))
except:
    # The commented lines would be the same
    #   as the None is the alias of np.newaxis here
    #   and we can add extra dimension with reshape as well
    print(numpy_array[:, :, np.newaxis] + np.ones((5, 3, 2)))
    #print(numpy_array[:, :, None] + np.ones((5, 3, 2)))
    #print(numpy_array.reshape(5, 3, -1) + np.ones((5, 3, 2)))

[[[ 2.  2.]
  [ 3.  3.]
  [ 4.  4.]]

 [[ 5.  5.]
  [ 6.  6.]
  [ 7.  7.]]

 [[ 8.  8.]
  [ 9.  9.]
  [10. 10.]]

 [[11. 11.]
  [12. 12.]
  [13. 13.]]

 [[14. 14.]
  [15. 15.]
  [16. 16.]]]


In [98]:
numpy_array.flatten()
numpy_array.ravel()

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15])

## Indexing and Slicing

(TODO): pretty much similar to Python lists

## Vectorization

- Since NumPy is implemented with C, avoid using Python loops and use build-in functions for NumPy arrays would be faster and more efficient

In [99]:
# loop method
array = np.array(range(50))
time_loop = %timeit -q -o -r 3 for i, element in enumerate(array): array[i] = element ** 2
# vectorized method
array = np.array(range(50))
time_vec = %timeit -q -o -r 3 array ** 2
print(f"Vectorized operation is {time_loop.average / time_vec.average:.2f}x faster than looping here.")

Vectorized operation is 23.20x faster than looping here.


## (Optional) Advanced NumPy
- (TODO)Array Data Type
- (TODO)Memory Layout and Strides -- why NumPy is so great?

## Practices

(TODO) Practice on the existing questions

## Reference

- Credits to UBC MDS DSCI 511, as most of the content were adapted from the course materials.