# Section 6 Introduction to Numpy

[NumPy -- *Numerical Python*](https://numpy.org/) provides the building-blocks for the entire ecosystem of data science tools in Python, serving as the efficient tool to store and manipulate data, and [friendly to Matlab users](https://numpy.org/doc/stable/user/numpy-for-matlab-users.html).

Unfortunately, the native numpy does not support GPU operations. For arrays on GPU, we have some popular substitutions, such as tensors in [TensorFlow](https://www.tensorflow.org/) and [jax](https://github.com/google/jax#quickstart-colab-in-the-cloud) (by Google), [PyTorch](https://pytorch.org/) (by Facebook) or arrays in [CuPy](https://cupy.dev/) (by Nvidia) -- while they all have close relations/ similar interface with Numpy. Therefore, learning the basic concepts about Numpy is crucial for doing data science with Python.

In [2]:
import numpy as np

my_arr = np.arange(1000000)
my_list = list(range(1000000))

In [2]:
print(type(my_arr))

<class 'numpy.ndarray'>


In [4]:
my_arr

array([     0,      1,      2, ..., 999997, 999998, 999999])

In [7]:
%time for _ in range(10): my_arr2 = my_arr * 2 

Wall time: 28.5 ms


In [8]:
%time for _ in range(10): my_list2 = [x * 2 for x in my_list]

Wall time: 1.1 s


## Difference between ndarray and list : Data Memory Perspective

[Intuitively speaking](https://jakevdp.github.io/PythonDataScienceHandbook/02.01-understanding-data-types.html), the built-in list object in Python can be viewed as the "address book" that store multiple pointers to heterogeneous objects in Python as its elements. On the other, the Numpy array object in Python stored the pointer to a consecutive memory block (data buffer) implemented in C language -- that's why the elements in Numpy array should be fixed-type, and the implementation is more efficient than list. 

In [9]:
a = np.array([1,2,3,4]) #numpy 1-d array, initialization with list
l = [1,2,3,4]  # python built-in list

Slicing of Numpy array creates *View* instead of *Copy*. The view object shares the same data buffer with the original one.

In [10]:
b = a[0:2] # creating view by slicing

In [11]:
print(b)
b.base # view has the base object because its memory is from some other object.

[1 2]


array([1, 2, 3, 4])

We can also check the  `flags` to see whether the array has its "own data".

In [15]:
b.flags

  C_CONTIGUOUS : True
  F_CONTIGUOUS : True
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

In [16]:
a.flags

  C_CONTIGUOUS : True
  F_CONTIGUOUS : True
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

This mechanism may cause unexpected outcomes for beginners.

In [12]:
b[0] = 1000 # change the first element of b (which is the slice of a -- view) 
a

array([1000,    2,    3,    4])

In [15]:
b2 = a.copy()
print(b2)

[1000    2    3    4]


In [16]:
b2[0] = 11
print(b2)
print(a)

[11  2  3  4]
[1000    2    3    4]


In [17]:
b2.flags

  C_CONTIGUOUS : True
  F_CONTIGUOUS : True
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

This is very different with the Python built-in list.

In [13]:
c = l[0: 2] #slicing in list, c is a copy of l
c[0] = 100
l

[1, 2, 3, 4]

Many other methods/functions in Numpy creates **view** instead of **copy** (in fact view is far more efficient than copy).

For example, Reshape creates the view whenever possible (for most of the case with consistent dimensions).

In [18]:
a_mat = a.reshape(2,2)
print(a_mat)

[[1000    2]
 [   3    4]]


In [19]:
print(a_mat[1,0]) #vertical index (top to bottom)
print(a_mat[0,1]) #horizontal index (left to right)

3
2


In [20]:
a_mat.base

array([1000,    2,    3,    4])

In [22]:
a_mat[0,0] = 2000 # same as a_mat[0][0]
a_mat[1,1] = -5
a

array([2000,    2,    3,   -5])

In [24]:
a_mat

array([[2000,    2],
       [   3,   -5]])

In [23]:
a_mat[0]

array([2000,    2])

Transpose also creates the **view**.

In [25]:
a_t = a_mat.T # attribute
a_tt = a_mat.transpose() # method
print(a_t) #swapped row and column
print(a_tt)

[[2000    3]
 [   2   -5]]
[[2000    3]
 [   2   -5]]


In [26]:
a_t.base

array([2000,    2,    3,   -5])

In [27]:
a_t[0,0] = 0 # change the view -- change the data buffer -- the base a is also changed!
a

array([ 0,  2,  3, -5])

Conversely, once the "base" is changed, **all** the associated "view" objects are changed!

In [28]:
a_mat # reshape of a -- view, changed!

array([[ 0,  2],
       [ 3, -5]])

In [29]:
b # slicing of a -- view, changed!

array([0, 2])

Use the copy method to create the new data buffer

In [30]:
a_copy = a.copy()
print(a_copy)
print(a_copy.base)

[ 0  2  3 -5]
None


In [31]:
a_copy.flags

  C_CONTIGUOUS : True
  F_CONTIGUOUS : True
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

In [32]:
a_mat_copy = a_mat.copy()

In [33]:
a_mat_copy.flags

  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

## Numpy ndarray as object

As the object created by Numpy, the ndarray has identity, type, value, attributes and methods.

In [34]:
type(a) 

numpy.ndarray

In [None]:
dir(a)

In [None]:
help(a)

In [74]:
a = np.arange(12)
print(a.shape) # 1-d array with length 4 -- different with 4x1 2-d array!
print(a)
a.T

(12,)
[ 0  1  2  3  4  5  6  7  8  9 10 11]


array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])

In [76]:
b = a.reshape(-1,4) #similar to how -1 is used as an index for the end of a list.
print(b.shape)
print(b)
b.T

(3, 4)
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]


array([[ 0,  4,  8],
       [ 1,  5,  9],
       [ 2,  6, 10],
       [ 3,  7, 11]])

In [63]:
a_mat.shape 

(2, 2)

In [65]:
L = a_mat.tolist() #each row of matrix is an element of the new list
print(L)
print(L[1])
print(L[1][0])

[[0, 2], [3, -5]]
[3, -5]
3


In [66]:
type(L)

list

In [68]:
print(a_mat)
print(a_mat[0])
print(a_mat[0][0])

[[ 0  2]
 [ 3 -5]]
[0 2]
0


In [69]:
a.mean()

1.5

In [70]:
help(a.mean)

Help on built-in function mean:

mean(...) method of numpy.ndarray instance
    a.mean(axis=None, dtype=None, out=None, keepdims=False)
    
    Returns the average of the array elements along given axis.
    
    Refer to `numpy.mean` for full documentation.
    
    See Also
    --------
    numpy.mean : equivalent function



In [71]:
np.mean(a)

1.5

In [72]:
help(a.reshape)

Help on built-in function reshape:

reshape(...) method of numpy.ndarray instance
    a.reshape(shape, order='C')
    
    Returns an array containing the same data with a new shape.
    
    Refer to `numpy.reshape` for full documentation.
    
    See Also
    --------
    numpy.reshape : equivalent function
    
    Notes
    -----
    Unlike the free function `numpy.reshape`, this method on `ndarray` allows
    the elements of the shape parameter to be passed in as separate arguments.
    For example, ``a.reshape(10, 11)`` is equivalent to
    ``a.reshape((10, 11))``.



##  Dimension and Axis of ndarray

Numpy uses the terms *dimension* and *axis* (indexing from 0) to describe the degree of freedom of an array. [See the illustrations here.](https://www.cs.ubc.ca/~pcarter/cs189/cs189_ch7s3.html)

In [52]:
a = np.arange(24).reshape(2,3,4) # 3-d array, or tensor  
a

array([[[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]],

       [[12, 13, 14, 15],
        [16, 17, 18, 19],
        [20, 21, 22, 23]]])

In the method `reshape`, you can also pass value -1 to let Numpy calculate the number for you.

In [4]:
np.arange(24).reshape(2,-1,4)

array([[[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]],

       [[12, 13, 14, 15],
        [16, 17, 18, 19],
        [20, 21, 22, 23]]])

In [78]:
print(a[1,0,0]) #first index is depth (from front to back)
print(a[0,1,0]) #second index is vertical (from top to bottom), which row do you want?
print(a[0,0,1]) #third index is horizontal (from left to right), which column do you want?

12
4
1


In [5]:
help(np.arange) # note the difference with range()

Help on built-in function arange in module numpy:

arange(...)
    arange([start,] stop[, step,], dtype=None, *, like=None)
    
    Return evenly spaced values within a given interval.
    
    Values are generated within the half-open interval ``[start, stop)``
    (in other words, the interval including `start` but excluding `stop`).
    For integer arguments the function is equivalent to the Python built-in
    `range` function, but returns an ndarray rather than a list.
    
    When using a non-integer step, such as 0.1, the results will often not
    be consistent.  It is better to use `numpy.linspace` for these cases.
    
    Parameters
    ----------
    start : integer or real, optional
        Start of interval.  The interval includes this value.  The default
        start value is 0.
    stop : integer or real
        End of interval.  The interval does not include this value, except
        in some cases where `step` is not an integer and floating point
        round-off 

In [53]:
a

array([[[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]],

       [[12, 13, 14, 15],
        [16, 17, 18, 19],
        [20, 21, 22, 23]]])

In [59]:
a[0:1,:,1:2]

array([[[1],
        [5],
        [9]]])

In [81]:
print(a.T) #Exercise: Describe what happens here
a.T.shape

[[[ 0 12]
  [ 4 16]
  [ 8 20]]

 [[ 1 13]
  [ 5 17]
  [ 9 21]]

 [[ 2 14]
  [ 6 18]
  [10 22]]

 [[ 3 15]
  [ 7 19]
  [11 23]]]


(4, 3, 2)

In [61]:
a.T[1:2,:,0:1]

array([[[1],
        [5],
        [9]]])

In [63]:
A4 = np.arange(48).reshape(2,3,2,4)
print(A4)

[[[[ 0  1  2  3]
   [ 4  5  6  7]]

  [[ 8  9 10 11]
   [12 13 14 15]]

  [[16 17 18 19]
   [20 21 22 23]]]


 [[[24 25 26 27]
   [28 29 30 31]]

  [[32 33 34 35]
   [36 37 38 39]]

  [[40 41 42 43]
   [44 45 46 47]]]]


In [82]:
a_1d = np.array([1,2,3,4])
a_1d.shape #1 axis

(4,)

In [83]:
a_1d.T.shape # transpose is still 1-D array! this is very different with Matlab!

(4,)

In [84]:
print(a_1d)
print(a_1d.T)

[1 2 3 4]
[1 2 3 4]


In [86]:
a_2d = a_1d[:,np.newaxis] # increase dimension. The new dimension will be the second component.
a_2d.shape

(4, 1)

In [87]:
help(np.newaxis) #has no value, not a function either. This attribute just allows us to increase number of dimensions.

Help on NoneType object:

class NoneType(object)
 |  Methods defined here:
 |  
 |  __bool__(self, /)
 |      self != 0
 |  
 |  __repr__(self, /)
 |      Return repr(self).
 |  
 |  ----------------------------------------------------------------------
 |  Static methods defined here:
 |  
 |  __new__(*args, **kwargs) from builtins.type
 |      Create and return a new object.  See help(type) for accurate signature.



In [88]:
print(a_2d) #recall the first index is vertical!
print(a_2d.T)

[[1]
 [2]
 [3]
 [4]]
[[1 2 3 4]]


In [13]:
a_1d

array([1, 2, 3, 4])

In [89]:
print(a_1d.ndim)
print(a_2d.ndim)

1
2


In [90]:
a_2d_alt = a_1d[np.newaxis,1:3]
print(a_2d_alt)
print(a_2d_alt.shape)
a_3d = a_1d[:,np.newaxis,np.newaxis] #tensor where orignal list is along the depth axis.
print(a_3d)
print(a_3d.shape)
#experiement with a_1d[np.newaxis,:,np.newaxis] a_1d[np.newaxis,np.newaxis,:], see what you get!

[[2 3]]
(1, 2)
[[[1]]

 [[2]]

 [[3]]

 [[4]]]
(4, 1, 1)


To change the multi-dimension array to 1-d array, in addition to `reshape`(create view), we can also choose `ravel`(create view) or `flatten`(create copy).

In [91]:
a_mat = np.zeros((2,2)) # note the parentheses here
a_mat_reshape = a_mat.reshape(-1) # -1 means default length -- create view
a_mat_ravel =  a_mat.ravel()
a_mat_flatten = a_mat.flatten()

In [92]:
a_mat_reshape

array([0., 0., 0., 0.])

In [93]:
a_mat_ravel #view of a_mat

array([0., 0., 0., 0.])

In [94]:
a_mat_flatten #copy of a_mat

array([0., 0., 0., 0.])

In [95]:
a_mat_ravel.base

array([[0., 0.],
       [0., 0.]])

In [53]:
a_mat_flatten.flags

  C_CONTIGUOUS : True
  F_CONTIGUOUS : True
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

## Indexing of ndarray

**1. Slicing: Similiar to the list indexing**

Always remember that slicing creates the view instead of copy!

In [3]:
a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])
print(a)
b = a[:2, 1:3] # creates a view instead of copy
print(b)
print(a[0, 1])   
b[0, 0] = 77
print(a[0, 1])
print(a)

[[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]]
[[2 3]
 [6 7]]
2
77
[[ 1 77  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]]


Be cautious with the difference between simple indexing (one integer index) and slicing.

In [5]:
a[:,1] # 1-d array

array([77,  6, 10])

In [7]:
a[:,1:2] # 2-d array

array([[77],
       [ 6],
       [10]])

In [22]:
a[0:1,:] # 2-d array, look for 2 sets of brackets [[ ]]

array([[ 1, 77,  3,  4]])

In [23]:
a[0,:] #1-d array, look for 1 set of brackets []

array([ 1, 77,  3,  4])

For more exercise: See Figure 4-2 in [this material](https://www.oreilly.com/library/view/python-for-data/9781449323592/ch04.html).

**2. Boolean Indexing**

In [8]:
a<5

array([[ True, False,  True,  True],
       [False, False, False, False],
       [False, False, False, False]])

In [9]:
a[a<5] = 0 # In Numpy terms, a<5 creates the "mask" contaning true or false values

In [10]:
a

array([[ 0, 77,  0,  0],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12]])

In [11]:
b = a[a>2]
b

array([77,  5,  6,  7,  8,  9, 10, 11, 12])

In [12]:
b[0] = -3
print(b)
print(a)
b.flags

[-3  5  6  7  8  9 10 11 12]
[[ 0 77  0  0]
 [ 5  6  7  8]
 [ 9 10 11 12]]


  C_CONTIGUOUS : True
  F_CONTIGUOUS : True
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

Boolean indexing can create new numpy ndarray instead of the view.

In [13]:
x = np.arange(10)
y = x[(x>4) & (x<8)] # just for your information: do not use keyword "and" here

In [14]:
y

array([5, 6, 7])

In [15]:
y.flags

  C_CONTIGUOUS : True
  F_CONTIGUOUS : True
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

**3. Integer Array Indexing (Fancy Indexing)**

General rule: `arr[[ind1,ind2]]` just means `np.array([arr[ind1],arr[ind2]])`

In [17]:
ind = np.array([1,0,2]) # no problem for list [1,0,2]
x = np.arange(0,10,2)
print(x)
y = x[ind] # equivalently, x[[1,0,2]]
print(y)
y.flags

[0 2 4 6 8]
[2 0 4]


  C_CONTIGUOUS : True
  F_CONTIGUOUS : True
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

In [18]:
y[0] = 27
print(y)
print(x)

[27  0  4]
[0 2 4 6 8]


In [19]:
z = np.linspace(1,2,6) #6 evenly spaced points between 1 and 2
print(z)
z[ind]

[1.  1.2 1.4 1.6 1.8 2. ]


array([1.2, 1. , 1.4])

In [20]:
a = np.arange(12).reshape(3,4)
a

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [21]:
a[[1,0,2],:] #Rearranges the rows, keeps everything in each row

array([[ 4,  5,  6,  7],
       [ 0,  1,  2,  3],
       [ 8,  9, 10, 11]])

In [25]:
a[2,[1,0,2]] #Last row, then pull from indices on that row

array([ 9,  8, 10])

## Numpy Universal Functions (ufuncs) and Aggregate Function

Similar to Matlab, the built-in loops in Python can be very slow for large-scale problems. To solve this issue, Numpy adopts vectorized methods (uses [vectorization](https://numpy.org/doc/stable/glossary.html#term-vectorization)) written in optimized C-language codes, and provides the interface as Numpy universal functions (ufuncs). 

Numpy ufuncs operates on ndarrays in an element-by-element fashion. You can find all the ufuncs in the [documentation](https://numpy.org/doc/stable/reference/ufuncs.html).

In [28]:
x = np.arange(1000000)
print(np.log(10))
print(np.log(1000000))
np.log(1+x) #ln(1+0), ln(1+1) up to ln(1+999999)

2.302585092994046
13.815510557964274


array([ 0.        ,  0.69314718,  1.09861229, ..., 13.81550856,
       13.81550956, 13.81551056])

We can also iterate the numpy array through elements just as Python built-in list (of course you can always get elements through iterating the index), although it is not very recommended for large-scale problems.

In [34]:
a = np.arange(6)
for elem in a:
    #print(elem)
    print(elem, end =" " )

0 1 2 3 4 5 

In [40]:
help(print)

Help on built-in function print in module builtins:

print(...)
    print(value, ..., sep=' ', end='\n', file=sys.stdout, flush=False)
    
    Prints the values to a stream, or to sys.stdout by default.
    Optional keyword arguments:
    file:  a file-like object (stream); defaults to the current sys.stdout.
    sep:   string inserted between values, default a space.
    end:   string appended after the last value, default a newline.
    flush: whether to forcibly flush the stream.



In [35]:
a = a.reshape(2,-1)
for row in a:
    print(row, end =" " )  #default value is end ='\n', for a new line

[0 1 2] [3 4 5] 

In [36]:
for row in a:
    for elem in row:
        print(elem, end =" " ) 

0 1 2 3 4 5 

In [37]:
for elem in np.nditer(a):
    print(elem, end =" " )

0 1 2 3 4 5 

In [38]:
for (idx, elem) in np.ndenumerate(a):
    print([idx, elem])

[(0, 0), 0]
[(0, 1), 1]
[(0, 2), 2]
[(1, 0), 3]
[(1, 1), 4]
[(1, 2), 5]


Numpy also provides some useful aggregate functions.

In [39]:
a = np.arange(6).reshape(2,3)
a 

array([[0, 1, 2],
       [3, 4, 5]])

In [40]:
a.sum(axis=0)

array([3, 5, 7])

In [41]:
a.sum(axis=1).reshape(-1,1)

array([[ 3],
       [12]])

In [42]:
a.sum()

15

In [43]:
a.min(axis=1).reshape(-1,1)

array([[0],
       [3]])

In [44]:
b = np.arange(24).reshape(2,3,-1)
b

array([[[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]],

       [[12, 13, 14, 15],
        [16, 17, 18, 19],
        [20, 21, 22, 23]]])

In [45]:
b.sum(axis=1) #column sum, 8 columns in this tensor

array([[12, 15, 18, 21],
       [48, 51, 54, 57]])

In [49]:
b.max(axis=0) #max along depth axis.

array([[12, 13, 14, 15],
       [16, 17, 18, 19],
       [20, 21, 22, 23]])

## Numpy Linear Algebra Functions

See the reference [here](https://numpy.org/doc/stable/reference/routines.linalg.html?highlight=linear%20algebra#matrix-and-vector-products) and [compare it with Matlab](https://numpy.org/doc/stable/user/numpy-for-matlab-users.html). Be cautious with operators like `*`,`@` (only available after Python 3.5) and functions/methods `dot`,`vdot` and `matmul`.

In [None]:
help(np.dot)

In [None]:
help(np.vdot)

In [None]:
help(np.matmul)