# Week 3: NumPy Basics

Numerical Python, also know as NumP|y is one of the most important packages for computing and for Data scientists. The other packages that we will discuss later on this course are also based on it. It provides an effecient multi-dimentional array object, Mathmetical functions for fast operations on array data, linear algebra objects and methods.

## ndarray: A Multidimentional Array Object

ndarray is a fast, flexible container of large datasets in Python. Arrays enable you to perform numerical operations on large datasets using simple syntax, similar to the one you use to implement operations between to simple scalars.

In [6]:
import numpy as np

numpy is great for generating random series and storing them into arrays:

In [7]:
data = np.random.randn(2,3)

In [8]:
data

array([[ 0.37039954, -0.02220144,  0.64383258],
       [ 0.14493145, -1.21294423,  1.51413584]])

In [4]:
data*2

array([[ 2.36224275, -2.90004382, -3.08053861],
       [-2.69232785, -1.81205138, -3.6935566 ]])

In [5]:
data + data

array([[ 2.36224275, -2.90004382, -3.08053861],
       [-2.69232785, -1.81205138, -3.6935566 ]])

Unlike the built-in Python sequences such as lists and tuples, ndarray need to have homogeneous data, which means all the elements of the array belong to the same type.

Every array has an attribute shape which returns a tuple with the number of row and columns in the array. One can also retrieve the data type of an ndarray using the .dtype attribute.

In [6]:
data.shape

(2, 3)

In [7]:
data.dtype

dtype('float64')

## Creating ndarrays

There are many ways to create an ndarray. The easiest one is the using the __array__ function.

In [8]:
data = [3,4,5,0,0.5]

In [9]:
ar_1 = np.array(data)

In [10]:
ar_1

array([ 3. ,  4. ,  5. ,  0. ,  0.5])

Nested sequences will create multi-dimentional arrays:

In [11]:
data_2 = [[1,4,5],[6,0,3]]

In [12]:
ar_2 = np.array(data_2)

In [13]:
ar_2.shape

(2, 3)

We can also use the attribute .ndim to retrieve the dimentionality of the array:

In [14]:
ar_2.ndim

2

In [15]:
ar_2.dtype

dtype('int32')

There exist other functions to create special array with specific characteristics.

In [16]:
np.zeros((2,5))

array([[ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.]])

In [17]:
np.empty((2,3,2))

array([[[  9.78249979e-322,   0.00000000e+000],
        [  0.00000000e+000,   0.00000000e+000],
        [  0.00000000e+000,   0.00000000e+000]],

       [[  0.00000000e+000,   0.00000000e+000],
        [  0.00000000e+000,   0.00000000e+000],
        [  0.00000000e+000,   0.00000000e+000]]])

more intrestingly and usefull is __arange__:

In [18]:
np.arange(15,25,2)

array([15, 17, 19, 21, 23])

Similar to the generator range() that we have seen earlier but instead of creating a range, arange creates an array.

## Data Types for ndarrays

__dtype__ is a special object containing the information ( metadata) the ndarray needs to interpret a chunk of memory as aparticular type of data:

In [19]:
ar_1 = np.array([1,2,3], dtype = np.float64)

In [20]:
ar_2 = np.array([1,2,3], dtype = np.int32)

You explicitly cast an array from one dtype to anothe using ndarray astype method:

In [21]:
ar_1 = np.array([1,2,3]) # integer to float
ar_1.astype(np.float64)

array([ 1.,  2.,  3.])

In [22]:
ar_2 = np.array([2.33,4.11,4,1.725]) # float to integer the part after the decimal will be truncated:
ar_2.astype(np.int32)

array([2, 4, 4, 1])

In [23]:
ar_3 =np.array (['1','4','101']) # String to int
ar_3.astype(np.int32)

array([  1,   4, 101])

## Basic Indexing and Slicing:

In the world of NumPy there are many way to select subsets of data from arrays. In a one dimentional array it is straight forward:

In [11]:
ar_1 = np.arange(1,11)

In [12]:
ar_1

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

In [13]:
ar_1[2]

3

In [14]:
ar_1[1:3]

array([2, 3])

In [15]:
ar_1[1:3] = (20,21)

In [16]:
ar_1

array([ 1, 20, 21,  4,  5,  6,  7,  8,  9, 10])

Assigning the scalar value 20 to a slice, also know as broadcasting, will over write the whole sub-election by the value of the scalar. It is important to note that in numpy a slice of an array is just a _view_ of the original array and not a new copy. For example:

In [17]:
ar_slice = ar_1[1:3]

In [18]:
ar_slice

array([20, 21])

In [19]:
ar_slice[1] = 55

In [20]:
ar_slice

array([20, 55])

In [34]:
ar_1

array([ 0, 20, 55,  3,  4,  5,  6,  7,  8,  9])

If you would like to work and change a sliced subset of array without impacting the orginal array, you will have to copy it first:

In [22]:
ar_slice =ar_1 [1:3].copy()

In [23]:
ar_slice

array([20, 55])

In [24]:
ar_slice[1] = 8

In [25]:
ar_slice

array([20,  8])

In [26]:
ar_1

array([ 1, 20, 55,  4,  5,  6,  7,  8,  9, 10])

With two-dimentional array, the elements at each index are no longer scalars but rather one-dimensional array:

In [27]:
arr_2 =np.array([[1,2,3],[4,5,6],[7,8,9]])

In [28]:
arr_2[2]

array([7, 8, 9])

The make your selection more granular you can choose one the below two ways:

In [29]:
arr_2[0][2]

3

In [30]:
arr_2[0,2]

3

![Indexing](indexing.jpg)

The picture above taken from Mckinney's book illustrate the logic of two-dimentional indexing.

### Indexing with Slice

Two-dimentional arrays can be sliced with the sam logic we used in the on dimentional arrays.

In [32]:
arr_2

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [33]:
arr_2[:2,1:]

array([[2, 3],
       [5, 6]])

In [45]:
arr_2[2,:]

array([7, 8, 9])

In [46]:
arr_2[:,:2]

array([[1, 2],
       [4, 5],
       [7, 8]])

In [47]:
arr_2[1,:2]

array([4, 5])

![Slicing](slicing.jpeg)

### Boolean Indexing

Boolean indexing is very usefull for analysing data. A good grasp of the concept is crucial to master the next package we will discuss which is pandas.

We start by creating a random two dimentional array using numpy built-in functionality (More on that later).

In [48]:
data = np.random.randn(7,4)

In [49]:
data

array([[ -4.52740784e-01,   1.13714280e+00,   9.75559451e-01,
          9.39912618e-01],
       [ -1.55725906e+00,  -3.18342638e-01,   4.22905375e-01,
          5.11407394e-01],
       [  1.32655386e+00,  -9.97085938e-01,  -6.30765406e-01,
         -7.56694522e-01],
       [  1.47583815e-01,   3.53870296e-02,  -2.11233226e-02,
         -2.13722297e-01],
       [ -6.77126648e-01,   2.35127756e+00,   9.50454220e-01,
         -3.60610630e-01],
       [  2.44289302e-01,   1.27030282e+00,   1.06703758e+00,
         -2.31210201e-01],
       [ -2.98261092e-01,   4.57562779e-01,   1.24715431e-03,
          7.50791818e-02]])

Supposed that each line of random numbers is associated with a person:

In [35]:
persons = np.array(['Marc','John','Bob', 'Stuart','Linda','Susan','Romeo'])

Similar to the standard operators, the comparison operators could be used in victorization. Vectorization is used to speed up the Python code without using a loop. Using these functions can help in minimizing the run time of code efficiently. Various operations could be applied as we have seen in previous weeks. Element wise multiplication or dot product on vectors(arrays) is possible. Below is an example of vectorisatin applied on comparison operators:

In [37]:
y = (persons == "Bob")
y

array([False, False,  True, False, False, False, False])

Comparing the list of names to a single name return back a boolean array with a __True__ value at the position of he single name.

Now if we pass the that boolean output array as index into the orginal array we will return the data corresponding to Bob.

In [52]:
data[persons=="Bob"]

array([[ 1.32655386, -0.99708594, -0.63076541, -0.75669452]])

Naturally the length of the names array need to be the same as the data array axis. Moreover the order of the data and the names need to be matching.

Boolean indexing could be used in conjunction with simple indexing:

In [53]:
data[persons=="Bob",2:]

array([[-0.63076541, -0.75669452]])

The operatr __~__ could be used to select the opposite:

In [54]:
data[~(persons=="Bob"),2:]

array([[ 0.97555945,  0.93991262],
       [ 0.42290538,  0.51140739],
       [-0.02112332, -0.2137223 ],
       [ 0.95045422, -0.36061063],
       [ 1.06703758, -0.2312102 ],
       [ 0.00124715,  0.07507918]])

In [55]:
data[persons=="Bob",2:]

array([[-0.63076541, -0.75669452]])

The other comparison operators we have seen in week 1 appy here as well. From and / or, ( &, |) to smaller than and larger than (<,>).

In [56]:
data[(persons=="Bob")|(persons=="John")]

array([[-1.55725906, -0.31834264,  0.42290538,  0.51140739],
       [ 1.32655386, -0.99708594, -0.63076541, -0.75669452]])

### Fancy Indexing

_Fancy indexing_ is a term adopted by NumPy to describe indexng using integer arrays.

In [39]:
arr_3 = np.empty((8,4))

In [40]:
arr_3

array([[ 1.03753786e-322,  0.00000000e+000,  2.12199579e-314,
         5.02034658e+175],
       [ 2.88186658e+180,  1.67468241e-076,  2.27235613e+184,
         3.72534537e-057],
       [ 5.01163106e+217,  5.49419094e-143,  1.50008929e+248,
         1.06396443e+224],
       [ 2.54943597e+161,  1.89165522e+219,  1.35617292e+248,
         9.56900005e-071],
       [ 2.81148185e-033,  2.25366708e+180,  3.40637061e+175,
         5.18941905e-144],
       [ 1.16097020e-028,  9.15558397e-072,  2.73744222e-052,
         1.72451605e-047],
       [ 1.11763122e+261,  1.16318408e-028,  6.79764374e-310,
         2.17903258e-314],
       [-2.00389129e+000,  1.49166815e-154,  2.16917212e-314,
        -3.19703655e-285]])

In [41]:
for i in range(8):
    arr_3[i] = i

In [60]:
arr_3

array([[ 0.,  0.,  0.,  0.],
       [ 1.,  1.,  1.,  1.],
       [ 2.,  2.,  2.,  2.],
       [ 3.,  3.,  3.,  3.],
       [ 4.,  4.,  4.,  4.],
       [ 5.,  5.,  5.,  5.],
       [ 6.,  6.,  6.,  6.],
       [ 7.,  7.,  7.,  7.]])

One feature of fancy indexing is that it allows you to select a subset of the data in a certain order

In [61]:
arr_3[[3,1,5]]

array([[ 3.,  3.,  3.,  3.],
       [ 1.,  1.,  1.,  1.],
       [ 5.,  5.,  5.,  5.]])

Similarly to what we have seen with simple indexing negative numbers select from the bottom:

In [42]:
arr_3[[-1,-5]]

array([[7., 7., 7., 7.],
       [3., 3., 3., 3.]])

Passing multiple index arrays does something slightly different; it selects a one-dimentional array of elements corresponding to each __tuple of indices__:

In [44]:
arr_4 = np.arange(32).reshape((8,4)) # we will revisit he reshape method at a later stage.

In [45]:
arr_4

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19],
       [20, 21, 22, 23],
       [24, 25, 26, 27],
       [28, 29, 30, 31]])

In [46]:
arr_4[[1],[2]]

array([6])

In [47]:
arr_4[[2],[3]]

array([11])

In [48]:
arr_4[[2,3],[2,3]]

array([10, 15])

Here the elements (2,2) and (3,3) are selected.

Unlike slicing, fancy indexing creates new copies.

## Universal Functions

Universal functions perform element wise operations on an array in a fast manner. Similar to victorization and to comprehension the goal here is to avoid iterations. Example of universal functions are __.sqrt()__ square root, or __.exp()__. 

In [68]:
arr_5 = np.arange(10)

In [69]:
arr_5

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [70]:
np.sqrt(arr_5)

array([ 0.        ,  1.        ,  1.41421356,  1.73205081,  2.        ,
        2.23606798,  2.44948974,  2.64575131,  2.82842712,  3.        ])

In [71]:
np.exp(arr_5)

array([  1.00000000e+00,   2.71828183e+00,   7.38905610e+00,
         2.00855369e+01,   5.45981500e+01,   1.48413159e+02,
         4.03428793e+02,   1.09663316e+03,   2.98095799e+03,
         8.10308393e+03])

![Slicing](ufunc_2.png)

![Slicing](ufunc_1.png)

In [50]:
ar_1 = np.arange(3)
ar_2 = np.arange(3,4,1)

In [51]:
ar_1

array([0, 1, 2])

In [53]:
ar_2

array([3])

In [73]:
np.add(ar_1,ar_2)

array([3, 4, 5])

### Transposing Arrays and Swapping Axes

Transposing is a special form of reshaping that similarly returns a view
on the underlying data without copying anything. Arrays have the transpose method
and also the special T attribute: 



In [54]:
arr = np.arange(15).reshape((3, 5))
arr


array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])

In [55]:
arr.T

array([[ 0,  5, 10],
       [ 1,  6, 11],
       [ 2,  7, 12],
       [ 3,  8, 13],
       [ 4,  9, 14]])

When doing matrix computations, you may do this very often for
example, when computing the inner matrix product using np.dot : 
 

In [10]:
arr = np.random.randn(6, 3)
arr


array([[-1.51048571,  0.00401612, -0.25099796],
       [ 1.05553343, -1.16102139, -2.52110152],
       [-0.43360617, -0.12973116, -0.87338529],
       [-0.11162047,  1.84808923,  1.0179832 ],
       [ 1.08956057,  1.574586  , -0.4978109 ],
       [-1.38216834, -0.50905537, -0.54511047]])

In [58]:
arr.T

array([[ 0,  5, 10],
       [ 1,  6, 11],
       [ 2,  7, 12],
       [ 3,  8, 13],
       [ 4,  9, 14]])

In [56]:
np.dot(arr.T, arr)

array([[125, 140, 155, 170, 185],
       [140, 158, 176, 194, 212],
       [155, 176, 197, 218, 239],
       [170, 194, 218, 242, 266],
       [185, 212, 239, 266, 293]])

For higher dimensional arrays, transpose will accept a tuple of axis numbers
to permute the axes (for extra mind bending): 
 

In [11]:
arr = np.arange(16).reshape((2, 2, 4))
arr


array([[[ 0,  1,  2,  3],
        [ 4,  5,  6,  7]],

       [[ 8,  9, 10, 11],
        [12, 13, 14, 15]]])

In [12]:
arr.transpose((1, 0, 2))

array([[[ 0,  1,  2,  3],
        [ 8,  9, 10, 11]],

       [[ 4,  5,  6,  7],
        [12, 13, 14, 15]]])

Here, the axes have been reordered with the second axis first, the
first axis second, and the last axis unchanged. Simple transposing with .T is a
special case of swapping axes. ndarray has the method swapaxes , which
takes a pair of axis numbers and switches the indicated axes to
rearrange the data: 

 

In [14]:

arr.swapaxes(1, 2)

array([[[ 0,  4],
        [ 1,  5],
        [ 2,  6],
        [ 3,  7]],

       [[ 8, 12],
        [ 9, 13],
        [10, 14],
        [11, 15]]])

### Expressing Conditional Logic as Array Operations

The numpy.where function is a
vectorized version of the ternary expression x if condition else y . Suppose we had a
boolean array and two arrays of values:  

In [16]:
xarr = np.array([1.1, 1.2, 1.3, 1.4, 1.5])
yarr = np.array([2.1, 2.2, 2.3, 2.4, 2.5])
cond = np.array([True, False, True, True, False])

Suppose we wanted to take a value from xarr whenever the corresponding value in
cond is True , and otherwise take the value from yarr . A list comprehension doing this might
look like: 
 

In [17]:
result = [(x if c else y)
          for x, y, c in zip(xarr, yarr, cond)]
result

[1.1, 2.2, 1.3, 1.4, 2.5]

This has multiple problems. First, it will not be very fast for
large arrays (because all the work is being done in interpreted Python
code). Second, it will not work with multidimensional arrays. With
np.where you can write this very
concisely:  

In [18]:
result = np.where(cond, xarr, yarr)
result

array([1.1, 2.2, 1.3, 1.4, 2.5])

### Sorting

Like Python’s built-in list type, NumPy arrays can be sorted in-place with the sort method: 
 

In [21]:
arr = np.random.randn(6)
arr


array([-0.12524889, -0.28581913, -0.52520661,  0.73728121,  0.77442323,
        0.95569573])

In [22]:
arr.sort()
arr

array([-0.52520661, -0.28581913, -0.12524889,  0.73728121,  0.77442323,
        0.95569573])

You can sort each one-dimensional section of values in a multidimensional array
in-place along an axis by passing the axis number to
sort : 
 

In [65]:
arr = np.random.randn(5, 3)
arr


array([[-0.6573369 , -1.47411544,  0.11401005],
       [ 0.63486172,  0.89230587,  0.930767  ],
       [-0.11451278,  1.61558426,  0.15553616],
       [-0.50800112, -0.3281146 , -0.90926197],
       [ 2.38992183,  0.69229835, -1.58705303]])

In [67]:
arr.sort(0)
arr

array([[-1.58705303, -0.6573369 , -0.3281146 ],
       [-1.47411544, -0.50800112,  0.11401005],
       [-0.90926197,  0.15553616,  0.930767  ],
       [-0.11451278,  0.69229835,  1.61558426],
       [ 0.63486172,  0.89230587,  2.38992183]])

The top-level method np.sort returns a sorted copy of an array instead of modifying the array
in-place. A quick-and-dirty way to compute the quantiles of an array is
to sort it and select the value at a particular rank: 

In [63]:
large_arr = np.random.randn(1000)


In [64]:
large_arr.sort()
large_arr[int(0.05 * len(large_arr))] # 5% quantile

-1.5300630180614765

### Pseudorandom Number Generation

In [29]:
samples = np.random.normal(size=(4, 4))
samples

array([[-1.30317429, -0.1472297 , -0.16501691, -0.47490453],
       [ 0.11782677, -0.9796637 , -0.05836418,  0.0867748 ],
       [ 0.36661933,  0.12126098,  0.46321487,  1.44863245],
       [ 0.10524789, -0.18578626, -0.29095006, -2.42795169]])

In [30]:
np.random.seed(1234)