# Week 3: NumPy Basics

Numerical Python, also know as Numpy is one of the most important packages for computing and for Data scientists. The other packages that we will discuss later on this course are also based on it. It provides an effecient multi-dimentional array object, Mathmetical functions for fast operations on array data, linear algebra objects and methods.

## ndarray: A Multidimentional Array Object

ndarray is a fast, flexible container of large datasets in Python. Arrays enable you to perform numerical operations on large datasets using simple syntax, similar to the one you use to implement operations between to simple scalars.

In [2]:
import numpy as np

numpy is great for generating random series and storing them into arrays:

In [3]:
data = np.random.randn(2,3)

In [4]:
data

array([[-0.74717722, -0.11582525,  0.02094417],
       [-0.79871831,  0.9607829 ,  1.1026757 ]])

In [5]:
data*2

array([[-1.49435443, -0.23165049,  0.04188835],
       [-1.59743662,  1.92156581,  2.2053514 ]])

In [6]:
data + data

array([[-1.49435443, -0.23165049,  0.04188835],
       [-1.59743662,  1.92156581,  2.2053514 ]])

Unlike the built-in Python sequences such as lists and tuples, ndarray need to have homogeneous data, which means all the elements of the array belong to the same type.

Every array has an attribute shape which returns a tuple with the number of row and columns in the array. One can also retrieve the data type of an ndarray using the .dtype attribute.

In [7]:
data.shape

(2, 3)

In [8]:
data.dtype

dtype('float64')

## Creating ndarrays

There are many way to create an ndarray. The easiest one is the using the __array__ function.

In [9]:
data = [3,4,5,0,0.5]

In [10]:
ar_1 = np.array(data)

In [11]:
ar_1

array([3. , 4. , 5. , 0. , 0.5])

Nested sequences will create multi-dimentional arrays:

In [12]:
data_2 = [[1,4,5],[6,0,3]]

In [13]:
ar_2 = np.array(data_2)

In [14]:
ar_2.shape

(2, 3)

We can also use the attribute .ndim to retrieve the dimentionality of the array:

In [15]:
ar_2.ndim

2

In [16]:
ar_2.dtype

dtype('int64')

There exist other functions to create special array with specific characteristics.

In [17]:
np.zeros((2,5))

array([[0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.]])

In [18]:
np.empty((2,3,2))

array([[[-2.00000000e+000, -1.73059846e-077],
        [ 7.41098469e-323,  0.00000000e+000],
        [ 0.00000000e+000,  0.00000000e+000]],

       [[ 0.00000000e+000,  0.00000000e+000],
        [ 0.00000000e+000,  0.00000000e+000],
        [ 0.00000000e+000,  0.00000000e+000]]])

more intrestingly and usefull is __arange__:

In [19]:
np.arange(15,25,2)

array([15, 17, 19, 21, 23])

Similar to the generator range() that we have seen earlier but instead of creating a range, arange creates an array.

## Data Types for ndarrays

__dtype__ is a special object containing the information ( metadata) the ndarray needs to interpret a chunk of memory as aparticular type of data:

In [20]:
ar_1 = np.array([1,2,3], dtype = np.float64)

In [21]:
ar_2 = np.array([1,2,3], dtype = np.int32)

You explicitly cast an array from one dtype to anothe using ndarray astype method:

In [22]:
ar_1 = np.array([1,2,3]) # integer to float
ar_1.astype(np.float64)

array([1., 2., 3.])

In [23]:
ar_2 = np.array([2.33,4.11,4,1.725]) # float to integer the part after the decimal will be truncated:
ar_2.astype(np.int32)

array([2, 4, 4, 1], dtype=int32)

In [24]:
ar_3 =np.array (['1','4','101']) # String to int
ar_3.astype(np.int32)

array([  1,   4, 101], dtype=int32)

## Basic Indexing and Slicing:

In the world of NumPy there are many way to select subsets of data from arrays. In a one dimentional array it is straight forward:

In [25]:
ar_1 = np.arange(10)

In [27]:
ar_1

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [29]:
ar_1[2]

2

In [30]:
ar_1[1:3]

array([1, 2])

In [32]:
ar_1[1:3] = 20

In [33]:
ar_1

array([ 0, 20, 20,  3,  4,  5,  6,  7,  8,  9])

Assigning the scalar value 20 to a slice, also know as broadcasting, will over write the whole sub-election by the value of the scalar. It is important to note that in numpy a slice of an array is just a _view_ of the original array and not a new copy. For example:

In [35]:
ar_slice = ar_1[1:3]

In [36]:
ar_slice

array([20, 20])

In [37]:
ar_slice[1] = 55

In [38]:
ar_slice

array([20, 55])

In [40]:
ar_1

array([ 0, 20, 55,  3,  4,  5,  6,  7,  8,  9])

If you would like to work and change a sliced subset of array without impacting the orginal array, you will have to copy it first:

In [50]:
ar_slice = ar_1 [1:3].copy()

In [51]:
ar_slice

array([20, 55])

In [52]:
ar_slice[1] = 8

In [53]:
ar_slice

array([20,  8])

In [54]:
ar_1

array([ 0, 20, 55,  3,  4,  5,  6,  7,  8,  9])

In [None]:
With two-dimentional array, the elements at each index are no longer scalars but rather one-dimensional array:

In [57]:
arr_2 =np.array([[1,2,3],[4,5,6],[7,8,9]])

In [58]:
arr_2[2]

array([7, 8, 9])

The make your selection more granular you can choose one the below two ways:

In [62]:
arr_2[0][2]

3

In [63]:
arr_2[0,2]

3

![Indexing](indexing.jpg)

The picture above taken from Mckinney's book illustrate the logic of two-dimentional indexing.

### Indexing with Slice

Two-dimentional arrays can be sliced with the sam logic we used in the on dimentional arrays.

In [65]:
arr_2[:2,1:]

array([[2, 3],
       [5, 6]])

In [67]:
arr_2[2,:]

array([7, 8, 9])

In [68]:
arr_2[:,:2]

array([[1, 2],
       [4, 5],
       [7, 8]])

In [69]:
arr_2[1,:2]

array([4, 5])

![Slicing](slicing.jpeg)

### Boolean Indexing

Boolean indexing is very usefull for analysing data. A good grasp of the concept is crucial to master the next package we will discuss which is pandas.

We start by creating a random two dimentional array using numpy built-in functionality (More on that later).

In [70]:
data = np.random.randn(7,4)

In [71]:
data

array([[ 0.37983556,  1.07828681,  1.35302116,  0.52333715],
       [-0.28659569, -0.62474032,  1.79544835, -1.39742396],
       [-0.2129049 ,  0.91275131,  0.89733811,  0.08029183],
       [-0.89040692, -1.20032778, -0.26549335,  0.030391  ],
       [-0.44854832, -0.39691022,  0.7380965 , -1.89056374],
       [-0.28435641, -0.32650264,  0.18771804,  0.08716628],
       [ 1.0851778 ,  0.8529362 ,  1.27369232,  2.05066963]])

Supposed that each line of random numbers is associated with a person:

In [73]:
persons = np.array(['Marc','John','Bob', 'Stuart','Linda','Susan','Romeo'])

Similar to the standard operators the comparison operators could be used in victorization. Vectorization is used to speed up the Python code without using loop. Using such a function can help in minimizing the running time of code efficiently. Various operations could be applied as we have seen in previous weeks. Element wise multiplication or dot product on vectors(arrays) is possible. Below is an example of vectorisatin applied on comparison operators:

In [74]:
persons == "Bob"

array([False, False,  True, False, False, False, False])

Comparing the list of names to a single name return back a boolean array with a __True__ value at the position of he single name.

Now if we pass the that boolean output array as index into the orginal array we will return the data corresponding to Bob.

In [75]:
data[persons=="Bob"]

array([[-0.2129049 ,  0.91275131,  0.89733811,  0.08029183]])

Naturally the length of the names array need to be the same as the data array axis. Moreover the order of the data and the names need to be matching.

Boolean indexing could be used in conjunction with simple indexing:

In [76]:
data[persons=="Bob",2:]

array([[0.89733811, 0.08029183]])

The operatr __~__ could be used to select the opposite:

In [77]:
data[~(persons=="Bob"),2:]

array([[ 1.35302116,  0.52333715],
       [ 1.79544835, -1.39742396],
       [-0.26549335,  0.030391  ],
       [ 0.7380965 , -1.89056374],
       [ 0.18771804,  0.08716628],
       [ 1.27369232,  2.05066963]])

In [78]:
data[persons=="Bob",2:]

array([[0.89733811, 0.08029183]])

The other comparison operators we have seen in week 1 appy here as here. From and / or, ( &, |) to smaller than and larger than (<,>).

In [79]:
data[(persons=="Bob")|(persons=="John")]

array([[-0.28659569, -0.62474032,  1.79544835, -1.39742396],
       [-0.2129049 ,  0.91275131,  0.89733811,  0.08029183]])

### Fancy Indexing

_Fancy indexing_ is a term adopted by NumPy to describe indexng using integer arrays.

In [80]:
arr_3 = np.empty((8,4))

In [81]:
arr_3

array([[ 0.00000000e+000,  4.47032019e-038,  4.97288234e-091,
         1.38240109e-047],
       [ 2.81080421e-033,  1.01423523e-070,  3.99910963e+252,
         5.49419094e-143],
       [-2.00000000e+000, -2.00000000e+000,  2.14027814e+161,
         7.69801264e+218],
       [ 2.07081704e+161,  9.30537467e+199,  2.89938602e-057,
         1.43466005e-051],
       [ 3.73306734e-061,  6.45399498e-067,  4.30238741e-096,
         8.37170584e-144],
       [ 5.27485048e-091,  1.20142593e-071,  1.14461116e-071,
         5.74020278e+180],
       [ 8.37174974e-144,  1.25186958e-308,  6.93532342e-310,
         3.50786609e-322],
       [ 0.00000000e+000,  0.00000000e+000,  0.00000000e+000,
         1.50008929e+248]])

In [82]:
for i in range(8):
    arr_3[i] = i

In [83]:
arr_3

array([[0., 0., 0., 0.],
       [1., 1., 1., 1.],
       [2., 2., 2., 2.],
       [3., 3., 3., 3.],
       [4., 4., 4., 4.],
       [5., 5., 5., 5.],
       [6., 6., 6., 6.],
       [7., 7., 7., 7.]])