# Week 3: NumPy Basics

Numerical Python, also know as Numpy is one of the most important packages for computing and for Data scientists. The other packages that we will discuss later on this course are also based on it. It provides an effecient multi-dimentional array object, Mathmetical functions for fast operations on array data, linear algebra objects and methods.

## ndarray: A Multidimentional Array Object

ndarray is a fast, flexible container of large datasets in Python. Arrays enable you to perform numerical operations on large datasets using simple syntax, similar to the one you use to implement operations between to simple scalars.

In [2]:
import numpy as np

numpy is great for generating random series and storing them into arrays:

In [3]:
data = np.random.randn(2,3)

In [4]:
data

array([[ 1.181651  ,  1.05792236,  1.33022129],
       [ 0.7965656 , -0.66605603,  0.41649146]])

In [5]:
data*2

array([[ 2.363302  ,  2.11584473,  2.66044257],
       [ 1.59313119, -1.33211206,  0.83298292]])

In [6]:
data + data

array([[ 2.363302  ,  2.11584473,  2.66044257],
       [ 1.59313119, -1.33211206,  0.83298292]])

Unlike the built-in Python sequences such as lists and tuples, ndarray need to have homogeneous data, which means all the elements of the array belong to the same type.

Every array has an attribute shape which returns a tuple with the number of row and columns in the array. One can also retrieve the data type of an ndarray using the .dtype attribute.

In [7]:
data.shape

(2, 3)

In [8]:
data.dtype

dtype('float64')

## Creating ndarrays

There are many way to create an ndarray. The easiest one is the using the __array__ function.

In [9]:
data = [3,4,5,0,0.5]

In [10]:
ar_1 = np.array(data)

In [11]:
ar_1

array([3. , 4. , 5. , 0. , 0.5])

Nested sequences will create multi-dimentional arrays:

In [12]:
data_2 = [[1,4,5],[6,0,3]]

In [13]:
ar_2 = np.array(data_2)

In [14]:
ar_2.shape

(2, 3)

We can also use the attribute .ndim to retrieve the dimentionality of the array:

In [15]:
ar_2.ndim

2

In [16]:
ar_2.dtype

dtype('int64')

There exist other functions to create special array with specific characteristics.

In [17]:
np.zeros((2,5))

array([[0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.]])

In [18]:
np.empty((2,3,2))

array([[[ 0.00000000e+000,  0.00000000e+000],
        [ 0.00000000e+000,  0.00000000e+000],
        [ 0.00000000e+000,  0.00000000e+000]],

       [[ 0.00000000e+000,  1.06396443e+224],
        [-3.10503618e+231, -3.10503618e+231],
        [ 1.35617292e+248,  3.20748339e-057]]])

more intrestingly and usefull is __arange__:

In [19]:
np.arange(15,25,2)

array([15, 17, 19, 21, 23])

Similar to the generator range() that we have seen earlier but instead of creating a range, arange creates an array.

## Data Types for ndarrays

__dtype__ is a special object containing the information ( metadata) the ndarray needs to interpret a chunk of memory as aparticular type of data:

In [20]:
ar_1 = np.array([1,2,3], dtype = np.float64)

In [21]:
ar_2 = np.array([1,2,3], dtype = np.int32)

You explicitly cast an array from one dtype to anothe using ndarray astype method:

In [22]:
ar_1 = np.array([1,2,3]) # integer to float
ar_1.astype(np.float64)

array([1., 2., 3.])

In [23]:
ar_2 = np.array([2.33,4.11,4,1.725]) # float to integer the part after the decimal will be truncated:
ar_2.astype(np.int32)

array([2, 4, 4, 1], dtype=int32)

In [24]:
ar_3 =np.array (['1','4','101']) # String to int
ar_3.astype(np.int32)

array([  1,   4, 101], dtype=int32)

## Basic Indexing and Slicing:

In the world of NumPy there are many way to select subsets of data from arrays. In a one dimentional array it is straight forward:

In [25]:
ar_1 = np.arange(10)

In [26]:
ar_1

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [27]:
ar_1[2]

2

In [28]:
ar_1[1:3]

array([1, 2])

In [29]:
ar_1[1:3] = 20

In [30]:
ar_1

array([ 0, 20, 20,  3,  4,  5,  6,  7,  8,  9])

Assigning the scalar value 20 to a slice, also know as broadcasting, will over write the whole sub-election by the value of the scalar. It is important to note that in numpy a slice of an array is just a _view_ of the original array and not a new copy. For example:

In [31]:
ar_slice = ar_1[1:3]

In [32]:
ar_slice

array([20, 20])

In [33]:
ar_slice[1] = 55

In [34]:
ar_slice

array([20, 55])

In [35]:
ar_1

array([ 0, 20, 55,  3,  4,  5,  6,  7,  8,  9])

If you would like to work and change a sliced subset of array without impacting the orginal array, you will have to copy it first:

In [36]:
ar_slice = ar_1 [1:3].copy()

In [37]:
ar_slice

array([20, 55])

In [38]:
ar_slice[1] = 8

In [39]:
ar_slice

array([20,  8])

In [40]:
ar_1

array([ 0, 20, 55,  3,  4,  5,  6,  7,  8,  9])

With two-dimentional array, the elements at each index are no longer scalars but rather one-dimensional array:

In [41]:
arr_2 =np.array([[1,2,3],[4,5,6],[7,8,9]])

In [42]:
arr_2[2]

array([7, 8, 9])

The make your selection more granular you can choose one the below two ways:

In [43]:
arr_2[0][2]

3

In [44]:
arr_2[0,2]

3

![Indexing](indexing.jpg)

The picture above taken from Mckinney's book illustrate the logic of two-dimentional indexing.

### Indexing with Slice

Two-dimentional arrays can be sliced with the sam logic we used in the on dimentional arrays.

In [45]:
arr_2[:2,1:]

array([[2, 3],
       [5, 6]])

In [46]:
arr_2[2,:]

array([7, 8, 9])

In [47]:
arr_2[:,:2]

array([[1, 2],
       [4, 5],
       [7, 8]])

In [48]:
arr_2[1,:2]

array([4, 5])

![Slicing](slicing.jpeg)

### Boolean Indexing

Boolean indexing is very usefull for analysing data. A good grasp of the concept is crucial to master the next package we will discuss which is pandas.

We start by creating a random two dimentional array using numpy built-in functionality (More on that later).

In [49]:
data = np.random.randn(7,4)

In [50]:
data

array([[-0.41127922,  0.49356613,  0.07425284, -0.91020734],
       [ 0.69447578,  0.77654216, -0.77941591, -0.41152265],
       [-0.21381986,  0.42325474, -0.12498377, -0.27651019],
       [-1.57431172,  0.64672834,  1.12702635, -0.32928371],
       [-0.95702715,  0.22513531,  1.19058315, -0.43435616],
       [ 1.3149757 ,  0.3258301 , -1.39679604, -0.45601009],
       [ 0.0805522 ,  0.37074721, -1.68476036,  0.32535508]])

Supposed that each line of random numbers is associated with a person:

In [51]:
persons = np.array(['Marc','John','Bob', 'Stuart','Linda','Susan','Romeo'])

Similar to the standard operators the comparison operators could be used in victorization. Vectorization is used to speed up the Python code without using loop. Using such a function can help in minimizing the running time of code efficiently. Various operations could be applied as we have seen in previous weeks. Element wise multiplication or dot product on vectors(arrays) is possible. Below is an example of vectorisatin applied on comparison operators:

In [52]:
persons == "Bob"

array([False, False,  True, False, False, False, False])

Comparing the list of names to a single name return back a boolean array with a __True__ value at the position of he single name.

Now if we pass the that boolean output array as index into the orginal array we will return the data corresponding to Bob.

In [53]:
data[persons=="Bob"]

array([[-0.21381986,  0.42325474, -0.12498377, -0.27651019]])

Naturally the length of the names array need to be the same as the data array axis. Moreover the order of the data and the names need to be matching.

Boolean indexing could be used in conjunction with simple indexing:

In [54]:
data[persons=="Bob",2:]

array([[-0.12498377, -0.27651019]])

The operatr __~__ could be used to select the opposite:

In [55]:
data[~(persons=="Bob"),2:]

array([[ 0.07425284, -0.91020734],
       [-0.77941591, -0.41152265],
       [ 1.12702635, -0.32928371],
       [ 1.19058315, -0.43435616],
       [-1.39679604, -0.45601009],
       [-1.68476036,  0.32535508]])

In [56]:
data[persons=="Bob",2:]

array([[-0.12498377, -0.27651019]])

The other comparison operators we have seen in week 1 appy here as here. From and / or, ( &, |) to smaller than and larger than (<,>).

In [57]:
data[(persons=="Bob")|(persons=="John")]

array([[ 0.69447578,  0.77654216, -0.77941591, -0.41152265],
       [-0.21381986,  0.42325474, -0.12498377, -0.27651019]])

### Fancy Indexing

_Fancy indexing_ is a term adopted by NumPy to describe indexng using integer arrays.

In [58]:
arr_3 = np.empty((8,4))

In [59]:
arr_3

array([[ 1.63041663e-322,  0.00000000e+000,  0.00000000e+000,
         0.00000000e+000],
       [ 0.00000000e+000,  0.00000000e+000,  0.00000000e+000,
         0.00000000e+000],
       [ 0.00000000e+000,  0.00000000e+000,  0.00000000e+000,
                     nan],
       [-3.10503618e+231, -3.10503618e+231,  1.28457068e-322,
         0.00000000e+000],
       [ 0.00000000e+000,  0.00000000e+000,  0.00000000e+000,
         0.00000000e+000],
       [ 0.00000000e+000,  0.00000000e+000,  0.00000000e+000,
         0.00000000e+000],
       [ 0.00000000e+000,  2.05833592e-312, -3.10503618e+231,
        -3.10503618e+231],
       [ 2.07955588e-312,  2.14321575e-312, -3.10503618e+231,
        -3.10503618e+231]])

In [60]:
for i in range(8):
    arr_3[i] = i

In [61]:
arr_3

array([[0., 0., 0., 0.],
       [1., 1., 1., 1.],
       [2., 2., 2., 2.],
       [3., 3., 3., 3.],
       [4., 4., 4., 4.],
       [5., 5., 5., 5.],
       [6., 6., 6., 6.],
       [7., 7., 7., 7.]])

One feature of fancy indexing is that it allows you to select a subset of the data in a certain order

In [62]:
arr_3[[3,1,5]]

array([[3., 3., 3., 3.],
       [1., 1., 1., 1.],
       [5., 5., 5., 5.]])

Similarly to what we have seen with simple indexing negative numbers select from the buttom:

In [63]:
arr_3[[-1,-5]]

array([[7., 7., 7., 7.],
       [3., 3., 3., 3.]])

Passing multiple index arrays does something slightly different; it selects a one-dimentional array of elements corresponding to each __tuple of indices__:

In [64]:
arr_4 = np.arange(32).reshape((8,4)) # we will revisit he reshape method at a later stage.

In [65]:
arr_4

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19],
       [20, 21, 22, 23],
       [24, 25, 26, 27],
       [28, 29, 30, 31]])

In [66]:
arr_4[[1],[2]]

array([6])

In [67]:
arr_4[[2],[3]]

array([11])

In [68]:
arr_4[[2,3],[2,3]]

array([10, 15])

Here the elements (2,2) and (3,3) are selected.

Unlike slicing, fancy indexing creates new copies.

## Universal Functions

Universal functions perform element wise operations on an array in a fast manner. Similar to victorization and to comprehension the goal here is to avoid iterations. Example of universal functions are __.sqrt()__ square root, or __.exp()__. 

In [69]:
arr_5 = np.arange(10)

In [70]:
arr_5

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [71]:
np.sqrt(arr_5)

array([0.        , 1.        , 1.41421356, 1.73205081, 2.        ,
       2.23606798, 2.44948974, 2.64575131, 2.82842712, 3.        ])

In [72]:
np.exp(arr_5)

array([1.00000000e+00, 2.71828183e+00, 7.38905610e+00, 2.00855369e+01,
       5.45981500e+01, 1.48413159e+02, 4.03428793e+02, 1.09663316e+03,
       2.98095799e+03, 8.10308393e+03])

![Slicing](ufunc_2.png)

![Slicing](ufunc_1.png)

In [73]:
ar_1 = np.arange(3)
ar_2 = np.arange(3,4,1)

In [74]:
np.add(ar_1,ar_2)

array([3, 4, 5])

#### Mathematical and Statistical Methods

Numpy also comes wih a usefull set of methods that compute statistics for the entire array. It allows th use of aggregations ( also known as reductions) like sum, mean and std. For example:

In [76]:
ar_1.sum()

3

In [78]:
ar_1.mean()


1.0

In [77]:
np.mean(ar_1)

1.0

![Stats](Stat_methods.jpeg)

One last usefull method that could apply to an numpy array is __unique()__:

In [81]:
names =np.array(['Bob','Carl','Ben','Bob','Bob'])
np.unique(names)

array(['Ben', 'Bob', 'Carl'], dtype='<U4')

### Expressing Conditional Logic as Array Operation

what follows is another method to vectorize what else would have been a few line of sow code. By using the method __numpy.where__ we avoid using the expression _x if condition else y_. Assume we have the below three arrays:

In [79]:
arr1 = np.array([1,2,3,4,5])
arr2 = np.array([11,12,13,14,15])
cond = np.array([True,True,False,False,True])

Suppose we want to select a value from arr1 if cond is true otherwise select the value from arr2:

In [84]:
result=[]
for elem in range(len(cond)):
    if cond[elem]:
        result.append(arr1[elem])
    else:
        result.append(arr2[elem])
result
                  
                  
                  
    

[1, 2, 13, 14, 5]

this could as be acheived using the __.where__ method in one line

In [87]:
result = np.where(cond,arr1,arr2)
result

array([ 1,  2, 13, 14,  5])

This is not only faster, but it also allows us to work with multidimentional arrays.

moreover the second and third argument don't have to be arrays. The method also takes scalars.

In [88]:
result = np.where(cond,arr1,100)
result

array([  1,   2, 100, 100,   5])

In [89]:
result = np.where(cond,500,100)
result

array([500, 500, 100, 100, 500])

## Pseudorandom Number Generaton

The np.random module which we have seen already, supplements the build-in Python random with functions that generates efficiently arrays that contain sample values that are generated by a certain probability distribuation. For example you can get a 2x3 array of samples from the standard normal distribuation using __normal__.

In [84]:
ex = np.random.normal(size=(2,3))

In [85]:
ex

array([[-2.06919692,  1.81385599, -1.83052923],
       [ 1.9737368 ,  0.67379244,  0.10997746]])

We say that these are pseudorandom numbers because they are generated by an algorithm with deterministic behavior
based on the seed of the random number generator. You can change NumPy’s random number generation seed using np.random.seed : 



In [88]:
np.random.seed(1234) 


The data generation functions in numpy.random use a global random seed. To avoid global state, you can use numpy.random.RandomState to create a random number generator isolated fro others:



In [89]:
rst = np.random.RandomState(1234)

In [90]:
rst.randn(10)

array([ 0.47143516, -1.19097569,  1.43270697, -0.3126519 , -0.72058873,
        0.88716294,  0.85958841, -0.6365235 ,  0.01569637, -2.24268495])

![random](random.jpeg)