## Introduction to Numpy

This workbook aims to provide a quick and in-depth introduction to `numpy`. 

In [1]:
import numpy as np

``from numpy import *``

This is not a good idea as `numpy` namespace is large and contains a number of functions which can conflict with standard built-in Python functions (like min and max). As a best practice, it is good to alias the namespace as np, so all the methods of `numpy` would be accessible via the np.<method_name> syntax.  


In [2]:
my_arr = np.arange(1000000)

my_list = list(range(1000000))

In [3]:
%time for _ in range(10): my_arr2 = my_arr * 2

CPU times: user 17.4 ms, sys: 6.32 ms, total: 23.7 ms
Wall time: 23.1 ms


In [4]:
%time for _ in range(10): my_list2 = [x * 2 for x in my_list]

CPU times: user 650 ms, sys: 156 ms, total: 806 ms
Wall time: 816 ms


*** Generating Random Data ***

In [5]:
data = np.random.randn(2,3) #Sample from Standard Normal Distribution

In [6]:
data

array([[-0.02011543, -1.82746148,  0.15227316],
       [-0.10798547, -1.05172216,  0.03996201]])

In [7]:
data * 10  # batch mathematical operations can be performed directly. Element-wise multiplication.

array([[ -0.20115434, -18.27461476,   1.52273156],
       [ -1.07985473, -10.51722159,   0.3996201 ]])

In [8]:
data + data # Element-wise addition.

array([[-0.04023087, -3.65492295,  0.30454631],
       [-0.21597095, -2.10344432,  0.07992402]])

## `ndarray` Multi-dimensional Arrays

An `ndarray` is a generic-multidimensional container for homogenous data. In simple terms, all the elements should be of same type. For example, all ints, all floats etc.

Every array has a `shape` property, which is a tuple describing the dimensions. 

In [9]:
data.shape #2 rows, 3 columns

(2, 3)

The `dtype` property describes the data type of the array.

In [10]:
data.dtype # in this case it is float64

dtype('float64')

### Creating `ndarray`

To create an array the `array` function can be used. A simple list can be converted into `ndarray`.

In [12]:
data1 = [1,2,3,4,5,6]

arr1 = np.array(data1)

print('Array: {}'.format(arr1))
print('Type : {}'.format(type(arr1)))

Array: [1 2 3 4 5 6]
Type : <class 'numpy.ndarray'>


Nested sequences will be converted into multi-dimensional lists.

In [11]:
data2 = [[1,2,3,4,5],[6,7,8,9,10]]

arr2 = np.array(data2)

print('Array: {}'.format(arr2))
print('Type: {}'.format(type(arr2)))
print('Type (alternative): {}'.format(arr2.dtype))
print('Dimension: {}'.format(arr2.ndim))
print('Shape: {}'.format(arr2.shape))

Array: [[ 1  2  3  4  5]
 [ 6  7  8  9 10]]
Type: <class 'numpy.ndarray'>
Type (alternative): int64
Dimension: 2
Shape: (2, 5)


In previous example, the nested lists were of same length. What will happen if it is not the case?

In [13]:
data3 = [[1,2,3,4,5],[6,7,8,9]]

arr3 = np.array(data3)

print('Array: {}'.format(arr3))
print('Type: {}'.format(type(arr3)))
print('Type (alternative): {}'.format(arr3.dtype))
print('Dimension: {}'.format(arr3.ndim))
print('Shape: {}'.format(arr3.shape))


Array: [list([1, 2, 3, 4, 5]) list([6, 7, 8, 9])]
Type: <class 'numpy.ndarray'>
Type (alternative): object
Dimension: 1
Shape: (2,)


*** Notice, since the length of the lists is not the same. The numpy module created a single dimensional array of lists, instead of multi-dimensional array. ***

> *** Important: *** While creating a numpy array from individual list remember this check.



Additionally, you can create initialized arrays using helper methods like `zeros`, `ones` and `empty`.

1. `zeros` : Initializes an array of requested shape with all zeros.

2. `ones`  : Initializes an array of requested shape with all ones.

3. `empty` : Initializes an array of requested shape with *** random values. ***
 
 To create higher order arrays pass in tuple with shape specification.
 
 Let's look at an illustration.

In [14]:
z1 = np.zeros(10)
print('Array : {}'.format(z1))
print('Dimension: {}'.format(z1.ndim))
print('Shape: {}'.format(z1.shape))

Array : [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
Dimension: 1
Shape: (10,)


In [15]:
z2 = np.zeros((2,3))
print('Array : {}'.format(z2))
print('Dimension: {}'.format(z2.ndim))
print('Shape: {}'.format(z2.shape))

Array : [[0. 0. 0.]
 [0. 0. 0.]]
Dimension: 2
Shape: (2, 3)


In [16]:
o1 = np.ones((2,3))
print('Array : {}'.format(o1))
print('Dimension: {}'.format(o1.ndim))
print('Shape: {}'.format(o1.shape))

Array : [[1. 1. 1.]
 [1. 1. 1.]]
Dimension: 2
Shape: (2, 3)


In [17]:
e1 = np.empty((2,3))
print('Array : {}'.format(e1))
print('Dimension: {}'.format(e1.ndim))
print('Shape: {}'.format(e1.shape))

Array : [[1. 1. 1.]
 [1. 1. 1.]]
Dimension: 2
Shape: (2, 3)


*** Note: *** It is not necessary that the values will be either 0, or 1 they can be random in case of `np.empty` (garbage values).

`arange` is array-valued version of Python's `range` function.

In [18]:
np.arange(10)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

*** Array Creation Functions ***

| Function | Description |
|----------|-------------|
| array    | Convert input data (list, tuple, array, or other sequence to array) |
| asarray  | Comvert input to ndarray, but do not copy if it is already a ndarray |
| arange   | Built-in range function, returns ndarray instead of list.    |
| ones     | Array of 1's with given shape and size. |
| ones_like | Create ndarray of ones matching shape and size of input array |
| zeros    | Array of O's with given shape and size. |
| zeros_like | Create ndarray of zeros matching shape and size of input array |
| empty    | Array of random (garbage) data with given shape and size. |
| empty_like | Create ndarray of random (garbage) data matching shape and size of input array |
| full    | Array of given element data indicated by "fill value" of given shape and size. |
| full_like | Create ndarray of rgiven element data indicated by "fill value" matching shape and size of input array |
| eye, identity | Create a square NxN matrix ()



## Data Types for ndarrays

The *data type* or *dtype* is a special object containing the information (or metadata about the data). Different types of Numpy data types supported are:

int8, uint8, int16, uint16, int32, uint32, int64, uint64, float16, float32, float64, float128, complex64, complex128, complex256, bool, object, string_, unicode_

Explicit, type-casting of an array can be done using the `astype` method.


In [19]:
arr = np.array([1,2,3,4,5])
print('Initial DataType : {}'.format(arr.dtype))
float_arr = arr.astype(np.float64)
print('Converted DataType : {}'.format(float_arr.dtype))

Initial DataType : int64
Converted DataType : float64


The above example is converting int to floating point. *** Note: *** When casting from floating-point to int, the decimal part is truncated. This is a typical *down-casting* behaviour which is lossy in nature (irreversible). That is you loose the information represented by the decimal part, and it cannot be recovered by converting int to float again. The context of the problem to be solved is important. 

In [20]:
arr = np.array([3.7,2.8, -0.5, 0.5, 12.1, 11.4])
print('Initial DataType : {}'.format(arr.dtype))
int_arr = arr.astype(np.int)
print('Converted DataType: {}'.format(int_arr)) # Note there is no rounding.

Initial DataType : float64
Converted DataType: [ 3  2  0  0 12 11]


In [21]:
numeric_string = np.array(['1.25', '-9.6', '42'], dtype=np.string_)
numeric_string.astype(np.float)

array([ 1.25, -9.6 , 42.  ])

*** Calling `astype` *always* creates a new array (a copy of the data), even if the new dtype is teh same as the old dtype.

## Arithmetic with NumPy Arrays

Batch operations can be done on data without writing any for loops. This is called as *vectorization*.  The arrays need to be of equal-size for these operations.

In [22]:
arr = np.array([[1., 2., 3.], [4., 5., 6.]])
arr

array([[1., 2., 3.],
       [4., 5., 6.]])

In [24]:
arr * arr #Element-wise operations or vectorization

array([[ 1.,  4.,  9.],
       [16., 25., 36.]])

In [23]:
arr - arr

array([[0., 0., 0.],
       [0., 0., 0.]])

*** Scalar Operations ***

In [25]:
1/arr

array([[1.        , 0.5       , 0.33333333],
       [0.25      , 0.2       , 0.16666667]])

In [26]:
arr ** 0.5 

array([[1.        , 1.41421356, 1.73205081],
       [2.        , 2.23606798, 2.44948974]])

Comparison between ndarrays result in ndarray ofsame shape and dimension.

In [27]:
arr2 = np.array([[0., 4., 1.],[7., 2., 12.]])
arr2

array([[ 0.,  4.,  1.],
       [ 7.,  2., 12.]])

In [28]:
arr2 > arr

array([[False,  True, False],
       [ True, False,  True]])

Basic Indexing and Slicing

In [29]:
 arr = np.arange(10)
 arr

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [30]:
arr[5] #6th element in the array

5

In [31]:
arr[5:8]

array([5, 6, 7])

In [32]:
arr[5:8] = 12 # This will broadcast the value to the selection, and update the list.

In [33]:
arr # Notice the index positions 5:8 are updated in original array.

array([ 0,  1,  2,  3,  4, 12, 12, 12,  8,  9])

*Note:* Array slices are *views*, the data is not copied and any modifications to the view will be reflected in the source array. 

In [34]:
arr_slice = arr[5:8]
arr_slice

array([12, 12, 12])

In [35]:
arr_slice[1] = 12345
arr

array([    0,     1,     2,     3,     4,    12, 12345,    12,     8,
           9])

In [36]:
arr_slice[:] = 64 # all indexes in slice will be assigned 64
arr

array([ 0,  1,  2,  3,  4, 64, 64, 64,  8,  9])

If you want a copy of a slice of an ndarray instead of a view, you will need to explicitly copy the array - for example, `arr[5:8].copy()`.

In [62]:
arr2d = np.array([[1,2,3],[4,5,6],[7,8,9]])

In [63]:
arr2d

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [64]:
arr2d[2]

array([7, 8, 9])

In [65]:
arr2d[0][2]

3

In [66]:
arr2d[0, 2] # (rows, column) same as above and concise. Axis 0 = "Rows", Axis 1 = "Columns".

3

### Indexing with Slices

In Numpy the indexing with slices works much like standard Python like slices.

In [67]:
arr

array([ 0,  1,  2,  3,  4, 64, 64, 64,  8,  9])

In [68]:
arr[1:6]

array([ 1,  2,  3,  4, 64])

In [69]:
arr2d

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [70]:
arr2d[:2]

array([[1, 2, 3],
       [4, 5, 6]])

Multiple slices can be passed just like multiple indexes, and the slices would work on respective axis. 

In [71]:
arr2d[:2, ]

array([[1, 2, 3],
       [4, 5, 6]])

In [72]:
arr2d[:2, 1:]

array([[2, 3],
       [5, 6]])

Assignements also work with slices.

In [73]:
arr2d[:2, 1:] = 0
arr2d

array([[1, 0, 0],
       [4, 0, 0],
       [7, 8, 9]])

## Boolean Indexing

In [74]:
names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])
data = np.random.randn(7,4)

In [75]:
names

array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'], dtype='<U4')

In [76]:
data

array([[ 0.40430751, -0.22607386, -1.71664478, -1.20199625],
       [ 1.2242563 ,  0.75351702, -0.08157072, -0.49116335],
       [ 0.1642233 ,  0.11522248,  0.44212813,  0.52408164],
       [-1.01723423,  0.08714652,  0.77233403, -0.03337033],
       [-0.34905077,  2.00457051, -1.2266707 ,  0.44645096],
       [ 0.58761652,  1.01370728, -0.99481037, -0.58019532],
       [-0.25351247, -0.30593909, -0.49698721,  0.47693796]])

Suppose each name corresponds to a row in the data array and we wanted to select all the rows with corresponding name 'Bob'. Like arithmetic operations '==' comparison can be vectorized.

In [77]:
names == 'Bob' #yields array of booleans

array([ True, False, False,  True, False, False, False])

This boolean array can be passed when indexing the array. *Note*: The boolean array *must* be of same length as the array axis it's indexing. Otherwise inconsistent results will be obtained which are hard to track. 

In [78]:
data[ names== 'Bob']

array([[ 0.40430751, -0.22607386, -1.71664478, -1.20199625],
       [-1.01723423,  0.08714652,  0.77233403, -0.03337033]])

Filtering rows and indexing is possible in one shot.

In [79]:
data[names == 'Bob', 2:]

array([[-1.71664478, -1.20199625],
       [ 0.77233403, -0.03337033]])

In [80]:
names != 'Bob'

array([False,  True,  True, False,  True,  True,  True])

In [81]:
data[~(names == 'Bob')] # Notice the ~ operator which is used for negation.

array([[ 1.2242563 ,  0.75351702, -0.08157072, -0.49116335],
       [ 0.1642233 ,  0.11522248,  0.44212813,  0.52408164],
       [-0.34905077,  2.00457051, -1.2266707 ,  0.44645096],
       [ 0.58761652,  1.01370728, -0.99481037, -0.58019532],
       [-0.25351247, -0.30593909, -0.49698721,  0.47693796]])

Selecting multiple names using & and | operators. Note the standard python `and` and `or` operators do not work with boolean arrays.

In [82]:
mask = (names == 'Bob') | (names == 'Will')
mask

array([ True, False,  True,  True,  True, False, False])

In [83]:
data[mask]

array([[ 0.40430751, -0.22607386, -1.71664478, -1.20199625],
       [ 0.1642233 ,  0.11522248,  0.44212813,  0.52408164],
       [-1.01723423,  0.08714652,  0.77233403, -0.03337033],
       [-0.34905077,  2.00457051, -1.2266707 ,  0.44645096]])

Setting boolean values.

In [84]:
data[data < 0] = 0

In [85]:
data

array([[0.40430751, 0.        , 0.        , 0.        ],
       [1.2242563 , 0.75351702, 0.        , 0.        ],
       [0.1642233 , 0.11522248, 0.44212813, 0.52408164],
       [0.        , 0.08714652, 0.77233403, 0.        ],
       [0.        , 2.00457051, 0.        , 0.44645096],
       [0.58761652, 1.01370728, 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.47693796]])

In [86]:
data[ names != 'Joe'] = 7
data

array([[7.        , 7.        , 7.        , 7.        ],
       [1.2242563 , 0.75351702, 0.        , 0.        ],
       [7.        , 7.        , 7.        , 7.        ],
       [7.        , 7.        , 7.        , 7.        ],
       [7.        , 7.        , 7.        , 7.        ],
       [0.58761652, 1.01370728, 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.47693796]])