# A practical introduction into data wrangling using NumPy

## NumPy

The NumPy package focuses on numerical calculations. The heart of the package is the **ndarray** or n-dimensional array. This object is multidimensional homogeneous array with a predetermined number of items.

- All data items are of the same type, a NumPy object **dtype**
- Once the object is created the size is fixed

### Subject overview
- [Create an array](#Create-an-array)
- [Generate an array](#Generate-an-array)
- [Basic operations](#Basic-operations)
- [Indexing](#Indexing)
- [Slicing](#Slicing)


In [3]:
import numpy as np

<a id="section_create"></a>

## Create an array
[Back to index](#Subject-overview)

Create a simple NumPy array from a vector.

In [27]:
np.array([1., 2., 3.])

array([ 1.,  2.,  3.])

Create a simple 2 by 2 NumPy array.

In [28]:
np.array([[1, 2], [3, 4]])

array([[1, 2],
       [3, 4]])

Create a simple array and specify the data type.

Types to choose from:
- bool
- int, int8, int16, int32, int64
- uint8, uint16, uint32, uint64
- float, float16, float32, float64
- complex
- str

In [14]:
c = np.array(((5, 6), (7, 8)), dtype=str)
c

array([['5', '6'],
       ['7', '8']], 
      dtype='<U1')

## Generate an array
[Back to index](#Subject-overview)

Create an array of zeros or ones.

In [26]:
np.zeros((3, 3))

array([[ 0.,  0.,  0.],
       [ 0.,  0.,  0.],
       [ 0.,  0.,  0.]])

In [25]:
np.ones((6, 2))

array([[ 1.,  1.],
       [ 1.,  1.],
       [ 1.,  1.],
       [ 1.,  1.],
       [ 1.,  1.],
       [ 1.,  1.]])

Create an array form a generate sequence.

In [24]:
np.arange(2, 10)

array([2, 3, 4, 5, 6, 7, 8, 9])

In [23]:
np.arange(0, 10, 2)

array([0, 2, 4, 6, 8])

A function similar to *arange()* is *linspace()* which instead of taking the step size uses the amount of elements required.

In [22]:
np.linspace(0, 6, 5)

array([ 0. ,  1.5,  3. ,  4.5,  6. ])

These are just one dimensional arrays. To create multi dimensional arrays we need to *reshape()* them.

In [21]:
np.arange(0, 12).reshape((3, 4))

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

Or generate random values.

In [29]:
np.random.random(3)

array([ 0.42485936,  0.24004095,  0.08131715])

In [31]:
np.random.random((4, 5))

array([[ 0.06459895,  0.84031473,  0.68726673,  0.47809327,  0.98323487],
       [ 0.68924383,  0.40895858,  0.89195416,  0.68001887,  0.54300379],
       [ 0.98599101,  0.13860767,  0.43204627,  0.85974912,  0.14087153],
       [ 0.91331493,  0.63871075,  0.86657309,  0.05456826,  0.63491184]])

## Basic operations
[Back to index](#Subject-overview)

Arithmic operations are performed elementwise also for universal functions like **.sqrt()** or **.sin()**. In case scalars are used they will be broadcasted over the elements.

In [33]:
a = np.arange(1, 5)
a

array([1, 2, 3, 4])

In [34]:
a + 2

array([3, 4, 5, 6])

In [35]:
a * 2

array([2, 4, 6, 8])

In [38]:
b = np.arange(5, 9)
b

array([5, 6, 7, 8])

In [39]:
a + b

array([ 6,  8, 10, 12])

In [40]:
a * b

array([ 5, 12, 21, 32])

An universal function.

In [41]:
np.sqrt(a)

array([ 1.        ,  1.41421356,  1.73205081,  2.        ])

The **dot** product is also supported.

In [44]:
c = np.arange(0, 9).reshape((3, 3))
c

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

In [46]:
d = np.arange(10, 22).reshape((3, 4))
d

array([[10, 11, 12, 13],
       [14, 15, 16, 17],
       [18, 19, 20, 21]])

In [47]:
c.dot(d)

array([[ 50,  53,  56,  59],
       [176, 188, 200, 212],
       [302, 323, 344, 365]])

**Aggregate** functions will work on the whole array unless told otherwise. You can use the **axis** parameter to determine over which axis the aggregation is performed.

In [49]:
np.sum(c)

36

Sum per column.

In [50]:
np.sum(c, axis=0)

array([ 9, 12, 15])

Sum per row.

In [51]:
np.sum(c, axis=1)

array([ 3, 12, 21])

## Indexing
[Back to index](#Subject-overview)

Indexing data refers to retrieving specific data from the array. An index can be seen as the position per dimension in which a data objects is stored. NumPy arrays typically start at zero (0), so the first position is 0.

You can also use negative indexes these will start at the last index and then move back to the start.

In [56]:
a = np.arange(0, 20, 2)
a

array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])

In [57]:
a[1]

2

In [58]:
a[-2]

16

Selecting multiple items at once can be achieved by passing an array of indexes.

In [59]:
a[[2, 4, 6]]

array([ 4,  8, 12])

For multidimensional arrays.

In [60]:
b = np.linspace(10, 20, 9).reshape((3, 3))
b

array([[ 10.  ,  11.25,  12.5 ],
       [ 13.75,  15.  ,  16.25],
       [ 17.5 ,  18.75,  20.  ]])

In [61]:
b[2, 1]

18.75

To get multiple items from a multidimensional array you need to pass per axis an array of indexes.

In [72]:
b[[1, 1, 0], [2, 1, 2]]

array([ 16.25,  15.  ,  12.5 ])

## Slicing
[Back to index](#Subject-overview)

Slicing is similar to indexing but takes portions out of the array instead of per index. To specify a range of indexes you use a ':' (colon) between the start and end index numbers.

In [73]:
a

array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])

In [74]:
a[2:4]

array([4, 6])

In [75]:
a[:4]

array([0, 2, 4, 6])

Take a subset but take every x index. This amount is specified as a third parameter.

In [77]:
a[:4:2]

array([0, 4])

In [78]:
b

array([[ 10.  ,  11.25,  12.5 ],
       [ 13.75,  15.  ,  16.25],
       [ 17.5 ,  18.75,  20.  ]])

Take one row requires to only use the ':' as python assumes a range from 0 to the last item.

In [79]:
b[1, :]

array([ 13.75,  15.  ,  16.25])

Take one column.

In [80]:
b[:, 1]

array([ 11.25,  15.  ,  18.75])

To apply a function over a row or column, one can write for-loops but NumPy provides an own function for that: **apply_along_axis(func, axis, array)**

In [82]:
np.apply_along_axis(np.mean, axis=0, arr=b)

array([ 13.75,  15.  ,  16.25])

In [83]:
np.apply_along_axis(np.mean, axis=1, arr=b)

array([ 11.25,  15.  ,  18.75])

In [86]:
def divide(x):
    return x/2

In [87]:
np.apply_along_axis(divide, axis=0, arr=b)

array([[  5.   ,   5.625,   6.25 ],
       [  6.875,   7.5  ,   8.125],
       [  8.75 ,   9.375,  10.   ]])

In [88]:
z = np.arange(0,9).reshape((3,3))
z

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

In [6]:
type(a)

numpy.ndarray

In [7]:
a.dtype

dtype('float64')