# 2. Data Wrangling

Crash course into data generation, handling, manipulation, transformation and preparation. This can lead to further analysis: statistical or visual (explored in future workshops) or training of predictive models. This assumes the data is already cleaned and structured, not in raw form (briefly explored in future workshops).

Why data manipulation in Python? The language is well fit for this task, [the Jetbrains 2019 survey](https://www.jetbrains.com/lp/devecosystem-2019/python/) showed that data analysis is the most frequent of the kinds of usage for Python:

<img src="https://i.imgur.com/aA19E1Q.png" width=600></img>

## 1a. NumPy Arrays

Why NumPy and Pandas? They are the most popular data science libraries, [the same survey](https://www.jetbrains.com/lp/devecosystem-2019/python/) shows:

<img src="https://i.imgur.com/PgSfnn1.png" width=600></img>

NumPy offers powerful array objects. It is part of the de-facto [Python ecosystem](https://www.scipy.org) for mathematics, science and engineering, and sits at the foundation of most scientific computation libraries.

### Table of contents

- NumPy Arrays
   - Creation
   - Properties
     - Shape
     - Data Type
  - Generation
    - Randomness
  - Accessing
    - Iteration
  - Copying
  - Array Operations
    - Arithmetic
    - Logical
    - Reshaping
    - Broadcasting
    - Masking
   - Extension


- Further reading

They are homogenous (all elements are the same type) containers, usually for numbers, that indexed by integers. They are similar to lists but offer much more functionality and performance.

In [1]:
import numpy as np  # import the package into our namespace, under the usual name `np`

### Array Creation

Create an array from a regular `list`s object:

In [2]:
squares = np.array([0, 1, 4, 9, 16, 25, 36, 49])

In [3]:
squares

array([ 0,  1,  4,  9, 16, 25, 36, 49])

Create a 2D array/a matrix:

In [4]:
m = np.array([
    [5, 2, 3],
    [4, 5, 1],
    [7, 1, 2],
    [6, 2, 9],
])

In [5]:
m

array([[5, 2, 3],
       [4, 5, 1],
       [7, 1, 2],
       [6, 2, 9]])

Arbitrarily many dimensions dimensions:

In [6]:
# pixels, which are 3-dimensional points:
R = [1, 0, 0]  # red
B = [0, 0, 1]  # blue
W = [1, 1, 1]  # white

In [7]:
image = np.array([
    [B, B, R, R],
    [B, B, W, W],
    [R, R, R, R],
    [W, W, W, W],
    [R, R, R, R],
    [W, W, W, W],
])

In [8]:
image

array([[[0, 0, 1],
        [0, 0, 1],
        [1, 0, 0],
        [1, 0, 0]],

       [[0, 0, 1],
        [0, 0, 1],
        [1, 1, 1],
        [1, 1, 1]],

       [[1, 0, 0],
        [1, 0, 0],
        [1, 0, 0],
        [1, 0, 0]],

       [[1, 1, 1],
        [1, 1, 1],
        [1, 1, 1],
        [1, 1, 1]],

       [[1, 0, 0],
        [1, 0, 0],
        [1, 0, 0],
        [1, 0, 0]],

       [[1, 1, 1],
        [1, 1, 1],
        [1, 1, 1],
        [1, 1, 1]]])

**💪 Exercise**: create a numpy array of three rows and two columns of arbitrary numbers from `0` to `5`:

In [9]:
np.array([
    [2, 1],
    [4, 2],
    [2, 1],
])

array([[2, 1],
       [4, 2],
       [2, 1]])

#### Array Shape

The _shape_ of each array is the size of each dimension:

In [10]:
squares

array([ 0,  1,  4,  9, 16, 25, 36, 49])

In [11]:
squares.shape  # 8 elements

(8,)

In [12]:
m.shape  # 4 rows, 3 columns

(4, 3)

In [13]:
image.shape  # 6 rows, 4 columns, each element containing 3 coordinates

(6, 4, 3)

An array's _rank_ is the number of dimensions.

---

All dimensions must have equal size, meaning we can't have a _jagged_ array:

In [14]:
a = np.array([
    [1,2,3], 
    [1,2]
])
a

array([list([1, 2, 3]), list([1, 2])], dtype=object)

In [15]:
a.shape

(2,)

It just interpreted it as an array of two elements, each element being a list. More details in the next sub-section.

#### Array Data Types

An array's data type is the type of the object they are holding.

In [16]:
squares.dtype

dtype('int64')

In [17]:
m.dtype

dtype('int64')

In [18]:
np.array([1.5, 2.3, 4.9]).dtype

dtype('float64')

In [19]:
np.array([True, True, False]).dtype

dtype('bool')

In [20]:
np.array(['abc', 'def', 'xy']).dtype  # unicode with 3 or fewer characters

dtype('<U3')

Compatible datatypes are "up-scaled":

In [21]:
a = np.array([True, 5])  # bool gets promoted to int
a

array([1, 5])

In [22]:
a.dtype

dtype('int64')

In [23]:
np.array([1, 2.5]).dtype  # int gets promoted to float

dtype('float64')

In [24]:
np.array([True, 2, 3.5]).dtype  # bool and int get promoted to float

dtype('float64')

In [25]:
np.array([7, 'abc']).dtype  # calls `str()` on them

dtype('<U21')

Incompatible datatypes are put under the `object` umbrella:

In [26]:
s = {1, 2, 3}
np.array([s, 5]).dtype

dtype('O')

You can also call type conversion manually:

In [27]:
np.array([2, 4, 0]).astype(bool)

array([ True,  True, False])

In [28]:
np.array([2, 4, 0]).astype(float)

array([2., 4., 0.])

Or upon creation:

In [29]:
np.array([2, 4], dtype=float)

array([2., 4.])

In [30]:
np.array([
    [1, 2, 3]
]).shape

(1, 3)

In [31]:
np.array([
    [1], 
    [2], 
    [3]
]).shape

(3, 1)

In [32]:
np.array([1, 2, 3]).shape

(3,)

**💪 Exercise**: create an array of three booleans, of `dtype` `str`:

In [33]:
np.array([True, True, False], dtype=str)

array(['True', 'True', 'False'], dtype='<U5')

#### Array Generation

Similar to the built-in `range`, generate an array of sequential numbers:

In [34]:
np.arange(7)

array([0, 1, 2, 3, 4, 5, 6])

**ℹ️ Tip**: it's called `a range` as in `an interval`, not `arrange` as in `align` — it confused me for the longest time.

A more powerful, non-integer counterpart:

In [35]:
np.linspace(start=0, stop=15, num=5)

array([ 0.  ,  3.75,  7.5 , 11.25, 15.  ])

Similarly, there is a logarithmic counterpart:

In [36]:
np.logspace(0, 3, num=4, base=10.)

array([   1.,   10.,  100., 1000.])

---

Generate an array of equal elements:

In [37]:
np.ones(4)

array([1., 1., 1., 1.])

In [38]:
np.zeros((3, 2))  # any shape

array([[0., 0.],
       [0., 0.],
       [0., 0.]])

Specify shape based on another array's:

In [39]:
np.zeros(m.shape)

array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.]])

In [40]:
np.zeros_like(m)

array([[0, 0, 0],
       [0, 0, 0],
       [0, 0, 0],
       [0, 0, 0]])

In [41]:
np.zeros_like(squares)  # same shape and dtype as `squares`

array([0, 0, 0, 0, 0, 0, 0, 0])

If you just want to instantiate an array, filling the elements later, you skip the filling step, and create one with bogus elements:

In [42]:
np.empty(30)  # the address where it is assigned is arbitrary, so you will likely see different results each time this is ran

array([-2.00000000e+000, -1.29074518e-231,  7.90505033e-323,
        0.00000000e+000,  2.29175545e-312,  2.14027814e+161,
        4.50602192e-144,  7.79952704e-143,  4.50606090e-144,
        7.79952704e-143,  8.00376614e+169,  4.26976898e-090,
        4.79121995e-037,  2.23113595e+160,  4.30604566e-096,
        5.23081515e-143,  1.27038358e-075,  5.94845000e-091,
        1.71862197e+185,  1.54667645e+185,  4.26397229e-096,
        6.32299154e+233,  6.48224638e+170,  5.22411352e+257,
        1.41529403e+161,  6.00736899e-067,  3.67220939e+097,
        7.13185209e-067,  4.09604276e+126,  2.08600674e-308])

**💪 Exercise**: generate an array of 6 rows, 3 columns of ones:

In [43]:
np.ones((6, 3))

array([[1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.]])

#### Array Copying

Assign a new "label" to the same array object (similar to `&` references in C-like languages):

In [44]:
a = np.ones(3)  # create an array, and "assign the label" `a` to it
a

array([1., 1., 1.])

In [45]:
b = a  # `b` is now another label for the same array

Modifications in any label affect the base object:

In [46]:
b[0] = 5
b

array([5., 1., 1.])

In [47]:
a  # modified indirectly

array([5., 1., 1.])

---

To "clone" the object, use `copy` instead:

In [48]:
a = np.ones(3)
a

array([1., 1., 1.])

In [49]:
b = np.copy(a)

In [50]:
b[0] = 5
b

array([5., 1., 1.])

In [51]:
a

array([1., 1., 1.])

**ℹ️ Tip**: this still fails if you store non-primitive data types:

In [52]:
a = np.array([
    {1, 2, 3},  # a set
    {5, 4},
])
a

array([{1, 2, 3}, {4, 5}], dtype=object)

In [53]:
b = np.copy(a)

In [54]:
b[0].remove(3)
b

array([{1, 2}, {4, 5}], dtype=object)

In [55]:
a  # still affected

array([{1, 2}, {4, 5}], dtype=object)

In this case, `deepcopy` would be useful, from the [copy built-in library](https://docs.python.org/3.7/library/copy.html).

#### Randomness

Generating random numbers sees more use in training predictive models, but they can also be relevant in terms of example data. It also has some uses in some advanced data visualizations.

Uniformly distributed $V \sim U(0, 1)$:

In [56]:
np.zeros((4, 2))

array([[0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.]])

In [57]:
np.random.rand(4, 2)

array([[0.41743152, 0.31767253],
       [0.58279742, 0.8164259 ],
       [0.90546049, 0.86193167],
       [0.21541149, 0.67950082]])

Normal standard distribution, zero-centered and unit deviation $\sim N(0, 1)$:

In [58]:
np.random.randn(3)

array([0.63136506, 0.99443542, 0.15893066])

Uniformly distributed integers in an interval $V \sim U_{\mathbb{Z}}(a, b)$:

In [59]:
np.random.randint(0, 10, size=3)

array([6, 9, 9])

Sampling elements from a given set, with or without replcement:

In [60]:
np.random.choice(['red', 'green', 'blue'], size=5)

array(['blue', 'green', 'green', 'green', 'green'], dtype='<U5')

Generating permutations $\sigma \in \mathbb{N}_n$:

In [61]:
np.random.permutation([4, 2, 1])

array([1, 2, 4])

**ℹ️ Tip**: setting the random seed allows for reproducibility of results when randomness is involved. The results are still random, but they are the same ones, every time. Since most scientific libraries delegate their random generation to numpy, `np.random.seed(123)` is sufficient for all. Read more about [random number generation](https://en.wikipedia.org/wiki/Pseudorandom_number_generator).

**💪 Exercise**: generate an array of 6 rows, 3 columns of random integers between `0` and `9`:

In [62]:
np.random.randint(0, 9, size=(6, 3))

array([[4, 6, 5],
       [7, 8, 0],
       [5, 6, 2],
       [5, 2, 2],
       [4, 3, 6],
       [2, 2, 0]])

### Array Accessing

Index, and slice accessing is similar to `list`s:

In [63]:
squares  # to remember what it contains

array([ 0,  1,  4,  9, 16, 25, 36, 49])

In [64]:
squares[2]  # remember, zero-indexed

4

In [65]:
squares[2:6]  # slices

array([ 4,  9, 16, 25])

---

It extends naturally to multi-dimensional arrays:

In [66]:
m

array([[5, 2, 3],
       [4, 5, 1],
       [7, 1, 2],
       [6, 2, 9]])

In [67]:
m[:2]  # first two rows

array([[5, 2, 3],
       [4, 5, 1]])

In [68]:
m[:, :2]  # all rows, first two columns

array([[5, 2],
       [4, 5],
       [7, 1],
       [6, 2]])

In [69]:
m[:2, :2]  # first two rows of the first two columns

array([[5, 2],
       [4, 5]])

---

_Fancy_ indexing (this is the actual term) allows for accessing multiple elements at once:

In [70]:
indices = [4, 2, 2]  # can also repeat
squares[indices]

array([16,  4,  4])

In [71]:
row_indices = [0, 1]
col_indices = [0, 2]
m[row_indices, col_indices]

array([5, 1])

**💪 Exercise**: access all rows, columns 2 through 3 of `m`:

In [72]:
m[:, 1:3]

array([[2, 3],
       [5, 1],
       [1, 2],
       [2, 9]])

### Array Iteration

Iteration works the same:

In [73]:
for sq in squares:
    print(sq)

0
1
4
9
16
25
36
49


Enumeration has its n-dimensional counterpart:

In [74]:
for index, element in np.ndenumerate(m):
    print('index', index, 'element', element)

index (0, 0) element 5
index (0, 1) element 2
index (0, 2) element 3
index (1, 0) element 4
index (1, 1) element 5
index (1, 2) element 1
index (2, 0) element 7
index (2, 1) element 1
index (2, 2) element 2
index (3, 0) element 6
index (3, 1) element 2
index (3, 2) element 9


### Array Operations

Arithmetic operations are _vectorized_ — extended for array operations:

In [75]:
squares  # to remember what it contains

array([ 0,  1,  4,  9, 16, 25, 36, 49])

In [76]:
squares + 100  # add 100 to each element

array([100, 101, 104, 109, 116, 125, 136, 149])

In [77]:
squares ** .5  # raise every element to the power 0.5 (square root it)

array([0., 1., 2., 3., 4., 5., 6., 7.])

**ℹ️ Tip**: an array containing all `5`s is generated by `np.ones(dim) * 5`

---

Conditional operators as well, and their result is boolean:

In [78]:
squares > 5

array([False, False, False,  True,  True,  True,  True,  True])

In [79]:
squares == 1

array([False,  True, False, False, False, False, False, False])

In [80]:
odd = (squares % 2 == 0)
odd

array([ True, False,  True, False,  True, False,  True, False])

---

There are also unary operators, such as negation:

In [81]:
~odd

array([False,  True, False,  True, False,  True, False,  True])

In [82]:
-squares

array([  0,  -1,  -4,  -9, -16, -25, -36, -49])

---

Aggregations and other more complex operations are available as methods:

In [83]:
squares.sum()  # sum of all elements

140

In [84]:
sum(squares)  # equivalent

140

In [85]:
np.log(squares + 1)

array([0.        , 0.69314718, 1.60943791, 2.30258509, 2.83321334,
       3.25809654, 3.61091791, 3.91202301])

In [86]:
squares.mean()  # equivalent to sum/len

17.5

In [87]:
squares.std()  # standard deviation

16.680827317612277

In [88]:
squares.cumsum()  # cumulative sum

array([  0,   1,   5,  14,  30,  55,  91, 140])

---

In [89]:
a = np.array([2, 0, 4])

In [90]:
a.min()

0

In [91]:
a.argmin()  # the index of the minimum element

1

In [92]:
a.argsort()  # the indices that would sort the array

array([1, 0, 2])

In [93]:
a[a.argsort()]

array([0, 2, 4])

---

Operators naturally extend to multiple dimensions as well:

In [94]:
m

array([[5, 2, 3],
       [4, 5, 1],
       [7, 1, 2],
       [6, 2, 9]])

In [95]:
m * 10

array([[50, 20, 30],
       [40, 50, 10],
       [70, 10, 20],
       [60, 20, 90]])

In [96]:
m == 5

array([[ True, False, False],
       [False,  True, False],
       [False, False, False],
       [False, False, False]])

Element-wise application of operators between to arrays:

In [97]:
a

array([2, 0, 4])

In [98]:
b = np.array([9, 6, 6])

In [99]:
a + b

array([11,  6, 10])

In [100]:
a * b

array([18,  0, 24])

---

Binary operations:

In [101]:
a = squares > 5
a

array([False, False, False,  True,  True,  True,  True,  True])

In [102]:
b = (squares % 2 == 0)
b

array([ True, False,  True, False,  True, False,  True, False])

In [103]:
a & b

array([False, False, False, False,  True, False,  True, False])

In [104]:
a | b

array([ True, False,  True,  True,  True,  True,  True,  True])

---

In [105]:
x = np.linspace(0, np.pi, num=5)
x

array([0.        , 0.78539816, 1.57079633, 2.35619449, 3.14159265])

In [106]:
np.sin(x).round(3)

array([0.   , 0.707, 1.   , 0.707, 0.   ])

---

Functions, when applied to multi-dimensional arrays, allow you to specify an axis. In a 2D matrix, that means either column-wise or row-wise:

In [107]:
m

array([[5, 2, 3],
       [4, 5, 1],
       [7, 1, 2],
       [6, 2, 9]])

In [108]:
m.sum()  # overall sum of all elements, no axis specified

47

In [109]:
m.sum(axis=0)  # first axis, column wise — one result for each column

array([22, 10, 15])

In [110]:
m.sum(axis=1)  # per each row

array([10, 10, 10, 17])

---

The `*` operator gives the hadamard product (element-wise multiplication) between matrices:

In [111]:
m * m

array([[25,  4,  9],
       [16, 25,  1],
       [49,  1,  4],
       [36,  4, 81]])

Matrix multiplication is done using the `@` operator (previously, using `a.dot(b)`):

In [112]:
m @ m.transpose()

array([[ 38,  33,  43,  61],
       [ 33,  42,  35,  43],
       [ 43,  35,  54,  62],
       [ 61,  43,  62, 121]])

---

**💪 Exercise**: re-generate the `squares` array, but using numpy:

In [113]:
np.arange(8) ** 2

array([ 0,  1,  4,  9, 16, 25, 36, 49])

#### Reshaping

Arrays can be morph into a different (compatible) shape:

In [114]:
squares  # original

array([ 0,  1,  4,  9, 16, 25, 36, 49])

In [115]:
squares.reshape(2, 4)  # 2 rows, 4 columns

array([[ 0,  1,  4,  9],
       [16, 25, 36, 49]])

In [116]:
squares.reshape(4, 2)  # 4 rows, 2 columns

array([[ 0,  1],
       [ 4,  9],
       [16, 25],
       [36, 49]])

Flatten an array of any shape with `.reshape(-1)`:

In [117]:
m

array([[5, 2, 3],
       [4, 5, 1],
       [7, 1, 2],
       [6, 2, 9]])

In [118]:
m.reshape(-1)

array([5, 2, 3, 4, 5, 1, 7, 1, 2, 6, 2, 9])

In [119]:
image.reshape(-1).shape

(72,)

In [120]:
6 * 4 * 3

72

Transposition (axis inversion):

In [121]:
m.T

array([[5, 4, 7, 6],
       [2, 5, 1, 2],
       [3, 1, 2, 9]])

---

Generating new axes can be useful when certain functions require the data in a particular shape, even if it is degenerated:

In [122]:
squares[:, np.newaxis]  # make each element be a list

array([[ 0],
       [ 1],
       [ 4],
       [ 9],
       [16],
       [25],
       [36],
       [49]])

In [123]:
squares[np.newaxis, :]  # wrap the array

array([[ 0,  1,  4,  9, 16, 25, 36, 49]])

In [124]:
squares.shape  # original shape

(8,)

In [125]:
squares[:, np.newaxis].shape

(8, 1)

In [126]:
squares[np.newaxis, :].shape

(1, 8)

**ℹ️ Tip**: shapes are simply "views" of the underlying data, which is stored the same way, regardless of assigned shape. Read more about how [data is stored internally](https://docs.scipy.org/doc/numpy-1.13.0/reference/internals.html).

**💪 Exercise**: in how many ways can an array of size 12 be reshaped?

In [127]:
np.arange(12).reshape(4, 3)

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11]])

In [None]:
np.arange(12).reshape(2, 6)

In [None]:
np.arange(12).reshape(12, 1)

**👾 Trivia**: [this is why](https://www.youtube.com/watch?v=U6xJfP7-HCc) a base-12 number system would make arithmetic easier.

### Broadcasting

The same operation, `+`, is used both when adding a constant to each element, and also when performing element-wise addition. This concept is extended to arbitrarily many dimensions. The right-hand-side of the operator is _broadcasted_ until it reaches the left-hand-side's shape.

In [128]:
squares + 10

array([10, 11, 14, 19, 26, 35, 46, 59])

In [129]:
tens = [10] * 8  # eight elements, each equal to 10
squares + tens  # behind the scenes, the rhs is broadcasted to match the lhs' shape

array([10, 11, 14, 19, 26, 35, 46, 59])

It becomes non-arbitrary in higher dimensions:

In [130]:
m  # content refresher

array([[5, 2, 3],
       [4, 5, 1],
       [7, 1, 2],
       [6, 2, 9]])

In [131]:
m + [100, 10, 0]  # add these values to each row
# for each row, the first element 

array([[105,  12,   3],
       [104,  15,   1],
       [107,  11,   2],
       [106,  12,   9]])

In [132]:
m + [[1000], [100], [10], [0]]  # add these values to each column
# 

array([[1005, 1002, 1003],
       [ 104,  105,  101],
       [  17,   11,   12],
       [   6,    2,    9]])

Read more about [broadcasting](https://jakevdp.github.io/PythonDataScienceHandbook/02.05-computation-on-arrays-broadcasting.html).

### Array Masking

Boolean indexing — access only those elements where the indexing array is `True`:

In [133]:
mask = (squares > 5)

In [134]:
mask

array([False, False, False,  True,  True,  True,  True,  True])

In [135]:
squares[mask]

array([ 9, 16, 25, 36, 49])

**💪 Exercise**: select only even `squares`:

In [136]:
squares[squares % 2 == 0]

array([ 0,  4, 16, 36])

### Array Extension

Since `+` is reserved for addition, array concatenation is done by function:

In [137]:
np.concatenate([squares, squares])

array([ 0,  1,  4,  9, 16, 25, 36, 49,  0,  1,  4,  9, 16, 25, 36, 49])

In the multi-dimensional case:

In [138]:
a = np.arange(6).reshape(3, 2)
a

array([[0, 1],
       [2, 3],
       [4, 5]])

In [139]:
b = np.ones((2, 2))
b

array([[1., 1.],
       [1., 1.]])

In [140]:
np.concatenate([a, b])

array([[0., 1.],
       [2., 3.],
       [4., 5.],
       [1., 1.],
       [1., 1.]])

---

In [141]:
c = np.zeros((2, 2))

In [142]:
np.vstack([b, c])  # on top of eachother

array([[1., 1.],
       [1., 1.],
       [0., 0.],
       [0., 0.]])

In [143]:
np.hstack([b, c])  # next to eachother

array([[1., 1., 0., 0.],
       [1., 1., 0., 0.]])

## Further reading
 - Numpy: 
   - [cheatsheet](https://www.dataquest.io/blog/large_files/numpy-cheat-sheet.pdf)
   - [official quickstart guide](https://docs.scipy.org/doc/numpy-1.15.0/user/quickstart.html)
   - [official reference](https://docs.scipy.org/doc/numpy/reference/index.html#reference)
 - Scipy: [tutorial](https://docs.scipy.org/doc/scipy/reference/tutorial/index.html)
 - Python/Numpy/Scipy/Matplotlib: [quick tutorial](http://cs231n.github.io/python-numpy-tutorial/)
 
Links to more details about particular concepts are placed at the end of their respective (sub)sections.