## Vectorization 

In [1]:
import sys
print(sys.version)
import numpy as np
print(np.__version__)

3.5.0 (v3.5.0:374f501f4567, Sep 12 2015, 11:00:19) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]
1.9.2


In [2]:
npa = np.random.random_integers(0,50,20)

I have touched on the values of vectorization in the last couple of videos. To review, the true power of vectorization lies in two properties:

- Concise
- Efficient

The fundamental idea behind array programming is that operations apply at once to an entire set of values. This makes it a high-level programming model as it allows the programmer to think and operate on whole aggregates of data without having to resort to explicit loops of individual scalar operations.

You can read more here:
https://en.wikipedia.org/wiki/Array_programming

In [3]:
npa

array([ 3, 35,  0, 50, 22, 49, 17,  0, 22, 32, 24, 20, 42, 43, 20, 12, 43,
       50, 29, 50])

As we have touched on, with vectorization we can apply changes to the entire array extremely efficiently, no more for loops. If we want to double the array, we just multiply by 2; if we want to cube it we just cube it.

In [4]:
npa * 2

array([  6,  70,   0, 100,  44,  98,  34,   0,  44,  64,  48,  40,  84,
        86,  40,  24,  86, 100,  58, 100])

In [5]:
npa ** 3

array([    27,  42875,      0, 125000,  10648, 117649,   4913,      0,
        10648,  32768,  13824,   8000,  74088,  79507,   8000,   1728,
        79507, 125000,  24389, 125000])

In [6]:
npa ** 2 ** 3

array([          6561,  2251875390625,              0, 39062500000000,
          54875873536, 33232930569601,     6975757441,              0,
          54875873536,  1099511627776,   110075314176,    25600000000,
        9682651996416, 11688200277601,    25600000000,      429981696,
       11688200277601, 39062500000000,   500246412961, 39062500000000])

We can actually take this a step further and define our own vectorized functions, which will operate extremeley efficiently. Let's say that we need to do a series of transformations. We can include these in a function to make them a bit more readable.

We can simply declare exactly what we want done to each value.

In [7]:
def transformation(val):
    return ((val * 2) - 1)

In [8]:
transformation(2)

3

And then we can apply it to an array.

In [9]:
transformation(npa)

array([ 5, 69, -1, 99, 43, 97, 33, -1, 43, 63, 47, 39, 83, 85, 39, 23, 85,
       99, 57, 99])

While this did work, sometimes it will not, such as if we include filters or have specific transformations for certain values. In those cases we need to use the `vectorize` function in order to allow us to pass an array into it.

In [10]:
transform_2 = np.vectorize(transformation)

In [11]:
transform_2(npa)

array([ 5, 69, -1, 99, 43, 97, 33, -1, 43, 63, 47, 39, 83, 85, 39, 23, 85,
       99, 57, 99])

In [12]:
%timeit [transformation(x) for x in npa]

100000 loops, best of 3: 13.6 µs per loop


In [13]:
%timeit transform_2(npa)

The slowest run took 4.07 times longer than the fastest. This could mean that an intermediate result is being cached 
10000 loops, best of 3: 27.6 µs per loop


We can see that in this case it was not faster than a list comprehension. However, remember that this is just a small, simple array; when we get to more complicated arrays and matrices, things will certainly change.

## Multidimensional Arrays

We learned in the last video how to generate simple arrays. Now let’s move on to arrays with multiple dimensions. Think in terms of matrices or cubes.

We can create these by reshaping one-dimensional arrays. One of the simplest ways to do this is with the reshape command. That gives us an `x` by `y` array.

In [14]:
npa = np.arange(25)

In [15]:
npa2 = npa.reshape((5,5))
npa2

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24]])

To create a null matrix we can use the zeros command along with the desired shape.

In [16]:
np.zeros((5,2))

array([[ 0.,  0.],
       [ 0.,  0.],
       [ 0.,  0.],
       [ 0.,  0.],
       [ 0.,  0.]])

In [17]:
np.zeros((5,5))

array([[ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.]])

What's nice about this is that a lot of outputs can be used as inputs.

In [18]:
np.zeros(npa2.shape)

array([[ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.]])

This obviously works because it outputs a tuple.

In [19]:
npa2.shape

(5, 5)

Now these have all been two-dimensional arrays. Let's explore the third dimension.

In [20]:
np.arange(8).reshape(2,2,2)

array([[[0, 1],
        [2, 3]],

       [[4, 5],
        [6, 7]]])

Of course, when we start diving into theory, we can start going to higher and higher dimensions. Here is the fourth dimension.

In [21]:
np.zeros((4,4,4,4))

array([[[[ 0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.]],

        [[ 0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.]],

        [[ 0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.]],

        [[ 0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.]]],


       [[[ 0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.]],

        [[ 0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.]],

        [[ 0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.]],

        [[ 0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.

For the most part, we will be working in two dimensions. Just be aware that NumPy is capable of representing more dimensions. Now you are going to see a lot of these things come together. You are likely going to enjoy vectorization and Boolean selection after this section.

First I will set a seed so you can get exactly the same results that I do. A seed is a number that goes into the computer's process for simulating random numbers.  Although these are meant to look as close to random as possible, if you and I set the same seed, we will get the same sequence of pseudorandom numbers.  A seed is what makes random reproducable.

In [22]:
np.random.seed(10)

In [23]:
npa2 = np.random.random_integers(1,10,25).reshape(5,5)
npa2

array([[10,  5,  1,  2, 10],
       [ 1,  2,  9, 10,  1],
       [ 9,  7,  5,  4,  1],
       [ 5,  7,  9,  2,  9],
       [ 5,  2,  4,  7,  6]])

In [24]:
npa3 = np.random.random_integers(1,10,25).reshape(5,5)
npa3

array([[ 4, 10,  7, 10,  2],
       [10,  5,  3,  7,  8],
       [ 9,  9, 10,  3,  1],
       [ 7,  8,  9,  2,  8],
       [ 2,  5,  1,  9,  6]])

Now let's say we want to multiple the first array by 2.

In [25]:
npa2 * 2

array([[20, 10,  2,  4, 20],
       [ 2,  4, 18, 20,  2],
       [18, 14, 10,  8,  2],
       [10, 14, 18,  4, 18],
       [10,  4,  8, 14, 12]])

Let's say we want to compare the values and see where array number 2 is greater than array number 3.

In [26]:
npa2 < npa3

array([[False,  True,  True,  True, False],
       [ True,  True, False, False,  True],
       [False,  True,  True, False, False],
       [ True,  True, False, False, False],
       [False,  True, False,  True, False]], dtype=bool)

Isn't that surprisingly easy? Think about all the for loops you would have to write to achieve the same result.

We can also sum those values to get the total counts of where it is greater.

In [27]:
(npa2 > npa3).sum()

8

Or when they are equal.

In [28]:
(npa2 == npa3).sum()

5

Just like vectors, we can get mimimums and maximums extremely easily.

In [29]:
npa2.min()

1

We can even do that by the axis or column.

In [30]:
npa2

array([[10,  5,  1,  2, 10],
       [ 1,  2,  9, 10,  1],
       [ 9,  7,  5,  4,  1],
       [ 5,  7,  9,  2,  9],
       [ 5,  2,  4,  7,  6]])

In [31]:
npa2.max(axis=1) # by row

array([10, 10,  9,  9,  7])

In [32]:
npa2.max(axis=0) # by column

array([10,  7,  9, 10, 10])

There are also  plenty of other functions such as the transpose.

In [33]:
npa2.transpose()

array([[10,  1,  9,  5,  5],
       [ 5,  2,  7,  7,  2],
       [ 1,  9,  5,  9,  4],
       [ 2, 10,  4,  2,  7],
       [10,  1,  1,  9,  6]])

In [34]:
npa2.T

array([[10,  1,  9,  5,  5],
       [ 5,  2,  7,  7,  2],
       [ 1,  9,  5,  9,  4],
       [ 2, 10,  4,  2,  7],
       [10,  1,  1,  9,  6]])

In [35]:
npa2.T == npa2.transpose()

array([[ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True]], dtype=bool)

We can also multiple arrays very easily, which will do value-by-value multiplication but not any of the other products unless we are looking for another product.

In [36]:
npa2 * npa3

array([[40, 50,  7, 20, 20],
       [10, 10, 27, 70,  8],
       [81, 63, 50, 12,  1],
       [35, 56, 81,  4, 72],
       [10, 10,  4, 63, 36]])

We can also flatten and unravel the array. These look the same, but flatten returns a new array while unravel points to the old one.

In [37]:
np2 = npa2.flatten()
np2

array([10,  5,  1,  2, 10,  1,  2,  9, 10,  1,  9,  7,  5,  4,  1,  5,  7,
        9,  2,  9,  5,  2,  4,  7,  6])

In [38]:
np2[0] = 25

In [39]:
npa2

array([[10,  5,  1,  2, 10],
       [ 1,  2,  9, 10,  1],
       [ 9,  7,  5,  4,  1],
       [ 5,  7,  9,  2,  9],
       [ 5,  2,  4,  7,  6]])

In [40]:
r = npa2.ravel()
r

array([10,  5,  1,  2, 10,  1,  2,  9, 10,  1,  9,  7,  5,  4,  1,  5,  7,
        9,  2,  9,  5,  2,  4,  7,  6])

In [41]:
r[0] = 25

In [42]:
npa2

array([[25,  5,  1,  2, 10],
       [ 1,  2,  9, 10,  1],
       [ 9,  7,  5,  4,  1],
       [ 5,  7,  9,  2,  9],
       [ 5,  2,  4,  7,  6]])

We can use some other helpful functions like cumsum and comprod to get the cumulative products and sums. This works for any dimensional array.

In [43]:
npa2.cumsum()

array([ 25,  30,  31,  33,  43,  44,  46,  55,  65,  66,  75,  82,  87,
        91,  92,  97, 104, 113, 115, 124, 129, 131, 135, 142, 148])

In [44]:
npa2.cumprod()

array([              25,              125,              125,
                    250,             2500,             2500,
                   5000,            45000,           450000,
                 450000,          4050000,         28350000,
              141750000,        567000000,        567000000,
             2835000000,      19845000000,     178605000000,
           357210000000,    3214890000000,   16074450000000,
         32148900000000,  128595600000000,  900169200000000,
       5401015200000000])

That covers a lot of the basic functions you are going to use or need when working with pandas, but it is worth being aware that NumPy is a very deep library that does a lot more  than I have covered here. I wanted to cover these basics because they will come up when we work with pandas. Although this has likely felt fairly academic at this point, it provides a valuable foundation to pandas.