## Vectorization 

In [1]:
import sys
print(sys.version)
import numpy as np
print(np.__version__)

3.3.5 |Anaconda 2.2.0 (64-bit)| (default, Sep  2 2014, 13:55:40) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
1.9.2


In [2]:
npa = np.random.random_integers(0,50,20)

Now I’ve touched on the values of vectorization in the last couple of videos. To review the true power of vectorization lies in two properties:

- Concise
- Efficient

The fundamental idea behind array programming is that operations apply at once to an entire set of values. This makes it a high-level programming model as it allows the programmer to think and operate on whole aggregates of data, without having to resort to explicit loops of individual scalar operations.

You can read more here:
https://en.wikipedia.org/wiki/Array_programming

In [3]:
npa

array([13, 30,  4, 30, 14, 46, 33, 18, 20, 39, 22, 19, 21, 20, 50, 46, 35,
        3, 14, 20])

As we touched on, with vectorization we can apply changes to the entire array extremely efficiently, no more for loops. If we want to double the array, we just multiply by 2 if we want to cube it we just cube it. 

In [4]:
npa * 2

array([ 26,  60,   8,  60,  28,  92,  66,  36,  40,  78,  44,  38,  42,
        40, 100,  92,  70,   6,  28,  40])

In [5]:
npa ** 3

array([  2197,  27000,     64,  27000,   2744,  97336,  35937,   5832,
         8000,  59319,  10648,   6859,   9261,   8000, 125000,  97336,
        42875,     27,   2744,   8000])

In [6]:
npa ** 2 ** 3

array([     815730721,   656100000000,          65536,   656100000000,
           1475789056, 20047612231936,  1406408618241,    11019960576,
          25600000000,  5352009260481,    54875873536,    16983563041,
          37822859361,    25600000000, 39062500000000, 20047612231936,
        2251875390625,           6561,     1475789056,    25600000000])

We can actually take this a step further and define our own vectorized functions which will operate super efficiently. let's say that we've got to do a series of transformations. We can throw these in a function to make them a bit more readable.

What we can do is just declare exactly what we want done to each value.

In [25]:
def transformation(val):
    return ((val * 2) - 1)

In [26]:
transformation(2)

3

And then we can apply it to an array.

In [30]:
transformation(npa)

array([25, 59,  7, 59, 27, 91, 65, 35, 39, 77, 43, 37, 41, 39, 99, 91, 69,
        5, 27, 39])

While this did work, sometimes it won't like if we include filters or want to have specific transformations for certain values. In those cases we have to use the `vectorize` function in order to make it something that we can pass an array into.

In [31]:
transform_2 = np.vectorize(transformation)

In [32]:
transform_2(npa)

array([25, 59,  7, 59, 27, 91, 65, 35, 39, 77, 43, 37, 41, 39, 99, 91, 69,
        5, 27, 39])

In [35]:
%timeit [transformation(x) for x in npa]

100000 loops, best of 3: 6.3 µs per loop


In [36]:
%timeit transform_2(npa)

The slowest run took 4.16 times longer than the fastest. This could mean that an intermediate result is being cached 
100000 loops, best of 3: 14.9 µs per loop


Now we can see that in this case it wasn't faster than a list comprehension. However remember that this is just a small, simple array - when we get to more complicated arrays and matrices things will certainly change.

## Multi-Dimensional Arrays

We learned in the last video how to generate super simple arrays, now let’s generate multidimensional arrays. These are, as you might guess, arrays with multiple dimensions. Think matrices or cubes.

We can create these by reshaping arrays. One of the simplest ways is to just reshape an array with the reshape command. That gives us an x by x array.

In [38]:
npa = np.arange(25)

In [44]:
npa2 = npa.reshape((5,5))
npa2

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24]])

To create a null matrix we can use the zeros command along with the desired shape.

In [40]:
np.zeros((5,2))

array([[ 0.,  0.],
       [ 0.,  0.],
       [ 0.,  0.],
       [ 0.,  0.],
       [ 0.,  0.]])

In [42]:
np.zeros((5,5))

array([[ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.]])

What's cool is that because the creators of numpy are brilliant a lot of outputs can be used as inputs

In [46]:
np.zeros(npa2.shape)

array([[ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.]])

This obviously works because it outputs a tuple.

In [48]:
npa2.shape

(5, 5)

Now these have all been 2 dimensional arrays. Let's explore the 3rd dimension.

In [49]:
np.arange(8).reshape(2,2,2)

array([[[0, 1],
        [2, 3]],

       [[4, 5],
        [6, 7]]])

Of course when we start diving into theory, we can start goign to higher and higher dimensions. Here's the fourth dimension!

In [50]:
np.zeros((4,4,4,4))

array([[[[ 0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.]],

        [[ 0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.]],

        [[ 0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.]],

        [[ 0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.]]],


       [[[ 0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.]],

        [[ 0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.]],

        [[ 0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.]],

        [[ 0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.

For the most part, we'll be working in 2 dimensions. Just be aware that it's there. Now you're in for a treat because we're going to see a lot of these things come together. You're going to love vectorization and boolean selection after this section.

First I'm going to set a seed so you can get the exact same results that I do. A seed is basically what the source of randomness should be *needs clarification*. It makes random, reproducible.

In [51]:
np.random.seed(10)

In [52]:
npa2 = np.random.random_integers(1,10,25).reshape(5,5)
npa2

array([[10,  5,  1,  2, 10],
       [ 1,  2,  9, 10,  1],
       [ 9,  7,  5,  4,  1],
       [ 5,  7,  9,  2,  9],
       [ 5,  2,  4,  7,  6]])

In [53]:
npa3 = np.random.random_integers(1,10,25).reshape(5,5)
npa3

array([[ 4, 10,  7, 10,  2],
       [10,  5,  3,  7,  8],
       [ 9,  9, 10,  3,  1],
       [ 7,  8,  9,  2,  8],
       [ 2,  5,  1,  9,  6]])

Now let's say we want to multiple the first array by 2

In [54]:
npa2 * 2

array([[20, 10,  2,  4, 20],
       [ 2,  4, 18, 20,  2],
       [18, 14, 10,  8,  2],
       [10, 14, 18,  4, 18],
       [10,  4,  8, 14, 12]])

Let's say we want to compare the values and see where array number two is greater than array number 3

In [55]:
npa2 < npa3

array([[False,  True,  True,  True, False],
       [ True,  True, False, False,  True],
       [False,  True,  True, False, False],
       [ True,  True, False, False, False],
       [False,  True, False,  True, False]], dtype=bool)

How easy was that!?!?!? Isn't that crazy? Think of all the for loops you'd have to write to achieve that same result!!!!

We can also sum those values to get the total counts of where it's greater.

In [56]:
(npa2 > npa3).sum()

8

Or when they're equal!

In [57]:
(npa2 == npa3).sum()

5

Just like vectors we can get mimimums and maximums super easily

In [58]:
npa2.min()

1

We can even do that by the axis or column.

In [60]:
npa2

array([[10,  5,  1,  2, 10],
       [ 1,  2,  9, 10,  1],
       [ 9,  7,  5,  4,  1],
       [ 5,  7,  9,  2,  9],
       [ 5,  2,  4,  7,  6]])

In [64]:
npa2.max(axis=1) # by row

array([10, 10,  9,  9,  7])

In [65]:
npa2.max(axis=0) # by column

array([10,  7,  9, 10, 10])

We've also got plenty of other functions like the transpose

In [66]:
npa2.transpose()

array([[10,  1,  9,  5,  5],
       [ 5,  2,  7,  7,  2],
       [ 1,  9,  5,  9,  4],
       [ 2, 10,  4,  2,  7],
       [10,  1,  1,  9,  6]])

In [67]:
npa2.T

array([[10,  1,  9,  5,  5],
       [ 5,  2,  7,  7,  2],
       [ 1,  9,  5,  9,  4],
       [ 2, 10,  4,  2,  7],
       [10,  1,  1,  9,  6]])

In [68]:
npa2.T == npa2.transpose()

array([[ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True]], dtype=bool)

We can also multiple arrays very easily which will do value by value multiplication, not any of the other products unless we're looking for another product.

In [70]:
npa2 * npa3

array([[40, 50,  7, 20, 20],
       [10, 10, 27, 70,  8],
       [81, 63, 50, 12,  1],
       [35, 56, 81,  4, 72],
       [10, 10,  4, 63, 36]])

We can also flatten and unravel the array. These look the same but flatten returns a new array while unravel points to the old one.

In [73]:
np2 = npa2.flatten()
np2

array([10,  5,  1,  2, 10,  1,  2,  9, 10,  1,  9,  7,  5,  4,  1,  5,  7,
        9,  2,  9,  5,  2,  4,  7,  6])

In [74]:
np2[0] = 25

In [75]:
npa2

array([[10,  5,  1,  2, 10],
       [ 1,  2,  9, 10,  1],
       [ 9,  7,  5,  4,  1],
       [ 5,  7,  9,  2,  9],
       [ 5,  2,  4,  7,  6]])

In [76]:
r = npa2.ravel()
r

array([10,  5,  1,  2, 10,  1,  2,  9, 10,  1,  9,  7,  5,  4,  1,  5,  7,
        9,  2,  9,  5,  2,  4,  7,  6])

In [77]:
r[0] = 25

In [78]:
npa2

array([[25,  5,  1,  2, 10],
       [ 1,  2,  9, 10,  1],
       [ 9,  7,  5,  4,  1],
       [ 5,  7,  9,  2,  9],
       [ 5,  2,  4,  7,  6]])


Now we can use some other helpful functions like cumsum and comprod to get the cumulative products and sums. This works for any dimensional array.

In [79]:
npa2.cumsum()

array([ 25,  30,  31,  33,  43,  44,  46,  55,  65,  66,  75,  82,  87,
        91,  92,  97, 104, 113, 115, 124, 129, 131, 135, 142, 148])

In [80]:
npa2.cumprod()

array([              25,              125,              125,
                    250,             2500,             2500,
                   5000,            45000,           450000,
                 450000,          4050000,         28350000,
              141750000,        567000000,        567000000,
             2835000000,      19845000000,     178605000000,
           357210000000,    3214890000000,   16074450000000,
         32148900000000,  128595600000000,  900169200000000,
       5401015200000000])


That really covers a lot of the basic functions you’re going to use or need when working with pandas but it is worth being aware that numpy is a very deep library that does a lot more things that I've covered here. I wanted to cover these basics because they're going to come up when we're working with pandas. I'm sure this has felt fairly academic at this point but I can promise you that it provides a valuable foundation to pandas.


need. If there’s anything you have questions about feel free to ask along the side and I can create some appendix videos to help you along.
