# NumPy

Now we've talked a bit about NumPy before but this is going to be a much deeper dive into the subject.

To Review:


NumPy is the fundamental base for scientific computing in python. It contains:

- N-dimensional array objects
- vectorization of functions
- Tools for integrating C/C++ and fortran code
- linear algebra, fourier transformation, and random number tools.

Now, NumPy is the basis for a lot of other packages like scikit-learn, scipy, pandas, among other packages but provides a lot of power in and of itself and they keep numpy pretty abstract however it provides a strong foundation for us to learn some of the operations and concepts we’ll be applying later on.

http://numpy.org

Let’s go ahead and get started.

In [1]:
import sys
print(sys.version)

3.3.5 |Anaconda 2.2.0 (64-bit)| (default, Sep  2 2014, 13:55:40) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]


First we'll need to import the library. Remember it's the convention to rename it np. You should rename it using the 'as' keyword every time you import it.

In [7]:
import numpy as np
print(np.__version__)

1.9.2


One of the reasons that NumPy is such a popular tool is that it is more efficient than regular python lists. Now why is this you might ask?

Without going into too much detail on your computer (when you install numpy) exists a BLAS (Basic Linear Algebra Subprograms).

This allows for super efficient matrix multiplication and operations to take place. Now it won't always be faster but as a rule of thumb it's going to be faster and more efficient. We're going to skip this entirely, just trust that it's true.

Let's start playing around with the NumPy api.

We've created ranges with `range` now let's juse NumPy's version, `arange`.

In [8]:
np.arange(10)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [10]:
a = np.arange(10)

Another advantage of numpy is all the baked in functions. For example we can get a lot of standard statistics just through method calls. Including the mean, sum, maximum, minimum, and standard deviation.

In [11]:
a.mean()

4.5

In [12]:
a.sum()

45

In [13]:
a.max()

9

In [14]:
a.min()

0

In [15]:
a.std()

2.8722813232690143

Now NumPy arrays implement a lot of the same methods that lists in python do, so we can do things like list comprehensions with them but that's really not the recommended way of doing things. The recommended way would be to vectorize our computation.

In [17]:
%timeit [x*x for x in a]

The slowest run took 5.94 times longer than the fastest. This could mean that an intermediate result is being cached 
1000000 loops, best of 3: 1.69 µs per loop


In [18]:
%timeit a*a

The slowest run took 36.36 times longer than the fastest. This could mean that an intermediate result is being cached 
1000000 loops, best of 3: 365 ns per loop


$ \mu s $ = microseconds or $ 10^{-6} $ seconds

ns = milliseconds or $ 10 ^{-9} $ seconds

So from this we can see that it is much, much faster than performing a list comprehension because this method can be *vectorized*.

Vectorization in simplest terms allows you to apply a function to an entire array instead of doing it value by value - similar to what we were doing with map and filter in the previous lessons. 

This typically makes things much more concise, readable and faster. This isn't always the case but it's a good rule of thumb, especially when we move into more complicated analysis.

A good rule of thumb is if you’re hard coding for loops with numpy arrays or with certain things in pandas or numpy, you're doing it wrong. Loops are bad because they can introduce errors and means that we're forced to think in steps instead of functional transformations.

We'll go a bit deeper on this topic shortly but first we're going to cover Boolean Selection.

## Boolean Selection

Boolean selection is super cool because it allows us to think very declaratively instead of imperatively. This means that we can ask the computer for what we want instead of exactly how to get there. The computer can then do it in a more efficient way than we could. For example:

In [20]:
a = np.arange(20)

In [21]:
[x for x in a if x % 2 == 0]

[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

This is imperative because we're dictating what the computer should do. Go through each value in this list and if it's divisible by two keep it. Compare that to:

In [24]:
a[a % 2 == 0]

array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])

This is a lot shorter and certainly introduces some new syntax so let's just break it apart. In effect what we are doing is rather than seeing if each value is divisible by two, we're seeing if the array is.

In [25]:
a % 2 == 0

array([ True, False,  True, False,  True, False,  True, False,  True,
       False,  True, False,  True, False,  True, False,  True, False,
        True, False], dtype=bool)

Which should remind you of our filter and map functions!

In [26]:
list(map(lambda x: x % 2 == 0, a))

[True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False]

It's, in essence a more concise filter operation and is known as boolean selection. Boolean values are true or false, so we're selecting what is True - it's as simple as that. We're then filtering our original array by those values using the bracket notation.

So who cares, it's shorter right? Well let's see how that ends up mattering.

In [27]:
np2 = np.arange(20000)

In [28]:
%timeit [x for x in np2 if x % 2 == 0]

100 loops, best of 3: 4 ms per loop


In [29]:
%timeit np2[np2 % 2 == 0]

1000 loops, best of 3: 254 µs per loop


We can see that it is orders of magnitude faster than the pythonic way and that's because it's optimized under the hood. Additionally it allows for chaining operations in ways that look a lot better.

In [38]:
a[(a > 15) | (a < 5)]

array([ 0,  1,  2,  3,  4, 16, 17, 18, 19])

Think about how complicated that query would have to be if we were going to do it in a for loop or in a list comprehension. With boolean selection it's just that easy.

However you've got to be sure to always use parenthesis around your values otherwise it won't work (because it won't now how to combine the different statements).

On a final note, we've got to be sure that we're filtering with an array of the same size - otherwise it won't know exactly how to filter it.

## Helpful Methods & Shortcuts
Now we’ve got our array and we’ve got a method or two that we can use to create them. Let’s learn a bit more about some of the function and methods that arrays give us.

In [53]:
npa = np.arange(25)

In [54]:
print(type(npa))

<class 'numpy.ndarray'>


In [55]:
print(npa.dtype)

int64


Now numpy as different data types that are more specific that the ones we have in regular python. That's because we typically need more precision than we would need in a regular python program.

Here's a list of all the types:
Data type   Description
- bool_
    - Boolean (True or False) stored as a byte
- int_
    - Default integer type (same as C long; normally either int64 or int32)
- intc 
    - Identical to C int (normally int32 or int64)
- intp 
    - Integer used for indexing (same as C ssize_t; normally either int32 or int64)
- int8 
    - Byte (-128 to 127)
- int16 
    - Integer (-32768 to 32767)
- int32 
    - Integer (-2147483648 to 2147483647)
- int64 
    - Integer (-9223372036854775808 to 9223372036854775807)
- uint8 
    - Unsigned integer (0 to 255)
- uint16 
    - Unsigned integer (0 to 65535)
- uint32 
    - Unsigned integer (0 to 4294967295)
- uint64 
    - Unsigned integer (0 to 18446744073709551615)
- float_
    - Shorthand for float64.
- float16 
    - alf precision float: sign bit, 5 bits exponent, 10 bits mantissa
- float32 
    - ingle precision float: sign bit, 8 bits exponent, 23 bits mantissa
- float64 
    - ouble precision float: sign bit, 11 bits exponent, 52 bits mantissa
- complex_
    - Shorthand for complex128.
- complex64 
    - Complex number, represented by two 32-bit floats (real and imaginary components)
- complex128 
    - Complex number, represented by two 64-bit floats (real and imaginary component- s)

Because of how computers store values in memory, There are limits to the size of numbers we can use in our programs. This is super important to be aware of when you're developing programs because obviously choosing a type that won't fit your values will cause serious problems.

We can get at these values by name under the `np.` module

In [83]:
np.float128(25)

25.0

In [84]:
np.int16(234)

234

It’s worth starting that numpy arrays can be created in the fly simply by passing a list into the “array” function.

By default, numpy will try to guess what kinds of types you have in there but on creation you can also specify which type you would like. Now numpy arrays all have to have the same type, you can’t have a mix of floats and integers for example.

In [56]:
np.array(range(20))

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19])

In [57]:
np.array(range(20)).dtype

dtype('int64')

In [58]:
np.array([1.0,0,2,3])

array([ 1.,  0.,  2.,  3.])

In [59]:
np.array([1.0,0,2,3]).dtype

dtype('float64')

This allows for consistent memory allocation and vectorization. Although sometimes you'll get strange types when you're not quite expecting them.

In [60]:
np.array([True,2,2.0])

array([ 1.,  2.,  2.])

In [61]:
np.array([True,2,2.0]).dtype

dtype('float64')

If you have any worry about what kind of array you’re creating, just specify the type.

In [62]:
np.array([True, 1, 2.0], dtype='bool_')

array([ True,  True,  True], dtype=bool)

In [63]:
np.array([True, 1, 2.0], dtype='float_')

array([ 1.,  1.,  2.])

In [64]:
np.array([True, 1, 2.0], dtype='uint8')

array([1, 1, 2], dtype=uint8)

Numpy arrays also have some nifty functions that make our lives a lot easier. We can get the `size`, which is functionally equivalent to `len` in raw python. We can get the `shape` which gives us a hint as to where we’ll be going next, towards multidimensional arrays.

In [67]:
npa.size

25

In [68]:
npa.shape

(25,)

As we mentioned before, we can also get things like the maximum, the minimum, the mean, the standard deviation, and variance.

In [69]:
print(npa.min(), npa.max(), npa.std(), npa.var())

0 24 7.21110255093 52.0


We can also do super useful things like get the location of certain values, kind of like `index` in regular lists.

Here are the methods for the location of minimum and maximum values.

In [70]:
npa.argmin(), npa.argmax()

(0, 24)

Now that we’ve seen how we can play around with arrays and some of their helper functions. We can elaborate on all the different ways of creating them. We’ve seen arange but we can actually do some cool things with it. We can count up or down. 

In [71]:
?np.arange

Just like range, we can have steps that we want to use when we create a list. Like by two for example.

In [72]:
np.arange(0,20,2)

array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])

We can also fill in between two numbers. Say we want 1 to 2 broken up into 5 numbers.


In [73]:
np.linspace(1,2,5)

array([ 1.  ,  1.25,  1.5 ,  1.75,  2.  ])

In [74]:
np.linspace(0,10,11)

array([  0.,   1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.])

And of course we can convert that to a different type if we need to but be careful because we can obviously lose information this way if we're not careful.

In [76]:
np.linspace(1,2,5).astype('int')

array([1, 1, 1, 1, 2])

In [77]:
np.linspace(0,10,11).astype('int')

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

## NaN values

Now there is one type that hasn’t been listed yet that is a bit difficult to describe, this is the `nan` or not a number value.

This will certainly come up and can throw off your analysis. Now `NaN`’s are technically floats but come up in certain operations. One way to get this is from some strange division that doesn't return a real number.

In [79]:
problem = np.array([0.])/np.array([0.])

  if __name__ == '__main__':


In [81]:
problem

array([ nan])

We can also get this value from:

In [85]:
np.nan

nan

So what's the big deal you might ask, well in your work you might come across these and have to deal with them. If they're inside of an array, you're going to have to make sure that they're handled correctly because they'll throw off your computations. For example:

In [86]:
ar = np.linspace(0,10,11).astype(np.float32)
ar

array([  0.,   1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.], dtype=float32)

In [87]:
ar[0] = np.nan
ar

array([ nan,   1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.], dtype=float32)

In [88]:
ar.min()

nan

In [89]:
ar.max()

nan

In [90]:
ar.mean()

nan

As I said we’ll dive into this in more detail in the future. I just want you to be aware that there is a representation of illegal float values in numpy and pandas. In numpy those aren’t handled with out being explicit while in pandas they do some of the work for you.

Now that we understand more about array creation, some array properties, and data types. Let’s dive a bit more into vectorization.