# NumPy

We have talked a bit about NumPy before, but this is going to be a much deeper look into the subject.

To review:


NumPy is the fundamental base for scientific computing in Python. It contains:

- N-dimensional array objects
- Vectorization of functions
- Tools for integrating C/C++ and Fortran code
- Linear algebra, Fourier transformation, and random number tools

NumPy is the basis for a lot of other packages like scikit-learn, scipy, pandas, among other packages, but it provides a lot of power in and of itself. NumPy remains pretty abstract; however, it provides a strong foundation for us to learn some of the operations and concepts we will be applying later on.

http://numpy.org

Let’s get started.

In [1]:
import sys
print(sys.version)

3.5.0 (v3.5.0:374f501f4567, Sep 12 2015, 11:00:19) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]


First we will need to import the library. Remember the convention is to rename it np. You should rename it using the 'as' keyword every time you import it.

In [2]:
import numpy as np
print(np.__version__)

1.9.2


One of the reasons that NumPy is such a popular tool is that it is more efficient than regular Python lists. 

Why is this? Without going into too much detail, on your computer (when you install NumPy) exists Basic Linear Algebra Subprograms (BLAS). This allows for extremely efficient matrix multiplication and operations to take place. It will not always be faster, but as a rule of thumb it will be faster and more efficient.

Let's start playing around with the NumPy API.

We have created ranges with `range` before. Going forward, we will use NumPy's version, `arange`.

In [3]:
np.arange(10)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [4]:
a = np.arange(10)

Another advantage of NumPy is all the included functions. For example we can get a lot of standard statistics, including the mean, sum, maximum, minimum, and standard deviation, just through method calls. 

In [5]:
a.mean()

4.5

In [6]:
a.sum()

45

In [7]:
a.max()

9

In [8]:
a.min()

0

In [9]:
a.std()

2.8722813232690143

NumPy arrays implement a lot of the same methods as lists in Python, so we can do things like list comprehensions, but that is really not the recommended way of doing things. The recommended way would be to vectorize our computation.

In [10]:
%timeit [x*x for x in a]

The slowest run took 6.97 times longer than the fastest. This could mean that an intermediate result is being cached 
100000 loops, best of 3: 3.7 µs per loop


In [11]:
%timeit a*a

The slowest run took 20.24 times longer than the fastest. This could mean that an intermediate result is being cached 
1000000 loops, best of 3: 643 ns per loop


$ \mu s $ = microseconds or $ 10^{-6} $ seconds

ns = milliseconds or $ 10 ^{-9} $ seconds
*Should "ns" be "ms"?*

From this we can see that it is much, much faster than performing a list comprehension because this method can be *vectorized*.

Vectorization in simplest terms allows you to apply a function to an entire array instead of doing it value by value, similar to what we did with map and filter in previous lessons. 

This typically makes things much more concise, readable, and faster. This is not always the case, but it is a good rule of thumb, especially when we move into more complicated analysis.

Another good rule of thumb is that if you are hard coding for loops with NumPy arrays or with certain things in pandas or NumPy, you are doing it wrong. Loops are bad because they can introduce errors and mean that we are forced to think in steps instead of functional transformations.

We will further discuss this topic shortly, but first we will cover Boolean selection.

## Boolean Selection

Boolean selection is useful because it allows us to think declaratively instead of imperatively. This means that we can ask the computer for what we want instead of exactly how to get there. The computer can then do it in a more efficient way than we could, as you can see in this example.

In [12]:
a = np.arange(20)

In [13]:
[x for x in a if x % 2 == 0]

[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

This is imperative because we are dictating what the computer should do: go through each value in this list, and if it is divisible by 2, keep it. Compare that to the following example.

In [14]:
a[a % 2 == 0]

array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])

This is a lot shorter and certainly introduces some new syntax, so let's break it apart. In effect, rather than seeing if each value is divisible by 2, we are seeing if the array is.

In [15]:
a % 2 == 0

array([ True, False,  True, False,  True, False,  True, False,  True,
       False,  True, False,  True, False,  True, False,  True, False,
        True, False], dtype=bool)

This should remind you of our `filter` and `map`functions.

In [16]:
list(map(lambda x: x % 2 == 0, a))

[True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False]

This is essentially a more concise filter operation and is known as Boolean selection. Boolean values are `True` or `False`, so we are selecting what is `True`; it is as simple as that. We then filter our original array by those values using the bracket notation.

Why is it important that it is shorter? Let's take a look at why that matters.

In [27]:
np2 = np.arange(20000)

In [28]:
%timeit [x for x in np2 if x % 2 == 0]

100 loops, best of 3: 4 ms per loop


In [29]:
%timeit np2[np2 % 2 == 0]

1000 loops, best of 3: 254 µs per loop


We can see that it is orders of magnitude faster than the standard pythonic way because it is optimized under the hood. It hooks into an optimized linear algebra library that is on your computer. Additionally, it allows for chaining operations in ways that are much easier to understand than doing complicated list comprehension.

In [17]:
a[(a > 15) | (a < 5)]

array([ 0,  1,  2,  3,  4, 16, 17, 18, 19])

Think about how complicated that query would have to be if we were going to do it in a for loop or in a list comprehension. With Boolean selection it is just that easy.

You must be sure to use parentheses around your values; otherwise, it will not work because it will not know how to combine the different statements.

On a final note, we must be sure that we are filtering with an array of the same size; otherwise, it will not know exactly how to filter it.

## Helpful Methods and Shortcuts
Now we have our array, and we have a method or two that we can use to create them. Let’s learn a bit more about some of the functions and methods that arrays give us.

In [18]:
npa = np.arange(25)

In [19]:
print(type(npa))

<class 'numpy.ndarray'>


In [20]:
print(npa.dtype)

int64


NumPy has different data types that are more specific than the ones we have in regular Python because NumPy arrays need every item to take up the same amount of memory. This is part of what makes them so efficient for computation.

Here is a list of all the types:

- bool_
    - Boolean (True or False) stored as a byte
- int_
    - Default integer type (same as C long; normally either int64 or int32)
- intc 
    - Identical to C int (normally int32 or int64)
- intp 
    - Integer used for indexing (same as C ssize_t; normally either int32 or int64)
- int8 
    - Byte (-128 to 127)
- int16 
    - Integer (-32768 to 32767)
- int32 
    - Integer (-2147483648 to 2147483647)
- int64 
    - Integer (-9223372036854775808 to 9223372036854775807)
- uint8 
    - Unsigned integer (0 to 255)
- uint16 
    - Unsigned integer (0 to 65535)
- uint32 
    - Unsigned integer (0 to 4294967295)
- uint64 
    - Unsigned integer (0 to 18446744073709551615)
- float_
    - Shorthand for float64.
- float16 
    - alf precision float: sign bit, 5 bits exponent, 10 bits mantissa
- float32 
    - ingle precision float: sign bit, 8 bits exponent, 23 bits mantissa
- float64 
    - ouble precision float: sign bit, 11 bits exponent, 52 bits mantissa
- complex_
    - Shorthand for complex128.
- complex64 
    - Complex number, represented by two 32-bit floats (real and imaginary components)
- complex128 
    - Complex number, represented by two 64-bit floats (real and imaginary component- s)

Because of how computers store values in memory, there are limits to the size of numbers we can use in our programs. This is extremely important to be aware of when you are developing programs because choosing a type that will not fit your values will cause serious problems.

We can get at these values by name under the `np.` namespace.

In [22]:
np.float128(25)

25.0

In [23]:
np.int16(234)

234

It is worth stating that NumPy arrays can be created on the fly simply by passing a list into the “array” function.

By default, NumPy will try to guess what kinds of types you have, but on creation you can also specify which type you would like. NumPy arrays all must have the same type; you cannot have a mix of floats and integers, for example.

In [24]:
np.array(range(20))

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19])

In [25]:
np.array(range(20)).dtype

dtype('int64')

In [58]:
np.array([1.0,0,2,3])

array([ 1.,  0.,  2.,  3.])

In [59]:
np.array([1.0,0,2,3]).dtype

dtype('float64')

This allows for consistent memory allocation and vectorization, although sometimes you will get strange types when you are not quite expecting them.

In [60]:
np.array([True,2,2.0])

array([ 1.,  2.,  2.])

In [61]:
np.array([True,2,2.0]).dtype

dtype('float64')

If you are worried about the kind of array you are creating, just specify the type.

In [62]:
np.array([True, 1, 2.0], dtype='bool_')

array([ True,  True,  True], dtype=bool)

In [63]:
np.array([True, 1, 2.0], dtype='float_')

array([ 1.,  1.,  2.])

In [27]:
np.array([True, 1, 2.0], dtype='uint8')

array([1, 1, 2], dtype=uint8)

NumPy arrays also have some functions that make our lives a lot easier. We can get the `size`, which is functionally equivalent to `len` in raw Python. We can get the `shape`, which gives us a hint as to where we will be going next, toward multidimensional arrays.

In [28]:
npa.size

25

In [29]:
npa.shape

(25,)

As we mentioned before, we can also get things like the maximum, minimum, mean, standard deviation, and variance.

In [30]:
print(npa.min(), npa.max(), npa.std(), npa.var())

0 24 7.21110255093 52.0


We can also do useful things like get the location of certain values, a bit like `index` in regular lists.

Here are the methods for the location of minimum and maximum values.

In [31]:
npa.argmin(), npa.argmax()

(0, 24)

Now that we have seen how we can play around with arrays and some of their helper functions, we can elaborate on all the different ways of creating them. We have seen arange, but we can actually do some interesting things with it. We can count up or down. 

In [32]:
?np.arange

Just like range, we can have steps that we want to use when we create a list. Like by two, for example.

In [33]:
np.arange(0,20,2)

array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])

We can also fill in between two numbers. Say we want 1 to 2 broken up into five numbers.


In [34]:
np.linspace(1,2,5)

array([ 1.  ,  1.25,  1.5 ,  1.75,  2.  ])

In [35]:
np.linspace(0,10,11)

array([  0.,   1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.])

And of course we can convert that to a different type if we need to, but be careful because we can obviously lose information this way if we are not careful.

In [36]:
np.linspace(1,2,5).astype('int')

array([1, 1, 1, 1, 2])

In [37]:
np.linspace(0,10,11).astype('int')

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

## NaN values

There is one type that has not been listed yet, which is a bit difficult to describe: this is the `NaN` or not a number value.

This will certainly come up and can throw off your analysis. `NaN`’s are technically floats but come up in certain operations. One way this can happen is from strange division that does not return a real number.

In [79]:
problem = np.array([0.])/np.array([0.])

  if __name__ == '__main__':


In [81]:
problem

array([ nan])

We can also get this value from the following.

In [85]:
np.nan

nan

You may be wondering why this is important. You might come across these in your work and have to deal with them. If they are inside an array, you will need to make sure that they are handled correctly because they will throw off your computations. Take a look at the following example.

In [86]:
ar = np.linspace(0,10,11).astype(np.float32)
ar

array([  0.,   1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.], dtype=float32)

In [87]:
ar[0] = np.nan
ar

array([ nan,   1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.], dtype=float32)

In [88]:
ar.min()

nan

In [89]:
ar.max()

nan

In [90]:
ar.mean()

nan

We will get into this in more detail in the future. For now you should be aware that there is a representation of illegal float values in NumPy and pandas. In NumPy those are not handled without being explicit, while in pandas they do some of the work for you.

Now that we understand more about array creation, some array properties, and data types, let’s look further into vectorization.