# Intro to Numpy 

Today, we're going to start off with a foray into the `Numpy` library, which is one of the fundamental packages for scientific computing in Python. It turns out that the `Pandas DataFrames` we work with are actually built off the `numpy array` (which we'll get to), so it's important to have some basic knowldege of what's running under the hood of our `DataFrames`. 

While `Numpy` offers an amazing amount of functionality (see the [docs](http://www.numpy.org/) for a better idea), one of it's mainstays is the `numpy array` (an n-dimensional array), which is what we'll focus on. There are loads of things that you can do with `numpy arrays`, and today we'll introduce some of their amazing capabilities. Learning about everything `numpy arrays` can do really just takes working with them day in and day out, and so again we'll try to aim for breadth over depth. We want you to walk away from these notes with an idea of the many things you can do with `numpy` and `numpy arrays`, and the types of situations where you would want to use them. 

Finally, this [cheat-sheet](https://github.com/sallamander/just-enough/blob/master/numpy.md) may come in handy. 

## Why does this matter?

As mentioned, `pandas` DataFrames are built on top of `numpy` arrays (in some sense DataFrames are n-dimensional arrays with column names attached). At times, this means that you'll end up working with `numpy` arrays, which means it'll be good to understand what you can do with them. 

## The basics of the Array

#### What's the big deal with Numpy Arrays?

What's so special about a `numpy array`? From a high level, they are kind of like lists - they just store a bunch of stuff in a container. It turns out, though, that a numpy array is much faster to interact with and perform certain types of calculations with than a standard Python list. Why is that, though? The two main reasons that they are faster are: 

1. They are stored as one contiguous block of memory, rather than being spread out across multiple locations like a list. 
2. Each item in a numpy array is of the same data type (i.e. all integers, all floats, etc.), rather than a conglomerate of any number of data types (as a list is). We call this idea homogeneity, as opposed to the possible heterogeneity of Python lists.

Just how much faster are they? Let's take the numbers from 0 to 1 million, and sum those numbers, timing it with both a list and a numpy array.

In [1]:
import numpy as np # Standard alias when importing numpy - follow this convention!
numpy_array = np.arange(0, 1000000)
python_list = range(1000000)

In [2]:
%timeit np.sum(numpy_array)

1000 loops, best of 3: 587 µs per loop


In [3]:
%timeit sum(python_list)

100 loops, best of 3: 8.34 ms per loop


In [4]:
%timeit sum(numpy_array)

10 loops, best of 3: 77.7 ms per loop


So, numpy runs nearly an order of magnitude faster! This is because of those two points above. Because numpy arrays store data in contiguous blocks of memory, they are able to take advantage of **vectorization**, which is the ability of a CPU to perform one operation on mulitiple pieces of data at once. In addition, since a numpy array knows what type each object it is storing is (and those types don't change), it doesn't have to waste time checking what type each object is (like a list). The combo of these two things speeds up our calcualtion quite a bit.

Notice, too, that we had to specify `np.sum()` - numpy's implementation of sum. When we just used the built-in Python `sum()` on the numpy array, the calculation was actually slower! This is because numpy arrays are optimized for vectorized computations, and when we try to do a non-vectorized operation we pay a price. 

It's also worth nothing that all we did above was a sum - just a **simple** sum. When we move to doing more complicated operations, we'll save even more time! Let's look at what else numpy arrays can do...

#### Making a Numpy Array

Now that we know how awesome `numpy arrays` can be, let's dive into them. We're not going to cover everything that you can do with `numpy arrays` (see the [methods docs](http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.ndarray.html#numpy.ndarray) for that), but we'll look at the basics. 

Let's start with how we can create a `numpy array`. To do this, we use the `np.array()` constructor, which expects some kind of array or something that exposes the array interface (i.e. acts like an array - lists and tuples are examples). So, this means that we can create a numpy array by passing in a list or tuple. 

In [5]:
my_lst_ndarray = np.array([1, 2, 3, 4, 5])
my_tuple_ndarray = np.array((1, 2, 3, 4, 5), np.int32) # You can specify the data type 
                                                       # upon creation. 

#### Just like we have the shape attribute on Pandas DataFrames, we also have it on numpy arrays.

In [6]:
print my_lst_ndarray.shape
print my_tuple_ndarray.shape

(5,)
(5,)


#### We also have the dtype attribute, which will tell us the data type of the objects in our ndarray.

In [7]:
print my_lst_ndarray.dtype
print my_tuple_ndarray.dtype

int64
int32


In [8]:
my_lst_ndarray2 = np.array(["1", 2, 3, "10", 5])
print my_lst_ndarray2.dtype

|S2


#### If you try to tell the ndarray to be a certain data type, it will try to cast every object you pass in to that data type (here a 32-bit integer), and fail if it can't cast it to that data type. Below, we are able to cast "10" to a 32-bit integer, so this is fine. 

In [9]:
my_lst_ndarray3 = np.array([1, 2, 3, "10", 5], np.int32) 
print my_lst_ndarray3.dtype

int32


In [10]:
# This will not work, because Python can't cast the string 'bozo' as a 32 bit integer. 
my_lst_ndarray3 = np.array([1, 2, 3, "bozo", 5], np.int32) 

ValueError: invalid literal for long() with base 10: 'bozo'

#### Here are some other methods of constructing a numpy array. It's helpful to know these exist. 

In [11]:
zeros_arr = np.zeros((3,4)) # Create a matrix of zeros with 3 rows and 4 columns. 
ones_arr = np.ones((10,20))  # Create a matrix of ones with 10 rows and 20 columns.
identity_arr = np.identity(50) # Create an identity matrix with 50 rows and 50 columns. 
random_arr = np.random.rand(2, 2) # Create a 2x2 array of random floats ranging from 0 to 1. 
range_arr = np.arange(0, 20, 0.5) # Create a numpy array with arguments (start, end, step_size). 

## Numpy Array Math

When working with a `numpy` array, we have all of the basic mathematics operators available to us - `+`, `-`, `*`, `/`, `**`, and `%`. Frequently, we'll be working with two arrays that are the same size, in which case these operators will just be performed **element-wise**. For example: 

In [13]:
first_arr = np.array([1, 2, 3, 4])
second_arr = np.array([5, 6, 7, 8])
first_arr + second_arr # Each element is lined up with it's corresponding element in the other 
                       # array, and the addition is then performed. 

array([ 6,  8, 10, 12])

In [14]:
first_arr = np.array([[1, 2], [3, 4]]) # This is now a two-dimensional array. 
second_arr = np.array([[5, 6], [7, 8]]) # This is now a two-dimensional array. 
first_arr * second_arr # Each element is lined up with it's corresponding element in the other 
                       # array, and the multiplication is then performed. 

array([[ 5, 12],
       [21, 32]])

It turns out that our numerical operations can also work when we want to perform an operation between a `numpy array` and a single value. For example, let's say that we want to subtract `4` from `first_arr` above, or multiply it by `5`, or find the remainder when everything is divided by `3`. We can do that via the following: 

In [15]:
first_arr = np.array([[1, 2], [3, 4]]) # This is now a two-dimensional array. 

In [16]:
first_arr - 4

array([[-3, -2],
       [-1,  0]])

In [17]:
first_arr * 5

array([[ 5, 10],
       [15, 20]])

In [18]:
first_arr % 3

array([[1, 2],
       [0, 1]])

The concept that allows this to happen is referred to as [broadcasting](http://docs.scipy.org/doc/numpy-1.10.1/user/basics.broadcasting.html). It is a concept that will be particularly useful when working with and interacting with `numpy` arrays. Basically, it takes that single number on the right (the `4`, `5`, or `3` above), and **broadcasts** it's shape to match that of `first_arr` (`2 x 2`). After doing so, it then performs the operation element-wise like we saw before. 

It turns out that things can get a little more intricate than this. If we wanted, we could perform mathematical operations like the above at a column level, or row level. For example, we could subtract off `4` from the first column and `5` from the second column, or `4` from the first row and `5` from the second row. We would do that via the following: 

In [19]:
first_arr = np.array([[1, 2], [3, 4]]) # This is now a two-dimensional array. 

In [20]:
first_arr - [4, 5] # Here, we subtract 4 off the first **column** and 5 off the second **column**. 

array([[-3, -3],
       [-1, -1]])

In [21]:
first_arr - [[4], [5]] # Here, we subtract 4 from the first **row** and 5 from the second **row**. 

array([[-3, -2],
       [-2, -1]])

## A Little bit more of Numpy Arrays

Okay, so now that we know a little bit about a `numpy array`, what else can we do with it? There are actually quite a number of things we can do. We can index into them, perform calculations, ask for aggregation type metrics, etc. 

#### Indexing 

Let's begin by indexing into them. With `numpy arrays`, we don't have the `.loc[]`, `.iloc[]`, or `.ix[]` methods like we do on a DataFrame - we simply index into them like we would a list. It's effectively a multidimensional list, though. Therefore, we can pass it multiple indexing values. Let's take a look. 

In [22]:
# Reshape will reshape the data to the shape that you tell it to (here 5 rows, 4 columns). 
range_arr = np.arange(0, 20, 1).reshape(5, 4)
range_arr

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19]])

In [23]:
range_arr[:, 2] # Grab every row, but only the element at index 2 in those rows. 

array([ 2,  6, 10, 14, 18])

In [24]:
range_arr[0:2] # With no second index, this defaults to taking the rows. 

array([[0, 1, 2, 3],
       [4, 5, 6, 7]])

In [25]:
range_arr[0:2, 1:3] # The first set of numbers refers to the rows to 
                    # grab, the second set the columns.  

array([[1, 2],
       [5, 6]])

#### Other methods

Now let's look at some of the other methods that are available. Again, there is a ton we can do, and we're aiming here to at least get your eyes on a lot of the things that are possible. We also want to give you a notebook here that you can look back at to see what is possible (Google is also amazing for this). 

In [26]:
# Remember what this looks like. 
range_arr

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19]])

#### We can perform sums in any direction with a method on the arrays.

In [27]:
range_arr.sum(axis=0) # Sum along the rows (i.e. get column totals)

array([40, 45, 50, 55])

In [28]:
range_arr.sum(axis=1) # Sum along the columns (i.e. get row totals)

array([ 6, 22, 38, 54, 70])

In [29]:
range_arr.sum() # Get sum of all elements in numpy array. 

190

#### We can also grab the mean, standard deviation, max, and min values along the rows (i.e. for the columns). We could also do this along the columns, or for the array as a whole (just like we did with `.sum()`).

In [30]:
range_arr.mean(axis=0)

array([  8.,   9.,  10.,  11.])

In [31]:
range_arr.std(axis=0)

array([ 5.65685425,  5.65685425,  5.65685425,  5.65685425])

In [32]:
range_arr.max(axis=0)

array([16, 17, 18, 19])

In [33]:
range_arr.min(axis=0)

array([0, 1, 2, 3])

#### If we want to instead grab the **index** at which those min and max values occur (either along the rows or columns), then we can use the `argmin()` and `argmax()` methods available on our numpy array. 

In [33]:
range_arr.argmin(axis=0) # We see that the mins of each column occur at row 1 (index 0).

array([0, 0, 0, 0])

In [34]:
range_arr.argmax(axis=0) # We see that the maxes of each column occur at row 5 (index 4).

array([4, 4, 4, 4])

In [35]:
range_arr.argmin() # Here we get the index of the overall minimum (the 0th index).

0

In [36]:
range_arr.argmax() # Here we get the index of the overall maximum (the last index). 

19

#### We can get the cumulative sum or product with the following. 

In [37]:
range_arr.cumsum(axis=0)  # Here it gets the cumsum along the rows (i.e. from top to bottom)

array([[ 0,  1,  2,  3],
       [ 4,  6,  8, 10],
       [12, 15, 18, 21],
       [24, 28, 32, 36],
       [40, 45, 50, 55]])

In [38]:
range_arr.cumprod(axis=0) # Gets the cumprod along the rows

array([[    0,     1,     2,     3],
       [    0,     5,    12,    21],
       [    0,    45,   120,   231],
       [    0,   585,  1680,  3465],
       [    0,  9945, 30240, 65835]])

In [39]:
# We can flatten our arrays as follows. 
range_arr.flatten()
range_arr.ravel()  # They look the same in this case. 

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19])

## A brief look at a cool Numpy method

The majority, if not all, of the methods that we looked at for numpy arrays are available on Pandas columns. They might have some slightly different naming conventions (`idxmax` on a column versus `argmax` on a numpy array, for example), but since Pandas DataFrames are built on numpy arrays, the methods available on numpy arrays largely coincide with the methods available on Pandas DataFrames. 

Many of these methods are available as functions on the `numpy` module itself, as well. Just like we can call the `argmax()` method on a numpy array, we can call `np.argmax()` and pass in a list or tuple. Before we move back to DataFrames, let's look at one last method that is available in `numpy`, `np.where()`. `np.where()` can help us to find what elements in a numpy array meet some condition. 

In [40]:
my_ndarray = np.array([2, 4, 6, 8, 24, 3, 8, 9, 12])

In [41]:
print np.where(my_ndarray <= 2) # Returns the indices where the data meet the condition. 
print np.where(my_ndarray == 8) # Returns the indices where the data meet the condition. 
print np.where(my_ndarray > 6) # Returns the indices where the data meet the condition. 

(array([0]),)
(array([3, 6]),)
(array([3, 4, 6, 7, 8]),)
