# Introduction 

Throughout this entire notebook you should be experimenting with the code in the non-text cells. A great way to begin to get a feel for Python is by playing with it. So have some fun by changing the values in the cells and then running them again with Shift-Enter. Before you do, think about what you expect the output to be, and make sure your intuition matches up with what you run. If it doesn't, take some time to think about what happened so you can hone your intuition.

At the end of each section there will be some questions to help further your understanding. Remember, in Python we can always manually test code by running it; however, you should try to think about the answers to these questions before you run some code. This way you can check and verify your understanding of the section's topic.

# Intro to Numpy 

Tonight, we're going to start off with a foray into the `Numpy` library, which is one of the fundamental packages for scientific computing in Python. It turns out that the `Pandas DataFrames` we worked with last class are actually built off the `numpy array` (which we'll get to), so it's important to have some basic knowldege of what's running under the hood of our `DataFrames`. We started with `DataFrames` as opposed to `numpy arrays` because they are a little bit more intuitive and we're able to interact with them from a higher level (this is largely due to the ability to label our data). 

While `Numpy` offers an amazing amount of functionality (see the [docs](http://www.numpy.org/) for a better idea), one of it's mainstays is the `numpy array` (an n-dimensional array), which is what we'll focus on tonight. There are loads of things that you can do with `numpy arrays`, and tonight we're going to introduce some of their amazing capabilities. Learning about everything `numpy arrays` can do really just takes working with them day in and day out, and so tonight we'll try to aim for breadth over depth. We want you to walk out of tonight with an idea of the many things you can do with `numpy` and `numpy arrays`, and the types of situations where you would want to use them. 

## The basics of the Array

#### What's the big deal with Numpy Arrays?

What's so special about a `numpy array`? From a high level, they are kind of like lists - they just store a bunch of stuff in a container. It turns out, though, that a numpy array is much faster to interact with and perform certain types of calculations with than a standard Python list. Why is that, though? The two main reasons that they are faster are: 

1. They are stored as one contiguous block of memory, rather than being spread out across multiple locations like a list. 
2. Each item in a numpy array is of the same data type (i.e. all integers, all floats, etc.), rather than a conglomerate of any number of data types (as a list is). We call this idea homogeneity, as opposed to the possible heterogeneity of Python lists.

Just how much faster are they? Let's take the numbers from 0 to 1 million, and sum those numbers, timing it with both a list and a numpy array.

In [1]:
import numpy as np # Standard alias when importing numpy - follow this convention!
numpy_array = np.arange(0, 1000000)
python_list = range(1000000)

In [2]:
%timeit np.sum(numpy_array)

1000 loops, best of 3: 713 µs per loop


In [3]:
%timeit sum(python_list)

100 loops, best of 3: 8.59 ms per loop


In [4]:
%timeit sum(numpy_array)

10 loops, best of 3: 81.6 ms per loop


So, numpy runs nearly an order of magnitude faster! This is because of those two points above. Because numpy arrays store data in contiguous blocks of memory, they are able to take advantage of **vectorization**, which is the ability of a CPU to perform one operation on mulitiple pieces of data at once. In addition, since a numpy array knows what type each object it is storing is (and those types don't change), it doesn't have to waste time checking what type each object is (like a list). The combo of these two things speeds up our calcualtion quite a bit.

Notice, too, that we had to specify `np.sum()` - numpy's implementation of sum. When we just used the built-in Python `sum()` on the numpy array, the calculation was actually slower! This is because numpy arrays are optimized for vectorized computations, and when we try to do a non-vectorized operation we pay a price. 

It's also worth nothing that all we did above was a sum - just a **simple** sum. When we move to doing more complicated operations, we'll save even more time! Let's look at what else numpy arrays can do...

#### Making a Numpy Array

Now that we know how awesome `numpy arrays` can be, let's dive into them. We're not going to cover everything that you can do with `numpy arrays` (see the [methods docs](http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.ndarray.html#numpy.ndarray) for that), but we'll look at the basics. 

Let's start with how we can create a `numpy array`. To do this, we use the `np.array()` constructor, which expects some kind of array or something that exposes the array interface (i.e. acts like an array - lists and tuples are examples). So, this means that we can create a numpy array by passing in a list or tuple. 

In [5]:
my_lst_ndarray = np.array([1, 2, 3, 4, 5])
my_tuple_ndarray = np.array((1, 2, 3, 4, 5), np.int32) # You can specify the data type 
                                                       # upon creation. 

#### Just like we have the shape attribute on Pandas DataFrames, we also have it on numpy arrays.

In [6]:
print(my_lst_ndarray.shape)
print(my_tuple_ndarray.shape)

(5,)
(5,)


#### We also have the dtype attribute, which will tell us the data type of the objects in our ndarray.

In [7]:
print(my_lst_ndarray.dtype)
print(my_tuple_ndarray.dtype)

int64
int32


In [8]:
my_lst_ndarray2 = np.array(["1", 2, 3, "10", 5])
print(my_lst_ndarray2.dtype)

|S2


#### If you try to tell the ndarray to be a certain data type, it will try to cast every object you pass in to that data type (here a 32-bit integer), and fail if it can't cast it to that data type. Below, we are able to cast "10" to a 32-bit integer, so this is fine. 

In [9]:
my_lst_ndarray3 = np.array([1, 2, 3, "10", 5], np.int32) 
print(my_lst_ndarray3.dtype)

int32


In [10]:
# This will not work, because Python can't cast the string 'bozo' as a 32 bit integer. 
my_lst_ndarray3 = np.array([1, 2, 3, "bozo", 5], np.int32) 

ValueError: invalid literal for long() with base 10: 'bozo'

#### Here are some other methods of constructing a numpy array. It's helpful to know these exist. 

In [11]:
zeros_arr = np.zeros((3,4)) # Create a matrix of zeros with 3 rows and 4 columns. 
ones_arr = np.ones((10,20)) # Create a matrix of ones with 10 rows and 20 columns.
identity_arr = np.identity(50) # Create an identity matrix with 50 rows and 50 columns. 
random_arr = np.random.rand(2, 2) # Create a 2x2 array of random floats ranging from 0 to 1. 
range_arr = np.arange(0, 20, 0.5) # Create a numpy array with arguments (start, end, step_size). 

##### Numpy Array Questions

1. How could I create a `numpy array` holding the numbers `1` - `10`?
2. How could I create a `numpy array` holding the numbers `10` - `20` ?
3. How would I create a `numpy array` of all zeros that has 10 rows and 20 columns? 
4. How would I create a `numpy array` of all ones that has 4 rows and 2 columns?

## Numpy Array Math

When working with a `numpy` array, we have all of the basic mathematics operators available to us - `+`, `-`, `*`, `/`, `**`, and `%`. Frequently, we'll be working with two arrays that are the same size, in which case these operators will just be performed **element-wise**. For example: 

In [12]:
first_arr = np.array([1, 2, 3, 4])
second_arr = np.array([5, 6, 7, 8])
first_arr + second_arr # Each element is lined up with it's corresponding element in the other 
                       # array, and the addition is then performed. 

array([ 6,  8, 10, 12])

In [13]:
first_arr = np.array([[1, 2], [3, 4]]) # This is now a two-dimensional array. 
second_arr = np.array([[5, 6], [7, 8]]) # This is now a two-dimensional array. 
first_arr * second_arr # Each element is lined up with it's corresponding element in the other 
                       # array, and the multiplication is then performed. 

array([[ 5, 12],
       [21, 32]])

It turns out that our numerical operations can also work when we want to perform an operation between a `numpy array` and a single value. For example, let's say that we want to subtract `4` from `first_arr` above, or multiply it by `5`, or find the remainder when everything is divided by `3`. We can do that via the following: 

In [14]:
first_arr = np.array([[1, 2], [3, 4]]) # This is now a two-dimensional array. 

In [15]:
first_arr - 4

array([[-3, -2],
       [-1,  0]])

In [16]:
first_arr * 5

array([[ 5, 10],
       [15, 20]])

In [17]:
first_arr % 3

array([[1, 2],
       [0, 1]])

The concept that allows this to happen is referred to as [broadcasting](http://docs.scipy.org/doc/numpy-1.10.1/user/basics.broadcasting.html). It is a concept that will be particularly useful when working with and interacting with `numpy` arrays. Basically, it takes that single number on the right (the `4`, `5`, or `3` above), and **broadcasts** it's shape to match that of `first_arr` (`2 x 2`). After doing so, it then performs the operation element-wise like we saw before. 

It turns out that things can get a little more intricate than this. If we wanted, we could perform mathematical operations like the above at a column level, or row level. For example, we could subtract off `4` from the first column and `5` from the second column, or `4` from the first row and `5` from the second row. We would do that via the following: 

In [18]:
first_arr = np.array([[1, 2], [3, 4]]) # This is now a two-dimensional array. 

In [19]:
first_arr - [4, 5] # Here, we subtract 4 off the first **column** and 5 off the second **column**. 

array([[-3, -3],
       [-1, -1]])

In [20]:
first_arr - [[4], [5]] # Here, we subtract 4 from the first **row** and 5 from the second **row**. 

array([[-3, -2],
       [-2, -1]])

##### Math/Broadcasting Questions?

Assume that we have `my_arr1` equal to `[[10, 20], [30, 40]]` and `my_arr2` equal to `[[50, 60], [70, 80]]`: 

1. What is the result of `my_arr1 + my_arr2`?
2. What is the result of `my_arr1 * my_arr2`?
3. What is the result of `my_arr2 - my_arr1`?
4. What is the result of `my_arr2 - 100`?
5. What is the result of `my_arr1 - [10, 20]`?