# Section 6: Getting Started with NumPy
## Contents
- [The Fundamentals: the ND-array](#The-Fundamentals:-the-ND-array)
  - [Accessing Data Along Multiple Dimensions](#Accessing-Data-Along-Multiple-Dimensions)
    - [One-dimensional Arrays](#One-dimensional-Arrays)
    - [Two-dimensional Arrays](#Two-dimensional-Arrays)
    - [N-dimensional Arrays](#N-dimensional-Arrays)
  - ["Vectorized" Operations: Optimized Computations on NumPy Arrays](#"Vectorized"-Operations:-Optimized-Computations-on-NumPy-Arrays)
  
It is time to start familiarizing ourselves with NumPy. To use this package, we need to be sure to "import" the NumPy module into our code:

```python
import numpy as np
```

You could have run `import numpy` instead, but the prescribed method allows us to use the abbreviation 'np' throughout our code, instead of having to write 'numpy'. This abbreviation is also used in the sections of the online book that you will be working through. 



In [9]:
"-".join('"Vectorized" Operations: Optimized Computations on NumPy Arrays'.split())

'"Vectorized"-Operations:-Optimized-Computations-on-NumPy-Arrays'

<div class="alert alert-block alert-info"> moo </div>
<div class="alert alert-block alert-warning"> moo </div>
<div class="alert alert-block alert-success"> moo </div>
<div class="alert alert-block alert-danger"> moo </div>

## The Fundamentals: the ND-array
The ND-array (N-dimensional array) is the star of the show for NumPy. This array simply stores a sequence of numbers. Like a Python list, you can access individual entries in this array by "indexing" into the array, and you can access a sub-sequence of the array by "slicing" it. So what distinguishes NumPy's ND-array from a Python list, and why is there a whole numerical library that revolves around this array? There are two major features that makes the ND-array special - it can:
 1. Provide an interface to its underlying data to be accessed along multiple dimensions.
 2. Rapidly perform mathematical operations over all of its elements, or over patterned subsequences of its elements, using compiled C code instead of Python.
 
We will take a moment to understand these two capabilities and why they are important.

### Accessing Data Along Multiple Dimensions
In this section, we will: 
 - define the "dimensionality" of an array
 - discuss the usefulness of ND-arrays
 - introduce the basic indexing scheme for accessing the contents of ND-arrays along their "axes"

#### One-dimensional Arrays
Let's being our discussion by constructing a simple ND-array containing three fractional-valued (a.k.a "floating point") numbers. 

```python
simple_array = np.array([2.3, 0.1, -9.1])
```
This supports the same indexing as a Python string or list:

```
 +------+------+------+
 |  2.3 |  0.1 | -9.1 | 
 +------+------+------+
 0      1      2      3
-3     -2     -1       
```
The first row of numbers gives the position of the indices 0…3 in the array; the second row gives the corresponding negative indices. The slice from $i$ to $j$ returns an array containing of all numbers between the edges labeled $i$ and $j$, respectively:

```python
>>> simple_array[0]
2.3

>>> simple_array[-2]
0.1

>>> simple_array[1:3]
array([ 0.1, -9.1])

>>> simple_array[3]
IndexError: index 3 is out of bounds for axis 0 with size 3
```

Given this indexing scheme, only a *one* integer is needed to specify a unique entry in the array. Similarly only *one* slice is needed to uniquely specify a subsequence of entries in the array. For this reason, we say that this is a **1-dimensional array**. In general, the **dimensionality** of a numpy array specifies the number of indices that are required to uniquely specify one of its entries.

<div class="alert alert-block alert-info"> **Definition**: the **dimensionality** of an array specifies the number of indices that are required to uniquely specify one of its entries. </div>

This definition of dimensionality is common far beyond NumPy - one needs to use three numbers to uniquely specify a point in physical space, which is why it is said that space consists of three dimensions.

#### Two-dimensional Arrays
Before proceeding further down the path of high-dimensional arrays, let's briefly consider a very simple dataset where the desire to access the data along multiple dimensions is manifestly desirable. Consider the following table from a gradebook:


|         | Exam 1 (%)           | Exam 2 (%) |
| ------------- |:-------------:| -----:|
| Ashley     | $93$ | $95$ |
| Brad     | $84$      |   $100$ |
| Cassie | $99$      |    $87$ |

This dataset contains 6 grade-values. It is almost immediately clear that storing these in a 1-dimensional array is not ideal:

```python
grades = np.array([93, 95, 84, 100, 99, 87])
```

While no data has been lost, accessing this data using a single index is less-than convenient; we want to be able to specify both the student and the exam when accessing a grade - it is natural to ascribe *two dimensions* to this data. Let's construct a 2D array containing these grades:

```python
grades = np.array([[93,  95], 
                   [84, 100], 
                   [99,  87]])
```

NumPy is able to see the strucure of the data passed to `np.array`, and resolve the two dimensions of data, which we deem the 'student' dimension and the 'exam' dimension, respectively. 

<div class="alert alert-block alert-warning"> **Note**:  Although NumPy does formally recognize the concept of dimensionality precisely in the way that it is discussed here, its documentation refers to an individual dimension of an array as an **axis**. Thus you will see "axes" (pronounced "aks-ēz") used in place of "dimensions". They mean the same thing.</div>

NumPy specifies the row-axis (students) of a 2D array as "axis-0" and the column-axis (exams) as axis-1. You must now provide **two** indices, one for each axis (dimension), to uniquely specify an element in this 2D array; the first number is an index along axis-0, the second is an index along axis-1. The zero-based indexing schema that we reviewed earlier applies to each axis of the ND-array:

```                                                    
                                                  -- axis-1 -> 
                                                  -2  -1    
                                                   0   1   2
                                        |     -3 0 +---+---+  
                                        |          |93 | 95| 
                                        |     -2 1 +---+---+
                                      axis-0       |84 |100|  
                                        |     -1 2 +---+---+
                                        |          |99 | 87|
                                        V        3 +---+---+
```



Thus, if we want to access Brad's (axis-0: 1) score for Exam  1 (axis-1: 0) we simply specify:

```python
>>> grades[1, 0]  # Brad's score on Exam 1
84
```

We can also uses *slices* to access subsequences of our data. Suppose we want the scores of all the students for Exam 2. We can slice from 0 through 3 along axis-0 (refer to the indexing diagram in the previous section) to include all the students, and specify index 1 to select Exam 2:

```python
>>> grades[0:3, 1]  # Exam 2 scores for all students
array([ 95, 100,  87])
```
> You can specify an "empty" slice to include all possible entries along an axis, by default: `grades[:, 1]` is equivalent to `grades[0:3, 1]`, in this instance.

What happens if we only supply one index to our array? It may be surprising that `grades[0]` does not throw an error, instead it will return all of the exam scores for student-0 (Ashley):

```python
>>> grades[0] 
array([ 93, 95])
```
This is because NumPy will automatically insert slices for you if you don't provide as many indices as there are dimensions for your array:
> Suppose you have an $N$-dimensional array, and only provide $j$ indices for the array. NumPy will automatically insert $N-j$ trailing slices for you. 

> In the case that $N=5$ and $j=3$: `d5_array[0, 0, 0]` is treated as  `d5_array[0, 0, 0, :, :]`

Thus `grades[0]` was treated as `grades[0, :]`.


<div class="alert alert-block alert-warning">
**FYI**: Keeping track of the meaning of an array's various dimensions can quickly become unwieldy when working with real world datasets. [xarray](http://xarray.pydata.org/en/stable/) is a Python library that provides functionality comparable to NumPy, but allows users provide *explicit labels* for an array's dimensions. Using an `xarray`, selecting Brad's scores could look like `grades.sel(student='Brad')`. 

#### N-dimensional Arrays
We'll wrap up this section by building up some intuition for arrays with a dimensionality higher than 2. The following code creates a 3-dimensional array:
```python
d3_array = np.array([[[0, 1],
                      [2, 3]],

                     [[4, 5],
                      [6, 7]]])
```
You can think of axis-0 specifying which of the 2x2 "sheets" to select from, axis-1 specifying the row along the sheets, and axis-2 the column along the sheets:

```
                                           sheet 0:
                                                  [0, 1]
                                                  [2, 3]

                                           sheet 1:
                                                  [4, 5]
                                                  [6, 7]
                                                  


                                        |       -- axis-2 ->
                                        |    |    
                                        |  axis-1 [0, 1]
                                        |    |    [2, 3]
                                        |    V
                                     axis-0     
                                        |      -- axis-2 ->
                                        |    |    
                                        |  axis-1 [4, 5]
                                        |    |    [6, 7]
                                        V    V

```

Thus `d3_array[0, 1, 0]` specifies the element residing in sheet-0, at row-1 and column-0:
```python
>>> d3_array[0, 1, 0]
2
```

`d3_array[:, 0, 0]` specifies the elements in row-0 and column-0 of **both** sheets:
```python
>>> d3_array[:, 0, 0]
array([0, 4])
```

In four dimensions, one can think of "stacks of sheets with rows and columns" where axis-0 selects the stack of sheets you are working with, axis-1 chooses the sheet, axis-2 chooses the row, and axis-3 chooses the column. Extrapolating to higher dimensions ("collections of stacks of sheets ...") continues in the same tedious fashion.  


<div class="alert alert-block alert-success">
**Takeaway**: Although accessing data along varying dimensions is ultimately all a matter of judicuous bookkeeping (you *could* access all of this data from a 1-dimensional array, after all), NumPy's ability to provide users with an interface for accessing data along dimensions is incredibly useful. It affords us an ability to impose intuitive, abstract structure to our data. </div>

#### "Vectorized" Operations: Optimized Computations on NumPy Arrays

NumPy's ND-arrays are *homogenous* arrays - an array can only contain data of a single type. For instance, an array can contain a sequence of 8-bit integers or a sequence of 32-bit floating point numbers, but not a mix of the two. This is in stark contrast to Python's lists (and its other containers), which are entirely unrestricted in the variety of contents they can possess; a given list could simultaneously contain strings, integers, floats, and other objects. This restriction on an array's contents comes at a great benefit - in "knowing" that an array's contents are homogenous in data type, NumPy is able to delegate the task of performing mathematical operations on the array's contents to optimized compiled C code, which is a process that is referred to as **vectorization**. The outcome of this can be a *tremendous* speedup relative to the analgous computation performed in Python, which painstakingly check the data type of *every* one of the array's items as it encounters it.    

<div class="alert alert-block alert-info"> 
**Definition**: In high-level languages like Python, Matlab, and R, **vectorization** describes the use of optimized, pre-compiled code written in a low-level language (e.g. C) to perform mathematical operations over a sequence of data in place of explicit iteration written in the native language code (e.g. Python). 
</div>

Consider, for instance, the task of summing the integers 0-9,999 stored in a NumPy array. Calling NumPy's `sum` function cues optimized C code to iterate over the integers in the array and tally the sum the result is then returned "to Python" (this is therefore a "vectorized" function). Let's time how long it takes to compute this sum: 


<div class="alert alert-block alert-warning"> 
**Tip**: `%%timeit` is a utility built into Jupyter notebooks and the IPython console that allows one to time how long it takes to execute the contents of a cell. It is **not** part of the Python language.</div>

In [1]:
import numpy as np

In [10]:
%%timeit 
# sum an array, using NumPy's vectorized 'sum' function
np.sum(np.arange(10000))

29.5 µs ± 632 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


Now let's compare this to the time required to *explicitly* loop over the array in Python and tally up the sum. Python is unable to take advantage of the fact that the array's contents are all of a single data type - it has to check, for every iteration, if it is dealing with an integer, a string, a floating point number, etc, just as it does when iterating over a list. This will slow down the computation massively.

In [3]:
%%timeit
# sum an array by explicitly looping over the array in Python
total = 0
for i in np.arange(10000):
    total = i + total

1.91 ms ± 48.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


Timed on my computer, the sum is **over 50 times faster when performed in NumPy**! This should make it clear that, whenever computational efficiency is important, one should avoid performing explicit for-loops over long sequences of data in Python, be them lists or NumPy arrays. NumPy provides a whole suite of vectorized functions. In fact, the name of the game when it comes to leveraging NumPy to do computations over arrays of numbers is to exclusively leverage its vectorized functions. The following computations all invoke vectorized functions: 

```python
# multiply 2 with each number in the array
>>> 2 * np.array([2, 3, 4]) 
array([4, 6, 8])

# subtract the corresponding entries of the two arrays
>>> np.array([10.2, 3.5, -0.9]) - np.array([8.2, 3.5, 6.5])
array([ 2. ,  0. , -7.4])

# Take the "dot product" of the two arrays 
# - multiply their corresponding entries and sume the result
>>> np.dot(np.array([1, -3, 4]), np.array([2, 0, 1]))
6
```

## Required Reading
Read through the provided material, following along in your own Jupyter notebook:

- [The Basics of NumPy (from the official NumPy tutorial)](https://docs.scipy.org/doc/numpy-dev/user/quickstart.html#the-basics)
- [Creating Arrays from Python Lists](http://nbviewer.jupyter.org/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/02.01-Understanding-Data-Types.ipynb#Creating-Arrays-from-Python-Lists)
- [Creating Arrays from Scratch](http://nbviewer.jupyter.org/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/02.01-Understanding-Data-Types.ipynb#Creating-Arrays-from-Scratch)
- [NumPy Standard Data Types](http://nbviewer.jupyter.org/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/02.01-Understanding-Data-Types.ipynb#NumPy-Standard-Data-Types)
- [The Basics of NumPy Arrays](http://nbviewer.jupyter.org/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/02.02-The-Basics-Of-NumPy-Arrays.ipynb#The-Basics-of-NumPy-Arrays) (work through all sections in this page
- [Computation on NumPy Arrays: Universal Functions](http://nbviewer.jupyter.org/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/02.03-Computation-on-arrays-ufuncs.ipynb#Computation-on-NumPy-Arrays:-Universal-Functions) (work through all sections in this page)

These sections are from the [Python Data Science Handbook](http://nbviewer.jupyter.org/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/Index.ipynb), which is available for free online (it was actually written in Jupyter notebooks!). Please make note of the book's contents - this is a great resource to return to time and time again.

If you are already considerably experienced with NumPy, take a look at [From Python to NumPy](http://www.labri.fr/perso/nrougier/from-python-to-numpy/) (which is free, and published online). Its contents are quite advanced, but exceedingly useful. Working through this will earn you a black belt in NumPy!

### NumPy data types
What is the data type(s) of the elements in `a = np.array([3.14, 1, 2, 3])`?

 - [ ] float and int
 - [x] float32 or float64 (depending on your machine)
 - [ ] float


What is `type(a[2])`?
 - [x] numpy.float32 or numpy.float64
 - [ ] float
 - [ ] np.int32 or np.int64