# numpy

Thus far, we have explored the "built-in" functionality of Python.

However, there is a large ecosystem of packages out there that do almost any task you can think of.

numpy (pronounced num-pie) is one of those extremely useful packages. 

### Why numpy?

You can think of numpy as a "supercharged" way of managing ```list``` data types.

Under the hood, the method calls you will be using are powered by very fast programming languages (C/C++/Cython). What numpy does is basically pass your information to those tools and then waits for the result. Once completed, the result is made available in regular Python format.

Due to this difference between ```lists``` and numpy data types, calculations on numerical data, particularly large amounts of numerical data, can be completed in a fraction of the time.

We can use the built-in ```%timeit``` functionality in Jupyter to demonstrate how ```lists``` and numpy compare when calculating the sum of a large amount of numbers.

First, build the ```list``` of integers and an ```numpy.ndarray``` of integers:

In [14]:
python_list = list(range(10000))

print(python_list[:10]) # first ten ints
print(python_list[-10:]) # last ten ints
print(len(python_list)) # length of python list

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[9990, 9991, 9992, 9993, 9994, 9995, 9996, 9997, 9998, 9999]
10000


To use numpy, you first need to import the package (like we did with classes).

When you use the ```as``` keyword, you are creating an *alias*. In almost all examples you will see online, the alias for numpy is ```np```.

The combination looks like this:

```
import numpy as np
```

Which is equivalent to

```
import numpy
```

But instead of typing out ```numpy``` every time to use with dot notation, you just type ```np```

For example:

```
import numpy as np

numbers = np.arange(100)
```

vs.

```
import numpy

numbers = numpy.arange(100)
```

Both are correct, and do the same thing, but the first example requires less typing, is more convenient, and matches up with most online examples and codebase formatting choices.

Once imported, you can use dot notation to access numpy methods.

One such method named ```arange``` is very similar to Python's ```range()``` method.

In other words,

```
np.arange(10000)
```

is the same as

```
list(range(10000))
```

except the first case will give you an ```numpy.ndarray``` datatype, instead of a ```list``` data type. This is important, because ```numpy.ndarray``` is much more flexible than ```list``` when it comes to complex calculations. There are certainly cases where you will still use ```list``` in your career, but often it is just a *preprocessing* (i.e., preparing and cleaning data) step before converting the ```list``` to a ```numpy.ndarray``` (i.e., casting!). This is very easy to do and I will show examples later in the notebook.

The following example creates a ```numpy.ndarray``` of ```int``` that range from 0 to 9999. numpy is also *not inclusive* on the last number for ```arange```. So I pass it 10000 to be equivalent with ```range``` that does not include the last number (i.e., 10000). The same slicing and indexing rules we learned with ```list``` generally apply (i.e., the last number in the slice in both ```list``` and ```numpy.ndarray``` is not inclusive), with some exceptions / nuances we will discuss and examine later.

In [16]:
import numpy as np

numpy_array = np.arange(10000)

print(numpy_array[:10]) # first ten ints
print(numpy_array[-10:]) # last ten ints
print(len(numpy_array)) # length of python list

[0 1 2 3 4 5 6 7 8 9]
[9990 9991 9992 9993 9994 9995 9996 9997 9998 9999]
10000


Next, use ```%timeit``` to calculate the sum of a Python list: 

In [17]:
%timeit sum(python_list)

113 µs ± 16 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


Similarly, use ```%timeit``` to calculate the sum of an ```numpy.ndarray```

In [18]:
import numpy as np

%timeit np.sum(numpy_array)

16.1 µs ± 627 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


For this simple example, the numpy version over 5 times faster, on average, compared to ```list```.

A difference of ~100 microseconds (0.0001 seconds) might not mean much to you, but imagine doing this same calculation 1 billion times.

This might seem like an impossible situation, but consider this. A ~4-km WRF (Weather Research and Forecasting Model) simulation containing the CONUS can have over 1 million grids, each with a ```float``` value.

A climate change dataset called "WRF BCC" used by researchers in our department has simulation grid with ~1 million grid points. We have output every 15 minutes, and our study periods span 75 years.

   - $1 x 10^{6} * 4$ = 4 million floats per hour

   - $4 x 10^{6} * 24$ = 96 million floats per day

   - $9.6 x 10^{7} * 365$ = 35 billion floats per year

   - $3.5 x 10^{10} * 75$ = 2.6 trillion floats in the dataset

In other words, if you would like to run a simple calculation like a temperature conversion on the entire dataset, you would need to run that calculation 2.6 trillion times. This can **really** add up the microseconds quickly! And trust me when I say, datasets are not getting any smaller.

In other words, if a Python solution for task on "WRF BCC" took 16 microseconds for ```numpy.ndarray```, it would take hundreds of days to complete. If it took 113 microseconds for ```list```, it would take thousands of days. Luckily there are even better solutions than numpy that I will talk about in my course **EAE 495 / 598 - Seminar - Climate Science**

Other cases can produce an even larger speed up.

# Basic numpy examples

You only need to import once in your Notebook:

In [19]:
import numpy as np

### Finding the mean of a list of numbers

numpy can interface with Python ```lists``` in many cases.

For example, you can pass a ```list``` as an argument to the ```numpy.mean()``` method:

In [20]:
a = [1, 2, 3, 4, 5]

print("mean of", a, "=", np.mean(a))

mean of [1, 2, 3, 4, 5] = 3.0


We can examine the ```mean``` method by using the ```help()``` function. This usually gives examples on how to use the method at the end of the "doc string":

In [21]:
help(np.mean)

Help on function mean in module numpy:

mean(a, axis=None, dtype=None, out=None, keepdims=<no value>, *, where=<no value>)
    Compute the arithmetic mean along the specified axis.
    
    Returns the average of the array elements.  The average is taken over
    the flattened array by default, otherwise over the specified axis.
    `float64` intermediate and return values are used for integer inputs.
    
    Parameters
    ----------
    a : array_like
        Array containing numbers whose mean is desired. If `a` is not an
        array, a conversion is attempted.
    axis : None or int or tuple of ints, optional
        Axis or axes along which the means are computed. The default is to
        compute the mean of the flattened array.
    
        .. versionadded:: 1.7.0
    
        If this is a tuple of ints, a mean is performed over multiple axes,
        instead of a single axis or all the axes as before.
    dtype : data-type, optional
        Type to use in computing the mean.

### sorting a list

In [22]:
a = [3, 2, 1, 5, 4]

print("sorted", a, "=", np.sort(a))

sorted [3, 2, 1, 5, 4] = [1 2 3 4 5]


### finding the maximum value in a list

In [23]:
a = [3, 2, 1, 5, 4]

print(np.max(a))

5


### finding the minimum value in a list:

In [24]:
a = [3, 2, 1, 5, 4]

print(np.min(a))

1


### Other methods you can use for simple calculations on a ```list```: sum, median, unique, std

# Comparing numpy arrays to Python lists

Numpy arrays allow fast, *elementwise* mathematical operations.

Here is how you would multiply every number in a Python list by 10:

In [27]:
a = [1, 2, 3, 4, 5]

print("before multiplication", a)

for i in range(len(a)):
    
    a[i] = a[i] * 10

print("After math", a)

before multiplication [1, 2, 3, 4, 5]
After math [10, 20, 30, 40, 50]


You would have to visit every index, multiply itself by 10, and then set the result to the index position.

Compare this to numpy, where the only extra step is converting a python list to a ```numpy.ndarray```

```np.array``` is a method that has one argument. The argument should be a composite data type (like a ```list```)the function 'returns' a numpy array representation of that list.

In [29]:
a = np.array([1, 2, 3, 4, 5]) 

print("Before math", a)

a = a * 10 # multiplies every value in 'a' by 10, automatically!

print("After math", a)

Before math [1 2 3 4 5]
After math [10 20 30 40 50]


Although they give the same result, numpy is much easier to work with because it is one line with no loops!

## Python list and numpy arrays have the same indexing approach:

1. ### Get the 3rd value in a ....

<div style="display: inline-block">
    
| Python list| Numpy array|
|------------|------------|
| my_list[2] | my_array[2]|
    
</div>

2. ### Get the 2nd through 3rd values in a... 

<div style="display: inline-block">
    
| Python list  | Numpy array  |
|--------------|--------------|
| my_list[1:3] | my_array[1:3]|
    
</div>

3. ### Skip every other value in a...

<div style="display: inline-block">
    
| Python list  | Numpy array  |
|--------------|--------------|
| my_list[::2] | my_array[::2]|
    
</div>

4. ### Reverse the order of a...

<div style="display: inline-block">
    
| Python list   | Numpy array   |
|---------------|---------------|
| my_list[::-1] | my_array[::-1]|
    
</div>

# N-dimensional data

If you have more than 1 dimension, numpy can handle visiting every index location automatically.

To work with a 2D python list, you would have to use nested for loops:

In [31]:
a = [[1, 2, 3, 4, 5], [1, 2, 3, 4, 5]]

print("Before multiply", a)

for i in range(2):
    for j in range(5):
        a[i][j] = a[i][j] * 10

print("After multiply", a)

Before multiply [[1, 2, 3, 4, 5], [1, 2, 3, 4, 5]]
After multiply [[10, 20, 30, 40, 50], [10, 20, 30, 40, 50]]


This is much easier to deal with using numpy:

In [32]:
import numpy as np

a = [[1, 2, 3, 4, 5], [1, 2, 3, 4, 5]]
a_n = np.array(a)

print("Before multiply", a_n)

a_n = a_n * 10

print("After multiply", a_n)

Before multiply [[1 2 3 4 5]
 [1 2 3 4 5]]
After multiply [[10 20 30 40 50]
 [10 20 30 40 50]]
