# NumPy

In the section on [lists](content:lists) you learned how `list` collection types can be used to store multiple individual objects, such as individual numerical values that together correspond to an experimental data set.

While `lists` are convenient for storing and working with relatively small amounts of data, they are not ideal for working with large numerical data sets. Creating large `lists` and iterating over them—for example, to perform some statistical analysis on our dataset—is relatively slow, and can require writing non-trivial amounts of &ldquo;boilerplate&rdquo; code, which can make writing code for data analysis cumbersome and prone to bugs.

For example, imagine we have recorded a set of experimental measurements and we want to compute the mean, $\bar{x}$, and standard error, $\delta \bar{x}$, which are given by:

$$
\bar{x} = \frac{1}{N}\sum_i x_i
$$

$$
\sigma_N = \sqrt{\frac{1}{N}\sum_i\left(x_i - \bar{x}\right)^2}
$$

$$
\delta \bar{x} = \frac{\sigma_{N-1}}{\sqrt{N}}
$$

Using lists this might look like:

In [1]:
import math

data = [1.0026, 1.0019, 0.9972, 0.9986, 1.0009]

def mean(data):
    """Calcualte the mean of a dataset.
    
    Args:
        data (list(float)): The numerical data.
        
    Returns:
        float
        
    """
    return sum(data)/len(data)

def std_dev(data, ddof=0):
    """Calculate the standard deviation of a dataset.
    
    Args:
        data (list(float)): The numerical data.
        ddof (int): to adjust for bias in limited samples
                    relative to the population estimate of variance.
                    Default is 0.
                    
    Returns:
        float
        
    """
    x_mean = mean(data)
    return math.sqrt(sum([(x - x_mean)**2 for x in data])/(len(data)-ddof))

x_bar = mean(data)
delta_x_bar = std_dev(data, ddof=1) / math.sqrt(len(data))

print(f'mean = {x_bar} ± {delta_x_bar}')

mean = 1.00024 ± 0.0010171528891961034


Performing numerical analyses of this type on large datasets is a very common computational workflow. To do this efficiently using compact code we can use the [NumPy](https://numpy.org) package, which provides a large number of numerical computing tools that are fast and efficient.

NumPy is not part of the Python standard library, which means that it must be imported before use.

````{margin}
```{note}
The ability to import libraries is one of the things that makes the Python programming language so powerful and attractive, allowing the use of algorithms and methods development by (probably better) programmers to be easily accessed.
Using `as` in the import command lets us assign the `numpy` module to a different (shorter) variable name.
```
````

In [2]:
import numpy as np

We can now reimplement our analysis above using the following, much more compact, `numpy` code:

In [3]:
data = np.array([1.0026, 1.0019, 0.9972, 0.9986, 1.0009])

x_bar = data.mean()
delta_x_bar = data.std(ddof=1) / np.sqrt(len(data))

print(f'mean = {x_bar} ± {delta_x_bar}')

mean = 1.00024 ± 0.0010171528891961034


The numpy version is also significantly faster if we have large datasets to process.

In [4]:
data = np.random.random(10000) # generate 10,000 random numbers between 0 and 1.

````{margin}
```{note}
The `%%timeit` command is not Python, but is a so-called &ldquo;magic&rdquo; command build into the Jupyter Notebook application. Jupyter Notebooks provide several &ldquo;magic&rdquo; commands that are always prefixed by one or more percent signs `%`. The `%%timeit` command here runs the cell it is entered in multiple times and then reports the mean time the cell took to run, ± the standard deviation.
```
````

In [5]:
%%timeit

x_bar = data.mean()
delta_x_bar = data.std(ddof=1) / np.sqrt(len(data))

18.1 µs ± 122 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


In [6]:
%%timeit

x_bar = mean(data)
delta_x_bar = std_dev(data, ddof=1) / math.sqrt(len(data))

2.03 ms ± 48.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


## NumPy Arrays

NumPy provides its own data type, called the numpy array. We create a NumPy array similarly to how we create a `list`, using the `np.array()` function:

In [7]:
data = np.array([1.0026, 1.0019, 0.9972, 0.9986, 1.0009])

This can then be indexed in the same way as a `list`:

In [8]:
data[2:5]

array([0.9972, 0.9986, 1.0009])

In [9]:
data[::-1]

array([1.0009, 0.9986, 0.9972, 1.0019, 1.0026])

NumPy arrays can only contain data of the same type, in contrast to `lists`, which can contain data of different types. Trying to create a NumPy array that contains data of different types will either cause an error, or your data will be converted to a single consistent type (which probably is not what you wanted to happen).

In [10]:
my_list = ['yellow', 43.0, False]
print(my_list)

['yellow', 43.0, False]


In [11]:
my_array = np.array(['yellow', 43.0, False])
print(my_array)

['yellow' '43.0' 'False']


NumPy arrays can be created from `lists`:

In [12]:
data = [1.0026, 1.0019, 0.9972, 0.9986, 1.0009]
data_array = np.array(data)
data_array

array([1.0026, 1.0019, 0.9972, 0.9986, 1.0009])

NumPy arrays can also have more than one dimension, which makes them useful for representing multidimensional datasets, or mathematical objects like matrices.

In [13]:
A = np.array([[1, 2, 0], [1, 3, 4], [-2, 0.5, 1]])
A

array([[ 1. ,  2. ,  0. ],
       [ 1. ,  3. ,  4. ],
       [-2. ,  0.5,  1. ]])

Individual elements, or slices, of a N-dimensional NumPy array can be referenced using the same syntax as for nested lists. For example, `A[1]` gives the second &ldquo;row&rdquo; of this matrix:

In [14]:
A[1] # second row

array([1., 3., 4.])

which we can then also slice:

In [15]:
A[1][2] # second row, third element

4.0

NumPy also allows a more compact N-dimensional index notation, using comma-separated values:

In [16]:
A[1,2] # second row, third column

4.0

In [17]:
A[-1,1:] # last row, everything from the second column to the end

array([0.5, 1. ])

## Arithmetic with NumPy arrays

The fact that NumPy arrays are all of a same type means that it is possible to perform arthmetic on them. 
For example, all of the items in an array can be multipled by a single value. 

In [18]:
np.array([0.1, 0.2, 0.3]) * 10.0

array([1., 2., 3.])

Or two NumPy arrays can operator on each other.

In [19]:
np.array([5, 10, 15]) + np.array([5.0, 0.0, -5.0])

array([10., 10., 10.])

These operations on every element of a NumPy array are called **vector operations**.

Using NumPy arrays for value-wise operations such as those shown where are **very** efficient and run much faster than explicitly looping over elements in a list.

## NumPy functions

In addition to the power of the NumPy array (on which some of Python's most impressive libraries are built), the NumPy library also enables access to a broad range of useful functions. 
For example, the `np.log` function that was introduced at the start, differs from the `math.log` function introduced earlier 
As the former can operate on NumPy array when the latter cannot. 

In [20]:
K = np.array([1.06, 3.8, 15.0, 45.44, 150.6])

np.log(K)

array([0.05826891, 1.33500107, 2.7080502 , 3.81639277, 5.01462732])

In [21]:
from math import log

log(K)

TypeError: only length-1 arrays can be converted to Python scalars

The function from the `math` module will result in an error. 

Alongside these mathematical operations, the NumPy library also enables statistical operations on the NumPy arrays. 
For example, sum, mean and standard deviations are easy to find. These functions can be used using the `np.function_name()` syntax, or used as **methods** of a `numpy` array (using dot syntax).

In [28]:
mass_numbers = np.array([112, 114, 115, 116, 117, 118, 119, 120, 122, 124])

print(np.sum(mass_numbers))
print(np.mean(mass_numbers))
print(np.std(mass_numbers))

1177
117.7
3.4942810419312296


In [29]:
print(mass_numbers.sum())
print(mass_numbers.mean())
print(mass_numbers.std())

1177
117.7
3.4942810419312296


## Exercises: 

1. The percentage abundances of the natural isotopes of tin are:</br> 
112: 0.0097,</br>
114: 0.0066,</br>
115: 0.0034,</br>
116: 0.1454,</br>
117: 0.0768,</br>
118: 0.2422,</br>
119: 0.0859,</br>
120: 0.3258,</br>
122: 0.0463,</br>
124: 0.0579</br>
Calculate the average (mean) mass of naturally occurring tin.

2. Rewrite or modify your code from the **Loops** Exercise to calculate the distances between each pair of atoms, using **numpy arrays** to store the atom positions, and **vector arithmetic** to calculate the vectors between pairs of atoms.