# NumPy
(Víctor Sojo | vsojo@amnh.org)

Here we will take a look at **NumPy**, a third-party library to optimise Python code.


**References:**
+ The most excellent [beginner's tutorial on the NumPy website](https://numpy.org/doc/stable/user/absolute_beginners.html).

## Contents
&emsp;[Getting started](#Getting-started)<br/>
&emsp;&emsp;[Importing required libraries](#Importing-required-libraries)<br/>
&emsp;[Introduction - What's the point of NumPy?](#Introduction---What's-the-point-of-NumPy?)<br/>
&emsp;[Defining a numpy array](#Defining-a-numpy-array)<br/>
&emsp;&emsp;[So, is NumPy always faster?](#So,-is-NumPy-always-faster?)<br/>
&emsp;&emsp;[Defining a specific data type](#Defining-a-specific-data-type)<br/>
&emsp;[Array indexing and slicing works just like with Python lists](#Array-indexing-and-slicing-works-just-like-with-Python-lists)<br/>
&emsp;[How about appending items to the array?](#How-about-appending-items-to-the-array?)<br/>
&emsp;[Filtering desired values](#Filtering-desired-values)<br/>
&emsp;[Multidimensional arrays](#Multidimensional-arrays)<br/>
&emsp;[Pre-filled arrays](#Pre-filled-arrays)<br/>
&emsp;&emsp;[Use np.zeros\(\) to get an array filled with... 0s:](#Use-np.zeros\(\)-to-get-an-array-filled-with...-0s:)<br/>
&emsp;&emsp;[Use np.ones\(\) to get an array filled with... 1s:](#Use-np.ones\(\)-to-get-an-array-filled-with...-1s:)<br/>
&emsp;&emsp;[Use np.empty\(\) to get an array that is... NOT empty!?!?!:](#Use-np.empty\(\)-to-get-an-array-that-is...-NOT-empty!?!?!:)<br/>
&emsp;&emsp;[Use np.arange\(\) to create an array with a succession of numbers](#Use-np.arange\(\)-to-create-an-array-with-a-succession-of-numbers)<br/>
&emsp;&emsp;[Use np.random.random\(\) to get an array with random floating-point numbers](#Use-np.random.random\(\)-to-get-an-array-with-random-floating-point-numbers)<br/>
&emsp;&emsp;[Pre-filled multi-dimensional arrays](#Pre-filled-multi-dimensional-arrays)<br/>
&emsp;&emsp;[Use np.random.randint\(max, size=\(\)\) to get an array with random integers](#Use-np.random.randint\(max,-size=\(\)\)-to-get-an-array-with-random-integers)<br/>
&emsp;[Reshaping arrays and transposing matrices](#Reshaping-arrays-and-transposing-matrices)<br/>
&emsp;&emsp;[Reshaping a 1-dimensional array to a 2-d \"matrix\"](#Reshaping-a-1-dimensional-array-to-a-2-d-\"matrix\")<br/>
&emsp;&emsp;[Transposing a matrix \(a multi dimensional array\) with .T](#Transposing-a-matrix-\(a-multi-dimensional-array\)-with-.T)<br/>
&emsp;[Saving numpy arrays to files](#Saving-numpy-arrays-to-files)<br/>
&emsp;[If you're seeing your 2-D array as a table with column names, you want pandas, not numpy](#If-you're-seeing-your-2-D-array-as-a-table-with-column-names,-you-want-pandas,-not-numpy)<br/>
&emsp;[Where to from here](#Where-to-from-here)<br/>

## Getting started
Let's make sure that we're using the `data` environment that we created in the `Py301` notebook:

You should see `bioinfo` being printed out.

If you're on Windows, remember that every line starting with a `!`, such as `!my code` should be changed to `!wsl my code` and you should have an active [WSL installation](https://docs.microsoft.com/en-us/windows/wsl/install-win10)).

### Importing required libraries
We will need:

Module        | Use
:-------------|:-----------------------------------------
**numpy**  | Makes big data and scientific computation much more efficient.

Note that `numpy` is almost always imported as `np` for brevity.

## Introduction - What's the point of NumPy?
**Q: Isn't Python already a very efficient language?**<br/>
Well, yes, but not really. Python was built to make _writing_ code extremely efficient. The running of the code itself is pretty fast too, but it is not optimal for scientific computation.

In particular, the way Python accesses memory is a bit problematic for big data. Imagine a Python `list`. The way Python places it into memory is by giving a position to each element independently, and adjusting the size of memory requested as necessary depending on what's stored. This is efficient for flexibility, since it allows us to put anything we wish into a list:

This is extremely flexible and comfortable for coding. However, this flexibility comes at a prize. In particular, since Python doesn't know what exactly you're planning to put into positions `0`, `1`, and `2`, it can't know exactly where to put item `3`, so it normally puts it somewhere far away from the others, where it expects they won't be clashing with anything else. This is wasteful.

Consider a list with only numbers:

If you know for sure that your list will contain only numbers of relatively similar sizes (say, all less than 1 billion), you don't need the memory size to be so flexible. **It would be much more efficient to have each of the values placed right next to each other, in equally sized contiguous chunks of memory**. This way, when you ask for position `[3]`, the computer can find it much quicker by simply multiplying the size of the chunk – which is fixed – times the position you're looking for. This is, in a not entirely accurate nutshell, what NumPy does.

**In another (not entirely accurate) nutshell:** NumPy optimises Python code chiefly by turning **`list`s** with somewhat random positions and sizes in memory into **`array`s** with pre-defined sizes and locations in memory. This makes processing big data far quicker. In the words of NumPy themselves:
>_While a Python list can contain different data types within a single list, all of the elements in a NumPy array should be homogeneous._

These arrays don't have to be one-dimensional. Just like a Python `list`, they can be n-dimensional, so you can have matrices of whatever dimensions you choose. This makes `numpy` great for big-data analyses.

Let's take a short ride with `numpy` in the rest of this notebook.

## Defining a `numpy` array
The simplest way to define a numpy array: you just feed a list to it:

We can also feed a pre-existing list by name:

Doesn't seem to be anything special. What's special is how it's stored in memory, therefore how long it takes to run complex calculations.

For example, let's do some absurdly silly calculation, and let's use the Jupyter `%%timeit` magic to tell us how long a cell is taking to run that calculation on average.

First, let's do it with the regular `list`:

And now the `numpy.array`:

As you can see, it is much quicker with numpy.

### So, is NumPy always faster?
⚠️ Don't be fooled into thinking that it's always worth using numpy. Sometimes it may actually make things slower, in particular for simple calculations on simple data.

**With `list`:**

**With `numpy.array`:**

In my machine at least, the `numpy` code takes over 10 times longer than the normal `list`. So, it isn't always worth making your lists into arrays. The bigger and more repetitive your data, the more likely it is that `numpy` will help.

### Defining a specific data type
The advantage of NumPy is thus that every item in an array can be of the same type and occupy an equally sized space in contiguous blocks memory. If you don't specify the type, NumPy will make a best guess, but if you do know that your numbers will all be all positive and smaller than 8-bits, for example, you could specify `np.uint8`:

There's no apparent change to when we defined the array above, but this way, the largest number we can have is an unsigned (always positive) 8-bit number with all eight bits on: $2^7+2^6+2^5+2^4+2^3+2^2+2^1+2^0=255$

Seems limited, but if we're absolutely sure that the numbers in this array will never be negative or larger than `255` (e.g. we're dealing with human ages or height in centimetres), then this could be a very efficient choice.

Now, it turns out there was a fellow by the name of [Robert Wadlow](https://en.wikipedia.org/wiki/Robert_Wadlow) who was a whopping **272 cm** tall (8ft 11.1in)! In fact there have been [at least four people](https://en.wikipedia.org/wiki/List_of_tallest_people) taller than 255 cm recorded in history, so our choice of `uint8` may not be so great for height in centimetres and we'd be safer going for `uint16`.

This is important. See what happens if we try to enter poor Robert's height into the first position of our `uint8` array:

Why do we get `16`? Well, a simple count will let you see that NumPy just got to the maximum of the `uint8`, which is `255`, and then it kept counting the remaining `17` again from `0`. This gives `16`.

⚠️ Note that we got no warning whatsoever from NumPy. It just trusts that we know what we're doing. So: **make sure no elements will ever be bigger than the size of your chosen number size!** ⚠️

Incidentally, you'll note that we used typical Python indexing with `[0]` to access the first element of the array. Let's look a little closer at that.

## Array indexing and slicing works just like with Python `list`s

## How about appending items to the array?
If you try to do something like
```python
my_np_array.append(new_item)
```
you'll get an error because numpy arrays don't have the method `.append()`. How about `.add()` or something like that? Nope, also an error.

So, how do you append new items to the end of an array? The answer may shock you:
⚠️ **You do not append anything to an array. That would defeat the whole purpose of `numpy`** ⚠️<br/>
Remember that NumPy arrays take a fixed space in memory because the computer always knows how many items of which size they contain. If you tried to add an item, you'd have to redefine the entire array. So, what should you do? Best to create an empty array with as many elements as you think you could possibly ever have. We'll look into defining pre-filled arrays below.

## Filtering desired values
You can filter elements of an array that meet a certain condition. For example, let's declare an array to contain all numbers between `0` and `30`, and then filter it to get all the elements that are larger than 12:

You can combine two _necessary_ conditions (both of which must be met) with `&`:

And you can also specify _alternative_ conditions (only one of which must be met) with `|` (read "or"). For example, here are the multiples of `3` _and_ the multiples of `5`:

Note that for a computer this is an "or", not an "and": _is either this or that condition valid?_

If you do the filtering without the `[]` subsetting, you get an `array` of `True`s and `False`s:

We have only explored this type of filtering very briefly here, because this is probably better done in a `pandas` dataframe, which we will study in the next lesson.

## Multidimensional arrays
Just send `numpy` a list of lists, obviously with appropriate dimensions (the elements must be symmetrical).

## Pre-filled arrays
NumPy provides a number of very handy methods for creating pre-filled arrays.

### Use `np.zeros()` to get an array filled with... `0`s:

### Use `np.ones()` to get an array filled with... `1`s:

### `Use np.empty()` to get an array that is... _NOT_ empty!?!?!:
Just when you thought you were figuring this stuff out... the `empty` method actually gives us an array with some semi random nonsense in it:

By default we get floats, but we specify the type with `dtype`, just like we did above:

⚠️ Why on Earth would NumPy give us an array with something in it when we specifically asked for an empty one?! Well... think about it: if we actually had an **empty** array, then it would be an array of nothingness, so it couldn't be an array of numbers. If it is an array of numbers, then it has to have numbers in it – that's how NumPy works.

For this reason, most people try to generate their arrays _after_ they've read in their data and they already have it all in memory.

But what if I don't know what will be in my array when I create it? In that case, I'd prefer to define the array to a desirable size with zeros, with ones, or better still, with an absurd but easily identifiable number within my context. For example, for ages, I would make an array of integers, all `-999`:

(yes, you can multiply an entire array by a number if you wish)

Then I'd go on and fill in my array as necessary.

### Use `np.arange()` to create an array with a succession of numbers
This works exactly like the normal Python `range()` function. By default it starts at `0` and goes until _one before_ the last one:

And just like the normal `range()` function from Python, this one can take starting, ending, and step:

### Use `np.random.random()` to get an array with random floating-point numbers
These are uniformly distributed between `0.0` and `1.0`:

### Pre-filled multi-dimensional arrays
All of the above methods can be used to create arrays with multiple dimensions, just by providing a tuple with the desired dimensions:

This works identically for `zeros` and `ones`.

### Use `np.random.randint(max, size=())` to get an array with random integers
Generating random integers is a little different to floating point numbers, zeros and ones. If we only specify a single number, we simply get a random integer between 0 and (just before) that number:

Go ahead and run that code above several times. You'll see you get random numbers between 0 and 9. To get an array of random integers, we also need to specify a size. This can be one-dimensional:

And of course in multiple dimensions too:

## Reshaping arrays and transposing matrices

### Reshaping a 1-dimensional array to a 2-d "matrix"
Let's first create a 1-d array with 15 elements:

And now we can easily reshape it to 3 rows x 5 columns:

### Transposing a matrix (a multi dimensional array) with `.T`
Let's transpose that last `3x5` matrix to `5x3` instead:

As you can see, transposing works just like in linear algebra. So do mathematical operations like addition and subtraction, but not multiplication.

## Saving numpy arrays to files
You can easily save an array to a text file such as a `CSV` or `TSV`. For example, we can specify tab as the delimiter to get a `TSV` of the `transposed` matrix that we created in the previous section:

⚠️ If you're wondering how you could add column names, do not. That's not what `numpy` is for, you want `pandas`. Please read on.

## If you're seeing your 2-D array as a table with column names, you want `pandas`, not `numpy`
NumPy arrays are not meant to be tables, with column names and such. That's what `pandas` is for. Please take a look at the `pandas` lesson (or [their website](https://pandas.pydata.org/)).

## Where to from here
There is a lot more to numpy arrays. A lot. Some of it overlaps with `pandas`, so take a look at the `pandas` lesson too. But to know more about `numpy` itself, I strongly recommend following the excellent [beginner's tutorial on their official website](https://numpy.org/doc/stable/user/absolute_beginners.html), which I relied on very heavily to create this document.