## Notebook 6.0: Into to numpy

Hands on usage of numpy arrays

Before completing this notebook you should have completed your assigned reading, chapters 2-3 of the [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/02.00-introduction-to-numpy.html), which includes a tutorial on using the package `numpy`. We'll spend a lot of time talking about and using `numpy` because it is by far the most widely used package for scientific computing in Python, and it is incredibly powerful. Follow along with your reading and execute code in a notebook to try out the various functions and concepts that it is introducing. Here, I provide a number of additional exercises for you that have biological significance and may be more interesting. 

### Required software

In [None]:
# conda install numpy -c conda-forge

In [None]:
import numpy as np

### Numpy arrays
Numpy arrays are efficient for storing and operating on sets of values that are all of the same `dtype`. 

In [None]:
# default type is float64 (large)
arr = np.zeros(10)
print(arr)
print(arr.dtype)
print(arr.nbytes)

In [None]:
# default int type is int64 (large)
arr = np.zeros(10, dtype=int)
print(arr)
print(arr.dtype)
print(arr.nbytes)

In [None]:
# smaller int types are faster and use less memory (1/4 of the above example)
arr = np.zeros(10, dtype=np.int16)
print(arr)
print(arr.dtype)
print(arr.nbytes)

In [None]:
# smaller int types are faster and use less memory (1/8 of the above example)
arr = np.zeros(10, dtype=np.int8)
print(arr)
print(arr.dtype)
print(arr.nbytes)

The default numeric types are int64 and float64, which are able to hold super super large numbers. But, they take up large amounts of memory too. In general, who cares, memory is cheap. But, when you're writing code for performance, it turns out that using smaller dtypes can make your code much much faster. For example, if your data is always composed of only 4 values (e.g., think of DNA which is only 4 characters) then you do not need to use a dtype that is so much larger. The danger of using a small dtype is that you can experience 'overflow' if your values do exceed the capacity of the dtype. For example, int8 can only store values in -128 to 128. When this happens you get strange results that will cause big errors in your analyses! This is why the default is to use large dtypes (like np.int64).

To keep things simple when you're first starting to learn numpy you could just always use the default dtype when working with numeric data (np.float64). But it is good to be aware of dtypes since understanding these will help you to become a power user.

In [None]:
# max int8 is 128 (uh oh, it starts counting down after 127!)
arr = np.zeros(10, dtype=np.int8)
arr[:] = range(125, 135)
print(arr)


To keep things simple when you're first starting to learn numpy you could just always use the default dtype when working with numeric data (np.float64). But it is good to be aware of dtypes since understanding these will help you to become a power user.

### Modifying arrays (a copy versus a view)

Although arrays seem similar to lists, they are in fact very different and you will likely run in to many errors early on due to this confusion. Arrays can be indexed and sliced like lists, and they are also mutable, so that you can change values within an array like in a list. However, they differ in operations like broadcasting (e.g., + 1 will add 1 value to all elements of an array but not a list), and they obviously have very different functions and attributes.

Another difference that it is important to be aware of it that they return a different type of object when you <b>slice</b> an array versus a list. This topic may seem like a minor detail, but it is a common 'gotcha', and so a good thing to try to comprehend. 

Essentially, arrays are intended to store only a single copy of itself in memory unless you *explicitly* tell it to make another copy by using the `.copy()` function. Otherwise, the thing that is returned to you when you perform a slice on an array is called a `view` (a view of the same object). If you modify a view of an array object then you will have modified the original array object as well. 

Lists on the other hand return a copy of themselves when you index or slice them, such that the original is unchanged if you operate on the copy. This is demonstrated below. First I show that mutating a single indexed value is not a problem. They behave the same. It is only when slicing that this difference arises.

In [None]:
# mutate the first element in a list
ll = ['a', 'b', 'c']
ll[0] = 'd'
print(ll)

In [None]:
# mutate the first element in an array
arr = np.array(['a', 'b', 'c'])
arr[0] = 'd'
print(arr)

In [None]:
# make a copy of a list and mutate it, both exist as separate instances
lc = ll.copy()
lc[0] = 'e'
print(lc, ll)

In [None]:
# make a copy of an array and mutate it, both exist as separate instances
carr = arr.copy()
carr[0] = 'e'
print(carr, arr)

So far so good. Nothing unexpected happened. But now you will see how they act differently:

In [None]:
# make a copy of list by slicing, mutate first element, and compare: they are different
lsub = ll[:2]
lsub[0] = 'x'
print(lsub, ll)

In [None]:
# make a view of array by slicing, mutate first element, and compare: they are same!
asub = arr[:2]
asub[0] = 'x'
print(asub, arr)

In [None]:
# to get a copy of the array we need to call .copy() explicitly!
asub = arr[:2].copy()
asub[0] = 'y'
print(asub, arr)

## Genomic sequence data as an array
The string characters A,C,G,T can be sampled in an array to represent a sequence of DNA. Here we use the `.random` module of numpy, which is similar to the `.random` package from the standard library, but much more powerful, as it return arrays and has many more scientific methods for sampling random distributions, as we'll see. The array of sequence data in this case is six rows and 12 columns, or in other words, we have data for 6 haploid individuals for 12 sites of DNA. 

In [None]:
seq = np.random.choice(list("ACGT"), size=10, replace=True)  # make array that is 12 bases long
seqs = np.array([seq] * 6)                                   # make 6 copies of arr

In [None]:
print(seqs)

### Fancy indexing

In [None]:
# select the first four rows
seqs[:4, :]

In [None]:
# select the last six columns
seqs[:, -6:]

In [None]:
# select first two rows and first four columns
seqs[:2, :4]

In [None]:
# create boolean mask of whether element is "G"
seqs == "G"

In [None]:
# view the boolean mask as int8 values (easier to read than True/False)
np.int8(seqs == "G")

In [None]:
# create boolean mask for whether any sites in a column are "G"
np.any(seqs == "G", axis=0)

# Generate variable sequence data

Don't worry too much about this function for right now, we'll dive into it in detail in the next notebook. For now we'll just use it to generate variable sequence data. 

In [None]:
def seqdata(ninds, nsites, seed=None):
    """
    Generate a ninds x nsites array of A,C,T,G string data
    and randomly add mutations to some sites.
    """
    # use random seed
    np.random.seed(seed)
    
    # make sequence data 
    oseq = np.random.choice(list("ACGT"), size=nsites)
    arr = np.array([oseq] * ninds)
    
    # introduce some mutataions
    muts = np.random.binomial(1, 0.1, (ninds, nsites))
    for col in range(nsites):
        newbase = np.random.choice(list(set("ACTG") - set(arr[0, col])))
        mask = muts[:, col].astype(bool)
        arr[:, col][mask] = newbase
    return arr

In [None]:
# generate an array of variable sequence data
arr = seqdata(8, 10, 12345)
print(arr)

### Find variable sites
Here we can use a broadcasting method to compare sequences to find if there is any variation in the sequences. We could examine each column individually and count the number of elements in it, but a much easier way is to simply perform on operation over an `axis` of the array that will return True or False depending on whether there is variation. One way is to simply compare each column to the value in the first row. Broadcasting will allow this to work so that across all rows each value in each column is compared to its respective first row element. 

Look at the boolean array below and the array above to see how it is identifying the columns that contain variation (columns where all elements are not the same as the first row). Complex tasks like this can be made very simple by learning how to think in terms of array operations.

In [None]:
# ask which sites are variable
print(arr != arr[0])

### Using *any* and *all*
These are commonly used operations to select or mask parts of an array based on a boolean comparison. Below we call `.any()` on a boolean array. The operation `arr != arr[0]` returns the boolean array shown above (all True or False values). The `.any()` function will return True if the value is True. By telling it to work over an axis we are asking it to tell us if *any* of the values in a given column are True. From looking at the array above we can see that some are and some aren't. We expect to get a result that includes a True or False statement for each column. Thus, by operating over the array with an axes argument we expect the *dimension* of the result to be less than the original array (i.e., the shape of arr is (8,10) and the shape of the result below is (10,)). 

In [None]:
# broadcast with any() to get columns (sites) that are variable
np.any(arr != arr[0], axis=0)

## Masking (filtering)

Here we will use masking through an example for calculating population genetic statistics from a sequence array.

Often we are interested in filtering sequence data based on some criterion before we calculate statistics on it. Examples would be filtering to remove sites with missing data (often coded in DNA by the character `N`), or filtering to remove sites with rare alleles (if its found in very few individuals it may just be an error). The latter is often applied with a filter called a minor allele frequency (MAF). Let's practice calculating the minor allele frequency and filtering based on it. 

In [None]:
# generate a larger array of variable sequence data
arr = seqdata(16, 10, 12345)
print(arr)

### calculate the frequency of the rare allele in each column
Let's think about how to do this. First, we need to find the sites that are variable in each column, then we need to find a way to count them, and then divide by the length of the column to get the value as a frequency. Well, all of this information is present in the operations we performed above to find the variable sites. Let's use that same framework here. 

#### 1. view which sites are variable?

In [None]:
# comparison returns True or False for every value (shown as ints for easier viewing here)
np.int8(arr != arr[0])

#### 2. get each column as a frequency
Try to tease apart what each part is doing here. Open a new jupyter cell below and execute parts of the code. What is returned by the sum function? What is the shape[0]? Try to figure it out by exploring.

In [None]:
# get sum of each column in array from above. Divide each column sum by the column length.
np.sum(arr != arr[0], axis=0) / arr.shape[0]

#### 3. to get the minor allele frequency

In [None]:
# store view from above cell
freqs = np.sum(arr != arr[0], axis=0) / arr.shape[0]

# store a copy so we do not modify the original array 'arr'
maf = freqs.copy()

# subselect sites with freq (>0.5) and modify to be 1-value (so it is a MINOR freq)
maf[maf > 0.5] = 1 - maf[maf > 0.5]

# print minor allele frequencies
print(maf)

#### 4. filter columns of the array by MAF
For our analyses we might only want to analyze sites with a MAF > 0.1. This excludes two sites from the original array, one that was not variable and one that was variable at only a single haplotype. 

In [None]:
print(arr[:, maf > 0.1])

<div class="alert alert-success">
    Congrats you have finished the notebook. There is not assessment here, move on the next notebook. It is OK if you found this part challenging. It will take some time to become comfortable with the numpy notation. But be sure to practice as you go by dissecting these statements and running them bit by bit in your notebook.
</div>