# Python for Data Science Teaching Session 3: Data Processing

## Introduction

### Session Objectives

- Creating arrays and accessing elements
- Reshaping and combining arrays
- Arithmetic and aggregations
- Filtering and sorting

## Getting Started with NumPy

### What is NumPy?

In one word, fast. NumPy is a Python package (and part of the SciPy stack) devoted to processing and performing mathematical operations on data. Although faster solutions do exist ([Numba](https://numba.pydata.org/), [TensorFlow](https://www.tensorflow.org/overview), etc.), NumPy is powerful enough for almost all use cases.

The backbone of NumPy is the array. These arrays differ from Pandas dataframes and base Python lists in two ways:

- The same datatype must be used for all elements in an array
- Every axis (dimension) must contain blocks of the same shape (i.e. arrays are rectangular, cuboidal, etc.)

Because we have consistent types and structure, operations on NumPy arrays can be performed much faster than methods we've seen so far.

> **Technical Note**
>
> Technically, the second rule can be relaxed, in which case we obtain what is known as a _ragged array_. This however loses most of the efficiency of NumPy so is advised against.

Typically we import `numpy` under the alias `np`

In [None]:
# Import packages
import matplotlib.pyplot as plt
# ...
import pandas as pd

### Creating an Array

#### From Lists

The simplest way to create an array is by passing a number, list, list-of-lists, etc. into `np.array`. As with, Pandas dataframes, we can then view the dimensions of the array with the `.shape` attribute.

In [None]:
# Create a 0-D array and print its shape


In [None]:
# Create a 1-D array and print its shape


> **An Idea to Ponder**
>
> What is the difference between `(3)` and `(3,)`? Perhaps look at their `type(...)`.

In [None]:
# Create a 2-D array and print its shape


If we provide lists containing different datatypes, NumPy will try to coerce them into the same type (giving an error if this process fails).

In [None]:
# Create an array from different data types
arr = np.array([
    [True, 2],
    [3.0, 4]
])
arr

In general, it can be hard to predict what datatype you will obtain when doing this, so it is best practice to ensure that you have the same datatype _before_ creating an array.

> **Further Reading**
>
> You can convert between types by using the `.astype()` method of a NumPy array, passing in one of the types listed [here](https://numpy.org/devdocs/user/basics.types.html). E.g. `arr.astype(np.int32)`.

Notice, that by nesting lists in lists, you can create arrays with different dimensions that contain the same data.

In [None]:
# Create arr1, arr2 with the same data but different dimensions
# ...
print(arr1.shape, arr2.shape)

#### Random Arrays

NumPy has the ability to create random arrays. This is done using the `random` submodule whose documentation can be found [here](https://numpy.org/doc/1.16/reference/routines.random.html).

In [None]:
# Create a 3x2x3 grid of random integers between 1 and 100


> **Warning**
>
> All `np.random` functions use the parameter `size` to specify the arrays `shape`. This is because many distributions such as `gamma` already have a `shape` parameter. 

We can create reproducible randomness using `np.random.seed(...)`.

In [None]:
# Set the random seed to 42 and generate 10 random standard normal samples


#### Arrays from Dataframes

We can convert entire dataframes, series or indexes into NumPy arrays, using the `.to_numpy()` method. It is important to only do this when datatypes are consistent else this may result in unexpected type conversions or a errors.

> **Good to Know**
>
> In older versions of Pandas, the same effect could be achieved using the `.values` attribute. This is however being fazed out for the more consistent `.to_numpy()` approach. 

For example, we can extract the predictors from the iris dataset as a NumPy array.

In [None]:
# Load the iris dataset
iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
iris.head()

In [None]:
# Extract the first 10 rows predictors as an array, X
X = # ...
X

In fact, Pandas dataframes are built on top of NumPy arrays. More on this in the project session.

### Array Attributes

We have already seen the `shape` attribute of NumPy arrays. There are a few more of interest.

In [None]:
# Number of dimensions of X


In [None]:
# Datatype of X


In [None]:
# Number of elements in X


In [None]:
# Storage size for each element in X


In [None]:
# Total storage size of X (= X.size * X.itemsize)


### Indexing and Slicing

We can access individual elements of an array or create sub-arrays in the exact same way we would with Python lists or Pandas dataframes using `iloc`. We can specify as many indexes or slices as we like up to number of dimensions in the array, separated by commas.

> If your knowledge of slicing is lacking, checkout [session five](https://education.wdss.io/beginners-python/session-four/) of WDSS's Beginner's Python.

In [None]:
# Create an array to slice
np.random.seed(253)
arr = np.random.random(size=(10, 5, 3))

In [None]:
# Extract the element in position (7, 4, 2)


In [None]:
# Extract the 1D array in position (3, 4)


In [None]:
# Extract the first value of the 3rd dimension


In [None]:
# Reverse the 2nd dimension and extract every 3rd value of the 1st dimension


As with dataframes (but unlike lists) indexing and slicing copies by reference.

In [None]:
X = np.array([
    [1, 2, 3],
    [4, 5, 6]
])
X1 = X[0]
X1 *= 2
X

To create copies, we use the `.copy()` method.

In [None]:
X = np.array([
    [1, 2, 3],
    [4, 5, 6]
])
X1 = X[0].copy()
X1 *= 2
X

> **Further Reading**
>
> NumPy also allows indexing using arrays. This is an incredibly powerful tool for manipulating array elements but is beyond the scope of this course. You can read more about it [here](https://numpy.org/doc/stable/user/basics.indexing.html#index-arrays).

## Reshaping and Combining Arrays

### Reshaping Arrays

We can reshape an array using the `.reshape()` method, passing in the new desired shape as a tuple.

In [None]:
# Flatten a 5x5 random array
arr = np.random.random(size=(5, 5))
# ...

We are allowed to set one dimension to have the value `-1`, in which case this will be chosen so that the size of the new array matches the original.

In [None]:
# Group a flat array into 2x2 blocks
arr = np.random.random(size=(20,))
# ...

We can also use `reshape` to add new dummy dimensions.

In [None]:
# Turn a 2x2 array into a 1x2x2 array
np.array([[1, 2], [3, 4]]) # ...

### Combining Arrays

NumPy has many functions for combining (and splitting arrays). For the sake of time, we will only focus on a few, though you may want to checkout the following documentation for yourself:

- [block](https://numpy.org/doc/stable/reference/generated/numpy.block.html)
- [split](https://numpy.org/doc/stable/reference/generated/numpy.split.html)
- [vstack](https://numpy.org/doc/stable/reference/generated/numpy.vstack.html), [hstack](https://numpy.org/doc/stable/reference/generated/numpy.hstack.html), [dstack](https://numpy.org/doc/stable/reference/generated/numpy.dstack.html)

One of the functions we will cover is `concatenate`, which joins a sequence of arrays along an existing axis.

In [None]:
# Combine 3 three random 2x2 arrays into a 6x2 array
X = np.random.random(size=(2, 2))
Y = np.random.random(size=(2, 2))
Z = np.random.random(size=(2, 2))
C = # ...
C.shape

A similar function is `stack`, which also joins a sequence of arrays, creating a _new_ axis in the process.

In [None]:
# Combine 3 three random 2x2 arrays into a 3x2x2 array
D = # ...
D.shape

## Maths with NumPy

The main reason we want to use NumPy is for quick and elegant mathematical operations. These take advantage of a technique called _vectorisation_ in which operations are applied element-wise rather than to the entire array.

### Arithmetic

NumPy comes with a wide array (eh?) of built-in mathematical functions. Most are written in low-level, optimised C code so are lightning fast. A full list can be found [here](https://numpy.org/doc/stable/reference/routines.math.html). We can also use most base Python operators, which are then converted to the faster underlying functions.

In [None]:
# Create arrays to play with
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])

In [None]:
# Add A and B


In [None]:
# Add A and B using underlying function


In [None]:
# Multiply corresponding elements of A and B


> **Note**
> 
> You can perform matrix multiplication using `np.matmul` or the `@` operator (e.g. `A @ B`).

In [None]:
# Square root the elements in A


In [None]:
# Divide the elements of B by A and round to 1dp


This brings us back to a faster way of plotting functions with `matplotlib`.

In [None]:
x = np.linspace(0, 10, 100)
y = np.log(x + 1)
fig, ax = plt.subplots()
ax.plot(x, y)
plt.show()

> **Further Reading**
>
> It is possible (and incredibly powerful) to perform element-wise operations on arrays with different shapes. This is known as broadcasting and is an advanced NumPy feature documented [here](https://numpy.org/doc/stable/user/basics.broadcasting.html).

We can vectorise our own functions for use with NumPy using `np.vectorize`, though performance will almost certainly suffer in the process. Note, there is no way in Python to make this function correspond to a new operator.

In [None]:
def random_op(x, y):
    rnd = np.random.random()
    if rnd < 0.4:
        return x + y
    elif rnd < 0.8:
        return x - y
    else:
        return x * y
    
# Vectorise random_op and apply to A and B

### Aggregations

Whereas the above operations returned arrays the same size as the original, aggregate operations will return either a single value or a smaller array. The most common of these functions are listed [here](https://www.pythonprogramming.in/numpy-aggregate-and-statistical-functions.html).

In [None]:
# Create data to play with
np.random.seed(42)
# 2 groups, 5 participants in each, 3 repeat measurements for each
arr = np.random.standard_normal(size=(2, 5, 3))

In [None]:
# Sum all elements in arr


In [None]:
# Average value for each participant over the 3 measurements


In [None]:
# Maximum value in each group


A particularly interesting pair of aggregate functions is `argmin`/`argmax`. Rather than finding the smallest/largest element, they tell you where it is.

In [None]:
# Which participant of each group had the lowest variance?


> **Useful Tip**
>
> You can also reference axes using negative indices

### Accumulations

A related family of functions is accumulations. These apply a function over an axis, returning the result at each step.

In [None]:
# Cummulative sum of arr
arr = np.array([
    [1, 2, 3],
    [4, 5, 6]
])
# ...

You can find a full list of these functions by searching for the prefix "cum" on the NumPy [function list](https://numpy.org/doc/stable/reference/routines.math.html?highlight=sum).

> **Further Reading**
>
> You are also free to create your own aggregations (specifically, [reductions](https://numpy.org/doc/stable/reference/generated/numpy.ufunc.reduce.html)) and [accumulations](https://numpy.org/doc/stable/reference/generated/numpy.ufunc.accumulate.html) using the documentation linked here, though do note that these are both quite advanced features of the package.

## Filtering and Sorting

Finally, we will look at how we can filter and sort arrays. We begin with filtering.

### Filtering

We can use comparison operators with NumPy arrays to create Boolean arrays.

In [None]:
# Which rows of A have a sum greater than one? Call the result B
np.random.seed(253)
A = np.random.standard_normal(size=(10, 10))
B = # ...
B

We can then use Boolean arrays to filter along an axis of an array.

In [None]:
# Select only rows with a sum greater than one


> **Review**
>
> Now that you are aware of Boolean arrays, you may want to review the aggregation functions `np.all` and `np.any`.

We can also pass in a NumPy array of indexes (which could have been obtained from `argmin`, `argsort` (see later), or `argwhere` (read docs)).

In [None]:
# Select 2nd and 4th columns using an array of indexes
A[:, np.array([1, 3])]

At this point you are likely wondering what happens when we run code such as `A[np.array([1, 2], [4, 2])]` or `A[A > 0]`. The answer all comes down to broadcasting, which we will not cover here, but can be read about at [this link](https://numpy.org/doc/stable/user/basics.broadcasting.html).

### Sorting Arrays

There are two ways to sort arrays in NumPy. The first is `arr.sort(...)` which sorts **in-place**, and the second is `np.sort(arr, ...`) which returns a sorted **copy** of the original. We will use the former but the same logic applies to the latter.

The only argument we need to specify in most use cases is `axis` which determines the axis to sort over. It is generally difficult to handle ties although you will find ways if you search online.

In [None]:
# Sort A over the last dimension
A = np.random.random(size=(2, 2, 2))
# ...
print(A)

When using `np.sort()`, by setting `axis=None` we sort the flattened array.

In [None]:
# Sort the flattened array of A


Finally, we can use `np.argsort()` to return the indices that would result in a sorted array.

In [None]:
# Argsort A, creating B
np.random.seed(42)
A = np.random.randint(100, size=(10,))
B = # ...
B

In [None]:
np.all(np.sort(A) == A[B])

## Wrapping Up

### Saving NumPy arrays

It is possible to save NumPy arrays to disk for later use. This isn't too common a requirement so we won't cover it in this session, but you can read more about it [here](https://numpy.org/doc/stable/reference/generated/numpy.save.html).

### Sparsity

The sparsity of an array is given by the proportion of zero elements out of all elements. When an array is highly sparse (typically >70-90% depending on the scenario), it can be stored in a different way to make operations faster and use less storage. NumPy only has the ability to work with _dense_ arrays, but thankfully, SciPy has us covered. The feature set of sparse arrays is more limited but when these meet you needs, they can often offer much faster computations with less storage requirements. Read more [here](https://docs.scipy.org/doc/scipy/reference/sparse.html).