# McKinney Chapter 4 - NumPy Basics: Arrays and Vectorized Computation

## Introduction

Chapter 4 of Wes McKinney's [*Python for Data Analysis*](https://wesmckinney.com/pages/book.html) discusses the NumPy package (short for numerical Python), which is the foundation for numerical computing in Python, including pandas.

We will focus on:

1. Creating arrays
1. Slicing arrays
1. Performing mathematical operations on arrays
1. Applying functions and methods to arrays
1. Using conditional logic with arrays (i.e., `np.where()` and `np.select()`)

***Note:*** Indented block quotes are from McKinney unless otherwise indicated. The section numbers here may differ from McKinney because we may not discuss every topic.

The typical abbreviation for NumPy is `np`.

In [None]:
import numpy as np

The following prints NumPy arrays to 4 decimals places without changing the precision of the underlying array.

In [None]:
%precision 4

NumPy is critical to numerical computing in Python.
We will often use NumPy via McKinney's pandas, but data analysts must know the fundamentals of NumPy.

> NumPy, short for Numerical Python, is one of the most important foundational packages for numerical computing in Python. Most computational packages providing scientific functionality use NumPy’s array objects as the lingua franca for data exchange.
>
> Here are some of the things you’ll find in NumPy:
> - ndarray, an efficient multidimensional array providing fast array-oriented arithmetic operations and flexible broadcasting capabilities.
> - Mathematical functions for fast operations on entire arrays of data without having to write loops.
> - Tools for reading/writing array data to disk and working with memory-mapped files.
> - Linear algebra, random number generation, and Fourier transform capabilities.
> - A C API for connecting NumPy with libraries written in C, C++, or FORTRAN.
>
> Because NumPy provides an easy-to-use C API, it is straightforward to pass data to external libraries written in a low-level language and also for external libraries to return data to Python as NumPy arrays. This feature has made Python a language of choice for wrapping legacy C/C++/Fortran codebases and giving them a dynamic and
easy-to-use interface. 
>
> While NumPy by itself does not provide modeling or scientific functionality, having an understanding of NumPy arrays and array-oriented computing will help you use tools with array-oriented semantics, like pandas, much more effectively. Since NumPy is a large topic, I will cover many advanced NumPy features like broadcasting
in more depth later (see Appendix A). For most data analysis applications, the main areas of functionality I’ll focus on are: 
> - Fast vectorized array operations for data munging and cleaning, subsetting and filtering, transformation, and any other kinds of computations
> - Common array algorithms like sorting, unique, and set operations
> - Efficient descriptive statistics and aggregating/summarizing data
> - Data alignment and relational data manipulations for merging and joining together heterogeneous datasets
> - Expressing conditional logic as array expressions instead of loops with if-elif-else branches
> - Group-wise data manipulations (aggregation, transformation, function application)
>
> While NumPy provides a computational foundation for general numerical data processing, many readers will want to use pandas as the basis for most kinds of statistics or analytics, especially on tabular data. pandas also provides some more domain-specific functionality like time series manipulation, which is not present in NumPy.
>
> One of the reasons NumPy is so important for numerical computations in Python is because it is designed for efficiency on large arrays of data. There are a number of reasons for this:
> - NumPy internally stores data in a contiguous block of memory, independent of other built-in Python objects. NumPy’s library of algorithms written in the C language can operate on this memory without any type checking or other overhead. NumPy arrays also use much less memory than built-in Python sequences.
> - NumPy operations perform complex computations on entire arrays without the need for Python for loops.


McKinney provides a clear example of NumPy's power and speed.
He creates 2 sequences of numbers from 0 to 999,999 as a NumPy array and a Python list, then multiplies the list by 2.
The NumPy array supports vectorized operations and it "just works" when he multiplies the array by 2.
However, he must use a list comprehension to multiply the list by 2 element-by-element.

In [None]:
my_list = list(range(1000000))

In [None]:
my_arr = np.arange(1000000)

In [None]:
my_list[:5]

In [None]:
my_arr[:5]

Multiplying lists by an integer concatenates them instead of elementwise multiplication.
So we use a list comprehension to multiply the elements in `my_list` by 2.

In [None]:
# my_list * 2 # concatenates two copies of my_list

In [None]:
# [2 * x for x in my_list] # we use a list comprehension for elementwise multiplication of a list

However, math on NumPy arrays "just works".

In [None]:
my_arr * 2

We use the "magic" function `%timeit` to time these two calculations.

In [None]:
%timeit [x * 2 for x in my_list]

In [None]:
%timeit my_arr * 2

The NumPy version is few hundred times faster than the list version.

## The NumPy ndarray: A Multidimensional Array Object

> One of the key features of NumPy is its N-dimensional array object, or ndarray, which is a fast, flexible container for large datasets in Python. Arrays enable you to perform mathematical operations on whole blocks of data using similar syntax to the equivalent operations between scalar elements.

We can generate random data to explore NumPy arrays.
Whenever we use random data, we should set the random number seed with `np.random.seed(42)`, which makes our random numbers repeatable.
Because we set the random number seed just before we generate `data`, our `data`s will be identical.

In [None]:
np.random.seed(42)
data = np.random.randn(2, 3)

Multiplying `data` by 10 multiplies each element in `data` by 10, and adding `data` to itself does element-wise addition.
The compromise to achieve this common-sense behavior is that NumPy arrays must contain homogeneous data types (e.g., all floats or all integers).

In [None]:
data * 10

Addition in NumPy is also elementwise.

In [None]:
data + data

NumPy arrays also have attributes.
Recall that Jupyter Notebooks provides tab completion.

In [None]:
data.shape

In [None]:
data.dtype

In [None]:
data[0]

In [None]:
data[0][0] # zero row, then the zero element in the zero row

In [None]:
data[0, 0] # zero row, zero column

### Creating ndarrays

> The easiest way to create an array is to use the array function. This accepts any sequence-like object (including other arrays) and produces a new NumPy array containing the passed data

In [None]:
data1 = [6, 7.5, 8, 0, 1]
arr1 = np.array(data1)

In [None]:
arr1.dtype

We could coerce these values to integers, but we would lose information.
The default is to select a `dtype` that would not lose information.

In [None]:
np.array(data1, dtype=np.int64)

Note that `np.array()` re-cast the values in `data1` to floats becuase NumPy arrays must be homogenous data types.
A list of lists becomes a two-dimensional array.

In [None]:
data2 = [[1, 2, 3, 4], [5, 6, 7, 8]]
arr2 = np.array(data2)

In [None]:
arr2.ndim

In [None]:
arr2.shape

In [None]:
arr1.dtype

In [None]:
arr2.dtype

There are several other ways to create NumPy arrays.

In [None]:
np.zeros(10)

In [None]:
np.zeros((3, 6))

The `np.arange()` function is similar to the core `range()` but it creates an array directly.

In [None]:
list(range(15))

In [None]:
np.arange(15)

***Table 4-1*** lists the NumPy array creation functions.

- `array`: Convert input data (list, tuple, array, or other sequence type) to an ndarray either by inferring a dtype or explicitly specifying a dtype; copies the input data by default
- `asarray`:  Convert input to ndarray, but do not copy if the input is already an ndarray 
- `arange`:  Like the built-in range but returns an ndarray instead of a list
- `ones`, `ones_like`:  Produce an array of all 1s with the given shape and dtype; `ones_like` takes another array and produces a `ones` array of the - same shape and dtype
- `zeros`, `zeros_like`:  Like `ones` and `ones_like` but producing arrays of 0s instead
- `empty`, `empty_like`:  Create new arrays by allocating new memory, but do not populate with any values like ones and zeros
- `full`, `full_like`:  Produce an array of the given shape and dtype with all values set to the indicated "fill value"
- `eye`, `identity`:  Create a square N-by-N identity matrix (1s on the diagonal and 0s elsewhere)

### Arithmetic with NumPy Arrays

> Arrays are important because they enable you to express batch operations on data without writing any for loops. NumPy users call this vectorization. Any arithmetic operations between equal-size arrays applies the operation element-wise

In [None]:
arr = np.array([[1., 2., 3.], [4., 5., 6.]])

NumPy array addition is elementwise.

In [None]:
arr + arr

NumPy array multiplication is elementwise.

In [None]:
arr * arr

NumPy array division is elementwise.

In [None]:
1 / arr

In [None]:
arr ** 0.5

In [None]:
arr2 = np.array([[0., 4., 1.], [7., 2., 12.]])

### Basic Indexing and Slicing

One-dimensional array index and slice the same as lists.

In [None]:
arr = np.arange(10)

In [None]:
arr

In [None]:
arr[5]

In [None]:
arr[5:8]

In [None]:
equiv_list = list(range(10))

In [None]:
equiv_list

In [None]:
equiv_list[5:8]

In [None]:
# equiv_list[5:8] = 12 # TypeError: can only assign an iterable

In [None]:
equiv_list[5:8] = [12] * 3

In [None]:
equiv_list

With NumPy arrays, we do not have to ump through this hoop.

In [None]:
arr[5:8] = 12
arr

> As you can see, if you assign a scalar value to a slice, as in `arr[5:8] = 12`, the value is propagated (or broadcasted henceforth) to the entire selection. An important first distinction from Python’s built-in lists is that array slices are views on the original array. This means that the data is not copied, and any modifications to the view will be reflected in the source array.

In [None]:
arr_slice = arr[5:8]

In [None]:
arr_slice

In [None]:
arr_slice[1] = 12345

In [None]:
arr_slice

In [None]:
arr

The `:` slices every element in `arr_slice`.

In [None]:
arr_slice[:] = 64

In [None]:
arr_slice

In [None]:
arr

> If you want a copy of a slice of an ndarray instead of a view, you will need to explicitly copy the array-for example, `arr[5:8].copy()`.

In [None]:
arr_slice_2 = arr[5:8].copy()

In [None]:
arr_slice_2

In [None]:
arr_slice_2[:] = 2001

In [None]:
arr_slice_2

In [None]:
arr

> With higher dimensional arrays, you have many more options. In a two-dimensional array, the elements at each index are no longer scalars but rather one-dimensional arrays... Thus, individual elements can be accessed recursively. But that is a bit too much work, so you can pass a comma-separated list of indices to select individual elements.

In [None]:
arr2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

In [None]:
arr2d

In [None]:
arr2d[2]

In [None]:
arr2d[0][2]

In [None]:
arr2d[0, 2] # row, column notation is more common and typically easier to read

### Indexing with slices

In [None]:
arr2d

In [None]:
arr2d[:2]

In [None]:
arr2d[:2, 1:]

In [None]:
arr2d[1, :2]

In [None]:
arr2d[:2, 2]

A colon (`:`) by itself selects the entire dimension and is necessary to slice higher dimensions.

In [None]:
arr2d[:, :1]

In [None]:
arr2d[:2, 1:] = 0
arr2d

***ALWAYS CHECK YOUR OUTPUT!***

### Boolean Indexing

We can use Booleans (`True`s and `False`s) to slice, too.
Think of `names` as a sequence of seven names that line up with the seven rows in `data`.
To keep things simple, we will not give column names.
The folowing example is like `index(match(), match())` from Excel.

In [None]:
names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])
np.random.seed(42)
data = np.random.randn(7, 4)

In [None]:
names

In [None]:
data

In [None]:
names == 'Bob'

In [None]:
data[names == 'Bob']

In [None]:
data[names == 'Bob', 2:]

In [None]:
data[names == 'Bob', 3]

In [None]:
names != 'Bob'

The `~` inverts a Boolean condition.

In [None]:
data[~(names == 'Bob')]

In [None]:
cond = names == 'Bob'
data[~cond]

For NumPy arrays, we must use `&` and `|` instead of `and` and `or`.

In [None]:
mask = (names == 'Bob') | (names == 'Will')

In [None]:
data[mask]

The "not" operator is `~` in NumPy.
The "not" operator is typically `!` in other programming languages.

In [None]:
data[~(names == 'Bob')]

In [None]:
data

In [None]:
data < 0

In [None]:
data[data < 0] = 0

In [None]:
data

In [None]:
data[names != 'Joe'] = 7

In [None]:
data

Note:

> Selecting data from an array by boolean indexing always creates a copy of the data,
even if the returned array is unchanged.

## Universal Functions: Fast Element-Wise Array Functions

> A universal function, or ufunc, is a function that performs element-wise operations on data in ndarrays. You can think of them as fast vectorized wrappers for simple functions that take one or more scalar values and produce one or more scalar results.

In [None]:
arr = np.arange(10)

In [None]:
np.sqrt(arr)

Here `np.exp(x)` is $e^x$.

In [None]:
np.exp(arr)

In [None]:
2**arr

> These are referred to as unary ufuncs. Others, such as add or maximum, take two arrays (thus, binary ufuncs) and return a single array as the result.

In [None]:
np.random.seed(42)
x = np.random.randn(8)
y = np.random.randn(8)

In [None]:
np.maximum(x, y)

***Be careful! Function names are not the whole story. Check your output and read the docstring!***

In [None]:
np.max(x)

***Table 4-3*** lists the fast, element-wise unary functions:

- `abs`, `fabs`: Compute the absolute value element-wise for integer, oating-point, or complex values
- `sqrt`: Compute the square root of each element (equivalent to arr ** 0.5) 
- `square`: Compute the square of each element (equivalent to arr ** 2)
- `exp`: Compute the exponent $e^x$ of each element
- `log`, `log10`, `log2`, `log1p`: Natural logarithm (base e), log base 10, log base 2, and log(1 + x), respectively
- `sign`: Compute the sign of each element: 1 (positive), 0 (zero), or –1 (negative)
- `ceil`: Compute the ceiling of each element (i.e., the smallest integer greater than or equal to thatnumber)
- `floor`: Compute the oor of each element (i.e., the largest integer less than or equal to each element)
- `rint`: Round elements to the nearest integer, preserving the dtype
- `modf`: Return fractional and integral parts of array as a separate array
- `isnan`: Return boolean array indicating whether each value is NaN (Not a Number)
- `isfinite`, `isinf`: Return boolean array indicating whether each element is finite (non-inf, non-NaN) or infinite, respectively
- `cos`, `cosh`, `sin`, `sinh`, `tan`, `tanh`: Regular and hyperbolic trigonometric functions
- `arccos`, `arccosh`, `arcsin`, `arcsinh`, `arctan`, `arctanh`: Inverse trigonometric functions
- `logical_not`: Compute truth value of not x element-wise (equivalent to ~arr).

***Table 4-4*** lists the fast, element-wise binary functions:

- `add`: Add corresponding elements in arrays
- `subtract`: Subtract elements in second array from first array
- `multiply`: Multiply array elements
- `divide`, `floor_divide`: Divide or floor divide (truncating the remainder)
- `power`: Raise elements in first array to powers indicated in second array
- `maximum`, `fmax`: Element-wise maximum; `fmax` ignores `NaN`
- `minimum`, `fmin`: Element-wise minimum; `fmin` ignores `NaN`
- `mod`: Element-wise modulus (remainder of division)
- `copysign`: Copy sign of values in second argument to values in first argument
- `greater`, `greater_equal`, `less`, `less_equal`, `equal`, `not_equal`: Perform element-wise comparison, yielding boolean array (equivalent to infix operators >, >=, <, <=, ==, !=)
- `logical_and`, `logical_or`, `logical_xor`: Compute element-wise truth value of logical operation (equivalent to infix operators & |, ^)

## Array-Oriented Programming with Arrays

> Using NumPy arrays enables you to express many kinds of data processing tasks as concise array expressions that might otherwise require writing loops. This practice of replacing explicit loops with array expressions is commonly referred to as vectorization. In general, vectorized array operations will often be one or two (or more) orders of magnitude faster than their pure Python equivalents, with the biggest impact in any kind of numerical computations. Later, in Appendix A, I explain broadcasting, a powerful method for vectorizing computations.

### Expressing Conditional Logic as Array Operations

> The numpy.where function is a vectorized version of the ternary expression x if condition else y.

In [None]:
xarr = np.array([1.1, 1.2, 1.3, 1.4, 1.5])
yarr = np.array([2.1, 2.2, 2.3, 2.4, 2.5])
cond = np.array([True, False, True, True, False])

In [None]:
result = [(x if c else y) for x, y, c in zip(xarr, yarr, cond)]

In [None]:
result

NumPy's `where()` is an if-else statement that operates like Excel's `if()`.

In [None]:
np.where(cond, xarr, yarr)

We could also use `np.select()`.

In [None]:
np.select(
    condlist=[cond==True, cond==False],
    choicelist=[xarr, yarr]
)

### Mathematical and Statistical Methods

> A set of mathematical functions that compute statistics about an entire array or about the data along an axis are accessible as methods of the array class. You can use aggregations (often called reductions) like sum, mean, and std (standard deviation) either by calling the array instance method or using the top-level NumPy function.

We will use these aggregations extensively in pandas.

In [None]:
np.random.seed(42)
arr = np.random.randn(5, 4)

In [None]:
arr.mean()

In [None]:
np.mean(arr)

In [None]:
arr.sum()

The aggregation methods above aggregated the whole array.
We can use the `axis` argument to aggregate columns (`axis=0`) and rows (`axis=1`).

In [None]:
arr.mean(axis=1)

In [None]:
arr[0].mean()

In [None]:
arr.mean(axis=0)

In [None]:
arr = np.array([0, 1, 2, 3, 4, 5, 6, 7])
arr.cumsum()

In [None]:
arr = np.array([[0, 1, 2], [3, 4, 5], [6, 7, 8]])

In [None]:
arr

In [None]:
arr.cumsum(axis=0)

In [None]:
arr.cumprod(axis=1)

Table 4-5 lists the basic statistical methods:

- `sum`: Sum of all the elements in the array or along an axis; zero-length arrays have sum 0
- `mean`: Arithmetic mean; zero-length arrays have NaN mean
- `std`, `var`: Standard deviation and variance, respectively, with optional degrees of freedom adjustment (default denominator $n$)
- `min`, `max`: Minimum and maximum
- `argmin`, `argmax`: Indices of minimum and maximum elements, respectively
- `cumsum`: Cumulative sum of elements starting from 0
- `cumprod`: Cumulative product of elements starting from 1

NumPy's `.var()` and `.std()` methods return *population* statistics (denominators of $n$).
The pandas equivalents return *sample* statistics (denominators of $n-1$), which are more appropriate for financial data analysis where we have a sample instead of a population.

### Methods for Boolean Arrays

In [None]:
np.random.seed(42)
arr = np.random.randn(100)

In [None]:
(arr > 0).sum() # Number of positive values

In [None]:
(arr > 0).mean() # percentage of positive values

In [None]:
bools = np.array([False, False, True, False])

In [None]:
bools.any()

In [None]:
bools.all()

### Sorting

> Like Python's built-in list type, NumPy arrays can be sorted in-place with the sort method.

In [None]:
np.random.seed(42)
arr = np.random.randn(6)

In [None]:
arr.sort()

In [None]:
np.random.seed(42)
arr = np.random.randn(5, 3)

For two-dimensional arrays (and beyond), we can sort along an axis.
The default sort axis is the last axis.
Recall that `axis=1` operates on rows.
Note that these row sorts are independent of one another.

In [None]:
arr.sort(1)

## Practice

***Practice:***
Create a 1-dimensional array named `a1` that counts from 0 to 24 by 1.

***Practice:***
Create a 1-dimentional array named `a2` that counts from 0 to 24 by 3.

***Practice:***
Create a 1-dimentional array named `a3` that counts from 0 to 100 by multiples of 3 and 5.

***Practice:***
Create a 1-dimensional array `a3` that contains the squares of the even integers through 100,000.
How much faster is the NumPy version than the list comprehension version?

***Practice:***
Write functions that mimic Excel's `pv` and `fv` functions.

***Practice***
Create a copy of `data` named `data2`, and replace negative values with -1 and positive values with +1.

In [None]:
np.random.seed(42)
data = np.random.randn(7, 4)

***Practice:***
Write a function that calculates the number of payments that generate $x\%$ of the present value of a perpetuity given $C_1$, $r$, and $g$.
Recall the present value of a growing perpetuity is $PV = \frac{C_1}{r - g}$.

***Practice:***
Write a function that calculates the internal rate of return given an numpy array of cash flows.