<h1> NumPy Basics: Arrays Vectorized Computation </h1>


NumPy, short for Numerical Python, is one of the most important foundational packages for numerical computing in Python. Many computational packages providing scientific functionality use NumPy's array objects as one of the standard interface for data exchange.

Here are some of the things you'll find in NumPy:

- ndarray, an efficient multidimensional array providing fast array-oriented arithmetic operations and flexible broadcasting capabilities
- Mathematical functions for fast operations on entire arrays of data without having to write loops
- Tools for reading/writing array data to disk and working with memory-mapped files
- Linear algebra, random number generation, and Fourier transform capabilities
- A C API for connecting NumPy with libraries written in C, C++, or FORTRAN

While NumPy by itself does not provide modeling or scientific functionality, having an understanding of NumPy arrays and array-oriented computing will help you use tools with array computing semantics, like pandas, much more effectively.

For most data analysis applications, the main areas of functionality we’ll focus on are:

- Fast array-based operations for data munging and cleaning, subsetting and filtering, transformation, and any other kind of computation
- Common array algorithms like sorting, unique, and set operations
- Efficient descriptive statistics and aggregating/summarizing data
- Data alignment and relational data manipulations for merging and joining heterogeneous datasets
- Expressing conditional logic as array expressions instead of loops with if-elif-else branches
- Group-wise data manipulations (aggregation, transformation, and function application)

One of the reasons NumPy is so important for numerical computations in Python is because it is designed for efficiency on large arrays of data. There are a number of reasons for this:

- NumPy internally stores data in a contiguous block of memory, independent of other built-in Python objects. NumPy's library of algorithms written in the C language can operate on this memory without any type checking or other overhead. NumPy arrays also use much less memory than built-in Python sequences.
 - NumPy operations perform complex computations on entire arrays without the need for Python `for` loops, which can be slow for large sequences. NumPy is faster than regular Python code because its C-based algorithms avoid overhead present with regular interpreted Python code.

<h1>The NumPy ndarray: A Multidimensional Array Object</h1>

One of the key features of NumPy is its N-dimensional array object, or ndarray, which is a fast, flexible container for large datasets in Python. Arrays enable you to perform mathematical operations on whole blocks of data using similar syntax to the equivalent operations between scalar elements.

To give you a flavor of how NumPy enables batch computations with similar syntax to scalar values on built-in Python objects, We will first import NumPy and create a small array:

In your anaconda environment install the `numpy` package using the following command:

In [3]:
conda install numpy

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: c:\Users\dm\anaconda3

  added / updated specs:
    - numpy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    conda-22.11.1              |   py39haa95532_3         931 KB
    ruamel.yaml-0.17.21        |   py39h2bbff1b_0         174 KB
    ruamel.yaml.clib-0.2.6     |   py39h2bbff1b_1         101 KB
    ------------------------------------------------------------
                                           Total:         1.2 MB

The following NEW packages will be INSTALLED:

  ruamel.yaml        pkgs/main/win-64::ruamel.yaml-0.17.21-py39h2bbff1b_0 None
  ruamel.yaml.clib   pkgs/main/win-64::ruamel.yaml.clib-0.2.6-py39h2bbff1b_1 None

The following packages will be UPDATED:

  conda                               22.9.0-py39haa95532_0 -

In [1]:
import numpy as np

In [5]:
data = np.array([[1.5, -0.1, 3], [0, -3, 6.5]])

data

array([[ 1.5, -0.1,  3. ],
       [ 0. , -3. ,  6.5]])

We then write mathematical operations with `data`:

In [6]:
data * 10

array([[ 15.,  -1.,  30.],
       [  0., -30.,  65.]])

In [7]:
data + data

array([[ 3. , -0.2,  6. ],
       [ 0. , -6. , 13. ]])

In the first example, all of the elements have been multiplied by 10. In the second, the corresponding values in each "cell" in the array have been added to each other.

An ndarray is a generic multidimensional container for homogeneous data; that is, all of the elements must be the same type. Every array has a shape, a tuple indicating the size of each dimension, and a dtype, an object describing the data type of the array:

In [8]:
data.shape

(2, 3)

In [9]:
data.dtype

dtype('float64')

<h2>Creating ndarrays</h2>

The easiest way to create an array is to use the `array` function. This accepts any sequence-like object (including other arrays) and produces a new NumPy array containing the passed data. For example, a list is a good candidate for conversion:

In [10]:
data1 = [6, 7.5, 8, 0, 1]

arr1 = np.array(data1)
 
arr1

array([6. , 7.5, 8. , 0. , 1. ])

Nested sequences, like a list of equal-length lists, will be converted into a multidimensional array:

In [11]:
data2 = [[1, 2, 3, 4], [5, 6, 7, 8]]
 
arr2 = np.array(data2)

arr2

array([[1, 2, 3, 4],
       [5, 6, 7, 8]])

Since `data2` was a list of lists, the NumPy array `arr2` has two dimensions, with shape inferred from the data. We can confirm this by inspecting the `ndim` and `shape` attributes:

In [12]:
arr2.ndim

2

In [13]:
arr2.shape

(2, 4)

Unless explicitly specified, `numpy.array` tries to infer a good data type for the array that it creates. The data type is stored in a special `dtype` metadata object; for example, in the previous two examples we have:

In [14]:
arr1.dtype

dtype('float64')

In [15]:
arr2.dtype

dtype('int32')

In addition to `numpy.array`, there are a number of other functions for creating new arrays. As examples, `numpy.zeros` and `numpy.ones` create arrays of 0s or 1s, respectively, with a given length or shape. `numpy.empty` creates an array without initializing its values to any particular value. To create a higher dimensional array with these methods, pass a tuple for the shape:

In [16]:
np.zeros(10)

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [17]:
np.ones(5)

array([1., 1., 1., 1., 1.])

In [18]:
np.zeros((3, 6))

array([[0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.]])

In [19]:
np.empty((1))

array([0.])

`numpy.arange` is an array-valued version of the built-in Python `range` function:

In [20]:
np.arange(15)

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])

<b>Note:</b> See some important NumPy array creation functions in this <a href="https://numpy.org/doc/stable/user/basics.creation.html">link</a>.

<h2>Data Types for ndarrays</h2>

The data type or dtype is a special object containing the information (or metadata, data about data) the ndarray needs to interpret a chunk of memory as a particular type of data:

In [21]:
arr1 = np.array([1, 2, 3], dtype=np.float64)

arr1.dtype

dtype('float64')

In [22]:
arr2 = np.array([1, 2, 3], dtype=np.int32)

arr2.dtype

dtype('int32')

Data types are a source of NumPy's flexibility for interacting with data coming from other systems. In most cases they provide a mapping directly onto an underlying disk or memory representation, which makes it possible to read and write binary streams of data to disk and to connect to code written in a low-level language like C or FORTRAN. The numerical data types are named the same way: a type name, like `float` or `int`, followed by a number indicating the number of bits per element. A standard double-precision floating-point value (what’s used under the hood in Python’s `float` object) takes up 8 bytes or 64 bits. Thus, this type is known in NumPy as `float64`.

<b>Note:</b> See NumPy data types in this <a href="https://numpy.org/doc/stable/user/basics.types.html">link</a>.

You can explicitly convert or cast an array from one data type to another using ndarray’s astype method:

In [23]:
arr = np.array([1, 2, 3, 4, 5])

arr.dtype

dtype('int32')

In [24]:
float_arr = arr.astype(np.float64)

float_arr

array([1., 2., 3., 4., 5.])

In [25]:
float_arr.dtype

dtype('float64')

In this example, integers were cast to floating point. If I cast some floating-point numbers to be of integer data type, the decimal part will be truncated:

In [26]:
arr = np.array([3.7, -1.2, -2.6, 0.5, 12.9, 10.1])

arr

array([ 3.7, -1.2, -2.6,  0.5, 12.9, 10.1])

In [27]:
arr.astype(np.int32)

array([ 3, -1, -2,  0, 12, 10])

If you have an array of strings representing numbers, you can use `astype` to convert them to numeric form:

In [28]:
numeric_strings = np.array(["1.25", "-9.6", "42"], dtype=np.string_)

numeric_strings.astype(float)

array([ 1.25, -9.6 , 42.  ])

If casting were to fail for some reason (like a string that cannot be converted to `float64`, a `ValueError` will be raised. Before, we wrote `float` instead of `np.float64`; NumPy aliases the Python types to its own equivalent data types.

You can also use another array’s `dtype` attribute:

In [29]:
int_array = np.arange(10)

floats = np.array([.22, .270, .357, .380, .44, .50], dtype=np.float64)

int_array.astype(floats.dtype)

array([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])

There are shorthand type code strings you can also use to refer to a `dtype`:

In [30]:
zeros_uint32 = np.zeros(8, dtype="u4")

zeros_uint32

array([0, 0, 0, 0, 0, 0, 0, 0], dtype=uint32)

<h2>Arithmetic with NumPy Arrays</h2>

Arrays are important because they enable you to express batch operations on data without writing any `for` loops. NumPy users call this vectorization. Any arithmetic operations between equal-size arrays apply the operation element-wise:

In [31]:
arr = np.array([[1., 2., 3.], [4., 5., 6.]])

arr

array([[1., 2., 3.],
       [4., 5., 6.]])

In [32]:
arr * arr

array([[ 1.,  4.,  9.],
       [16., 25., 36.]])

In [33]:
arr - arr

array([[0., 0., 0.],
       [0., 0., 0.]])

Arithmetic operations with scalars propagate the scalar argument to each element in the array:

In [34]:
1 / arr

array([[1.        , 0.5       , 0.33333333],
       [0.25      , 0.2       , 0.16666667]])

In [35]:
arr ** 2

array([[ 1.,  4.,  9.],
       [16., 25., 36.]])

Comparisons between arrays of the same size yield Boolean arrays:

In [36]:
arr2 = np.array([[0., 4., 1.], [7., 2., 12.]])

arr2

array([[ 0.,  4.,  1.],
       [ 7.,  2., 12.]])

In [37]:
arr2 > arr

array([[False,  True, False],
       [ True, False,  True]])

<h2>Basic Indexing and Slicing</h2>

NumPy array indexing is a deep topic, as there are many ways you may want to select a subset of your data or individual elements. One-dimensional arrays are simple; on the surface they act similarly to Python lists:

In [38]:
arr = np.arange(10)

arr

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [39]:
arr[5]

5

In [40]:
arr[5:8]

array([5, 6, 7])

In [41]:
arr[5:8] = 12

arr

array([ 0,  1,  2,  3,  4, 12, 12, 12,  8,  9])

As you can see, if you assign a scalar value to a slice, as in arr
`[5:8] = 12` , the value is propagated (or broadcast henceforth) to the entire selection.

To give an example of this, We will first create a slice of `arr`:

In [42]:
arr_slice = arr[5:8]

arr_slice

array([12, 12, 12])

Now, when I change values in `arr_slice`, the mutations are reflected in the original array `arr`:

In [43]:
arr_slice[1] = 12345

arr

array([    0,     1,     2,     3,     4,    12, 12345,    12,     8,
           9])

The "bare" slice `[:]` will assign to all values in an array:

In [44]:
arr_slice[:] = 64

arr

array([ 0,  1,  2,  3,  4, 64, 64, 64,  8,  9])

If you are new to NumPy, you might be surprised by this, especially if you have used other array programming languages that copy data more eagerly. As NumPy has been designed to be able to work with very large arrays, you could imagine performance and memory problems if NumPy insisted on always copying data.

With higher dimensional arrays, you have many more options. In a two-dimensional array, the elements at each index are no longer scalars but rather one-dimensional arrays:

In [53]:
arr2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

arr2d[2]

array([7, 8, 9])

Thus, individual elements can be accessed recursively. But that is a bit too much work, so you can pass a comma-separated list of indices to select individual elements. So these are equivalent:

In [46]:
arr2d[0][2]

3

In [47]:
arr2d[1, 2]

6

In multidimensional arrays, if you omit later indices, the returned object will be a lower dimensional ndarray consisting of all the data along the higher dimensions. So in the 2 × 2 × 3 array `arr3d`:

In [48]:
arr3d = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])

arr3d

array([[[ 1,  2,  3],
        [ 4,  5,  6]],

       [[ 7,  8,  9],
        [10, 11, 12]]])

In [49]:
arr3d.shape

(2, 2, 3)

`arr3d[0]` is a 2 × 3 array:

In [45]:
arr3d[0]

array([[1, 2, 3],
       [4, 5, 6]])

Both scalar values and arrays can be assigned to `arr3d[0]`:

In [46]:
old_values = arr3d[0].copy()

arr3d[0] = 42

arr3d

array([[[42, 42, 42],
        [42, 42, 42]],

       [[ 7,  8,  9],
        [10, 11, 12]]])

In [47]:

arr3d[0] = old_values

arr3d

array([[[ 1,  2,  3],
        [ 4,  5,  6]],

       [[ 7,  8,  9],
        [10, 11, 12]]])

Similarly, `arr3d[1, 0]` gives you all of the values whose indices start with `(1, 0)` , forming a one-dimensional array:

In [48]:
arr3d[1, 0]

array([7, 8, 9])

This expression is the same as though we had indexed in two steps:

In [49]:
x = arr3d[1]

# x

x[0]

array([7, 8, 9])

Note that in all of these cases where subsections of the array have been selected, the returned arrays are views.

<b>Note:</b> This multidimensional indexing syntax for NumPy arrays will not work with regular Python objects, such as lists of lists.</p>

<h3>Indexing with slices</h3>

Like one-dimensional objects such as Python lists, ndarrays can be sliced with the familiar syntax:

In [50]:
arr

array([ 0,  1,  2,  3,  4, 64, 64, 64,  8,  9])

In [51]:
arr[1:6]

array([ 1,  2,  3,  4, 64])

Consider the two-dimensional array from before, `arr2d`. Slicing this array is a bit different:

In [52]:
arr2d 

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [53]:
arr2d[:2]

array([[1, 2, 3],
       [4, 5, 6]])

As you can see, it has sliced along axis 0, the first axis. A slice, therefore, selects a range of elements along an axis. It can be helpful to read the expression `arr2d[:2]` as "select the first two rows of `arr2d`."

You can pass multiple slices just like you can pass multiple indexes:

In [50]:
arr2d

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [54]:
arr2d[:2, 1:]

array([[2, 3],
       [5, 6]])

When slicing like this, you always obtain array views of the same number of dimensions. By mixing integer indexes and slices, you get lower dimensional slices.

For example, I can select the second row but only the first two columns, like so:

In [57]:
lower_dim_slice = arr2d[1, :2]
lower_dim_slice

array([4, 5])

In [55]:
arr2d

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

Here, while arr2d is two-dimensional, lower_dim_slice is one-dimensional, and its shape is a tuple with one axis size:

In [56]:
lower_dim_slice.shape

(2,)

Similarly, I can select the third column but only the first two rows, like so:

In [57]:
arr2d[:2, 2]

array([3, 6])

Note that a colon by itself means to take the entire axis, so you can slice only higher dimensional axes by doing:

In [58]:
arr2d[:, :1]

array([[1],
       [4],
       [7]])

Of course, assigning to a slice expression assigns to the whole selection:

In [59]:
arr2d[:2, 1:] = 0

arr2d

array([[1, 0, 0],
       [4, 0, 0],
       [7, 8, 9]])

<h2>Boolean Indexing</h2>

Let’s consider an example where we have some data in an array and an array of names with duplicates:

In [60]:
names = np.array(["Bob", "Joe", "Will", "Bob", "Will", "Joe", "Joe"])

data = np.array([[4, 7], [0, 2], [-5, 6], [0, 0], [1, 2],[-12, -4], [3, 4]])

names

array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'], dtype='<U4')

In [61]:
data

array([[  4,   7],
       [  0,   2],
       [ -5,   6],
       [  0,   0],
       [  1,   2],
       [-12,  -4],
       [  3,   4]])

Suppose each name corresponds to a row in the `data` array and we wanted to select all the rows with the corresponding name `Bob`. Like arithmetic operations, comparisons (such as `==`) with arrays are also vectorized. Thus, comparing `names` with the string `Bob` yields a Boolean array:

In [62]:
names == "Bob"

array([ True, False, False,  True, False, False, False])

This Boolean array can be passed when indexing the array:

In [63]:
data[names == "Bob"]

array([[4, 7],
       [0, 0]])

The Boolean array must be of the same length as the array axis it’s indexing. You can even mix and match Boolean arrays with slices or integers (or sequences of integers; more on this later).

In these examples, I select from the rows where `names == "Bob"` and index the columns, too:

In [64]:
data[names == "Bob", 1:]

array([[7],
       [0]])

In [65]:
data[names == "Bob", 1]

array([7, 0])

To select everything but `"Bob"` you can either use `!=` or negate the condition using `~`:

In [66]:
names != "Bob"

array([False,  True,  True, False,  True,  True,  True])

In [67]:
~(names == "Bob")

array([False,  True,  True, False,  True,  True,  True])

In [68]:
data[~(names == "Bob")]

array([[  0,   2],
       [ -5,   6],
       [  1,   2],
       [-12,  -4],
       [  3,   4]])

The `~` operator can be useful when you want to invert a Boolean array referenced by a variable:

In [69]:
cond = names == "Bob"

data[~cond]

array([[  0,   2],
       [ -5,   6],
       [  1,   2],
       [-12,  -4],
       [  3,   4]])

To select two of the three names to combine multiple Boolean conditions, use Boolean arithmetic operators like `&` (and) and `|` (or):

In [70]:
mask = (names == "Bob") | (names == "Will")

mask

array([ True, False,  True,  True,  True, False, False])

In [71]:
data[mask]

array([[ 4,  7],
       [-5,  6],
       [ 0,  0],
       [ 1,  2]])

Selecting data from an array by Boolean indexing and assigning the result to a new variable always creates a copy of the data, even if the returned array is unchanged.

<b>Note:</b> The Python keywords and and or do not work with Boolean arrays. Use & (and) and | (or) instead.</p>

Setting values with Boolean arrays works by substituting the value or values on the righthand side into the locations where the Boolean array's values are `True`. To set all of the negative values in `data` to 0, we need only do:

In [72]:
data[data < 0] = 0

data

array([[4, 7],
       [0, 2],
       [0, 6],
       [0, 0],
       [1, 2],
       [0, 0],
       [3, 4]])

You can also set whole rows or columns using a one-dimensional Boolean array:

In [73]:
data[names != "Joe"] = 7

data

array([[7, 7],
       [0, 2],
       [7, 7],
       [7, 7],
       [7, 7],
       [0, 0],
       [3, 4]])

As we will see later, these types of operations on two-dimensional data are convenient to do with pandas.

<h2>Fancy Indexing</h2>

Fancy indexing is a term adopted by NumPy to describe indexing using integer arrays. Suppose we had an 8 × 4 array:

In [74]:
arr = np.zeros((8, 4))

for i in range(8):
   arr[i] = i
    
arr

array([[0., 0., 0., 0.],
       [1., 1., 1., 1.],
       [2., 2., 2., 2.],
       [3., 3., 3., 3.],
       [4., 4., 4., 4.],
       [5., 5., 5., 5.],
       [6., 6., 6., 6.],
       [7., 7., 7., 7.]])

To select a subset of the rows in a particular order, you can simply pass a list or ndarray of integers specifying the desired order:

In [75]:
arr[[4, 3, 0, 6]]

array([[4., 4., 4., 4.],
       [3., 3., 3., 3.],
       [0., 0., 0., 0.],
       [6., 6., 6., 6.]])

Hopefully this code did what you expected! Using negative indices selects rows from the end:

In [76]:
arr[[-3, -5, -7]]

array([[5., 5., 5., 5.],
       [3., 3., 3., 3.],
       [1., 1., 1., 1.]])

Passing multiple index arrays does something slightly different; it selects a one-dimensional array of elements corresponding to each tuple of indices:

In [77]:
arr = np.arange(32).reshape((8, 4))

arr

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19],
       [20, 21, 22, 23],
       [24, 25, 26, 27],
       [28, 29, 30, 31]])

In [78]:

arr[[1, 5, 7, 2], [0, 3, 1, 2]]

array([ 4, 23, 29, 10])

Here the elements `(1, 0), (5, 3), (7, 1)`, and `(2, 2)` were selected. The result of fancy indexing with as many integer arrays as there are axes is always one-dimensional.

The behavior of fancy indexing in this case is a bit different from what some users might have expected (myself included), which is the rectangular region formed by selecting a subset of the matrix’s rows and columns. Here is one way to get that:

In [79]:
arr[[1, 5, 7, 2]][:, [0, 3, 1, 2]]

array([[ 4,  7,  5,  6],
       [20, 23, 21, 22],
       [28, 31, 29, 30],
       [ 8, 11,  9, 10]])

Keep in mind that fancy indexing, unlike slicing, always copies the data into a new array when assigning the result to a new variable. If you assign values with fancy indexing, the indexed values will be modified:

In [80]:
arr[[1, 5, 7, 2], [0, 3, 1, 2]]

array([ 4, 23, 29, 10])

In [81]:
arr[[1, 5, 7, 2], [0, 3, 1, 2]] = 0

arr

array([[ 0,  1,  2,  3],
       [ 0,  5,  6,  7],
       [ 8,  9,  0, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19],
       [20, 21, 22,  0],
       [24, 25, 26, 27],
       [28,  0, 30, 31]])

<h2>Transposing Arrays and Swapping Axes</h2>

Transposing is a special form of reshaping that similarly returns a view on the underlying data without copying anything. Arrays have the `transpose` method and the special `T` attribute:

In [82]:
arr = np.arange(15).reshape((3, 5))

arr

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])

In [83]:
arr.T

array([[ 0,  5, 10],
       [ 1,  6, 11],
       [ 2,  7, 12],
       [ 3,  8, 13],
       [ 4,  9, 14]])

When doing matrix computations, you may do this very often—for example, when computing the inner matrix product using `numpy.dot`:

In [84]:
arr = np.array([[0, 1, 0], [1, 2, -2], [6, 3, 2], [-1, 0, -1], [1, 0, 1]])

arr

array([[ 0,  1,  0],
       [ 1,  2, -2],
       [ 6,  3,  2],
       [-1,  0, -1],
       [ 1,  0,  1]])

In [85]:
np.dot(arr.T, arr)

array([[39, 20, 12],
       [20, 14,  2],
       [12,  2, 10]])

The `@` infix operator is another way to do matrix multiplication:

In [86]:
arr.T @ arr

array([[39, 20, 12],
       [20, 14,  2],
       [12,  2, 10]])

Simple transposing with `.T` is a special case of swapping axes. ndarray has the method `swapaxes`, which takes a pair of axis numbers and switches the indicated axes to rearrange the data:

In [87]:
arr

array([[ 0,  1,  0],
       [ 1,  2, -2],
       [ 6,  3,  2],
       [-1,  0, -1],
       [ 1,  0,  1]])

In [88]:
arr.swapaxes(0, 1)

array([[ 0,  1,  6, -1,  1],
       [ 1,  2,  3,  0,  0],
       [ 0, -2,  2, -1,  1]])

`swapaxes` similarly returns a view on the data without making a copy.