# CS-6570 Lecture 4 - An Intro to NumPy

**Dylan Zwick**

*Weber State University*

Today, we're going to go over the basics of NumPy, which is short for "Numerical Python". Honestly, it probably would have made more sense for us to discuss NumPy before Pandas, but it's not that critical.

Like the basic object of interest in Pandas is the dataframe, the basic object of interest in NumPy is the array, or the *ndarray*, which is an efficient, multidimensional array optimized for storage and computation. NumPy also comes with tools for reading/writing array data to disk and some nice built-in math functions, plus tools for integrating with C libraries.

The basic idea behind NumPy is that, while Python is great for doing many things, it's not an inherently optimized language for handling large-scale numeric computations or storing large data files. Instead of requiring that data analysts needing to work with such things do so using a different language, NumPy provides tools that, essentially, translate basic numeric analysis needs into much more efficient, lower-level (C-style) implementations.

To give you an idea of the performance difference, check out the following:

In [3]:
import numpy as np # Numpy is almost always abbreviated as np

In [4]:
my_arr = np.arange(1000000)
my_list = list(range(1000000))

In [5]:
%timeit my_arr2 = my_arr*2

779 µs ± 22.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [6]:
%timeit my_list2 = [x * 2 for x in my_list]

54.1 ms ± 5.57 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


That's quite the difference! Generally, NumPy-based algorithms are 10 to 100 times faster (or more!) than their pure Python counterparts and use significantly less memory. So, if you're writing for loops to go through lists for large numeric computations - you're probably doing it wrong.

### The NumPy ndarray: A Multidimensional Array Object

The basic data object in NumPy is the *ndarray*, and applying standard arithmetic operations to these arrays uses syntax very similar to standard arithmetic.

For example, let's create and array:

In [10]:
data = np.array([[1, 4.2, 7], [5, 8, 2.71]])
data

array([[1.  , 4.2 , 7.  ],
       [5.  , 8.  , 2.71]])

We can multiply every element in the array by 2 with the following:

In [12]:
data * 2

array([[ 2.  ,  8.4 , 14.  ],
       [10.  , 16.  ,  5.42]])

Or, we can get the same result by adding the array to itself:

In [14]:
data + data

array([[ 2.  ,  8.4 , 14.  ],
       [10.  , 16.  ,  5.42]])

We could even square every element in the array:

In [16]:
data ** 2

array([[ 1.    , 17.64  , 49.    ],
       [25.    , 64.    ,  7.3441]])

Again, these operations are generally quite fast relative to trying to implement them with a for loop.

#### Data Types for ndarrays

In general, an array is  for homogeneous data; in other words, all the elements must be the same type. Every array also has a *shape*, which is a tuple indicating the size of each dimension, and a *dtype*, which describes the data type of the array.

In [19]:
data.shape

(2, 3)

In [20]:
data.dtype

dtype('float64')

How do we create ndarrays? The easiest way is with the *array* function used above, where we just specify the elements in a list or nested lists. Unless explicitly told otherwise, the array function will try to infer a good data type for the array based upon its inputs.

In addition to the *array* function, there are a number of other functions for creating new arrays. For example:

In [23]:
np.zeros(10)

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [24]:
np.zeros((2,3))

array([[0., 0., 0.],
       [0., 0., 0.]])

In [25]:
np.ones((4,3))

array([[1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.]])

The function *arange* is like the *range* function is Python, except it creates a NumPy array instead of a list.

In [27]:
np.arange(12)

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])

In [28]:
np.arange(3,11)

array([ 3,  4,  5,  6,  7,  8,  9, 10])

We can, if desired, explicitly convert or *cast* an array from one data type to another using the *astype* method:

In [30]:
arr = np.array([1,2,3,4,5])
arr.dtype

dtype('int64')

In [31]:
float_arr = arr.astype(np.float64)
float_arr

array([1., 2., 3., 4., 5.])

If we go the other way, from floats to ints, that will not generate an error, but the decimal parts will be truncated.

In [33]:
arr = np.array([2.3, 5.0, 4.999, 2.71828])
arr.dtype

dtype('float64')

In [34]:
int_arr = arr.astype(np.int64)
int_arr

array([2, 5, 4, 2])

We can convert strings that make sense as numbers to numbers:

In [36]:
arr = np.array(["2.3", "1", "42"])
arr.dtype

dtype('<U3')

In [37]:
float_arr = arr.astype(np.float64)
float_arr

array([ 2.3,  1. , 42. ])

But, this will cause an error if the strings don't make sense as numbers:

In [39]:
arr = np.array(["2.3", "1", "42", "Shrubbery"])
arr

array(['2.3', '1', '42', 'Shrubbery'], dtype='<U9')

In [40]:
float_arr = arr.astype(np.float64)

ValueError: could not convert string to float: 'Shrubbery'

The datatype for an array can also be specified when you created it:


In [81]:
arr = np.array([1, 2, 3])
arr

array([1, 2, 3])

In [82]:
arr = np.array([1, 2, 3], dtype=np.float64)
arr

array([1., 2., 3.])

#### Arithmetic with NumPy Arrays

Arrays enable you to express batch operations on data without writing any for loops. In NumPy this is called "vectorization". Any arithmetic operations between equal-sized arrays apply the operation element-wise:

In [84]:
arr = np.array([[1,2,3],[3,2,1]])
arr

array([[1, 2, 3],
       [3, 2, 1]])

In [85]:
arr * arr

array([[1, 4, 9],
       [9, 4, 1]])

In [86]:
arr + 2 * arr

array([[3, 6, 9],
       [9, 6, 3]])

#### Basic indexing and slicing

Selecting a subset of an array is a somewhat deep topic that we'll only touch on here. One-dimensional arrays on the surface act similarly to Python lists:

In [88]:
arr = np.arange(10)
arr[5]

5

In [89]:
arr[5:8]

array([5, 6, 7])

In [90]:
arr[5:8] = 12
arr

array([ 0,  1,  2,  3,  4, 12, 12, 12,  8,  9])

As you can see, assigning a scalar value to a slice propagates (or *broadcasts*) the value to the entire selection.

An important first  distinction here is that array slices are views **on the original array**, which means any modification to them is reflected in the source array.

For example:

In [92]:
arr_slice = arr[5:8]
arr_slice

array([12, 12, 12])

In [93]:
arr_slice[1] = 42

In [94]:
arr

array([ 0,  1,  2,  3,  4, 12, 42, 12,  8,  9])

This might seem surprising. The idea behind this is that NumPy has been designed to work with very large arrays, and you can image that lots of copying of big arrays could lead to performance and memory problems.

If you do want to create a copy and not just a view, you can do so with the *copy* function:

In [96]:
arr = np.arange(10)
arr

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [97]:
arr_slice = arr[3:6]
arr_slice

array([3, 4, 5])

In [98]:
arr_copy = arr[3:6].copy()
arr_copy

array([3, 4, 5])

In [99]:
arr_slice[0] = 23
arr_copy[1] = 42
arr

array([ 0,  1,  2, 23,  4,  5,  6,  7,  8,  9])

For higher dimensional arrays there are more options.

I a two-dimensional array, the elements at each index are no longer scalars but rather one-dimensional arrays:

In [101]:
arr2d = np.array([[1,2,3],[4,5,6],[7,8,9]])
arr2d[2]

array([7, 8, 9])

If we wanted to access the third element of the first array, we could do so either recursively:

In [103]:
arr2d[0][2]

3

Or, using a comma-separated list:

In [105]:
arr2d[0,2]

3

If we assign a scalar value to an entire array, it broadcasts that value to every entry:

In [107]:
arr2d[0]

array([1, 2, 3])

In [108]:
arr2d[0] = 42
arr2d

array([[42, 42, 42],
       [ 4,  5,  6],
       [ 7,  8,  9]])

Multiple slices can be passed to multi-dimensional arrays just like we can pass multiple indices.

In [110]:
arr2d[:2]

array([[42, 42, 42],
       [ 4,  5,  6]])

In [111]:
arr2d[:2,1:]

array([[42, 42],
       [ 5,  6]])

By mixing indexes and slices, you can get lower dimensional slices:

In [113]:
arr2d[:2,2]

array([42,  6])

Keep in mind, these slices are *views*

In [115]:
arr2d[:2,2] = 801
arr2d

array([[ 42,  42, 801],
       [  4,   5, 801],
       [  7,   8,   9]])

We can index the elements in an array with **boolean indexing** much like we saw last time with Pandas. For example, suppose we have the following array of names:

In [117]:
names = np.array(["Bob", "Joe", "Will", "Bob", "Will", "Joe", "Joe"])
data = np.array([[4,7], [0,2], [-5,6], [0,0], [1,2], [-12,-4], [3,4]])

In [118]:
names

array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'], dtype='<U4')

In [119]:
data

array([[  4,   7],
       [  0,   2],
       [ -5,   6],
       [  0,   0],
       [  1,   2],
       [-12,  -4],
       [  3,   4]])

Suppose each name corresponds with a row in the *data* array and we wanted to select all the rows with the corresponding name "Bob". Like arithmetic operations, comparisons (like ==) with arrays are also vectorized. So, this command produces a Boolean array:

In [121]:
names == "Bob"

array([ True, False, False,  True, False, False, False])

If we pass this as an index to our array we get:

In [123]:
data[names=="Bob"]

array([[4, 7],
       [0, 0]])

Note of course the Boolean array must be of the same length as the array axis it's indexing.

You can even mix and match Boolean arrays with slices:

In [125]:
data[names == "Bob", 1:]

array([[7],
       [0]])

In [126]:
data[names == "Bob", 1]

array([7, 0])

We can also use the standard *and* (&), *or* (|), and *not* (~) operations.

In [128]:
data[names != "Bob"]

array([[  0,   2],
       [ -5,   6],
       [  1,   2],
       [-12,  -4],
       [  3,   4]])

In [129]:
data[~(names == "Bob")]

array([[  0,   2],
       [ -5,   6],
       [  1,   2],
       [-12,  -4],
       [  3,   4]])

In [130]:
data[(names == "Bob") | (names == "Will")]

array([[ 4,  7],
       [-5,  6],
       [ 0,  0],
       [ 1,  2]])

Note selecting data from an array by Boolean indexing and assigning the results to a new variable creates a *copy* of the data. I know it's a bit confusing when a copy is created and when it's not. Sorry, I didn't make the rules.

Fancy indexing is a term adopted by NumPy to describe indexing using integer arrays. Sorry, I didn't make the terminology either. For example:

In [133]:
arr = np.array([[1,2,3],[4,5,6],[7,8,9],[10,11,12]])
arr

array([[ 1,  2,  3],
       [ 4,  5,  6],
       [ 7,  8,  9],
       [10, 11, 12]])

In [134]:
arr[[1,0,2]]

array([[4, 5, 6],
       [1, 2, 3],
       [7, 8, 9]])

Using negative indices selects rows from the end:

In [136]:
arr[[-1,-2]]

array([[10, 11, 12],
       [ 7,  8,  9]])

Passing multiple index arrays selects an array of elements corresponding to each tuple of indices:

In [138]:
arr[[1,0,2],[2,0,1]]

array([6, 1, 8])

Fancy indexing, unlike slicing, always copies the data into a new array when assigning the results to a new variable. Again, not my rules.

Finally, like with Pandas dataframes, we can transpose an array with the *T* method:

In [141]:
arr

array([[ 1,  2,  3],
       [ 4,  5,  6],
       [ 7,  8,  9],
       [10, 11, 12]])

In [142]:
arr.T

array([[ 1,  4,  7, 10],
       [ 2,  5,  8, 11],
       [ 3,  6,  9, 12]])