## The `numpy` Module

In a previous lesson, you learned to load and manipulate data using the `pandas` module. Usage of `pandas` centers on the `DataFrame`, a container data type that holds our data in a tabular format reminiscent of a CSV file or an Excel worksheet. Pandas DataFrames are powerful tools that we will use extensively. However, when working with some of the machine learning modules about which we will learn in future lessons, it will be more convenient for us to package in our data in a different form that is specialized for working with arrays of numbers. Therefore, we will now learn about the `numpy` module and its core data type, the numpy array. 

## `numpy`: Multidimensional Arrays in Python

To understand the role that numpy plays in the Python ecosystem, it is useful to consider the problem of working with two-dimensional arrays of numbers. The list objects that we learned about in previous lessons are very useful for managing one-dimensional arrays of numbers like `[1,2,3,4,5]`. But what if we need to manipulate two-dimensional arrays of numbers, such as a table of numbers in an Excel spreadsheet? We can start out by writing our two-dimensional array as a "list of lists," as shown below:

In [1]:
example_list_of_lists = [
    [1,2,3,4],
    [5,6,7,8],
    [9,10,11,12]
]

Notice that the list above, `example_list_of_lists`, has three elements. Each of those elements is itself a list containing four numbers.
This method of storing two-dimensional arrays of numbers allows us to do certain basic tasks. For example, we can print each row of the table above using the code below:

In [2]:
print("first row:", example_list_of_lists[0])
print("second row:", example_list_of_lists[1])
print("third row:", example_list_of_lists[2])

first row: [1, 2, 3, 4]
second row: [5, 6, 7, 8]
third row: [9, 10, 11, 12]


But what if we want to print the second *column*, rather than the second *row*? For operations such as this, the "list of lists" structure above becomes very awkward. We therefore wish to teach Python better techniques for manipulating large arrays of numbers. We can teach Python these skills by importing the `numpy` module:

In [3]:
import numpy as np

Note that we have used the `import as` statement to alias `numpy` to easier-to-type string `np`, as is convention. Now that Python knows the skills contained within the `numpy` module, we can use the `array` function provided by this module to turn our "list of lists" into a **numpy array**:

In [4]:
numpy_array = np.array(example_list_of_lists)

The variable `numpy_array`, which stores the return value of `np.array`, now contains a numpy array **object**. This object is just like the ones that we learned about in our previous lesson. Just as we did before, we can type a `.` after the name of the object and hit the `Tab` key to see the attributes and methods contained within this object. Click on the right edge of the cell below and hit `Tab` to view this information:

In [10]:
numpy_array.argmax()

11

As you can see, the list shown above is quite long. This is because numpy arrays can perform a wide variety of mathematical operations on the multidimensional data that they contain. Let's take a look at what the `numpy` array object looks like:

In [11]:
print(numpy_array)

[[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]]


This looks a lot like our "list of lists" from above - although note that `print` does not show you everything inside an object, just an "easy-to-read" view of some parts of it. What exactly is better about a numpy array? Let's determine the answer to this question by learning the basics of manipulating data inside `numpy` arrays.

Recall that we access the elements of a one-dimensional list using the indexing operator `[...]`:

In [12]:
my_list = [1,2,3]
print("first element:", my_list[0])
print("second element:", my_list[1])
print("third element:", my_list[2])

first element: 1
second element: 2
third element: 3


We can access the elements of the numpy array above in a similar manner, but we will need to pass two numbers to the indexing operator `[...]` because the data we are working with is two-dimesional (meaning that it has rows and columns, not just rows or just columns.) Let's see how this works by printing the the first row element by element:

In [13]:
print("first row, first column:", numpy_array[0,0])
print("first row, second column:", numpy_array[0,1])
print("first row, third column:", numpy_array[0,2])
print("first row, fourth column:", numpy_array[0,3])

first row, first column: 1
first row, second column: 2
first row, third column: 3
first row, fourth column: 4


As you can see, the first number that we pass to the indexing operator of the numpy array specifies the *row* that we want, while the second number specifies the *column*. 

<span style="color:red;font-weight:bold">Try It</span>
: Using the indexing operator as above, print the value located in the *third row* and the *second column* of `numpy_array`. Remember, row and column indices start at `0` - you can look at the above example for reference.

In [15]:
numpy_array[2][1]

10

What if we want an entire row, or an entire column, rather than an individual array element? We can accomplish this by replacing either the column or row index with a colon (`:`) - using a colon for the column index means that we want "all the columns" for the specified row, and using it for the row index implies that we want "all the rows"  for the specified column. You can see this below: 

In [16]:
print("the first row:")
print(numpy_array[0,:])
print("the first column:")
print(numpy_array[:,0])

the first row:
[1 2 3 4]
the first column:
[1 5 9]


<span style="color:red;font-weight:bold">Try It</span>
: Using the special `:` syntax above, print the second column of `numpy_array`:

In [20]:
numpy_array[1,:]

array([5, 6, 7, 8])

### Looping over Multidimensional Arrays

We saw in the previous lesson on loops how we can use a `for` loop to loop over the elements of the following things:

1. lists
2. The output of the `range(...)` function

Is there a similar way to loop over the elements of a numpy array? Yes. It turns out that `for` loops can be used to iterate over any *iterable* object. You can think of *iterable* as meaning: "has elements that can be accessed one by one." Numpy arrays are *iterable*, though in a slightly different way than traditional lists. Since the numpy arrays that we are most interested in are two-dimensional, the solution will require two `for` loops nested inside each other. This is demonstrated with the following code:

In [21]:
my_two_d_array = np.array([
    [1,2,3,4],
    [5,6,7,8],
    [9,10,11,12]
])
for row in my_two_d_array:
    print("starting new row:")
    for element in row:
        print(element)

starting new row:
1
2
3
4
starting new row:
5
6
7
8
starting new row:
9
10
11
12


Let's break down how the `for` loops above work. Our first `for` loop traverses the *rows* of the numpy array, while the second `for` loop nested inside the first one traverses the individual *elements* of each row. Remember, everything in the body of a `for` loop (the indented portion) is run once per iteration of the loop, so the entire inner `for` loop is run during every iteration of the outer `for` loop.

<span style="color:blue;font-weight:bold">Exercise</span>: Write a function called `count_zeros` that accepts a single variable called `my_array` as an argument, and returns a single number that is the total number of zeros in the array. For example, given a numpy array constructed from the following values (as above): 

```
[[1,0,1,0],
 [0,0,0,1],
 [1,0,1,1]]
```

your function should return a value of `6`. Hint: this challenging exercise will require you to use the following elements of previous lessons:

1. `for` loops (not just one `for` loop, but a second `for` loop inside the body of the first)
2. `if` statements
3. Comparison operators (At least one of `<`,`>`, `==`, etc.)

In [31]:
def count_zeros(my_array):
    return sum(my_array.flatten() == 0)

In [31]:
import numpy as np
check_function_definition("count_zeros")
test_array_zero = np.array([
    [1,0,1,0],
    [0,0,0,1],
    [1,0,1,1]
])
assert count_zeros(test_array_zero) == 6, "Your function does count zeros correctly - try calling it on this array: <code>np.array([[1,0,1,0],[0,0,0,1],[1,0,1,1]])</code>"
test_array_one = np.array([
    [1,0,1,0,0],
    [0,0,0,1,1],
    [1,0,1,0,1],
    [1,1,1,0,1]
])
assert count_zeros(test_array_one) == 9, "Your function does count zeros correctly - double-check your code."
success()

### Tuples

We must now cover a data type that we intentionally skipped in previous lessons, but that appears quite often when using the numpy module: *tuples*. You can think of tuples as being exactly like lists, but defined using `()` instead of `[]`: 

In [32]:
my_tuple = (1,2,3)

In many ways, tuples are just like lists - we can find their length with `len(...)` and access their elements with `[...]`:

In [33]:
print(len(my_tuple))
print(my_tuple[0])

3
1


However, tuples are *immutable*, meaning their contents cannot be changed. The following cell will therefore fail with an error:

In [34]:
my_tuple[0] = "foo"

TypeError: 'tuple' object does not support item assignment

So you can think of tuples as "lists that you can't change." The reason that we are introducing tuples in this lesson is because the numpy module uses them extensively. In particular, the *shape* of our numpy arrays (the number of rows and columns) is stored as a tuple:

In [35]:
three_by_two = np.array([
    [1,2],
    [3,4],
    [5,6]
])
three_by_two.shape

(3, 2)

We can see that the attribute `shape` of a numpy array object contains the number of rows (first tuple entry) and the number of columns (second tuple entry.) In the next section, we will see how to use tuples to quickly construct numpy arrays with particular shapes. 

### Constructing Numpy Arrays Efficiently

The numpy module provides several convenient functions that we can use to quickly generate large arrays with particular shapes. For example, we can use `np.zeros` to create an array filled with zeros, which is often a useful initial condition for our programs:

In [36]:
three_zeros = np.zeros(3)
print(three_zeros)

[0. 0. 0.]


Passing a single argument to `np.zeros` creates a row of zeros with the given number of columns. We can create a two-dimensional array of zeros by passing a tuple containing the appropriate shape (note the doubled parentheses - one set for the function, one set for the tuple):

In [37]:
five_by_six = np.zeros((5,6))
print(five_by_six)

[[0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]]


We can also quickly construct numpy arrays with more interesting patterns. For example, we can use the `np.arange` function to construct arrays that span a particular numerical range:

In [38]:
range_arr = np.arange(0,20,2)
print(range_arr)

[ 0  2  4  6  8 10 12 14 16 18]


Let's break down the result above above. The `arange` function takes three arguments:

1. The *start* number: `0`
2. The *end* number (will not be included in the output): `20`
3. The *step* by which to increment at each position: `2`

Thus the function call above gives all the numbers from zero up to (but not including) twenty, counting by two. 

### Mathematical Operations on Numpy Arrays

Numpy provides many useful mathematical functions that can operate on numpy array objects. A principal difference between these mathematical functions and those found in other Python modules is that they can apply a mathematical operation to an entire array at once. For example, we can apply the exponential function $e^x$ to every element of an array at once using the `np.exp` function call:

In [39]:
powers = np.array([0,1,2])
np.exp(powers)

array([1.        , 2.71828183, 7.3890561 ])

If you are unfamiliar with the function $e^x$ from math class, do not worry - the point of the above is simply to show that the operation is applied to every single array element, resulting in an array of three new elements. 

Let's take a look at another example: we can also find the `mean` and the `min` of a given array. These functions work even on two-dimensional arrays - they will use all of the elements across all rows and columns to compute the answer:

In [40]:
two_by_five = [
    [1,2,3,4,5],
    [6,7,8,9,10],
]
print(np.mean(two_by_five))
print(np.min(two_by_five))

5.5
1


### Reshaping Numpy Arrays

The two-dimensional numpy arrays that we have been using so far in this lesson have a fixed number of rows and columns. What if we want to change their shape by changing the number of rows/columns? We can do so using the `reshape` method, as shown below:

In [41]:
arr = np.array([
    [1,2,3],
    [4,5,6]
])
reshaped_arr = arr.reshape(3,2)
print(reshaped_arr)

[[1 2]
 [3 4]
 [5 6]]


You can see from the output above that the `reshape` method returns a *new* numpy array with a different shape. The first argument to `reshape` specifies the number of rows that we want in the new array, while the second argument specifies the number of columns.

You have now seen how `reshape` can be used to change the shape of two-dimensional arrays. Can we also reshape one-dimensional arrays? Standard one-dimensional numpy arrays take the form of individual *rows*, as shown below: 

In [42]:
row_array = np.array([1,2,3])
print(row_array)

[1 2 3]


In the subsequent  lessons, we will often use the `reshape` method to perform a simple "trick" that transforms our data into a particular shape that is compatible with several machine learning modules. To work with these modules, we often need our one-dimensional arrays to be *column*-shaped, rather than *row*-shaped. To "flip" a numpy array from a row to a column, we can use the following special `reshape` call:

In [43]:
col_array = row_array.reshape(-1,1)
print(col_array)

[[1]
 [2]
 [3]]


As you can see, the use of the magic value of `-1` as the "number of rows" argument causes numpy to perform the following tasks:

1. Look at the number of *columns* that we want: `1`
2. Calculate the number of *rows* that our new array must have so that the total number of elements is the same as before: `3`

Therefore, using `reshape(-1,1)` on a row will "flip it" to produce a column of the same length. We will use this trick extensively in the following lessons.

<span style="color:blue;font-weight:bold">Exercise</span>: Write a function called `shaped_range` that accepts four numerical arguments: `start`,`end`,`n_rows`, and `n_cols`. Your function should return a numpy array with `n_rows` rows and `n_cols` columns, containing the numbers ranging from `start` to `end` (but not including `end`.) For example, the function call `shaped_range(1,11,5,2)` should return this numpy array:

```
[[ 1  2]
 [ 3  4]
 [ 5  6]
 [ 7  8]
 [ 9 10]]
```
Note: we would usually need to write extra code to check that the number of elements in the range specified match the number of rows and columns requested in the function call, but we will skip this precaution in this exercise.


In [44]:
def shaped_range(start, end, n_rows, n_cols):
    return np.arange(start, end).reshape(n_rows, n_cols)

In [44]:
check_function_definition("shaped_range")
try:
    resone = shaped_range(1,11,5,2)
except Exception as e:
    raise ExerciseError(f"Your function produced an error when called as follows: <code>shaped_range(1,11,5,2)</code> - the error was: <code>{e}</code>")

try:
    restwo = shaped_range(3,15,3,4)
except Exception as e:
    raise ExerciseError(f"Your function produced an error when called as follows: <code>shaped_range(3,15,3,4)</code> - the error was: <code>{e}</code>")
    
assert isinstance(resone, np.ndarray), "Your function should return a numpy array."
assert isinstance(restwo, np.ndarray), "Your function should return a numpy array."
assert np.array_equal(resone, np.array([
    [1,2],
    [3,4],
    [5,6],
    [7,8],
    [9,10]
])), "Your function did not return the correct value when called as follows: <code>shaped_range(1,11,5,2)</code>"
assert np.array_equal(restwo, np.array([
    [3,4,5,6],
    [7,8,9,10],
    [11,12,13,14]
])),"Your function did not return the correct value when called as follows: <code>shaped_range(3,15,3,4)</code>"
success()

## Converting Pandas DataFrame Contents to Numpy Arrays

A tedious but necessary task that you will undertake in subsequent lessons is extracting data from pandas data frames and converting it to an appropriate numpy array. Consider the dataframe below:

In [45]:
import pandas as pd
df = pd.read_excel("data/demo.xlsx")
df.head()

Unnamed: 0,Product Name,Sales Q1 2019 (USD),Sales Q2 2019 (USD)
0,Apples,1000000,1500000
1,Oranges,2000000,3000000
2,Bananas,3000000,4500000
3,Pineapples,1500000,2250000
4,Peaches,2000000,1000000


Suppose we wanted a numpy array containing the columns `Sales Q1 2019 (USD)` and `Sales Q2 2019 (USD)` - we could create this using the `to_numpy()` method as follows:

In [46]:
# extract the columns we want
numeric_columns = ["Sales Q1 2019 (USD)", "Sales Q2 2019 (USD)"]
my_np_array = df[numeric_columns].to_numpy()
my_np_array

array([[1000000, 1500000],
       [2000000, 3000000],
       [3000000, 4500000],
       [1500000, 2250000],
       [2000000, 1000000],
       [2000000, 4000000],
       [3000000, 4500000],
       [2000000, 3000000],
       [3000000, 4500000],
       [1500000, 2250000]])

Alternatively, if we need to extract only a single column as a numpy array, we might simply use `np.array` to build our array from the appropriate set of values:

In [48]:
one_column_np_array = np.array(df["Sales Q1 2019 (USD)"])
one_column_np_array

array([1000000, 2000000, 3000000, 1500000, 2000000, 2000000, 3000000,
       2000000, 3000000, 1500000])

You will see us use both of these conversion methods in subsequent lessons.