# Introduction to NumPy arrays

## $ \S 1 $ Motivation

Suppose that we wish to represent three-dimensional vectors such as $ \mathbf{u}
= (1, 2, 3) $ or $ \mathbf{v} = (-1, 0, 1) $ in Python. It is natural to think
that either lists or tuples might be a good choice for this task.

In [3]:
u = [1, 2, 3]   # Create a list whose elements are 1, 2, 3
v = [-1, 0, 1]

However, at some point we will probably wish not only to store, but to manipulate
these vectors. For instance, how can we add $ \mathbf u $ and $ \mathbf v $ or
take a multiple of $ \mathbf v $? It is reasonable to try the following code:

In [4]:
s = u + v
multiple = 3 * v
print(s)
print(multiple)

[1, 2, 3, -1, 0, 1]
[-1, 0, 1, -1, 0, 1, -1, 0, 1]


These unexpected results can be explained by recalling that for either lists or
tuples (or strings), the `+` operator denotes _concatenation_, not addition; and
accordingly, `*` denotes _repetition_, not multiplication. This behavior is not
so strange at all if we take into account that lists and tuples are _generic_
sequential types, capable of holding objects of arbitrary types, for which
addition and multiplication might not make sense.

__Exercise:__ What happens if you represent $ \mathbf u $ and $ \mathbf v $ as
tuples and try to take their difference $ \mathbf u - \mathbf v $? What if they
are represented as lists?

Vectors and matrices are fundamental objects in engineering, data
science and machine learning. There is thus a clear need for a library that
extends Python by providing efficient ways to operate on these objects.

## $ \S 2 $ Arrays

**NumPy**, which stands for _Numerical Python_, is a foundational package
for scientific computing in Python. It is almost universally imported with the
`np` alias, as follows:

In [2]:
import numpy as np

The central feature in NumPy is a new data structure called an **ndarray** (an
abbreviation of $n$-dimensional array), or simply **array**. An ndarray is a
grid of values _of the same type_; in other words, arrays must be
**homogeneous**. This type is usually, but not always, numerical. For example,
it is also possible to create an array whose elements are booleans, or 
strings. 

A $ 1 $-dimensional (numerical) array is essentially a vector in the sense of
Linear Algebra, as in the discussion in $ \S 1 $.  Arrays can be instantiated
with the `array` function:

In [15]:
u = np.array([1, 2, 3])
print(u)
print("Note the absence of commas (,) separating the entries when an array is displayed.")
print(f"The type of an array such as u is: {type(u)}")

[1 2 3]
Note the absence of commas (,) separating the entries when an array is displayed.
The type of an array such as u is: <class 'numpy.ndarray'>


The number of dimensions of an array is also called its **rank**. A $ 2 $-dimensional
array, or array of rank $ 2 $, is just a matrix. 

In [16]:
A = np.array([[1, 1, 1, 1],   # first row of matrix A
              [2, 2, 2, 2],   # second row
              [3, 3, 3, 3]])  # third row
print(A)
print(type(A))  # Print the type of object A

[[1 1 1 1]
 [2 2 2 2]
 [3 3 3 3]]
<class 'numpy.ndarray'>


Note the use of _double_ brackets here: the first opening bracket `[` serves
to delimit the array itself, while the second one is being used to delimit the elements
of the first row. The rows are separated by commas, as are elements within each row.

The __shape__ of an array is a tuple of integers indicating the size of each of
its dimensions. The preceding array $ \mathbf A $ has shape $ (3, 4) $ since it
has three rows and four columns.

In [14]:
print(A.shape)  # Print the shape of A

(3, 4)


__Exercise:__ What is the shape of a one-dimensional array, for instance the array
$ \mathbf b $ below? Can you explain the result of `b.shape`?

In [12]:
b = np.array([True, False, False, True, False])

__Exercise:__ How would you create the matrix
$$
\mathbf B = \begin{equation*}
\left[ \begin{array}{cc}
b_{11} & b_{12} \\
b_{21} & b_{22} \\
b_{31} & b_{32} \\
b_{41} & b_{42}
\end{array} \right]
\end{equation*}
$$
where $ b_{ij} = i \cdot j $?

We may also use `array` to convert an existing tuple or list to a one-dimensional array:

In [7]:
pi = 3.14
e = 2.72
phi = 1.62
constants = [pi, e, phi]      # Create a list containing the values of three important numbers
names = ('pi', 'e', 'phi')    # Create a tuple containing their names and assign it to `names`
print(np.array(constants))    # Convert `constants` to an array and print the result
print(np.array(names))        # Convert `names` to an array and print the result

[3.14 2.72 1.62]
['pi' 'e' 'phi']


__Exercise:__ Can you make your solution to the previous exercise more efficient
by using a list comprehension to generate the $ b_{ij} $ and then
converting the list to an array? _Hint:_ You will need a double comprehension, of the form `[[... for j in ...] for i in ...]`. 

📝 To recap, `ndarray` is the official name of the data type provided by NumPy, and `array` is both
the informal name of this data type and the name of the function that we can use
to create ndarrays.


A $ 3D$ array is to a matrix as a solid block is to a rectangle. In other words,
a rank $ 3 $ array is one having three axes, instead of just two.

<img src="array_3D.png" alt="drawing" width="400"/>

__Exercise:__ What is the rank and shape of the array depicted above?

Here's a concrete example of a $ 3D $ array of shape $ 2 \times 2 \times 2 $.
Think of it as an array having $ 2 $ "rows",
each of which is a $ 2 \times 2 $ matrix.

In [62]:
A = np.array([[[1, 2],   # The "first row" is a 2x2 matrix
               [3, 4]], 

              [[5, 6],   # The "second row" is also a 2x2 matrix
               [7, 8]]])
print(A)

[[[1 2]
  [3 4]]

 [[5 6]
  [7 8]]]


__Exercise:__ What happens if you delete the blank line inside the definition of $ \mathbf A $ above?

Note that a $ 3D $ array need not be a "cube" (i.e., have all three dimensions
of the same length) as in the previous example.

__Exercise:__ Build a three-dimensional array of shape $ (2, 3, 4) $. 

There is no bound on the number of dimensions that an array can have, although
for most applications, arrays of dimension greater than $ 3 $ are rarely used.

## $ \S 3 $ Other ways to create arrays

There are other ways of creating arrays that are often more convenient than through use of the `array` function. For instance,
to instantiate an array of a desired shape filled with $ 0\text{s} $, we can use the function `zeros`:

In [8]:
Z = np.zeros((4, 4))  # The parameter of `zeros` is the shape you want the array to have
print(Z)

[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]


📝 Note the necessary double parentheses in this call, one to delimit the arguments of the function and the other to specify the shape, which is always a tuple.

__Exercise:__ Create and print a $ 3D $ array of shape $ (3, 4, 5) $ filled with zeros.

Arrays can also be automatically populated with $ 1\text{s} $ by means of the function `ones`: 

In [9]:
U = np.ones((2, 3))
print(U)

[[1. 1. 1.]
 [1. 1. 1.]]


__Exercise:__ Create a $ 1D $ array having $ 50 $ coordinates, all of them equal to $ 1 $.

The `arange` function is much like the Python built-in `range`, but it returns an ndarray:

In [10]:
digits = np.arange(10)
print(type(digits), digits)


<class 'numpy.ndarray'> [0 1 2 3 4 5 6 7 8 9]


The full syntax is `arange(<start>, <stop>, <step>)`. Note that the starting
value is included, but the stopping value is not (exactly as in vanilla `range`).

In [11]:
y =np.arange(4, 10, 2)
print(y)

[4 6 8]


One advantage of `arange` over `range` is that _it accepts arguments of type float_, for instance:

In [12]:
x = np.arange(0.1, 1, 0.1)
print(x)

[0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9]


However, this feature must be used with care, because sometimes rounding errors may lead to unexpected results, as in the following example:

In [13]:
print(np.arange(1, 1.3, 0.1))
print("Note that the value 1.3 was included!")

[1.  1.1 1.2 1.3]
Note that the value 1.3 was included!


__Exercise:__ For each item, create a $ 1D $ array containing the elements described:

(a) All integers from $ 5 $  to $ 15 $ (inclusive), but represented as
floating-point numbers.

(b) The sequence of even numbers between $ 2 $ and $ 19 $.

(c) All integers from $ 10 $ down to $ 1 $.

(d) All numbers from $ -3.14 $ to $ 2.86 = -3.14 + 6 $, with a stride of $ 2 $.

Alternatively, with `linspace` we can create an array containing linearly spaced
values inside a specified interval. The syntax is similar to that of `arange`,
except that the stop value in the second argument is included in the result, and
_the third argument gives the number of values to be generated, instead of the
step size_:

In [64]:
z = np.linspace(0, 10, 11)
print(z, '\n')

w = np.linspace(0, 10, 10)
print(w)

[ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9. 10.] 

[ 0.          1.11111111  2.22222222  3.33333333  4.44444444  5.55555556
  6.66666667  7.77777778  8.88888889 10.        ]


Thus in general the syntax is `linspace(<start>, <stop (inclusive)>, <# 
elements>)`. We can exclude the stop value in `linspace` using `endpoint=False`:

In [51]:
print(np.linspace(0, 10, 10, endpoint=False))

[0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]


__Exercise:__ What happens to the result of `linspace` if the starting value is greater than the stopping value? What if they are equal? And what if the third argument is zero or negative? 

## $ \S 4 $ Accessing and modifying individual array elements

Consider the following vector $ \mathbf{a} $:

In [37]:
a = np.array([1, 2, 3])
print(a)

[1 2 3]


To access or modify the $ 0 $-th element of $ \mathbf a $ (recall that we always
count from $ 0 $ in Python), we use the same syntax as we would if it were a
list:

In [38]:
print(a[0])  # Access 0-th element of `a`
a[0] = -1    # Modify this element
print(a)     # Print the result


1
[-1  2  3]


If we are dealing with a $ 2D $ array, we use `[i, j]` to access its $ (i, j) $-th entry, that is, the element in row $ i $ and column $ j $:

In [10]:
A = np.ones((2, 2))
print("Before modifications:")
print(A, '\n')

A[0, 1] = 0
A[1, 0] = 0 
print("After modifications:")
print(A)

Before modifications:
[[1. 1.]
 [1. 1.]] 

After modifications:
[[1. 0.]
 [0. 1.]]


In general, when dealing with an $ n $-dimensional array, use `[k_1, k_2, ..., k_n]` to access its element having indices $ k_1, k_2, \cdots, k_n $, respectively.

__Exercise:__ Build a "$ 3D $ identity array" $ M $ of shape $ (5, 5, 5) $ by
first populating it with zeros, then setting all elements with indices
of the form $ (i, i, i) $ equal to $ 1 $. How could you use a `for` loop to do
this? Can you set all five elements to $ 1 $ with just one instruction?

In [11]:
# Populate M with zeros:
# M = ...

# Set diagonal elements equal to 1:
# ...

# Print the result:
# print(M)


## $ \S 5 $ Slices

NumPy arrays can be sliced in a similar way to lists, by using the `:`
operator. More precisely, in the case of $ 1D $ arrays, the syntax is
`array[<start>:<stop>:<step>]`

__Exercise:__ Let $ \mathbf a = (0, 1, \cdots, 10) $.

(a) Instantiate this array using `linspace` or `arange`.

(b) Take a slice of $ \mathbf a $ resulting in $ (0, 1, \cdots, 4) $.

(c) Slice $ \mathbf a $ to obtain the array $ (5, 6, \cdots, 10) $.

(d) Construct a slice to retrieve the subarray $ (3, 5, 7, 9) $.

(e) Take a full slice of $ \mathbf a $, call it $ \mathbf b $, and modify its $ 0
$-th element. Is $ \mathbf a $ affected?

📝 _Slicing an array only creates a view on the original array, not a copy_.
This behavior is by design, for efficiency reasons. However, this means that
modifying the data in the slice will also modify the original array. We will
discuss later how to create independent copies.

We can use access multiple specific indices of an array as follows:.

In [25]:
b = np.arange(11)
print(b[[2, 5, 7]])  # Elements at indices 2, 5, and 7


[2 5 7]


Slicing becomes more interesting with higher-dimensional arrays. For a $ 2D $
array, we need to specify slices for both dimensions, separated by a comma.

In [23]:
M = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(M, '\n')

# Accessing a specific row:
print(M[1, :], '\n')  # 'First' row

# Accessing a specific column:
print(M[:, 2], '\n')  # 'Second' column

# Sub-array slicing:
print(M[0:2, 1:3])  # Top right 2x2 sub-array


[[1 2 3]
 [4 5 6]
 [7 8 9]] 

[4 5 6] 

[3 6 9] 

[[2 3]
 [5 6]]


__Exercise:__ Create a $ 2D $ array of shape $ (4, 4) $ filled with your
favorite integers and print it for reference.

(a) Extract the last row using a slice.

(b) Extract the column of index $ 3 $.

(c) Extract a $ 2\times 2 $ sub-array from the center of this array.

(d) Extract the $ 3 \times 2 $ lower-left corner of the array.

(e) Extract the subarray consisting of columns indexed by $ 1 $ and $ 3 $ in two
different ways: by specifying these columns explicitly, and by using a slice
with step size $ 2 $.

📝 All of the principles we have seen extend to arrays of higher dimensions. For
each dimension, you can specify a `<start>:<stop>:<step>` slice, separated by
commas.

__Boolean indexing__ allows us to select elements from an array using an array
of boolean values of the same shape. This method is particularly useful for
filtering elements based on some condition. Here's an example:

In [5]:
v = np.array([1, 2, 3, 4, 5])
print("Original array:", v)

# Create a boolean index array:
filter = v > 2
print("Filter: ", filter)

# Use boolean indexing to select elements:
selected_elements = v[filter]
print("Selected elements:", selected_elements)

Original array: [1 2 3 4 5]
Filter:  [False False  True  True  True]
Selected elements: [3 4 5]


In NumPy, and Python in general, the Boolean value `True` is treated as
equivalent to $ 1 $ in numerical contexts, and `False` is equivalent to
$ 0 $. Thus, in the preceding example, we can find the number of elements
greater than $ 2 $ by taking the sum of the entries in the filter:

In [10]:
print("Number of elements greater than two:", np.sum(filter))

Number of elements greater than two: 3


__Exercise:__ You're provided below with a dataset `rainfall_data` of daily
rainfall measurements (in millimeters) in a city for four months (June to
September), stored as a 2D NumPy array.  Each row represents a month, and each
column represents a day.

(a) Compute the average rainfall and the standard deviation of the sample
across the entire period using the functions `np.mean` and `np.std`.

(b) Identify days with above-average rainfall for each month, and calculate the
percentage of such days in each month. _Hint:_ Use a for loop to iterate over
each month/row of the dataset. Use a Boolean filter to find which days
had greater than average rainfall and take the sum of the `True` values
as discussed above to compute the percentages. 

In [35]:
np.random.seed(123456789)

# Sample mean and standard deviation for each month (in mm):
mean_rainfall = [1.5, 2.7, 2.7, 0.9]  # Average daily rainfall (mm) for June, July, August, September
std_deviation = [1.5, 2, 2.5, 1.5]  # Variability in daily rainfall (mm)

# Generating the 2D array of rainfall data:
months = ["June", "July", "August", "September"]
days_in_month = 30
rainfall_data = np.zeros((4, days_in_month))

for i in range(4):
    # Simulate daily rainfall using a normal distribution:
    rainfall_data[i, :] = np.clip(np.random.normal(mean_rainfall[i],
                                  std_deviation[i], days_in_month), 0, None)
# Round to 2 decimal digits:
rainfall_data = np.round(rainfall_data, 2)

In [36]:
print(rainfall_data)

[[4.82 4.69 4.26 1.62 2.79 0.26 3.24 3.56 2.91 2.77 2.29 0.65 2.75 1.85
  0.23 0.   1.55 2.07 0.3  0.25 0.   1.78 1.42 3.09 2.37 1.02 1.1  1.98
  3.56 1.19]
 [3.25 3.03 1.83 2.89 0.   3.86 9.27 3.12 1.66 6.75 3.34 1.62 1.77 3.44
  2.05 4.97 0.   0.19 2.66 4.4  1.84 2.44 0.76 0.   1.29 1.43 2.95 4.09
  4.81 3.46]
 [3.49 1.97 4.84 1.42 0.52 3.14 1.3  1.48 1.29 0.71 4.79 1.14 1.37 0.43
  6.15 2.31 1.46 4.01 3.85 0.56 0.65 4.01 4.39 1.43 2.33 6.08 7.46 3.95
  7.37 2.54]
 [0.09 3.55 0.37 0.   1.54 1.46 0.61 0.   1.08 2.33 0.19 0.   0.   1.42
  0.   2.57 1.65 2.28 0.48 0.48 4.24 0.48 0.   1.08 1.62 0.   2.26 2.73
  0.   1.73]]


## $ \S 6 $ Other NumPy features

In summary, NumPy provides a high-performance multidimensional array object,
along with a wide range of tools for working with these arrays. As we will see
in the next notebook, arrays are far more efficient and convenient for numerical
computation than Python's built-in data types such as lists, both in memory and
in computational costs. NumPy is used in data analysis, machine learning,
engineering and any other field that requires intensive numerical computation. It
also serves as the basis for higher-level scientific computing libraries such as
SciPy, Pandas, and scikit-learn.  Other features supplied by NumPy include (but are not limited to):
* Basic statistical operations;
* Random number generation;
* Fourier transforms and signal processing;
* Integration with various databases and file formats for data import/export.

We will meet and use some of these in other notebooks.