# Package: `numpy`

The main reference for this chapter is @McK2017.


## Basics

The core data structure for `numpy` is `numpy.ndarray`. It is called *NumPy N-d array*. In most cases we will use its alias `array` for simplicity. You may treat it as a generalized version of `list`. However it can do so much more than the built-in `list`. 

To use `numpy`, we just import it. In most cases you would like to use the alias `np`.

In [None]:
import numpy as np

Using alias, we will just call NumPy N-d array `np.array`.

### Understanding `np.array`
The simplest way to look at an `np.array` is to think it as lists of list. Here are some examples.

- This is an example of a 1d array. Note that it can be treated as a list. You may get access to its entries by 1 index, e.g. `a[0]`. This means that: we have a list, and we want to get the `0`th element in the list.

In [None]:
a = np.array([1, 2])
a

- This is an example of a 2d array. Note that it can be treated as a list of lists. You may get access to its entries by 2 indexes, e.g. `b[0, 0]`. This means that: we have a list of lists. We first get the `0`th element (which is a list), and then get the `0`th element from this `0`th list (which is a number).

In [None]:
b = np.array([[1, 2], [3,4]])
b

- This is an example of a 3d array. Note that it can be treated as a list of lists of lists. You may get access to its entries by 3 indexes, e.g. `c[0, 0, 0]`. This means that: we have a list of lists of lists. We first get the `0`th element (which is a list of lists), and then get the `0`th element (which is a list) from this `0`th list of lists, and then get the `0`th element (which is a number) from the previous list.

In [None]:
c = np.array([[[1, 2], [3,4]], [[1, 2], [3,4]]])
c

#### The dimension of `np.array`
There is a very confusing terminology for `np.array`: dimension. The actual word using in documents is actually `axes`. It refers to the number of coordinates required to describe the location. 

In the previous example, `a` is a 1d array since you only need 1 index to get entries, `b` is a 2d array since you need 2 indexes to get entries, and `c` is a 3d array since you need 3 indexes to get entries.

We could use `.ndim` to check the dimension of a `np.array`.

In [None]:
d = np.array([[1, 2, 3], [4, 5, 6]])
d.ndim

::: {.callout-note}
## Comparing to Linear algebras
The dimension of an `np.array` and the dimenion of a vector in Linear algebras are totally different. In this example, as a `np.array`, `a` is a 1d `np.array`, of length `3`. As a vector, it is a 3d vector.
:::

To describe the length of each axes, we could use `.shape`. It will tells us the length of each axis. In other words, it tells us the maximal index of each axis.


::: {#exm-}

In [None]:
d = np.array([[1, 2, 3], [4, 5, 6]])
d.shape

The shape of `d` is `(2, 3)`, which means that the length of axis 0 is `2` and the length of axis 1 is `3`. 

- Axis 0 is the vertical axis, and its index is corresponding to rows. The length of axis 0 is actually the number of rows.
- Axis 1 is the horizental axis, and its index is corresponding to columns. The length of axis 1 is actually the number of columns.

So a 2d array can be treated as a matrix, and the shape being `(2, 3)` means that the matrix has `2` rows and `3` columns.
:::

::: {.callout-caution}
`.ndim` and `.shape` are not methods. There is no `()` behind.
:::


#### Moving along axis
A lot of `numpy` methods has an argument `axis=`, which let you to specify performing the action along which axis. You may understand this "along axis" in the following way. `axis=i` means that when we perform the action, we keep all other indexes the same, only changing the index on axis `i`.

For example, `b.sum(axis=0)` means that we want to add all entries along axis `0`. So we start from a certain entry, keeping all other index the same when changing index on axis `0` only, and add all these entries together. Since axis `0` is corresponding to rows index, only changing row index means we are moving vertically. So if `b` is a 2d array, `b.sum(axis=0)` means we are adding all column together.

We will do more examples later this section.



### Create `np.array`
`np.array` is called Numpy Ndarray. It is a table of elements (usually numbers), all of the same type, indexed by a tuple of non-negative integers. In NumPy dimensions are called *axes*. It refers to the number of axes need to index it., and it is NOT the dimension of the vector spaces.



There are many ways to do that. 

- convert a list into a numpy array.
- `np.zeros`, `np.zeros_like`
- `np.ones`, `np.ones_like`
- `np.eye`
- `np.random.rand`
- `np.arange`
- `np.linspace`

::: {.callout-note}
Please be very careful about the format of the input. For example, when you want to specify the dimension of the array, using `np.zeros`, you need to input a `tuple`. On the other hand, when using `np.random.rand`, you just directly input the dimensions one by one.

In [None]:
#| eval: false
import numpy as np

np.zeros((3, 2))
np.random.rand(3, 2)

In this case, the official documents are always your friend.
:::



### Mathematical and Statistical Methods

- `+`, `-`, `*`, `/`, `**`, etc.. 
- `np.sin`, `np.exp`, `np.sqrt`, etc..

- `mean`, `sum`, `std`, `var`, `cumsum`
- `max` and `min`
- `maximum` and `minimum`
- `argmin` and `argmax`

- `np.sort`
- `np.unique`, `np.any`

- `np.dot`: Matrix multiplication
- `np.concatenate`

- Broadcast


::: {#exm-}
## Axis
Given `A = np.array([[1,2],[3,4]])` and `B = np.array([[5,6],[7,8]])`, please use `np.concatenate` to concatencate these two matrices to get a new matrix, in the order:

- `A` left, `B` right
- `A` right, `B` left
- `A` up, `B` down
- `A` down, `B` up
:::


### Common attributes and methods

- `shape`
- `dtype`
- `ndim`
- Any arithmetic operations between equal-size arrays applies the operation element-wise. 



::: {#exm-}
`MNIST` is a very famous dataset of hand written images. Here is how to load it. Note that in this instance of the dataset the data are stored as `numpy` arraies.

In [None]:
#| eval: false
import tensorflow as tf

(X_train, y_train), (X_test, y_test) = tf.keras.datasets.mnist.load_data()
X_train.shape

:::



## Indexing

### Basic indexing and slicing

First see the following example.


::: {#exm-}

In [None]:
import numpy as np
arr = np.arange(10)

print(arr[5])
print(arr[5:8])

arr[5:8] = 12
print(arr)

print(arr[5:8:2])
print(arr[8:5:-1])
print(arr[::-1])

:::


To do slicing in higher dimensional case, you may either treat a `numpy` array as a nested list, or you may directly work with it with multiindexes.


::: {#exm-}

In [None]:
import numpy as np
arr3d = np.arange(12).reshape(2, 2, 3)

print('case 1:\n {}'.format(arr3d))
print('case 2:\n {}'.format(arr3d[0, 1, 2]))
print('case 3:\n {}'.format(arr3d[:, 0: 2, 1]))
print('case 4:\n {}'.format(arr3d[:, 0: 2, 1:2]))

:::


### Boolean Indexing
`numpy` array can accept index in terms of numpy arries with boolean indexing.


::: {#exm-}

In [None]:
import numpy as np
a = np.arange(4)
b = np.array([True, True, False, True])
print(a)
print(b)
print(a[b])

:::

We could combine this way with the logic computation to filter out the elements we don't want.


::: {#exm-}
Please replace the odd number in the array by its negative.

In [None]:
import numpy as np
arr = np.arange(10)
odd = arr %2 == 1
arr[odd] = arr[odd] * (-1)

print(arr)

:::

### Fancy indexing
Fancy indexing is a term adopted by NumPy to describe indexing using integer arrays. 

::: {#exm-}

In [None]:
import numpy as np

arr = np.zeros((8, 4))
for i in range(8):
    arr[i] = i

arr[[4, 3, 0, 6]]

:::

::: {#exm-}

In [None]:
import numpy as np

arr = np.arange(32).reshape((8, 4))
print(arr)
print(arr[[1, 5, 7, 2], [0, 3, 1, 2]])
print(arr[[1, 5, 7, 2]][:, [0, 3, 1, 2]])

:::


### [Copies and views](https://numpy.org/doc/stable/user/basics.copies.html)
The view of an numpy array is a way to get access to the array without copying internel data. When operating with a view, the original data as well as all other views of the original data will be modified simutanously. 

The default setting for copies and views is that, basic indexing and slicing will make views, and advanced indexing and slicing (e.g. boolean indexing, fancy indexing, etc.) will make copies. For other operations, you need to check the documents to know how they work. For example, `np.reshape` creates a view where possible, and `np.flatten` always creates a copy.

You may use `np.view()` or `np.copy()` to make views or copies explicitly. 
::: {#exm-}

In [None]:
import numpy as np
arr = np.arange(10)
b = arr[5:8]
print('arr is {}'.format(arr))
print('b is {}'.format(b))

b[0] = -1
print('arr is {}'.format(arr))
print('b is {}'.format(b))


arr[6] = -2
print('arr is {}'.format(arr))
print('b is {}'.format(b))

print('The base of b is {}'.format(b.base))

:::


The way to make explicit copy is `.copy()`.


::: {#exm-}

In [None]:
import numpy as np
arr = np.arange(10)
b = arr[5:8].copy()
print('arr is {}'.format(arr))
print('b is {}'.format(b))

b[0] = -1
print('arr is {}'.format(arr))
print('b is {}'.format(b))


arr[6] = -2
print('arr is {}'.format(arr))
print('b is {}'.format(b))

print('The base of b is {}'.format(b.base))

:::


## More commands

- `.T`
- `axis=n` is very important.
- `np.reshape()`
- `np.tile()`
- `np.repeat()`


### More advanced commands

- `np.where()`
- `np.any()`
- `np.all()`
- `np.argsort()`




::: {#exm-}
Get the position where elements of `a` and `b` match.

In [None]:
a = np.array([1,2,3,2,3,4,3,4,5,6])
b = np.array([7,2,10,2,7,4,9,4,9,8])

np.where(a == b)

:::

::: {#exm-}

In [None]:
a = np.array([1,2,3,2,3,4,3,4,5,6])
b = np.array([7,2,10,2,7,4,9,4,9,8])

np.where(a == b, a*2, b+1)

:::


::: {#exm-}
## Playing with axis


In [None]:
import numpy as np
a = np.array([[[1,2],[3,4]],[[5,6],[7,8]]])

np.any(a==1, axis=0)
np.any(a==1, axis=1)
np.any(a==1, axis=2)


np.any(a==2, axis=0)
np.any(a==2, axis=1)
np.any(a==2, axis=2)

np.any(a==5, axis=0)
np.any(a==5, axis=1)
np.any(a==5, axis=2)

:::


## Examples
::: {#exm-}
## Random walks
Adam walks randomly along the axis. He starts from `0`. Every step he has equal possibility to go left or right. Please simulate this process.

Use `choices` to record the choice of Adam at each step. We may generate a random array where `0` represents left and `1` represents right.

Use `positions` to record the position of Adam at each step. Using `choices`, the position is `+1` if we see a `1` and the position is `-1` if we see a `0`. So the most elegent way to perform this is to 

1. Convert `choices` from `{0, 1}` to `{-1, 1}`.
2. To record the starting position, we attach `0` to the beginning of the new `choices`.
3. Apply `cumsum` to `choices` to get `positions`.

In [None]:
import numpy as np

step = 30
choices = np.random.randint(2, size=step)
choices = choices * 2 - 1
choices = np.concatenate(([0], choices))
positions = choices.cumsum()

import matplotlib.pyplot as plt
plt.plot(positions)

:::

::: {#exm-}
## Many random walks
We mainly use `numpy.ndarray` to write the code in the previous example. The best part here is that it can be easily generalized to many random walks.

Still keep `choices` and `positions` in mind. Now we would like to deal with multiple people simutanously. Each row represents one person's random walk. All the formulas stay the same. We only need to update the dimension setting in the previous code.

- Update `size` in `np.random.randint`.
- Update `[0]` to `np.zeros((N, 1))` in `concatenate`.
- For `cumsum` and `concatenate`, add `axis=1` to indicate that we perform the operations along `axis 1`.
- We plot each row in the same figure. `plt.legend` is used to show the label for each line.

In [None]:
import numpy as np

step = 30
N = 3
choices = np.random.randint(2, size=(N, step))
choices = choices * 2 - 1
choices = np.concatenate((np.zeros((N, 1)), choices), axis=1)
positions = choices.cumsum(axis=1)

import matplotlib.pyplot as plt
for row in positions:
    plt.plot(row)
plt.legend([1, 2, 3])

:::


::: {#exm-}
## Analyze `positions`
We play with the numpy array `positions` to get some information about the random walks of three generated in the previous example.

- The maximal position:

In [None]:
positions.max()

- The maximal position for each one:

In [None]:
positions.max(axis=1)

- The maximal position across all three for each step:

In [None]:
positions.max(axis=0)

- Check whether anyone once got to the position 3:

In [None]:
(positions>=3).any(axis=1)

- The number of people who once got to the position 3: 

In [None]:
(positions>=3).any(axis=1).sum()

- Which step for each one gets to the right most position: 

In [None]:
positions.argmax(axis=1)

:::


## Exercises

Many exercises are from @Pra2018.

::: {#exr-}

1. Create a $3\times3$ matrix with values ranging from 2 to 10.
2. Create a $10\times10$ 2D-array with `1` on the border and `0` inside.
3. Create a 2D array of shape `5x3` to contain random decimal numbers between `5` and `10`.
4. Create a 1D zero `np.array` of size 10 and update sixth value to 11.
:::


::: {#exr-}
Write a function to reverse a `np.ndarray` (first element becomes last).
:::



::: {#exr-}
Given `a = np.array([1,2,3])`, please get the desired output `array([1, 1, 1, 2, 2, 2, 3, 3, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3])`. You may use `np.repeat()` and `np.tile()`.
:::


::: {#exr-}
## Compare two `numpy` arraies
Consider two `np.array` of the same length `x` and `y`. Compare them entry by entry. We would like to know how many are the same.

Please wrap your code into a function that return the nnumber of same entries between `x` and `y`.
:::





::: {#exr-}
## Manipulate matries
Please finish the following tasks. Let `arr = np.arange(9).reshape(3,3)`.

1. Swap rows `1` and `2` in the array `arr`.
2. Reverse the rows of a 2D array `arr`.
3. Reverse the columns of a 2D array `arr`.
:::



::: {#exr-}
Consider a 2d `np.array`.

In [None]:
arr = np.random.rand(4, 4)

1. Please compute the mean of each column.
2. Please compute the sum of each row.
3. Please compute the maximam of the whole array.
:::


::: {#exr-}
## Adding one axis
Please download [this file](assests/img/20220824224849.png).   

1. Please use `matplotlib.pyplot.imread()` to read the file as a 3d `np.array`. You may need to use `matplotlib` package. It will be introduced later this course. You may go to its [homepage](https://matplotlib.org/stable/users/getting_started/) to install it.
2. Check the shape of the array.
3. Add one additional axis to it as axis 0 to make it into a 4D array. 
:::


::: {#exr-}
## Understanding colored pictures
Please download [this file](assests/img/20220824224849.png) and use `matplotlib.pyplot.imread()` to read the file as a 3d `np.array`. You may need to use `matplotlib` package. It will be introduced later this course. You may go to its [homepage](https://matplotlib.org/stable/users/getting_started/) to install it.

A colored picture is stored as a 3d `np.array`. Axis `0` and Axis `1` is about the vertical and horizontal coordinates and can help us to locate a sepecific point in the picture. Axis `2` is an array with `3` elements. It is the color vector which represents the three principal colors: red, green and blue.

1. Find the maximum and minimum of the values in the array.
2. Compute the mean of the three colors at each point to get a 2d `np.array` where each entry represents the mean of the three colors at each point of the picture.
:::





::: {#exr-}
## Queries

1. Get all items between `5` and `10` from an array `a = np.array([2, 6, 1, 9, 10, 3, 27])`.
2. Consider `x = np.array([1, 2, 1, 1, 3, 4, 3, 1, 1, 2, 1, 1, 2])`. Please find the index of 5th repetition of number `1` in `x`.
:::


::: {#exr-}
Use the following code to get the dataset `iris` and three related `np.array`: `iris_1d`, `iris_2d` and `sepallength`. 

In [None]:
import numpy as np

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris_1d = np.genfromtxt(url, delimiter=',', dtype=None, encoding=None)
iris_2d = np.genfromtxt(url, delimiter=',', dtype='float', encoding=None)
iris_2d[np.random.randint(150, size=20), np.random.randint(4, size=20)] = np.nan
sepallength = np.genfromtxt(url, delimiter=',', dtype='float', usecols=[0],
                            encoding=None)

1. `iris_1d` is a 1D numpy array that each item is a tuple. Please construct a new 1D numpy array that each item is the last componenet of each tuple in `iris_1d`.

2. Convert `iris_1d` into a 2D array `iris_2d` by omitting the last field of each item.
3. `np.isnan()` is a function to check whether each entry of a `np.array` is `nan` or not. Please use `np.isnan()` as well as `np.where` to find all `nan` entries in `iris_2d`. 
4. Select the rows of `iris_2d` that does not have any `nan` value.
5. Replace all `nan` with `0` in `iris_2d`.
:::






::: {#exr-}
## Random
Please finish the following tasks. 

1. Use the package `np.random` to flip a coin 100 times and record the result in a list `coin`.
2. Assume that the coin is not fair, and the probability to get `H` is `p`. Write a code to flip the coin 100 times and record the result in a list `coin`, with a given parameter `p`. You may use `p=.4` as the first choice.
3. For each list `coin` created above, write a code to find the longest `H` streak. We only need the biggest number of consecutive `H` we get during this 100 tosses. It is NOT necessary to know when we start the streak.
:::

<!-- 
<details>
<summary>Click for Hint.</summary>

::: {.solution}
The following ideas can be used to solve the problem.

- `np.where`
- string, `split` and `join`
:::

</details>
 -->


::: {#exr-}
## Bins
Please read the [document of `np.digitize`](https://numpy.org/doc/stable/reference/generated/numpy.digitize.html#numpy-digitize), and use it to do the following task.

Set the following bins:

- Less than `3`: `small`
- `3-5`: `medium`
- Bigger than `5`: `large`

Please transform the following data `iris_2c` into texts using the given bins.

In [None]:
import numpy as np
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris_2c = np.genfromtxt(url, delimiter=',', dtype='object')[:, 2].astype('float')

:::



::: {#exr-}
Consider a 2D numpy array `a`. 

In [None]:
import numpy as np
a = np.random.rand(5, 5)

1. Please sort it along the 3rd column.
2. Please sort it along the 2nd row.

You may use `np.argsort()` for the problem.
:::



::: {#exr-}
## One-hot vector
Compute the one-hot encodings of a given array. You may use the following array as a test example. In this example, there are `3` labels. So the one-hot vectors are 3 dimensional vectors.

For more infomation about one-hot encodings, you may check the [Wiki page](https://en.wikipedia.org/wiki/One-hot#Machine_learning_and_statistics). You are not allowed to use packages that can directly compute the one-hot encodings for this problem. 

In [None]:
import numpy as np
arr = np.random.randint(1,4, size=6)

:::



::: {#exr-}
Consider `arr = np.arange(8)`. A stride of `arr` with a window length of `4` and strides of `2` is a 2d `np.array` that looks like `[[0,1,2,3], [2,3,4,5], [4,5,6,7]]`.

Please write a function that takes `arr` and `length` and `strides` as inputs, and its stride as outputs.
:::





## References {.unnumbered}

::: {#refs}
:::