# Package: `numpy`

The main reference for this chapter is [@McK2017].


## Basics

The core data structure for `numpy` is `numpy.ndarray`. It is called *NumPy Nd array*. In most cases we will use its alias `array` for simplicity. You may treat it as a generalized version of `list`. However it can do so much more than the built-in `list`. 

To use `numpy`, we just import it. In most cases you would like to use the alias `np`.

In [None]:
import numpy as np

<!-- Using alias, we will just call NumPy Nd array `np.array`. -->

### Understanding `ndarray`
The simplest way to look at an `ndarray` is to think it as lists of list. Here are some examples.

- This is an example of a 1d array. Note that it can be treated as a list. You may get access to its entries by 1 index, e.g. `a[0]`. This means that: we have a list, and we want to get the `0`th element in the list.

In [None]:
a = np.array([1, 2])
a

- This is an example of a 2d array. Note that it can be treated as a list of lists. You may get access to its entries by 2 indexes, e.g. `b[0, 0]`. This means that: we have a list of lists. We first get the `0`th element (which is a list), and then get the `0`th element from this `0`th list (which is a number).

In [None]:
b = np.array([[1, 2], [3,4]])
b

- This is an example of a 3d array. Note that it can be treated as a list of lists of lists. You may get access to its entries by 3 indexes, e.g. `c[0, 0, 0]`. This means that: we have a list of lists of lists. We first get the `0`th element (which is a list of lists), and then get the `0`th element (which is a list) from this `0`th list of lists, and then get the `0`th element (which is a number) from the previous list.

In [None]:
c = np.array([[[1, 2], [3,4]], [[1, 2], [3,4]]])
c

#### The dimension of `ndarray`
There is a very confusing terminology for `ndarray`: dimension. The actual word using in documents is actually `axes`. It refers to the number of coordinates required to describe the location. 

In the previous example, `a` is a 1d array since you only need 1 index to get entries, `b` is a 2d array since you need 2 indexes to get entries, and `c` is a 3d array since you need 3 indexes to get entries.

We could use `.ndim` to check the dimension of a `ndarray`.

In [None]:
d = np.array([[1, 2, 3], [4, 5, 6]])
d.ndim

::: {.callout-note}
## Comparing to Linear algebras
The dimension of an `ndarray` and the dimenion of a vector in Linear algebras are totally different. In this example, as a `ndarray`, `a=np.array([1, 2])` is a 1d `ndarray`, of length `2`. As a vector, it is a 2d vector.
:::

To describe the length of each axes, we could use `.shape`. It will tells us the length of each axis. In other words, it tells us the maximal index of each axis.


::: {#exm-}

In [None]:
d = np.array([[1, 2, 3], [4, 5, 6]])
d.shape

The shape of `d` is `(2, 3)`, which means that the length of axis 0 is `2` and the length of axis 1 is `3`. 

- Axis 0 is the vertical axis, and its index is corresponding to rows. The length of axis 0 is actually the number of rows.
- Axis 1 is the horizental axis, and its index is corresponding to columns. The length of axis 1 is actually the number of columns.

So a 2d array can be treated as a matrix, and the shape being `(2, 3)` means that the matrix has `2` rows and `3` columns.
:::

::: {.callout-caution}
`.ndim` and `.shape` are not methods. There is no `()` behind.
:::


#### Moving along axis {#sec-moving-along-axis}
A lot of `numpy` methods has an argument `axis=`, which let you to specify performing the action along which axis. You may understand this "along axis" in the following way. `axis=i` means that when we perform the action, we keep all other indexes the same, only changing the index on axis `i`.

For example, `b.sum(axis=0)` means that we want to add all entries along axis `0`. So we start from a certain entry, keeping all other index the same when changing index on axis `0` only, and add all these entries together. Since axis `0` is corresponding to rows index, only changing row index means we are moving vertically. So if `b` is a 2d array, `b.sum(axis=0)` means we are adding all column together.

We will do more examples later this section.



### Create `ndarrays`
There are many ways to create `ndarrays`. We list some basic ways below.


::: {.callout-note collapse="true"}
# Converting from a Python `list`
You may apply `np.array()` to a `list` to convert it into a `ndarray`.

1. A list of numbers will create a 1d `ndarray`.
2. A list of lists will create a 2d `ndarray`.
3. Further nested lists will create a higher-dimensional `ndarray`.

All arrays in the previous sections are created in this way.
:::

::: {.callout-note collapse="true"}
# Intrinsic `numpy` array creation functions
Here is an incomplete list of such functions.

1. `np.ones()` and `np.zeros()`
    - Both of them will create `ndarrays` with the specified shape.
2. `np.eye()` and `np.diag()`
    - Both will create 2d array. So they can also be treated as creating matrices.

3. `np.arange(start, stop, step)`
    - It will only create 1d array, which start from `start` to `stop` with the step size `step`.
    - `start` is by default `0` and `step` is by default `1`.
    - In most cases the `stop` is NOT included, which is similar to Python list.
    - The syntax is very similar to `range()`. The main difference between them is the object type of the output.

4. `np.linspace(start, stop, num)`
    - It will only create 1d array, which starts from `start`, stops at `stop` with totally `num` of points in the array, each of which are equally spread.
    - `start` and `stop` are always INCLUDED in the array.

5. `np.random.rand()` and many other functions in `np.random` package.

These functions are straightforward. You may go to the official documents for more details. For example [this](https://numpy.org/doc/stable/reference/generated/numpy.arange.html) is the page for `np.arange()`. You may find other functions on the left navigation bar, or you may use the search function to locate them.
:::


::: {.callout-note collapse="true"}
# Reading from files
`numpy` provides several functions to read and write files. We discuss the most commonly used one: `np.genfromtxt()`.

`np.genfromtxt()` is used to load data from a text file, with missing values handled as specified. The idea of this function is to first read the file as a string and then parse the structure of the string, automatically. 

There are many arguments. Here are a few commonly used. For more details please read the [official tutorial](https://numpy.org/doc/stable/user/basics.io.genfromtxt.html).

- `dtype`: Data type of the resulting array. If `None`, the dtypes will be determined by the contents of each column, individually.
- `delimiter`: The string used to separate values. By default, any consecutive whitespaces act as delimiter. 
- `usecols`: Which columns to read, with 0 being the first. 
- `encoding`: This is used to decode the inputfile. The default setting for `encoding` is `bytes`. If it is set to `None` the system default is used. Please pay attention to the differences between these two.

Note that when choosing `dtype`, if the type is NOT a single type, the output will be a 1d array with each entry being a tuple. If it is a single type, the output will be a 2d array. Please see the following example.

In [None]:
import numpy as np

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris_1d = np.genfromtxt(url, delimiter=',', dtype=None, encoding=None)
iris_2d = np.genfromtxt(url, delimiter=',', dtype='float', encoding=None)
iris_2d_str = np.genfromtxt(url, delimiter=',', dtype='str', encoding=None)
iris_1d[:10]

In [None]:
iris_2d[:10]

In [None]:
iris_2d_str[:10]

We only show the first 10 rows to save some display room.

You may also download the datafile from the `url` provided in the code. The file can be opened with any editor. It is displayed below for reference. 

In [None]:
#| echo: false
import urllib
datafile = urllib.request.urlopen(url).read()
print(datafile)

The file can be understood as follows. `\n` separates rows and `,` separates columns. Each row contains five columns, where the last one is definitely a string, and the first four are numeric. Therefore the whole dataset is a mixed type dataset. 

1. In the first command, `dtype=None`. Since any types are accepted, it returns an 1d array with each row being a tuple. 
2. In the second command, `dtype='float'`. Then only `float` data is accepted. Then we have a 2d array with the last column (string data that cannot be tranlated into a float) being `np.nan`.
3. In the third command, `dtype='string'`. Then all data are tranlated into strings, and we get a 2d array.



<!-- - Binary files
    - `np.save()` is used to save one `np.array` array into the binary `.npy` format. 
    - `np.savez()` is used to save multiple `np.array` arries into one uncompressed `.npz` format. 
    - `np.load()` is used to load the `np.array` data from either `.npy` or `.npz`.

    `.npy` is the standard binary file format in `numpy`. `.npz` is a ZipFile containing multiple `.npy` files. These file formats are used in `numpy` naively so `numpy` work with them very fast, comparing to other file formats like `.txt` and `.csv`. 
    
    The downside is that these files are not human-readable, and are hard to be understood by other programs. So if you would use other ways to deal with these data, you may want to save the data in other formats. -->
:::



::: {.callout-note collapse="true"}
# Changing the shape of other `ndarrays`
There are multiple ways to manipulate the shapes of `ndarrays`. We will only mention some commonly used ones in this section.

1. `np.concatenate()`

`np.concatenate()` is used to join a sequence of `ndarrays` along an existing axis. Therefore the major input arugments including:

- A tuple which represents the sequence of `ndarrays`.
- The axis for the `ndarrays` to be concatenated. The default is `axis=0`.

The setting for `axis` is the same as in @sec-moving-along-axis. That is, `axis=i` along the axis `i` means that all we collect all the entries with the same other indexes and different `i`th index.

A quick example is about a 2d `ndarrays`. When talking about `axis=0`, we are looking at entries that have the same `1st` index and different `0th` index. This refers to all the entries in one column. So if we want to do something vertically, we need to set `axis=0`.

Similarly, `axis=1` means that we are looking at the entries wich the same `0th` index and different `1st` index. These are entries in the same row. So `axis=1` menas horizontally. Please see the following example.

::: {#exm-}
## Axis
Given `A = np.array([[1,2],[3,4]])` and `B = np.array([[5,6],[7,8]])`, please use `np.concatenate` to concatencate these two matrices to get a new matrix, in the order:

In [None]:
#| echo: false
A = np.array([[1,2],[3,4]])
B = np.array([[5,6],[7,8]])

- `A` left, `B` right

In [None]:
np.concatenate((A, B), axis=1)

- `A` right, `B` left

In [None]:
np.concatenate((B, A), axis=1)

- `A` up, `B` down

In [None]:
np.concatenate((A, B), axis=0)

- `A` down, `B` up

In [None]:
np.concatenate((B, A), axis=0)

:::

2. Reshape

`np.reshape()` functions and `.reshape()` methods are equivalent. They are used to change the shape of the original `ndarray`. Please see the following example.


In [None]:
A = np.array([[1, 2, 3], [4, 5, 6]])
A.reshape((6, 1))

3. Transpose

There are three ways to perform transpose.

- `np.transpose()` function
- `.transpose()` method
- `.T` attribute
Please see the following example.

In [None]:
A = np.array([[1, 2, 3], [4, 5, 6]])
A.T

Note that in the third method, `.T` is NOT a function that there are no `()` at the end.

:::


::: {.callout-caution collapse="true"}
# Pay attention to the format of inputs
Please be very careful about the format of the input. For example, when you want to specify the dimension of the array, using `np.zeros`, you need to input a `tuple`. On the other hand, when using `np.random.rand`, you just directly input the dimensions one by one.

In [None]:
#| eval: false
import numpy as np

np.zeros((3, 2))
np.random.rand(3, 2)

In this case, the official documents are always your friend.
:::



### Mathematical and Statistical Methods
Many functions performs element-wise operations on data in `ndarrays`, and supports array broadcasting, type casting, and several other standard features. This type of functions is called a universal function (or *ufunc* for short). 

With ufuncs, using `ndarrays` enables you to express many kinds of data processing tasks as concise array expressions that might otherwise require writing loops. This practice of replacing explicit loops with array expressions is commonly referred to as *vectorization*.

Please see the following example.

::: {#exm-}

In [None]:
import numpy as np
x = np.linspace(0, 1, 101)
y = np.sin(x)
z = y**2 + 2*y-3

This example defines two functions $y=\sin(x)$ and $z=y^2+2y-3$. The syntax is very similar to the math language.
:::


::: {.callout-caution} 
Please pay attention to the difference between `numpy` functions and `ndarray` methods. `numpy` functions are functions defined in the `numpy` package that you use it by applying it to the arguments. `ndarray` methods are function defined specific for one `ndarray`, and it is used by calling it after the `ndarray` with `.` symbol. In the official documents, a `numpy` function looks like `numpy.XXX()` while a `ndarray` method looks like `numpy.ndarray.XXX()`. Please see the following example.

- [`np.amax()`](https://numpy.org/doc/stable/reference/generated/numpy.amax.html#numpy.amax) is a numpy function. It is used to find the maximum of an array. Assuming `a` is a `np.array`, then the syntax is `np.amax(a)`.
- [`.max()`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.max.html) is a `np.array` method. It is used to find the maximum of an array. Assuming `a` is a `np.array`, then the syntax is `a.max()`.
:::
Here is an incomplete list of ufuncs. Some functions come with brief introductions. For more details please read the official documents.

- `numpy` functions
    - `+`, `-`, `*`, `/`, `**`, etc.. 
    - `>`, `<`, `>=`, `<=`, `==`, `!=`, etc..
    - `np.sin()`, `np.exp()`, `np.sqrt()`, etc..
    - `np.dot()`: Matrix multiplication.
    - `np.unique()`: Find out all unique values from the array.
    - `np.maximum()` and `np.minimum()`: These are used to find the maximum/minimum between two `np.array`.
    - `np.argmax()` and `np.argmin()`: Return the indices of the maximum/minimum values. There are also `.argmin()` and `.argmax()` methods.
    - `np.sort()`: Sort the array. There is also a `.sort()` method.    
- `ndarray` methods
    - `.mean()`, `.sum()`, `.std()`, `.var()`: Array methods that are used to compute corresponding properties of the array.
    - `.cumsum()`: Return the cumulative sum of the elements along a given axis. 
    - `.max()` and `.min()`: This is used to find the maximal/minimal entry of one `np.array`.    
    - `.argmax()` and `.argmin()`: Return the indices of the maximum/minimum values. There are also `np.argmax()` and `np.argmin()` functions.
    - `.sort()`: Sort the array. There is also a `np.sort()` function.


::: {.callout-tip}
Don't forget that most functions and methods have `axis` arguments to specify which axis you want to move along with.
:::

#### Broadcasting 

Although most `numpy` functions and `ndarray` methods are computing entry-wise, when we perform binary operations, the size of the two arrays don't have to be the same. If they are not the same, the Broadcasting Rule applies, and some entries will be filled automatically by repeating themselves.


::: {.callout-note}
# The Broadcasting Rule
Two arrays are compatible for broadcasting if for each dimension the axis lengths match or if either of the lengths is 1. Broadcasting is then performed over the missing or length 1 dimensions.
:::

Please see the following examples.

In [None]:
import numpy as np
a = np.array([1, 2])
a + 1

In [None]:
b = np.array([[3, 4], [5, 6]])
a + b

In [None]:
c = np.array([[1], [2]])
b + c

## Indexing

### Basic indexing

Basic indexing is very similar to indexing and slicing for `list`. Please see the following examples.


::: {#exm-}

In [None]:
import numpy as np
arr = np.arange(10)
arr

In [None]:
arr[5]

In [None]:
arr[5:8]

In [None]:
arr[5:8:2]

In [None]:
arr[8:5:-1]

In [None]:
arr[::-1]

In [None]:
arr[5:8] = 12
arr

:::


To do slicing in higher dimensional case, you may directly work with it with multiindexes.


::: {#exm-}

In [None]:
import numpy as np
arr3d = np.arange(12).reshape(2, 2, 3)
arr3d

In [None]:
arr3d[0, 1, 2]

In [None]:
arr3d[:, 0: 2, 1]

In [None]:
arr3d[:, 0: 2, 1:2]

:::


::: {.callout-caution}
# Nested indexes
In theory, since `ndarrys` can be treated as lists of list, it is possible to use nested index to get access to entries. For example, assuming `a` is a 2d `ndarray`, we might use `a[0][0]` to get access to `a[0, 0]`. This is a legal syntax.

However it is almost required NOT to do so. The main reason is due to the copy/view rules that will be described later. Nested indexes might cause many confusions and it is highly possible to casue unexpected errors.
:::

### Advanced Indexing
Advanced indexing is triggered when the selection object satisfies some conditions. The concrete definition is technical and abstract. You may (not entirely correctly) understand it as "everything other than basic indexing (concrete coordinates or slicing)". Please read the [official document](https://numpy.org/doc/stable/user/basics.indexing.html#advanced-indexing) for more details. 

Here we mainly focus on some typical advaced indexing methods. 

::: {.callout-caution}
There are some very exotic examples that are very hard to tell whether they belong to basic indexing or advanced indexing. Our suggestion is to avoid this type of code, and try to code in the most straight forward way. You could come back to understand this problem later when you are more expericened, but it is more of a Programming Language problem, instead of a Data Science problem.
:::




::: {.callout-note collapse="true"}
# Fancy indexing
Fancy indexing is a term adopted by `numpy` to describe indexing using integer arrays. 

The basic idea is to use a `list` of indexes to select entries. The general rule is relative complicated. Here we will only talk about 1d and 2d cases.


::: {.callout-tip}
# 1d case
:::

When dealing with a 1d `ndarray`, indexing by a `list` is straight forward. Please see the following example.

In [None]:
import numpy as np
arr = np.arange(16)

arr[[1, 3, 0, 2]]

::: {.callout-caution collapse="true"}
# A trick example
Please consider the following two indexings. 

In [None]:
arr[1:2]

In [None]:
arr[[1]]

From the first glance, the two outputs look the same. However they are from two different techniques. 

- The `1:2` in `arr[1:2]` is a `slice`. Therefore the first indexing is basic indexing. 
- The `[1]` in `arr[[1]]` is a `list`. Therefore the second indexing is advanced indexing.

The main reason to distinguish these two indexings is about view and copy, which will be discussed in the next section.
:::


::: {.callout-tip}
# 2d case
:::

When dealing with a 2d `ndarray`, there are multiple possibilities. In the following discussion we will use the following example.

In [None]:
A = np.arange(16).reshape((4, 4))
A

::: {.callout-note collapse="true"}
# 1. If only one `list` is given
If only one `list` is given, this `list` is considered as the list of row indexes. The resulted `ndarray` is always 2d.

In [None]:
A[[3, 1]]

:::


::: {.callout-note collapse="true"}
# 2. If two arguments are given, one is a `list`, the other is `:`
If two arguments are given, one is a `list`, the other is `:`, this `list` refers to row indexes if it is in the first argument place, and refers to column indexes if it is in the second argument place. The resulted `ndarray` is always 2d.

In [None]:
A[[3, 1], :]

In [None]:
A[:, [3, 1]]

:::


::: {.callout-note collapse="true"}
# 3. If both two arguments are `lists` of the same length
If both two arguments are `lists` of the same length, it is considered as the `list` of `axis 0` coordinates and the `list` of `axis 1` coordinates. In this case, the resulted `ndarray` is 1d.

In [None]:
A[[0, 1], [3, 1]]

In this example, the two `lists` together gives two entries. 

- The coordinate of the first entry is `(0, 3)` since they are the first entry of each `list`. The `(0, 3)` entry in `A` is `3`.
- The coordinate of the second entry is `(1, 1)` since they are the second entry of each `list`. The `(1, 1)` entry in `A` is `5`. 

Then the result is `array([3, 5])`, as shown above.

:::


::: {.callout-note collapse="true"}
# 4. If both two arguments are `lists`, and one of the `lists` is of length `1`
If both two arguments are `lists`, and one of the `lists` is of length `1`, it is the same as the previous case, with the `list` of length `1` being broadcasted.


In [None]:
A[[0], [3, 1]]

In this example, after broadcasting, the result is the same as `A[[0,0], [3,1]]`.


:::

For higher dimensions, please read the documents to understand how it actually works. 

Note that `ndarray` can also be used as indexes and it behaves very similar to `list`.

:::



::: {.callout-note collapse="true"}
# Boolean Indexing
`ndarray` can accept index in terms of `ndarrays` with boolean indexing.


::: {#exm-}

In [None]:
import numpy as np
a = np.arange(4)
b = np.array([True, True, False, True])
a

In [None]:
b

In [None]:
a[b]

:::

We could combine this way with the logic computation to filter out the elements we want/don't want.

::: {#exm-}
Please find the odd numbers in `arr`. 

In [None]:
arr = np.arange(10)
odd = (arr %2 == 1)
arr[odd] 

:::
:::





### [Copies and views](https://numpy.org/doc/stable/user/basics.copies.html)
The view of an `ndarray` is a way to get access to the array without copying internel data. When operating with a view, the original data as well as all other views of the original data will be modified simutanously. 

::: {#exm-}

In [None]:
import numpy as np
arr = np.arange(10)
b = arr[5:8]
print('arr is {}'.format(arr))
print('b is {}'.format(b))

In [None]:
b[0] = -1
print('arr is {}'.format(arr))
print('b is {}'.format(b))

In [None]:
arr[6] = -2
print('arr is {}'.format(arr))
print('b is {}'.format(b))

:::



The default setting for copies and views is that, basic indexing will always make views, and advanced indexing (e.g. boolean indexing, fancy indexing, etc.) will make copies. For other operations, you need to check the documents to know how they work. For example, `np.reshape()` creates a view where possible, and `np.flatten()` always creates a copy.

The way to check whether something is a view or not is the attribute `.base`. If it is a view of another `ndarray`, you may see that `ndarray` in the attribute `.base`. If it is not a view, in other words, if it is a copy, the `.base` attribute is `None`.


::: {#exm-}

In [None]:
A = np.random.rand(3, 3)
A

In [None]:
A[1:2].base

Basic indexing creates views. In this example, the `base` of `A[1:2]` is `A`, which means that `A[1:2]` is a view of `A`.

In [None]:
print(A[[1]].base)

Advanced indexing creates copys. In this example, the `base` is `None`. So `A[[1]]` is NOT a view of anything.
:::




You may use `np.view()` or `np.copy()` to make views or copies explicitly. 

::: {#exm-}

In [None]:
arr = np.arange(10)
b = arr[5:8].copy()
print('arr is {}'.format(arr))
print('b is {}'.format(b))

In [None]:
b[0] = -1
print('arr is {}'.format(arr))
print('b is {}'.format(b))

In [None]:
arr[6] = -2
print('arr is {}'.format(arr))
print('b is {}'.format(b))

In [None]:
print('The base of b is {}'.format(b.base))

:::


## More functions

We introduce a few more advanced functions here. All the following functions are somehow related to the indexes of entries.

::: {.callout-note collapse="true"}
# `np.where()`
`np.where()` is a very powerful function. The basic usage is `np.where(A satisfies condition)`. The output is an `ndarray` of indexes of entries of `A` that satisfies the condition.

- When the `ndarray` in question is 1d, the output is a 1d `ndarray` of indexes.

In [None]:
import numpy as np
a = np.random.randint(10, size=10)
a

In [None]:
np.where(a%3 == 1)

Since the output is the `ndarray` of indexes, it is possible to directly use it to get those entries. 

In [None]:
a[np.where(a%3 == 1)]

Note that this is a fancy indexing, so the result is a copy.

- When the `ndarray` in question is 2d, the output is a tuple which consists of two 2d `ndarray` of indexes. The two `ndarrays` are the arrays of the `axis 0` indexes and the `axis 1` indexes of the very entries. 

In [None]:
b = np.random.randint(10, size=(3, 3))
b

In [None]:
np.where(b%2 == 0)

Similar to the previous case, we may directly using fancy indexing to get an `ndarray` of the entries, and what we get is a copy.

In [None]:
b[np.where(b%2 == 0)]

- `np.where()` has two more optional arguments. 

In [None]:
#| eval: false
np.where(arr satisfies condition, x, y)

The output is an `ndarray` of the same shape as `arr`. For each entry, if it satisfies the `condition`, the entry is `x`. Otherwise it is `y`.


In [None]:
arr = np.arange(10)
np.where(arr<5, 0, 1)

`numpy` will go over all entries in `arr`, and check whether they are smaller than `5`. If an entry is smaller than `5`, it is set to `0`. If an entry is not smaller than `5`, it is set to `1`.

This is a very convenient way to do some aggragation operations. 

:::


::: {.callout-note collapse="true"}
# `np.any()` and `np.all()`
Both of them will check each entry of an `ndarray` satisfies certain conditions. `np.any()` will return `True` if any one entry satisfies the condition. `np.all()` will return `True` if all entries satisfy the condition.

Both of them also accept `axis` argument. In this case output will be an `ndarray` which gives results along the specific axis.

Please see the following examples.

In [None]:
a = np.array([[1,2],[2,4], [3,5]])
np.any(a%2==0)

In [None]:
np.any(a%2==0, axis=0)

In [None]:
np.any(a%2==0, axis=1)

In [None]:
np.all(a%2==0)

In [None]:
np.all(a%2==0, axis=0)

In [None]:
np.all(a%2==0, axis=1)

:::


::: {.callout-note collapse="true"}
# `np.argsort()`
`np.argsort()` returns the indices that would sort an array. It is easy to think of that indexing using this output indices can resulted a sorted `ndarray`, which is a copy of the original one since this indexing is a fancy indexing.


In [None]:
import numpy as np
a = np.random.randint(100, size=10)
a

In [None]:
a[np.argsort(a)]

:::

### Some examples


::: {#exm-}
Get the position where elements of `a` and `b` match.

In [None]:
a = np.array([1,2,3,2,3,4,3,4,5,6])
b = np.array([7,2,10,2,7,4,9,4,9,8])

np.where(a == b)

:::

::: {#exm-}

In [None]:
a = np.array([1,2,3,2,3,4,3,4,5,6])
b = np.array([7,2,10,2,7,4,9,4,9,8])

np.where(a == b, a*2, b+1)

:::





::: {#exm-}
## Playing with axis
Please think through the example and understand what actually happens in each case.

In [None]:
import numpy as np
a = np.array([[[1,2],[3,4]],[[5,6],[7,8]]])
a

In [None]:
np.any(a==1, axis=0)

In [None]:
np.any(a==1, axis=1)

In [None]:
np.any(a==1, axis=2)

In [None]:
np.any(a==2, axis=0)

In [None]:
np.any(a==2, axis=1)

In [None]:
np.any(a==2, axis=2)

In [None]:
np.any(a==5, axis=0)

In [None]:
np.any(a==5, axis=1)

In [None]:
np.any(a==5, axis=2)

:::


## Projects Examples

### Toss a coin
Tossing a coin can be modeled by picking a random number between `0` and `1`. If the number is `<0.5`, we call it `H` (head). If the number is `>=0.5`, we call it `T` (tail). 


::: {.callout-tip collapse="true"}

In [None]:
import numpy as np

def tossacoin():
    r = np.random.rand()
    if r < 0.5:
        result = 'H'
    else:
        result = 'T'
    return result

:::


If we want to do it 10 times, we may use a `for` loop.


::: {.callout-tip collapse="true"}

In [None]:
results = []
for i in range(10):
    results.append(tossacoin())

:::

The above code can be written in terms of list comprehension.

::: {.callout-tip collapse="true"}

In [None]:
results = [tossacoin() for _ in range(10)]

Note that since the loop parameter `i` is actually not used in the loop body, we could replace it by `_` to indicate that it is not used.
:::

Now we would like to rewrite these code using `np.where()`. Consider all tossing actions simutanously. So we generate an `ndarray` of random numbers to model all tossing actions.

::: {.callout-tip collapse="true"}

In [None]:
toss = np.random.rand(10)

:::

Then using `np.where()` to check each whether it is `H` or `T`.

::: {.callout-tip collapse="true"}

In [None]:
results = np.where(toss<0.5, 'H', 'T')

:::

Since now `results` is an `ndarray`, we could directly use it to count the number of `H`. 

::: {.callout-tip collapse="true"}

In [None]:
(results=='H').sum()

:::

### Random walks
Adam walks randomly along the axis. He starts from `0`. Every step he has equal possibility to go left or right. Please simulate this process.

Use `choices` to record the choice of Adam at each step. We may generate a random array where `0` represents left and `1` represents right.

Use `positions` to record the position of Adam at each step. Using `choices`, the position is `+1` if we see a `1` and the position is `-1` if we see a `0`. So the most elegent way to perform this is to 

1. Convert `choices` from `{0, 1}` to `{-1, 1}`.
2. To record the starting position, we attach `0` to the beginning of the new `choices`.
3. Apply `.cumsum()` to `choices` to get `positions`.

::: {.callout-tip collapse="true"}

In [None]:
import numpy as np

step = 30
choices = np.random.randint(2, size=step)
choices = choices * 2 - 1
choices = np.concatenate(([0], choices))
positions = choices.cumsum()

import matplotlib.pyplot as plt
plt.plot(positions)

:::

### Many random walks
We mainly use `numpy.ndarray` to write the code in the previous example. The best part here is that it can be easily generalized to many random walks.

Still keep `choices` and `positions` in mind. Now we would like to deal with multiple people simutanously. Each row represents one person's random walk. All the formulas stay the same. We only need to update the dimension setting in the previous code.

- Update `size` in `np.random.randint`.
- Update `[0]` to `np.zeros((N, 1))` in `concatenate`.
- For `cumsum` and `concatenate`, add `axis=1` to indicate that we perform the operations along `axis 1`.
- We plot each row in the same figure. `plt.legend` is used to show the label for each line.

::: {.callout-tip collapse="true"}

In [None]:
import numpy as np

step = 30
N = 3
choices = np.random.randint(2, size=(N, step))
choices = choices * 2 - 1
choices = np.concatenate((np.zeros((N, 1)), choices), axis=1)
positions = choices.cumsum(axis=1)

import matplotlib.pyplot as plt
for row in positions:
    plt.plot(row)
plt.legend([1, 2, 3])

:::

### Analyze `positions`
We play with the numpy array `positions` to get some information about the random walks of three generated in the previous example.

- The maximal position:

::: {.callout-tip collapse="true"}

In [None]:
positions.max()

:::
- The maximal position for each one:

::: {.callout-tip collapse="true"}

In [None]:
positions.max(axis=1)

:::
- The maximal position across all three for each step:


::: {.callout-tip collapse="true"}

In [None]:
positions.max(axis=0)

:::
- Check whether anyone once got to the position 3:


::: {.callout-tip collapse="true"}

In [None]:
(positions>=3).any(axis=1)

:::

- The number of people who once got to the position 3: 


::: {.callout-tip collapse="true"}

In [None]:
(positions>=3).any(axis=1).sum()

:::

- Which step for each one gets to the right most position: 

::: {.callout-tip collapse="true"}

In [None]:
positions.argmax(axis=1)

:::


## Exercises

Many exercises are from [@Pra2018].

::: {#exr-}

1. Create a $3\times3$ matrix with values ranging from 2 to 10.
2. Create a $10\times10$ 2D-array with `1` on the border and `0` inside.
3. Create a 2D array of shape `5x3` to contain random decimal numbers between `5` and `10`.
4. Create a 1D zero `ndarray` of size 10 and update sixth value to 11.
:::


::: {#exr-}
Write a function to reverse a 1d `ndarray` (first element becomes last).
:::



::: {#exr-}
Given `a = np.array([1,2,3])`, please get the desired output `array([1, 1, 1, 2, 2, 2, 3, 3, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3])`. You may use `np.repeat()` and `np.tile()`.
:::


::: {#exr-}
## Compare two `ndarrays`
Consider two `ndarrays` of the same length `x` and `y`. Compare them entry by entry. We would like to know the percentage of the entries that are the same.

Please wrap your code into a function that return the above percentage.
:::





::: {#exr-}
## Manipulate matries
Please finish the following tasks. Let `arr = np.arange(9).reshape(3,3)`.

1. Swap rows `1` and `2` in the array `arr`.
2. Reverse the rows of a 2D array `arr`.
3. Reverse the columns of a 2D array `arr`.
:::



::: {#exr-}
Consider a 2d `ndarray`.

In [None]:
arr = np.random.rand(4, 4)

1. Please compute the mean of each column.
2. Please compute the sum of each row.
3. Please compute the maximam of the whole array.
:::


::: {#exr-}
## Adding one axis
Please download [this file](assests/img/20220824224849.png).   

1. Please use `matplotlib.pyplot.imread()` to read the file as a 3d `ndarray`. You may need to use `matplotlib` package. It will be introduced later this course. You may go to its [homepage](https://matplotlib.org/stable/users/getting_started/) to install it.
2. Check the shape of the array.
3. Add one additional axis to it as `axis 0` to make it into a 4d `ndarray`. 
:::


::: {#exr-}
## Understanding colored pictures
Please download [this file](assests/img/20220824224849.png) and use `matplotlib.pyplot.imread()` to read the file as a 3d `ndarray`. You may need to use `matplotlib` package. It will be introduced later this course. You may go to its [homepage](https://matplotlib.org/stable/users/getting_started/) to install it.

A colored picture is stored as a 3d `ndarray`. `axis 0` and `axis 1` is about the vertical and horizontal coordinates and can help us to locate a sepecific point in the picture. `axis 2` is an array with `3` elements. It is the color vector which represents the three principal colors: red, green and blue.

1. Find the maximum and minimum of the values in the array.
2. Compute the mean of the three colors at each point to get a 2d `ndarray` where each entry represents the mean of the three colors at each point of the picture.
:::





::: {#exr-}
## Queries

1. Get all items between `5` and `10` from an array `a = np.array([2, 6, 1, 9, 10, 3, 27])`.
2. Consider `x = np.array([1, 2, 1, 1, 3, 4, 3, 1, 1, 2, 1, 1, 2])`. Please find the index of 5th repetition of number `1` in `x`.
:::


::: {#exr-}
Use the following code to get the dataset `iris` and three related `np.array`: `iris_1d`, `iris_2d` and `sepallength`. 

In [None]:
import numpy as np

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris_1d = np.genfromtxt(url, delimiter=',', dtype=None, encoding=None)
iris_2d = np.genfromtxt(url, delimiter=',', dtype='float', encoding=None,
                        usecols=[0, 1, 2, 3])
iris_2d[np.random.randint(150, size=20), np.random.randint(4, size=20)] = np.nan
sepallength = np.genfromtxt(url, delimiter=',', dtype='float', usecols=[0],
                            encoding=None)

1. `iris_1d` is a 1D numpy array that each item is a tuple. Please construct a new 1D numpy array that each item is the last componenet of each tuple in `iris_1d`.

2. Convert `iris_1d` into a 2D array `iris_2d` by omitting the last field of each item.
3. `np.isnan()` is a function to check whether each entry of a `ndarray` is `nan` or not. Please use `np.isnan()` as well as `np.where()` to find all `nan` entries in `iris_2d`. 
4. Select the rows of `iris_2d` that does not have any `nan` value.
5. Replace all `nan` with `0` in `iris_2d`.
:::






::: {#exr-}
## Random
Please finish the following tasks. 

1. Use the package `np.random` to flip a coin 100 times and record the result in a list `coin`.
2. Assume that the coin is not fair, and the probability to get `H` is `p`. Write a code to flip the coin 100 times and record the result in a list `coin`, with a given parameter `p`. You may use `p=.4` as the first choice.
3. For each list `coin` created above, write a code to find the longest `H` streak. We only need the biggest number of consecutive `H` we get during this 100 tosses. It is NOT necessary to know when we start the streak.
:::




::: {#exr-}
## Bins
Please read the [document of `np.digitize()`](https://numpy.org/doc/stable/reference/generated/numpy.digitize.html#numpy-digitize), and use it to do the following task.

Set the following bins:

- Less than `3`: `small`
- `3-5`: `medium`
- Bigger than `5`: `large`

Please transform the following data `iris_2c` into texts using the given bins.

In [None]:
import numpy as np
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris_2c = np.genfromtxt(url, delimiter=',', dtype='object')[:, 2].astype('float')

:::



::: {#exr-}
Consider a 2d `ndarray` `a`. 

In [None]:
import numpy as np
a = np.random.rand(5, 5)

1. Please sort it along the 3rd column.
2. Please sort it along the 2nd row.

You may use `np.argsort()` for the problem.
:::



::: {#exr-}
## One-hot vector
Compute the one-hot encodings of a given array. You may use the following array as a test example. In this example, there are `3` labels. So the one-hot vectors are 3 dimensional vectors.

For more infomation about one-hot encodings, you may check the [Wiki page](https://en.wikipedia.org/wiki/One-hot#Machine_learning_and_statistics). You are not allowed to use packages that can directly compute the one-hot encodings for this problem. 

In [None]:
import numpy as np
arr = np.random.randint(1,4, size=6)

:::



::: {#exr-}
Consider `arr = np.arange(8)`. A stride of `arr` with a window length of `4` and strides of `2` is a 2d `np.array` that looks like:

In [None]:
#| echo: false
arr = np.arange(8)
np.array([arr[i:i+4] for i in range(0, 8, 2) if i+4<=8])

Please write a function that takes `arr` and `length` and `strides` as inputs, and its stride as outputs.
:::
