In [None]:
from __future__ import print_function

# Comparisons, Masks, and Boolean Logic

This section covers the use of Boolean masks to examine and manipulate values within NumPy arrays.
Masking comes up when you want to extract, modify, count, or otherwise manipulate values in an array based on some criterion: for example, you might wish to count all values greater than a certain value, or perhaps remove all outliers that are above some threshold.
In NumPy, Boolean masking is often the most efficient way to accomplish these types of tasks.

## Comparison Operators as ufuncs

NumPy also implements comparison operators such as ``<`` (less than) and ``>`` (greater than) as element-wise ufuncs.
The result of these comparison operators is always an array with a Boolean data type.
All six of the standard comparison operations are available:

In [1]:
import numpy as np

In [2]:
x = np.array([1, 2, 3, 4, 5, 6])

In [3]:
x < 3  # less than

array([ True,  True, False, False, False, False])

In [4]:
x > 3  # greater than

array([False, False, False,  True,  True,  True])

In [5]:
x <= 3  # less than or equal

array([ True,  True,  True, False, False, False])

In [6]:
x >= 3  # greater than or equal

array([False, False,  True,  True,  True,  True])

In [7]:
x != 3  # not equal

array([ True,  True, False,  True,  True,  True])

In [8]:
x == 3  # equal

array([False, False,  True, False, False, False])

It is also possible to do an element-wise comparison of two arrays, and to include compound expressions:

In [9]:
(2 * x) == (x ** 2)

array([False,  True, False, False, False, False])

As in the case of arithmetic operators, the comparison operators are implemented as ufuncs in NumPy; for example, when you write ``x < 3``, internally NumPy uses ``np.less(x, 3)``.
    A summary of the comparison operators and their equivalent ufunc is shown here:

| Operator	    | Equivalent ufunc    || Operator	   | Equivalent ufunc    |
|---------------|---------------------||---------------|---------------------|
|``==``         |``np.equal``         ||``!=``         |``np.not_equal``     |
|``<``          |``np.less``          ||``<=``         |``np.less_equal``    |
|``>``          |``np.greater``       ||``>=``         |``np.greater_equal`` |

In each case, the result is a Boolean array, and NumPy provides a number of straightforward patterns for working with these Boolean results.

## Working with Boolean Arrays

Given a Boolean array, there are a host of useful operations you can do.
We'll work with ``x``, the two-dimensional array we created earlier.

In [10]:
print(x)

[1 2 3 4 5 6]


### Counting entries

To count the number of ``True`` entries in a Boolean array, ``np.count_nonzero`` is useful:

In [11]:
# how many values less than 6?
np.count_nonzero(x < 6)

5

We see that there are five array entries that are less than 6.
Another way to get at this information is to use ``np.sum``; in this case, ``False`` is interpreted as ``0``, and ``True`` is interpreted as ``1``:

In [12]:
np.sum(x < 6)

5

The benefit of ``sum()`` is that like with other NumPy aggregation functions, this summation can be done along rows or columns as well:

In [13]:
x = np.arange(6).reshape(2,3)
x

array([[0, 1, 2],
       [3, 4, 5]])

In [14]:
# how many values less than 6 in each row?
np.sum(x < 6, axis=1)

array([3, 3])

This counts the number of values less than 6 in each row of the matrix.

If we're interested in quickly checking whether any or all the values are true, we can use (you guessed it) ``np.any`` or ``np.all``:

In [15]:
# are there any values greater than 8?
np.any(x > 8)

False

In [16]:
# are there any values less than zero?
np.any(x < 0)

False

In [17]:
# are all values less than 10?
np.all(x < 10)

True

In [18]:
# are all values equal to 6?
np.all(x == 6)

False

``np.all`` and ``np.any`` can be used along particular axes as well. For example:

In [19]:
# are all values in each row less than 8?
np.all(x < 8, axis=1)

array([ True,  True])

Finally, a quick warning: as mentioned before, Python has built-in ``sum()``, ``any()``, and ``all()`` functions. These have a different syntax than the NumPy versions, and in particular will fail or produce unintended results when used on multidimensional arrays. Be sure that you are using ``np.sum()``, ``np.any()``, and ``np.all()`` for these examples!

<div style="background-color:yellow; padding: 10px"><h3><span class="fa fa-flash"></span> Quick Exercise:</h3></div>

Find `sum` of clear day from `curah-hujan-amfoang.csv`. In order to load data with Numpy, you can use the functions numpy.genfromtxt or numpy.loadtxt, where the difference is that np.genfromtxt can read CSV files with missing data and gives you options like the parameters missing_values and filling_values that help with missing values in the CSV. The loading of our data in previous recipe can be done in one step by

```
data = np.loadtxt(data_path, delimiter=',', skiprows=1)
```
or with the more powerful nunmpy.genfromtxt
```
data = np.genfromtxt(datas_path, delimiter=',', names=True)
```
where the names argument specifies to load the header, which enables us to access the columns with their header names.

<hr>

### Boolean operators

We've already seen how we might count, say, all days with humidty less than 85%, or all days with humidity greater than 50%.
But what if we want to know about all days with humidity less than 85% and greater than 50%?
This is accomplished through Python's *bitwise logic operators*, ``&``, ``|``, ``^``, and ``~``.
Like with the standard arithmetic operators, NumPy overloads these as ufuncs which work element-wise on (usually Boolean) arrays.

For example, we can address this sort of compound question as follows:

In [24]:
hum = data[:,2]

In [25]:
hum

array([        nan, 80.14814815, 76.43055556, 80.00694444, 81.46527778,
       78.96527778, 82.15972222, 81.40277778, 80.7       , 96.07843137,
       90.22222222, 88.50694444, 91.97916667, 96.90277778, 97.29166667,
       94.82638889, 94.16666667, 91.09722222, 93.86111111, 94.77083333,
       88.40972222, 83.15277778, 85.33333333, 85.27083333, 87.91666667,
       87.85416667, 89.88194444, 87.53846154])

In [26]:
np.sum((hum > 50) & (hum < 85))

9

So we see that there are 9 days with mean humidity between 50 and 85 %

Note that the parentheses here are important–because of operator precedence rules

Using the equivalence of *A AND B* and *NOT (NOT A OR NOT B)* (which you may remember if you've taken an introductory logic course), we can compute the same result in a different manner:

In [27]:
np.sum(~( ~(hum > 50) | ~(hum < 85)))

9

Combining comparison operators and Boolean operators on arrays can lead to a wide range of efficient logical operations.

The following table summarizes the bitwise Boolean operators and their equivalent ufuncs:

| Operator	    | Equivalent ufunc    || Operator	    | Equivalent ufunc    |
|---------------|---------------------||---------------|---------------------|
|``&``          |``np.bitwise_and``   ||&#124;         |``np.bitwise_or``    |
|``^``          |``np.bitwise_xor``   ||``~``          |``np.bitwise_not``   |

Using these tools, we might start to answer the types of questions we have about our weather data.
Here are some examples of results we can compute when combining masking with aggregations:

## Boolean Arrays as Masks

In the preceding section we looked at aggregates computed directly on Boolean arrays.
A more powerful pattern is to use Boolean arrays as masks, to select particular subsets of the data themselves.
Returning to our ``x`` array from before, suppose we want an array of all values in the array that are less than, say, 5:

In [28]:
x

array([[0, 1, 2],
       [3, 4, 5]])

We can obtain a Boolean array for this condition easily, as we've already seen:

In [29]:
x < 5

array([[ True,  True,  True],
       [ True,  True, False]])

Now to *select* these values from the array, we can simply index on this Boolean array; this is known as a *masking* operation:

In [30]:
x[x < 5]

array([0, 1, 2, 3, 4])

What is returned is a one-dimensional array filled with all the values that meet this condition; in other words, all the values in positions at which the mask array is ``True``.

We are then free to operate on these values as we wish.
For example, we can compute some relevant statistics on our rain data:

In [31]:
# construct a mask of all rainy days
rainrate = data[:,1]
rainy = (rainrate > 0)

print("Median precip on rainy days:   ",
      np.median(rainrate[rainy]))


Median precip on rainy days:    17.7292


By combining Boolean operations, masking operations, and aggregates, we can very quickly answer these sorts of questions for our dataset.

# Fancy Indexing

In the previous sections, we saw how to access and modify portions of arrays using simple indices (e.g., ``arr[0]``), slices (e.g., ``arr[:5]``), and Boolean masks (e.g., ``arr[arr > 0]``).
In this section, we'll look at another style of array indexing, known as *fancy indexing*.
Fancy indexing is like the simple indexing we've already seen, but we pass arrays of indices in place of single scalars.
This allows us to very quickly access and modify complicated subsets of an array's values.

## Exploring Fancy Indexing

Fancy indexing is conceptually simple: it means passing an array of indices to access multiple array elements at once.
For example, consider the following array:

In [32]:
import numpy as np
rand = np.random.RandomState(42)

x = rand.randint(100, size=10)
print(x)

[51 92 14 71 60 20 82 86 74 74]


Suppose we want to access three different elements. We could do it like this:

In [33]:
[x[3], x[7], x[2]]

[71, 86, 14]

Alternatively, we can pass a single list or array of indices to obtain the same result:

In [34]:
ind = [3, 7, 4]
x[ind]

array([71, 86, 60])

When using fancy indexing, the shape of the result reflects the shape of the *index arrays* rather than the shape of the *array being indexed*:

In [35]:
ind = np.array([[3, 7],
                [4, 5]])
x[ind]

array([[71, 86],
       [60, 20]])

Fancy indexing also works in multiple dimensions. Consider the following array:

In [36]:
X = np.arange(12).reshape((3, 4))
X

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

Like with standard indexing, the first index refers to the row, and the second to the column:

In [37]:
row = np.array([0, 1, 2])
col = np.array([2, 1, 3])
X[row, col]

array([ 2,  5, 11])

Notice that the first value in the result is ``X[0, 2]``, the second is ``X[1, 1]``, and the third is ``X[2, 3]``.
The pairing of indices in fancy indexing follows all the broadcasting rules that were mentioned before in broadcasting.
So, for example, if we combine a column vector and a row vector within the indices, we get a two-dimensional result:

In [38]:
X[row[:, np.newaxis], col]

array([[ 2,  1,  3],
       [ 6,  5,  7],
       [10,  9, 11]])

It is always important to remember with fancy indexing that the return value reflects the *broadcasted shape of the indices*, rather than the shape of the array being indexed.

## Combined Indexing

For even more powerful operations, fancy indexing can be combined with the other indexing schemes we've seen:

In [39]:
print(X)

[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]


We can combine fancy and simple indices:

In [40]:
X[2, [2, 0, 1]]

array([10,  8,  9])

We can also combine fancy indexing with slicing:

In [41]:
X[1:, [2, 0, 1]]

array([[ 6,  4,  5],
       [10,  8,  9]])

And we can combine fancy indexing with masking:

In [42]:
mask = np.array([1, 0, 1, 0], dtype=bool)
X[row[:, np.newaxis], mask]

array([[ 0,  2],
       [ 4,  6],
       [ 8, 10]])

All of these indexing options combined lead to a very flexible set of operations for accessing and modifying array values.