# Tutorial 2.3: NumPy Array Comparisons & Masking
Python for Data Analytics | Module 2  
Professor James Ng

In a previous tutorial, we covered how to retrieve elements out of `ndarray` objects. In this tutorial we will expand upon that by learning about array comparisons and how you can use them to selectively pull data out of an array.

To get started, I'm going to create a couple of Numpy arrays from our `chicago_employees.csv` data set. This data set hold a number of data points on all public employees in the city of Chicago. 

As before, don't worry about how I'm creating these as that will be covered in later modules.

In [None]:
# import numpy and pandas (for the data load)
import numpy as np
import pandas as pd

In [None]:
# Create a directory to hold our data sets and download them
!mkdir -p data-sets
!wget --show-progress -O data-sets/chicago-employees.csv https://osf.io/8svw4/download

In [None]:
# Load the Chicago government employees data set.
chicago_employees = pd.read_csv('data-sets/chicago-employees.csv')

In [None]:
# A quick preview of the sort of data in the set.
chicago_employees.head()

In [None]:
# Some data cleaning stuff happening here.
# We'll learning about it in Module 3
_ = chicago_employees['Annual Salary'][chicago_employees['Annual Salary'].notnull()]
employee_salaries = np.array(pd.to_numeric(_.str.replace('\$|\,', '')))

In [None]:
employee_departments = np.array(chicago_employees['Department'])

In [None]:
employee_titles = np.array(chicago_employees['Job Titles'])

In [None]:
employee_names = np.array(chicago_employees['Name'])

So, now you have four arrays:
* `employee_salaries`: Holds annual salary information for each employee. The datatype of this array is `float`.
* `employee_departments`: Holds information on what department each employee works for.
* `employee_titles`: Holds information on each employee's title.
* `employee_names`: Holds each employee's name.

## Available NumPy Comparison Functions
You can invoke NumPy's comparison functions either through an operator or by an explicit function call. You need to be familiar with both styles as you will see both in other people's code. 

Here are the available functions:

| Operator    | Equivalent ufunc    |
|---------------|---------------------|
|``==``         |``np.equal``         |
|``!=``         |``np.not_equal``     |
|``<``          |``np.less``          |
|``<=``         |``np.less_equal``    |
|``>``          |``np.greater``       |
|``>=``         |``np.greater_equal`` |

## Getting Started with Comparisons
Alright, let's try using some of the comparison functions. Let's just jump right in and explain things as we go.

In [None]:
# Which employees have salaries of over $100,000.00
employee_salaries > 100000

**Pythonista Note: ** What's up with the `...` in the display of the result?

Whenever NumPy is asked to display an array with a large number of elements, it will use `...` to indicate "there are many more elements here, but I'm not going to display them all".

Just imagine how difficult your Notebook would be to work with if it had printed all 33,000 records of this data set.

Anyway, you can see that our comparison function returns a new array that is full of `boolean` values. If the value is `True` at a given index, it means that employee's salary is over $100,000.

In [None]:
# Which employees do NOT work in the police department?
# This time I'll explicitly call the comparison function.
np.not_equal(employee_departments, "POLICE")

## Ok, Now What?
While it is somewhat interesting to get these arrays full of boolean values, you might be wondering *what exactly am I supposed to do with these?*

### Comparison Functions and `np.sum`, `np.all`, or `np.any`
Above, we answered the question *which employees make more than $100,000?* Now we will combine that information with additional functions to answer the following:

In [None]:
# The `np.any()` function will return True if 
# any array values are `True`.

# So, we can ask "Do ANY of the employees make more than $100,000?"
np.any(employee_salaries > 100000)

In [None]:
# The `np.all()` function returns true if ALL the array values are true.

# Do ALL Chicago employees make over $100000?
np.all(employee_salaries > 100000)

In [None]:
# The `np.sum()` function will return the total 
# number of `True` values in a boolean array.

# So, we can ask "How many of the employees make more than $100,000?"
# P.S. It is a LOT of employees. Maybe think about working for 
# the city of Chicago.
np.sum(employee_salaries > 100000)

**Pythonista Note: ** Where is `np.sum` getting a number from?

Turns out that in Python the boolean `True` value has a corresponding numeric value of `1`. So, each time `np.sum` encounters `True` in the boolean array, it adds a `1` to its running total. Clever.

### Comparison Functions and Bitwise Boolean Operators
Often times, we will want to perform multiple comparisons at the same time. 

For instance, let's say that we wanted to know which employees make between `$100000` and `$125000` annually. 

**Bitwise boolean operators** allow us to combine and join comparisons together and get the net result.

We will begin by demonstrating the use of the `&` (bitwise and) operator:

In [None]:
# How many employees make between $100000 and $125000 annually?
(employee_salaries >= 100000) & (employee_salaries <= 125000)

#### Parentheses are Important Here
<p>
The parentheses here are important because of 
<a href="https://docs.python.org/3/reference/expressions.html#operator-precedence" target="_blank">
Python's operator precedence rules</a> which would lead to the following evaluation if I hadn't included the parentheses: 
`employee_salaries >= (100000 & employee_salaries) <= 125000`
</p>
<p>
This would obviously have a different result. So, be mindful to use paratheses to force the correct order of operations when combining NumPy functions with bitwise boolean operators.
</p>
</div> 

In [None]:
# Ok, now let's bring back in `np.sum()` to get a 
# count of how many employee match these two criteria
np.sum((employee_salaries >= 100000) & (employee_salaries <= 125000))

Now let's try using the `|` (bitwise or) operator. Instead of combining two conditions like the `&` operator does, this one allows you do ask if at least one of multiple conditions is True.

In [None]:
# How many employees make less than $50000 or more than $125000
np.sum((employee_salaries < 50000) | (employee_salaries > 125000))

In all the above examples we've utilized two different  **bitwise boolean operators**:
* "and" (`&`) 
* "or" (`|`)

While these are the most common ones that you will use, there are a couple of others which we will demonstrate in the next section.  For your reference, here is a list of them:

| Operator   | Equivalent ufunc  |
|------------|-------------------|
|`&`         |np.bitwise_and     | 
|&#124;      |`np.bitwise_or`    |
|`^`         |`np.bitwise_xor`   |
|`~`         |`np.bitwise_not`   |


#### How Bitwise Boolean Operators Work
Under the covers, when a bitwise boolean operator is used, NumPy evaluates both elements of the arrays being compared for each index value. 

For each element, NumPy evaluates whether the two elements match the operator condition, and then returns `True` or `False` for that element pair accordingly.

Here are some additional examples of using the various comparison operators/functions on two arrays. These are different from the previous example in that we are comparing the data in separate arrays.

#### Aside: Using the keywords and/or versus the operators &/|
If you tried `(employee_salaries >= 100000) and (employee_salaries <= 125000)`, you get a ValueError. The `and` keyword treats `(employee_salaries >= 100000)` as a single Boolean entity, which is impossible because an array cannot be true or false. For a detailed explanation, see: https://jakevdp.github.io/PythonDataScienceHandbook/02.06-boolean-arrays-and-masks.html#Aside:-Using-the-Keywords-and/or-Versus-the-Operators-&/|

In [None]:
# Which employees work in the "FIRE" department 
# and have the title of "LIEUTENANT-EMT"?

# First, determine how many employees are in the fire department
fire_department_employees = (employee_departments == "FIRE")

# Then, which employees have the title "LIEUTENANT-EMT"? 
lieutenant_emts = (employee_titles == "LIEUTENANT-EMT")

In [None]:
# np.bitwise_and (&)
# We can now use the & operator to determine which employees
# meet both of these conditions
print(fire_department_employees & lieutenant_emts)

# Which over course isn't all that useful by itself.
# But, what happens if we us `np.sum()` with that?
np.sum(fire_department_employees & lieutenant_emts)

In [None]:
# This could be also be stated like this and avoid
# the extra variable definition.
np.sum((employee_departments == 'FIRE') & 
       (employee_titles == "LIEUTENANT-EMT"))

**You are now one of the select few people on earth who know exactly how many Lieutenant EMTs work for the Fire Department in Chicago!** I'm sure that will prove useful in Jeopardy one day.

In [None]:
# np.bitwise_or (|)
# With this operator, True is returned for a given index if
# at least one of the two boolean arrays being compared has 
# a value of True at a given index

# Question: How many employees work in the "GENERAL SERVICES" 
# department OR have a Job Title of "SERGEANT"?
sergeants = (employee_titles == 'SERGEANT')
general_services_employees = (employee_departments == "GENERAL SERVICES")

np.sum(sergeants | general_services_employees)

In [None]:
# np.bitwise_xor (^)
# True is returned for a given index if one, but not both, 
# of the elements being compared is `True`.

# How many employees are EITHER working for the fire department
# OR hold the title of "SERGEANT"
np.sum(sergeants ^ fire_department_employees)

#### Special Note for `bitwise.not (~)`
The `bitwise.not` operator is the odd man out in this collection.

Unlike the other bitwise operators, this one simply reverses the values in a boolean array.

In [None]:
# np.bitwise_not (~)
# Here we will demonstate that this operation simply switches
# the True/False values of the `sergeants` boolean array.

# Total Records in `sergeants` array
print(len(sergeants))

# Number of "Sergeants"
print(np.sum(sergeants))

# Invert the values of the the array with `~`
# to get the number of employees that are NOT sergeants.
print(np.sum(~sergeants))

### A Short Review
Alright, you have covered **a lot** of ground so far in this tutorial. Good job. 

Here is a brief review of what we've covered:
1. We've seen how the comparison functions (`np.equal`, `np.less`, `np.greater`, etc.) generate boolean arrays that indicate whether a given element of an array meets (or doesn't meet) the condition of the function.

1. We then showed how you could pass these boolean arrays to `np.sum`, `np.all`, and `np.any` to derive additional information on your data set.

1. Finally, we demonstrated how you could logically compare two boolean arrays with the **bitwise** operators to perform multistep data comparisons.

For the last segment of this tutorial, we are going to demonstrate using comparison functions to return original items of the array that meet the condition of the comparison.

## Comparison Operations as Array Masks
Previously, we showed how you could select data from an array using index or slice notation. Here we will introduce another data selection technique called **masking**.

Basically, it looks a similar to slice notation. In case you've forgotten what that looks like, here is a reminder.

In [None]:
# Let's say we have the following simple array
simple_int_array = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

# Slice elements with indexes 6, 7, 8 from our simple_int_array
simple_int_array[5:8]

The difference with a mask is that instead of putting `[start:stop:step]` inside the brackets, you actually invoke a comparison function.

In [None]:
# Simple comparison operation
simple_int_array < 7

In [None]:
# Now we pass that comparison invocation as a "mask"
# It will return all the values of simple_int_array that are less than 7
masked_array = simple_int_array[simple_int_array < 7]
masked_array

**Pretty cool huh?** This is how it works:
1. The comparison function inside the brackets is evaluated first. 
2. It returns a boolean array where the first 7 elements have a `True` value, and the rest have `False`.
3. It then applies this boolean array as a 'mask' on the original array. The end result is that only values from the original array who have a matching `True` value in the mask are returned.

Let's apply this to a couple of our other NumPy arrays:

In [None]:
### Practice Exercises

In [None]:
# What salaries are over $180000
employee_salaries[employee_salaries >180000]

In [None]:
# Which employees work for the FIRE or GENERAL SERVICES departments? Instead of a truncated display, 
# print the names of all these employees.



In [None]:
# Referring to the previous question, how many such employees are there?