# Filtering with the `query` Method

The previous chapters on boolean selection showed us how to filter our DataFrames and Series based on their values. We created conditions, usually involving the comparison operators, resulting in boolean Series and passed them to *just the brackets* to filter the data.

In this chapter we cover the `query` method which enables us to also make selections based on the values of the DataFrame or Series. The `query` method is easier and more intuitive to use than boolean selection, but doesn't provide as much functionality to filter the data. Still, it is a good method to know about to make your subset selections more readable.

## The `query` method

The `query` method allows you to filter the data by writing the condition as a string. For instance, you would pass the string `'tripduration > 1000'` to select all rows of the `bikes` dataset that have a `tripduration` less than 1000. Let's read in the bikes dataset and run this command now.

In [None]:
import pandas as pd
bikes = pd.read_csv('../data/bikes.csv', parse_dates=['starttime', 'stoptime'])
bikes.head(3)

In [None]:
bikes.query('tripduration > 1000').head(3)

### Less syntax and more readable

The `query` method generally uses less syntax than boolean selection and is usually more readable. For instance, the following reproduces the last result with boolean selection:

```
bikes[bikes['tripduration'] > 1000]
```

This looks a bit clumsy with the name `bikes` written twice right next to one another. The `query` method has its own set of rules for constitutes a correctly written condition within the string you pass it. The rest of this chapter covers all of the available functionality of the `query` method. This syntax only works within the `query` method and is not allowed anywhere else in pandas.

### Use strings `and`, `or`, `not`

Unlike boolean selection, you can use the strings `and`, `or`, and `not` instead of the operators which further aides readability with `query`. Let's select all rides with `tripduration` greater than 1,000 and `temperature` greater than 85.

In [None]:
bikes.query('tripduration > 1000 and temperature > 85').head(3)

### Chained comparisons

Let's say we want to find all rides where the temperature was between 50 and 60 degrees. You can do this with query by using the and operator.

In [None]:
bikes.query('temperature >= 50 and temperature <= 60').head(3)

While this syntax is valid, there is a better way. You can use a **chained comparison** to make the string even more readable and concise. A chained comparison places the column name between two comparison operators. The following implies that 50 is less than or equal to the temperature and the temperature is less than or equal to 60 which is equivalent to our previous selection.

In [None]:
bikes.query('50 <= temperature <= 60').head(3)

### Reference strings with quotes

If you would like to reference a literal string within `query`, you need to surround it in quotes, or else pandas will attempt to use it as a column name. Let's select all rides done by a 'Female' with a trip duration greater than 2,000.

In [None]:
bikes.query('gender == "Female" and tripduration > 2000').head(3)

### Forgetting quotes

If you do not use quotes around your literal string then pandas assumes that value is a column name. The following raises an error. It believes you are accessing a column name Female, but that doesn't exist.

In [None]:
bikes.query('gender == Female and tripduration > 2000')

### Column to column comparisons

It is possible to compare each value in one column with each value in another column. Here, we filter for all the rides where there were more bikes at the start than at the end.

In [None]:
bikes.query('dpcapacity_start > dpcapacity_end').head(3)

### Use 'in' for multiple equalities

You can check whether each value in a column is equal to one or more other values by using the word 'in' within your query. Use the syntax for creating a list withing the query string to contain all the values you'd like to check. The following tests whether the ride weather event was snow or rain.

In [None]:
bikes.query('events in ["snow", "rain"]').head(3)

There are multiple syntaxes for the above that all work the same, but I prefer using the above as it is most similar to the `isin` method used during boolean selection.

* `bikes.query('["snow", "rain"] in events')`
* `bikes.query('["snow", "rain"] == events')`
* `bikes.query('events == ["snow", "rain"]')`

### Use 'not in' to invert the condition

You can invert the result of an 'in' clause by placing the word 'not' before it. Here, we find all the rides that did not have the weather events cloudy, partly cloudy or mostly cloudy.

In [None]:
bikes.query('events not in ["cloudy", "partlycloudy", "mostlycloudy"]').head(3)

### Arithmetic operations within `query`

It is possible to write arithmetic operations within `query` just as you would outside of it. For instance, if we wanted to find all the rides such that there were 20 more bikes at the start station than at the end, we do the following.

In [None]:
bikes.query('dpcapacity_start - dpcapacity_end > 20').head(3)

### Filtering for right triangles

Let's read in the triangles dataset which contains the lengths of each side of a triangle as the columns `a`, `b`, and `c`.

In [None]:
triangles = pd.read_csv('../data/triangles.csv')
triangles.head()

We can use the `query` method to find all the right triangles, those that satisfy the Pythagorean Theorem. We write the condition using the arithmetic and comparison operators.

In [None]:
triangles.query('a ** 2 + b ** 2 == c ** 2').head()

The syntax is quite a bit nicer than the boolean selection alternative.

In [None]:
filt = triangles['a'] ** 2 + triangles['b'] ** 2 == triangles['c'] ** 2
triangles[filt].head()

### Use the `@` symbol to reference a variable name

By default, all words within the query string attempt to reference the column name. You can, however, reference a variable name by preceding it with the `@` symbol. Let's assign the variable name `min_length` to 5000 and reference it in a query to find all the rides where trip duration was greater than it.

In [None]:
min_length = 5000
bikes.query('tripduration > @min_length').head(3)

### Using the index with `query`

You can even use the word `index` to make comparisons against the index as if it were a normal column. In the bikes DataFrame, the index is just the integers beginning at 0. Here, we select only the `events` that were 'cloudy' for an index value greater than 4000.

In [None]:
bikes.query('index > 4000 and events == "cloudy" ').head(3)

### Using column names with spaces

pandas allows DataFrames to have column names with spaces in them. In order to use a column name containing spaces within `query`, you'll need to surround it with back ticks. If you don't use the back ticks you'll get an error. Let's read in the San Francisco employee compensation dataset which contains multiple column names that have spaces.

In [None]:
sf_emp = pd.read_csv('../data/sf_employee_compensation.csv')
sf_emp.head(3)

Let's find all the employees that are in the organization group of 'Public Protection'.

In [None]:
sf_emp.query('`organization group` == "Public Protection"').head(3)

### Selecting columns with `query`

Unfortunately the `query` method does not give us the ability to select a subset of the columns when filtering the data. You would have to do normal column selection after calling the method. Here, we use *just the brackets* to select three columns after finding all the rides where the weather was snow or rain.

In [None]:
cols = ['starttime', 'temperature', 'events']
bikes.query('events in ["snow", "rain"]')[cols].head()

## Summary

The `query` method provides an alternative to boolean selection to filter the data based on the values. Here are the rules for the string you provide.

* The expression in the string must evaluate as True or False for every row
* Column names may be accessed directly with their name
* Often you will use one of the comparison operators to create a condition
* Use `and`, `or`, and `not` to create more complex conditions
* To use a literal string, surround it with quotes
* Use chained comparison operators to shorten syntax
* Use `in` to test multiple equalities. Provide the test values in a list
* All arithmetic operators work just as they do outside of the string
* Use the `@` character to reference a variable name
* Use back ticks to reference a column name with spaces in it

## Exercises

Use the bikes dataset for the first few exercises.

### Exercise 1

<span  style="color:green; font-size:16px">Use the `query` method to select trip durations between 5000 and 10000.</span>

### Exercise 2

<span  style="color:green; font-size:16px">Use the `query` method to select trip durations between 5000 and 10000 when the weather was snow or rain. Retrieve the same data with boolean selection.</span>

### Exercise 3

<span  style="color:green; font-size:16px">Use the `query` method to select trip durations between 5000 and 10000 when it was snow or rain. Create a list outside of the query method to hold the weather and reference that variable with `@` within query.</span>

Read in the movie dataset by executing the cell below and use it for the following exercises.

In [None]:
import pandas as pd
pd.set_option('display.max_columns', 50)
movie = pd.read_csv('../data/movie.csv', index_col='title')
movie.head(3)

### Exercise 4

<span  style="color:green; font-size:16px">Use the `query` method to find all movies where the total number of Facebook likes for all three actors is greater than 50,000.</span>

### Exercise 5

<span  style="color:green; font-size:16px">Select all the movies the number of user voters is less than 10 times the number of reviews.</span>

### Exercise 6

<span  style="color:green; font-size:16px">Select all the movies made in the 1990s that were rated R with an IMDB score greater than 8.</span>