# Filtering with the `query` Method

The `query` method is easier and more intuitive to use than boolean selection, but doesn't provide as much functionality to filter the data. Still, it is a good method to know about to make your subset selections more readable.

The `query` method allows you to filter the data by writing the condition as a string.

In [2]:
import pandas as pd
bikes = pd.read_csv('bikes.csv', parse_dates=['starttime', 'stoptime'])
bikes.query('tripduration > 1000').head(3)

Unnamed: 0,gender,starttime,stoptime,tripduration,from_station_name,start_capacity,to_station_name,end_capacity,temperature,wind_speed,events
2,Male,2013-06-30 14:43:00,2013-06-30 15:01:00,1040,Sheffield Ave & Kingsbury St,15.0,Dearborn St & Monroe St,23.0,73.0,16.1,mostlycloudy
8,Male,2013-07-03 15:21:00,2013-07-03 15:42:00,1300,Clinton St & Washington Blvd,31.0,Wood St & Division St,15.0,71.1,0.0,cloudy
10,Male,2013-07-04 17:17:00,2013-07-04 17:42:00,1523,Morgan St & 18th St,15.0,Damen Ave & Pierce Ave,19.0,79.0,9.2,mostlycloudy


Unlike boolean selection, you can use the strings `and`, `or`, and `not` instead of the operators `&`, `|`, and `~` which further aides readability with `query`. Let's select all rides with `tripduration` greater than 1,000 and `temperature` greater than 85.

In [3]:
bikes.query('tripduration > 1000 and temperature > 85').head(3)

Unnamed: 0,gender,starttime,stoptime,tripduration,from_station_name,start_capacity,to_station_name,end_capacity,temperature,wind_speed,events
40,Female,2013-07-14 14:08:00,2013-07-14 15:53:00,6274,Wabash Ave & Roosevelt Rd,19.0,Lake Shore Dr & Monroe St,11.0,87.1,8.1,partlycloudy
53,Male,2013-07-16 13:04:00,2013-07-16 13:28:00,1435,Canal St & Jackson Blvd,35.0,Canal St & Jackson Blvd,35.0,90.0,8.1,mostlycloudy
60,Male,2013-07-17 10:23:00,2013-07-17 10:40:00,1024,Clinton St & Washington Blvd,31.0,Larrabee St & Menomonee St,15.0,88.0,5.8,partlycloudy


While this syntax is valid, there is a more compact way. You can use a **chained comparison** to make the string even more readable and concise. A chained comparison places the column name between two comparison operators.

In [4]:
bikes.query('50 <= temperature <= 60').head(3)

Unnamed: 0,gender,starttime,stoptime,tripduration,from_station_name,start_capacity,to_station_name,end_capacity,temperature,wind_speed,events
590,Female,2013-09-13 07:55:00,2013-09-13 08:01:00,319,Greenview Ave & Fullerton Ave,15.0,Sheffield Ave & Fullerton Ave,15.0,54.0,15.0,partlycloudy
591,Male,2013-09-13 08:04:00,2013-09-13 08:16:00,738,Lincoln Ave & Armitage Ave,19.0,Larrabee St & Kingsbury St,27.0,57.9,13.8,partlycloudy
592,Female,2013-09-13 08:04:00,2013-09-13 08:14:00,599,Orleans St & Elm St,15.0,Sheffield Ave & Kingsbury St,15.0,57.9,13.8,partlycloudy


In [5]:
bikes.query('gender == "Female" and tripduration > 2000').head(3)

Unnamed: 0,gender,starttime,stoptime,tripduration,from_station_name,start_capacity,to_station_name,end_capacity,temperature,wind_speed,events
40,Female,2013-07-14 14:08:00,2013-07-14 15:53:00,6274,Wabash Ave & Roosevelt Rd,19.0,Lake Shore Dr & Monroe St,11.0,87.1,8.1,partlycloudy
77,Female,2013-07-21 11:35:00,2013-07-21 13:54:00,8299,State St & 19th St,15.0,Sheffield Ave & Kingsbury St,15.0,82.9,5.8,mostlycloudy
173,Female,2013-08-08 08:49:00,2013-08-08 09:31:00,2502,Sheffield Ave & Addison St,27.0,Dearborn St & Adams St,19.0,71.1,10.4,mostlycloudy


## Column to column comparisons

It is possible to compare each value in one column with each value in another column. Here, we filter for all the rides where there were more bikes at the start than at the end.

In [6]:
bikes.query('start_capacity > end_capacity').head(3)

Unnamed: 0,gender,starttime,stoptime,tripduration,from_station_name,start_capacity,to_station_name,end_capacity,temperature,wind_speed,events
1,Male,2013-06-28 22:53:00,2013-06-28 23:03:00,623,Clinton St & Washington Blvd,31.0,Wells St & Walton St,19.0,69.1,6.9,partlycloudy
6,Male,2013-07-02 17:47:00,2013-07-02 17:56:00,565,Clark St & Randolph St,31.0,Ravenswood Ave & Irving Park Rd,19.0,66.0,15.0,cloudy
8,Male,2013-07-03 15:21:00,2013-07-03 15:42:00,1300,Clinton St & Washington Blvd,31.0,Wood St & Division St,15.0,71.1,0.0,cloudy


## Use 'in' for multiple equalities

You can check whether each value in a column is equal to one or more other values by using the word 'in' within your query. Use the syntax for creating a list within the query string to contain all the values you'd like to check. The following tests whether the weather event was snow or rain.

In [7]:
bikes.query('events in ["snow", "rain"]').head(3)

Unnamed: 0,gender,starttime,stoptime,tripduration,from_station_name,start_capacity,to_station_name,end_capacity,temperature,wind_speed,events
45,Male,2013-07-15 16:43:00,2013-07-15 16:55:00,727,Greenwood Ave & 47th St,15.0,State St & Harrison St,19.0,82.9,5.8,rain
112,Male,2013-07-26 19:10:00,2013-07-26 19:33:00,1395,Larrabee St & Kingsbury St,27.0,Damen Ave & Pierce Ave,19.0,66.9,12.7,rain
124,Male,2013-07-30 18:53:00,2013-07-30 19:00:00,442,Canal St & Jackson Blvd,35.0,Racine Ave & Congress Pkwy,19.0,69.1,3.5,rain


In [8]:
bikes.query('events not in ["cloudy", "partlycloudy", "mostlycloudy"]').head(3)

Unnamed: 0,gender,starttime,stoptime,tripduration,from_station_name,start_capacity,to_station_name,end_capacity,temperature,wind_speed,events
25,Female,2013-07-11 08:17:00,2013-07-11 08:31:00,830,Wabash Ave & Roosevelt Rd,19.0,Daley Center Plaza,47.0,73.9,8.1,clear
26,Male,2013-07-12 01:07:00,2013-07-12 01:24:00,1043,State St & Harrison St,19.0,Racine Ave & 18th St,15.0,64.9,0.0,clear
33,Male,2013-07-12 17:22:00,2013-07-12 17:34:00,730,Clark St & Congress Pkwy,27.0,Racine Ave & Congress Pkwy,19.0,79.0,10.4,clear


## Arithmetic operations within `query`

It is possible to write arithmetic operations within `query` just as you would outside of it.

In [9]:
bikes.query('start_capacity - end_capacity >= 20').head(3)

Unnamed: 0,gender,starttime,stoptime,tripduration,from_station_name,start_capacity,to_station_name,end_capacity,temperature,wind_speed,events
54,Male,2013-07-16 15:13:00,2013-07-16 15:18:00,347,Daley Center Plaza,47.0,State St & Van Buren St,27.0,91.0,8.1,mostlycloudy
66,Male,2013-07-17 20:56:00,2013-07-17 21:14:00,1073,Millennium Park,35.0,Morgan St & 18th St,15.0,86.0,9.2,partlycloudy
116,Male,2013-07-27 09:54:00,2013-07-27 09:56:00,121,Daley Center Plaza,47.0,LaSalle St & Washington St,15.0,60.8,11.5,cloudy


## Reference variable names with the `@` symbol

By default, all words within the query string attempt to reference a column name. You can, however, reference a variable name by preceding it with the `@` symbol. Let's assign the variable name `min_length` to 5,000 and reference it in a query to find all the rides where trip duration was greater than it.

In [10]:
min_length = 5000
bikes.query('tripduration > @min_length').head(3)

Unnamed: 0,gender,starttime,stoptime,tripduration,from_station_name,start_capacity,to_station_name,end_capacity,temperature,wind_speed,events
18,Male,2013-07-09 13:12:00,2013-07-09 14:42:00,5396,Canal St & Jackson Blvd,35.0,Millennium Park,35.0,79.0,13.8,cloudy
40,Female,2013-07-14 14:08:00,2013-07-14 15:53:00,6274,Wabash Ave & Roosevelt Rd,19.0,Lake Shore Dr & Monroe St,11.0,87.1,8.1,partlycloudy
77,Female,2013-07-21 11:35:00,2013-07-21 13:54:00,8299,State St & 19th St,15.0,Sheffield Ave & Kingsbury St,15.0,82.9,5.8,mostlycloudy


## Using the index with `query`

You can even use the word `index` to make comparisons against the index as if it were a normal column. In the bikes DataFrame, the index is just the integers beginning at 0. Here, we select only the `events` that were 'cloudy' for an index value greater than 4,000.

In [11]:
bikes.query('index > 4000 and events == "cloudy" ').head(3)

Unnamed: 0,gender,starttime,stoptime,tripduration,from_station_name,start_capacity,to_station_name,end_capacity,temperature,wind_speed,events
4007,Male,2014-06-07 14:07:00,2014-06-07 14:31:00,1434,Lake Shore Dr & North Blvd,15.0,Halsted St & Roscoe St,15.0,82.0,13.8,cloudy
4008,Male,2014-06-07 14:58:00,2014-06-07 15:19:00,1258,Theater on the Lake,15.0,Sheridan Rd & Buena Ave,15.0,82.0,13.8,cloudy
4009,Male,2014-06-07 15:23:00,2014-06-07 15:28:00,297,Sheffield Ave & Addison St,27.0,Pine Grove Ave & Waveland Ave,23.0,80.1,13.8,cloudy


In [12]:
bikes_idx = bikes.set_index('from_station_name')
bikes_idx.head(3)

Unnamed: 0_level_0,gender,starttime,stoptime,tripduration,start_capacity,to_station_name,end_capacity,temperature,wind_speed,events
from_station_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Lake Shore Dr & Monroe St,Male,2013-06-28 19:01:00,2013-06-28 19:17:00,993,11.0,Michigan Ave & Oak St,15.0,73.9,12.7,mostlycloudy
Clinton St & Washington Blvd,Male,2013-06-28 22:53:00,2013-06-28 23:03:00,623,31.0,Wells St & Walton St,19.0,69.1,6.9,partlycloudy
Sheffield Ave & Kingsbury St,Male,2013-06-30 14:43:00,2013-06-30 15:01:00,1040,15.0,Dearborn St & Monroe St,23.0,73.0,16.1,mostlycloudy


In [13]:
bikes_idx.query('from_station_name == "Theater on the Lake"').head(3)

Unnamed: 0_level_0,gender,starttime,stoptime,tripduration,start_capacity,to_station_name,end_capacity,temperature,wind_speed,events
from_station_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Theater on the Lake,Male,2013-08-23 17:57:00,2013-08-23 18:16:00,1166,15.0,Lincoln Ave & Roscoe St,19.0,79.0,9.2,partlycloudy
Theater on the Lake,Female,2013-08-24 15:31:00,2013-08-24 15:59:00,1661,15.0,Fairbanks Ct & Grand Ave,15.0,84.9,6.9,partlycloudy
Theater on the Lake,Male,2013-09-07 14:28:00,2013-09-07 14:37:00,540,15.0,Sheffield Ave & Fullerton Ave,15.0,88.0,10.4,mostlycloudy


## Use backticks to reference column names with spaces

pandas allows DataFrames to have column names with spaces in them. In order to use a column name containing spaces within `query`, you'll need to surround it with backticks. If you don't use the backticks you'll get an error. Let's read in the San Francisco employee compensation dataset which contains multiple column names that have spaces.

### Selecting columns with `query`

Unfortunately the `query` method does not give us the ability to select a subset of the columns when filtering the data. You would have to do normal column selection after calling the method. Here, we use *just the brackets* to select three columns after finding all the rides where the weather was snow or rain.

In [14]:
cols = ['starttime', 'temperature', 'events']
bikes.query('events in ["snow", "rain"]')[cols].head()

Unnamed: 0,starttime,temperature,events
45,2013-07-15 16:43:00,82.9,rain
112,2013-07-26 19:10:00,66.9,rain
124,2013-07-30 18:53:00,69.1,rain
161,2013-08-05 17:09:00,68.0,rain
498,2013-09-07 16:09:00,81.0,rain
