# Selecting by Callable
This is the fourth entry in a series on indexing and selecting in pandas. In summary, this is what we've covered:

 * [Basic indexing, selecting by label and location](https://www.wrighters.io/indexing-and-selecting-in-pandas-part-1/)
 * [Slicing in pandas](https://www.wrighters.io/indexing-and-selecting-in-pandas-slicing/)
 * [Selecting by boolean indexing](https://www.wrighters.io/boolean-indexing-in-pandas/)
 
In all of the discussion so far, we've focused on the three main methods of selecting data in the two main pandas data structures, ```Series``` and ```DataFrame```. 

 * The array indexing operator, or ```[]```
 * The ```.loc``` selector, for selection using the label on the index
 * The ```.iloc``` selector, for selection using the location
 
We noted in the last entry in the series that all three can take a boolean vector as indexer to select data from the object. It turns out that you can also pass in a callable.  If you're not familiar with a callable, it can be a function, or object with a ```___call___``` method. When used for pandas selection, the callable needs to take one argument, which will be the pandas object, and return a result that will select data from the dataset. Why would you want to do this? We'll look at how this can be useful.

In this series I've been grabbing data from the [Chicago Data Portal](https://data.cityofchicago.org). For this post, I've grabbed the [list of current employees](https://data.cityofchicago.org/Administration-Finance/Current-Employee-Names-Salaries-and-Position-Title/xzkq-xp2w) for the city. This includes full time and part time, salaried and hourly data.


In [1]:
import pandas as pd

# you should be able to grab this dataset as an unauthenticated user, but you can be rate limited
# it also only returns 1000 rows (or at least it did for me without an API key)
df = pd.read_json("https://data.cityofchicago.org/resource/xzkq-xp2w.json")

In [2]:
df.dtypes

name                  object
job_titles            object
department            object
full_or_part_time     object
salary_or_hourly      object
annual_salary        float64
typical_hours        float64
hourly_rate          float64
dtype: object

In [3]:
df.describe()

Unnamed: 0,annual_salary,typical_hours,hourly_rate
count,785.0,215.0,215.0
mean,87307.076637,35.55814,34.706
std,20342.094746,8.183932,13.027963
min,20568.0,20.0,3.0
25%,76164.0,40.0,22.35
50%,87006.0,40.0,38.35
75%,97386.0,40.0,44.4
max,180000.0,40.0,57.04


In [4]:
df.shape

(1000, 8)

In [5]:
df = df.drop('name', axis=1)   # no need to include personal info in this post

## Simple callables
So we have some data, which is a subset of the total list of employees for the city of Chicago. The full dataset should be about 32,000 rows.

Before we give a few examples, let's clarify what this callable should do. First, the callable will take one argument, which will be the ```DataFrame``` or ```Series``` being indexed. What you need to returns a valid value for indexing. This could be any value that we've already discussed in earlier posts.

So, if we are using the array indexing operator, on a ```DataFrame``` you'll remember that you can pass in a single column, or a list of columns to select.

In [6]:
def select_job_titles(df):
    return "job_titles"

df[select_job_titles]

0                                    SERGEANT
1      POLICE OFFICER (ASSIGNED AS DETECTIVE)
2                    CHIEF CONTRACT EXPEDITER
3                           CIVIL ENGINEER IV
4                            CONCRETE LABORER
                        ...                  
995                 AVIATION SECURITY OFFICER
996                           FIREFIGHTER-EMT
997                              LIBRARIAN IV
998               HUMAN SERVICE SPECIALIST II
999                            POLICE OFFICER
Name: job_titles, Length: 1000, dtype: object

In [7]:
def select_job_titles_typical_hours(df):
    return ["job_titles", "typical_hours"]

df[select_job_titles_typical_hours].dropna()

Unnamed: 0,job_titles,typical_hours
4,CONCRETE LABORER,40.0
6,TRAFFIC CONTROL AIDE-HOURLY,20.0
7,ELECTRICAL MECHANIC,40.0
10,FOSTER GRANDPARENT,20.0
21,ELECTRICAL MECHANIC (AUTOMOTIVE),40.0
...,...,...
971,CONSTRUCTION LABORER,40.0
974,HOISTING ENGINEER,40.0
977,CONSTRUCTION LABORER,40.0
988,CONSTRUCTION LABORER,40.0


We can also return a boolean indexer, since that's a valid argument.

In [8]:
def select_20_hours_or_less(df):
    return df['typical_hours'] <= 20

df[select_20_hours_or_less].head()

Unnamed: 0,job_titles,department,full_or_part_time,salary_or_hourly,annual_salary,typical_hours,hourly_rate
6,TRAFFIC CONTROL AIDE-HOURLY,OEMC,P,Hourly,,20.0,19.86
10,FOSTER GRANDPARENT,FAMILY & SUPPORT,P,Hourly,,20.0,3.0
91,CROSSING GUARD,OEMC,P,Hourly,,20.0,18.52
113,SENIOR COMPANION,FAMILY & SUPPORT,P,Hourly,,20.0,3.0
125,TITLE V PROGRAM TRAINEE I,FAMILY & SUPPORT,P,Hourly,,20.0,13.0


You can also use callables for both the first (row indexer) and second (column indexer) arguments in a ```DataFrame```.

In [9]:
df.loc[lambda df: df['typical_hours'] <= 20, lambda df: ['job_titles', 'typical_hours']].head()

Unnamed: 0,job_titles,typical_hours
6,TRAFFIC CONTROL AIDE-HOURLY,20.0
10,FOSTER GRANDPARENT,20.0
91,CROSSING GUARD,20.0
113,SENIOR COMPANION,20.0
125,TITLE V PROGRAM TRAINEE I,20.0


### But why?
OK, so this all seems kind of unnecessary because you could do this much more directly. Why write a separate function to provide another level of redirection?

I have to admit that before writing this post, I don't think that I've used callable indexing much, if at all. But one use case where it's helpful is something that I do all the time. Maybe you do as well.

Let's say we want to find departments with an average hourly pay rate below some threshold. Usually you'll do a group by followed by a selector on the resulting groupby ```DataFrame```.

In [10]:
temp = df.groupby('job_titles').mean()
temp[temp['hourly_rate'] < 20]

Unnamed: 0_level_0,annual_salary,typical_hours,hourly_rate
job_titles,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ALDERMANIC AIDE,41760.0,25.0,14.0
CROSSING GUARD - PER CBA,,20.0,15.195
CUSTODIAL WORKER,,40.0,19.2
FOSTER GRANDPARENT,,20.0,3.0
HOSPITALITY WORKER,,20.0,14.11
LAW CLERK,,35.0,14.95
LIBRARY CLERK - HOURLY,,20.0,18.01
LIBRARY PAGE,,20.0,13.75
SENIOR COMPANION,,20.0,3.0
STUDENT INTERN,,35.0,16.0


But with a callable, you can do this without the temporary ```DataFrame``` variable.

In [11]:
df.groupby('job_titles').mean().loc[lambda df: df['hourly_rate'] < 20]

Unnamed: 0_level_0,annual_salary,typical_hours,hourly_rate
job_titles,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ALDERMANIC AIDE,41760.0,25.0,14.0
CROSSING GUARD - PER CBA,,20.0,15.195
CUSTODIAL WORKER,,40.0,19.2
FOSTER GRANDPARENT,,20.0,3.0
HOSPITALITY WORKER,,20.0,14.11
LAW CLERK,,35.0,14.95
LIBRARY CLERK - HOURLY,,20.0,18.01
LIBRARY PAGE,,20.0,13.75
SENIOR COMPANION,,20.0,3.0
STUDENT INTERN,,35.0,16.0


One thing to note is that there's nothing special about these callables. They still have to return the correct values for the selector you are choosing to use. So for example, you can do this using ```loc```:

In [12]:
df.loc[lambda df: df['department'] == 'CITY COUNCIL']

Unnamed: 0,job_titles,department,full_or_part_time,salary_or_hourly,annual_salary,typical_hours,hourly_rate
124,STUDENT INTERN - ALDERMANIC,CITY COUNCIL,F,Hourly,,35.0,14.0
175,LEGISLATIVE AIDE,CITY COUNCIL,F,Salary,20568.0,,
220,ALDERMANIC AIDE,CITY COUNCIL,F,Salary,47520.0,,
365,STUDENT INTERN - ALDERMANIC,CITY COUNCIL,F,Hourly,,35.0,14.0
400,STUDENT INTERN - ALDERMANIC,CITY COUNCIL,F,Hourly,,35.0,14.0
480,ALDERMANIC AIDE,CITY COUNCIL,P,Hourly,,20.0,15.0
522,ASST TO THE ALDERMAN,CITY COUNCIL,F,Salary,90000.0,,
689,ALDERMANIC AIDE,CITY COUNCIL,P,Hourly,,20.0,15.0
862,ASST TO THE ALDERMAN,CITY COUNCIL,F,Salary,65160.0,,
923,STAFF ASST TO THE ALDERMAN,CITY COUNCIL,F,Salary,60228.0,,


But you can't do this, because ```.iloc``` requires a boolean vector without an index (as we talked about in [the post on boolean indexing](https://www.wrighters.io/2021/01/04/boolean-indexing-in-pandas/).

In [13]:
try:
    df.iloc[lambda df: df['department'] == 'CITY COUNCIL']
except NotImplementedError as nie:
    print(nie)
    
    
# instead, return just the boolean vector
df.iloc[lambda df: (df['department'] == 'CITY COUNCIL').to_numpy()]
# or 
df.iloc[lambda df: (df['department'] == 'CITY COUNCIL').values]


iLocation based boolean indexing on an integer type is not available


Unnamed: 0,job_titles,department,full_or_part_time,salary_or_hourly,annual_salary,typical_hours,hourly_rate
124,STUDENT INTERN - ALDERMANIC,CITY COUNCIL,F,Hourly,,35.0,14.0
175,LEGISLATIVE AIDE,CITY COUNCIL,F,Salary,20568.0,,
220,ALDERMANIC AIDE,CITY COUNCIL,F,Salary,47520.0,,
365,STUDENT INTERN - ALDERMANIC,CITY COUNCIL,F,Hourly,,35.0,14.0
400,STUDENT INTERN - ALDERMANIC,CITY COUNCIL,F,Hourly,,35.0,14.0
480,ALDERMANIC AIDE,CITY COUNCIL,P,Hourly,,20.0,15.0
522,ASST TO THE ALDERMAN,CITY COUNCIL,F,Salary,90000.0,,
689,ALDERMANIC AIDE,CITY COUNCIL,P,Hourly,,20.0,15.0
862,ASST TO THE ALDERMAN,CITY COUNCIL,F,Salary,65160.0,,
923,STAFF ASST TO THE ALDERMAN,CITY COUNCIL,F,Salary,60228.0,,


Also, while I've used the ```DataFrame``` for all these examples, this works in ```Series``` as well.

In [14]:
s = df['annual_salary']

s[lambda s: s < 30000]

175    20568.0
Name: annual_salary, dtype: float64

In summary, indexing with a callable allows some flexibity for condensing some code that would otherwise require temporary variables. The thing to remember about the callable is that it will need to return a result that is acceptable in the same place as the callable. 

I hope you'll stay tuned for future updates. I'll plan to talk about the ```.where``` method of selection next.