# Advanced Filters

In this section, we will take a closer look at common filtering patterns.  Note that this list is based on the Common Filter Operations section of the [SQL Alchemy tutorial](https://docs.sqlalchemy.org/en/latest/orm/tutorial.html) from the SQL Alchemy documentation, which is copyright © by SQLAlchemy authors and contributors. SQLAlchemy and its documentation are licensed under the MIT license.

In [1]:
import pandas as pd
from dfply import *
import seaborn as sns
%matplotlib inline

### Common Filter Operators

In this lecture, we will focus on the following filters

* Like/ilike
* In/not in


## Set up

Let's read in a data set 

In [2]:
from more_dfply import fix_names
heroes_raw = pd.read_csv('./data/heroes_information.csv', na_values=['-', '-99.0', ''])
heroes = (heroes_raw >> fix_names)
heroes.head()

Unnamed: 0,Unnamed_0,name,Gender,Eye_color,Race,Hair_color,Height,Publisher,Skin_color,Alignment,Weight
0,0,A-Bomb,Male,yellow,Human,No Hair,203.0,Marvel Comics,,good,441.0
1,1,Abe Sapien,Male,blue,Icthyo Sapien,No Hair,191.0,Dark Horse Comics,blue,good,65.0
2,2,Abin Sur,Male,blue,Ungaran,No Hair,185.0,DC Comics,red,good,90.0
3,3,Abomination,Male,green,Human / Radiation,No Hair,203.0,Marvel Comics,,bad,441.0
4,4,Abraxas,Male,blue,Cosmic Entity,Black,,Marvel Comics,,bad,


## LIKE and ILIKE

`LIKE` and `ILIKE` are both SQL idioms that 

* is used to match string patterns
* Uses the `%` wildcard
    * like `*` in a regular expression
* `LIKE` is case-sensitive
* `ILIKE` is case-insensitive
    * Actual details are platform dependent

### Examples of Like patterns from `SQL`

* `abc%` matches any string that starts with `abc`
* `%abc` matches any string that ends with `abc`
* `%abc%` matches any string that contains `abc`

## `pandas` lacks `LIKE`/`ILIKE`

Instead we use

* `str.startswith`
* `str.endswith`
* `str.contains`

#### `LIKE 'Super%'`

In [3]:
(heroes
 >> filter_by(X.name.str.startswith('Super'))
 >> head(2))

Unnamed: 0,Unnamed_0,name,Gender,Eye_color,Race,Hair_color,Height,Publisher,Skin_color,Alignment,Weight
643,643,Superboy,Male,blue,,Black,170.0,DC Comics,,good,68.0
644,644,Superboy-Prime,Male,blue,Kryptonian,Black / Blue,180.0,DC Comics,,bad,77.0


#### `LIKE '%boy'`

In [4]:
(heroes
 >> filter_by(X.name.str.endswith('boy'))
 >> head(2))

Unnamed: 0,Unnamed_0,name,Gender,Eye_color,Race,Hair_color,Height,Publisher,Skin_color,Alignment,Weight
142,142,Bumbleboy,Male,,,,,Marvel Comics,,good,
321,321,Hellboy,Male,gold,Demon,Black,259.0,Dark Horse Comics,,good,158.0


#### `LIKE '%boy%'`

In [5]:
(heroes
 >> filter_by(X.name.str.contains('boy'))
 >> head(2))

Unnamed: 0,Unnamed_0,name,Gender,Eye_color,Race,Hair_color,Height,Publisher,Skin_color,Alignment,Weight
142,142,Bumbleboy,Male,,,,,Marvel Comics,,good,
321,321,Hellboy,Male,gold,Demon,Black,259.0,Dark Horse Comics,,good,158.0


#### `ILIKE` using `str.lower()`

In [6]:
(heroes
 >> filter_by((X.name
               .str.lower()
               .str.contains('boy'))
              )
 >> head(3))

Unnamed: 0,Unnamed_0,name,Gender,Eye_color,Race,Hair_color,Height,Publisher,Skin_color,Alignment,Weight
46,46,Astro Boy,Male,brown,,Black,,,,good,
75,75,Beast Boy,Male,green,Human,Green,173.0,DC Comics,green,good,68.0
142,142,Bumbleboy,Male,,,,,Marvel Comics,,good,


## Cry 'Havoc!,' and let slip the dogs of RegEx

`pandas` string methods accept regular expressions.

In [7]:
(heroes
 >> filter_by(X.Publisher.str.contains('DC|Marvel', na=False))
 >> filter_by(X.name.str.contains('\s[Bb]oy|\wboy', na=False))
 >> head
)

Unnamed: 0,Unnamed_0,name,Gender,Eye_color,Race,Hair_color,Height,Publisher,Skin_color,Alignment,Weight
75,75,Beast Boy,Male,green,Human,Green,173.0,DC Comics,green,good,68.0
142,142,Bumbleboy,Male,,,,,Marvel Comics,,good,
183,183,Colossal Boy,Male,,,,,DC Comics,,good,
643,643,Superboy,Male,blue,,Black,170.0,DC Comics,,good,68.0
644,644,Superboy-Prime,Male,blue,Kryptonian,Black / Blue,180.0,DC Comics,,bad,77.0


## Be careful about missing values

* `pandas` methods map `NA` $\rightarrow$ `NA`
* Results in `dtype('O')`
* Use `na=False` to guarantee `dtype('bool')`

In [8]:
(heroes.Publisher
 .str.contains('DC Comics|Marvel')
 .dtype
)

dtype('O')

In [9]:
(heroes.Publisher
 .str.contains('DC Comics|Marvel', na=False)
 .dtype
)

dtype('bool')

In [10]:
_ = (heroes
     >> filter_by(X.name.str.lower().str.contains('boy')) # Unsafe and might crash
     >> filter_by(X.name.str.lower().str.contains('boy', na=False)) # Safe
    )

# Checking membership with `IN` and `NOT IN`

`SQL` has `IN` and `NOT IN`, which are used to check if a value is in/not in a collection.

### Using  `IN`/`NOT IN` in `pandas`

* `pandas` uses the column `isin` method
* Prepend column expression with `~` for `NOT IN`

#### `IN` 

In [11]:
(heroes
 >> filter_by(X.Publisher.isin(['DC Comics', 'Marvel Comics']))
 >> head(2))

Unnamed: 0,Unnamed_0,name,Gender,Eye_color,Race,Hair_color,Height,Publisher,Skin_color,Alignment,Weight
0,0,A-Bomb,Male,yellow,Human,No Hair,203.0,Marvel Comics,,good,441.0
2,2,Abin Sur,Male,blue,Ungaran,No Hair,185.0,DC Comics,red,good,90.0


#### `NOT IN` 

In [12]:
(heroes
 >> filter_by(~X.Publisher.isin(['DC Comics', 'Marvel Comics']))
 >> head(2))

Unnamed: 0,Unnamed_0,name,Gender,Eye_color,Race,Hair_color,Height,Publisher,Skin_color,Alignment,Weight
1,1,Abe Sapien,Male,blue,Icthyo Sapien,No Hair,191.0,Dark Horse Comics,blue,good,65.0
6,6,Adam Monroe,Male,blue,,Blond,,NBC - Heroes,,good,


## <font color="red"> Exercise 4.2.1 - The Super Hero Dating Game - Part 2</font>

Yesterday, you notice another singles add in the local paper, which reads

> SBiM looking for SyFy super hero (will also consider Star Wars (George Lucas), Star Trek, or NBC - Heroes ... check the `Publisher` column).  Eye color must be either blue or brown and last name must start with either B or P.

Write a query in `dfply` to help find candidates for this personal add.

In [25]:
# Your dfply solution here
(heroes
 >>filter_by(X.Publisher.isin(['George Lucas', 'Star Trek','NBC - Heroes']))
 >>filter_by(X.Eye_color.str.contains('blue|brown', na=False))
 >>filter_by(X.name.str.contains('\sB|P', na=False))
)


Unnamed: 0,Unnamed_0,name,Gender,Eye_color,Race,Hair_color,Height,Publisher,Skin_color,Alignment,Weight
177,177,Claire Bennet,Female,blue,,Blond,,NBC - Heroes,,good,
238,238,Elle Bishop,Female,blue,,Blond,,NBC - Heroes,,bad,
486,486,Nathan Petrelli,Male,brown,,,,NBC - Heroes,,good,
