# Advanced Filters

In this section, we will take a closer look at common filtering patterns.  Note that this list is based on the Common Filter Operations section of the [SQL Alchemy tutorial](https://docs.sqlalchemy.org/en/latest/orm/tutorial.html) from the SQL Alchemy documentation, which is copyright © by SQLAlchemy authors and contributors. SQLAlchemy and its documentation are licensed under the MIT license.

### Common Filter Operators

Most filters consist of the following patterns.

* EQUALS/NOT EQUALS and other inequalities
* LIKE and ILIKE
* IN and NOT IN
* IS NULL and IS NOT NULL
* AND and OR
* CONTAINS
* `text_filter` and `text_facet`


## How we will proceed

Let's look at how each of the operations is performed in `pyspark`.  We need a dataset that is ripe for filtering, so we will return to the super hero data set.  Who doesn't love a super hero?

## Set up


In [1]:
from pyspark.sql import SparkSession
from more_pyspark import to_pandas

spark = SparkSession.builder.appName('Ops').getOrCreate()

22/10/31 23:27:36 WARN Utils: Your hostname, nn1448lr222 resolves to a loopback address: 127.0.1.1; using 172.22.172.170 instead (on interface eth0)
22/10/31 23:27:36 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/10/31 23:27:38 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/10/31 23:27:39 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
22/10/31 23:27:39 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.


In [2]:
heros = (spark.read.csv('./data/heroes_information.csv', 
                       header=True, 
                       inferSchema=True,
                       nanValue='-99.0',
                       nullValue='-')
        )

heros.take(2) >> to_pandas

                                                                                

Unnamed: 0,ID,name,Gender,Eye color,Race,Hair color,Height,Publisher,Skin color,Alignment,Weight
0,0,A-Bomb,Male,yellow,Human,No Hair,203.0,Marvel Comics,,good,441.0
1,1,Abe Sapien,Male,blue,Icthyo Sapien,No Hair,191.0,Dark Horse Comics,blue,good,65.0


In [3]:
from more_pyspark import pprint_schema

heros >> pprint_schema

StructType([StructField('ID', IntegerType(), True),
            StructField('name', StringType(), True),
            StructField('Gender', StringType(), True),
            StructField('Eye color', StringType(), True),
            StructField('Race', StringType(), True),
            StructField('Hair color', StringType(), True),
            StructField('Height', DoubleType(), True),
            StructField('Publisher', StringType(), True),
            StructField('Skin color', StringType(), True),
            StructField('Alignment', StringType(), True),
            StructField('Weight', DoubleType(), True)])


## Category 1 - Equality and Inequality

In `pyspark`, equalities/inequalities are performed using the regular Python operators on column expressions.

#### EQUALS

In [4]:
from pyspark.sql.functions import col

(heros
 .where(col('Eye color') == 'blue')
 .take(2)
) >> to_pandas 

Unnamed: 0,ID,name,Gender,Eye color,Race,Hair color,Height,Publisher,Skin color,Alignment,Weight
0,1,Abe Sapien,Male,blue,Icthyo Sapien,No Hair,191.0,Dark Horse Comics,blue,good,65.0
1,2,Abin Sur,Male,blue,Ungaran,No Hair,185.0,DC Comics,red,good,90.0


#### not equals:

In [5]:
(heros
 .where(col('Eye color') != 'blue')
 .take(2)) >> to_pandas 

Unnamed: 0,ID,name,Gender,Eye color,Race,Hair color,Height,Publisher,Skin color,Alignment,Weight
0,0,A-Bomb,Male,yellow,Human,No Hair,203.0,Marvel Comics,,good,441.0
1,3,Abomination,Male,green,Human / Radiation,No Hair,203.0,Marvel Comics,,bad,441.0


#### Other inequalities

In [6]:
(heros
 .where(heros.Height > 200)
 .where(heros.Weight <= 440)
 .take(2)
) >> to_pandas 

Unnamed: 0,ID,name,Gender,Eye color,Race,Hair color,Height,Publisher,Skin color,Alignment,Weight
0,17,Alien,Male,,Xenomorph XX121,No Hair,244.0,Dark Horse Comics,black,bad,169.0
1,19,Amazo,Male,red,Android,,257.0,DC Comics,,bad,173.0


## LIKE and ILIKE

`LIKE` and `ILIKE` are a SQL idiom that 

* is used to match string patterns
* Uses the `%` wildcard
    * like `*` in a regular expression
* `LIKE` is case-sensitive
* `ILIKE` is case-insensitive
    * Actual details are platform dependent

### Examples

* `abc%` matches any string that starts with `abc`
* `%abc` matches any string that ends with `abc`
* `%abc%` matches any string that contains `abc`

## `pyspark` columns have a case-sensitive `like` method

In [7]:
(heros
 .where(heros.name.like('%boy%'))
 .take(2)) >> to_pandas 

Unnamed: 0,ID,name,Gender,Eye color,Race,Hair color,Height,Publisher,Skin color,Alignment,Weight
0,142,Bumbleboy,Male,,,,,Marvel Comics,,good,
1,321,Hellboy,Male,gold,Demon,Black,259.0,Dark Horse Comics,,good,158.0


## Replicating `ILIKE` in `pyspark`

* No `ilike` method
* Use `lower` then `like`

In [8]:
from pyspark.sql.functions import lower

(heros
 .where(lower(heros.name).like('%boy%'))
 .take(2)) >> to_pandas 

Unnamed: 0,ID,name,Gender,Eye color,Race,Hair color,Height,Publisher,Skin color,Alignment,Weight
0,46,Astro Boy,Male,brown,,Black,,,,good,
1,75,Beast Boy,Male,green,Human,Green,173.0,DC Comics,green,good,68.0


## Unleash the power of `rlike`

`rlike` is `like` with RegEx

In [9]:
from pyspark.sql.functions import lower
(heros
 .where(heros.name.rlike('\s[bB]oy|\wboy'))
 .where(heros.Publisher.rlike('DC Comics|Marvel'))
 .take(2)) >> to_pandas 

Unnamed: 0,ID,name,Gender,Eye color,Race,Hair color,Height,Publisher,Skin color,Alignment,Weight
0,75,Beast Boy,Male,green,Human,Green,173.0,DC Comics,green,good,68.0
1,142,Bumbleboy,Male,,,,,Marvel Comics,,good,


## Checking membership with `IN` and `NOT IN`

`SQL` has `IN` and `NOT IN`, which are used to check if a value is in/not in a collection.  In `pyspark`, this action is performed using the column expression `isin` method and negating the expression for performing `NOT IN`.

#### `IN` in `pyspark`

In [10]:
(heros
 .where(heros.Publisher.isin(['DC Comics', 'Marvel Comics']))
 .take(2)) >> to_pandas

Unnamed: 0,ID,name,Gender,Eye color,Race,Hair color,Height,Publisher,Skin color,Alignment,Weight
0,0,A-Bomb,Male,yellow,Human,No Hair,203.0,Marvel Comics,,good,441.0
1,2,Abin Sur,Male,blue,Ungaran,No Hair,185.0,DC Comics,red,good,90.0


#### `NOT IN` in `pyspark`

In [11]:
(heros
 .where(~heros.Publisher.isin(['DC Comics', 'Marvel Comics']))
 .take(2)) >> to_pandas

Unnamed: 0,ID,name,Gender,Eye color,Race,Hair color,Height,Publisher,Skin color,Alignment,Weight
0,1,Abe Sapien,Male,blue,Icthyo Sapien,No Hair,191.0,Dark Horse Comics,blue,good,65.0
1,6,Adam Monroe,Male,blue,,Blond,,NBC - Heroes,,good,


## Using  `IS NULL`/`IS NOT NULL` in `pyspark`
 
* `pyspark` uses the column `isnull` method for text columns
* `pyspark` uses the column `isnan` method for numeric columns
* Prepend column expression with `~` for `NOT IN`

#### `IS  NULL` in `pyspark`

Check for `Null` in text columns

In [12]:
(heros
 .where(col('Skin color').isNull())
 .collect()) >> to_pandas

Unnamed: 0,ID,name,Gender,Eye color,Race,Hair color,Height,Publisher,Skin color,Alignment,Weight
0,0,A-Bomb,Male,yellow,Human,No Hair,203.0,Marvel Comics,,good,441.0
1,3,Abomination,Male,green,Human / Radiation,No Hair,203.0,Marvel Comics,,bad,441.0
2,4,Abraxas,Male,blue,Cosmic Entity,Black,,Marvel Comics,,bad,
3,5,Absorbing Man,Male,blue,Human,No Hair,193.0,Marvel Comics,,bad,122.0
4,6,Adam Monroe,Male,blue,,Blond,,NBC - Heroes,,good,
...,...,...,...,...,...,...,...,...,...,...,...
657,727,Yellow Claw,Male,blue,,No Hair,188.0,Marvel Comics,,bad,95.0
658,728,Yellowjacket,Male,blue,Human,Blond,183.0,Marvel Comics,,good,83.0
659,729,Yellowjacket II,Female,blue,Human,Strawberry Blond,165.0,Marvel Comics,,good,52.0
660,732,Zatanna,Female,blue,Human,Black,170.0,DC Comics,,good,57.0


#### `IS  NULL` in `pyspark`

Check for `nan` in numeric columns

In [13]:
from pyspark.sql.functions import isnan

(heros
 .where(isnan(heros.Weight))
 .take(2)
) >> to_pandas

Unnamed: 0,ID,name,Gender,Eye color,Race,Hair color,Height,Publisher,Skin color,Alignment,Weight
0,4,Abraxas,Male,blue,Cosmic Entity,Black,,Marvel Comics,,bad,
1,6,Adam Monroe,Male,blue,,Blond,,NBC - Heroes,,good,


#### Use `~` to perform `IS  NOT NULL` in `pyspark`

As above, we have to look for both `Null` (text column) and `NaN` (numeric column)

In [14]:
(heros
 .where(~col('Skin color').isNull())
 .take(2)) >> to_pandas

Unnamed: 0,ID,name,Gender,Eye color,Race,Hair color,Height,Publisher,Skin color,Alignment,Weight
0,1,Abe Sapien,Male,blue,Icthyo Sapien,No Hair,191.0,Dark Horse Comics,blue,good,65.0
1,2,Abin Sur,Male,blue,Ungaran,No Hair,185.0,DC Comics,red,good,90.0


In [15]:
(heros
 .where(~isnan(heros.Weight))
 .take(2)) >> to_pandas

Unnamed: 0,ID,name,Gender,Eye color,Race,Hair color,Height,Publisher,Skin color,Alignment,Weight
0,0,A-Bomb,Male,yellow,Human,No Hair,203.0,Marvel Comics,,good,441.0
1,1,Abe Sapien,Male,blue,Icthyo Sapien,No Hair,191.0,Dark Horse Comics,blue,good,65.0


## `AND` and `OR`

#### `AND` using `&`

In [16]:
(heros
 .where((col('Hair color') == 'No Hair') & (col('Eye color') == 'blue'))
 .take(2)) >> to_pandas 

Unnamed: 0,ID,name,Gender,Eye color,Race,Hair color,Height,Publisher,Skin color,Alignment,Weight
0,1,Abe Sapien,Male,blue,Icthyo Sapien,No Hair,191.0,Dark Horse Comics,blue,good,65.0
1,2,Abin Sur,Male,blue,Ungaran,No Hair,185.0,DC Comics,red,good,90.0


#### `AND` using multiple `where`s

In [17]:
(heros
 .where(col('Hair color') == 'No Hair')
 .where(col('Eye color') == 'blue')
 .take(2)) >> to_pandas 

Unnamed: 0,ID,name,Gender,Eye color,Race,Hair color,Height,Publisher,Skin color,Alignment,Weight
0,1,Abe Sapien,Male,blue,Icthyo Sapien,No Hair,191.0,Dark Horse Comics,blue,good,65.0
1,2,Abin Sur,Male,blue,Ungaran,No Hair,185.0,DC Comics,red,good,90.0


#### `OR` using `|`

In [18]:
(heros
 .where((col('Hair color') == 'No Hair') | (col('Eye color') == 'blue'))
 .take(2)) >> to_pandas 

Unnamed: 0,ID,name,Gender,Eye color,Race,Hair color,Height,Publisher,Skin color,Alignment,Weight
0,0,A-Bomb,Male,yellow,Human,No Hair,203.0,Marvel Comics,,good,441.0
1,1,Abe Sapien,Male,blue,Icthyo Sapien,No Hair,191.0,Dark Horse Comics,blue,good,65.0


## CONTAINS using the column `contains` method

* CONTAINS is used to look for substrings
* `pyspark` column expressions have `contains` method
* Similar to LIKE and RLIKE but no wild-cards
* Doesn't accept regular expressions
    * Use `rlike` in those cases

In [19]:
(heros
.where(heros.Publisher.contains('Comics'))
.where(col('name').contains('-'))
.take(5)
) >> to_pandas

Unnamed: 0,ID,name,Gender,Eye color,Race,Hair color,Height,Publisher,Skin color,Alignment,Weight
0,0,A-Bomb,Male,yellow,Human,No Hair,203.0,Marvel Comics,,good,441.0
1,11,Air-Walker,Male,blue,,White,188.0,Marvel Comics,,bad,108.0
2,29,Ant-Man,Male,blue,Human,Blond,211.0,Marvel Comics,,good,122.0
3,30,Ant-Man II,Male,blue,Human,Blond,183.0,Marvel Comics,,good,86.0
4,31,Anti-Monitor,Male,yellow,God / Eternal,No Hair,61.0,DC Comics,,bad,


## Refactoring conditional statements


* Column expression are lazy, which makes refactoring easy.
* Make code readable by refactoring all conditional expressions.

### Example - The Super Hero Dating Game - Part 1

Yesterday, you notice another singles add in the local paper, which reads

> SBiM looking for SyFy super hero (will also consider Star Wars, Star Trek, or NBC - Heroes).  Eye color must be either blue or brown and last name must start with either B or P.

1. Write a query in `pyspark` to help find candidates for this personal add.
2. Refactor the conditional expressions to be more readable.

In [20]:
# Original solution
(heros
 .select(heros.name,
         'Eye color',
         heros.Publisher)
 .where(heros.Publisher.isin(['NBC - Heroes', 'SyFy', 'Star Trek', 'George Lucas']))
 .where(col('Eye color').rlike('blue|brown'))
 .where(heros.name.rlike(' [BP]'))
 .collect()
) >> to_pandas
        

Unnamed: 0,name,Eye color,Publisher
0,Claire Bennet,blue,NBC - Heroes
1,Elle Bishop,blue,NBC - Heroes
2,Nathan Petrelli,brown,NBC - Heroes


In [21]:
# refactored solution

is_syfy = heros.Publisher.isin(['NBC - Heroes', 'SyFy', 'Star Trek', 'George Lucas'])
has_blue_or_brown_eyes = col('Eye color').rlike('blue|brown')
name_startswith_B_or_P = heros.name.rlike(' [BP]')

(heros
 .select(heros.name,
         'Eye color',
         heros.Publisher)
 .where(is_syfy & has_blue_or_brown_eyes & name_startswith_B_or_P)
 .collect()
) >> to_pandas

Unnamed: 0,name,Eye color,Publisher
0,Claire Bennet,blue,NBC - Heroes
1,Elle Bishop,blue,NBC - Heroes
2,Nathan Petrelli,brown,NBC - Heroes


## <font color="red"> Exercise 6.2.3 - The Super Hero Dating Game - Part 3</font>

Yesterday, you notice one more singles add in the local paper, which read

> W4A (Woman for Androgynous) looking for super hero.  Must be tall (at least 6 feet tall) and either God/Eternal/Cosmic Entity; or have no body hair.  Bad heroes need not reply.

1. Write a query in all three frameworks to help find candidates for this personal add.  You should complete each query with **exactly one filter_by/where**.
2. Refactor all conditional statements.

In [50]:
# Your dfply solution here
import pandas as pd
from dfply import *

taller_than_6 = (X.Height >= 6)
has_no_hair = (X['Hair color'] == 'No Hair')
is_eternal = (X.Race.isin(['God','Eternal','Cosmic Entity']))
is_not_bad = (X.Alignment != 'bad')

heroespd = pd.read_csv('./data/heroes_information.csv')
(heroespd
 >> mutate(Height = X.Height / 30.48)
 >> filter_by( taller_than_6 &
              (has_no_hair | is_eternal) &
              is_not_bad
             )
)

Unnamed: 0,ID,name,Gender,Eye color,Race,Hair color,Height,Publisher,Skin color,Alignment,Weight
0,0,A-Bomb,Male,yellow,Human,No Hair,6.660105,Marvel Comics,-,good,441.0
1,1,Abe Sapien,Male,blue,Icthyo Sapien,No Hair,6.266404,Dark Horse Comics,blue,good,65.0
2,2,Abin Sur,Male,blue,Ungaran,No Hair,6.069554,DC Comics,red,good,90.0
78,78,Beta Ray Bill,Male,-,-,No Hair,6.594488,Marvel Comics,-,good,216.0
91,91,Bishop,Male,brown,Mutant,No Hair,6.496063,Marvel Comics,-,good,124.0
102,102,Black Lightning,Male,brown,-,No Hair,6.069554,DC Comics,-,good,90.0
212,212,Deadpool,Male,brown,Mutant,No Hair,6.167979,Marvel Comics,-,neutral,95.0
233,233,Drax the Destroyer,Male,red,Human / Altered,No Hair,6.332021,Marvel Comics,green,good,306.0
245,245,Etrigan,Male,red,Demon,No Hair,6.332021,DC Comics,yellow,neutral,203.0
255,255,Fin Fang Foom,Male,red,Kakarantharaian,No Hair,31.988189,Marvel Comics,green,good,18.0


In [62]:
# Your pyspark solution here
from pyspark.sql.functions import column, col

taller_than_6 = (heros.Height >= 6)
has_no_hair = (col('Hair color') == 'No Hair')
is_eternal = (heros.Race.isin(['God','Eternal','Cosmic Entity']))
is_not_bad = (heros.Alignment != 'bad')

(heros
 .withColumn('Height', heros.Height/30.48)
 .where(taller_than_6 & (has_no_hair | is_eternal) & is_not_bad)
 .collect()
) >> to_pandas

Unnamed: 0,ID,name,Gender,Eye color,Race,Hair color,Height,Publisher,Skin color,Alignment,Weight
0,0,A-Bomb,Male,yellow,Human,No Hair,6.660105,Marvel Comics,,good,441.0
1,1,Abe Sapien,Male,blue,Icthyo Sapien,No Hair,6.266404,Dark Horse Comics,blue,good,65.0
2,2,Abin Sur,Male,blue,Ungaran,No Hair,6.069554,DC Comics,red,good,90.0
3,78,Beta Ray Bill,Male,,,No Hair,6.594488,Marvel Comics,,good,216.0
4,91,Bishop,Male,brown,Mutant,No Hair,6.496063,Marvel Comics,,good,124.0
5,102,Black Lightning,Male,brown,,No Hair,6.069554,DC Comics,,good,90.0
6,112,Blaquesmith,,black,,No Hair,,Marvel Comics,,good,
7,120,Bloodhawk,Male,black,Mutant,No Hair,,Marvel Comics,,good,
8,189,Crimson Dynamo,Male,brown,,No Hair,5.905512,Marvel Comics,,good,104.0
9,212,Deadpool,Male,brown,Mutant,No Hair,6.167979,Marvel Comics,,neutral,95.0


In [24]:
# Your sqlalchemy solution here