# `dfply_sql` - `dplyr`-like functions for `pyspark`

## Example  - Hero database

In [1]:
from sqlalchemy.orm import sessionmaker
from sqlalchemy import create_engine
from sqlalchemy.ext.automap import automap_base

engine = create_engine("sqlite:///databases/heroes_2_1.db")

Base = automap_base()
Base.prepare(engine, reflect=True)
Hero = Base.classes.heroes

## What is `dfply_sql`

* `dfply` like functions for `sqlalchemy`
* Raw and likely has bugs (I just made it)
* Could use some extra hands ...

In [28]:
import dfply_sql as q
Hero >> q.to_statement >> q.head(num = 2) >> q.to_pandas(engine)

Unnamed: 0,id,name,gender,eye_color,race,hair_color,height,publisher,skin_color,alignment,weight
0,0,A-Bomb,Male,yellow,Human,No Hair,203.0,Marvel Comics,,good,441.0
1,1,Abe Sapien,Male,blue,Icthyo Sapien,No Hair,191.0,Dark Horse Comics,blue,good,65.0


## A simple example - `select`, `filter_by`, and `mutate`

In [21]:
from dfply_sql import T, D, func

s = (Hero >> q.to_statement
     >> q.select(['name', T.gender, D['heroes'].c.eye_color])
     >> q.filter_by(T.gender == 'Male')
     >> q.mutate(double_height = T.height*2,
                 triple_height = D['heroes'].c.height*3))

#### `dfply_sql` functions generate SQL statements

In [2]:
s >> q.pprint

SELECT heroes.name,
       heroes.gender,
       heroes.eye_color,
       heroes.height * :height_1 AS double_height,
       heroes.height * :height_2 AS triple_height
FROM heroes
WHERE heroes.gender = :gender_1


<sqlalchemy.sql.selectable.Alias at 0x1171ba940; %(4682656064 anon)s>

#### The resulting head of the table

In [3]:
s >> q.head >> q.to_pandas(engine)

Unnamed: 0,name,gender,eye_color,double_height,triple_height
0,A-Bomb,Male,yellow,406.0,609.0
1,Abe Sapien,Male,blue,382.0,573.0
2,Abin Sur,Male,blue,370.0,555.0
3,Abomination,Male,green,406.0,609.0
4,Abraxas,Male,blue,,


## Basic usage

* Import as `q` to avoid conflicts with `dfply`
* Start by passing the `sqlalchemy` through `to_statement`
* Pipe into standard functions

## `T` - The table `Intention` 

* `T` for table
* References the current table
* Only use with one table/statment in `FROM`


In [4]:
T.gender

<dfply.base.Intention at 0x106177780>

In [22]:
T.gender.evaluate(s)

Column('gender', VARCHAR(), table=<heroes>)

## `D` - The database `Intention` 

* `D` for database
* Access a table using `D['table_name']`
* Access a columns using `D['table_name'].c.col_name`
* Used when there are 2+ tables in `FROM`


In [25]:
D['heroes'].c.gender

<dfply.base.Intention at 0x117c0e048>

In [26]:
D['heroes'].c.gender.evaluate(s)

Column('gender', VARCHAR(), table=<heroes>)

## Example 2 - Group and Aggregate

In [7]:
(Hero >> q.to_statement
 >> q.select([T.gender,
              T.eye_color])
 >> q.group_by([T.gender])
 >> q.summarise(avg_height = q.func.avg(T.height))
 >> q.to_pandas(engine)
)

Unnamed: 0,gender,eye_color,avg_height
0,,brown,177.066667
1,Female,blue,174.684028
2,Male,red,191.97486


## Example 3 - Filter after aggregation

Note the automatic subquery.

In [12]:
sum_then_filt = (Hero
                 >> q.to_statement
                 >> q.select([T.publisher])
                 >> q.filter_by(T.publisher != 'None')
                 >> q.group_by([T.publisher])
                 >> q.summarise(cnt = q.func.count('*'))
                 >> q.filter_by(T.cnt > 50))
_ = sum_then_filt >> q.pprint

SELECT anon_1.publisher,
       anon_1.cnt
FROM
  (SELECT heroes.publisher AS publisher,
          count(:count_1) AS cnt
   FROM heroes
   WHERE heroes.publisher != :publisher_1
   GROUP BY heroes.publisher) AS anon_1
WHERE anon_1.cnt > :cnt_1


In [13]:
sum_then_filt >> q.head >> q.to_pandas(engine)

Unnamed: 0,publisher,cnt
0,DC Comics,215
1,Marvel Comics,388


## Example 4 - Double aggregation

In [10]:
dbl_agg = (Hero >> q.to_statement
           >> q.select([T.gender, 
                        T.eye_color])
           >> q.filter_by(T.eye_color != "green")
           >> q.group_by([T.gender, T.eye_color])
           >> q.summarise(avg_height = func.avg(T.height),
                          cnt = func.count('*')
                         )
           >> q.group_by([T.gender])
           >> q.summarise(max_height_by_eye_color = q.func.max(T.avg_height)))
 >> q.pprint
 >> q.head(num=5)
 >> q.to_pandas(engine)
)

SELECT anon_1.gender,
       anon_1.eye_color,
       anon_1.avg_height,
       anon_1.cnt,
       max(anon_1.avg_height) AS max_height_by_eye_color
FROM
  (SELECT heroes.gender AS gender,
          heroes.eye_color AS eye_color,
          avg(heroes.height) AS avg_height,
          count(:count_1) AS cnt
   FROM heroes
   WHERE heroes.eye_color != :eye_color_1
   GROUP BY heroes.gender,
            heroes.eye_color) AS anon_1
GROUP BY anon_1.gender


Unnamed: 0,gender,eye_color,avg_height,cnt,max_height_by_eye_color
0,,yellow,198.0,1,198.0
1,Female,hazel,213.333333,3,213.333333
2,Male,black,264.4,17,264.4


In [19]:
dbl_agg = (Hero >> q.to_statement
           >> q.select([T.gender, 
                        T.eye_color])
           >> q.filter_by(T.eye_color != "green")
           >> q.group_by([T.gender, T.eye_color])
           >> q.summarise(avg_height = func.avg(T.height),
                          cnt = func.count('*')
                         )
           >> q.group_by([T.gender])
           >> q.summarise(max_height_by_eye_color = q.func.max(T.avg_height)))

_ = dbl_agg >> q.pprint

SELECT anon_1.gender,
       anon_1.eye_color,
       anon_1.avg_height,
       anon_1.cnt,
       max(anon_1.avg_height) AS max_height_by_eye_color
FROM
  (SELECT heroes.gender AS gender,
          heroes.eye_color AS eye_color,
          avg(heroes.height) AS avg_height,
          count(:count_1) AS cnt
   FROM heroes
   WHERE heroes.eye_color != :eye_color_1
   GROUP BY heroes.gender,
            heroes.eye_color) AS anon_1
GROUP BY anon_1.gender


In [20]:
dbl_agg >> q.to_pandas(engine)

Unnamed: 0,gender,eye_color,avg_height,cnt,max_height_by_eye_color
0,,yellow,198.0,1,198.0
1,Female,hazel,213.333333,3,213.333333
2,Male,black,264.4,17,264.4


## <font color="red"> Exercise 1: Blue-eyed Heroes </font>

Use `dfply_sql` functions to perform the following query

1. Selects the name, Gender, and Eye Color columns
2. Filters on eye_color == 'blue'

In [29]:
# Your code here

## <font color="red"> Exercise 2: Tall Heroes </font>

Use `dfply_sql` functions to perform the following query

1. Selects the name, Gender, and Height columns
2. Compute the height in inches.
    * Check [here](https://www.kaggle.com/claudiodavi/superhero-set) to determine the current units.
3. Filters on height_in > 72

In [30]:
# Your code here

## <font color="red"> Exercise 3: Strong and Fast! </font>

Use `dfply_sql` functions to answer the following question.

How many heroes have both Super Strength and Super Speed?

In [30]:
# Your code here

## Up Next

Stuff