# Select, Filter, and Mutate

In this lecture, we will look at three important actions used to process data frames.  While each framework uses different names for these functions, we will use the names from the `R` library `dplyr`, namely `select`, `mutate`, and `filter`.  The most important takeaway will be that, regardless of framework or scale, we can process data frames in the same way by applying the same sequence of data verbs.

## R and Python can interact!

In [6]:
#!pip install rpy2 tzlocal
import rpy2
%load_ext rpy2.ipython

ModuleNotFoundError: No module named 'rpy2'

In [3]:
import warnings
warnings.filterwarnings('ignore')

In [5]:
%%R
rnorm(5, 2, 3)

ERROR:root:Cell magic `%%R` not found.


## We love dplyr!

In [7]:
%%R 
library(dplyr)
artists <- read.csv('./data/Artists.csv')

(artists %>%
  select(BeginDate, DisplayName, Nationality) %>%
  filter(BeginDate > 0) %>%
  head) -> output
output

ERROR:root:Cell magic `%%R` not found.


## What makes `dplyr` so great?

* Focus on data verbs
* Pipes lead to code that is
    * More readable
    * Easy to compose and debug

## Set up

Let's read in a data set in each of the three frameworks

#### `pandas` and `dfply`

In [20]:
import pandas as pd
from dfply import *
heroes = pd.read_csv('./data/heroes_information.csv')

#### `sqlalchemy` 

In [22]:
from sqlalchemy.orm import sessionmaker
from sqlalchemy import create_engine, func
from sqlalchemy.ext.automap import automap_base

engine = create_engine("sqlite:///databases/heroes.db")

Base = automap_base()
Base.prepare(engine, reflect=True)
#Hero = Base.classes.heroes
Heroes = Base.metadata

Session = sessionmaker(bind=engine)
session = Session()

AttributeError: heroes

In [23]:
Base.metadata.tables

immutabledict({})

#### `pyspark`

In [10]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, mean

spark1 = SparkSession.builder.appName('Ops').getOrCreate()
df_spark = spark1.read.csv('data/heroes_information.csv', inferSchema=True, header=True)

## Selecting Columns

The first verb, `select` 

* filters the *columns*
* At the core of `SQL` statements

## How to select

* `pandas`: pipe (`>>`) into `select`
* `sqlalchemy`: Use `session.query`
* `pyspark`: Use the `select` method

#### `select` in `pandas` + `dfply`

In [13]:
from dfply import select as select_dfply
(heroes >>
   select_dfply(X.name, 
                X.Gender, 
                X.Weight) >>
   head)

Unnamed: 0,name,Gender,Weight
0,A-Bomb,Male,441.0
1,Abe Sapien,Male,65.0
2,Abin Sur,Male,90.0
3,Abomination,Male,441.0
4,Abraxas,Male,-99.0


#### `select` expression in `sqlalchemy`

In [14]:
from sqlalchemy import select as select_sql
stmt = (select_sql([Hero.name, Hero.gender, Hero.weight]).
          select_from(Hero).
          limit(5))
print(stmt)

NameError: name 'Hero' is not defined

In [36]:
session.execute(stmt).fetchall()

[('A-Bomb', 'Male', 441.0),
 ('Abe Sapien', 'Male', 65.0),
 ('Abin Sur', 'Male', 90.0),
 ('Abomination', 'Male', 441.0),
 ('Abraxas', 'Male', None)]

#### Convert the result to a `pandas.DataFrame`

In [37]:
pd.read_sql_query(stmt, con=engine)

Unnamed: 0,name,gender,weight
0,A-Bomb,Male,441.0
1,Abe Sapien,Male,65.0
2,Abin Sur,Male,90.0
3,Abomination,Male,441.0
4,Abraxas,Male,


#### `select` in `pyspark`

In [16]:
(df_spark.
    select(df_spark.name, 
           df_spark.Gender, 
           df_spark.Weight).
    take(5))

[Row(name='A-Bomb', Gender='Male', Weight=441.0),
 Row(name='Abe Sapien', Gender='Male', Weight=65.0),
 Row(name='Abin Sur', Gender='Male', Weight=90.0),
 Row(name='Abomination', Gender='Male', Weight=441.0),
 Row(name='Abraxas', Gender='Male', Weight=-99.0)]

#### Convert the result to a `pandas.DataFrame`

In [17]:
result = (df_spark.
            select(df_spark.name, 
                   df_spark.Gender, 
                   df_spark.Weight).
            take(5))
spark1.createDataFrame(result).toPandas()

IllegalArgumentException: 'Unsupported class file major version 55'

## Filtering Rows

The next verb, `filter` 

* filters the *rows*
* is related to the `SQL` `WHERE` clause

## How to filter

* `pandas`: pipe (`>>`) into `filter_by`
* `sqlalchemy`: Use `session.query.filter` or `session.query.filter_by`
* `pyspark`: Use the `where` method

#### `filter_by` in `pandas` + `dfply`

In [29]:
(heroes >>
  filter_by(X.Gender == 'Male') >>
  head)

Unnamed: 0,Id,name,Gender,Eye color,Race,Hair color,Height,Publisher,Skin color,Alignment,Weight
0,0,A-Bomb,Male,yellow,Human,No Hair,203.0,Marvel Comics,-,good,441.0
1,1,Abe Sapien,Male,blue,Icthyo Sapien,No Hair,191.0,Dark Horse Comics,blue,good,65.0
2,2,Abin Sur,Male,blue,Ungaran,No Hair,185.0,DC Comics,red,good,90.0
3,3,Abomination,Male,green,Human / Radiation,No Hair,203.0,Marvel Comics,-,bad,441.0
4,4,Abraxas,Male,blue,Cosmic Entity,Black,-99.0,Marvel Comics,-,bad,-99.0


#### `where` in a `sqlalchemy` `select` expression

In [55]:
# All SQL statements start with a select
f_stmt = (select_sql('*').
           where(Hero.gender == 'Male').
           limit(5))
session.execute(f_stmt).fetchall()

[(0, 'A-Bomb', 'Male', 'yellow', 'Human', 'No Hair', 203.0, 'Marvel Comics', None, 'good', 441.0),
 (1, 'Abe Sapien', 'Male', 'blue', 'Icthyo Sapien', 'No Hair', 191.0, 'Dark Horse Comics', 'blue', 'good', 65.0),
 (2, 'Abin Sur', 'Male', 'blue', 'Ungaran', 'No Hair', 185.0, 'DC Comics', 'red', 'good', 90.0),
 (3, 'Abomination', 'Male', 'green', 'Human / Radiation', 'No Hair', 203.0, 'Marvel Comics', None, 'bad', 441.0),
 (4, 'Abraxas', 'Male', 'blue', 'Cosmic Entity', 'Black', None, 'Marvel Comics', None, 'bad', None)]

#### Convert the result to a `pandas.DataFrame`

In [56]:
pd.read_sql_query(f_stmt, con = engine)

Unnamed: 0,id,name,gender,eye_color,race,hair_color,height,publisher,skin_color,alignment,weight
0,0,A-Bomb,Male,yellow,Human,No Hair,203.0,Marvel Comics,,good,441.0
1,1,Abe Sapien,Male,blue,Icthyo Sapien,No Hair,191.0,Dark Horse Comics,blue,good,65.0
2,2,Abin Sur,Male,blue,Ungaran,No Hair,185.0,DC Comics,red,good,90.0
3,3,Abomination,Male,green,Human / Radiation,No Hair,203.0,Marvel Comics,,bad,441.0
4,4,Abraxas,Male,blue,Cosmic Entity,Black,,Marvel Comics,,bad,


#### `where` in `pyspark`

In [61]:
f_result = (df_spark.
            where(df_spark.Gender == 'Male').
            take(5))
f_result

[Row(Id=0, name='A-Bomb', Gender='Male', Eye color='yellow', Race='Human', Hair color='No Hair', Height=203.0, Publisher='Marvel Comics', Skin color='-', Alignment='good', Weight=441.0),
 Row(Id=1, name='Abe Sapien', Gender='Male', Eye color='blue', Race='Icthyo Sapien', Hair color='No Hair', Height=191.0, Publisher='Dark Horse Comics', Skin color='blue', Alignment='good', Weight=65.0),
 Row(Id=2, name='Abin Sur', Gender='Male', Eye color='blue', Race='Ungaran', Hair color='No Hair', Height=185.0, Publisher='DC Comics', Skin color='red', Alignment='good', Weight=90.0),
 Row(Id=3, name='Abomination', Gender='Male', Eye color='green', Race='Human / Radiation', Hair color='No Hair', Height=203.0, Publisher='Marvel Comics', Skin color='-', Alignment='bad', Weight=441.0),
 Row(Id=4, name='Abraxas', Gender='Male', Eye color='blue', Race='Cosmic Entity', Hair color='Black', Height=-99.0, Publisher='Marvel Comics', Skin color='-', Alignment='bad', Weight=-99.0)]

In [63]:
spark1.createDataFrame(f_result).toPandas()

Unnamed: 0,Id,name,Gender,Eye color,Race,Hair color,Height,Publisher,Skin color,Alignment,Weight
0,0,A-Bomb,Male,yellow,Human,No Hair,203.0,Marvel Comics,-,good,441.0
1,1,Abe Sapien,Male,blue,Icthyo Sapien,No Hair,191.0,Dark Horse Comics,blue,good,65.0
2,2,Abin Sur,Male,blue,Ungaran,No Hair,185.0,DC Comics,red,good,90.0
3,3,Abomination,Male,green,Human / Radiation,No Hair,203.0,Marvel Comics,-,bad,441.0
4,4,Abraxas,Male,blue,Cosmic Entity,Black,-99.0,Marvel Comics,-,bad,-99.0


## Chaining Data Verbs

* Processing df $\rightarrow$ chaining data verbs
* Accomplished through pipes/dot-chains

## Example 1 - `select` + `filter`

#### `pandas` + `dfply`

In [50]:
(heroes >>
   filter_by(X.Gender == 'Male') >>
   select_dfply(X.name, X.Gender, X.Weight) >>
   head)

Unnamed: 0,name,Gender,Weight
0,A-Bomb,Male,441.0
1,Abe Sapien,Male,65.0
2,Abin Sur,Male,90.0
3,Abomination,Male,441.0
4,Abraxas,Male,-99.0


#### `select`  expression in `sqlalchemy`

In [66]:
from sqlalchemy import select
# Make an SQL expression
sel_filt_stmt = (select_sql([Hero.name, 
                             Hero.gender, 
                             Hero.weight]).
                   where(Hero.gender == 'Male').
                   limit(5))
# Excute the expression
pd.read_sql_query(sel_filt_stmt, con=engine)

Unnamed: 0,name,gender,weight
0,A-Bomb,Male,441.0
1,Abe Sapien,Male,65.0
2,Abin Sur,Male,90.0
3,Abomination,Male,441.0
4,Abraxas,Male,


####  `pyspark`

In [67]:
sf_result = (df_spark.
            select(df_spark.name, 
                   df_spark.Gender, 
                   df_spark.Weight).
            where(df_spark.Gender == 'Male').
            take(5))
spark1.createDataFrame(sf_result).toPandas()

Unnamed: 0,name,Gender,Weight
0,A-Bomb,Male,441.0
1,Abe Sapien,Male,65.0
2,Abin Sur,Male,90.0
3,Abomination,Male,441.0
4,Abraxas,Male,-99.0


## Example 2 - `filter` + `filter`

Note that chaining `filter`s is an `and` operation.

####  `pandas` + `dfply`

In [68]:
(heroes >>
   select_dfply(X.name, X.Gender, X.Weight) >>
   filter_by(X.Gender == 'Male') >>
   filter_by(X.Weight > 0) >>
   head)

Unnamed: 0,name,Gender,Weight
0,A-Bomb,Male,441.0
1,Abe Sapien,Male,65.0
2,Abin Sur,Male,90.0
3,Abomination,Male,441.0
5,Absorbing Man,Male,122.0


#### `select`  expression in `sqlalchemy`

In [70]:
from sqlalchemy import select
# Make an SQL expression
ff_stmt = (select_sql([Hero.name, 
                       Hero.gender, 
                       Hero.weight]).
            where(Hero.gender == 'Male').
            where(Hero.weight > 0).
            limit(5))
# Excute the expression
pd.read_sql_query(ff_stmt, con=engine)

Unnamed: 0,name,gender,weight
0,A-Bomb,Male,441.0
1,Abe Sapien,Male,65.0
2,Abin Sur,Male,90.0
3,Abomination,Male,441.0
4,Absorbing Man,Male,122.0


####  `pyspark`

In [71]:
ff_result = (df_spark.
               select(df_spark.name, df_spark.Gender, df_spark.Weight).
               where(df_spark.Gender == 'Male').
               where(df_spark.Weight > 0).
               take(5))
spark1.createDataFrame(ff_result).toPandas()

Unnamed: 0,name,Gender,Weight
0,A-Bomb,Male,441.0
1,Abe Sapien,Male,65.0
2,Abin Sur,Male,90.0
3,Abomination,Male,441.0
4,Absorbing Man,Male,122.0


## <font color="red"> Exercise 1: Blue-eyed Heroes </font>

Create a query that

1. Selects the name, Gender, and Eye Color columns
2. Filters on eye_color == 'blue'

####  `pandas` + `dfply`

#### `query`  in `sqlalchemy`

#### `select`  expression in `sqlalchemy`

####  `pyspark`

## Constructing New Columns

The third verb, `mutate` 

* Creates new columns
* Changes existing columns

## How to mutate

* `pandas`: pipe (`>>`) into `mutate`
* `sqlalchemy`: Use a formula and alias in `session.query` or `select`
* `pyspark`: Use the `withColumns` method

## Example 3 - Converting Weight to kilograms

Currently, the weight column is in pounds.  Let's convert to kilograms.

####  `pandas` + `dfply`

In [72]:
(heroes >>
   select_dfply(X.name, 
                X.Gender, 
                X.Weight) >>
   mutate(Weight_kg = X.Weight/2.2046) >>
   head)

Unnamed: 0,name,Gender,Weight,Weight_kg
0,A-Bomb,Male,441.0,200.036288
1,Abe Sapien,Male,65.0,29.483807
2,Abin Sur,Male,90.0,40.823732
3,Abomination,Male,441.0,200.036288
4,Abraxas,Male,-99.0,-44.906105


#### `select`  expression in `sqlalchemy`

In [75]:
from sqlalchemy import select
m_stmt = (select_sql([Hero.name, 
                      Hero.gender, 
                      Hero.weight, 
                      (Hero.weight/2.2046).label('Weight_kg')]).
            limit(5))
pd.read_sql_query(m_stmt, con=engine)

Unnamed: 0,name,gender,weight,Weight_kg
0,A-Bomb,Male,441.0,200.036288
1,Abe Sapien,Male,65.0,29.483807
2,Abin Sur,Male,90.0,40.823732
3,Abomination,Male,441.0,200.036288
4,Abraxas,Male,,


####  `pyspark`

In [77]:
m_result = (df_spark.
              select(df_spark.name, 
                     df_spark.Gender, 
                     df_spark.Weight).
              withColumn('Weight_kg', df_spark.Weight/2.2046).
              take(5))
spark1.createDataFrame(m_result).toPandas()

Unnamed: 0,name,Gender,Weight,Weight_kg
0,A-Bomb,Male,441.0,200.036288
1,Abe Sapien,Male,65.0,29.483807
2,Abin Sur,Male,90.0,40.823732
3,Abomination,Male,441.0,200.036288
4,Abraxas,Male,-99.0,-44.906105


## Referencing a new column

Each framework provides a way to reference a new column.

* `pandas` + `dfply`: Use the `X` `Intention`
* `sqlalchemy`: Use `column` function with the label from `select`
* `pyspark`: Use the `col` function with the label from `withColumn`

## Example 4 - Converting Weight to kilograms and filter

Let's find all heroes with a weight under 100kg.

####  `pandas` + `dfply`

In [78]:
(heroes >>
   select_dfply(X.name, X.Gender, X.Weight) >>
   mutate(Weight_kg = X.Weight/2.2046) >>
   filter_by(X.Weight_kg < 100) >>
   head)

Unnamed: 0,name,Gender,Weight,Weight_kg
1,Abe Sapien,Male,65.0,29.483807
2,Abin Sur,Male,90.0,40.823732
4,Abraxas,Male,-99.0,-44.906105
5,Absorbing Man,Male,122.0,55.338837
6,Adam Monroe,Male,-99.0,-44.906105


#### `select`  expression in `sqlalchemy`

In [82]:
from sqlalchemy import column
new_col_stmt = (select_sql([Hero.name, 
                            Hero.gender, 
                            Hero.weight, 
                            (Hero.weight/2.2046).label('Weight_kg')]).
                  where(column('Weight_kg') < 100).
                  limit(5))
pd.read_sql_query(new_col_stmt, con=engine)

Unnamed: 0,name,gender,weight,Weight_kg
0,Abe Sapien,Male,65.0,29.483807
1,Abin Sur,Male,90.0,40.823732
2,Absorbing Man,Male,122.0,55.338837
3,Adam Strange,Male,88.0,39.916538
4,Agent 13,Female,61.0,27.669418


####  `pyspark`

In [83]:
from pyspark.sql.functions import col
new_col_result = (df_spark.
                   select(df_spark.name, df_spark.Gender, df_spark.Weight).
                   withColumn('Weight_kg', df_spark.Weight/2.2046).
                   where(col('Weight_kg') < 100 ).
                   take(5))
spark1.createDataFrame(new_col_result).toPandas()

Unnamed: 0,name,Gender,Weight,Weight_kg
0,Abe Sapien,Male,65.0,29.483807
1,Abin Sur,Male,90.0,40.823732
2,Abraxas,Male,-99.0,-44.906105
3,Absorbing Man,Male,122.0,55.338837
4,Adam Monroe,Male,-99.0,-44.906105


## <font color="red"> Exercise 2: Tall Heroes </font>

Create a query that

1. Selects the name, Gender, and Height columns
2. Compute the height in inches
3. Filters on height_in > 72

####  `pandas` + `dfply`

#### `query`  in `sqlalchemy`

#### `select`  expression in `sqlalchemy`

####  `pyspark`

# <font color="red"> TODO </font>

* More complicated mutations
    * Add many similar transforms with `**kwarg` unpacking (example below)

In [None]:
# Students will likely come up with a solution like this, we will fix this next
fix_parentheses = lambda df: (df >> 
                             mutate(
                                 ArtistBio = X.ArtistBio.str.replace('[()]', ''),
                                 Nationality = X.Nationality.str.replace('[()]', ''),
                                 BeginDate = X.BeginDate.str.replace('[()]', ''),
                                 EndDate = X.EndDate.str.replace('[()]', ''),
                                 Gender = X.Gender.str.replace('[()]', ''),
                             )
                            )
fix_parentheses(first_chuck).head()