# Select, Filter, and Mutate

In this lecture, we will look at three important actions used to process data frames.  While each framework uses different names for these functions, we will use the names from the `R` library `dplyr`, namely `select`, `mutate`, and `filter`.  The most important takeaway will be that, regardless of framework or scale, we can process data frames in the same way by applying the same sequence of data verbs.

## Set up

Let's read in a data set in each of the three frameworks

In [2]:
import polars as pl
heroes = pl.read_csv('./data/heroes_information.csv')
heroes.head()

Unnamed: 0_level_0,name,Gender,Eye color,Race,Hair color,Height,Publisher,Skin color,Alignment,Weight
i64,str,str,str,str,str,f64,str,str,str,f64
0,"""A-Bomb""","""Male""","""yellow""","""Human""","""No Hair""",203.0,"""Marvel Comics""","""-""","""good""",441.0
1,"""Abe Sapien""","""Male""","""blue""","""Icthyo Sapien""","""No Hair""",191.0,"""Dark Horse Com...","""blue""","""good""",65.0
2,"""Abin Sur""","""Male""","""blue""","""Ungaran""","""No Hair""",185.0,"""DC Comics""","""red""","""good""",90.0
3,"""Abomination""","""Male""","""green""","""Human / Radiat...","""No Hair""",203.0,"""Marvel Comics""","""-""","""bad""",441.0
4,"""Abraxas""","""Male""","""blue""","""Cosmic Entity""","""Black""",-99.0,"""Marvel Comics""","""-""","""bad""",-99.0


## Selecting Columns

The first verb, `select` 

* filters the *columns*
* At the core of `SQL` statements

## How to select
 dot-chain (`.`) into `select`, provide a list of columns expressions using `'column name'`, `df['column name']` or `pl.col('column name')`

In [4]:
(heroes
 .select(['Eye color',
          pl.col('name'), 
          heroes['Gender'],
         ]
          )
 .head()
)

Eye color,name,Gender
str,str,str
"""yellow""","""A-Bomb""","""Male"""
"""blue""","""Abe Sapien""","""Male"""
"""blue""","""Abin Sur""","""Male"""
"""green""","""Abomination""","""Male"""
"""blue""","""Abraxas""","""Male"""


In [5]:
# Refactored
cols_to_keep = ['Eye color', pl.col('name'),  heroes['Gender']] 

(heroes
 .select(cols_to_keep)
 .head()
)

Eye color,name,Gender
str,str,str
"""yellow""","""A-Bomb""","""Male"""
"""blue""","""Abe Sapien""","""Male"""
"""blue""","""Abin Sur""","""Male"""
"""green""","""Abomination""","""Male"""
"""blue""","""Abraxas""","""Male"""


In [6]:
['Eye color', # Column name (type: string)
 pl.col('name'),  # Column expression (type: pl.Expr)
 heroes['Gender']] # Actual data (type: pl.Series)

['Eye color',
 <polars.internals.expr.expr.Expr at 0x122a1ecd0>,
 shape: (734,)
 Series: 'Gender' [str]
 [
 	"Male"
 	"Male"
 	"Male"
 	"Male"
 	"Male"
 	"Male"
 	"Male"
 	"Male"
 	"Female"
 	"Male"
 	"Male"
 	"Male"
 	...
 	"Male"
 	"Female"
 	"Female"
 	"Male"
 	"Female"
 	"Male"
 	"Male"
 	"Male"
 	"Female"
 	"Male"
 	"Male"
 	"Female"
 	"Male"
 ]]

## Filtering Rows

<img src="./img/filter.png">

The next verb, `filter` 

* filters the *rows*
* is related to the `SQL` `WHERE` clause

## How to filter

* dot (`.`) into `filters` 
* First argument is a boolean expression
* Reference columns with `df['column_name']` or `pl.col('column name')`

In [8]:
(heroes 
 .filter(pl.col('Gender') == 'Male') 
 .head()
)

Unnamed: 0_level_0,name,Gender,Eye color,Race,Hair color,Height,Publisher,Skin color,Alignment,Weight
i64,str,str,str,str,str,f64,str,str,str,f64
0,"""A-Bomb""","""Male""","""yellow""","""Human""","""No Hair""",203.0,"""Marvel Comics""","""-""","""good""",441.0
1,"""Abe Sapien""","""Male""","""blue""","""Icthyo Sapien""","""No Hair""",191.0,"""Dark Horse Com...","""blue""","""good""",65.0
2,"""Abin Sur""","""Male""","""blue""","""Ungaran""","""No Hair""",185.0,"""DC Comics""","""red""","""good""",90.0
3,"""Abomination""","""Male""","""green""","""Human / Radiat...","""No Hair""",203.0,"""Marvel Comics""","""-""","""bad""",441.0
4,"""Abraxas""","""Male""","""blue""","""Cosmic Entity""","""Black""",-99.0,"""Marvel Comics""","""-""","""bad""",-99.0


In [9]:
pl.col('Gender') == 'Male' # lazy Expr

## Chaining Data Verbs

* Processing df $\rightarrow$ chaining data verbs
* Accomplished through pipes/dot-chains

## Example 1 - `select` + `filter`

In [10]:
(heroes 
 .filter(pl.col('Gender') == 'Male') 
 .select(['name', 'Gender', 'Weight'])
 .head()
)

name,Gender,Weight
str,str,f64
"""A-Bomb""","""Male""",441.0
"""Abe Sapien""","""Male""",65.0
"""Abin Sur""","""Male""",90.0
"""Abomination""","""Male""",441.0
"""Abraxas""","""Male""",-99.0


## Example 2 - `filter` + `filter`

Note that chaining `filter`s is an `and` operation.

In [11]:
(heroes
.select(['name', 'Gender', 'Weight'])
.filter(pl.col('Gender') == 'Male')
.filter(pl.col('Weight') > 0)
.head()
)

name,Gender,Weight
str,str,f64
"""A-Bomb""","""Male""",441.0
"""Abe Sapien""","""Male""",65.0
"""Abin Sur""","""Male""",90.0
"""Abomination""","""Male""",441.0
"""Absorbing Man""","""Male""",122.0


## <font color="red"> Exercise 2.2.1: Blue-eyed Heroes </font>

Create a query that

1. Selects the name, Gender, and Eye Color columns
2. Filters on eye_color == 'blue'

In [12]:
# Your code here

## Constructing New Columns

The third verb, `mutate` 

* Creates new columns
* Changes existing columns

## Example 3 - Converting Weight to kilograms

Currently, the weight column is in pounds.  Let's convert to kilograms.

### Method 1 - Inside `select`

*  dot (`.`) into `select`
* Include one or more mutates in the `list` of expressions.
* Name the new column using `alias`

In [14]:
(heroes 
 .select(['name', 
          'Gender', 
          'Weight',
          (pl.col('Weight')/2.2046).alias('Weight_kg'),
         ]) 
 .head()
)

name,Gender,Weight,Weight_kg
str,str,f64,f64
"""A-Bomb""","""Male""",441.0,200.036288
"""Abe Sapien""","""Male""",65.0,29.483807
"""Abin Sur""","""Male""",90.0,40.823732
"""Abomination""","""Male""",441.0,200.036288
"""Abraxas""","""Male""",-99.0,-44.906105


### Method 2 - A single mutate using `with_column`

*  dot (`.`) into `with_column`
* First argument is a transformational expression
* Reference columns with `pl.col('column_name')` or `df['column name']`
* Name the new column using `alias`

In [15]:
(heroes 
 .select(['name', 
          'Gender', 
          'Weight'
         ]) 
 .with_column((pl.col('Weight')/2.2046).alias('Weight_kg')) 
 .head()
)

name,Gender,Weight,Weight_kg
str,str,f64,f64
"""A-Bomb""","""Male""",441.0,200.036288
"""Abe Sapien""","""Male""",65.0,29.483807
"""Abin Sur""","""Male""",90.0,40.823732
"""Abomination""","""Male""",441.0,200.036288
"""Abraxas""","""Male""",-99.0,-44.906105


## Example 3 - Converting Weight to grams and kilograms

Now to illustrate performing multiple MUTATEs, let's convert the weight to both kg and g.

### Method 1 - Inside `select`

*  dot (`.`) into `select`
* Include one or more mutates in the `list` of expressions.
* Name the new column using `alias`

In [16]:
(heroes 
 .select(['name', 
          'Gender', 
          'Weight',
          (pl.col('Weight')/2.2046).alias('Weight_kg'),
          (pl.col('Weight')/2.2046*1000).alias('Weight_g'),
         ]) 
 .head()
)

name,Gender,Weight,Weight_kg,Weight_g
str,str,f64,f64,f64
"""A-Bomb""","""Male""",441.0,200.036288,200036.287762
"""Abe Sapien""","""Male""",65.0,29.483807,29483.806586
"""Abin Sur""","""Male""",90.0,40.823732,40823.732196
"""Abomination""","""Male""",441.0,200.036288,200036.287762
"""Abraxas""","""Male""",-99.0,-44.906105,-44906.105416


### Method 2 - Separate mutate using `with_column`

*  dot (`.`) into `with_column`
* Reference columns with `pl.col('column_name')` or `df['column name']`
* Name the new column using `alias`

In [19]:
(heroes 
 .select(['name', 
          'Gender', 
          'Weight'
         ]) 
 .with_column((pl.col('Weight')/2.2046).alias('Weight_kg'))
 .with_column((pl.col('Weight')/2.2046*1000).alias('Weight_g'))
 .head()
)

name,Gender,Weight,Weight_kg,Weight_g
str,str,f64,f64,f64
"""A-Bomb""","""Male""",441.0,200.036288,200036.287762
"""Abe Sapien""","""Male""",65.0,29.483807,29483.806586
"""Abin Sur""","""Male""",90.0,40.823732,40823.732196
"""Abomination""","""Male""",441.0,200.036288,200036.287762
"""Abraxas""","""Male""",-99.0,-44.906105,-44906.105416


### Method 3 - A single mutate using `with_columns`

*  dot (`.`) into `with_columns`
* First argument is a `list` of transformational expressions
* Reference columns with `pl.col('column_name')` or `df['column name']`
* Name the new column using `alias`

In [20]:
(heroes 
 .select(['name', 
          'Gender', 
          'Weight'
         ]) 
 .with_columns([(pl.col('Weight')/2.2046).alias('Weight_kg'),
                (pl.col('Weight')/2.2046*1000).alias('Weight_g'),
               ]) 
 .head()
)

name,Gender,Weight,Weight_kg,Weight_g
str,str,f64,f64,f64
"""A-Bomb""","""Male""",441.0,200.036288,200036.287762
"""Abe Sapien""","""Male""",65.0,29.483807,29483.806586
"""Abin Sur""","""Male""",90.0,40.823732,40823.732196
"""Abomination""","""Male""",441.0,200.036288,200036.287762
"""Abraxas""","""Male""",-99.0,-44.906105,-44906.105416


### Making this more readable with keyword arguments

* Use `new_var = col_expr`.
* `new_var` must be a proper Python name
* Experimental, requires Config

In [21]:
pl.Config.with_columns_kwargs = True

(heroes 
 .select(['name', 
          'Gender', 
          'Weight'
         ]) 
 .with_columns(Weight_kg = pl.col('Weight')/2.2046,
               Weight_g =  pl.col('Weight')/2.2046*1000,
               ) 
 .head()
)

name,Gender,Weight,Weight_kg,Weight_g
str,str,f64,f64,f64
"""A-Bomb""","""Male""",441.0,200.036288,200036.287762
"""Abe Sapien""","""Male""",65.0,29.483807,29483.806586
"""Abin Sur""","""Male""",90.0,40.823732,40823.732196
"""Abomination""","""Male""",441.0,200.036288,200036.287762
"""Abraxas""","""Male""",-99.0,-44.906105,-44906.105416


## Referencing a new column

* Newly created columns can't be referenced in the same `select` or `with_column(s)`.
* Use `pl.col('new_column')` to reference in later method calls.
* Add additional `with_column(s)` to use in new expressions.

In [22]:
pl.Config.with_columns_kwargs = True

(heroes 
 .select(['name', 
          'Gender', 
          'Weight'
         ]) 
 .with_columns(Weight_kg = pl.col('Weight')/2.2046)
 .with_columns(Weight_g =  pl.col('Weight_kg')*1000)
 .filter(pl.col('Weight_kg') < 100) 
 .head()
)

name,Gender,Weight,Weight_kg,Weight_g
str,str,f64,f64,f64
"""Abe Sapien""","""Male""",65.0,29.483807,29483.806586
"""Abin Sur""","""Male""",90.0,40.823732,40823.732196
"""Abraxas""","""Male""",-99.0,-44.906105,-44906.105416
"""Absorbing Man""","""Male""",122.0,55.338837,55338.836977
"""Adam Monroe""","""Male""",-99.0,-44.906105,-44906.105416


## <font color="red"> Exercise 2.2.2: Tall Heroes </font>

Create a query that

1. Selects the name, Gender, and Height columns
2. Compute the height in inches.
    * Check [here](https://www.kaggle.com/claudiodavi/superhero-set) to determine the current units.
3. Filters on height_in > 72

In [None]:
# Your code here