# Introduction to `pyspark.sql.DataFrame`s

Adapted from [Databrick's tutorial](https://docs.databricks.com/spark/latest/dataframes-datasets/introduction-to-dataframes-python.html)

## Installing a `pyspark` Anaconda virtual environment

Use Anaconda Navigator to create a virtual environment with the following packages installed

#### `pyspark` Stuff
1. `openjdk` to install Java,
2. `pyspark` to install `spark` and `pyspark`, and
3. `findspark` to (possibly) deal with any issues finding `spark`.

#### The usual suspects - data management

1. `pandas`
2. `polars`
3. `pyarrow[all]`

#### The usual suspects - visualization and ML

1. `scikit-learn`
2. `seaborn`
3. `plotnine`
4. `altair`

In [1]:
# import pyspark class Row from module sql
from pyspark.sql import SparkSession

## What is spark?

* Build for the Hadoop platform
* Replacement of MapReduce
* Second-generation optimization
    * Lazy
    * Optimized
    * Persistent data structures
* Written in scala

## Ok ... so what's Hadoop?

* Distributed computing platform
* [Used by lots of companies](https://wiki.apache.org/hadoop/PoweredBy)
* Becoming an industry standard


## What is `pyspark`?

* Python interface to spark
* Needs a spark session
    * `session` $\leftrightarrow$ spark


## Step 0 - Create a spark session

`pyspark` (Python) communicates with `spark` (JVM via Scala) through a session

In [2]:
spark = SparkSession.builder.appName('Ops').getOrCreate()

## Overview -  `pyspark.DataFrame`

* A `DataFrame` is a collection of `Row`s
* `Row`s can be distributed over many machines
* `spark`
    * Hides the messy details
    * Optimizes operations

## How to think about a `pyspark.DataFrame`

<img src="https://github.com/wsu-stat489/module5_intro_to_pyspark/blob/main/img/pyspark_df.png?raw=1" width=600>

## Reading a `csv` file with `spark.read.csv`

#### `read.csv` is lazy

In [3]:
(heroes := 
 spark.read.csv('./data/heroes_information.csv', header=True)
)

DataFrame[ID: string, name: string, Gender: string, Eye color: string, Race: string, Hair color: string, Height: string, Publisher: string, Skin color: string, Alignment: string, Weight: string]

## Example - `filter` and `collect`

#### `filter` is lazy

In [4]:
from pyspark.sql.functions import col

(heroes
 .filter(col('Eye color') == 'yellow')
)

DataFrame[ID: string, name: string, Gender: string, Eye color: string, Race: string, Hair color: string, Height: string, Publisher: string, Skin color: string, Alignment: string, Weight: string]

#### `limit` is lazy

In [5]:
from pyspark.sql.functions import col

(heroes
 .filter(col('Eye color') == 'yellow')
 .limit(5)
)

DataFrame[ID: string, name: string, Gender: string, Eye color: string, Race: string, Hair color: string, Height: string, Publisher: string, Skin color: string, Alignment: string, Weight: string]

#### `take` is eager

In [6]:
from pyspark.sql.functions import col

(heroes
 .filter(col('Eye color') == 'yellow')
 .take(5)
)

[Row(ID='0', name='A-Bomb', Gender='Male', Eye color='yellow', Race='Human', Hair color='No Hair', Height='203', Publisher='Marvel Comics', Skin color='-', Alignment='good', Weight='441'),
 Row(ID='24', name='Angel Dust', Gender='Female', Eye color='yellow', Race='Mutant', Hair color='Black', Height='165', Publisher='Marvel Comics', Skin color='-', Alignment='good', Weight='57'),
 Row(ID='31', name='Anti-Monitor', Gender='Male', Eye color='yellow', Race='God / Eternal', Hair color='No Hair', Height='61', Publisher='DC Comics', Skin color='-', Alignment='bad', Weight='-99'),
 Row(ID='56', name='Azazel', Gender='Male', Eye color='yellow', Race='Neyaphem', Hair color='Black', Height='183', Publisher='Marvel Comics', Skin color='red', Alignment='bad', Weight='67'),
 Row(ID='207', name='Darth Vader', Gender='Male', Eye color='yellow', Race='Cyborg', Hair color='No Hair', Height='198', Publisher='George Lucas', Skin color='-', Alignment='bad', Weight='135')]

#### `collect` is eager

In [7]:
from pyspark.sql.functions import col

(heroes
 .filter(col('Eye color') == 'yellow')
#  .limit(5)
 .collect()
)

[Row(ID='0', name='A-Bomb', Gender='Male', Eye color='yellow', Race='Human', Hair color='No Hair', Height='203', Publisher='Marvel Comics', Skin color='-', Alignment='good', Weight='441'),
 Row(ID='24', name='Angel Dust', Gender='Female', Eye color='yellow', Race='Mutant', Hair color='Black', Height='165', Publisher='Marvel Comics', Skin color='-', Alignment='good', Weight='57'),
 Row(ID='31', name='Anti-Monitor', Gender='Male', Eye color='yellow', Race='God / Eternal', Hair color='No Hair', Height='61', Publisher='DC Comics', Skin color='-', Alignment='bad', Weight='-99'),
 Row(ID='56', name='Azazel', Gender='Male', Eye color='yellow', Race='Neyaphem', Hair color='Black', Height='183', Publisher='Marvel Comics', Skin color='red', Alignment='bad', Weight='67'),
 Row(ID='207', name='Darth Vader', Gender='Male', Eye color='yellow', Race='Cyborg', Hair color='No Hair', Height='198', Publisher='George Lucas', Skin color='-', Alignment='bad', Weight='135'),
 Row(ID='209', name='Data', Gende

### Why is `pyspark` so slow?

* Optimized for 
    * Distributed computation
    * Big data 
* Not great for
    * Local work on
    * Small data

### Why is `pyspark` so fast?

* Distributed nature $\longrightarrow$ leverage multi-core CPU,
* Data model can optimize data access via predicate/projection/slice pushdown,
* Lazy evaluation allow optimized memory usages (e.g., for a grouped aggregation), and
* Arrow allows FAST implementation of `pandas` user defined functions (UDF).

See [this article](https://www.databricks.com/blog/2018/05/03/benchmarking-apache-spark-on-a-single-node-machine.html) for details.

## `filter` and `collect` illustrated

<img src="https://github.com/wsu-stat489/module5_intro_to_pyspark/blob/main/img/pyspark_filter_collect.gif?raw=1" width=600>

## Inspecting the column types

In [11]:
heroes.printSchema()

root
 |-- ID: string (nullable = true)
 |-- name: string (nullable = true)
 |-- Gender: string (nullable = true)
 |-- Eye color: string (nullable = true)
 |-- Race: string (nullable = true)
 |-- Hair color: string (nullable = true)
 |-- Height: string (nullable = true)
 |-- Publisher: string (nullable = true)
 |-- Skin color: string (nullable = true)
 |-- Alignment: string (nullable = true)
 |-- Weight: string (nullable = true)



## Gathering results in `pyspark.sql`

* **Important fact** All `pyspark` queries end in a collection method
* **Why?** Data is (possibly) spread across many machines
* <font color = "red"> **Warning** This might be is *expensive*! Know how much data your are requesting! </font>

## Gathering methods

Here are the default methods for gathering the results.

* `collect` returns all values
* `take(n)` returns the first `n` values 
* `sample(n)` returns `n` randomly selected values

**Note.** These are combersome, as they return a list of `Row`s :(

### Inspecting the content - `take`

In [12]:
heroes.take(5) # BAD!!!

[Row(ID='0', name='A-Bomb', Gender='Male', Eye color='yellow', Race='Human', Hair color='No Hair', Height='203', Publisher='Marvel Comics', Skin color='-', Alignment='good', Weight='441'),
 Row(ID='1', name='Abe Sapien', Gender='Male', Eye color='blue', Race='Icthyo Sapien', Hair color='No Hair', Height='191', Publisher='Dark Horse Comics', Skin color='blue', Alignment='good', Weight='65'),
 Row(ID='2', name='Abin Sur', Gender='Male', Eye color='blue', Race='Ungaran', Hair color='No Hair', Height='185', Publisher='DC Comics', Skin color='red', Alignment='good', Weight='90'),
 Row(ID='3', name='Abomination', Gender='Male', Eye color='green', Race='Human / Radiation', Hair color='No Hair', Height='203', Publisher='Marvel Comics', Skin color='-', Alignment='bad', Weight='441'),
 Row(ID='4', name='Abraxas', Gender='Male', Eye color='blue', Race='Cosmic Entity', Hair color='Black', Height='-99', Publisher='Marvel Comics', Skin color='-', Alignment='bad', Weight='-99')]

## Inspecting the whole table - `collect`

In [13]:
heroes.collect() # BAD!!!1!

[Row(ID='0', name='A-Bomb', Gender='Male', Eye color='yellow', Race='Human', Hair color='No Hair', Height='203', Publisher='Marvel Comics', Skin color='-', Alignment='good', Weight='441'),
 Row(ID='1', name='Abe Sapien', Gender='Male', Eye color='blue', Race='Icthyo Sapien', Hair color='No Hair', Height='191', Publisher='Dark Horse Comics', Skin color='blue', Alignment='good', Weight='65'),
 Row(ID='2', name='Abin Sur', Gender='Male', Eye color='blue', Race='Ungaran', Hair color='No Hair', Height='185', Publisher='DC Comics', Skin color='red', Alignment='good', Weight='90'),
 Row(ID='3', name='Abomination', Gender='Male', Eye color='green', Race='Human / Radiation', Hair color='No Hair', Height='203', Publisher='Marvel Comics', Skin color='-', Alignment='bad', Weight='441'),
 Row(ID='4', name='Abraxas', Gender='Male', Eye color='blue', Race='Cosmic Entity', Hair color='Black', Height='-99', Publisher='Marvel Comics', Skin color='-', Alignment='bad', Weight='-99'),
 Row(ID='5', name='Ab

## Converting to `pandas` using `pyarrow`

If we have `pyarrow` installed, we can use the `toPandas` method to collect the data and convert to `pandas`

#### Use `limit` to collect the head.

In [14]:
heroes.limit(5).toPandas() # Good!

Unnamed: 0,ID,name,Gender,Eye color,Race,Hair color,Height,Publisher,Skin color,Alignment,Weight
0,0,A-Bomb,Male,yellow,Human,No Hair,203,Marvel Comics,-,good,441
1,1,Abe Sapien,Male,blue,Icthyo Sapien,No Hair,191,Dark Horse Comics,blue,good,65
2,2,Abin Sur,Male,blue,Ungaran,No Hair,185,DC Comics,red,good,90
3,3,Abomination,Male,green,Human / Radiation,No Hair,203,Marvel Comics,-,bad,441
4,4,Abraxas,Male,blue,Cosmic Entity,Black,-99,Marvel Comics,-,bad,-99


#### Use `sample` and `toPandas` to get a random sample.

In [15]:
(sample := 
 heroes
 .sample(fraction=0.01)
).toPandas()

Unnamed: 0,ID,name,Gender,Eye color,Race,Hair color,Height,Publisher,Skin color,Alignment,Weight
0,175,Chuck Norris,Male,-,-,-,178,,-,good,-99
1,210,Dazzler,Female,blue,Mutant,Blond,173,Marvel Comics,-,good,52
2,256,Firebird,Female,brown,-,Black,165,Marvel Comics,-,good,56
3,298,Green Goblin,Male,blue,Human,Auburn,180,Marvel Comics,-,bad,83
4,331,Hulk,Male,green,Human / Radiation,Green,244,Marvel Comics,green,good,630
5,583,Scorpion,Male,brown,Human,Brown,211,Marvel Comics,-,bad,310
6,586,Shadow King,-,red,-,-,185,Marvel Comics,-,good,149
7,640,Storm,Female,blue,Mutant,White,180,Marvel Comics,-,good,57


#### Use `toPandas` to collect the whole table (careful...)

In [16]:
heroes.toPandas()

Unnamed: 0,ID,name,Gender,Eye color,Race,Hair color,Height,Publisher,Skin color,Alignment,Weight
0,0,A-Bomb,Male,yellow,Human,No Hair,203,Marvel Comics,-,good,441
1,1,Abe Sapien,Male,blue,Icthyo Sapien,No Hair,191,Dark Horse Comics,blue,good,65
2,2,Abin Sur,Male,blue,Ungaran,No Hair,185,DC Comics,red,good,90
3,3,Abomination,Male,green,Human / Radiation,No Hair,203,Marvel Comics,-,bad,441
4,4,Abraxas,Male,blue,Cosmic Entity,Black,-99,Marvel Comics,-,bad,-99
...,...,...,...,...,...,...,...,...,...,...,...
729,729,Yellowjacket II,Female,blue,Human,Strawberry Blond,165,Marvel Comics,-,good,52
730,730,Ymir,Male,white,Frost Giant,No Hair,304.8,Marvel Comics,white,good,-99
731,731,Yoda,Male,brown,Yoda's species,White,66,George Lucas,green,good,17
732,732,Zatanna,Female,blue,Human,Black,170,DC Comics,-,good,57


## Houston, we have a problem! (Did you notice?)

<img src="https://github.com/wsu-stat489/module5_intro_to_pyspark/blob/main/img/pyspark_missing_values.png?raw=1" width=400>

### Specifying a `nullValue`

In [17]:
(heroes := 
 spark.read.csv('./data/heroes_information.csv', header=True, nullValue='-')
).limit(5).toPandas()

Unnamed: 0,ID,name,Gender,Eye color,Race,Hair color,Height,Publisher,Skin color,Alignment,Weight
0,0,A-Bomb,Male,yellow,Human,No Hair,203,Marvel Comics,,good,441
1,1,Abe Sapien,Male,blue,Icthyo Sapien,No Hair,191,Dark Horse Comics,blue,good,65
2,2,Abin Sur,Male,blue,Ungaran,No Hair,185,DC Comics,red,good,90
3,3,Abomination,Male,green,Human / Radiation,No Hair,203,Marvel Comics,,bad,441
4,4,Abraxas,Male,blue,Cosmic Entity,Black,-99,Marvel Comics,,bad,-99


### Did you notice?

<img src="https://github.com/wsu-stat489/module5_intro_to_pyspark/blob/main/img/pyspark_default_types.png?raw=1" width=400>

Default type is a string

### Letting `spark` guess the types

Set `inferScheme=True` 

In [18]:
(heroes := 
 spark.read.csv('./data/heroes_information.csv', header=True, inferSchema=True, nullValue='-')
)

DataFrame[ID: int, name: string, Gender: string, Eye color: string, Race: string, Hair color: string, Height: double, Publisher: string, Skin color: string, Alignment: string, Weight: int]

## Checking the column types after `inferScheme`

In this case, `spark` guessed correctly

In [19]:
heroes.printSchema()

root
 |-- ID: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- Gender: string (nullable = true)
 |-- Eye color: string (nullable = true)
 |-- Race: string (nullable = true)
 |-- Hair color: string (nullable = true)
 |-- Height: double (nullable = true)
 |-- Publisher: string (nullable = true)
 |-- Skin color: string (nullable = true)
 |-- Alignment: string (nullable = true)
 |-- Weight: integer (nullable = true)



## Inspecting the content - `limit(5).toPandas()`

In [20]:
heroes.limit(5).toPandas()

Unnamed: 0,ID,name,Gender,Eye color,Race,Hair color,Height,Publisher,Skin color,Alignment,Weight
0,0,A-Bomb,Male,yellow,Human,No Hair,203.0,Marvel Comics,,good,441
1,1,Abe Sapien,Male,blue,Icthyo Sapien,No Hair,191.0,Dark Horse Comics,blue,good,65
2,2,Abin Sur,Male,blue,Ungaran,No Hair,185.0,DC Comics,red,good,90
3,3,Abomination,Male,green,Human / Radiation,No Hair,203.0,Marvel Comics,,bad,441
4,4,Abraxas,Male,blue,Cosmic Entity,Black,-99.0,Marvel Comics,,bad,-99


## Explicit `schema` specification

Format is `add(name, type, nullable?)`

In [21]:
from pyspark.sql.types import StructType
from pyspark.sql.types import DoubleType, StringType, IntegerType

hero_schema = (StructType()
  .add('Id', IntegerType(), False)
  .add('name', StringType(), True)
  .add('Gender', StringType(), True)
  .add('Eye color', StringType(), True)
  .add('Race', StringType(), True)
  .add('Hair color', StringType(), True)
  .add('Height', DoubleType(), True)
  .add('Publisher', StringType(), True)
  .add('Skin color', StringType(), True)
  .add('Alignment', StringType(), True)
  .add('Weight', DoubleType(), True))

(heros := 
 spark.read.csv('./data/heroes_information.csv', header=True, schema=hero_schema, nullValue='-')
).limit(5).toPandas()

Unnamed: 0,Id,name,Gender,Eye color,Race,Hair color,Height,Publisher,Skin color,Alignment,Weight
0,0,A-Bomb,Male,yellow,Human,No Hair,203.0,Marvel Comics,,good,441.0
1,1,Abe Sapien,Male,blue,Icthyo Sapien,No Hair,191.0,Dark Horse Comics,blue,good,65.0
2,2,Abin Sur,Male,blue,Ungaran,No Hair,185.0,DC Comics,red,good,90.0
3,3,Abomination,Male,green,Human / Radiation,No Hair,203.0,Marvel Comics,,bad,441.0
4,4,Abraxas,Male,blue,Cosmic Entity,Black,-99.0,Marvel Comics,,bad,-99.0


## `pyspark.sql` queries are like `SQL` queries

#### Filter, group, and aggregate (categorical)

In [22]:
(heroes
.where(col('Gender') == 'Male')
.groupby('Eye color')
.count()
.limit(5)
).toPandas()

Unnamed: 0,Eye color,count
0,grey,6
1,green,30
2,yellow,16
3,bown,1
4,,121


#### Group by multiple and aggregate (categorical)

In [23]:
(heroes
 .groupby('Eye color', 'Gender')
 .count()
 .limit(5)
).toPandas()

Unnamed: 0,Eye color,Gender,count
0,yellow (without irises),,1
1,green,Male,30
2,violet,Female,2
3,hazel,Female,3
4,blue,Male,143


## <font color="red"> Exercise 4.2 </font>

First, define a `schema` and read in `./data/super_hero_powers.csv`, then perform `pyspark.sql` queries to answer each of the following questions.

1. How many heroes have both Super Strength and Super Speed?
2. How many heroes have names that start with the word *Black*
3. Are heroes with Agility more likely to have Stealth?
4. What fraction of all heroes that can fly also have Super Strength?
5. Consider heroes that have names that contain `"girl"`, `"boy"`, `"woman"`, or `"man"`.  Compute the following ratio

$$\frac{N(\text{boy or man})}{N(\text{girl or woman}}$$

**Hint:** You will need to use some combination of `where`, `group_by`, and `count` for each part.

In [24]:
%%bash

ls -alG data | grep hero

-rwxrwxrwx 1 jb5983on     46276 Nov 11 21:30 heroes_information.csv
-rwxrwxrwx 1 jb5983on    672305 Nov 11 21:30 super_hero_powers.csv


In [31]:
from pyspark.sql.types import StructType
from pyspark.sql.types import BooleanType, StringType 

(powers := 
 spark.read.csv('./data/super_hero_powers.csv', header=True, inferSchema=True, nullValue='-')
)

powers.printSchema()

root
 |-- hero_names: string (nullable = true)
 |-- Agility: boolean (nullable = true)
 |-- Accelerated Healing: boolean (nullable = true)
 |-- Lantern Power Ring: boolean (nullable = true)
 |-- Dimensional Awareness: boolean (nullable = true)
 |-- Cold Resistance: boolean (nullable = true)
 |-- Durability: boolean (nullable = true)
 |-- Stealth: boolean (nullable = true)
 |-- Energy Absorption: boolean (nullable = true)
 |-- Flight: boolean (nullable = true)
 |-- Danger Sense: boolean (nullable = true)
 |-- Underwater breathing: boolean (nullable = true)
 |-- Marksmanship: boolean (nullable = true)
 |-- Weapons Master: boolean (nullable = true)
 |-- Power Augmentation: boolean (nullable = true)
 |-- Animal Attributes: boolean (nullable = true)
 |-- Longevity: boolean (nullable = true)
 |-- Intelligence: boolean (nullable = true)
 |-- Super Strength: boolean (nullable = true)
 |-- Cryokinesis: boolean (nullable = true)
 |-- Telepathy: boolean (nullable = true)
 |-- Energy Armor: boolea

In [32]:
powers.limit(5).toPandas()

Unnamed: 0,hero_names,Agility,Accelerated Healing,Lantern Power Ring,Dimensional Awareness,Cold Resistance,Durability,Stealth,Energy Absorption,Flight,...,Web Creation,Reality Warping,Odin Force,Symbiote Costume,Speed Force,Phoenix Force,Molecular Dissipation,Vision - Cryo,Omnipresent,Omniscient
0,3-D Man,True,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,A-Bomb,False,True,False,False,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,Abe Sapien,True,True,False,False,True,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,Abin Sur,False,False,True,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,Abomination,False,True,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [57]:
#1
(powers
.where((col('Super Strength') == True) & (col('Super Speed') == True))
.count()
)

#219 superheros

219

In [65]:
#2
(powers
.where(col('hero_names').startswith('Black'))
.count()
)

#16 superheros

16

In [64]:
#3
agility = (powers
           .where(col('Agility') == True)
           .count()
           )

agility_stealth = (powers
                  .where((col('Agility') == True) & 
                         (col('Stealth') == True))
                  .count()
                  )

likelihood = agility_stealth / agility

likelihood

#NO, only 39% of superheros with agility also have stealth

0.3925619834710744

In [70]:
#4
fly = (powers
      .where(col('Flight') == True)
      .count()
      )

fly_strength = (powers
               .where((col('Flight') == True) & 
                      (col('Super Strength') == True))
               .count()
               )

fraction = fly_strength / fly

fraction

#69% of superheros that can fly also have super strength

0.6933962264150944

In [77]:
#5
man = (powers
      .where(((col('hero_names')).contains('boy')) | 
             ((col('hero_names')).contains('man')))
      .count()
      )

woman = (powers
      .where(((col('hero_names')).contains('girl')) | 
             ((col('hero_names')).contains('woman')))
      .count()
      )
       
ratio = man / woman

ratio

2.9

# Appendix

## Creating a `Row` of data

* Use the `Row` class
* Pass data using keywords
    * key == column name
    * value == cell value

In [None]:
department1 = Row(id='123456', name='Computer Science')
department1

## Unpacking a `Row` dictionary

* Data is in a row dictionary
* Unpack keywords using `**`

In [None]:
dept2_info = {'id':'789012', 'name':'Mechanical Engineering'}
department2 = Row(**dept2_info)
department2

## Unpacking a list of row dictionaries

In [None]:
dept_info = [{'id':123456, 'name':'Computer Science'},
             {'id':789012, 'name':'Mechanical Engineering'},
             {'id':345678, 'name':'Theater and Drama'},
             {'id':901234, 'name':'Indoor Recreation'}]

dept_rows = [Row(**r) for r in dept_info]
dept_rows

## Access `Row` content with column attributes

In [None]:
[dept.id for dept in dept_rows]

In [None]:
[dept.name for dept in dept_rows]

## Creating a `pyspark.DataFrame`

* A `DataFrame` is a collection of `Row`s
* Create with spark.createDataFrame
* Need to have a 

In [None]:
df = spark.createDataFrame(dept_rows)
df

## Creating rows from list of data

## Creating a Row class

* Pass `Row` the columns names
* Creates a specialized `Row` class

In [None]:
Employee = Row("firstName", "lastName", "email", "salary")
Employee

## Creating a `Employee` instance

* Pass the data to `Employee` to make a row
* Order matters ... use the same order as names

In [None]:
Employee = Row("firstName", "lastName", "email", "salary")
employee1 = Employee('michael', 'armbrust', 'no-reply@berkeley.edu', 100000)
employee1

## Unpacking a data list

* Suppose the data is in a list/tuple.
* Use sequence unpacking with `*`

In [None]:
empl2_info = ('xiangrui', 'meng', 'no-reply@stanford.edu', 120000)
empl2_info

In [None]:
employee2 = Employee(*empl2_info)
employee2

## Unpacking 

In [None]:
# Create the Employees
Employee = Row("firstName", "lastName", "email", "salary")
employees = [('michael', 'armbrust', 'no-reply@berkeley.edu', 100000),
             ('xiangrui', 'meng', 'no-reply@stanford.edu', 120000),
             ('matei', None, 'no-reply@waterloo.edu', 140000),
             (None, 'wendell', 'no-reply@berkeley.edu', 160000)]
emp_rows = [Employee(*r) for r in employees]
emp_rows