#### Setting up Colab

To run this notebook on [Google's Colab](https://colab.research.google.com), you will need to perform the following steps.

#### Step 1. Install `pyspark`

Since `pyspark` isn't included in Colab's Python installation, you will need to install it each time you open this notebook.

In [1]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.3.1.tar.gz (281.4 MB)
[K     |████████████████████████████████| 281.4 MB 105 kB/s eta 0:00:017   |███▊                            | 33.0 MB 1.8 MB/s eta 0:02:20
[?25hCollecting py4j==0.10.9.5
  Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)
[K     |████████████████████████████████| 199 kB 2.1 MB/s eta 0:00:01
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25ldone
[?25h  Created wheel for pyspark: filename=pyspark-3.3.1-py2.py3-none-any.whl size=281845513 sha256=088630ac1d09342807f684b3b323f769a79fd413a5c041f09844462078149657
  Stored in directory: /home/wavessurfer/.cache/pip/wheels/51/c8/18/298a4ced8ebb3ab8a7d26a7198c0cc7035abb906bde94a4c4b
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.5 pyspark-3.3.1


In [2]:
!pip install composable



Step 2. Download and unzip the data

Next, the easiest way to access the data from the module is to download and unzip.

In [3]:
!wget https://github.com/wsu-stat489/module5_intro_to_pyspark/raw/main/data.zip

--2022-10-25 12:10:27--  https://github.com/wsu-stat489/module5_intro_to_pyspark/raw/main/data.zip
Resolving github.com (github.com)... 140.82.113.4
Connecting to github.com (github.com)|140.82.113.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/wsu-stat489/module5_intro_to_pyspark/main/data.zip [following]
--2022-10-25 12:10:27--  https://raw.githubusercontent.com/wsu-stat489/module5_intro_to_pyspark/main/data.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14977210 (14M) [application/zip]
Saving to: ‘data.zip.1’


2022-10-25 12:10:30 (4.64 MB/s) - ‘data.zip.1’ saved [14977210/14977210]



In [4]:
!unzip data.zip

Archive:  data.zip
  inflating: __MACOSX/._data         
  inflating: data/.DS_Store          
  inflating: __MACOSX/data/._.DS_Store  
replace data/Rochester_temps_2019.xlsx? [y]es, [n]o, [A]ll, [N]one, [r]ename: ^C


In [5]:
!ls ./data

auto_sales_apr.csv	  Rochester_temps_2019.xlsx
auto_sales.csv		  super_hero_powers.csv
auto_sales_may.csv	  TB_bad.csv
baseball		  uber-raw-data-apr14-sample.csv
department.csv		  uber-raw-data-aug14-sample.csv
employee.csv		  uber-raw-data-jul14-sample.csv
health_survey.csv	  uber-raw-data-jun14-sample.csv
heroes_information.csv	  uber-raw-data-may14-sample.csv
PEW_income_religion.csv   uber-raw-data-sep14-sample.csv
Rochester_temps_2019.csv


In [6]:
!wget https://github.com/wsu-stat489/module5_intro_to_pyspark/raw/main/more_pyspark.py

--2022-10-25 12:15:34--  https://github.com/wsu-stat489/module5_intro_to_pyspark/raw/main/more_pyspark.py
Resolving github.com (github.com)... 140.82.113.3
Connecting to github.com (github.com)|140.82.113.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/wsu-stat489/module5_intro_to_pyspark/main/more_pyspark.py [following]
--2022-10-25 12:15:34--  https://raw.githubusercontent.com/wsu-stat489/module5_intro_to_pyspark/main/more_pyspark.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2786 (2.7K) [text/plain]
Saving to: ‘more_pyspark.py.1’


2022-10-25 12:15:34 (12.0 MB/s) - ‘more_pyspark.py.1’ saved [2786/2786]



# Introduction to `pyspark.sql.DataFrame`s

Adapted from [Databrick's tutorial](https://docs.databricks.com/spark/latest/dataframes-datasets/introduction-to-dataframes-python.html)

In [1]:
# import pyspark class Row from module sql
from pyspark.sql import *

## What is spark?

* Build for the Hadoop platform
* Replacement of MapReduce
* Second-generation optimization
    * Lazy
    * Optimized
    * Persistent data structures
* Written in scala

## Ok ... so what's Hadoop?

* Distributed computing platform
* [Used by lots of companies](https://wiki.apache.org/hadoop/PoweredBy)
* Becoming an industry standard


## What is `pyspark`?

* Python interface to spark
* Needs a spark session
    * `session` $\leftrightarrow$ spark


## Step 0 - Create a spark session

* `pyspark` communicates with `spark` through a session
* Similar to `sqlalchemy` session.

In [2]:
spark = SparkSession.builder.appName('Ops').getOrCreate()

22/10/27 16:32:15 WARN Utils: Your hostname, nn1448lr222 resolves to a loopback address: 127.0.1.1; using 172.17.25.238 instead (on interface eth0)
22/10/27 16:32:15 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/10/27 16:32:17 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/10/27 16:32:18 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
22/10/27 16:32:18 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
22/10/27 16:32:18 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.


## Overview -  `pyspark.DataFrame`

* A `DataFrame` is a collection of `Row`s
* `Row`s can be distributed over many machines
* `spark`
    * Hides the messy details
    * Optimizes operations

## Creating a `Row` of data

* Use the `Row` class
* Pass data using keywords
    * key == column name
    * value == cell value

In [3]:
department1 = Row(id='123456', name='Computer Science')
department1

Row(id='123456', name='Computer Science')

## Unpacking a `Row` dictionary

* Data is in a row dictionary
* Unpack keywords using `**`

In [4]:
dept2_info = {'id':'789012', 'name':'Mechanical Engineering'}
department2 = Row(**dept2_info)
department2

Row(id='789012', name='Mechanical Engineering')

## Unpacking a list of row dictionaries

In [5]:
dept_info = [{'id':123456, 'name':'Computer Science'},
             {'id':789012, 'name':'Mechanical Engineering'},
             {'id':345678, 'name':'Theater and Drama'},
             {'id':901234, 'name':'Indoor Recreation'}]

dept_rows = [Row(**r) for r in dept_info]
dept_rows

[Row(id=123456, name='Computer Science'),
 Row(id=789012, name='Mechanical Engineering'),
 Row(id=345678, name='Theater and Drama'),
 Row(id=901234, name='Indoor Recreation')]

## Access `Row` content with column attributes

In [6]:
[dept.id for dept in dept_rows]

[123456, 789012, 345678, 901234]

In [7]:
[dept.name for dept in dept_rows]

['Computer Science',
 'Mechanical Engineering',
 'Theater and Drama',
 'Indoor Recreation']

## Creating a `pyspark.DataFrame`

* A `DataFrame` is a collection of `Row`s
* Create with spark.createDataFrame
* Need to have a 

In [8]:
df = spark.createDataFrame(dept_rows)
df

DataFrame[id: bigint, name: string]

## How to think about a `pyspark.DataFrame`

<img src="https://github.com/wsu-stat489/module5_intro_to_pyspark/blob/main/img/pyspark_df.png?raw=1" width=600>

## Example - `filter` and `collect`

In [9]:
output = (df
            .filter(df.name.startswith('C'))
            .collect())
output

                                                                                

[Row(id=123456, name='Computer Science')]

## Why is `pyspark` so slow

* Optimized for 
    * Distributed computation
    * Big data 
* Not great for
    * Local work
    * Small data

## `filter` and `collect` illustrated

<img src="https://github.com/wsu-stat489/module5_intro_to_pyspark/blob/main/img/pyspark_filter_collect.gif?raw=1" width=600>

## Reading a `csv` file with `spark.read.csv`

In [10]:
heros = spark.read.csv('./data/heroes_information.csv', header=True)
heros

DataFrame[_c0: string, name: string, Gender: string, Eye color: string, Race: string, Hair color: string, Height: string, Publisher: string, Skin color: string, Alignment: string, Weight: string]

## Inspecting the column types

In [11]:
heros.printSchema()

root
 |-- _c0: string (nullable = true)
 |-- name: string (nullable = true)
 |-- Gender: string (nullable = true)
 |-- Eye color: string (nullable = true)
 |-- Race: string (nullable = true)
 |-- Hair color: string (nullable = true)
 |-- Height: string (nullable = true)
 |-- Publisher: string (nullable = true)
 |-- Skin color: string (nullable = true)
 |-- Alignment: string (nullable = true)
 |-- Weight: string (nullable = true)



## Gathering results in `pyspark.sql`

* **Important fact** All `pyspark` queries end in a collection method
* **Why?** Data is (possibly) spread across many machines
* <font color = "red"> **Warning** This might be is *expensive*! Know how much data your are requesting! </font>

## Gathering methods

* `collect` returns all values
* `take(n)` returns the first `n` values 
* `sample(n)` returns `n` randomly selected values 

## Inspecting the content - `take`

In [12]:
heros.take(5)

22/10/27 16:32:28 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , name, Gender, Eye color, Race, Hair color, Height, Publisher, Skin color, Alignment, Weight
 Schema: _c0, name, Gender, Eye color, Race, Hair color, Height, Publisher, Skin color, Alignment, Weight
Expected: _c0 but found: 
CSV file: file:///home/wavessurfer/github-classroom/wsu-stat489/module-5-lectures-wavessurfer/data/heroes_information.csv


[Row(_c0='0', name='A-Bomb', Gender='Male', Eye color='yellow', Race='Human', Hair color='No Hair', Height='203.0', Publisher='Marvel Comics', Skin color='-', Alignment='good', Weight='441.0'),
 Row(_c0='1', name='Abe Sapien', Gender='Male', Eye color='blue', Race='Icthyo Sapien', Hair color='No Hair', Height='191.0', Publisher='Dark Horse Comics', Skin color='blue', Alignment='good', Weight='65.0'),
 Row(_c0='2', name='Abin Sur', Gender='Male', Eye color='blue', Race='Ungaran', Hair color='No Hair', Height='185.0', Publisher='DC Comics', Skin color='red', Alignment='good', Weight='90.0'),
 Row(_c0='3', name='Abomination', Gender='Male', Eye color='green', Race='Human / Radiation', Hair color='No Hair', Height='203.0', Publisher='Marvel Comics', Skin color='-', Alignment='bad', Weight='441.0'),
 Row(_c0='4', name='Abraxas', Gender='Male', Eye color='blue', Race='Cosmic Entity', Hair color='Black', Height='-99.0', Publisher='Marvel Comics', Skin color='-', Alignment='bad', Weight='-99.0

## Inspecting the content - `sample`

In [13]:
sample = heros.sample(fraction=0.01).collect()
sample

22/10/27 16:32:29 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , name, Gender, Eye color, Race, Hair color, Height, Publisher, Skin color, Alignment, Weight
 Schema: _c0, name, Gender, Eye color, Race, Hair color, Height, Publisher, Skin color, Alignment, Weight
Expected: _c0 but found: 
CSV file: file:///home/wavessurfer/github-classroom/wsu-stat489/module-5-lectures-wavessurfer/data/heroes_information.csv


[Row(_c0='25', name='Angel Salvadore', Gender='Female', Eye color='brown', Race='-', Hair color='Black', Height='163.0', Publisher='Marvel Comics', Skin color='-', Alignment='good', Weight='54.0'),
 Row(_c0='75', name='Beast Boy', Gender='Male', Eye color='green', Race='Human', Hair color='Green', Height='173.0', Publisher='DC Comics', Skin color='green', Alignment='good', Weight='68.0'),
 Row(_c0='90', name='Birdman', Gender='Male', Eye color='-', Race='God / Eternal', Hair color='-', Height='-99.0', Publisher='Hanna-Barbera', Skin color='-', Alignment='good', Weight='-99.0'),
 Row(_c0='330', name='Howard the Duck', Gender='Male', Eye color='brown', Race='-', Hair color='Yellow', Height='79.0', Publisher='Marvel Comics', Skin color='-', Alignment='good', Weight='18.0'),
 Row(_c0='333', name='Huntress', Gender='Female', Eye color='blue', Race='-', Hair color='Black', Height='180.0', Publisher='DC Comics', Skin color='-', Alignment='good', Weight='59.0'),
 Row(_c0='346', name='Iron Mong

## Switching to a `pd.DataFrame` (because that was UGLY)

You can pipe the results into `more_pyspark.to_pandas` to get the results in a dataframe

In [14]:
from more_pyspark import to_pandas

sample >> to_pandas

Unnamed: 0,_c0,name,Gender,Eye color,Race,Hair color,Height,Publisher,Skin color,Alignment,Weight
0,25,Angel Salvadore,Female,brown,-,Black,163.0,Marvel Comics,-,good,54.0
1,75,Beast Boy,Male,green,Human,Green,173.0,DC Comics,green,good,68.0
2,90,Birdman,Male,-,God / Eternal,-,-99.0,Hanna-Barbera,-,good,-99.0
3,330,Howard the Duck,Male,brown,-,Yellow,79.0,Marvel Comics,-,good,18.0
4,333,Huntress,Female,blue,-,Black,180.0,DC Comics,-,good,59.0
5,346,Iron Monger,Male,blue,-,No Hair,-99.0,Marvel Comics,-,bad,2.0
6,355,Jean Grey,Female,green,Mutant,Red,168.0,Marvel Comics,-,good,52.0
7,425,Magus,Male,black,-,-,183.0,Marvel Comics,-,bad,-99.0
8,449,Micah Sanders,Male,brown,-,Black,-99.0,NBC - Heroes,-,good,-99.0
9,454,Misfit,Female,blue,-,Red,-99.0,DC Comics,-,good,-99.0


## Getting all results with `collect`

<font color = "red"> **Warning** This might be is *expensive*! Know how much data your are requesting! </font> 

**The `collect` rule:** `count` before `collect`

In [15]:
heros.collect() >> to_pandas # <-- probably don't do this

22/10/27 16:32:29 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , name, Gender, Eye color, Race, Hair color, Height, Publisher, Skin color, Alignment, Weight
 Schema: _c0, name, Gender, Eye color, Race, Hair color, Height, Publisher, Skin color, Alignment, Weight
Expected: _c0 but found: 
CSV file: file:///home/wavessurfer/github-classroom/wsu-stat489/module-5-lectures-wavessurfer/data/heroes_information.csv


Unnamed: 0,_c0,name,Gender,Eye color,Race,Hair color,Height,Publisher,Skin color,Alignment,Weight
0,0,A-Bomb,Male,yellow,Human,No Hair,203.0,Marvel Comics,-,good,441.0
1,1,Abe Sapien,Male,blue,Icthyo Sapien,No Hair,191.0,Dark Horse Comics,blue,good,65.0
2,2,Abin Sur,Male,blue,Ungaran,No Hair,185.0,DC Comics,red,good,90.0
3,3,Abomination,Male,green,Human / Radiation,No Hair,203.0,Marvel Comics,-,bad,441.0
4,4,Abraxas,Male,blue,Cosmic Entity,Black,-99.0,Marvel Comics,-,bad,-99.0
...,...,...,...,...,...,...,...,...,...,...,...
729,729,Yellowjacket II,Female,blue,Human,Strawberry Blond,165.0,Marvel Comics,-,good,52.0
730,730,Ymir,Male,white,Frost Giant,No Hair,304.8,Marvel Comics,white,good,-99.0
731,731,Yoda,Male,brown,Yoda's species,White,66.0,George Lucas,green,good,17.0
732,732,Zatanna,Female,blue,Human,Black,170.0,DC Comics,-,good,57.0


In [16]:
heros.filter(heros['Eye Color'] == 'blue').collect() >> to_pandas # <-- better but still might be lots, let's check

22/10/27 16:32:30 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , name, Gender, Eye color, Race, Hair color, Height, Publisher, Skin color, Alignment, Weight
 Schema: _c0, name, Gender, Eye color, Race, Hair color, Height, Publisher, Skin color, Alignment, Weight
Expected: _c0 but found: 
CSV file: file:///home/wavessurfer/github-classroom/wsu-stat489/module-5-lectures-wavessurfer/data/heroes_information.csv


Unnamed: 0,_c0,name,Gender,Eye color,Race,Hair color,Height,Publisher,Skin color,Alignment,Weight
0,1,Abe Sapien,Male,blue,Icthyo Sapien,No Hair,191.0,Dark Horse Comics,blue,good,65.0
1,2,Abin Sur,Male,blue,Ungaran,No Hair,185.0,DC Comics,red,good,90.0
2,4,Abraxas,Male,blue,Cosmic Entity,Black,-99.0,Marvel Comics,-,bad,-99.0
3,5,Absorbing Man,Male,blue,Human,No Hair,193.0,Marvel Comics,-,bad,122.0
4,6,Adam Monroe,Male,blue,-,Blond,-99.0,NBC - Heroes,-,good,-99.0
...,...,...,...,...,...,...,...,...,...,...,...
220,726,X-Man,Male,blue,-,Brown,175.0,Marvel Comics,-,good,61.0
221,727,Yellow Claw,Male,blue,-,No Hair,188.0,Marvel Comics,-,bad,95.0
222,728,Yellowjacket,Male,blue,Human,Blond,183.0,Marvel Comics,-,good,83.0
223,729,Yellowjacket II,Female,blue,Human,Strawberry Blond,165.0,Marvel Comics,-,good,52.0


In [17]:
# Iverson's Law -- Count before you collect!
heros.filter(heros['Eye Color'] == 'blue').count()

225

## Did you notice?

<img src="https://github.com/wsu-stat489/module5_intro_to_pyspark/blob/main/img/pyspark_missing_values.png?raw=1" width=400>

## Specifying a `nullValue`

In [18]:
heros = spark.read.csv('./data/heroes_information.csv', header=True, nullValue='-')
heros.take(5) >> to_pandas

22/10/27 16:32:31 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: -, name, Gender, Eye color, Race, Hair color, Height, Publisher, Skin color, Alignment, Weight
 Schema: _c0, name, Gender, Eye color, Race, Hair color, Height, Publisher, Skin color, Alignment, Weight
Expected: _c0 but found: -
CSV file: file:///home/wavessurfer/github-classroom/wsu-stat489/module-5-lectures-wavessurfer/data/heroes_information.csv


Unnamed: 0,_c0,name,Gender,Eye color,Race,Hair color,Height,Publisher,Skin color,Alignment,Weight
0,0,A-Bomb,Male,yellow,Human,No Hair,203.0,Marvel Comics,,good,441.0
1,1,Abe Sapien,Male,blue,Icthyo Sapien,No Hair,191.0,Dark Horse Comics,blue,good,65.0
2,2,Abin Sur,Male,blue,Ungaran,No Hair,185.0,DC Comics,red,good,90.0
3,3,Abomination,Male,green,Human / Radiation,No Hair,203.0,Marvel Comics,,bad,441.0
4,4,Abraxas,Male,blue,Cosmic Entity,Black,-99.0,Marvel Comics,,bad,-99.0


## Did you notice?

<img src="https://github.com/wsu-stat489/module5_intro_to_pyspark/blob/main/img/pyspark_default_types.png?raw=1" width=400>

Default type is a string

## Letting `spark` guess the types

Set `inferScheme=True` 

In [19]:
heros = spark.read.csv('./data/heroes_information.csv', header=True, inferSchema=True, nullValue='-')
heros

DataFrame[_c0: int, name: string, Gender: string, Eye color: string, Race: string, Hair color: string, Height: double, Publisher: string, Skin color: string, Alignment: string, Weight: double]

## Checking the column types after `inferScheme`

In this case, `spark` guessed correctly

In [20]:
heros.printSchema()

root
 |-- _c0: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- Gender: string (nullable = true)
 |-- Eye color: string (nullable = true)
 |-- Race: string (nullable = true)
 |-- Hair color: string (nullable = true)
 |-- Height: double (nullable = true)
 |-- Publisher: string (nullable = true)
 |-- Skin color: string (nullable = true)
 |-- Alignment: string (nullable = true)
 |-- Weight: double (nullable = true)



## Inspecting the content - `take`

In [21]:
heros.take(5) >> to_pandas

22/10/27 16:32:32 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: -, name, Gender, Eye color, Race, Hair color, Height, Publisher, Skin color, Alignment, Weight
 Schema: _c0, name, Gender, Eye color, Race, Hair color, Height, Publisher, Skin color, Alignment, Weight
Expected: _c0 but found: -
CSV file: file:///home/wavessurfer/github-classroom/wsu-stat489/module-5-lectures-wavessurfer/data/heroes_information.csv


Unnamed: 0,_c0,name,Gender,Eye color,Race,Hair color,Height,Publisher,Skin color,Alignment,Weight
0,0,A-Bomb,Male,yellow,Human,No Hair,203.0,Marvel Comics,,good,441.0
1,1,Abe Sapien,Male,blue,Icthyo Sapien,No Hair,191.0,Dark Horse Comics,blue,good,65.0
2,2,Abin Sur,Male,blue,Ungaran,No Hair,185.0,DC Comics,red,good,90.0
3,3,Abomination,Male,green,Human / Radiation,No Hair,203.0,Marvel Comics,,bad,441.0
4,4,Abraxas,Male,blue,Cosmic Entity,Black,-99.0,Marvel Comics,,bad,-99.0


## Explicit `schema` specification

Format is `add(name, type, nullable?)`

In [22]:
from pyspark.sql.types import StructType
from pyspark.sql.types import DoubleType, StringType, IntegerType

hero_schema = (StructType()
  .add('Id', IntegerType(), False)
  .add('name', StringType(), True)
  .add('Gender', StringType(), True)
  .add('Eye color', StringType(), True)
  .add('Race', StringType(), True)
  .add('Hair color', StringType(), True)
  .add('Height', DoubleType(), True)
  .add('Publisher', StringType(), True)
  .add('Skin color', StringType(), True)
  .add('Alignment', StringType(), True)
  .add('Weight', DoubleType(), True))

heros = spark.read.csv('./data/heroes_information.csv', header=True, schema=hero_schema, nullValue='-')
heros

DataFrame[Id: int, name: string, Gender: string, Eye color: string, Race: string, Hair color: string, Height: double, Publisher: string, Skin color: string, Alignment: string, Weight: double]

In [23]:
heros.take(5)

22/10/27 16:32:32 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: -, name, Gender, Eye color, Race, Hair color, Height, Publisher, Skin color, Alignment, Weight
 Schema: Id, name, Gender, Eye color, Race, Hair color, Height, Publisher, Skin color, Alignment, Weight
Expected: Id but found: -
CSV file: file:///home/wavessurfer/github-classroom/wsu-stat489/module-5-lectures-wavessurfer/data/heroes_information.csv


[Row(Id=0, name='A-Bomb', Gender='Male', Eye color='yellow', Race='Human', Hair color='No Hair', Height=203.0, Publisher='Marvel Comics', Skin color=None, Alignment='good', Weight=441.0),
 Row(Id=1, name='Abe Sapien', Gender='Male', Eye color='blue', Race='Icthyo Sapien', Hair color='No Hair', Height=191.0, Publisher='Dark Horse Comics', Skin color='blue', Alignment='good', Weight=65.0),
 Row(Id=2, name='Abin Sur', Gender='Male', Eye color='blue', Race='Ungaran', Hair color='No Hair', Height=185.0, Publisher='DC Comics', Skin color='red', Alignment='good', Weight=90.0),
 Row(Id=3, name='Abomination', Gender='Male', Eye color='green', Race='Human / Radiation', Hair color='No Hair', Height=203.0, Publisher='Marvel Comics', Skin color=None, Alignment='bad', Weight=441.0),
 Row(Id=4, name='Abraxas', Gender='Male', Eye color='blue', Race='Cosmic Entity', Hair color='Black', Height=-99.0, Publisher='Marvel Comics', Skin color=None, Alignment='bad', Weight=-99.0)]

## <font color="red"> Exercise 1 </font>

Define a `schema` and read in `./data/super_hero_powers.csv`

In [24]:
from pyspark.sql.types import *

nameglob = "hero_names,Agility,Accelerated Healing,Lantern Power Ring,Dimensional Awareness,Cold Resistance,Durability,Stealth,Energy Absorption,Flight,Danger Sense,Underwater breathing,Marksmanship,Weapons Master,Power Augmentation,Animal Attributes,Longevity,Intelligence,Super Strength,Cryokinesis,Telepathy,Energy Armor,Energy Blasts,Duplication,Size Changing,Density Control,Stamina,Astral Travel,Audio Control,Dexterity,Omnitrix,Super Speed,Possession,Animal Oriented Powers,Weapon-based Powers,Electrokinesis,Darkforce Manipulation,Death Touch,Teleportation,Enhanced Senses,Telekinesis,Energy Beams,Magic,Hyperkinesis,Jump,Clairvoyance,Dimensional Travel,Power Sense,Shapeshifting,Peak Human Condition,Immortality,Camouflage,Element Control,Phasing,Astral Projection,Electrical Transport,Fire Control,Projection,Summoning,Enhanced Memory,Reflexes,Invulnerability,Energy Constructs,Force Fields,Self-Sustenance,Anti-Gravity,Empathy,Power Nullifier,Radiation Control,Psionic Powers,Elasticity,Substance Secretion,Elemental Transmogrification,Technopath/Cyberpath,Photographic Reflexes,Seismic Power,Animation,Precognition,Mind Control,Fire Resistance,Power Absorption,Enhanced Hearing,Nova Force,Insanity,Hypnokinesis,Animal Control,Natural Armor,Intangibility,Enhanced Sight,Molecular Manipulation,Heat Generation,Adaptation,Gliding,Power Suit,Mind Blast,Probability Manipulation,Gravity Control,Regeneration,Light Control,Echolocation,Levitation,Toxin and Disease Control,Banish,Energy Manipulation,Heat Resistance,Natural Weapons,Time Travel,Enhanced Smell,Illusions,Thirstokinesis,Hair Manipulation,Illumination,Omnipotent,Cloaking,Changing Armor,Power Cosmic,Biokinesis,Water Control,Radiation Immunity,Vision - Telescopic,Toxin and Disease Resistance,Spatial Awareness,Energy Resistance,Telepathy Resistance,Molecular Combustion,Omnilingualism,Portal Creation,Magnetism,Mind Control Resistance,Plant Control,Sonar,Sonic Scream,Time Manipulation,Enhanced Touch,Magic Resistance,Invisibility,Sub-Mariner,Radiation Absorption,Intuitive aptitude,Vision - Microscopic,Melting,Wind Control,Super Breath,Wallcrawling,Vision - Night,Vision - Infrared,Grim Reaping,Matter Absorption,The Force,Resurrection,Terrakinesis,Vision - Heat,Vitakinesis,Radar Sense,Qwardian Power Ring,Weather Control,Vision - X-Ray,Vision - Thermal,Web Creation,Reality Warping,Odin Force,Symbiote Costume,Speed Force,Phoenix Force,Molecular Dissipation,Vision - Cryo,Omnipresent,Omniscient"
colnames = nameglob.split(",")
len(colnames)

168

In [25]:
my_schema = StructType()
my_schema.add("hero_names", StringType(), False)

StructType([StructField('hero_names', StringType(), False)])

In [26]:
power_schema = StructType().add("hero_names", StringType(), False)

for col in colnames[1:]:
    power_schema.add(col,BooleanType(), False)

In [27]:
power_schema[-10:]

StructType([StructField('Web Creation', BooleanType(), False), StructField('Reality Warping', BooleanType(), False), StructField('Odin Force', BooleanType(), False), StructField('Symbiote Costume', BooleanType(), False), StructField('Speed Force', BooleanType(), False), StructField('Phoenix Force', BooleanType(), False), StructField('Molecular Dissipation', BooleanType(), False), StructField('Vision - Cryo', BooleanType(), False), StructField('Omnipresent', BooleanType(), False), StructField('Omniscient', BooleanType(), False)])

In [28]:
#Inferring a schema is more efficient since there are 168 columns to define

powers = spark.read.csv('./data/super_hero_powers.csv', header=True, schema=power_schema)
powers.take(1) >> to_pandas

22/10/27 16:32:32 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


Unnamed: 0,hero_names,Agility,Accelerated Healing,Lantern Power Ring,Dimensional Awareness,Cold Resistance,Durability,Stealth,Energy Absorption,Flight,...,Web Creation,Reality Warping,Odin Force,Symbiote Costume,Speed Force,Phoenix Force,Molecular Dissipation,Vision - Cryo,Omnipresent,Omniscient
0,3-D Man,True,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [29]:
powers.printSchema()

root
 |-- hero_names: string (nullable = true)
 |-- Agility: boolean (nullable = true)
 |-- Accelerated Healing: boolean (nullable = true)
 |-- Lantern Power Ring: boolean (nullable = true)
 |-- Dimensional Awareness: boolean (nullable = true)
 |-- Cold Resistance: boolean (nullable = true)
 |-- Durability: boolean (nullable = true)
 |-- Stealth: boolean (nullable = true)
 |-- Energy Absorption: boolean (nullable = true)
 |-- Flight: boolean (nullable = true)
 |-- Danger Sense: boolean (nullable = true)
 |-- Underwater breathing: boolean (nullable = true)
 |-- Marksmanship: boolean (nullable = true)
 |-- Weapons Master: boolean (nullable = true)
 |-- Power Augmentation: boolean (nullable = true)
 |-- Animal Attributes: boolean (nullable = true)
 |-- Longevity: boolean (nullable = true)
 |-- Intelligence: boolean (nullable = true)
 |-- Super Strength: boolean (nullable = true)
 |-- Cryokinesis: boolean (nullable = true)
 |-- Telepathy: boolean (nullable = true)
 |-- Energy Armor: boolea

## `pyspark.sql` queries are like `SQL` queries

#### Filter, group, and aggregate (categorical)

In [30]:
(heros
     .where(heros.Gender == 'Male')
     .groupby(heros['Eye color'])
     .count()
     .take(5)
) >> to_pandas

Unnamed: 0,Eye color,count
0,grey,6
1,green,30
2,yellow,16
3,bown,1
4,,121


#### Group by multiple and aggregate (categorical)

In [31]:
(heros
     .groupby(heros['Eye color'], heros.Gender)
     .count()
     .take(5)
) >> to_pandas

Unnamed: 0,Eye color,Gender,count
0,yellow (without irises),,1
1,green,Male,30
2,violet,Female,2
3,hazel,Female,3
4,blue,Male,143


## <font color="red"> Exercise 2 </font>
    
Perform `pyspark.sql` queries to answer each of the following questions.

1. How many heroes have both Super Strength and Super Speed?
2. How many heroes have names that start with the word *Black*
3. Are heroes with Agility more likely to have Stealth?
4. What fraction of all heroes that can fly also have Super Strength?
5. Consider heroes that have names that contain `"girl"`, `"boy"`, `"woman"`, or `"man"`.  Compute the following ratio

$$\frac{N(\text{boy or man})}{N(\text{girl or woman}}$$

**Hint:** You will need to use some combination of `where`, `group_by`, and `count` for each part.

In [32]:
(powers
     .where( powers['Super Strength'] == True)
     .where( powers['Super Speed'] == True)
     .groupby(powers['Super Speed'] , powers['Super Strength'])
     .count()
     .take(5)
) >> to_pandas

Unnamed: 0,Super Speed,Super Strength,count
0,True,True,219


In [33]:
# 2. How many heroes have names that start with the word *Black*
#16 rows = 16 superheroes
(powers
     .filter(powers.hero_names.startswith('Black'))
     .collect()
)  >> to_pandas

Unnamed: 0,hero_names,Agility,Accelerated Healing,Lantern Power Ring,Dimensional Awareness,Cold Resistance,Durability,Stealth,Energy Absorption,Flight,...,Web Creation,Reality Warping,Odin Force,Symbiote Costume,Speed Force,Phoenix Force,Molecular Dissipation,Vision - Cryo,Omnipresent,Omniscient
0,Black Abbott,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,Black Adam,False,True,False,False,False,False,False,False,True,...,False,False,False,False,False,False,False,False,False,False
2,Black Bolt,True,False,False,False,False,True,False,True,True,...,False,False,False,False,False,False,False,False,False,False
3,Black Canary,True,False,False,False,False,False,True,False,False,...,False,False,False,False,False,False,False,False,False,False
4,Black Cat,True,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
5,Black Flash,True,False,False,False,False,False,True,False,True,...,False,False,False,False,False,False,False,False,False,False
6,Black Knight III,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
7,Black Lightning,False,False,False,False,False,False,False,True,True,...,False,False,False,False,False,False,False,False,False,False
8,Black Mamba,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
9,Black Manta,True,False,False,False,True,True,True,False,False,...,False,False,False,False,False,False,False,False,False,False


In [34]:
#3. Are heroes with Agility more likely to have Stealth? Not really

from pyspark.sql import functions as F

(powers
     .select(powers['Agility'] , powers['Stealth'])
     .withColumn('Agility and Stealth', F.when(powers['Agility'] == True, True)
                                           .when(powers['Stealth'] == True, True)
                                           .otherwise(False))
     .groupby('Agility and Stealth')
     .count()
     .collect()
) >> to_pandas

Unnamed: 0,Agility and Stealth,count
0,True,273
1,False,394


In [35]:
# 4. What fraction of all heroes that can fly also have Super Strength? 63%

(powers
     .withColumn('Flight and Super Strength', F.when(powers['Flight'] == True, True)
                                           .when(powers['Super Strength'] == True, True)
                                           .otherwise(False))
     .groupby('Flight and Super Strength')
     .count()
     .collect()
) >> to_pandas

Unnamed: 0,Flight and Super Strength,count
0,True,425
1,False,242


In [36]:
x = 425 / (425 + 242)
x

0.6371814092953523

In [54]:
#5. Consider heroes that have names that contain "girl", "boy", "woman", or "man".
from more_pyspark import to_pandas
from pyspark.sql.functions import column, col

(powers
     .withColumn('boy or man', F.when(powers.hero_names.rlike(r'[M|m]an|[B|b]oy') == True, True)
                                .otherwise(False))
     .withColumn('girl or woman', F.when(powers.hero_names.rlike(r'[W|w]oman|[G|g]irl') == True, True)
                                   .otherwise(False))
       .select('hero_names','boy or man','girl or woman')
#       .where( col('girl or woman') == True)
      .groupby('boy or man', 'girl or woman')
      .count()
     .collect()
) >> to_pandas

Unnamed: 0,boy or man,girl or woman,count
0,True,False,54
1,True,True,8
2,False,False,587
3,False,True,18


In [51]:
(powers
     .withColumn('boy or man', F.when(powers.hero_names.rlike(r'[M|m]an|[B|b]oy') == True, True)
                                .otherwise(False))
     .withColumn('girl or woman', F.when(powers.hero_names.rlike(r'[W|w]oman|[G|g]irl') == True, True)
                                   .otherwise(False))
 
     .select('hero_names','boy or man','girl or woman')
     .where( col('girl or woman') == True)
     .where(col('boy or man') == True)
     .collect()
) >> to_pandas

Unnamed: 0,hero_names,boy or man,girl or woman
0,Batwoman V,True,True
1,Bionic Woman,True,True
2,Catwoman,True,True
3,Invisible Woman,True,True
4,Spider-Woman,True,True
5,Spider-Woman III,True,True
6,Spider-Woman IV,True,True
7,Wonder Woman,True,True


In [39]:
male_over_female_ratio = 54 / (8 + 18)
male_over_female_ratio

2.076923076923077

In [40]:
# df.filter(df['name'].rlike(expression))
(powers
 .filter(powers.hero_names.rlike(r'[W|w]oman|[G|g]irl'))
 .collect()
) >> to_pandas

Unnamed: 0,hero_names,Agility,Accelerated Healing,Lantern Power Ring,Dimensional Awareness,Cold Resistance,Durability,Stealth,Energy Absorption,Flight,...,Web Creation,Reality Warping,Odin Force,Symbiote Costume,Speed Force,Phoenix Force,Molecular Dissipation,Vision - Cryo,Omnipresent,Omniscient
0,Atom Girl,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,Batgirl,False,False,False,False,False,False,True,False,False,...,False,False,False,False,False,False,False,False,False,False
2,Batgirl IV,True,False,False,False,False,False,True,False,False,...,False,False,False,False,False,False,False,False,False,False
3,Batgirl VI,True,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,Batwoman V,False,False,False,False,False,False,True,False,False,...,False,False,False,False,False,False,False,False,False,False
5,Bionic Woman,True,False,False,False,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
6,Bizarro-Girl,True,True,False,False,False,True,False,True,True,...,False,False,False,False,False,False,False,True,False,False
7,Catwoman,True,False,False,False,False,False,True,False,False,...,False,False,False,False,False,False,False,False,False,False
8,Elastigirl,True,False,False,False,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
9,Hawkgirl,False,True,False,False,True,True,False,False,True,...,False,False,False,False,False,False,False,False,False,False


In [41]:
# df.filter(df['name'].rlike(expression))
(powers
 .filter(powers.hero_names.rlike(r'[M|m]an|[B|b]oy'))
 .collect()
) >> to_pandas

Unnamed: 0,hero_names,Agility,Accelerated Healing,Lantern Power Ring,Dimensional Awareness,Cold Resistance,Durability,Stealth,Energy Absorption,Flight,...,Web Creation,Reality Warping,Odin Force,Symbiote Costume,Speed Force,Phoenix Force,Molecular Dissipation,Vision - Cryo,Omnipresent,Omniscient
0,3-D Man,True,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,Absorbing Man,False,False,False,False,True,True,False,True,False,...,False,False,False,False,False,False,False,False,False,False
2,Animal Man,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,Ant-Man,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,Ant-Man II,True,False,False,False,False,False,True,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
57,Superboy-Prime,False,True,False,False,False,True,False,True,True,...,False,False,False,False,False,False,False,False,False,False
58,Superman,True,True,False,False,True,True,False,True,True,...,False,False,False,False,False,False,False,False,False,False
59,Wonder Man,False,False,False,False,False,True,False,False,True,...,False,False,False,False,False,False,False,False,False,False
60,Wonder Woman,False,True,False,False,False,True,False,False,True,...,False,False,False,False,False,False,False,False,False,False


In [42]:
dir(powers)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattr__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_collect_as_arrow',
 '_jcols',
 '_jdf',
 '_jmap',
 '_joinAsOf',
 '_jseq',
 '_lazy_rdd',
 '_repr_html_',
 '_sc',
 '_schema',
 '_session',
 '_sort_cols',
 '_sql_ctx',
 '_support_repr_html',
 '_to_corrected_pandas_type',
 'agg',
 'alias',
 'approxQuantile',
 'cache',
 'checkpoint',
 'coalesce',
 'colRegex',
 'collect',
 'columns',
 'corr',
 'count',
 'cov',
 'createGlobalTempView',
 'createOrReplaceGlobalTempView',
 'createOrReplaceTempView',
 'createTempView',
 'crossJoin',
 'crosstab',
 'cube',
 'describe',
 'distinct',
 'drop',
 'dropDuplicates',
 'drop_duplicates',
 'dropna',
 'dtypes',
 

In [43]:
help(powers.colRegex)

Help on method colRegex in module pyspark.sql.dataframe:

colRegex(colName: str) -> pyspark.sql.column.Column method of pyspark.sql.dataframe.DataFrame instance
    Selects column based on the column name specified as a regex and returns it
    as :class:`Column`.
    
    .. versionadded:: 2.3.0
    
    Parameters
    ----------
    colName : str
        string, column name specified as a regex.
    
    Examples
    --------
    >>> df = spark.createDataFrame([("a", 1), ("b", 2), ("c",  3)], ["Col1", "Col2"])
    >>> df.select(df.colRegex("`(Col1)?+.+`")).show()
    +----+
    |Col2|
    +----+
    |   1|
    |   2|
    |   3|
    +----+



In [44]:
powers.hero_names

Column<'hero_names'>

# Appendix

## Creating rows from list of data

## Creating a Row class

* Pass `Row` the columns names
* Creates a specialized `Row` class

In [45]:
Employee = Row("firstName", "lastName", "email", "salary")
Employee

<Row('firstName', 'lastName', 'email', 'salary')>

## Creating a `Employee` instance

* Pass the data to `Employee` to make a row
* Order matters ... use the same order as names

In [46]:
Employee = Row("firstName", "lastName", "email", "salary")
employee1 = Employee('michael', 'armbrust', 'no-reply@berkeley.edu', 100000)
employee1

Row(firstName='michael', lastName='armbrust', email='no-reply@berkeley.edu', salary=100000)

## Unpacking a data list

* Suppose the data is in a list/tuple.
* Use sequence unpacking with `*`

In [47]:
empl2_info = ('xiangrui', 'meng', 'no-reply@stanford.edu', 120000)
empl2_info

('xiangrui', 'meng', 'no-reply@stanford.edu', 120000)

In [48]:
employee2 = Employee(*empl2_info)
employee2

Row(firstName='xiangrui', lastName='meng', email='no-reply@stanford.edu', salary=120000)

## Unpacking 

In [49]:
# Create the Employees
Employee = Row("firstName", "lastName", "email", "salary")
employees = [('michael', 'armbrust', 'no-reply@berkeley.edu', 100000),
             ('xiangrui', 'meng', 'no-reply@stanford.edu', 120000),
             ('matei', None, 'no-reply@waterloo.edu', 140000),
             (None, 'wendell', 'no-reply@berkeley.edu', 160000)]
emp_rows = [Employee(*r) for r in employees]
emp_rows

[Row(firstName='michael', lastName='armbrust', email='no-reply@berkeley.edu', salary=100000),
 Row(firstName='xiangrui', lastName='meng', email='no-reply@stanford.edu', salary=120000),
 Row(firstName='matei', lastName=None, email='no-reply@waterloo.edu', salary=140000),
 Row(firstName=None, lastName='wendell', email='no-reply@berkeley.edu', salary=160000)]