# Data Munging with PySpark

### <font color='green'>_This notebook supports Google Colab_  </font>

<font color='green'>Look for the "_Sidebar_: Google Colab" section below to setup and run this Spark notebook on Google Colab.</font>

# Installation etc.

Look, I hate to be that guy, but Google/DDG are your friends here.  
Getting ```Spark```, ```pySpark``` and other libraries to run is more than a little tedious.  
The following are a 'best-guess' set of instrustions.  
The ones that worked for me.  
Using Windows 10 here.  
Your mileage may vary.  

Once you are through the tedium of installation and setup - it gets good. I promise. :)  

## Download (and install where applicable)
* JDK (prefer 8.x/11.x, 64bit, more open the better)
* Hadoop (3.2.x, at this time) - _for windows, we just need Hadoop Winutils_
* [Hadoop *winutils* (corresponding to the version of Hadoop)](https://github.com/cdarlint/winutils), [another repo](https://github.com/kontext-tech/winutils)
* [Spark (3.x, at this time)](https://spark.apache.org/downloads.html)  
* [Anaconda - Open Source/Individual Edition](https://www.anaconda.com/products/distribution)

## Setup environment variables 

We set these environment variables that help manage paths better.
Example variable values would look like:
Java:  
* JAVA_HOME = ```C:\[Java]```  
    
Hadoop:  
* HADOOP_HOME = ```C:\hadoop\hadoop-3.4.0-win10-x64```  
_this is just a sample_  
_On Windows, we need HADOOP_HOME to be the folder where **winutils.exe** is located_  
_For dev purposes: you can just download winutils and move on, just ensure you point to winutils.exe in this env variable_

finally, Spark:   
* SPARK_HOME = ```C:\Spark\spark-3.4.1-bin-hadoop3```  

*notice there are no backslashes in the end. This is because slashes will be added in the next step when we setup path*      

## Update system **'PATH'**

We use the variables defined above to set-up paths.  

* Java: ```%JAVA_HOME%/bin```
* Hadoop 01: ```%HADOOP_HOME%/bin``` 
* Hadoop 02: ```%HADOOP_HOME%/sbin``` (*sbin needed in addition to bin*)
* Spark: ```%SPARK_HOME%/bin```  
    
(*here we add backslashes before bin*)

N.B.: If you are only doing the dev setup (just winutils and nothing else) thing, you may not need the Hadoop 01/02 above, I added those to my system anyway.

## Patch Hadoop  

This is *needed* when Hadoop is run on Windows.

* copy the ```bin``` folder from the right version of winutils to replace ```%HADOOP_HOME%/bin```  

* copy ```hadoop-yarn-server-timelineservice-3.0.3``` from ```%HADOOP_HOME%\share\hadoop\yarn\timelineservice``` to ```%HADOOP_HOME%\share\hadoop\yarn``` (the parent directory).  

## Install the Python libraries

Prefer installing [Anaconda](https://www.anaconda.com/products/distribution). 
It resolves other dependencies like Pandas, Numpy, Jupyter etc. too.  
Once there, use either pip or conda - they are both cool but incompatible.  
The conda-forge channel is a few days behind the pip one.  
We're only running on the local machine here, no complicated infrastructure to care about.  
So, you do you.  

Use one of the following commands (from the command line obvs) to install each:  
* pyspark:
    * ```pip install pyspark``` or
    * ```conda install -c conda-forge pyspark```
* findspark:
    * ```pip install findspark``` or
    * ```conda install -c conda-forge findspark```

## References  

* [How to install Hadoop on Win 10](https://muhammadbilalyar.github.io/blogs/How-to-install-Hadoop-on-Window-10/)
* [Hadoop on Windows](https://github.com/MuhammadBilalYar/Hadoop-On-Window)
* [Hadoop and Spark on Windows](https://dev.to/awwsmm/installing-and-running-hadoop-and-spark-on-windows-33kc)

### (_Optionally_) Configure Hadoop

*only needed if you want to use hadoop as your file storage system*  

* create a folder for ```namenode```
* create a folder for ```datanode```
* four files: ```core-site.xml```, ```mapred-site.xml```, ```hdfs-site.xml```, ```yarn-site.xml``` - see code for each in the [reference repo](https://github.com/MuhammadBilalYar/Hadoop-On-Window) above.

### Also,

The scope of this notebook is *usage* - not setup or troubleshooting, am pretty sure these installation instructions will be outdated soon and be replaced by pre-built docker images or shell scripts or automated installs for windows or such-like.  

# Setup

This boiler plate helps, esp. in Jupyter Notebook situations

## _Sidebar_: Google Colab

You don't need to run this on your local machine.
The notebook is setup to run on Google Colab as well.

For a detailed description of how this is setup, see the [02.000 (optional) Setup_Spark_in_Google_Colab](https://github.com/shauryashaurya/learn-data-munging/blob/main/02.000%20(optional)%20Setup_Spark_in_Google_Colab.ipynb) notebook

Open the notebook in Google Colab using the following button, then uncomment the setup marked # SETUP FOR COLAB  
  
  
<a href="https://colab.research.google.com/github/shauryashaurya/learn-data-munging/blob/main/02.001%20-%2010%2B%20minutes%20to%20pyspark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>    

_NOTE: keep the # SETUP FOR COLAB step below commented (disabled) when you are running this notebook locally_

In [1]:
# SETUP FOR COLAB: select all the lines below and uncomment (CTRL+/ on windows)

# # grab spark
# # as of Dec 2022, the latest version is 3.2.3, get the link from Apache Spark's website
# ! wget -q https://dlcdn.apache.org/spark/spark-3.2.3/spark-3.2.3-bin-hadoop3.2.tgz
# # unzip spark
# !tar xf spark-3.2.3-bin-hadoop3.2.tgz
# # install findspark package
# !pip install -q findspark

# # got to provide JAVA_HOME and SPARK_HOME vairables
# import os
# os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"
# os.environ["SPARK_HOME"] = "/content/spark-3.2.3-bin-hadoop3.2"

In [1]:
# Step 1: initialize findspark
import findspark
findspark.init()

In [2]:
# Step 2: import pyspark
import pyspark
from pyspark.sql import SparkSession

In [3]:
pyspark.__version__

'3.4.1'

In [4]:
# Step 3: Create a spark session

# 'local[1]' indicates spark on 1 core on the local machine, specify the number of cores needed
# use .config("spark.some.config.option", "some-value") for additional configuration

spark = SparkSession \
    .builder \
    .master('local[1]') \
    .appName("10+ minutes to pyspark") \
    .getOrCreate()

# spark

Back in the day you'd need various 'contexts' as entry points into spark functionality.  
All of this is now wrapped into a SparkSession, easy to manage.

In [None]:
# The SparkSession carries the sparkContext
# spark.sparkContext

Check out the spark UI link when you uncomment the lines in the two cells above.  
Your local UI should launch at a link like: http://localhost:4041/jobs/

In [5]:
# before we close the notebook, stop spark, otherwise Jupyter closes, but scala-spark keep going on...
# spark.stop()

# Dataframes

## DataFrames: Create and View

In [6]:
from datetime import datetime, date
import numpy as np
import pandas as pd
from pyspark.sql import Row

In [7]:
# use a list of pyspark.sql.Row
df1 = spark.createDataFrame(
    [
        Row(a=1,b=2.,c='span a',d=date(2022,7,1),e=datetime(2022,7,1,12,0)),
        Row(a=1,b=3.,c='can a ',d=date(2022,7,2),e=datetime(2022,7,2,12,0,1)),
        Row(a=1,b=4.,c='banana',d=date(2022,7,3),e=datetime(2022,7,3,12,0,2))
    ]
)

In [8]:
df1

DataFrame[a: bigint, b: double, c: string, d: date, e: timestamp]

In [10]:
# df1's not been evaluated yet. It's lazy evaluation
# to eval it, we go
df1.show()

+---+---+------+----------+-------------------+
|  a|  b|     c|         d|                  e|
+---+---+------+----------+-------------------+
|  1|2.0|span a|2022-07-01|2022-07-01 12:00:00|
|  1|3.0|can a |2022-07-02|2022-07-02 12:00:01|
|  1|4.0|banana|2022-07-03|2022-07-03 12:00:02|
+---+---+------+----------+-------------------+



In [11]:
# use a list of tuples with explicit schema
df2 = spark.createDataFrame(
    [
        (2,5.,'man a',date(2022,7,1),datetime(2022,7,1,12,0)),
        (2,6.,'can a',date(2022,8,1),datetime(2022,7,2,12,0)),
        (2,7.,'manna',date(2022,9,1),datetime(2022,7,3,12,0))
    ],
    schema = 'a bigint, b double, c string, d date, e timestamp'
)
df2

DataFrame[a: bigint, b: double, c: string, d: date, e: timestamp]

In [12]:
# use spark RDDs to create a dataframe
# go to the sparksession's sparkContext to access parallelize method
rdd3 = spark.sparkContext.parallelize(
    [
        (3,5., 'main', date(2022,7,1), datetime(2022,7,1,12,0,1)),
        (3,5., 'brain', date(2022,7,1), datetime(2022,7,1,12,0,1)),
        (3,5., 'pain', date(2022,7,1), datetime(2022,7,1,12,0,1))
    ]
)

df3 = spark.createDataFrame(rdd3, schema=['a','b','c','d','e'])

df3

DataFrame[a: bigint, b: double, c: string, d: date, e: timestamp]

In [13]:
# can also use a pandas dataframe to create a spark dataframe
df4_pd = pd.DataFrame(
    {
        'a': np.random.randint(0,10, size = 3),
        'b': np.random.randn(3),
        'c': ["gandalf's manager", "said", 'no'],
        'd': [date(2022,7,1),date(2022,7,2),date(2022,7,3)],
        'e': [datetime(2022,7,1,12,0,1),datetime(2022,7,2,12,0,2),datetime(2022,7,3,12,0,3)]
    }
)

df4 = spark.createDataFrame(df4_pd)

df4.show()

+---+------------------+-----------------+----------+-------------------+
|  a|                 b|                c|         d|                  e|
+---+------------------+-----------------+----------+-------------------+
|  0|1.3461612779974976|gandalf's manager|2022-07-01|2022-07-01 12:00:01|
|  3|0.8235447997632228|             said|2022-07-02|2022-07-02 12:00:02|
|  3|1.1436223517625508|               no|2022-07-03|2022-07-03 12:00:03|
+---+------------------+-----------------+----------+-------------------+



In [14]:
df4_pd

Unnamed: 0,a,b,c,d,e
0,0,1.346161,gandalf's manager,2022-07-01,2022-07-01 12:00:01
1,3,0.823545,said,2022-07-02,2022-07-02 12:00:02
2,3,1.143622,no,2022-07-03,2022-07-03 12:00:03


In [15]:
# use printSchema() to..., you know, it says what it does
# also don't you hate that pySpark is following the Java camelCase instead of the Python snake_case?
# yeah, what's that all about?

df1.printSchema()
df2.printSchema()
df3.printSchema()
df4.printSchema()

root
 |-- a: long (nullable = true)
 |-- b: double (nullable = true)
 |-- c: string (nullable = true)
 |-- d: date (nullable = true)
 |-- e: timestamp (nullable = true)

root
 |-- a: long (nullable = true)
 |-- b: double (nullable = true)
 |-- c: string (nullable = true)
 |-- d: date (nullable = true)
 |-- e: timestamp (nullable = true)

root
 |-- a: long (nullable = true)
 |-- b: double (nullable = true)
 |-- c: string (nullable = true)
 |-- d: date (nullable = true)
 |-- e: timestamp (nullable = true)

root
 |-- a: long (nullable = true)
 |-- b: double (nullable = true)
 |-- c: string (nullable = true)
 |-- d: date (nullable = true)
 |-- e: timestamp (nullable = true)



In [16]:
# show only x rows
df1.show(1)
# vertical - if the row is too long for horizontal display
df4.show(2, vertical=True)

+---+---+------+----------+-------------------+
|  a|  b|     c|         d|                  e|
+---+---+------+----------+-------------------+
|  1|2.0|span a|2022-07-01|2022-07-01 12:00:00|
+---+---+------+----------+-------------------+
only showing top 1 row

-RECORD 0------------------
 a   | 0                   
 b   | 1.3461612779974976  
 c   | gandalf's manager   
 d   | 2022-07-01          
 e   | 2022-07-01 12:00:01 
-RECORD 1------------------
 a   | 3                   
 b   | 0.8235447997632228  
 c   | said                
 d   | 2022-07-02          
 e   | 2022-07-02 12:00:02 
only showing top 2 rows



In [17]:
# collect() - collects the entire df from across all nodes to the driver
# if you don't have enough memory, here's how you crash spark
# careful is the word
df3.collect()

[Row(a=3, b=5.0, c='main', d=datetime.date(2022, 7, 1), e=datetime.datetime(2022, 7, 1, 12, 0, 1)),
 Row(a=3, b=5.0, c='brain', d=datetime.date(2022, 7, 1), e=datetime.datetime(2022, 7, 1, 12, 0, 1)),
 Row(a=3, b=5.0, c='pain', d=datetime.date(2022, 7, 1), e=datetime.datetime(2022, 7, 1, 12, 0, 1))]

In [18]:
# Pandas used to have a take() method, deprecated now
# take() in spark extracts the first n rows of a dataframe
df2.take(2)

[Row(a=2, b=5.0, c='man a', d=datetime.date(2022, 7, 1), e=datetime.datetime(2022, 7, 1, 12, 0)),
 Row(a=2, b=6.0, c='can a', d=datetime.date(2022, 8, 1), e=datetime.datetime(2022, 7, 2, 12, 0))]

In [19]:
# same as take() but returns the last n rows of a dataframe
df2.tail(2)

[Row(a=2, b=6.0, c='can a', d=datetime.date(2022, 8, 1), e=datetime.datetime(2022, 7, 2, 12, 0)),
 Row(a=2, b=7.0, c='manna', d=datetime.date(2022, 9, 1), e=datetime.datetime(2022, 7, 3, 12, 0))]

In [23]:
# hey you want another way of crashing the spark driver?
# convert the spark dataframe to a pandas dataframe
# it'll collect all data from all workers into the driver

# this naive approach will fail because we have date-time values in df3 . 
# df3.toPandas()
# possible error: TypeError: Casting to unit-less dtype 'datetime64' is not supported. Pass e.g. 'datetime64[ns]' instead.

# explicitly provide date-time data type to Pandas
from pyspark.sql.functions import date_format
df3.withColumn("d", date_format("d", "yyyy-MM-dd HH:mm:ss")) \
    .withColumn("e", date_format("e", "yyyy-MM-dd HH:mm:ss")) \
    .toPandas()

Unnamed: 0,a,b,c,d,e
0,3,5.0,main,2022-07-01 00:00:00,2022-07-01 12:00:01
1,3,5.0,brain,2022-07-01 00:00:00,2022-07-01 12:00:01
2,3,5.0,pain,2022-07-01 00:00:00,2022-07-01 12:00:01


In [24]:
# how do I make it so my dataframes are evaluated eagerly?
# instead of the regular lazy eval?
# y'know when am in notebooks and stuff?
# set the config
spark.conf.set('spark.sql.repl.eagerEval.enabled', True)

In [25]:
# there you go, eager evaulatio'
df1

a,b,c,d,e
1,2.0,span a,2022-07-01,2022-07-01 12:00:00
1,3.0,can a,2022-07-02,2022-07-02 12:00:01
1,4.0,banana,2022-07-03,2022-07-03 12:00:02


In [26]:
# set it back to false if you like
spark.conf.set('spark.sql.repl.eagerEval.enabled', False)

In [27]:
# No more eagerly evaluated dataframes
df1

DataFrame[a: bigint, b: double, c: string, d: date, e: timestamp]

# Selecting and Accessing Data

In [28]:
# access a column
df1.a, df2.b, df3.c, df4.d

(Column<'a'>, Column<'b'>, Column<'c'>, Column<'d'>)

In [29]:
from pyspark.sql import Column
from pyspark.sql.functions import upper

In [30]:
type(df1.c) == type(upper(df1.c)) == type(df1.c.isNull())
# TODO: what's going on with type(df1.c.isNull()) above???

True

In [31]:
df1.c.isNull()

Column<'(c IS NULL)'>

In [32]:
# use dataframe's select method to identify a column and show() it
df1.select(df1.c).show()

+------+
|     c|
+------+
|span a|
|can a |
|banana|
+------+



In [33]:
# also there's dataframe.filter()
df4.filter(df4.a>0).show()

+---+------------------+----+----------+-------------------+
|  a|                 b|   c|         d|                  e|
+---+------------------+----+----------+-------------------+
|  3|0.8235447997632228|said|2022-07-02|2022-07-02 12:00:02|
|  3|1.1436223517625508|  no|2022-07-03|2022-07-03 12:00:03|
+---+------------------+----+----------+-------------------+



In [34]:
# spark.stop()

In [35]:
# assign a new column instance to the dataframe
df4_withNewCol = df4.withColumn('upper_c', upper(df4.c))
df4_withNewCol.show()

+---+------------------+-----------------+----------+-------------------+-----------------+
|  a|                 b|                c|         d|                  e|          upper_c|
+---+------------------+-----------------+----------+-------------------+-----------------+
|  0|1.3461612779974976|gandalf's manager|2022-07-01|2022-07-01 12:00:01|GANDALF'S MANAGER|
|  3|0.8235447997632228|             said|2022-07-02|2022-07-02 12:00:02|             SAID|
|  3|1.1436223517625508|               no|2022-07-03|2022-07-03 12:00:03|               NO|
+---+------------------+-----------------+----------+-------------------+-----------------+



In [36]:
df4.show()
df4_withNewCol.show()

+---+------------------+-----------------+----------+-------------------+
|  a|                 b|                c|         d|                  e|
+---+------------------+-----------------+----------+-------------------+
|  0|1.3461612779974976|gandalf's manager|2022-07-01|2022-07-01 12:00:01|
|  3|0.8235447997632228|             said|2022-07-02|2022-07-02 12:00:02|
|  3|1.1436223517625508|               no|2022-07-03|2022-07-03 12:00:03|
+---+------------------+-----------------+----------+-------------------+

+---+------------------+-----------------+----------+-------------------+-----------------+
|  a|                 b|                c|         d|                  e|          upper_c|
+---+------------------+-----------------+----------+-------------------+-----------------+
|  0|1.3461612779974976|gandalf's manager|2022-07-01|2022-07-01 12:00:01|GANDALF'S MANAGER|
|  3|0.8235447997632228|             said|2022-07-02|2022-07-02 12:00:02|             SAID|
|  3|1.14362235176255

In [37]:
# filter
df4.filter(df4.a == 9).show()

+---+---+---+---+---+
|  a|  b|  c|  d|  e|
+---+---+---+---+---+
+---+---+---+---+---+



In [38]:
df4.filter(df4.b > 0).show()

+---+------------------+-----------------+----------+-------------------+
|  a|                 b|                c|         d|                  e|
+---+------------------+-----------------+----------+-------------------+
|  0|1.3461612779974976|gandalf's manager|2022-07-01|2022-07-01 12:00:01|
|  3|0.8235447997632228|             said|2022-07-02|2022-07-02 12:00:02|
|  3|1.1436223517625508|               no|2022-07-03|2022-07-03 12:00:03|
+---+------------------+-----------------+----------+-------------------+



# UDFs: Applying a Function

In [39]:


import pandas as pd
from pyspark.sql.functions import pandas_udf

@pandas_udf('long')
def pandas_plus_one(series: pd.Series) -> pd.Series:
    #     plus one using pandas series
    return series+1

df4.select(df4.a, pandas_plus_one(df4.a)).show()

+---+------------------+
|  a|pandas_plus_one(a)|
+---+------------------+
|  0|                 1|
|  3|                 4|
|  3|                 4|
+---+------------------+



In [40]:
def pandas_filter(iterator):
    for pandas_df in iterator:
        yield pandas_df[pandas_df.b>0]
        
df4.mapInPandas(pandas_filter, schema = df4.schema).show()

+---+------------------+-----------------+----------+-------------------+
|  a|                 b|                c|         d|                  e|
+---+------------------+-----------------+----------+-------------------+
|  0|1.3461612779974976|gandalf's manager|2022-07-01|2022-07-01 12:00:01|
|  3|0.8235447997632228|             said|2022-07-02|2022-07-02 12:00:02|
|  3|1.1436223517625508|               no|2022-07-03|2022-07-03 12:00:03|
+---+------------------+-----------------+----------+-------------------+



# Grouping Data

Split-Apply-Combine, just like pandas

In [41]:
# group by fruit, color etc.
df5 = spark.createDataFrame([
    ['red', 'banana', 1, 10], ['blue', 'banana', 2, 20], ['red', 'carrot', 3, 30],
    ['blue', 'grape', 4, 40], ['red', 'carrot', 5, 50], ['black', 'carrot', 6, 60],
    ['red', 'banana', 7, 70], ['red', 'grape', 8, 80]], schema=['color', 'fruit', 'v1', 'v2'])

df5.show()

+-----+------+---+---+
|color| fruit| v1| v2|
+-----+------+---+---+
|  red|banana|  1| 10|
| blue|banana|  2| 20|
|  red|carrot|  3| 30|
| blue| grape|  4| 40|
|  red|carrot|  5| 50|
|black|carrot|  6| 60|
|  red|banana|  7| 70|
|  red| grape|  8| 80|
+-----+------+---+---+



In [42]:
df5.groupby('color').avg().show()

+-----+-------+-------+
|color|avg(v1)|avg(v2)|
+-----+-------+-------+
|  red|    4.8|   48.0|
|black|    6.0|   60.0|
| blue|    3.0|   30.0|
+-----+-------+-------+



In [43]:
# TODO: How to get the deviations in a new column?
def plus_mean(pandas_df):
    return pandas_df.assign(v1=pandas_df.v1 - pandas_df.v1.mean())

df5.groupby('color').applyInPandas(plus_mean, schema = df5.schema).show()

+-----+------+---+---+
|color| fruit| v1| v2|
+-----+------+---+---+
|black|carrot|  0| 60|
| blue|banana| -1| 20|
| blue| grape|  1| 40|
|  red|banana| -3| 10|
|  red|carrot| -1| 30|
|  red|carrot|  0| 50|
|  red|banana|  2| 70|
|  red| grape|  3| 80|
+-----+------+---+---+



In [44]:
# Co-grouping and applying a function.

df6 = spark.createDataFrame(
    [
        (20000101, 1, 1.0), 
        (20000101, 2, 2.0), 
        (20000102, 1, 3.0), 
        (20000102, 2, 4.0)
    ],
    ('time', 'id', 'v1')
)

df7 = spark.createDataFrame(
    [
        (20000101, 1, 'x'), 
        (20000101, 2, 'y')
    ],
    ('time', 'id', 'v2')
)

In [45]:
def asof_join(l, r):
    return pd.merge_asof(l,r,on='time', by='id')

In [46]:
df6_gb = df6.groupby('id')
df7_gb = df7.groupby('id')

In [47]:
co_grp = df6_gb.cogroup(df7_gb)

In [48]:
rslt = co_grp.applyInPandas(asof_join, schema='time int, id int, v1 double, v2 string')
rslt.show()

+--------+---+---+---+
|    time| id| v1| v2|
+--------+---+---+---+
|20000101|  1|1.0|  x|
|20000102|  1|3.0|  x|
|20000101|  2|2.0|  y|
|20000102|  2|4.0|  y|
+--------+---+---+---+



# Getting data in and out

In [49]:
# CSV
# .mode('overwrite') can be taken out when needed.
df5.write.mode('overwrite').csv('fruits.csv', header=True)
spark.read.csv('fruits.csv', header=True).show()

+-----+------+---+---+
|color| fruit| v1| v2|
+-----+------+---+---+
|  red|banana|  1| 10|
| blue|banana|  2| 20|
|  red|carrot|  3| 30|
| blue| grape|  4| 40|
|  red|carrot|  5| 50|
|black|carrot|  6| 60|
|  red|banana|  7| 70|
|  red| grape|  8| 80|
+-----+------+---+---+



In [50]:
# Parquet
# .mode('overwrite') can be taken out when needed.
df5.write.mode('overwrite').parquet('fruits.parquet')
spark.read.parquet('fruits.parquet').show()

+-----+------+---+---+
|color| fruit| v1| v2|
+-----+------+---+---+
|  red|banana|  1| 10|
| blue|banana|  2| 20|
|  red|carrot|  3| 30|
| blue| grape|  4| 40|
|  red|carrot|  5| 50|
|black|carrot|  6| 60|
|  red|banana|  7| 70|
|  red| grape|  8| 80|
+-----+------+---+---+



# Working with SQL

In [51]:
df5.createOrReplaceTempView('tableA')

In [52]:
spark.sql('select count(*) from tableA').show()

+--------+
|count(1)|
+--------+
|       8|
+--------+



In [53]:
# UDFs in SQL
# register and invoke
@pandas_udf('integer')
def add_one(s: pd.Series) -> pd.Series:
    return s+1

spark.udf.register('add_one', add_one)

<function __main__.add_one(s: pandas.core.series.Series) -> pandas.core.series.Series>

In [54]:
spark.sql('SELECT v1, add_one(v1) from tableA').show()

+---+-----------+
| v1|add_one(v1)|
+---+-----------+
|  1|          2|
|  2|          3|
|  3|          4|
|  4|          5|
|  5|          6|
|  6|          7|
|  7|          8|
|  8|          9|
+---+-----------+



In [55]:
# can mix/match sql expressions 
# for e.g. take the expressions from above

from pyspark.sql.functions import expr

df5.selectExpr('add_one(v1)').show()

+-----------+
|add_one(v1)|
+-----------+
|          2|
|          3|
|          4|
|          5|
|          6|
|          7|
|          8|
|          9|
+-----------+



In [56]:
df5.select(expr('count(*)')).show()

+--------+
|count(1)|
+--------+
|       8|
+--------+



In [57]:
df5.select(expr('count(*)')>0).show()

+--------------+
|(count(1) > 0)|
+--------------+
|          true|
+--------------+



# Pandas API on Spark

In [58]:
import pandas as pd
import numpy as np
import pyspark.pandas as ps
from pyspark.sql import SparkSession



In [59]:
# create a pandas on spark series

s1 = ps.Series([1,2,3,np.nan,5,np.nan,7,8,9])
s1

0    1.0
1    2.0
2    3.0
3    NaN
4    5.0
5    NaN
6    7.0
7    8.0
8    9.0
dtype: float64

In [98]:
psdf1 = ps.DataFrame(
    {'a': [1, 2, 3, 4, 5, 6],
     'b': [100, 200, 300, 400, 500, 600],
     'c': ["one", "two", "three", "four", "five", "six"]
    },
    index=[10, 20, 30, 40, 50, 60]
)

psdf1

Unnamed: 0,a,b,c
10,1,100,one
20,2,200,two
30,3,300,three
40,4,400,four
50,5,500,five
60,6,600,six


In [134]:
# create a pandas dataframe and convert to pandas-spark
dates1 = pd.date_range('20220101', periods=6)

In [197]:
pdf1 = pd.DataFrame(np.random.randn(6,4), index=dates1, columns = list('ABCD'))


In [198]:
pdf1

Unnamed: 0,A,B,C,D
2022-01-01,-0.16899,0.218452,-0.213017,-1.393176
2022-01-02,0.708587,-0.443116,-0.450608,-1.138874
2022-01-03,0.134243,-0.210103,0.617508,-1.238837
2022-01-04,-1.952191,1.187688,0.344275,-0.500572
2022-01-05,-0.313474,-0.05007,1.588762,-0.064186
2022-01-06,0.268115,0.743052,0.202203,-0.840306


In [199]:
pdf1.index

DatetimeIndex(['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04',
               '2022-01-05', '2022-01-06'],
              dtype='datetime64[ns]', freq='D')

In [200]:
# create spark dataframe from the pandas dataframe

# naive method breaks with pandas 2.x due to date-time column
# psdf1 = ps.from_pandas(pdf1)

# we follow a 2-step approach here:
# first convert datetime to string
pdf1.index = pdf1.index.astype("string")
psdf1 = ps.from_pandas(pdf1)

# now cast index back to datetime64 
psdf1.index = psdf1.index.astype("datetime64")

In [201]:
# this is a pandas-on-spark dataframe now
type(psdf1)

pyspark.pandas.frame.DataFrame

In [164]:
psdf1

Unnamed: 0,A,B,C,D
2022-01-01,0.231783,-1.084731,-0.608712,1.246273
2022-01-02,-0.805125,-2.042498,-0.830551,0.660595
2022-01-03,-1.364529,-1.882992,1.696941,0.615413
2022-01-04,0.873966,1.508365,-1.282595,0.707897
2022-01-05,-0.253708,0.637862,-0.188042,-0.059372
2022-01-06,0.166734,0.417695,-1.10731,-0.194818


In [165]:
# create a SPARK df from a PANDAS df (we've done this before)
# create a PANDAS-ON-SPARK df from a SPARK df (we've done this too...)

sdf2 = spark.createDataFrame(pdf1)
sdf2.show()

+--------------------+-------------------+-------------------+--------------------+
|                   A|                  B|                  C|                   D|
+--------------------+-------------------+-------------------+--------------------+
| 0.23178298083436075|-1.0847310277456157|-0.6087115856498937|  1.2462731288493991|
| -0.8051253969964965|-2.0424981029532785|-0.8305510126251235|  0.6605951086327256|
| -1.3645288946045742| -1.882992437515232| 1.6969414646290797|  0.6154125218090376|
|  0.8739660400170647| 1.5083646652561309|-1.2825951299127099|    0.70789746162282|
|-0.25370835805637826| 0.6378616352178663|-0.1880421255151992| -0.0593716881962378|
| 0.16673376667661077|0.41769524872208375|-1.1073103769831572|-0.19481750928898367|
+--------------------+-------------------+-------------------+--------------------+



In [190]:
psdf2 = sdf2.pandas_api()
# this works like a pandas df now...
psdf2

Unnamed: 0,A,B,C,D
0,0.231783,-1.084731,-0.608712,1.246273
1,-0.805125,-2.042498,-0.830551,0.660595
2,-1.364529,-1.882992,1.696941,0.615413
3,0.873966,1.508365,-1.282595,0.707897
4,-0.253708,0.637862,-0.188042,-0.059372
5,0.166734,0.417695,-1.10731,-0.194818


In [191]:
type(psdf2)

pyspark.pandas.frame.DataFrame

In [192]:
psdf2.dtypes

A    float64
B    float64
C    float64
D    float64
dtype: object

In [193]:
# pandas head()
psdf2.head()

Unnamed: 0,A,B,C,D
0,0.231783,-1.084731,-0.608712,1.246273
1,-0.805125,-2.042498,-0.830551,0.660595
2,-1.364529,-1.882992,1.696941,0.615413
3,0.873966,1.508365,-1.282595,0.707897
4,-0.253708,0.637862,-0.188042,-0.059372


In [194]:
psdf2.index

Index([0, 1, 2, 3, 4, 5], dtype='int64')

In [195]:
psdf2.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

In [196]:
psdf2.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,-0.191813,-0.407717,-0.386711,0.495998
std,0.79989,1.466506,1.0908,0.535692
min,-1.364529,-2.042498,-1.282595,-0.194818
25%,-0.805125,-1.882992,-1.10731,-0.059372
50%,-0.253708,-1.084731,-0.830551,0.615413
75%,0.231783,0.637862,-0.188042,0.707897
max,0.873966,1.508365,1.696941,1.246273


In [172]:
# warning - this breaks the driver due to memory - just like collect()
# be careful
psdf2.to_numpy()

array([[ 0.23178298, -1.08473103, -0.60871159,  1.24627313],
       [-0.8051254 , -2.0424981 , -0.83055101,  0.66059511],
       [-1.36452889, -1.88299244,  1.69694146,  0.61541252],
       [ 0.87396604,  1.50836467, -1.28259513,  0.70789746],
       [-0.25370836,  0.63786164, -0.18804213, -0.05937169],
       [ 0.16673377,  0.41769525, -1.10731038, -0.19481751]])

In [173]:
# transpose
psdf2.T

Unnamed: 0,0,1,2,3,4,5
A,0.231783,-0.805125,-1.364529,0.873966,-0.253708,0.166734
B,-1.084731,-2.042498,-1.882992,1.508365,0.637862,0.417695
C,-0.608712,-0.830551,1.696941,-1.282595,-0.188042,-1.10731
D,1.246273,0.660595,0.615413,0.707897,-0.059372,-0.194818


In [174]:
# sort index
psdf2.sort_index(ascending = False)

Unnamed: 0,A,B,C,D
5,0.166734,0.417695,-1.10731,-0.194818
4,-0.253708,0.637862,-0.188042,-0.059372
3,0.873966,1.508365,-1.282595,0.707897
2,-1.364529,-1.882992,1.696941,0.615413
1,-0.805125,-2.042498,-0.830551,0.660595
0,0.231783,-1.084731,-0.608712,1.246273


In [175]:
# sort values
psdf2.sort_values(by='B', ascending = True)

Unnamed: 0,A,B,C,D
1,-0.805125,-2.042498,-0.830551,0.660595
2,-1.364529,-1.882992,1.696941,0.615413
0,0.231783,-1.084731,-0.608712,1.246273
5,0.166734,0.417695,-1.10731,-0.194818
4,-0.253708,0.637862,-0.188042,-0.059372
3,0.873966,1.508365,-1.282595,0.707897


# Missing Data - Pandas on Spark

In [204]:
# you'll notice we are repeating a lot of the tasks 
# from 10 minutes to pandas 
pdf2 = pdf1.reindex(index=dates1[0:4], columns=list(pdf1.columns) + ['E'])

In [205]:
pdf2.loc[dates1[0]:dates1[1], 'E'] = 1

In [206]:
# updated for pandas 2.x+
# explicitly cast dates to string
pdf2.index = pdf2.index.astype("string")
psdf3 = ps.from_pandas(pdf2)
# convert index back to datetime64
psdf3.index = psdf3.index.astype("datetime64")

In [207]:
psdf3

Unnamed: 0,A,B,C,D,E
2022-01-01,,,,,1.0
2022-01-02,,,,,1.0
2022-01-03,,,,,
2022-01-04,,,,,


In [208]:
psdf3.dropna(how='any')

Unnamed: 0,A,B,C,D,E


In [209]:
psdf3.fillna(value=5)

Unnamed: 0,A,B,C,D,E
2022-01-01,5.0,5.0,5.0,5.0,1.0
2022-01-02,5.0,5.0,5.0,5.0,1.0
2022-01-03,5.0,5.0,5.0,5.0,5.0
2022-01-04,5.0,5.0,5.0,5.0,5.0


# Operations - Pandas on Spark

## Stats

In [210]:
psdf1.mean()

A   -0.220619
B    0.240984
C    0.348187
D   -0.862659
dtype: float64

## Spark Configurations

In [211]:
# store the current config value
previous = spark.conf.get('spark.sql.execution.arrow.pyspark.enabled')
previous

'false'

In [212]:
# Use default index prevent overhead.
ps.set_option('compute.default_index_type', 'distributed')

In [213]:
# ignore warnings coming from Arrow optimizations
import warnings
warnings.filterwarnings('ignore')

Let's toggle ```spark.sql.execution.arrow.pyspark.enabled``` to see the difference

In [214]:
# this is supposed to be faster by an order of magnitude
spark.conf.set('spark.sql.execution.arrow.pyspark.enabled', True)

In [215]:
%timeit ps.range(3000000).to_pandas()

2.29 s ± 762 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [216]:
# is supposed to be slower
spark.conf.set('spark.sql.execution.arrow.pyspark.enabled', False)

In [None]:
%timeit ps.range(3000000).to_pandas()

In [None]:
# put everything back as it was
ps.reset_option('compute.default_index_type')
spark.conf.set('spark.sql.execution.arrow.pyspark.enabled', previous)

# Grouping  
  
  
It's the same split-apply-combine deal.

In [None]:
psdf4 = ps.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar', 'foo', 'foo'],
                    'B': ['one', 'one', 'two', 'three',
                          'two', 'two', 'one', 'three'],
                    'C': np.random.randn(8),
                    'D': np.random.randn(8)})

In [None]:
psdf4

In [None]:
psdf4.groupby('A').sum()

In [None]:
psdf4.groupby(['A', 'B']).sum()

In [None]:
pser = pd.Series(np.random.randn(1000),
                 index=pd.date_range('1/1/2000', periods=1000))

In [None]:
psser = ps.Series(pser)

In [None]:
psser = psser.cummax()

In [None]:
psser.plot()

In [None]:
pdf4 = pd.DataFrame(np.random.randn(1000, 4), index=pser.index,
                   columns=['A', 'B', 'C', 'D'])

In [None]:
psdf5 = ps.from_pandas(pdf4)

In [None]:
psdf5 = psdf5.cummax()

In [None]:
psdf5.plot()

# Getting data in/out - Pandas on Spark

In [None]:
psdf4.to_csv('./data/foo.csv')
ps.read_csv('./data/foo.csv').head(4)

In [None]:
psdf4.to_parquet('./data/bar.parquet')
ps.read_parquet('./data/bar.parquet').head(3)

In [None]:
spark.stop()

# Up Next

Check out the exercises on real-world datasets next