# Data Munging with PySpark

# Installation etc.

Look, I hate to be that guy, but Google/DDG are your friends here.  
Getting ```Spark```, ```pySpark``` and other libraries to run is more than a little tedious.  
The following are a 'best-guess' set of instrustions.  
The ones that worked for me.  
Using Windows 10 here.  
Your mileage may vary.  

Once you are through the tedium of installation and setup - it gets good. I promise. :)  

## Download (and install where applicable)
* JDK (prefer 8.x/11.x, 64bit, more open the better)
* Hadoop (3.2.x, at this time) - _for windows, we just need Hadoop Winutils_
* [Hadoop *winutils* (corresponding to the version of Hadoop)](https://github.com/cdarlint/winutils), [another repo](https://github.com/kontext-tech/winutils)
* [Spark (3.x, at this time)](https://spark.apache.org/downloads.html)  
* [Anaconda - Open Source/Individual Edition](https://www.anaconda.com/products/distribution)

## Setup environment variables 

We set these environment variables that help manage paths better.
Example variable values would look like:
Java:  
* JAVA_HOME = ```C:\[Java]```  
    
Hadoop:  
* HADOOP_HOME = ```C:\hadoop\hadoop-3.2.1```  
_On Windows 10, we just need HADOOP_HOME to be the folder where **winutils.exe** is located_

finally, Spark:   
* SPARK_HOME = ```C:\Spark\spark-3.2.1-bin-hadoop3.2```  

*notice there are no backslashes in the end. This is because slashes will be added in the next step when we setup path*      

## Update system **'PATH'**

We use the variables defined above to set-up paths.  

* Java: ```%JAVA_HOME%/bin```
* Hadoop 01: ```%HADOOP_HOME%/bin```
* Hadoop 02: ```%HADOOP_HOME%/sbin``` (*sbin needed in addition to bin*)
* Spark: ```%SPARK_HOME%/bin```  
    
(*here we add backslashes before bin*)

## Patch Hadoop  

This is *needed* when Hadoop is run on Windows.

* copy the ```bin``` folder from the right version of winutils to replace ```%HADOOP_HOME%/bin```  

* copy ```hadoop-yarn-server-timelineservice-3.0.3``` from ```%HADOOP_HOME%\share\hadoop\yarn\timelineservice``` to ```%HADOOP_HOME%\share\hadoop\yarn``` (the parent directory).  

## Install the Python libraries

Prefer installing [Anaconda](https://www.anaconda.com/products/distribution). 
It resolves other dependencies like Pandas, Numpy, Jupyter etc. too.  
Once there, use either pip or conda - they are both cool but incompatible.  
The conda-forge channel is a few days behind the pip one.  
We're only running on the local machine here, no complicated infrastructure to care about.  
So, you do you.  

Use one of the following commands (from the command line obvs) to install each:  
* pyspark:
    * ```pip install pyspark``` or
    * ```conda install -c conda-forge pyspark```
* findspark:
    * ```pip install findspark``` or
    * ```conda install -c conda-forge findspark```

## References  

* [How to install Hadoop on Win 10](https://muhammadbilalyar.github.io/blogs/How-to-install-Hadoop-on-Window-10/)
* [Hadoop on Windows](https://github.com/MuhammadBilalYar/Hadoop-On-Window)
* [Hadoop and Spark on Windows](https://dev.to/awwsmm/installing-and-running-hadoop-and-spark-on-windows-33kc)

### (_Optionally_) Configure Hadoop

*only needed if you want to use hadoop as your file storage system*  

* create a folder for ```namenode```
* create a folder for ```datanode```
* four files: ```core-site.xml```, ```mapred-site.xml```, ```hdfs-site.xml```, ```yarn-site.xml``` - see code for each in the [reference repo](https://github.com/MuhammadBilalYar/Hadoop-On-Window) above.

### Also,

The scope of this notebook is *usage* - not setup or troubleshooting, am pretty sure these installation instructions will be outdated soon and be replaced by pre-built docker images or shell scripts or automated installs for windows or such-like.  

# Setup

This boiler plate helps, esp. in Jupyter Notebook situations

In [1]:
# Step 1: initialize findspark
import findspark
findspark.init()

In [2]:
# Step 2: import pyspark
import pyspark
from pyspark.sql import SparkSession
pyspark.__version__

'3.3.0'

In [3]:
# Step 3: Create a spark session

# 'local[1]' indicates spark on 1 core on the local machine, specify the number of cores needed
# use .config("spark.some.config.option", "some-value") for additional configuration

spark = SparkSession \
    .builder \
    .master('local[1]') \
    .appName("10+ minutes to pyspark") \
    .getOrCreate()

# spark

Back in the day you'd need various 'contexts' as entry points into spark functionality.  
All of this is now wrapped into a SparkSession, easy to manage.

In [4]:
# The SparkSession carries the sparkContext
# spark.sparkContext

Check out the spark UI link when you uncomment the lines in the two cells above.  
Your local UI should launch at a link like: http://localhost:4041/jobs/

In [5]:
# before we close the notebook, stop spark, otherwise Jupyter closes, but scala-spark keep going on...
# spark.stop()

# Dataframes

## DataFrames: Create and View

In [6]:
from datetime import datetime, date
import numpy as np
import pandas as pd
from pyspark.sql import Row

In [7]:
# use a list of pyspark.sql.Row
df1 = spark.createDataFrame(
    [
        Row(a=1,b=2.,c='span a',d=date(2022,7,1),e=datetime(2022,7,1,12,0)),
        Row(a=1,b=3.,c='can a ',d=date(2022,7,2),e=datetime(2022,7,2,12,0,1)),
        Row(a=1,b=4.,c='banana',d=date(2022,7,3),e=datetime(2022,7,3,12,0,2))
    ]
)

df1

DataFrame[a: bigint, b: double, c: string, d: date, e: timestamp]

In [8]:
# df1's not been evaluated yet. It's lazy evaluation
# to eval it, we go
df1.show()

+---+---+------+----------+-------------------+
|  a|  b|     c|         d|                  e|
+---+---+------+----------+-------------------+
|  1|2.0|span a|2022-07-01|2022-07-01 12:00:00|
|  1|3.0|can a |2022-07-02|2022-07-02 12:00:01|
|  1|4.0|banana|2022-07-03|2022-07-03 12:00:02|
+---+---+------+----------+-------------------+



In [9]:
# use a list of tuples with explicit schema
df2 = spark.createDataFrame(
    [
        (2,5.,'man a',date(2022,7,1),datetime(2022,7,1,12,0)),
        (2,6.,'can a',date(2022,8,1),datetime(2022,7,2,12,0)),
        (2,7.,'manna',date(2022,9,1),datetime(2022,7,3,12,0))
    ],
    schema = 'a bigint, b double, c string, d date, e timestamp'
)
df2

DataFrame[a: bigint, b: double, c: string, d: date, e: timestamp]

In [10]:
# use spark RDDs to create a dataframe
# go to the sparksession's sparkContext to access parallelize method
rdd3 = spark.sparkContext.parallelize(
    [
        (3,5., 'main', date(2022,7,1), datetime(2022,7,1,12,0,1)),
        (3,5., 'brain', date(2022,7,1), datetime(2022,7,1,12,0,1)),
        (3,5., 'pain', date(2022,7,1), datetime(2022,7,1,12,0,1))
    ]
)

df3 = spark.createDataFrame(rdd3, schema=['a','b','c','d','e'])

df3

DataFrame[a: bigint, b: double, c: string, d: date, e: timestamp]

In [11]:
# can also use a pandas dataframe to create a spark dataframe
df4_pd = pd.DataFrame(
    {
        'a': np.random.randint(0,10, size = 3),
        'b': np.random.randn(3),
        'c': ["gandalf's manager", "said", 'no'],
        'd': [date(2022,7,1),date(2022,7,2),date(2022,7,3)],
        'e': [datetime(2022,7,1,12,0,1),datetime(2022,7,2,12,0,2),datetime(2022,7,3,12,0,3)]
    }
)

df4 = spark.createDataFrame(df4_pd)

df4.show()

+---+------------------+-----------------+----------+-------------------+
|  a|                 b|                c|         d|                  e|
+---+------------------+-----------------+----------+-------------------+
|  7|  2.06919879867935|gandalf's manager|2022-07-01|2022-07-01 12:00:01|
|  1|1.2779543149855563|             said|2022-07-02|2022-07-02 12:00:02|
|  4|-1.445713114804323|               no|2022-07-03|2022-07-03 12:00:03|
+---+------------------+-----------------+----------+-------------------+



In [12]:
df4_pd

Unnamed: 0,a,b,c,d,e
0,7,2.069199,gandalf's manager,2022-07-01,2022-07-01 12:00:01
1,1,1.277954,said,2022-07-02,2022-07-02 12:00:02
2,4,-1.445713,no,2022-07-03,2022-07-03 12:00:03


In [13]:
# use printSchema() to..., you know, it says what it does
# also don't you hate that pySpark is following the Java camelCase instead of the Python snake_case?
# yeah, what's that all about?

df1.printSchema()
df2.printSchema()
df3.printSchema()
df4.printSchema()

root
 |-- a: long (nullable = true)
 |-- b: double (nullable = true)
 |-- c: string (nullable = true)
 |-- d: date (nullable = true)
 |-- e: timestamp (nullable = true)

root
 |-- a: long (nullable = true)
 |-- b: double (nullable = true)
 |-- c: string (nullable = true)
 |-- d: date (nullable = true)
 |-- e: timestamp (nullable = true)

root
 |-- a: long (nullable = true)
 |-- b: double (nullable = true)
 |-- c: string (nullable = true)
 |-- d: date (nullable = true)
 |-- e: timestamp (nullable = true)

root
 |-- a: long (nullable = true)
 |-- b: double (nullable = true)
 |-- c: string (nullable = true)
 |-- d: date (nullable = true)
 |-- e: timestamp (nullable = true)



In [14]:
# show only x rows
df1.show(1)
# vertical - if the row is too long for horizontal display
df4.show(2, vertical=True)

+---+---+------+----------+-------------------+
|  a|  b|     c|         d|                  e|
+---+---+------+----------+-------------------+
|  1|2.0|span a|2022-07-01|2022-07-01 12:00:00|
+---+---+------+----------+-------------------+
only showing top 1 row

-RECORD 0------------------
 a   | 7                   
 b   | 2.06919879867935    
 c   | gandalf's manager   
 d   | 2022-07-01          
 e   | 2022-07-01 12:00:01 
-RECORD 1------------------
 a   | 1                   
 b   | 1.2779543149855563  
 c   | said                
 d   | 2022-07-02          
 e   | 2022-07-02 12:00:02 
only showing top 2 rows



In [15]:
# collect() - collects the entire df from across all nodes to the driver
# if you don't have enough memory, here's how you crash spark
# careful is the word
df3.collect()

[Row(a=3, b=5.0, c='main', d=datetime.date(2022, 7, 1), e=datetime.datetime(2022, 7, 1, 12, 0, 1)),
 Row(a=3, b=5.0, c='brain', d=datetime.date(2022, 7, 1), e=datetime.datetime(2022, 7, 1, 12, 0, 1)),
 Row(a=3, b=5.0, c='pain', d=datetime.date(2022, 7, 1), e=datetime.datetime(2022, 7, 1, 12, 0, 1))]

In [16]:
# Pandas used to have a take() method, deprecated now
# take() in spark extracts the first n rows of a dataframe
df2.take(2)

[Row(a=2, b=5.0, c='man a', d=datetime.date(2022, 7, 1), e=datetime.datetime(2022, 7, 1, 12, 0)),
 Row(a=2, b=6.0, c='can a', d=datetime.date(2022, 8, 1), e=datetime.datetime(2022, 7, 2, 12, 0))]

In [17]:
# same as take() but returns the last n rows of a dataframe
df2.tail(2)

[Row(a=2, b=6.0, c='can a', d=datetime.date(2022, 8, 1), e=datetime.datetime(2022, 7, 2, 12, 0)),
 Row(a=2, b=7.0, c='manna', d=datetime.date(2022, 9, 1), e=datetime.datetime(2022, 7, 3, 12, 0))]

In [18]:
# hey you want another way of crashing the spark driver?
# convert the spark dataframe to a pandas dataframe
# it'll collect all data from all workers into the driver
df3.toPandas()

Unnamed: 0,a,b,c,d,e
0,3,5.0,main,2022-07-01,2022-07-01 12:00:01
1,3,5.0,brain,2022-07-01,2022-07-01 12:00:01
2,3,5.0,pain,2022-07-01,2022-07-01 12:00:01


In [19]:
# how do I make it so my dataframes are evaluated eagerly?
# instead of the regular lazy eval?
# y'know when am in notebooks and stuff?
# set the config
spark.conf.set('spark.sql.repl.eagerEval.enabled', True)

In [20]:
# there you go, eager evaulatio'
df1

a,b,c,d,e
1,2.0,span a,2022-07-01,2022-07-01 12:00:00
1,3.0,can a,2022-07-02,2022-07-02 12:00:01
1,4.0,banana,2022-07-03,2022-07-03 12:00:02


In [21]:
# set it back to false if you like
spark.conf.set('spark.sql.repl.eagerEval.enabled', False)

In [22]:
# No more eagerly evaluated dataframes
df1

DataFrame[a: bigint, b: double, c: string, d: date, e: timestamp]

# Selecting and Accessing Data

In [23]:
# access a column
df1.a, df2.b, df3.c, df4.d

(Column<'a'>, Column<'b'>, Column<'c'>, Column<'d'>)

In [24]:
from pyspark.sql import Column
from pyspark.sql.functions import upper

In [25]:
type(df1.c) == type(upper(df1.c)) == type(df1.c.isNull())
# TODO: what's going on with type(df1.c.isNull()) above???

True

In [26]:
df1.c.isNull()

Column<'(c IS NULL)'>

In [27]:
# use dataframe's select method to identify a column and show() it
df1.select(df1.c).show()

+------+
|     c|
+------+
|span a|
|can a |
|banana|
+------+



In [28]:
# also there's dataframe.filter()
df4.filter(df4.a>0).show()

+---+------------------+-----------------+----------+-------------------+
|  a|                 b|                c|         d|                  e|
+---+------------------+-----------------+----------+-------------------+
|  7|  2.06919879867935|gandalf's manager|2022-07-01|2022-07-01 12:00:01|
|  1|1.2779543149855563|             said|2022-07-02|2022-07-02 12:00:02|
|  4|-1.445713114804323|               no|2022-07-03|2022-07-03 12:00:03|
+---+------------------+-----------------+----------+-------------------+



In [29]:
# spark.stop()

In [30]:
# assign a new column instance to the dataframe
df4_withNewCol = df4.withColumn('upper_c', upper(df4.c))
df4_withNewCol.show()

+---+------------------+-----------------+----------+-------------------+-----------------+
|  a|                 b|                c|         d|                  e|          upper_c|
+---+------------------+-----------------+----------+-------------------+-----------------+
|  7|  2.06919879867935|gandalf's manager|2022-07-01|2022-07-01 12:00:01|GANDALF'S MANAGER|
|  1|1.2779543149855563|             said|2022-07-02|2022-07-02 12:00:02|             SAID|
|  4|-1.445713114804323|               no|2022-07-03|2022-07-03 12:00:03|               NO|
+---+------------------+-----------------+----------+-------------------+-----------------+



In [31]:
df4.show()
df4_withNewCol.show()

+---+------------------+-----------------+----------+-------------------+
|  a|                 b|                c|         d|                  e|
+---+------------------+-----------------+----------+-------------------+
|  7|  2.06919879867935|gandalf's manager|2022-07-01|2022-07-01 12:00:01|
|  1|1.2779543149855563|             said|2022-07-02|2022-07-02 12:00:02|
|  4|-1.445713114804323|               no|2022-07-03|2022-07-03 12:00:03|
+---+------------------+-----------------+----------+-------------------+

+---+------------------+-----------------+----------+-------------------+-----------------+
|  a|                 b|                c|         d|                  e|          upper_c|
+---+------------------+-----------------+----------+-------------------+-----------------+
|  7|  2.06919879867935|gandalf's manager|2022-07-01|2022-07-01 12:00:01|GANDALF'S MANAGER|
|  1|1.2779543149855563|             said|2022-07-02|2022-07-02 12:00:02|             SAID|
|  4|-1.4457131148043

In [32]:
# filter
df4.filter(df4.a == 9).show()

+---+---+---+---+---+
|  a|  b|  c|  d|  e|
+---+---+---+---+---+
+---+---+---+---+---+



In [33]:
df4.filter(df4.b > 0).show()

+---+------------------+-----------------+----------+-------------------+
|  a|                 b|                c|         d|                  e|
+---+------------------+-----------------+----------+-------------------+
|  7|  2.06919879867935|gandalf's manager|2022-07-01|2022-07-01 12:00:01|
|  1|1.2779543149855563|             said|2022-07-02|2022-07-02 12:00:02|
+---+------------------+-----------------+----------+-------------------+



# UDFs: Applying a Function

In [34]:


import pandas as pd
from pyspark.sql.functions import pandas_udf

@pandas_udf('long')
def pandas_plus_one(series: pd.Series) -> pd.Series:
    #     plus one using pandas series
    return series+1

df4.select(df4.a, pandas_plus_one(df4.a)).show()

+---+------------------+
|  a|pandas_plus_one(a)|
+---+------------------+
|  7|                 8|
|  1|                 2|
|  4|                 5|
+---+------------------+



In [35]:
def pandas_filter(iterator):
    for pandas_df in iterator:
        yield pandas_df[pandas_df.b>0]
        
df4.mapInPandas(pandas_filter, schema = df4.schema).show()

+---+------------------+-----------------+----------+-------------------+
|  a|                 b|                c|         d|                  e|
+---+------------------+-----------------+----------+-------------------+
|  7|  2.06919879867935|gandalf's manager|2022-07-01|2022-07-01 12:00:01|
|  1|1.2779543149855563|             said|2022-07-02|2022-07-02 12:00:02|
+---+------------------+-----------------+----------+-------------------+



# Grouping Data

Split-Apply-Combine, just like pandas

In [36]:
# group by fruit, color etc.
df5 = spark.createDataFrame([
    ['red', 'banana', 1, 10], ['blue', 'banana', 2, 20], ['red', 'carrot', 3, 30],
    ['blue', 'grape', 4, 40], ['red', 'carrot', 5, 50], ['black', 'carrot', 6, 60],
    ['red', 'banana', 7, 70], ['red', 'grape', 8, 80]], schema=['color', 'fruit', 'v1', 'v2'])

df5.show()

+-----+------+---+---+
|color| fruit| v1| v2|
+-----+------+---+---+
|  red|banana|  1| 10|
| blue|banana|  2| 20|
|  red|carrot|  3| 30|
| blue| grape|  4| 40|
|  red|carrot|  5| 50|
|black|carrot|  6| 60|
|  red|banana|  7| 70|
|  red| grape|  8| 80|
+-----+------+---+---+



In [37]:
df5.groupby('color').avg().show()

+-----+-------+-------+
|color|avg(v1)|avg(v2)|
+-----+-------+-------+
|  red|    4.8|   48.0|
|black|    6.0|   60.0|
| blue|    3.0|   30.0|
+-----+-------+-------+



In [38]:
# TODO: How to get the deviations in a new column?
def plus_mean(pandas_df):
    return pandas_df.assign(v1=pandas_df.v1 - pandas_df.v1.mean())

df5.groupby('color').applyInPandas(plus_mean, schema = df5.schema).show()

+-----+------+---+---+
|color| fruit| v1| v2|
+-----+------+---+---+
|black|carrot|  0| 60|
| blue|banana| -1| 20|
| blue| grape|  1| 40|
|  red|banana| -3| 10|
|  red|carrot| -1| 30|
|  red|carrot|  0| 50|
|  red|banana|  2| 70|
|  red| grape|  3| 80|
+-----+------+---+---+



In [39]:
# Co-grouping and applying a function.

df6 = spark.createDataFrame(
    [
        (20000101, 1, 1.0), 
        (20000101, 2, 2.0), 
        (20000102, 1, 3.0), 
        (20000102, 2, 4.0)
    ],
    ('time', 'id', 'v1')
)

df7 = spark.createDataFrame(
    [
        (20000101, 1, 'x'), 
        (20000101, 2, 'y')
    ],
    ('time', 'id', 'v2')
)

In [40]:
def asof_join(l, r):
    return pd.merge_asof(l,r,on='time', by='id')

In [41]:
df6_gb = df6.groupby('id')
df7_gb = df7.groupby('id')

In [42]:
co_grp = df6_gb.cogroup(df7_gb)

In [43]:
rslt = co_grp.applyInPandas(asof_join, schema='time int, id int, v1 double, v2 string')
rslt.show()

+--------+---+---+---+
|    time| id| v1| v2|
+--------+---+---+---+
|20000101|  1|1.0|  x|
|20000102|  1|3.0|  x|
|20000101|  2|2.0|  y|
|20000102|  2|4.0|  y|
+--------+---+---+---+



# Getting data in and out

In [44]:
# CSV
# .mode('overwrite') can be taken out when needed.
df5.write.mode('overwrite').csv('fruits.csv', header=True)
spark.read.csv('fruits.csv', header=True).show()

+-----+------+---+---+
|color| fruit| v1| v2|
+-----+------+---+---+
|  red|banana|  1| 10|
| blue|banana|  2| 20|
|  red|carrot|  3| 30|
| blue| grape|  4| 40|
|  red|carrot|  5| 50|
|black|carrot|  6| 60|
|  red|banana|  7| 70|
|  red| grape|  8| 80|
+-----+------+---+---+



In [45]:
# Parquet
# .mode('overwrite') can be taken out when needed.
df5.write.mode('overwrite').parquet('fruits.parquet')
spark.read.parquet('fruits.parquet').show()

+-----+------+---+---+
|color| fruit| v1| v2|
+-----+------+---+---+
|  red|banana|  1| 10|
| blue|banana|  2| 20|
|  red|carrot|  3| 30|
| blue| grape|  4| 40|
|  red|carrot|  5| 50|
|black|carrot|  6| 60|
|  red|banana|  7| 70|
|  red| grape|  8| 80|
+-----+------+---+---+



# Working with SQL

In [46]:
df5.createOrReplaceTempView('tableA')

In [47]:
spark.sql('select count(*) from tableA').show()

+--------+
|count(1)|
+--------+
|       8|
+--------+



In [48]:
# UDFs in SQL
# register and invoke
@pandas_udf('integer')
def add_one(s: pd.Series) -> pd.Series:
    return s+1

spark.udf.register('add_one', add_one)

<function __main__.add_one(s: pandas.core.series.Series) -> pandas.core.series.Series>

In [49]:
spark.sql('SELECT v1, add_one(v1) from tableA').show()

+---+-----------+
| v1|add_one(v1)|
+---+-----------+
|  1|          2|
|  2|          3|
|  3|          4|
|  4|          5|
|  5|          6|
|  6|          7|
|  7|          8|
|  8|          9|
+---+-----------+



In [50]:
# can mix/match sql expressions 
# for e.g. take the expressions from above

from pyspark.sql.functions import expr

df5.selectExpr('add_one(v1)').show()

+-----------+
|add_one(v1)|
+-----------+
|          2|
|          3|
|          4|
|          5|
|          6|
|          7|
|          8|
|          9|
+-----------+



In [51]:
df5.select(expr('count(*)')).show()

+--------+
|count(1)|
+--------+
|       8|
+--------+



In [52]:
df5.select(expr('count(*)')>0).show()

+--------------+
|(count(1) > 0)|
+--------------+
|          true|
+--------------+



# Pandas API on Spark

In [53]:
import pandas as pd
import numpy as np
import pyspark.pandas as ps
from pyspark.sql import SparkSession



In [54]:
# create a pandas on spark series

s1 = ps.Series([1,2,3,np.nan,5,np.nan,7,8,9])
s1

0    1.0
1    2.0
2    3.0
3    NaN
4    5.0
5    NaN
6    7.0
7    8.0
8    9.0
dtype: float64

In [55]:
psdf1 = ps.DataFrame(
    {'a': [1, 2, 3, 4, 5, 6],
     'b': [100, 200, 300, 400, 500, 600],
     'c': ["one", "two", "three", "four", "five", "six"]
    },
    index=[10, 20, 30, 40, 50, 60]
)

psdf1

Unnamed: 0,a,b,c
10,1,100,one
20,2,200,two
30,3,300,three
40,4,400,four
50,5,500,five
60,6,600,six


In [56]:
# create a pandas dataframe and convert to pandas-spark
dates1 = pd.date_range('20220101', periods=6)
pdf1 = pd.DataFrame(np.random.randn(6,4), index=dates1, columns = list('ABCD'))
# create spark dataframe from the pandas dataframe
psdf1 = ps.from_pandas(pdf1)

In [57]:
psdf1

Unnamed: 0,A,B,C,D
2022-01-01,-0.180368,-0.071954,-0.969922,1.754936
2022-01-02,2.809529,-0.330146,0.274938,-0.511405
2022-01-03,-0.993214,-1.811281,0.425113,-0.436873
2022-01-04,0.088916,0.336168,0.368839,0.010667
2022-01-05,0.347266,0.703723,0.75873,-0.81983
2022-01-06,-0.82359,-0.639242,-1.192412,0.186218


In [62]:
# create a SPARK df from a PANDAS df (we've done this before)
# create a PANDAS-ON-SPARK df from a SPARK df (we've done this too...)

sdf2 = spark.createDataFrame(pdf1)
sdf2.show()

+-------------------+--------------------+-------------------+--------------------+
|                  A|                   B|                  C|                   D|
+-------------------+--------------------+-------------------+--------------------+
|-0.1803676446120318|-0.07195374148708228|-0.9699224383554612|   1.754936270711452|
| 2.8095287412059653|-0.33014550044668356|0.27493761020477425| -0.5114051599724003|
|-0.9932135592367661| -1.8112805783021024|0.42511273144125733|-0.43687264087575084|
|0.08891577560108033| 0.33616818654299235| 0.3688387884737041|  0.0106672238161948|
|0.34726638819126543|  0.7037233061126154| 0.7587298906825163|  -0.819830479668479|
|-0.8235902111006436| -0.6392421962423047|-1.1924118488403428| 0.18621838183614017|
+-------------------+--------------------+-------------------+--------------------+



In [63]:
psdf2 = sdf2.pandas_api()
# this works like a pandas df now...
psdf2

Unnamed: 0,A,B,C,D
0,-0.180368,-0.071954,-0.969922,1.754936
1,2.809529,-0.330146,0.274938,-0.511405
2,-0.993214,-1.811281,0.425113,-0.436873
3,0.088916,0.336168,0.368839,0.010667
4,0.347266,0.703723,0.75873,-0.81983
5,-0.82359,-0.639242,-1.192412,0.186218


In [64]:
psdf2.dtypes

A    float64
B    float64
C    float64
D    float64
dtype: object

In [68]:
# pandas head()
psdf2.head()

Unnamed: 0,A,B,C,D
0,-0.180368,-0.071954,-0.969922,1.754936
1,2.809529,-0.330146,0.274938,-0.511405
2,-0.993214,-1.811281,0.425113,-0.436873
3,0.088916,0.336168,0.368839,0.010667
4,0.347266,0.703723,0.75873,-0.81983


In [69]:
psdf2.index

Int64Index([0, 1, 2, 3, 4, 5], dtype='int64')

In [70]:
psdf2.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

In [71]:
psdf2.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,0.20809,-0.302122,-0.055786,0.030619
std,1.37546,0.879103,0.813873,0.920161
min,-0.993214,-1.811281,-1.192412,-0.81983
25%,-0.82359,-0.639242,-0.969922,-0.511405
50%,-0.180368,-0.330146,0.274938,-0.436873
75%,0.347266,0.336168,0.425113,0.186218
max,2.809529,0.703723,0.75873,1.754936


In [73]:
# warning - this breaks the driver due to memory - just like collect()
# be careful
psdf2.to_numpy()



array([[-0.18036764, -0.07195374, -0.96992244,  1.75493627],
       [ 2.80952874, -0.3301455 ,  0.27493761, -0.51140516],
       [-0.99321356, -1.81128058,  0.42511273, -0.43687264],
       [ 0.08891578,  0.33616819,  0.36883879,  0.01066722],
       [ 0.34726639,  0.70372331,  0.75872989, -0.81983048],
       [-0.82359021, -0.6392422 , -1.19241185,  0.18621838]])

In [74]:
# transpose
psdf2.T

Unnamed: 0,0,1,2,3,4,5
A,-0.180368,2.809529,-0.993214,0.088916,0.347266,-0.82359
B,-0.071954,-0.330146,-1.811281,0.336168,0.703723,-0.639242
C,-0.969922,0.274938,0.425113,0.368839,0.75873,-1.192412
D,1.754936,-0.511405,-0.436873,0.010667,-0.81983,0.186218


In [75]:
# sort index
psdf2.sort_index(ascending = False)

Unnamed: 0,A,B,C,D
5,-0.82359,-0.639242,-1.192412,0.186218
4,0.347266,0.703723,0.75873,-0.81983
3,0.088916,0.336168,0.368839,0.010667
2,-0.993214,-1.811281,0.425113,-0.436873
1,2.809529,-0.330146,0.274938,-0.511405
0,-0.180368,-0.071954,-0.969922,1.754936


In [77]:
# sort values
psdf2.sort_values(by='B', ascending = True)

Unnamed: 0,A,B,C,D
2,-0.993214,-1.811281,0.425113,-0.436873
5,-0.82359,-0.639242,-1.192412,0.186218
1,2.809529,-0.330146,0.274938,-0.511405
0,-0.180368,-0.071954,-0.969922,1.754936
3,0.088916,0.336168,0.368839,0.010667
4,0.347266,0.703723,0.75873,-0.81983


# Missing Data - Pandas on Spark