# Partitioning

## Prerrequisites

Install Spark and Java in VM

In [11]:
# install Java8
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
# download spark 3.5.3
!wget -q https://dlcdn.apache.org/spark/spark-3.5.3/spark-3.5.3-bin-hadoop3.tgz

In [12]:
ls -l # check the .tgz is there

total 391476
drwxr-xr-x 1 root root      4096 Nov 25 19:13 [0m[01;34msample_data[0m/
-rw-r--r-- 1 root root 400864419 Sep  9 05:35 spark-3.5.3-bin-hadoop3.tgz


In [13]:
# unzip it
!tar xf spark-3.5.3-bin-hadoop3.tgz

In [14]:
!pip install -q findspark

Defining the environment

In [15]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.5.3-bin-hadoop3"
os.environ["PYSPARK_SUBMIT_ARGS"] = "--master local[*] pyspark-shell"

Start Spark Session

---

In [16]:
import findspark
findspark.init("spark-3.5.3-bin-hadoop3")# SPARK_HOME

from pyspark.sql import SparkSession

# create the session
spark = SparkSession \
        .builder \
        .appName("Window Partitioning Exercises") \
        .master("local[*]") \
        .getOrCreate()

spark.version

'3.5.3'

In [17]:
spark

In [18]:
# For Pandas conversion optimization
spark.conf.set("spark.sql.execution.arrow.enabled", "true")

In [19]:
# Import sql functions
from pyspark.sql.functions import *

Download datasets

In [20]:
!mkdir -p /dataset
!wget -q https://github.com/masfworld/datahack_docker/raw/master/zeppelin/data/bank.csv -P /dataset
!wget -q https://github.com/masfworld/datahack_docker/raw/master/zeppelin/data/vehicles.csv -P /dataset
!wget -q https://github.com/masfworld/datahack_docker/raw/master/zeppelin/data/characters.csv -P /dataset
!wget -q https://github.com/masfworld/datahack_docker/raw/master/zeppelin/data/planets.csv -P /dataset
!wget -q https://github.com/masfworld/datahack_docker/raw/master/zeppelin/data/species.csv -P /dataset
!ls /dataset

bank.csv  characters.csv  planets.csv  species.csv  vehicles.csv


## Examples

### Patitioning

In [21]:
# Load characters CSV
charactersDF = spark.read.option("inferSchema", "true").option("header", "true").csv("/dataset/characters.csv")

In [22]:
# Show how the data is partitioned now
charactersDF \
  .withColumn("partitionId", spark_partition_id()) \
  .groupBy("partitionId") \
  .count() \
  .orderBy(col("count").desc()) \
  .show()

+-----------+-----+
|partitionId|count|
+-----------+-----+
|          0|   87|
+-----------+-----+



In [23]:
# We will now repartition the DF to 20 partitions
charactersRepDF = charactersDF.repartition(20)
charactersRepDF \
  .withColumn("partitionId", spark_partition_id()) \
  .groupBy("partitionId") \
  .count() \
  .orderBy(col("partitionId")) \
  .show()

+-----------+-----+
|partitionId|count|
+-----------+-----+
|          0|    4|
|          1|    4|
|          2|    4|
|          3|    4|
|          4|    4|
|          5|    4|
|          6|    4|
|          7|    4|
|          8|    4|
|          9|    5|
|         10|    5|
|         11|    5|
|         12|    5|
|         13|    5|
|         14|    5|
|         15|    5|
|         16|    4|
|         17|    4|
|         18|    4|
|         19|    4|
+-----------+-----+



In [24]:
# Now we can use coalesce to reduce the number of partitions
charactersRepDF \
  .coalesce(5) \
  .withColumn("partitionId", spark_partition_id()) \
  .groupBy("partitionId") \
  .count() \
  .orderBy(col("partitionId")) \
  .show()

+-----------+-----+
|partitionId|count|
+-----------+-----+
|          0|   17|
|          1|   18|
|          2|   18|
|          3|   17|
|          4|   17|
+-----------+-----+



## Partitioning Exercises

1. Try repartition/colaesce yourself