# Spark Optimization

## Getting Started

- Create Conda Environment

  ```bash
  conda create -n spark python=3.11
  conda update -n base conda
  conda update python
  pip install --upgrade pip
  ```

- Install PySpark

  ```bash
  pip install pyspark==3.5.5
  ```

- Install ipykernel Notebook

  ```bash
  pip install ipykernel
  ```

- Install wget to download remote files for processing

  ```bash
  pip install wget
  ```

In [1]:
from pyspark.sql import SparkSession

spark = (
    SparkSession.builder.appName("spark-optimizations")
    .master("local[*]")
    .config("spark.sql.shuffle.partitions", "4")
    .getOrCreate()
)

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/05/12 07:44:39 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/05/12 07:44:39 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


In [2]:
import wget
import os

# Sample CSV files - https://github.com/datablist/sample-csv-files?tab=readme-ov-file

# url = "https://dumps.wikimedia.org/other/clickstream/2017-11/clickstream-jawiki-2017-11.tsv.gz"
url = "https://drive.google.com/uc?id=1N1xoxgcw2K3d-49tlchXAWw4wuxLj7EV&export=download"
tmp_dir = "/tmp/"
local_path = tmp_dir + "customers-100000.csv"

if os.path.exists(local_path):
    os.remove(local_path)

wget.download(url, local_path)

'/tmp/customers-100000.csv'

In [None]:
# load large dataset from online csv file
df = spark.read.csv(
    "file://" + local_path,
    header=True,
    inferSchema=True,
)

df.show(5)  # show the first 5 rows of the DataFrame

                                                                                

+-----+---------------+----------+----------+--------------------+--------------+--------------------+--------------------+--------------------+--------------------+-----------------+--------------------+
|Index|    Customer Id|First Name| Last Name|             Company|          City|             Country|             Phone 1|             Phone 2|               Email|Subscription Date|             Website|
+-----+---------------+----------+----------+--------------------+--------------+--------------------+--------------------+--------------------+--------------------+-----------------+--------------------+
|    1|ffeCAb7AbcB0f07|     Jared|    Jarvis|    Sanchez-Fletcher| Hatfieldshire|             Eritrea|  274.188.8773x41185|001-215-760-4642x969|gabriellehartman@...|       2021-11-11|https://www.mccar...|
|    2|b687FfC4F1600eC|     Marie|    Malone|           Mckay PLC|Robertsonburgh|            Botswana|        283-236-9529| (189)129-8356x63741|kstafford@sexton.com|       2021-05-

In [None]:
# Cache and Persist Dataframe
from pyspark import StorageLevel

df.cache()  # Cache the DataFrame in memory
df.persist(StorageLevel.MEMORY_AND_DISK)  # Persist the DataFrame in memory and disk
df.unpersist()  # Unpersist the DataFrame from memory and disk

25/05/12 07:44:58 WARN CacheManager: Asked to cache already cached data.


DataFrame[Index: int, Customer Id: string, First Name: string, Last Name: string, Company: string, City: string, Country: string, Phone 1: string, Phone 2: string, Email: string, Subscription Date: date, Website: string]

In [6]:
# Broadcast Join

from pyspark.sql.functions import broadcast

df_selfjoin = df.join(broadcast(df), "Customer Id")

df_selfjoin.show(5)  # show the first 5 rows of the DataFrame

+---------------+-----+----------+----------+--------------------+--------------+--------------------+--------------------+--------------------+--------------------+-----------------+--------------------+-----+----------+----------+--------------------+--------------+--------------------+--------------------+--------------------+--------------------+-----------------+--------------------+
|    Customer Id|Index|First Name| Last Name|             Company|          City|             Country|             Phone 1|             Phone 2|               Email|Subscription Date|             Website|Index|First Name| Last Name|             Company|          City|             Country|             Phone 1|             Phone 2|               Email|Subscription Date|             Website|
+---------------+-----+----------+----------+--------------------+--------------+--------------------+--------------------+--------------------+--------------------+-----------------+--------------------+-----+------

In [7]:
from pyspark.sql.functions import when

df.withColumn("First Name", when(df["Index"] > 2, "Phone 1").otherwise("Phone 2")).show(5)

+-----+---------------+----------+----------+--------------------+--------------+--------------------+--------------------+--------------------+--------------------+-----------------+--------------------+
|Index|    Customer Id|First Name| Last Name|             Company|          City|             Country|             Phone 1|             Phone 2|               Email|Subscription Date|             Website|
+-----+---------------+----------+----------+--------------------+--------------+--------------------+--------------------+--------------------+--------------------+-----------------+--------------------+
|    1|ffeCAb7AbcB0f07|   Phone 2|    Jarvis|    Sanchez-Fletcher| Hatfieldshire|             Eritrea|  274.188.8773x41185|001-215-760-4642x969|gabriellehartman@...|       2021-11-11|https://www.mccar...|
|    2|b687FfC4F1600eC|   Phone 2|    Malone|           Mckay PLC|Robertsonburgh|            Botswana|        283-236-9529| (189)129-8356x63741|kstafford@sexton.com|       2021-05-