# Data Processing 

Hydrating a system is an essential part of a well-oiled recommendation system. As datasets grow larger and larger, it proved benificial to use specialized computing libraries to help handle the data at scale. 

## PySpark 

Python API access to the Spark computing library that can be used to help process and transform large-scale datasets 

PySpark provides a convenient SQL API that allows us to write what seems to be SQL queries against large datasets

## Simple PySpark Recommender System 

`SparkSession` serves as an entry point for all Spark functionalitites. 

You *must* start a session if you want to do any building/manipulating in Spark 

In [1]:
from pyspark.sql import SparkSession

# start a spark session 
spark = (
    SparkSession.builder
    .appName("simple-spark-recommender")
    .config("spark.memory.offHeap.enabled", "true") # 
    .config("spark.memory.offHeap.size", "10g")
    .getOrCreate()
)

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/06/27 13:22:20 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [2]:
# create a dataframe 
df = spark.read.csv("../data/saved_track_features.csv", header=True, escape="\"")

In [3]:
# show the first 5 rows without truncation (second arg = 0 == no truncation)
df.show(5, 0)

+------------+------+---+--------+----+-----------+------------+----------------+--------+-------+-------+--------------+----------------------+------------------------------------+--------------------------------------------------------+----------------------------------------------------------------+-----------+--------------+
|danceability|energy|key|loudness|mode|speechiness|acousticness|instrumentalness|liveness|valence|tempo  |type          |id                    |uri                                 |track_href                                              |analysis_url                                                    |duration_ms|time_signature|
+------------+------+---+--------+----+-----------+------------+----------------+--------+-------+-------+--------------+----------------------+------------------------------------+--------------------------------------------------------+----------------------------------------------------------------+-----------+--------------+
|0.756 

In [6]:
# get n observations
print(f"There are {df.count()} observations in this dataset")

There are 1716 observations in this dataset


In [7]:
# how many unique ids are there in this dataset
n_unique_ids = df.select("id").distinct().count()

print(f"There are {n_unique_ids} unique IDs in this datase")

There are 1716 unique IDs in this datase


Get aggregate tables using `.groupBy()`, `.agg()` and, `.countDistinct()` methods on a DataFrame

In [10]:
from pyspark.sql.functions import *
from pyspark.sql.types import *

# what is the most common key -- generate a frequency table
key_freq = (
    df.groupBy("key")
        .agg(countDistinct("id").alias("count"))
        .orderBy(desc("count"))
)

key_freq.show()

+---+-----+
|key|count|
+---+-----+
|  1|  224|
|  0|  210|
|  7|  163|
|  5|  156|
|  6|  141|
|  2|  140|
|  9|  132|
| 11|  131|
|  8|  130|
|  4|  126|
| 10|  104|
|  3|   59|
+---+-----+

