# Data Analysis using __PySpark__  
*Fun with the __MovieLens__ dataset*  

**Part 1: Overview, Starting Spark and Loading the data**

<font color='green'>__Support for Google Colab__  </font>

open this notebook in Colab using the following button:  
  
<a href="https://colab.research.google.com/github/shauryashaurya/learn-data-munging/blob/main/03-Spark/002.01-Analyze-MovieLens-using-PySpark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>  

<font color='green'>uncomment and execute the cell below to setup and run this Spark notebook on Google Colab.</font>

## update Jan 2025 
It has been very frustrating to run spark 3.5.x with Python 3.12  
While the official documentation says it works, the python workers for pySpark have been crashing randomly.   
Not cool!  

So for the time being we'll stick with Spark 3.4.3 (ditto for pySpark - v3.4.3)  
See [this](https://www.reddit.com/r/dataengineering/comments/1dupogi/wasted_45_hours_to_install_pyspark_locally_pain/) reddit post that shares my frustration.  

In [1]:
# # SETUP FOR COLAB: select all the lines below and uncomment (CTRL+/ on windows)

# # grab spark
# # as of 2023-06-23, the latest version is 3.4.1, get the link from Apache Spark's website
# ! wget -q https://dlcdn.apache.org/spark/spark-3.4.1/spark-3.4.1-bin-hadoop3.tgz
# # unzip spark
# !tar xf spark-3.4.1-bin-hadoop3.tgz
# # install findspark package
# !pip install -q findspark
# # Let's download and unzip the MovieLens 25M Dataset as well.
# ! mkdir ./../data
# ! wget -q https://files.grouplens.org/datasets/movielens/ml-25m.zip
# ! unzip ./ml-25m.zip -d ./../data/

# # got to provide JAVA_HOME and SPARK_HOME vairables
# import os
# os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"
# # IMPORTANT - UPDATE THE SPARK_HOME PATH BASED ON THE PACKAGE YOU DOWNLOAD
# os.environ["SPARK_HOME"] = "/content/spark-3.4.1-bin-hadoop3"
# ! echo "DONE"

## Overview
Features of the PySpark DataFrames most commonly used in data analysis - select, filter, join, groupby, pivot, and windows.  
Instead of toy examples and '10 minutes to xx' we load an actual dataset and ask meaningful questions about it.
  
We'll use the [MovieLens](https://grouplens.org/datasets/movielens/#:~:text=MovieLens%2025M%20Dataset) dataset for these exercises.  
This dataset is non trivial and should expand to about __1GB__ on you local disk.  

Download and unzip [MovieLens 25M Dataset](https://grouplens.org/datasets/movielens/25m/) for this analysis.

Either ensure the data is in ```"./data/ml-25m"``` folder or update the path to the data below.

**Citation**:  
*F. Maxwell Harper and Joseph A. Konstan.* 2015.  
The MovieLens Datasets: History and Context.  
ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4: 19:1â€“19:19. <https://doi.org/10.1145/2827872>  

You got this.  


## Approach

The idea is to tackle simple Spark use-cases first and move on to more complex ones.  

Starting with simply loading the data into a dataframe, we then perform a data evaluation, some cleanup and finally analysis. We first ask questions based on individual data files, then move on to combining data from multiple files.

We are going to try and avoid the more mathematically involved parts of exploratory data analysis - for e.g. statistical analysis on various features etc. - the core focus in the ability to grok pyspark functions and have fun while doing it.  

By the end you'd not only have an idea of PySpark, but also how we ask questions and analyze a chunk of data.  

_You may also end up with a watch-list to binge on your next weekend._ :)   

## Setup the Spark Cluster

In [2]:
# Step 1: initialize findspark
import findspark

findspark.init()

In [3]:
# Step 2: import pyspark
import pyspark
from pyspark.sql import SparkSession

pyspark.__version__

'3.5.4'

In [4]:
# Step 3: Create a spark session

# using local[*] to use as many logical cores as available, use 1 when in doubt
# 'local[1]' indicates spark on 1 core on the local machine or specify the number of cores needed
# use .config("spark.some.config.option", "some-value") for additional configuration

spark = (
    SparkSession.builder.master("local[1]")
    .appName("Analyzing Movielens Data")
    .getOrCreate()
)

In [5]:
# spark

# ...to read and load the data *correctly*

This is typically the first problem you need to work out. You'll see.  
  
If you've downloaded and unzipped the data, you'll see that some of the files are quite large (genome-scores.csv is 400+ Mb, ratings.csv is 600+ Mb).  

So before we start loading the data to explore further, let's go through the [readme](https://files.grouplens.org/datasets/movielens/ml-25m-README.html) file to build a strategy for loading and analyzing data without clogging up the system.  

In real life, either you'll have to load files in small chunks to work out a strategy or you'll have to rely on defined schema for data.  

## Schema Spec

Here's the list of files (as of Aug 2022) that you get when you unzip the dataset:
1. **movies**.csv - list of movies with at least one rating.  
    Header: ```movieId,title,genres```  
1. **links**.csv - IDs to generate links to the movie listing on imdb.com and themoviedb.org  
    Header: ```movieId,imdbId,tmdbId```  
1. **ratings**.csv - Each line of this file after the header row represents one rating of one movie by one user.  
    Header: ```userId,movieId,rating,timestamp```  
1. **tags**.csv - Each line of this file after the header row represents one tag applied to one movie by one user.  
    Header: ```userId,movieId,tag,timestamp```  
1. Tag Genome: The tag genome contains tag relevance scores for movies. See [this](http://files.grouplens.org/papers/tag_genome.pdf)  
	1. **genome-tags**.csv - A list of tags  
    Header: ```tagId,tag```  
	1. **genome-scores**.csv - Each movie in the genome has a relevance score value for every tag in the genome  
    Header: ```movieId,tagId,relevance```  
1. README.txt - Check out the README.txt for more details about the files.  

## Data encoding details

From the Readme file, we have the following observations about the data:
1. Each file is a CSV with a single header row
1. Separator char is ```,```
1. Escape char is ```"```
1. Encoding is UTF-8

Let's set these options when reading the CSV files.

## Specify the schema for Spark  
  
Avoid ```inferSchema``` as much as possible, just cleaner

In [6]:
from pyspark.sql.types import *

In [7]:
#
schema_movies = StructType(
    [
        StructField("movieId", StringType(), False),
        StructField("title", StringType(), False),
        StructField("genres", StringType(), True),
    ]
)

In [8]:
#
schema_links = StructType(
    [
        StructField("movieId", StringType(), False),
        StructField("imdbId", StringType(), True),
        StructField("tmdbId", StringType(), True),
    ]
)

In [9]:
#
schema_ratings = StructType(
    [
        StructField("userId", StringType(), False),
        StructField("movieId", StringType(), False),
        StructField("rating", FloatType(), True),
        StructField("timestamp", StringType(), True),
    ]
)

In [10]:
#
schema_tags = StructType(
    [
        StructField("userId", StringType(), False),
        StructField("movieId", StringType(), False),
        StructField("tag", StringType(), True),
        StructField("timestamp", StringType(), True),
    ]
)

In [11]:
#
schema_genome_tags = StructType(
    [
		StructField("tagId", StringType(), False), 
		StructField("tag", StringType(), False)
	]
)

In [12]:
#
# using arbitrary precision signed decimals (java.math.BigDecimal) for relevance scores
schema_genome_scores = StructType(
    [
        StructField("movieId", StringType(), False),
        StructField("tagId", StringType(), False),
        StructField("relevance", DecimalType(), False),
    ]
)

## Specify the location of your data  

Change this folder if you are saving the data at the different place

In [13]:
datalocation = "../data/ml-25m/"

In [14]:
# specify file names
file_path_movies = datalocation + "movies.csv"
file_path_links = datalocation + "links.csv"
file_path_ratings = datalocation + "ratings.csv"
file_path_tags = datalocation + "tags.csv"
file_path_genome_tags = datalocation + "genome-tags.csv"
file_path_genome_scores = datalocation + "genome-scores.csv"

## Load the data and review

Let's load each file in turn and observe, just to get a sense of familiarity with the data.  

### A note on comparing the *method-chaining* syntax between pandas and pyspark  

Pandas supports that nice "method chaining" syntax where you can club everything in parens  
and write one operation per line  
to do that in spark,  
we use the multi-line format - end each line with a space-backslash  
and python will continue to add the next line to your single link of code  

The good thing about the pandas syntax is   
you can comment a line and the next one is picked up just fine  
also you can pipe() things to another variable for debugging or capturing state  
commenting in the middle definetely breaks in pyspark.  

### Movies

In [15]:
movies_raw = (
    spark.read.format("csv")
    .option("encoding", "UTF-8")
    .option("header", True)
    .option("sep", ",")
    .option("escape", '"')
    .schema(schema_movies)
    .load(file_path_movies)
)

In [16]:
# Spark collects all transformations needed
# and execution doesn't begin until an "action" is triggered
# 
# 'show' triggers a partial execution 
#  'show' - limiting computation (where relevant) to the number of rows you want to display
movies_raw.show(10, False)

+-------+----------------------------------+-------------------------------------------+
|movieId|title                             |genres                                     |
+-------+----------------------------------+-------------------------------------------+
|1      |Toy Story (1995)                  |Adventure|Animation|Children|Comedy|Fantasy|
|2      |Jumanji (1995)                    |Adventure|Children|Fantasy                 |
|3      |Grumpier Old Men (1995)           |Comedy|Romance                             |
|4      |Waiting to Exhale (1995)          |Comedy|Drama|Romance                       |
|5      |Father of the Bride Part II (1995)|Comedy                                     |
|6      |Heat (1995)                       |Action|Crime|Thriller                      |
|7      |Sabrina (1995)                    |Comedy|Romance                             |
|8      |Tom and Huck (1995)               |Adventure|Children                         |
|9      |Sudden Death

### RDDs and DataFrames  

RDDs are the fundamental data structures.
DataFrames are high level entities that operate on RDDs.
DataFrames have lots of underlying optimization built in, so when DataFrame code gets translated to RDDs, it's optimal.
Prefer DataFrames unless RDDs are absolutely needed - cleaner API, better performance.

References:
* [RDD Actions](https://spark.apache.org/docs/latest/rdd-programming-guide.html#actions) - this stuff triggers execution
* [RDD Transformations](https://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations) - these are accumulated in the DAG and executed in order when an action is triggered
* [DataFrames: Quickstart](https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_df.html#Viewing-Data) - This is what we'll leverage in the workshop

Onwards with loading and viewing the rest of the data files.

### Links

In [17]:
links_raw = (
    spark.read.format("csv")
    .option("encoding", "UTF-8")
    .option("header", True)
    .option("sep", ",")
    .option("escape", '"')
    .schema(schema_links)
    .load(file_path_links)
)

In [18]:
links_raw.show(10, False)

+-------+-------+------+
|movieId|imdbId |tmdbId|
+-------+-------+------+
|1      |0114709|862   |
|2      |0113497|8844  |
|3      |0113228|15602 |
|4      |0114885|31357 |
|5      |0113041|11862 |
|6      |0113277|949   |
|7      |0114319|11860 |
|8      |0112302|45325 |
|9      |0114576|9091  |
|10     |0113189|710   |
+-------+-------+------+
only showing top 10 rows



### Ratings

In [19]:
ratings_raw = (
    spark.read.format("csv")
    .option("encoding", "UTF-8")
    .option("header", True)
    .option("sep", ",")
    .option("escape", '"')
    .schema(schema_ratings)
    .load(file_path_ratings)
)

In [20]:
ratings_raw.show(10, False)

+------+-------+------+----------+
|userId|movieId|rating|timestamp |
+------+-------+------+----------+
|1     |296    |5.0   |1147880044|
|1     |306    |3.5   |1147868817|
|1     |307    |5.0   |1147868828|
|1     |665    |5.0   |1147878820|
|1     |899    |3.5   |1147868510|
|1     |1088   |4.0   |1147868495|
|1     |1175   |3.5   |1147868826|
|1     |1217   |3.5   |1147878326|
|1     |1237   |5.0   |1147868839|
|1     |1250   |4.0   |1147868414|
+------+-------+------+----------+
only showing top 10 rows



### Tags

In [21]:
tags_raw = (
    spark.read.format("csv")
    .option("encoding", "UTF-8")
    .option("header", True)
    .option("sep", ",")
    .option("escape", '"')
    .schema(schema_tags)
    .load(file_path_tags)
)

In [22]:
tags_raw.show(10, False)

+------+-------+-----------------------+----------+
|userId|movieId|tag                    |timestamp |
+------+-------+-----------------------+----------+
|3     |260    |classic                |1439472355|
|3     |260    |sci-fi                 |1439472256|
|4     |1732   |dark comedy            |1573943598|
|4     |1732   |great dialogue         |1573943604|
|4     |7569   |so bad it's good       |1573943455|
|4     |44665  |unreliable narrators   |1573943619|
|4     |115569 |tense                  |1573943077|
|4     |115713 |artificial intelligence|1573942979|
|4     |115713 |philosophical          |1573943033|
|4     |115713 |tense                  |1573943042|
+------+-------+-----------------------+----------+
only showing top 10 rows



### Tag Genome

In [23]:
genome_tags_raw = (
    spark.read.format("csv")
    .option("encoding", "UTF-8")
    .option("header", True)
    .option("sep", ",")
    .option("escape", '"')
    .schema(schema_genome_tags)
    .load(file_path_genome_tags)
)

In [24]:
genome_tags_raw.show(10, False)

+-----+------------+
|tagId|tag         |
+-----+------------+
|1    |007         |
|2    |007 (series)|
|3    |18th century|
|4    |1920s       |
|5    |1930s       |
|6    |1950s       |
|7    |1960s       |
|8    |1970s       |
|9    |1980s       |
|10   |19th century|
+-----+------------+
only showing top 10 rows



### Tag Genome Scores

In [25]:
genome_scores_raw = (
    spark.read.format("csv")
    .option("encoding", "UTF-8")
    .option("header", True)
    .option("sep", ",")
    .option("escape", '"')
    .schema(schema_genome_scores)
    .load(file_path_genome_scores)
)

In [26]:
genome_scores_raw.show(10, False)

+-------+-----+---------+
|movieId|tagId|relevance|
+-------+-----+---------+
|1      |1    |0        |
|1      |2    |0        |
|1      |3    |0        |
|1      |4    |0        |
|1      |5    |0        |
|1      |6    |0        |
|1      |7    |0        |
|1      |8    |0        |
|1      |9    |0        |
|1      |10   |0        |
+-------+-----+---------+
only showing top 10 rows



# Clear cache and stop the spark cluster

In [27]:
# clear cache
spark.catalog.clearCache()

In [28]:
# stop spark
spark.stop()

# Insights

In most cases, prefer loading files in a just-in-time manner to conserve memory and computing resources.  

IRL you'd load a file only when needed - big data means big memory, big processing, big everything but it doesn't mean big bull in a china shop. Brute force is rarely going to be the answer - you've got to learn to be lean in your approach. 

# Next

We will next start analysing the data through a series of data analysis exercises.  
First set of exercises work around the tags.csv data in the MovieLens Dataset.