# 202 Spark - Movielens

The goal of this lab is to run some analysis on the [MovieLens](https://grouplens.org/datasets/movielens/) dataset.

- [Spark programming guide](https://spark.apache.org/docs/latest/rdd-programming-guide.html)
- [PySpark RDD APIs](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.html)

This lab's notebook is in the ```material``` folder; the solutions will be released in the same folder.

The cluster configuration is the same from 201.
- Clone the previous cluster
- Update the addresses in Putty

Download the dataset [here](https://big.csr.unibo.it/downloads/bigdata/ml-dataset.zip), unzip it and upload the files to S3.

- ml_movies.csv (<u>movieId</u>:Long, title:String, genres:String) 
    - genres are separated by pipelines  (e.g., "comedy|drama|action")
    - each movie is associated with many ratings

- ml_ratings.csv (<u>userId</u>:Long, <u>movieId</u>:Long, rating:Double, year:Int)
    - each rating is associated with many tags
- ml_tags.csv (<u>userId</u>:Long, <u>movieId</u>:Long, <u>tag</u>:String, year:Int) 

In [None]:
%%configure -f
{"executorMemory":"8G", "numExecutors":2, "executorCores":3, "conf": {"spark.dynamicAllocation.enabled": "false"}}

In [None]:
bucketname = "univ-tours-bd2223-egallinucci"

path_ml_movies = "s3a://"+bucketname+"/datasets/movielens/ml-movies.csv"
path_ml_ratings = "s3a://"+bucketname+"/datasets/movielens/ml-ratings-sample.csv"
path_ml_tags = "s3a://"+bucketname+"/datasets/movielens/ml-tags.csv"

sc.applicationId

"SPARK UI: Enable forwarding of port 20888 and connect to http://localhost:20888/proxy/" + sc.applicationId + "/"

In [None]:
commaRegex = ",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)"
pipeRegex = "\\|(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)"

quotes = "\""
def parseText(title):
    try:
        title = title.strip()
        while (re.match(quotes,title[0:1])):
            title = title[1:]
        while (re.match(quotes,title[len(title)-1:])):
            title = title[:len(title)-1]
        return title
    except:
        return ""

In [None]:
from itertools import islice
from datetime import datetime
import re

rddMovies = spark.read.option("header","true").csv(path_ml_movies).rdd.map(lambda row: (int(row[0]),parseText(row[1]),row[2]))
rddRatings = spark.read.option("header","true").csv(path_ml_ratings).rdd.map(lambda row: (int(row[0]),int(row[1]),float(row[2]),datetime.utcfromtimestamp(int(row[3])).strftime('%Y')))
#rddTags = spark.read.option("header","true").csv(path_ml_tags).rdd.map(lambda row: (int(row[0]),int(row[1]),parseText(row[2]),datetime.utcfromtimestamp(int(row[3])).strftime('%Y')))

## 303-1 Datasets exploration

Cache the dataset and answer the following questions:

- How many (distinct) users, movies, ratings, and tags?
- How many (distinct) genres?
- On average, how many ratings per user?
- On average, how many ratings per movie?
- On average, how many genres per movie?
- What is the range of ratings?
- On average, how many ratings per year?

In [None]:
rddMoviesCached = rddMovies.cache()
rddRatingsCached = rddRatings.cache()
#rddTagsCached = rddTags.cache()

rddMoviesCached.count()
rddRatingsCached.count()
#rddTagsCached.count()

## 303-2 Compute the average rating for each movie

- Export the result to S3
- Do not start from cached RDDs
- Evaluate:
  - Join-and-Aggregate vs Aggregate-and-Join
  - Best join vs broadcast
- Use Tableau to check the results
  - Download the file from S3 instead of connecting to S3

In [None]:
path_output_avgRatPerMovie = "s3a://"+bucketname+"/spark/avgRatPerMovie-test"
# rdd.coalesce(1).toDF().write.format("csv").mode('overwrite').save(path_output_avgRatPerMovie)

for (id, rdd) in spark.sparkContext._jsc.getPersistentRDDs().items():
    rdd.unpersist()

## 303-3 Genres

Make a chart of best-ranked genres, export the result to S3, then use Tableau to check it.

Use cached RDDs.

Two possible workflows:

1. Pre-aggregation (3 shuffles)

  - Aggregate ratings by movieId
  - Join with movies and map to genres
  - Aggregate by genres
  
2. Join & aggregate (2 shuffles)

  - Join with movies and map to genres
  - Aggregate by genres



In [None]:
path_output_avgRatPerGenre = "s3a://"+bucketname+"/spark/avgRatPerGenre-test"

for (id, rdd) in spark.sparkContext._jsc.getPersistentRDDs().items():
    rdd.unpersist()