# Fun exercises on Spark using the Movielens Dataset

We are going to use the [MovieLens](https://grouplens.org/datasets/movielens/) dataset for these exercises. This is non trivial and should expand to about 1GB on you hard-drive.

Download and unzip [MovieLens 25M Dataset](https://grouplens.org/datasets/movielens/25m/) for this analysis.

Either ensure the data is in ```"./data/ml-25m"``` folder or update the path to the data below.

**Citation**:  
*F. Maxwell Harper and Joseph A. Konstan.* 2015.  
The MovieLens Datasets: History and Context.  
ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4: 19:1â€“19:19. <https://doi.org/10.1145/2827872>  

You got this.  


In [18]:
# Step 1: initialize findspark
import findspark
findspark.init()

In [19]:
# Step 2: import pyspark
import pyspark
from pyspark.sql import SparkSession
pyspark.__version__

'3.3.0'

In [20]:
# Step 3: Create a spark session

# 'local[1]' indicates spark on 1 core on the local machine, specify the number of cores needed
# use .config("spark.some.config.option", "some-value") for additional configuration

spark = SparkSession \
    .builder \
    .master('local[1]') \
    .appName("Analyzing Movielens Data") \
    .getOrCreate()

# spark

# ...to read and load the data *correctly*

This is typically the first problem you need to work out. You'll see.  
  
If you've downloaded and unzipped the data, you'll see that some of the files are quite large (genome-scores.csv is 400+ Mb, ratings.csv is 600+ Mb).  

So before we start loading the data to explore further, let's go through the [readme](https://files.grouplens.org/datasets/movielens/ml-25m-README.html) file to build a strategy for loading and analyzing data without clogging up the system.  

In real life, either you'll have to load files in small chunks to work out a strategy or you'll have to rely on defined schema for data.  

Here's the list of files (as of Aug 2022) that you get when you unzip the dataset:
1. movies.csv - list of movies with at least one rating.  
1. links.csv - IDs to generate links to the movie listing on imdb.com and themoviedb.org  
1. ratings.csv - Each line of this file after the header row represents one rating of one movie by one user.  
    Header: ```userId,movieId,rating,timestamp```  
1. tags.csv - Each line of this file after the header row represents one tag applied to one movie by one user.  
    Header: ```userId,movieId,tag,timestamp```  
1. Tag Genome: The tag genome contains tag relevance scores for movies. See [this](http://files.grouplens.org/papers/tag_genome.pdf)  
	1. genome-tags.csv - A list of tags  
	1. genome-scores.csv - Each movie in the genome has a relevance score value for every tag in the genome  
1. README.txt - Check out the README.txt for more details about the files.  

## formatting and encoding

From the Readme file, we have the following observations about the data:
1. Each file is a CSV with a single header row
1. Separator char is ```,```
1. Escape char is ```"```
1. Encoding is UTF-8

Let's set these options when reading the CSV files.

In [None]:
# where possible, let's avoid inferSchema


In [5]:
# load
genome_tags = spark.read.csv("./data/ml-25m/genome-tags.csv", header = True)
genome_tags.show()

+-----+---------------+
|tagId|            tag|
+-----+---------------+
|    1|            007|
|    2|   007 (series)|
|    3|   18th century|
|    4|          1920s|
|    5|          1930s|
|    6|          1950s|
|    7|          1960s|
|    8|          1970s|
|    9|          1980s|
|   10|   19th century|
|   11|             3d|
|   12|           70mm|
|   13|            80s|
|   14|           9/11|
|   15|        aardman|
|   16|aardman studios|
|   17|       abortion|
|   18|         absurd|
|   19|         action|
|   20|  action packed|
+-----+---------------+
only showing top 20 rows



In [6]:
movies = spark.read.csv("./data/ml-25m/movies.csv", header = True)
movies.schema

StructType([StructField('movieId', StringType(), True), StructField('title', StringType(), True), StructField('genres', StringType(), True)])

In [10]:
movies.show(10,False)

# from pyspark.sql.functions 

# dir(pyspark.sql.functions)

# movies = movies.withColumn("title_only", movies.select(split_part(movies['title'], ' (', 1)))
# movies.show(25, False)

+-------+----------------------------------+-------------------------------------------+
|movieId|title                             |genres                                     |
+-------+----------------------------------+-------------------------------------------+
|1      |Toy Story (1995)                  |Adventure|Animation|Children|Comedy|Fantasy|
|2      |Jumanji (1995)                    |Adventure|Children|Fantasy                 |
|3      |Grumpier Old Men (1995)           |Comedy|Romance                             |
|4      |Waiting to Exhale (1995)          |Comedy|Drama|Romance                       |
|5      |Father of the Bride Part II (1995)|Comedy                                     |
|6      |Heat (1995)                       |Action|Crime|Thriller                      |
|7      |Sabrina (1995)                    |Comedy|Romance                             |
|8      |Tom and Huck (1995)               |Adventure|Children                         |
|9      |Sudden Death

In [11]:
tags = spark.read.csv("./data/ml-25m/tags.csv", header=True)
tags.schema

StructType([StructField('userId', StringType(), True), StructField('movieId', StringType(), True), StructField('tag', StringType(), True), StructField('timestamp', StringType(), True)])

In [12]:
tags.show(10, False)

+------+-------+-----------------------+----------+
|userId|movieId|tag                    |timestamp |
+------+-------+-----------------------+----------+
|3     |260    |classic                |1439472355|
|3     |260    |sci-fi                 |1439472256|
|4     |1732   |dark comedy            |1573943598|
|4     |1732   |great dialogue         |1573943604|
|4     |7569   |so bad it's good       |1573943455|
|4     |44665  |unreliable narrators   |1573943619|
|4     |115569 |tense                  |1573943077|
|4     |115713 |artificial intelligence|1573942979|
|4     |115713 |philosophical          |1573943033|
|4     |115713 |tense                  |1573943042|
+------+-------+-----------------------+----------+
only showing top 10 rows



In [17]:
# from pyspark.sql.functions import Column
results = tags.groupby('userId').count()
results.sort(results['count'].desc()).show(10,False)
results.describe('count').show()

+------+------+
|userId|count |
+------+------+
|6550  |183356|
|21096 |20317 |
|62199 |13700 |
|160540|12076 |
|155146|11445 |
|70092 |10582 |
|131347|10195 |
|14116 |10167 |
|31047 |8463  |
|141263|7114  |
+------+------+
only showing top 10 rows

+-------+------------------+
|summary|             count|
+-------+------------------+
|  count|             14592|
|   mean| 74.92872807017544|
| stddev|1570.0725288977699|
|    min|                 1|
|    max|            183356|
+-------+------------------+



Here's the exercises we'll do with this data next

# 1  

List all unique genres found in ```movies.csv```  
List all unique tags found in ```tags.csv```  
Cross-check tags from the tag genome, insert tag_genome_id and relevance score, save file as tags_with_relevance.csv

# 2  

Extract the year of release in movies.csv into a new column year_of_release  

Add another column num_genres and list total number of genres associated with each film

Find number of films associated with each genre - absolute_frequency_of_genre

Is there a 'variety' metric? sum of absolute frequencies divided by total absolute frequency?


# 3  

Prepare a yearwise list of movies - list all the movies released in 1995, then 1996 and so on...  


# 4  

Prepare a list of highest rated movies (movies with atleast 5 instances where users have rated the filem a 4 or a 5 ), present this list by year of release and sorted in alphabetical order by movie title.  

Expected Columns in the output::year of release::movie title::# of 4s::# of 5s

# 5  

Prepare a list of movies that have atleast two vowels except 'e' - sort the list by month and year of video release.  

Expected Columns in the output::year of video release::month of video release::movie title::# of vowels that are not e

# 6  

Prepare a genere wise list of movies - list all the movies for 'unknown', for 'action', and so on...  

The list must be sorted in descending count of genres - a movie with 3 genres should rank higher than a movie with only 1 genre.

Expected Columns in the output::genere::movie title::::

Count number of movies in each genere.

Find out if a movie has both genres associated with it and also has ```(no genres listed)``` - if this is the case, find out how many such movies exist in the data set


# 7  

Find the top 3 highest rated movies for each year - highest rated means where the sum of 4 and 5 ratings is the highest for the particular year.  

Expected Columns in the output::year of release::movie title::# of 4s::# of 5s