# Movie Recommender

Project info here

## Configuration:

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
INPUT_DIRECTORY = "/content/drive/MyDrive/Grad School/DSCI 632/MovieRecommender/data/" #for google mount
# INPUT_DIRECTORY = "./data/" #for jupyter notebook

In [3]:
%%capture 
#prevent large printout with %%capture

#Download Java
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

#Install Apache Spark 3.2.1 with Hadoop 3.2, get zipped folder
!wget -q https://dlcdn.apache.org/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz

#Unzip folder
!tar xvf spark-3.2.1-bin-hadoop3.2.tgz

#Install findspark, pyspark 3.2.1
!pip install -q findspark
!pip install pyspark==3.2.1

#Set variables
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "spark-3.2.1-bin-hadoop3.2"

## Load Packages and Functions

In [23]:
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import max

In [24]:
def calculate_sparsity(spark_df, rating_col="rating", userId_col="userId", movieId_col="movieId"):
  #get number of ratings in dataset
  numerator = spark_df.select(rating_col).count()

  #get number of distinct users and movies
  users_count = ratings.select(userId_col).distinct().count()
  movies_count = ratings.select(movieId_col).distinct().count()

  #get number of total possible ratings
  denominator = users_count * movies_count

  #calculate sparsity (num ratings / num possible ratings)
  sparsity = (1.0 - (numerator*1.0) / denominator) * 100
  print(f"The ratings dataframe is {sparsity:.2f}% empty.")

## Import Data and Preprocessing


### Import Data

In [4]:
#create SparkSession and SparkContext objects
from pyspark import SparkContext
from pyspark.sql import SparkSession

sc = SparkContext.getOrCreate()
spark = SparkSession.builder.getOrCreate()

In [5]:
#Import data
file_path = INPUT_DIRECTORY + "ratings.csv"
ratings = spark.read.csv(file_path, header=True, inferSchema=True)
ratings.show(5)

+------+-------+------+---------+
|userId|movieId|rating|timestamp|
+------+-------+------+---------+
|     1|      1|   4.0|964982703|
|     1|      3|   4.0|964981247|
|     1|      6|   4.0|964982224|
|     1|     47|   5.0|964983815|
|     1|     50|   5.0|964982931|
|     1|     70|   3.0|964982400|
|     1|    101|   5.0|964980868|
|     1|    110|   4.0|964982176|
|     1|    151|   5.0|964984041|
|     1|    157|   5.0|964984100|
|     1|    163|   5.0|964983650|
|     1|    216|   5.0|964981208|
|     1|    223|   3.0|964980985|
|     1|    231|   5.0|964981179|
|     1|    235|   4.0|964980908|
|     1|    260|   5.0|964981680|
|     1|    296|   3.0|964982967|
|     1|    316|   3.0|964982310|
|     1|    333|   5.0|964981179|
|     1|    349|   4.0|964982563|
+------+-------+------+---------+
only showing top 20 rows



### Preprocessing

#### Check datatypes

In [6]:
ratings.printSchema()

root
 |-- userId: integer (nullable = true)
 |-- movieId: integer (nullable = true)
 |-- rating: double (nullable = true)
 |-- timestamp: integer (nullable = true)



#### No duplicate ratings

Now let's confirm there are no duplicates of `movieId` and `userId` (i.e. no double ratings for a given movie-user pair):

In [27]:
from pyspark.sql.functions import max

ratings.groupby("userId", "movieId").count().select(max("count")).show()

+----------+
|max(count)|
+----------+
|         1|
+----------+



Each user-movie pair has only one rating.

#### Sparsity

As a last quality check, we'll see how sparse our dataset is:


In [26]:
calculate_sparsity(movie_ratings)

The ratings dataframe is 98.30% empty.


Recommender Systems usually have about 99% sparsity, so this is normal, if not more populated than usual.

#### Join data for readability

`userId`, `movieId`, and `rating` are of the appropriate datatypes to continue analysis.  We'll join `ratings` with the actual movie titles for readability purposes:

In [9]:
#Import movie titles
file_path = INPUT_DIRECTORY + "movies.csv"
movies = spark.read.csv(file_path, header=True, inferSchema=True)
movies = movies.select("movieId", "title") #remove genres
movies.show(5, truncate=False)

+-------+----------------------------------+
|movieId|title                             |
+-------+----------------------------------+
|1      |Toy Story (1995)                  |
|2      |Jumanji (1995)                    |
|3      |Grumpier Old Men (1995)           |
|4      |Waiting to Exhale (1995)          |
|5      |Father of the Bride Part II (1995)|
+-------+----------------------------------+
only showing top 5 rows



In [18]:
movie_ratings = ratings.join(movies, on="movieId", how="left")
movie_ratings.show(5, truncate=False)

+-------+------+------+---------+---------------------------+
|movieId|userId|rating|timestamp|title                      |
+-------+------+------+---------+---------------------------+
|1      |1     |4.0   |964982703|Toy Story (1995)           |
|3      |1     |4.0   |964981247|Grumpier Old Men (1995)    |
|6      |1     |4.0   |964982224|Heat (1995)                |
|47     |1     |5.0   |964983815|Seven (a.k.a. Se7en) (1995)|
|50     |1     |5.0   |964982931|Usual Suspects, The (1995) |
+-------+------+------+---------+---------------------------+
only showing top 5 rows



Finally, we'll reorganize the columns for ALS and remove `timestamp`:

In [17]:
movie_ratings = movie_ratings.select("title", "movieId", "userId", "rating")
movie_ratings.show(5, truncate=False)

+---------------------------+-------+------+------+
|title                      |movieId|userId|rating|
+---------------------------+-------+------+------+
|Toy Story (1995)           |1      |1     |4.0   |
|Grumpier Old Men (1995)    |3      |1     |4.0   |
|Heat (1995)                |6      |1     |4.0   |
|Seven (a.k.a. Se7en) (1995)|47     |1     |5.0   |
|Usual Suspects, The (1995) |50     |1     |5.0   |
+---------------------------+-------+------+------+
only showing top 5 rows



## ALS Model Creation

## Performance Evaluation