# Building a Movie Recommendation Engine
## Xander Hieken
***
### Data Preparation

Load the data from the `ratings.csv` and `movies.csv` files. 

*They only contain complete records, so I don't need to worry about missing values at this point*

In [1]:
import pandas as pd
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS
from pyspark.sql.functions import lit
from pyspark.sql import Row
import warnings
warnings.filterwarnings('ignore')
pd.set_option('max_colwidth', -1)

# Starting the SparkSession
spark = SparkSession.builder \
                    .appName("MovieRatings") \
                    .getOrCreate()

# Defining the schema for both csv files
movieSchema = StructType().add("movieId", "integer") \
                         .add("title", "string") \
                         .add("genres", "string")

ratingSchema = StructType().add("userId", "integer") \
                         .add("movieId", "integer") \
                         .add("rating", "float") \
                         .add("timestamp", "integer")

# Reading each csv file using my defined schema
movies = spark.read.csv("movielens/movies.csv", header = True, schema = movieSchema)
ratings = spark.read.csv("movielens/ratings.csv", header = True, schema = ratingSchema)

***
### Training the Recommendation Engine

Using the data from the last step, I will create a movie recommendation model using collaborative filtering. 

Before training the recommendation model, I split the data into a training dataset and a testing dataset using the `randomSplit` dataframe method. 

*80% of the data will be used for training and 20% for testing.*

In [2]:
# Using randomSplit to create an 80/20 train/test split
(trainingRatings, testRatings) = ratings.randomSplit([0.8, 0.2])

als = ALS(userCol='userId', itemCol='movieId', ratingCol='rating', coldStartStrategy="drop")
model = als.fit(trainingRatings)
predictions = model.transform(testRatings)
predictions.toPandas().head()

Unnamed: 0,userId,movieId,rating,timestamp,prediction
0,133,471,4.0,843491793,2.850344
1,372,471,3.0,874415126,2.754405
2,599,471,2.5,1498518822,2.611227
3,182,471,4.5,1054779644,3.658455
4,387,471,3.0,1139047519,2.957531


In [3]:
evaluator = RegressionEvaluator(metricName='rmse', labelCol='rating', predictionCol='prediction')
print('The root mean squared error for our model is: {}'.format(evaluator.evaluate(predictions)))

The root mean squared error for our model is: 0.8792733496088538


***
### Generating Top 10 Movie Recommendations

Using the recommendation model, I now build a function to generate the top ten recommendations for each user. 

In [4]:
def recommendMovies(model, user, nbRecommendations):
    # Create a Spark DataFrame with the specified user and all the movies listed in the ratings DataFrame
    dataSet = ratings.select('movieId').distinct().withColumn('userId', lit(user))

    # Create a Spark DataFrame with the movies that have already been rated by this user
    moviesAlreadyRated = ratings.filter(ratings.userId == user).select('movieId', 'userId')

    # Apply the recommender system to the data set without the already rated movies to predict ratings
    predictions = model.transform(dataSet.subtract(moviesAlreadyRated)).dropna().orderBy('prediction', ascending=False).limit(nbRecommendations).select('movieId', 'prediction')

    # Join with the movies DataFrame to get the movies titles and genres
    recommendations = predictions.join(movies, predictions.movieId == movies.movieId).orderBy(predictions.prediction, ascending=False).select(movies.title, predictions.prediction)

    return recommendations

In [5]:
print('Top 10 Recommendations for User 133:')
recommendMovies(model, 133, 10).toPandas()

Top 10 Recommendations for User 133:


Unnamed: 0,title,prediction
0,Saving Face (2004),3.734774
1,Dead Man (1995),3.678385
2,Top Hat (1935),3.669599
3,Raiders of the Lost Ark: The Adaptation (1989),3.65819
4,Star Wars: Episode IV - A New Hope (1977),3.648085
5,Submarine (2010),3.647146
6,Victory (a.k.a. Escape to Victory) (1981),3.619778
7,The Big Bus (1976),3.611397
8,Seve (2014),3.611397
9,Creature Comforts (1989),3.60394


In [6]:
print('Top 10 Recommendations for User 471:')
recommendMovies(model, 471, 10).toPandas()

Top 10 Recommendations for User 471:


Unnamed: 0,title,prediction
0,The Artist (2011),5.122711
1,"Jetée, La (1962)",4.864706
2,Man Bites Dog (C'est arrivé près de chez vous) (1992),4.677149
3,World of Tomorrow (2015),4.581001
4,"Philadelphia Story, The (1940)",4.572469
5,Rivers and Tides (2001),4.571997
6,On the Beach (1959),4.567577
7,"Charlie Brown Christmas, A (1965)",4.557418
8,Funny Games U.S. (2007),4.539883
9,Mary and Max (2009),4.528869


In [7]:
print('Top 10 Recommendations for User 496:')
recommendMovies(model, 496, 10).toPandas()

Top 10 Recommendations for User 496:


Unnamed: 0,title,prediction
0,Man Bites Dog (C'est arrivé près de chez vous) (1992),4.917277
1,The Artist (2011),4.864929
2,Man on Wire (2008),4.862256
3,American: The Bill Hicks Story (2009),4.815118
4,Eddie Izzard: Dress to Kill (1999),4.812021
5,Bill Hicks: Revelations (1993),4.796528
6,"Jetée, La (1962)",4.795767
7,World of Tomorrow (2015),4.765717
8,"Midnight Clear, A (1992)",4.741269
9,Baby Driver (2017),4.680682


In [8]:
spark.stop()