In [1]:
import findspark
findspark.init()

<b>`findspark` is a  python library which is used to find the location of the Spark installed on the machine.</b>

In [2]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
                    .master('local[*]') \
                    .appName('movielens') \
                    .getOrCreate()


<b>the master node 'local[*]' tells Spark to run locally with as many worker threads as logical cores on your machine.</b>

In [3]:
from pyspark.sql.types import StructType, StructField, IntegerType, DoubleType,StringType

schema = StructType([
    StructField("userId", IntegerType()),
    StructField("movieId", IntegerType()),
    StructField("rating", DoubleType())
])

ratings = spark.read.csv("rating.csv", header=True, schema=schema).limit(10000)
ratings.printSchema()

root
 |-- userId: integer (nullable = true)
 |-- movieId: integer (nullable = true)
 |-- rating: double (nullable = true)



<b> A common issue with spark is that it thinks numbers are strings, and vice versa. So I defined the schema on read with the desired data types.</b>

## Data Exploration

In [4]:
ratings.show()

+------+-------+------+
|userId|movieId|rating|
+------+-------+------+
|     1|      2|   3.5|
|     1|     29|   3.5|
|     1|     32|   3.5|
|     1|     47|   3.5|
|     1|     50|   3.5|
|     1|    112|   3.5|
|     1|    151|   4.0|
|     1|    223|   4.0|
|     1|    253|   4.0|
|     1|    260|   4.0|
|     1|    293|   4.0|
|     1|    296|   4.0|
|     1|    318|   4.0|
|     1|    337|   3.5|
|     1|    367|   3.5|
|     1|    541|   4.0|
|     1|    589|   3.5|
|     1|    593|   3.5|
|     1|    653|   3.0|
|     1|    919|   3.5|
+------+-------+------+
only showing top 20 rows



In [5]:
# To check number of movies each user has rated
ratings.groupBy("userId").count().show()

+------+-----+
|userId|count|
+------+-----+
|     1|  175|
|     2|   61|
|     3|  187|
|     4|   28|
|     5|   66|
|     6|   24|
|     7|  276|
|     8|   70|
|     9|   35|
|    10|   38|
|    11|  504|
|    12|   36|
|    13|   62|
|    14|  243|
|    15|   49|
|    16|   60|
|    17|   26|
|    18|  121|
|    19|   50|
|    20|   28|
+------+-----+
only showing top 20 rows



In [6]:
from pyspark.sql.functions import rank, col, avg, min
# Min number of ratings for movies
print("Movie with the fewest ratings: ")
ratings.groupBy("movieId").count().select(min("count")).show()

Movie with the fewest ratings: 
+----------+
|min(count)|
+----------+
|         1|
+----------+



In [7]:
# Avg number of ratings per movie
print("Avg num ratings per movie: ")
ratings.groupBy("movieId").count().select(avg("count")).show()

Avg num ratings per movie: 
+-----------------+
|       avg(count)|
+-----------------+
|3.461405330564209|
+-----------------+



In [8]:
# Min number of ratings for user
print("User with the fewest ratings: ")
ratings.groupBy("userId").count().select(min("count")).show()

User with the fewest ratings: 
+----------+
|min(count)|
+----------+
|        20|
+----------+



<b>Each user has given ratings for atleast 20 movies</b>

In [9]:
# Avg number of ratings per users
print("Avg num ratings per user: ")
ratings.groupBy("userId").count().select(avg("count")).show()

Avg num ratings per user: 
+------------------+
|        avg(count)|
+------------------+
|109.89010989010988|
+------------------+



 ## Model Building and Evaluation

In [10]:
# Import the required functions
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS
from pyspark.ml.tuning import ParamGridBuilder,CrossValidator 

# Create test and train set
(train, test) = ratings.randomSplit([0.8, 0.2], seed = 1234)

In [11]:
# Creating and training a basic model
als = ALS(rank=5, maxIter=10, seed=0, userCol= 'userId', itemCol='movieId', ratingCol="rating",coldStartStrategy="drop",nonnegative=True)
model = als.fit(train.select(["userId", "movieId", "rating"]))

In [12]:
#Evaluating predictions
evaluator=RegressionEvaluator(metricName="rmse",labelCol="rating",predictionCol="prediction")
predictions=model.transform(test)
rmse=evaluator.evaluate(predictions)
print("RMSE="+str(rmse))
predictions.show()

RMSE=1.118327786291745
+------+-------+------+----------+
|userId|movieId|rating|prediction|
+------+-------+------+----------+
|    54|   1088|   3.0| 2.8224826|
|    17|   1580|   4.0| 3.5979238|
|    23|   1580|   5.0|  4.184193|
|    25|   1580|   1.5| 3.4518485|
|    46|   1580|   4.0|  3.137214|
|    76|   1591|   1.0| 3.0296664|
|    58|   1645|   4.0|  3.402373|
|    58|   1959|   4.0| 3.9246402|
|     3|   2366|   4.0| 4.0371027|
|    77|   3175|   3.0| 3.5477808|
|    73|   3175|   4.0| 3.1020522|
|    14|   3175|   4.5| 2.3943448|
|    69|    540|   3.0| 1.8369429|
|    58|    540|   2.0| 2.4598627|
|    44|    858|   5.0|  4.484232|
|    35|    858|   5.0|  4.887832|
|    58|    858|   5.0|  5.212182|
|    11|    858|   2.5|  4.303579|
|    74|    858|   5.0| 4.4629736|
|    85|   1025|   4.5| 2.1133268|
+------+-------+------+----------+
only showing top 20 rows



<b>With the baseline model we achieved a RMSE of 1.1.</b>

## Hyperparameter Tuning and Cross Validation

In [13]:
paramGrid = ParamGridBuilder().addGrid(als.rank, [1, 5, 10])\
                              .addGrid(als.maxIter, [20])\
                              .addGrid(als.regParam, [0.05, 0.1, 0.5])\
                              .build()
crossval = CrossValidator(estimator=als, estimatorParamMaps=paramGrid, evaluator=evaluator, numFolds=5)

In [14]:
cvModel = crossval.fit(train)

In [15]:
predictions = cvModel.transform(test)

In [16]:
print('The root mean squared error for our model is: {}'.format(evaluator.evaluate(predictions.na.drop())))

The root mean squared error for our model is: 0.9771839049156706


<b>This time we got an RMSE of 0.98 which is good enough to get decent recommendations, keep in mind that we have only taken ten thousand ratings and there is scope for more hyper parameter tuning.</b>

In [17]:
predictions.show()

+------+-------+------+----------+
|userId|movieId|rating|prediction|
+------+-------+------+----------+
|    54|   1088|   3.0| 3.0709338|
|    17|   1580|   4.0|  4.146607|
|    23|   1580|   5.0| 4.0154433|
|    25|   1580|   1.5| 3.4749508|
|    46|   1580|   4.0| 3.7126157|
|    76|   1591|   1.0| 3.7709544|
|    58|   1645|   4.0|  3.096809|
|    58|   1959|   4.0| 3.7447603|
|     3|   2366|   4.0| 4.9220014|
|    77|   3175|   3.0|  3.516558|
|    73|   3175|   4.0|    3.2938|
|    14|   3175|   4.5| 3.4732938|
|    69|    540|   3.0| 3.1697989|
|    58|    540|   2.0| 3.8704958|
|    44|    858|   5.0|  4.800685|
|    35|    858|   5.0| 4.9648337|
|    58|    858|   5.0|  5.177353|
|    11|    858|   2.5| 5.2545037|
|    74|    858|   5.0|  5.044103|
|    85|   1025|   4.5| 3.2344704|
+------+-------+------+----------+
only showing top 20 rows



<b>Next I read the movie dataset which has the title and genres for those movies.</b>

In [18]:
schema = StructType([
    StructField("movieId", IntegerType()),
    StructField("title", StringType()),
    StructField("genres", StringType())
])

movies = spark.read.csv("movie.csv", header=True, schema=schema)
ratings.printSchema()

root
 |-- userId: integer (nullable = true)
 |-- movieId: integer (nullable = true)
 |-- rating: double (nullable = true)



In [19]:
original = ratings.join(movies,on=['movieId'], how='inner')

In [20]:
preds = predictions.join(movies,on=['movieId'], how='inner')

<b>I perform `INNER JOIN` on movies dataframe using ratings and predictions dataframe.`INNER JOIN` is basically the intersection between the two tables in this case with respect to the column movieId.</b>

In [21]:
original.show()

+-------+------+------+----------------+--------------------+
|movieId|userId|rating|           title|              genres|
+-------+------+------+----------------+--------------------+
|      1|    91|   4.0|Toy Story (1995)|Adventure|Animati...|
|      1|    90|   3.5|Toy Story (1995)|Adventure|Animati...|
|      1|    84|   5.0|Toy Story (1995)|Adventure|Animati...|
|      1|    82|   5.0|Toy Story (1995)|Adventure|Animati...|
|      1|    80|   3.0|Toy Story (1995)|Adventure|Animati...|
|      1|    69|   4.0|Toy Story (1995)|Adventure|Animati...|
|      1|    66|   4.0|Toy Story (1995)|Adventure|Animati...|
|      1|    59|   4.5|Toy Story (1995)|Adventure|Animati...|
|      1|    58|   5.0|Toy Story (1995)|Adventure|Animati...|
|      1|    54|   4.0|Toy Story (1995)|Adventure|Animati...|
|      1|    53|   4.0|Toy Story (1995)|Adventure|Animati...|
|      1|    47|   1.0|Toy Story (1995)|Adventure|Animati...|
|      1|    39|   5.0|Toy Story (1995)|Adventure|Animati...|
|      1

In [28]:
preds.show()

+-------+------+------+----------+--------------------+--------------------+
|movieId|userId|rating|prediction|               title|              genres|
+-------+------+------+----------+--------------------+--------------------+
|   1088|    54|   3.0| 3.0709338|Dirty Dancing (1987)|Drama|Musical|Rom...|
|   1580|    17|   4.0|  4.146607|Men in Black (a.k...|Action|Comedy|Sci-Fi|
|   1580|    23|   5.0| 4.0154433|Men in Black (a.k...|Action|Comedy|Sci-Fi|
|   1580|    25|   1.5| 3.4749508|Men in Black (a.k...|Action|Comedy|Sci-Fi|
|   1580|    46|   4.0| 3.7126157|Men in Black (a.k...|Action|Comedy|Sci-Fi|
|   1591|    76|   1.0| 3.7709544|        Spawn (1997)|Action|Adventure|...|
|   1645|    58|   4.0|  3.096809|Devil's Advocate,...|Drama|Mystery|Thr...|
|   1959|    58|   4.0| 3.7447603|Out of Africa (1985)|       Drama|Romance|
|   2366|     3|   4.0| 4.9220014|    King Kong (1933)|Action|Adventure|...|
|   3175|    77|   3.0|  3.516558| Galaxy Quest (1999)|Adventure|Comedy|...|

<b>Now we will compare the original ratings and the recommendations provided to a particular user.</b>

In [24]:
original.filter(col('userId')==11).filter(col('rating')>=4.5).sort('rating',ascending=False).show()

+-------+------+------+--------------------+--------------------+
|movieId|userId|rating|               title|              genres|
+-------+------+------+--------------------+--------------------+
|    551|    11|   5.0|Nightmare Before ...|Animation|Childre...|
|   1591|    11|   5.0|        Spawn (1997)|Action|Adventure|...|
|    588|    11|   5.0|      Aladdin (1992)|Adventure|Animati...|
|     32|    11|   5.0|Twelve Monkeys (a...|Mystery|Sci-Fi|Th...|
|    593|    11|   5.0|Silence of the La...|Crime|Horror|Thri...|
|    150|    11|   5.0|    Apollo 13 (1995)|Adventure|Drama|IMAX|
|    673|    11|   5.0|    Space Jam (1996)|Adventure|Animati...|
|    172|    11|   5.0|Johnny Mnemonic (...|Action|Sci-Fi|Thr...|
|    736|    11|   5.0|      Twister (1996)|Action|Adventure|...|
|   1073|    11|   5.0|Willy Wonka & the...|Children|Comedy|F...|
|   1580|    11|   5.0|Men in Black (a.k...|Action|Comedy|Sci-Fi|
|   1196|    11|   5.0|Star Wars: Episod...|Action|Adventure|...|
|    260| 

<b>The highest rated movies by user 11 belong to Action,Adventure and Sci-Fi.</b>

In [25]:
preds.filter(col('userId')==11).filter(col('prediction')>=4.0).sort('prediction',ascending=False).show()

+-------+------+------+----------+--------------------+--------------------+
|movieId|userId|rating|prediction|               title|              genres|
+-------+------+------+----------+--------------------+--------------------+
|   8371|    11|   4.5| 6.0230994|Chronicles of Rid...|Action|Sci-Fi|Thr...|
|  54001|    11|   5.0|  5.576076|Harry Potter and ...|Adventure|Drama|F...|
|    858|    11|   2.5| 5.2545037|Godfather, The (1...|         Crime|Drama|
|   4006|    11|   2.0|  5.180258|Transformers: The...|Adventure|Animati...|
|  71282|    11|   5.0|  5.096739|   Food, Inc. (2008)|         Documentary|
|  33493|    11|   5.0|  5.013171|Star Wars: Episod...|Action|Adventure|...|
|   5378|    11|   4.5|  5.013171|Star Wars: Episod...|Action|Adventure|...|
|    318|    11|   5.0| 4.9973865|Shawshank Redempt...|         Crime|Drama|
|   2762|    11|   5.0| 4.9828043|Sixth Sense, The ...|Drama|Horror|Mystery|
|   4232|    11|   2.5| 4.8184795|     Spy Kids (2001)|Action|Adventure|...|

<b>Wow!! The recommendations are pretty accurate.</b>

In [26]:
original.filter(col('userId')==63).filter(col('rating')>=4.5).sort('rating',ascending=False).show()

+-------+------+------+--------------------+--------------------+
|movieId|userId|rating|               title|              genres|
+-------+------+------+--------------------+--------------------+
|    527|    63|   5.0|Schindler's List ...|           Drama|War|
|    750|    63|   5.0|Dr. Strangelove o...|          Comedy|War|
|    912|    63|   5.0|   Casablanca (1942)|       Drama|Romance|
|    920|    63|   5.0|Gone with the Win...|   Drama|Romance|War|
|    969|    63|   5.0|African Queen, Th...|Adventure|Comedy|...|
|   1050|    63|   5.0|Looking for Richa...|   Documentary|Drama|
|   1177|    63|   5.0|Enchanted April (...|       Drama|Romance|
|   1183|    63|   5.0|English Patient, ...|   Drama|Romance|War|
|   1196|    63|   5.0|Star Wars: Episod...|Action|Adventure|...|
|   1204|    63|   5.0|Lawrence of Arabi...| Adventure|Drama|War|
|   1210|    63|   5.0|Star Wars: Episod...|Action|Adventure|...|
|   1224|    63|   5.0|      Henry V (1989)|Action|Drama|Roma...|
|   1233| 

<b>The highest rated movies by user 63 belong to Action,drama and war.</b>

In [27]:
preds.filter(col('userId')==63).filter(col('prediction')>=4.0).sort('prediction',ascending=False).show()

+-------+------+------+----------+--------------------+-----------------+
|movieId|userId|rating|prediction|               title|           genres|
+-------+------+------+----------+--------------------+-----------------+
|   1050|    63|   5.0|  5.068888|Looking for Richa...|Documentary|Drama|
|   5060|    63|   5.0|  4.994673|M*A*S*H (a.k.a. M...| Comedy|Drama|War|
|   1208|    63|   4.0| 4.6689067|Apocalypse Now (1...| Action|Drama|War|
|   1234|    63|   5.0| 4.5235553|   Sting, The (1973)|     Comedy|Crime|
|   2028|    63|   5.0| 4.3869686|Saving Private Ry...| Action|Drama|War|
|   2858|    63|   4.0|  4.350927|American Beauty (...|     Comedy|Drama|
+-------+------+------+----------+--------------------+-----------------+



<b>The ALS recommendations are pretty consistent with their original preferences.Ofcourse lot more can be done like adding more data and better hyperparameter tuning to improve the model's performance.</b>