In [1]:
// Recommender Systems
// Two most common types of recommender systems are Content-Based and Collaborative Filtering (CF)

// Collaborative filtering produces recommendations based on the knowledge of users' attitude to items. It uses 
// wisdom of the crowd to recommend items

// Content-based recommender systems focus on the attributes of the items and give you recommendations based on the similarity between them


Intitializing Scala interpreter ...

Spark Web UI available at http://74e4be6e0134:4045
SparkContext available as 'sc' (version = 3.2.1, master = local[*], app id = local-1651749247805)
SparkSession available as 'spark'


In gerneral, Collaborative filtering (CF) is more commonly used than content-based systems because it usually gives better resutls 
and is relatively easy to understand (from an overall implementation perspective)

In [3]:
import org.apache.spark.ml.evaluation.RegressionEvaluator
import org.apache.spark.ml.recommendation.ALS

val ratings = spark.read.option("header","true").option("inferSchema","true").csv("movie_ratings.csv")

ratings.head()
ratings.printSchema()

val Array(training, test) = ratings.randomSplit(Array(0.8, 0.2))

// Build the recommendation model using ALS on the training data
val als = new ALS()
  .setMaxIter(5)
  .setRegParam(0.01)
  .setUserCol("userId")
  .setItemCol("movieId")
  .setRatingCol("rating")
val model = als.fit(training)

// Evaluate the model by computing the average error from real rating
val predictions = model.transform(test)

// import to use abs()
import org.apache.spark.sql.functions._
val error = predictions.select(abs($"rating"-$"prediction"))

// Drop NaNs
error.na.drop().describe().show()


root
 |-- userId: integer (nullable = true)
 |-- movieId: integer (nullable = true)
 |-- rating: double (nullable = true)

+-------+--------------------------+
|summary|abs((rating - prediction))|
+-------+--------------------------+
|  count|                     19179|
|   mean|        0.8278953580014744|
| stddev|        0.7243543891232413|
|    min|       1.10626220703125E-4|
|    max|         6.501978397369385|
+-------+--------------------------+



import org.apache.spark.ml.evaluation.RegressionEvaluator
import org.apache.spark.ml.recommendation.ALS
ratings: org.apache.spark.sql.DataFrame = [userId: int, movieId: int ... 1 more field]
training: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [userId: int, movieId: int ... 1 more field]
test: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [userId: int, movieId: int ... 1 more field]
als: org.apache.spark.ml.recommendation.ALS = als_004f1e3e5f2c
model: org.apache.spark.ml.recommendation.ALSModel = ALSModel: uid=als_004f1e3e5f2c, rank=10
predictions: org.apache.spark.sql.DataFrame = [userId: int, movieId: int ... 2 more fields]
import org.apache.spark.sql.functions._
error: org.apache.spark.sql.DataFrame = [abs((rating - prediction)): double]


In [4]:
predictions.show()

+------+-------+------+----------+
|userId|movieId|rating|prediction|
+------+-------+------+----------+
|     1|   2294|   2.0| 1.9593182|
|     2|    372|   3.0|  4.620823|
|     2|    319|   1.0|  5.221806|
|     2|    225|   3.0|   3.78617|
|     2|    168|   3.0|  5.548937|
|     1|   2968|   1.0| 2.2339349|
|     1|   3671|   3.0| 2.7036228|
|     2|    720|   4.0|  2.949804|
|     3|    247|   3.5| 3.3029373|
|     2|    314|   4.0| 6.0166364|
|     2|    509|   4.0|  4.419809|
|     3|    527|   3.0| 3.8954492|
|     2|    261|   4.0|  4.357129|
|     3|    356|   5.0|   3.58261|
|     3|    866|   3.0|  3.367491|
|     3|    595|   2.0| 3.4563918|
|     2|    382|   3.0|0.56606513|
|     2|    150|   5.0| 2.6441379|
|     2|    508|   4.0| 2.4963877|
|     2|    454|   4.0| 2.7321844|
+------+-------+------+----------+
only showing top 20 rows

