## Recommender
The classic recommender tutorial uses the [movielens data set](https://grouplens.org/datasets/movielens/). It is similar to using the iris or MNIST data set for other algorithms. Let's do a code along to get an idea of how this all works!

Looking for more datasets? Check out: https://gist.github.com/entaroadun/1653794

In [0]:
from pyspark.sql import SparkSession

In [0]:
spark = SparkSession.builder.appName('recommender').getOrCreate()

In [0]:
# read spark dataset
df = spark.read.csv('dbfs:/FileStore/movielens_ratings.csv', inferSchema=True, header=True)

In [0]:
df.show()

+-------+------+------+
|movieId|rating|userId|
+-------+------+------+
|      2|   3.0|     0|
|      3|   1.0|     0|
|      5|   2.0|     0|
|      9|   4.0|     0|
|     11|   1.0|     0|
|     12|   2.0|     0|
|     15|   1.0|     0|
|     17|   1.0|     0|
|     19|   1.0|     0|
|     21|   1.0|     0|
|     23|   1.0|     0|
|     26|   3.0|     0|
|     27|   1.0|     0|
|     28|   1.0|     0|
|     29|   1.0|     0|
|     30|   1.0|     0|
|     31|   1.0|     0|
|     34|   1.0|     0|
|     37|   1.0|     0|
|     41|   2.0|     0|
+-------+------+------+
only showing top 20 rows



In [0]:
print((df.count(), len(df.columns)))

(1501, 3)


In [0]:
df.printSchema()

root
 |-- movieId: integer (nullable = true)
 |-- rating: double (nullable = true)
 |-- userId: integer (nullable = true)



In [0]:
df.groupBy('rating').count().show()

+------+-----+
|rating|count|
+------+-----+
|   1.0|  941|
|   4.0|   99|
|   3.0|  179|
|   2.0|  207|
|   5.0|   75|
+------+-----+



In [0]:
from pyspark.sql.functions import col, sum as _sum

In [0]:
# missing values
missing_val = df.select([_sum(col(c).isNull().cast('int')).alias(c) for c in df.columns])
missing_val.show()

+-------+------+------+
|movieId|rating|userId|
+-------+------+------+
|      0|     0|     0|
+-------+------+------+



In [0]:
df.describe().show()

+-------+------------------+------------------+------------------+
|summary|           movieId|            rating|            userId|
+-------+------------------+------------------+------------------+
|  count|              1501|              1501|              1501|
|   mean| 49.40572951365756|1.7741505662891406|14.383744170552964|
| stddev|28.937034065088994| 1.187276166124803| 8.591040424293272|
|    min|                 0|               1.0|                 0|
|    max|                99|               5.0|                29|
+-------+------------------+------------------+------------------+



## Format for MLlib ALS (alternative lease squared)

In [0]:
from pyspark.ml.recommendation import ALS
from pyspark.ml.evaluation import RegressionEvaluator

In [0]:
# train tes split 
train, test = df.randomSplit([0.8, 0.2])

In [0]:
als = ALS(maxIter= 5, regParam = 0.01, userCol='userId', itemCol='movieId', ratingCol='rating')

In [0]:
model = als.fit(train)

In [0]:
predictions = model.transform(test)

In [0]:
predictions.show()

+-------+------+------+-----------+
|movieId|rating|userId| prediction|
+-------+------+------+-----------+
|      0|   1.0|    11| -3.1088016|
|      0|   1.0|    26|  0.6337264|
|      1|   1.0|     3|  0.5017787|
|      1|   1.0|     5|  0.9044224|
|      1|   1.0|    19|  0.6668815|
|      1|   4.0|    15|  0.8858888|
|      2|   1.0|    23|  2.8407087|
|      3|   1.0|    17| 0.16158539|
|      3|   1.0|    21|  2.1598358|
|      3|   1.0|    29|  0.4726417|
|      4|   1.0|     5|  -1.049664|
|      4|   1.0|    12|  0.3452071|
|      4|   1.0|    29|-0.55316293|
|      4|   3.0|     2|  1.8077749|
|      6|   1.0|     4| 0.49540302|
|      6|   1.0|     9|0.026313577|
|      6|   1.0|    14|   1.120514|
|      6|   1.0|    28|-0.63778573|
|      7|   1.0|    16| 0.79914165|
|      7|   1.0|    18|  3.4117157|
+-------+------+------+-----------+
only showing top 20 rows



In [0]:
# evaluation
recommEval = RegressionEvaluator(predictionCol='prediction', labelCol='rating', metricName='rmse')
rmse = recommEval.evaluate(predictions)
print("Root Mean Squred Error : "+ str(rmse))

Root Mean Squred Error : 2.014427223609934


In [0]:
single_user = test.filter(test['userId']==11).select('movieId', 'userId')

In [0]:
# user had 10 ratings in test data set
single_user.show()

+-------+------+
|movieId|userId|
+-------+------+
|      0|    11|
|     22|    11|
|     36|    11|
|     40|    11|
|     41|    11|
|     62|    11|
|     64|    11|
|     67|    11|
|     69|    11|
|     70|    11|
|     75|    11|
|     76|    11|
|     78|    11|
|     90|    11|
+-------+------+



In [0]:
recommendation = model.transform(single_user)

In [0]:
recommendation.orderBy('prediction', ascending=False).show()

+-------+------+-----------+
|movieId|userId| prediction|
+-------+------+-----------+
|     90|    11|   6.527989|
|     22|    11|   5.256767|
|     69|    11|   3.832805|
|     36|    11|   0.862678|
|     64|    11| 0.43639562|
|     78|    11|-0.41080028|
|     67|    11| -1.4732944|
|     75|    11|   -2.77473|
|      0|    11| -3.1088016|
|     76|    11| -3.9456007|
|     40|    11|  -4.123136|
|     41|    11| -4.5272965|
|     62|    11|  -5.027648|
|     70|    11| -5.8205004|
+-------+------+-----------+



## Good Job..!