## Recommendation Engine For Movies

### Importing Data set

In [1]:
#code to find spark in the system
import findspark
findspark.init('/home/ubuntu/spark-2.4.0-bin-hadoop2.7')

In [2]:
from pyspark.sql import SparkSession

In [3]:
#creating a spark session
spark = SparkSession.builder.appName('movieRec').getOrCreate()

We will be using Collaborative flitering method to make movie recommendations. The below image taken from wikipedia explains how collaborative approach works visually

<img src=https://upload.wikimedia.org/wikipedia/commons/5/52/Collaborative_filtering.gif />

In [4]:
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS

In [5]:
movie_data = spark.read.csv('ratings1m.csv',inferSchema=True,header=True)

### Brief view of the data set

In [6]:
movie_data.head()

Row(Column1=1, Column2=1193, Column3=5, Column4=978300760)

In [7]:
#movie_data.describe().show()

+-------+------------------+----------------+------------------+--------------------+
|summary|            userId|         movieId|            rating|           timestamp|
+-------+------------------+----------------+------------------+--------------------+
|  count|            100836|          100836|            100836|              100836|
|   mean|326.12756356856676|19435.2957177992| 3.501556983616962|1.2059460873684695E9|
| stddev| 182.6184914635004|35530.9871987003|1.0425292390606342|2.1626103599513078E8|
|    min|                 1|               1|               0.5|           828124615|
|    max|               610|          193609|               5.0|          1537799250|
+-------+------------------+----------------+------------------+--------------------+



There are about 100k users 

Now we will build our model using sparks implementation of collaborative approach using ALS(Alternating least square methods).
We initially divide the data set into training and testing set in a 70-30 fashion randomly. 

In [7]:
(train, test) = movie_data.randomSplit([0.7, 0.3])

In [9]:
# model building
als = ALS(maxIter=10, regParam=0.01, userCol="Column1", itemCol="Column2", ratingCol="Column3", coldStartStrategy="drop")
model = als.fit(train)

In [10]:
predictions = model.transform(test)

In [11]:
predictions.show()

+-------+-------+-------+---------+----------+
|Column1|Column2|Column3|  Column4|prediction|
+-------+-------+-------+---------+----------+
|    673|    148|      5|975620824|   5.03162|
|   3184|    148|      4|968708953| 3.6129827|
|   4784|    148|      3|970000570| 2.9289355|
|   2383|    148|      2|974417654| 2.1427507|
|   1242|    148|      3|974909976| 2.3306286|
|   1069|    148|      2|974945135| 4.0862164|
|    216|    148|      2|976870439| 1.9472154|
|   2456|    148|      2|974178993| 3.3313732|
|   2507|    148|      4|974082717| 2.7342834|
|   3053|    148|      3|970170090| 3.2123873|
|   4169|    463|      2|976589687| 2.8363774|
|     26|    463|      3|978271588| 3.7174568|
|   2051|    463|      1|974663178| 2.6280828|
|   1146|    463|      2|974939610| 1.5839099|
|   3683|    463|      1|966523740| 1.5243543|
|   5249|    463|      3|961602410| 3.6745543|
|    746|    463|      1|975470754| 2.1278586|
|    721|    463|      4|975775726| 3.4372332|
|   5795|    

### Evaluate the model

In [13]:
evaluator = RegressionEvaluator(metricName="rmse", labelCol="Column3",predictionCol="prediction")
rmse = evaluator.evaluate(predictions)
print("Root-mean-square error = " + str(rmse))

Root-mean-square error = 0.9040936851898465


### Making predictions

In [15]:
#Let's say we want to give some recommendations to user 11
single_user = test.filter(test['Column1']==11).select(['Column2','Column1'])

In [16]:
single_user.show()

+-------+-------+
|Column2|Column1|
+-------+-------+
|     50|     11|
|    249|     11|
|    318|     11|
|    345|     11|
|    454|     11|
|    543|     11|
|    597|     11|
|   1244|     11|
|   1546|     11|
|   1563|     11|
|   1573|     11|
|   1732|     11|
|   1887|     11|
|   2109|     11|
|   2174|     11|
|   2306|     11|
|   2431|     11|
|   2507|     11|
|   2710|     11|
|   2746|     11|
+-------+-------+
only showing top 20 rows



In [17]:
recommended_movies = model.recommendForUserSubset(single_user,5)

In [18]:
recommended_movies.head()

Row(Column1=11, recommendations=[Row(Column2=2998, rating=8.055205345153809), Row(Column2=119, rating=7.981388568878174), Row(Column2=2963, rating=7.274289131164551), Row(Column2=1549, rating=7.258699417114258), Row(Column2=966, rating=7.174513339996338)])