Recommender systems can be used for multiple purposes in the sense of
recommending various things to users. For example, some of them might
fall in the categories below:
1. Retail Products
2. Jobs
3. Connections/Friends
4. Movies/Music/Videos/Books/Articles
5. Ads

RS = Recommender systems <br>
There are mainly five types of RS that can be built:
1. Popularity Based RS
2. Content Based RS
3. Collaborative Filtering based RS
4. Hybrid RS
5. Association Rule Mining based RS

### Popularity Based RS
This is the most basic and simplest RS that can be used to recommend
products or content to the users. It recommends items/content based
on bought/viewed/liked/downloaded by most of the users. While it is easy and simple to implement, it doesn’t produce relevant results as
the recommendations stay the same for every user, but it sometimes
outperforms some of the more sophisticated RS.

1. No. of times downloaded
2. No. of times bought
3. No. of times viewed
4. Highest rated
5. No. of times shared
6. No. of times liked

### Content Based RS
This type of RS recommends similar items to the users that the user
has liked in the past. So, the whole idea is to calculate a similarity score
between any two items and recommended to the user based upon the
profile of the user’s interests.

In [2]:


#import and create sparksession object
from pyspark.sql import SparkSession 
spark=SparkSession.builder.appName('rc').getOrCreate()



In [3]:
#import the required functions and libraries
from pyspark.sql.functions import *

In [4]:
#load the dataset and create sprk dataframe
df=spark.read.csv('movie_ratings_df.csv',inferSchema=True,header=True)

In [5]:
df.show(10)

+------+------------+------+
|userId|       title|rating|
+------+------------+------+
|   196|Kolya (1996)|     3|
|    63|Kolya (1996)|     3|
|   226|Kolya (1996)|     5|
|   154|Kolya (1996)|     3|
|   306|Kolya (1996)|     5|
|   296|Kolya (1996)|     4|
|    34|Kolya (1996)|     5|
|   271|Kolya (1996)|     4|
|   201|Kolya (1996)|     4|
|   209|Kolya (1996)|     4|
+------+------------+------+
only showing top 10 rows



In [6]:
#validate the shape of the data 
print("Shape of dataset is:",(df.count(),len(df.columns)))

Shape of dataset is: (100000, 3)


In [7]:
#check columns in dataframe
df.printSchema()



root
 |-- userId: integer (nullable = true)
 |-- title: string (nullable = true)
 |-- rating: integer (nullable = true)



In [8]:
#check number of ratings by each user
df.groupBy('userId').count().orderBy('count',ascending=False).show(10,False)

+------+-----+
|userId|count|
+------+-----+
|405   |737  |
|655   |685  |
|13    |636  |
|450   |540  |
|276   |518  |
|416   |493  |
|537   |490  |
|303   |484  |
|234   |480  |
|393   |448  |
+------+-----+
only showing top 10 rows



In [9]:
#number of times movie been rated 
df.groupBy('title').count().orderBy('count',ascending=False).show(10,False)

+-----------------------------+-----+
|title                        |count|
+-----------------------------+-----+
|Star Wars (1977)             |583  |
|Contact (1997)               |509  |
|Fargo (1996)                 |508  |
|Return of the Jedi (1983)    |507  |
|Liar Liar (1997)             |485  |
|English Patient, The (1996)  |481  |
|Scream (1996)                |478  |
|Toy Story (1995)             |452  |
|Air Force One (1997)         |431  |
|Independence Day (ID4) (1996)|429  |
+-----------------------------+-----+
only showing top 10 rows



In [11]:
#import String indexer to convert string values to numeric values
from pyspark.ml.feature import StringIndexer,IndexToString



In [12]:
#creating string indexer to convert the movie title column values into numerical values
stringIndexer = StringIndexer(inputCol="title", outputCol="title_new")

In [13]:
#applying stringindexer object on dataframe movie title column
model = stringIndexer.fit(df)

In [14]:
#creating new dataframe with transformed values
indexed = model.transform(df)

In [15]:
#number of times each numerical movie title has been rated 
indexed.groupBy('title_new').count().orderBy('count',ascending=False).show(10,False)

+---------+-----+
|title_new|count|
+---------+-----+
|0.0      |583  |
|1.0      |509  |
|2.0      |508  |
|3.0      |507  |
|4.0      |485  |
|5.0      |481  |
|6.0      |478  |
|7.0      |452  |
|8.0      |431  |
|9.0      |429  |
+---------+-----+
only showing top 10 rows



In [16]:
#split the data into training and test datatset
train,test=indexed.randomSplit([0.75,0.25])

In [17]:
#count number of records in train set
print("shape of train data",train.count())

shape of train data 74973


In [18]:
#count number of records in test set
print("shape of test data",test.count())

shape of test data 25027


In [19]:
#import ALS recommender function from pyspark ml library
from pyspark.ml.recommendation import ALS

In [20]:


#Training the recommender model using train datatset
rec=ALS(maxIter=10,regParam=0.01,userCol='userId',itemCol='title_new',ratingCol='rating',nonnegative=True,coldStartStrategy="drop")



In [21]:
#fit the model on train set
rec_model=rec.fit(train)

In [22]:
#making predictions on test set 
predicted_ratings=rec_model.transform(test)

In [23]:
#columns in predicted ratings dataframe
predicted_ratings.printSchema()

root
 |-- userId: integer (nullable = true)
 |-- title: string (nullable = true)
 |-- rating: integer (nullable = true)
 |-- title_new: double (nullable = false)
 |-- prediction: float (nullable = false)



In [24]:
#predicted vs actual ratings for test set 
predicted_ratings.orderBy(rand()).show(10)

+------+--------------------+------+---------+----------+
|userId|               title|rating|title_new|prediction|
+------+--------------------+------+---------+----------+
|   669|Dances with Wolve...|     4|     61.0| 3.6622992|
|    13|Wes Craven's New ...|     1|    742.0| 2.7547276|
|    58|     Die Hard (1988)|     4|     73.0| 3.9408581|
|   559|       Gandhi (1982)|     4|    122.0| 3.1528165|
|    24|      Amadeus (1984)|     5|     50.0|  4.792174|
|   921| First Knight (1995)|     4|    389.0| 3.1264238|
|   850|Dead Poets Societ...|     3|     65.0| 4.8176537|
|   533|    Notorious (1946)|     4|    587.0|  2.701545|
|   378|Hunchback of Notr...|     5|    253.0| 3.5229619|
|    11|       Grease (1978)|     2|    166.0|  3.551807|
+------+--------------------+------+---------+----------+
only showing top 10 rows



In [31]:
#importing Regression Evaluator to measure RMSE
from pyspark.ml.evaluation import RegressionEvaluator

#create Regressor evaluator object for measuring accuracy
evaluator=RegressionEvaluator(metricName='rmse',predictionCol='predictions',labelCol='rating')
#apply the RE on predictions dataframe to calculate RMSE


In [33]:
rmse=evaluator.evaluate(prediction)
#print RMSE error
print(rmse)

NameError: name 'prediction' is not defined