Movie Recommendation Engine Building in Apache Spark

For complete code in Spark, please click here

Overview

In this project, the movie data is movie lens data set which includes about 600 users and 9500 movies.

The purpose of the project are to:

find similar movies.
gain insights on movie recommendations to users.

To achieve the goal, we will:

Train the model with Alternating Least Squares (ALS) algorithm.
Predict movie rating and recommend movies to users.
Find the correlations between different movies and infer similarities.

This will be implemented on Spark due to its fast speed for large-scale data processing and readiness to use.

About ALS

Many recommendation systems suggest item to users based on collaborative filtering(CF) techniques. However, CF have some major problems:

scability : lack of ability to scale to larger datasets when more user and items were added into the database
item cold-start problem: when movies added to the database have none or little interactions because system rely on the movie’s interactions to make recommendations
popularity bias: the system recommends the movies with the most interactions without any personalization

In collaborative filtering, matrix factorization is the state-of-the-art solution for sparse data problem. ALS recommender is a matrix factorization algorithm that uses Alternating Least Squares with Weighted-Lamda-Regularization (ALS-WR). In the matrix factorization, the rating matrix is decomposed into user and movie matrix. The column of the user matrix describes the latent feature of the users and the rows of the movie matrix descibe the latent feature of the movies.

This allows model to predict better personalized movie ratings for users.

With matrix factorization, less-rated movies can have rich latent features as much as popular movies have, which improves recommender’s ability to recommend less-known movies.

The details of the matrix factorization is shown below:

More about ALS please refer to here

Exploratory data analysis

The rating dataframe is 98.3% empty.

Histogram of the user rating count
Histogram of the average user rating

Most movie ratings lie in 1.5-4.5.

Model performance

RMSE = 0.687 on all data, saying that on average the model predicts 0.69 above or below values of the original ratings matrix.

Predicitng movie(top) and user(bottom) rating

Movie recommendation

We first recommend movies to user 414 and 599 who rated largest number of movies(2698 and 2478).

Finding similar movies

We find similar movie based on the cosine similarities between movie features, the closer it is to 1, the more similar they are. Fortunately, we have extracted the latent movie features from the matrix factorization which can be used here.

More about cosine similarity please refer to here

One more thing to notice is that, here,we use movies with high rating counts in order to increase the accuracy of our finding. This is because the more the movie is rated, the more data we have to use for) matrix factorization, therefore, higher chance the extracted latent feature matrix of this movie is closer to the correct one.

We saw a sharp decline in number of ratings from 100 which defines our threshold. This is saying, we will only find similar movies that have been rated more than 100 times

We pick movie 471 and find simlar movies to movie 471

Summary

Overall, the model did a good job on predicting the user rating, specifically, between 1.5-4.5. However, when predicting extreme scores(either low or high) like 0.5 or 5, it needs improvement.

Two possible reasons:

most ratings are between 2-4(we can tell from the avg(rating)-count plot), which means extreme high or low ratings are sparse, therefore, to accurately predict ratings in this region may be more difficult.
We know that the dataframe is highly sparse(98.3%), again bring challenges to the prediction.

Suggestions :

We could further increase iteration to lower the RMSE, however, due to long running time, we didn’t try it here. Also, we noticed before, this didn’t really help too much.
Collect more data. For now, it is not a good idea to use a more complex model due to limited ratings. In the future, we could encourage users to rate the film for low sparsity. With larger dataset, we could use a more complex model, for example, neural network for better performance.

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
Data		Data
image		image
Moive_recommendation_code.ipynb		Moive_recommendation_code.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Movie Recommendation Engine Building in Apache Spark

Overview

About ALS

Exploratory data analysis

Model performance

Movie recommendation

Finding similar movies

Summary

About

Releases

Packages

Languages

weiziyuan/Movie-recommendation

Folders and files

Latest commit

History

Repository files navigation

Movie Recommendation Engine Building in Apache Spark

Overview

About ALS

Exploratory data analysis

Model performance

Movie recommendation

Finding similar movies

Summary

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages