# Recommendation Engine with MovieTweetings Data <br> Movie Rating Prediction Model

### Table of Contents

* [Introduction - Business Understanding](#chapter1)
* [Data Understanding](#chapter2)
* [Data Preparation](#chapter3)
* [Rating Prediction Model](#chapter4)
* [Model Evaluation](#chapter5)
* [Recommendation Engine](#chapter6)
* [Conclusion](#chapter7)

## Introduction - Business Understanding<a class="anchor" id="chapter1"></a>

Ratings a commonly used in business by consumers for scoring the quality of products, services and articles. This information is then used by providers to make recommendations to other consumers. Many recommendation systems are rank-based, content based or based on collaborative filtering. The latter consists to group consumers with similar behavior and recommend them similar products, services or articles.
Predictive models try to predict the rating a consumer will assign to a product or service if he or she interacts with this. 

A movies streaming platform would like to make recommendations to registrated users. The current work will create a movie recommendation engine based on a movie rating prediction model which will be develop here. The model will be created on MovieTweetings dataset with contains movies details and review data.

# Data Understanding<a class="anchor" id="chapter2"></a>

The MovieTweetings dataset is a live movie rating dataset collected from Twitter. This dataset is availble here [movietweetings](https://data.world/sidooms/movietweetings) for free (e-mail registration may be necessary). It contains 3 sub-datasets.

- movies.dat: contains movie-id, movie title and genre 
- users.dat: contains details (will not be used here)
- ratings.dat: contains user-id, movie-id, rating and timestamp

## Data Preparation<a class="anchor" id="chapter3"></a>

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import svd_tests as t
from IPython.display import HTML, display
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from tabulate import tabulate
from funk_svd import SVD

%matplotlib inline

**Extract Data**

In [2]:
# Read in the datasets
def load_data():
    """
    Description: Read text-files separeted by "::"
    Input:
        None
    Output:
        movies, reviews - (pandas df) dataframes with movies and reviews data
    """
    movies = pd.read_csv('movies.dat', delimiter="::",encoding="utf-8", \
                  names=["movie_id","movie", "genre"], engine='python')
    
    reviews = pd.read_csv('ratings.dat', delimiter="::",encoding="utf-8", \
                  names=["user_id","movie_id","rating", "timestamp"], engine='python')
    
    movies.drop("genre",axis=1, inplace=True)
    
    return movies, reviews

movies, reviews = load_data()

# Create user-by-item matrix
def create_user_by_item_matrix(reviews):
    return reviews[['user_id', 'movie_id', 'rating', 'timestamp']]

user_items = create_user_by_item_matrix(reviews)

reviews.head(3)

Unnamed: 0,user_id,movie_id,rating,timestamp
0,1,114508,8,1381006850
1,2,75314,1,1595468524
2,2,102926,9,1590148016


**Helper functions**

In [3]:
def get_movie(movie_id):
    """
    Description: Retrieve the movie title corresponding to the movie-id
    Input:
        movie_id - movie-id
    Output:
        movie title corresponding to the movie-id (str)
    """
    ldf = movies[movies.movie_id==movie_id]
    if ldf.shape[0]==0:
        return None
    return ldf.movie.values[0]
    
def movies_watched(user_id):
    """
    Description: Retrieve list of movie-ids watched by user
    Input:
        user_id - user-id
    Output:
        list of movie-ids watched by user (list)
    """
    ldf = reviews[reviews.user_id==user_id]
    if ldf.shape[0]==0:
        return []
    return ldf.movie_id.to_list()

def get_user_items(user_id, exclude_movies_watched=True):
    """
    Description: Retrieve a subset of reviews data where user did not watch the movies so far
    Input:
        user_id - user-id
        exclude_movies_watched - flag to control whether watched movies are taken into account
    Output:
        u_i_df - (pandas df) dataframe containing potentially new movies for the user 
    """
    u_movies = list(reviews.movie_id.unique())
    if exclude_movies_watched:
        u_movies = list(set(u_movies)-set(movies_watched(user_id)))
    u_ids = [user_id for i in range(len(u_movies))]
    u_i_df = pd.DataFrame(data={"u_id":u_ids, "i_id":u_movies})
    return u_i_df

get_movie(114508)

'Species (1995)'

## Rating Prediction Model <a class="anchor" id="chapter4"></a>

**Create model data**

In [4]:
def create_train_val_test(reviews, order_by, training_size, val_size, testing_size, verbose=0):
    '''    
    Description: Create and retrieves DataFrames of training data, validation and testing data 
    INPUT:
        reviews - (pandas df) dataframe to split into train and test
        order_by - (string) column name to sort by
        training_size - (int) number of rows in training set
        testing_size - (int) number of columns in the test set
        verbose - Flag to control the debugging
    
    OUTPUT:
        training_df -  (pandas df) dataframe of the training set
        validation_df - (pandas df) dataframe of the validation set
        testing_df - (pandas df) dataframe of the test set
    '''
    reviews_new = reviews.sort_values(order_by)
    training_df = reviews_new.head(training_size)[["user_id","movie_id","rating"]]
    validation_df = reviews_new.iloc[training_size:training_size+val_size][["user_id","movie_id","rating"]]
    testing_df = reviews_new.iloc[training_size+val_size:training_size+val_size+testing_size][["user_id","movie_id","rating"]]
    training_df.columns = ["u_id","i_id","rating"]
    validation_df.columns = ["u_id","i_id","rating"]
    testing_df.columns = ["u_id","i_id","rating"]
    if verbose > 0:
        print("\nTrain data observations:",training_df.shape[0])
        display(training_df.head(3))
        print("\nValidation data observations:",validation_df.shape[0])
        display(validation_df.head(3))
        print("\nTesting data observations:",testing_df.shape[0])
        display(testing_df.head(3))
    
    return training_df, validation_df, testing_df

# Use our function to create training and test datasets
train_df, val_df, test_df = create_train_val_test(user_items, 'timestamp', 10000, 2500, 1000,  verbose=1)



Train data observations: 10000


Unnamed: 0,u_id,i_id,rating
141432,11228,2171847,6
591800,46461,444778,8
618518,48416,1411238,6



Validation data observations: 2500


Unnamed: 0,u_id,i_id,rating
484885,38123,1397280,6
484881,38123,936501,7
317002,25198,2608224,10



Testing data observations: 1000


Unnamed: 0,u_id,i_id,rating
170312,13506,1074638,6
640555,50067,2302755,9
640546,50067,1860355,9


**Build the model**

In [5]:
def build_SVD_Model(training_size=10000, val_size=2500, testing_size=1000,learning_rate=0.001, \
                    regularization=0.005, n_epochs=100,n_factors=15, min_rating=1, max_rating=10, random_state=42):
    """
    Description: Build rating prediction model based on matrix factorization
    Input:
        training_size - size of training data 
        val_size - size of validation data 
        testing_size - size of test data 
        learning_rate - learning rate of the model 
        regularization - regularization parameter of the model 
        n_epochs - number of epochs to train the model
        n_factors - number of latent factor to be taken into account
        min_rating - minimum rating value 
        max_rating - maximum rating value 
        random_state - seed used to make the model reproducible
    Output:
        svd - rating prediction model based on matrix factorization
        train_df - train data of the model
        val_df - validation data of the model
        test_df - test data of the model
    """
    # Read in the datasets
    movies, reviews = load_data()
    
    if training_size <= 1:
        
        if training_size+val_size+testing_size > 1:
            training_size =  int(0.75*reviews.shape[0])
            val_size = int(0.15*reviews.shape[0])
            testing_size = int(0.10*reviews.shape[0])
        else:
            training_size =  int(training_size*reviews.shape[0])
            val_size = int(val_size*reviews.shape[0])
            testing_size = int(testing_size*reviews.shape[0])             
    
    user_items = create_user_by_item_matrix(reviews)
    
    train_df, val_df, test_df = create_train_val_test(user_items, 'timestamp',training_size=training_size, \
                                                      val_size=val_size, testing_size=testing_size)
    
    np.random.seed(random_state)
    
    svd = SVD(learning_rate=learning_rate, regularization=regularization, n_epochs=n_epochs,n_factors=n_factors, \
              min_rating=min_rating, max_rating=max_rating)
    
    svd.fit(X=train_df , X_val=val_df , early_stopping=True, shuffle=False)
    
    return svd, train_df, val_df, test_df

svd, train_df, val_df, test_df = build_SVD_Model()

Preprocessing data...

Epoch 1/100  | val_loss: 3.32 - val_rmse: 1.82 - val_mae: 1.42 - took 1.1 sec
Epoch 2/100  | val_loss: 3.29 - val_rmse: 1.81 - val_mae: 1.41 - took 0.0 sec
Epoch 3/100  | val_loss: 3.26 - val_rmse: 1.80 - val_mae: 1.40 - took 0.0 sec
Epoch 4/100  | val_loss: 3.23 - val_rmse: 1.80 - val_mae: 1.39 - took 0.0 sec
Epoch 5/100  | val_loss: 3.20 - val_rmse: 1.79 - val_mae: 1.38 - took 0.0 sec
Epoch 6/100  | val_loss: 3.18 - val_rmse: 1.78 - val_mae: 1.38 - took 0.0 sec
Epoch 7/100  | val_loss: 3.16 - val_rmse: 1.78 - val_mae: 1.37 - took 0.0 sec
Epoch 8/100  | val_loss: 3.14 - val_rmse: 1.77 - val_mae: 1.37 - took 0.0 sec
Epoch 9/100  | val_loss: 3.12 - val_rmse: 1.77 - val_mae: 1.36 - took 0.0 sec
Epoch 10/100 | val_loss: 3.11 - val_rmse: 1.76 - val_mae: 1.36 - took 0.0 sec
Epoch 11/100 | val_loss: 3.09 - val_rmse: 1.76 - val_mae: 1.35 - took 0.0 sec
Epoch 12/100 | val_loss: 3.08 - val_rmse: 1.75 - val_mae: 1.35 - took 0.0 sec
Epoch 13/100 | val_loss: 3.07 - val_rmse:

## Model Evaluation<a class="anchor" id="chapter5"></a>

In [6]:
def display_model_performance(svd, train_df, val_df, test_df):
    """
    Description: Display model performance metrics
    Input:
        svd - rating prediction model based on matrix factorization
        train_df - train data of the model
        val_df - validation data of the model
        test_df - test data of the model
    Output:
        None
    """
    #val_loss: 2.77 - val_rmse: 1.67 - val_mae: 1.27
    pred = svd.predict(test_df)
    val_rmse = svd.metrics_[-1][1]
    val_mae = svd.metrics_[-1][2]
    mae = mean_absolute_error(test_df['rating'], pred)
    rmse = np.sqrt(mean_squared_error(test_df['rating'], pred))
    ldf = pd.DataFrame(data={"Validation":[val_rmse,val_mae],"Test":[rmse, mae]},index=["rmse","mae"])
    print("Model performance")
    print(tabulate(ldf, headers='keys', tablefmt='psql'))
    test_df["pred_rating"] = pred
    print("\nTest rating prediction")
    display(test_df)
    val_df["pred_rating"] = svd.predict(val_df)
    print("\nValidation rating prediction")
    display(val_df)

display_model_performance(svd, train_df, val_df, test_df)

Model performance
+------+--------------+---------+
|      |   Validation |    Test |
|------+--------------+---------|
| rmse |      1.66263 | 1.7569  |
| mae  |      1.26711 | 1.31104 |
+------+--------------+---------+

Test rating prediction


Unnamed: 0,u_id,i_id,rating,pred_rating
170312,13506,1074638,6,7.495256
640555,50067,2302755,9,7.186410
640546,50067,1860355,9,7.151885
193541,15451,1446192,8,7.630570
193537,15451,1234719,3,6.398318
...,...,...,...,...
559398,43780,1343727,7,6.567281
435361,33875,1024648,9,8.150694
807613,63543,848228,10,7.656689
863109,67621,1583421,10,7.340000



Validation rating prediction


Unnamed: 0,u_id,i_id,rating,pred_rating
484885,38123,1397280,6,6.497524
484881,38123,936501,7,7.395053
317002,25198,2608224,10,7.317537
731837,57299,181875,8,7.465280
203576,16449,1034314,6,7.361781
...,...,...,...,...
447754,35027,1480656,7,6.699510
300760,23823,1675434,9,8.394175
217963,17446,446029,7,7.208574
123585,9711,1707386,8,8.091933


## Recommendation Engine<a class="anchor" id="chapter6"></a>

In [7]:
def make_recommendations(svd, user_id, n_recommendations=10):
    """
    Description: make recommendations of movies with the highest predicted rating
    Input:
        svd - rating prediction model based on matrix factorization
        user_id - User who will receive the recommended movies
        n_recommendations - number of recommendations
    Output:
        rec_movies - list of recommended movies
    """
    if n_recommendations < 1:
        return []
    user_items_df = get_user_items(user_id=user_id)
    pred = svd.predict(user_items_df)
    user_items_df["pred"] = [int(round(rat,0)) for rat in pred]
    user_items_df = user_items_df.sort_values(by=["pred"])
    if user_items_df.shape[0] < 1:
        return []
    n_recommendations = min(n_recommendations,user_items_df.shape[0])
    
    rec_movies = [get_movie(i_id) for i_id in user_items_df[:n_recommendations].i_id.to_list()]
    
    return rec_movies

make_recommendations(svd, 37287, n_recommendations=10)  

['Paranormal Activity 4 (2012)',
 'Killing Them Softly (2012)',
 'Movie 43 (2013)',
 'Taken 2 (2012)',
 'A Good Day to Die Hard (2013)',
 'Playing for Keeps (2012)',
 'The Darkest Hour (2011)',
 'Beautiful Creatures (2013)',
 'Identity Thief (2013)',
 'The Twilight Saga: Breaking Dawn - Part 2 (2012)']

In [8]:
## Movie Recommendation Engine
class MovieRecommendationEngine():
    """
    Movie Recommendation Engine - Instance of this class will make movies recommendations to users
    based on predicted rating. This class builds an SVD model which is used for predicting rating.
    
    """
    
    def __init__(self):
        self.training_size=10000
        self.val_size=2500
        self.testing_size=1000
        self.learning_rate=0.001
        self.regularization=0.005
        self.n_epochs=100
        self.n_factors=15
        self.min_rating=0
        self.max_rating=10
        self.random_state=42
            
    def build_model(self,training_size=10000, val_size=2500, testing_size=1000,learning_rate=0.001,\
                   regularization=0.005,n_epochs=100, n_factors=15,min_rating=0, max_rating=10,random_state=42 ):
        """
        Description: method to wrap the build_SVD_model-function above
       
        Input:
            training_size - size of training data 
            val_size - size of validation data 
            testing_size - size of test data 
            learning_rate - learning rate of the model 
            regularization - regularization parameter of the model 
            n_epochs - number of epochs to train the model
            n_factors - number of latent factor to be taken into account
            min_rating - minimum rating value 
            max_rating - maximum rating value 
            random_state - seed used to make the model reproducible
        Output:
            None
        """
        self.training_size=training_size
        self.val_size=val_size
        self.testing_size=testing_size
        self.learning_rate=learning_rate
        
        model_data = build_SVD_Model(training_size=self.training_size, val_size=self.val_size,
                                     testing_size=self.testing_size,learning_rate=self.learning_rate,
                                     regularization=self.regularization,n_epochs=self.n_epochs,
                                     n_factors=self.n_factors,min_rating=self.min_rating,
                                     max_rating = self.max_rating,random_state=self.random_state
                                    )
        
        self.svd, self.train_df, self.val_df, self.test_df = model_data
        
    def display_model_performance(self):
        """
        Description: Display model performance metrics
        Input:
            None
        Output:
            None
        """
        display_model_performance(self.svd, self.train_df, self.val_df, self.test_df)
        
    def make_recommendations(self, user_id, n_recommendations=10):
        """
        Description: make recommendations of movies with the highest predicted rating
        Input:
            user_id - User who will receive the recommended movies
            n_recommendations - number of recommendations
        Output:
            rec_movies - list of recommended movies
        """
        return make_recommendations(self.svd, user_id=user_id, n_recommendations=n_recommendations)

  
MREngine = MovieRecommendationEngine()
MREngine.build_model(training_size=0.65,val_size=0.25, testing_size=0.10)
MREngine.display_model_performance()
print("\nMaking movies recommendation...")
MREngine.make_recommendations(user_id=37287, n_recommendations=10)

Preprocessing data...

Epoch 1/100  | val_loss: 3.20 - val_rmse: 1.79 - val_mae: 1.37 - took 0.0 sec
Epoch 2/100  | val_loss: 3.10 - val_rmse: 1.76 - val_mae: 1.34 - took 0.0 sec
Epoch 3/100  | val_loss: 3.04 - val_rmse: 1.74 - val_mae: 1.33 - took 0.0 sec
Epoch 4/100  | val_loss: 3.00 - val_rmse: 1.73 - val_mae: 1.31 - took 0.1 sec
Epoch 5/100  | val_loss: 2.97 - val_rmse: 1.72 - val_mae: 1.30 - took 0.1 sec
Epoch 6/100  | val_loss: 2.95 - val_rmse: 1.72 - val_mae: 1.30 - took 0.1 sec
Epoch 7/100  | val_loss: 2.93 - val_rmse: 1.71 - val_mae: 1.29 - took 0.1 sec
Epoch 8/100  | val_loss: 2.92 - val_rmse: 1.71 - val_mae: 1.29 - took 0.0 sec
Epoch 9/100  | val_loss: 2.90 - val_rmse: 1.70 - val_mae: 1.28 - took 0.0 sec
Epoch 10/100 | val_loss: 2.89 - val_rmse: 1.70 - val_mae: 1.28 - took 0.0 sec
Epoch 11/100 | val_loss: 2.89 - val_rmse: 1.70 - val_mae: 1.28 - took 0.0 sec
Epoch 12/100 | val_loss: 2.88 - val_rmse: 1.70 - val_mae: 1.28 - took 0.0 sec
Epoch 13/100 | val_loss: 2.87 - val_rmse:

Unnamed: 0,u_id,i_id,rating,pred_rating
343861,26955,481499,7,7.755913
243017,19461,7131622,9,7.301161
814385,64136,7131622,8,7.301161
324908,25750,107818,7,8.010638
152164,12045,1618434,6,7.277356
...,...,...,...,...
350189,27482,5862902,8,7.301161
440455,34352,9426210,8,7.301161
430802,33572,2584384,8,7.301161
719492,56408,11464826,8,7.301161



Validation rating prediction


Unnamed: 0,u_id,i_id,rating,pred_rating
662730,51624,2543164,10,8.259562
762193,59839,2671706,8,7.741912
236637,18877,2119532,9,8.090947
755420,59283,4034228,9,8.317521
547400,42855,2679552,10,9.449665
...,...,...,...,...
267939,21178,6146586,8,7.301161
602990,47300,1663202,5,8.316754
169355,13468,98546,7,6.348442
846382,66557,369339,7,7.688600



Making movies recommendation...


['The Thin Red Line (1998)',
 'Left Behind (2014)',
 'Scary Movie 5 (2013)',
 'A Most Wanted Man (2014)',
 'Fantastic Four (2015)',
 'Poltergeist (2015)',
 'Getaway (2013)',
 'Tammy (2014)',
 'Zoolander 2 (2016)',
 'Kod Adi K.O.Z. (2015)']

## Conclusion<a class="anchor" id="chapter7"></a>

The Movies Recommendation Engine built in the current work with a fast matrix factorization procedure known as ["Funk SVD"](https://github.com/gbolmier/funk-svd). It is an efficient matrix factorization algorithm which is well-adapted to sparse matrices as it is usually the case for rating data. 

The performance of the SVD-model is quite good with RMSE arround 1.68 and MAE arround 1.26 on validation data for ratings in a range of 1 to 10. 

|          | **Validation** | **Test**   |
|:---------| :-------------:| ---------: |
|**RMSE**  | 1.68491        | 1.7615    |
|**MAE**   | 1.26415        | 1.34874    | 

The SVD model on MovieTweeting data can predict the ratings with a mean absolute error of 1.35 point on unknown data (test data). This shows that this model is reliable.