## Movie Recommendation System based on Latent Factor Model (LFM) Algorithm

### 1 Abstract

This study presents a personalized movie recommendation system leveraging a Latent Factor Model (LFM) to mitigate the decision fatigue faced by consumers amidst a plethora of digital content. Utilizing a comprehensive MovieLens dataset and a robust implementation process, we constructed a model capable of providing tailored recommendations. Despite some discrepancies between training and validation loss, indicative of potential inconsistencies and suggesting a need for further complexity in the model, the final evaluation yielded a promising MAE of 0.9465, underscoring the model's adequate performance with scope for refinement.

### 2 Motivation

In today's digital age, the sheer volume of movies, TV shows, and video content available to consumers has reached unprecedented levels. Paradoxically, this wealth of options often leads to confusion and indecision among users. They can spend excessive amounts of time searching for content that aligns with their preferences, ultimately detracting from their entertainment experience. Furthermore, as online media platforms intensify their competition for user engagement, the need for personalized content recommendations becomes increasingly paramount. Without tailored suggestions, platforms risk losing their user base and, consequently, their profitability.

To address these challenges, we have chosen to focus on movie recommendations. Our objective is to develop a personalized movie recommendation system that empowers users to swiftly discover content that resonates with their tastes. This not only enhances the user experience but also enables online media platforms to deliver tailored services, retain users, and boost their bottom line.

### 3 Data description

The dataset comprises 1,000,209 anonymous ratings provided by 6,040 MovieLens users who joined MovieLens in 2000, reflecting a diverse set of movie preferences. The raw data is from the research Project at the University of Minnesota. The source is shown in Reference. The data consist of User ID, Movie ID, Rating, and Timestamp. User IDs range from 1 to 6040, Movie IDs from 1 to 3952, and ratings are on a 5-star scale. The timestamp is represented in seconds since the epoch.

### 4 Define and Use Class and Function to Process the Data

##### *INFORMATION*

**1. Function**

* Read data and process the movie data
* Split the data into the train(80%), validation(10%) and test(10%)

**2. Class**

* Class **DataSampler** about getting batched train and validation data 
* Class **DataSamplerForTest** about getting batched or all test data 

**3. Others**

* If you want to run this script, please download [*searchparameters.py* ](https://github.com/si-tong-chen/Computational-Tool-for-Data-Science)
* please click [*dataset*](https://github.com/si-tong-chen/Computational-Tool-for-Data-Science) to download 
* If you meet something wrong when you run this script, please keep same version as follow:

       - python 3.9.11 
       - TensorFlow==2.8.0
       - cuda/11.4 
       - cudnn/v8.2.2.26-prod-cuda-11.4 



#### 4.1 Define Class and Function

In [2]:
# function about reading  and prcoessing the movie data 

def read_data_and_process(filname, sep="\t"):
    col_names = ["user", "item", "rate", "st"] # 'st' is the length of the movie 
    df = pd.read_csv(filname, sep=sep, header=None, names=col_names, engine='python') 
    df["user"] -= 1  # user in the raw data starts from 1
    df["item"] -= 1  # item in the raw data starts from 1
    
    for col in ("user", "item"): 
        df[col] = df[col].astype(np.int32)
    df["rate"] = df["rate"].astype(np.float32)
    
    #process the null 
    if df.isnull().any().any():
        print("The document contained null values, and these null values were deleted.")
        df = df.dropna()
        df = df.reset_index(drop=True)
    else:
        print("The document doesnot contain null values")
    return df

In [35]:
# funcation about spliting the data into the train(80%), validation(10%) and test(10%)

def split_data(path):
    df = read_data_and_process(path, sep="::")
    rows = len(df)
    df = df.iloc[np.random.permutation(rows)].reset_index(drop=True)
    
    split_index_train_val = int(rows * 0.8)
    slpit_index_val_test = int(rows*0.9)
    df_train = df.iloc[:split_index_train_val,:]  
    df_validation = df.iloc[split_index_train_val:slpit_index_val_test,:]
    df_test = df.iloc[slpit_index_val_test:,:] 

    print('The shape of train:', df_train.shape)
    print('The shape of validaiton:',df_validation.shape)
    print('The shape of test:',df_test.shape)
    
    return df_train, df_validation,df_test

In [4]:
# class about get batched train data and validation data 

class DataSampler(object):
    """
    DataSampler is used for obtaining batched train data and validation data.

    Args:
    - inputs: Input data for training or validation.
    - batch_size: The batch size for sampling data.

    Attributes:
    - inputs: Transposed and stacked input data.
    - batch_size: The specified batch size.
    - num_cols: Number of columns in the input data.
    - len: Length of the input data.

    Methods:
    - __len__: Returns the length of the input data.
    - __iter__: Returns the data sampler itself.
    - __next__: Returns the next batch of data.
    - next: Returns the next batch of data.

    Usage:
    - Create an instance of DataSampler to sample batches of training or validation data.
    - Iterate over the DataSampler to obtain batches of data.
    """

    def __init__(self, inputs, batch_size=64):
        self.inputs = inputs 
        self.batch_size = batch_size
        self.num_cols = len(self.inputs)
        self.len = len(self.inputs[0])
        self.inputs = np.transpose(np.vstack([np.array(self.inputs[i]) for i in range(self.num_cols)]))

    def __len__(self):
        return self.len 

    def __iter__(self):
        return self   
    
    def __next__(self): 
        return self.next() 
    
    def next(self): 
        ids = np.random.randint(0, self.len, (self.batch_size,))
        out = self.inputs[ids, :] 
        return [out[:, i] for i in range(self.num_cols)]

In [5]:
#  class about get batched test data 
class DataSamplerForTest(DataSampler):
    """
    DataSamplerForTest is used for obtaining batched test data.

    Args:
    - inputs: Input data for testing.
    - batch_size: The batch size for testing data. If batch_size is positive, data will be split into batches.

    Attributes:
    - idx_group: List of indices indicating the data groups.
    - group_id: Current group index.

    Methods:
    - next: Retrieves the next batch of test data.

    Usage:
    - Create an instance of DataSamplerForTest to sample test data in batches for testing purposes.
    """
    def __init__(self, inputs, batch_size=64):
        super(DataSamplerForTest, self).__init__(inputs, batch_size=batch_size)        
        
        if batch_size > 0:
            self.idx_group = np.array_split(np.arange(self.len), np.ceil(self.len / batch_size))
        else:
            self.idx_group = [np.arange(self.len)] 
            
        self.group_id = 0

    def next(self):
        if self.group_id >= len(self.idx_group):
            self.group_id = 0
            raise StopIteration 
                
        out = self.inputs[self.idx_group[self.group_id], :]
        self.group_id += 1
        return [out[:, i] for i in range(self.num_cols)]

#### 4.2 Use Class and Function

In [6]:
batch_size = 2000
path = '/zhome/77/2/193848/BACT/data/movielens/ml-1m/ratings.dat'
df_train,df_validation,df_test = split_data(path)

The document doesnot contain null values
The shape of train: (800167, 4)
The shape of validaiton: (100021, 4)
The shape of test: (100021, 4)


In [7]:
# get the batched train，validation and test data
train_data = DataSampler([df_train["user"],
                                  df_train["item"],
                                  df_train["rate"]],
                                  batch_size=batch_size)
validation_data = DataSampler([df_validation["user"],
                                  df_validation["item"],
                                  df_validation["rate"]],
                                  batch_size=batch_size)
test_data = DataSamplerForTest([df_test["user"],
                                   df_test["item"],
                                   df_test["rate"]],
                                   batch_size=-1)

### 5 Model Part

##### *INFORMATION*

**1. Create SVDModel**

- Create an embedding matrix encompassing and initialize all the features.
- Construct the LFM algorithm:

    $y_{pred[u, i]} = bias_{global} + bias_{user[u]} + bias_{item_[i]} + <embedding_{user[u]}, embedding_{item[i]}>$

- Prepare to add regularization into the loss function:

    $loss = \sum_{u, i} |y_{pred[u, i]} - y_{true[u, i]}|^2 + \lambda(|embedding_{user[u]}|^2 + |embedding_{item[i]}|^2)$

|<img src="image/tf_svd_graph.png" alt="Image 1" width="400" height="200"> 

**2. Model Initialization and Setup**

- Data and Problem Setup: the number of users and items dimensionality of the latent factors and the regularization strength 
- Create an instance of the SVD model 
- Define a loss function and Select an optimizer
- Choose a performance metric


**3. Test Model**

- Test forward propagation

**4. Hyperparameter Tuning**
- Find the great hhyperparameter
  


  


**5. Train Model**

**6.Model Evaluation**



#### 5.1 Create SVDModel

In [8]:
# creat SVDModel
class SVDModel(tf.keras.Model):
    '''

    This class represents a Singular Value Decomposition (SVD) model used in the Latent Factor Model (LFM) algorithm.
    
    Attributes:
    - user_num: Number of users in the dataset.
    - item_num: Number of items (movie categories) in the dataset.
    - dim: Dimensionality of the latent factors for users and items.
    - reg_strength: Regularization strength for preventing overfitting.
    
    Methods:
    - __init__: Initializes the SVD model with user and item embeddings, biases, and regularization.
    - call: Defines the forward pass of the model to compute predicted ratings.
    '''
 
    def __init__(self, user_num, item_num, dim,reg_strength):
        super(SVDModel, self).__init__()
        self.reg_strength = reg_strength
        initializer = tf.keras.initializers.TruncatedNormal(stddev=0.02)
        self.user_emb = tf.keras.layers.Embedding(user_num, dim, embeddings_initializer=initializer,trainable=True)
        self.item_emb = tf.keras.layers.Embedding(item_num, dim, embeddings_initializer=initializer,trainable=True)
                           
        self.global_bias = tf.Variable(initial_value=0.0, dtype=tf.float32, name="global_bias")
        self.bias_user = tf.keras.layers.Embedding(user_num, 1, embeddings_initializer='zeros', name="bias_user")
        self.bias_item = tf.keras.layers.Embedding(item_num, 1, embeddings_initializer='zeros', name="bias_item")
    
    
    def call(self, inputs):
        '''
        Defines the forward pass of the SVD model to compute predicted ratings.

        Args:
        - inputs: A tuple of user and item indices.

        Returns:
        - output_star: Predicted ratings clipped between 1.0 and 5.0.
        '''
        user, item = inputs
        user_emb = self.user_emb(user)
        item_emb = self.item_emb(item)
        
        dot = tf.reduce_sum(tf.multiply(user_emb, item_emb), axis=1)
        global_bias = self.global_bias
        bias_user = self.bias_user(user)
        bias_item = self.bias_item(item)
        
        output = dot + global_bias + tf.transpose(bias_user)+tf.transpose(bias_item)
        output_star = tf.clip_by_value(output, 1.0, 5.0) # Clip scores, setting them to 0 if they are less than 0 and to 5 if they are greater than 5.
        
        reg_loss = self.reg_strength * (tf.reduce_sum(tf.square(user_emb)) + tf.reduce_sum(tf.square(item_emb))) 
        self.add_loss(reg_loss) ## perparing to add the regularization into the loss function 

        return output_star
        


#### 5.2 Model Initialization and Setup

In [12]:
# Model Initialization and Setup
user_num = 6040
item_num = 3952
dim  = 15
reg_strength = 0.1

model = SVDModel(user_num, item_num, dim,reg_strength)
loss_object = tf.keras.losses.MeanSquaredError()
optimizer = tf.keras.optimizers.Adam(learning_rate=0.1, clipvalue=2.0)
mae_metric = tf.keras.metrics.MeanAbsoluteError()


#### 5.3 Test Model 

In [13]:
# Test Model 
user_input = tf.constant([1.0, 2.0, 3.0,4.0])
item_input =  tf.constant([4.0, 5.0, 6.0,7.0])
output_star = model([user_input, item_input])
output_star

<tf.Tensor: shape=(1, 4), dtype=float32, numpy=array([[1., 1., 1., 1.]], dtype=float32)>

#### 5.4 Hyperparameter Tuning

##### **Note**
we don't run this part to search best hyperparameters in this scripts, becasue it consumes lots of time. We transfer it into .py file and run it on HPC. And then we get the best hyperparamters as follow:

        - dim: 6
        - regularization parameter: 0.1
        - learning rate : 0.1
        - clipvalue value:2.0
        - best_val_loss： 2.1972108
        - batch size: 2000
  


In [None]:
# Hyperparameter Tuning
num_epochs =10
num_searches =100
dim_values =list(range(5, 15, 1))
reg_strength_values =[1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1.0]

learning_rate_values = [1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1.0]
clipvalue_values = list(range(2,20,3))

search_processor = searchparameters.SearchPara(num_epochs,num_searches,dim_values,
                 reg_strength_values,learning_rate_values,
                 clipvalue_values,item_num,user_num,train_data,validation_data,batch_size)

best_parameterss= search_processor.search()

#### 5.5 Train Model

Based on the tuned parameters, we will use these parameters to train the model with a batch size of 2000 and 500 epochs. Simultaneously, we will record the train loss and validation loss, as well as train mean absolute error (MAE) and validation MAE for each epoch. Finally, we will show it in the follow picture.


The conclusion drawn from picture is that the training loss stabilizes around 10.8, indicating a relatively steady error in the training data. The validation loss converges around 0.98, which is relatively low, suggesting good generalization of the model to new data. The training MAE decreases from 1.1 to 0.75, demonstrating the model's increasing fit over training and a reduction in mean absolute error. The validation MAE decreases from 0.83 to around 0.75, matching the training MAE, showing consistency between the training and validation sets, and indicating that the model generalizes well to new data. 

However, the significant discrepancy between training and validation loss may indicate inconsistency in loss function calculation or differences in data distribution, possibly due to anomalies during data collection. Additionally, the relatively high training loss could suggest insufficient model complexity to capture all variations in the data. The stabilization of the loss and MAE over the number of training epochs indicates that beyond a certain point, the model does not significantly learn from additional training


|<img src="image/loss.png" alt="Image 1" width="400" height="200"> | <img src="image/mae.png" alt="Image 2" width="400" height="200">|

In [14]:
user_num = 6040
item_num = 3952
dim  = 6
reg_strength = 0.1

model = SVDModel(user_num, item_num, dim,reg_strength)
loss_object = tf.keras.losses.MeanSquaredError()
optimizer = tf.keras.optimizers.Adam(learning_rate=0.1, clipvalue=2.0)
mae_metric = tf.keras.metrics.MeanAbsoluteError()

In [None]:
# Train Model
draw_num = 10
num_epochs = 500


iter = []
train_loss,train_mae=[],[]
valid_loss, valid_mae=[],[]

train_iter,validation_iter = [],[]

train_loss_batch_record, train_mae_batch_record=[],[]
valid_loss_batch_record, valid_mae_batch_record=[],[]

## 
@tf.function
def train_step(users, items, rates):
    ''' 
    Perform a single training step for a recommendation model.
        
    Args:
    - users: TensorFlow tensor representing user data.
    - items: TensorFlow tensor representing item data.
    - rates: TensorFlow tensor representing rating data.

    Returns:
    - total_loss: Total loss for the current training step.
    - mae: Mean Absolute Error (MAE) metric for the current batch.
    
        
    '''

    with tf.device('/GPU:0'):  
        with tf.GradientTape() as tape:
            users= tf.convert_to_tensor(users)
            items = tf.convert_to_tensor(items)
            rates =tf.convert_to_tensor(rates)

            inputs = (users, items)
            output_star = model(inputs, training=True)
            loss = loss_object(rates, output_star)
            total_loss = loss + tf.reduce_sum(model.losses)       

        gradients = tape.gradient(total_loss, model.trainable_variables)
        optimizer.apply_gradients(zip(gradients, model.trainable_variables))

        mae_metric.update_state(rates, output_star)
        mae = mae_metric.result()

    return total_loss, mae

@tf.function 
def validation_step(users, items, rates):
    """
    Perform a single validation step for a recommendation model.
    
    Args:
    - users: TensorFlow tensor representing user data.
    - items: TensorFlow tensor representing item data.
    - rates: TensorFlow tensor representing rating data.

    Returns:
    - val_loss: Validation loss for the current validation step.
    - val_mae: Mean Absolute Error (MAE) metric for the current batch.
    """
    users= tf.convert_to_tensor(users)
    items = tf.convert_to_tensor(items)
    rates =tf.convert_to_tensor(rates)

    inputs = (users, items)
    model.trainable = False 
    output_star = model(inputs, training=False)
    val_loss = loss_object(rates, output_star)

    mae_metric.update_state(rates, output_star)
    val_mae = mae_metric.result()

    return val_loss, val_mae

for epoch in range(num_epochs):

    #train data 
    mae_metric.reset_states() 
    train_loss_batch, train_mae_batch= [], [] 
    start = time.time()
    for i, (users, items, rates) in enumerate(train_data):


        loss, mae = train_step(users, items, rates)

        train_loss_batch.append(loss.numpy())
        train_mae_batch.append(mae.numpy())
     

        train_loss_batch_record.append(loss.numpy())
        train_mae_batch_record.append(mae.numpy())

        train_iter.append(i+1)

        if i+1 >= batch_size:
            break

    train_loss.append(np.mean(train_loss_batch))
    train_mae.append(np.mean(train_mae_batch))

    
    #validation data
    valid_loss_batch, valid_mae_batch = [],[]

    for j, (users, items, rates) in enumerate(validation_data):
        val_loss, val_mae = validation_step(users, items, rates)

        valid_loss_batch.append(val_loss.numpy())
        valid_mae_batch.append(val_mae.numpy())        

        valid_loss_batch_record.append(val_loss.numpy())
        valid_mae_batch_record.append(val_mae.numpy())

        validation_iter.append(j+1)

        if j+1 >= batch_size:
            break 
    valid_loss.append(np.mean(valid_loss_batch))
    valid_mae.append(np.mean(valid_mae_batch))

    iter.append(epoch+1)

    if epoch % draw_num == 0:
        end = time.time()
        print(f"Epoch {epoch+1}: Train Loss = {np.mean(train_loss_batch)}, Train MAE = {np.mean(train_mae_batch)},Time Consuming = {round((end - start),3)}(s)")
        print(f"Epoch {epoch+1}: Validation Loss = {np.mean(valid_loss_batch)}, Validation MAE = {np.mean(valid_mae_batch)},Time Consuming = {round((end - start),3)}(s)")

print("Finished training.")
# save model 
tf.saved_model.save(model, '/work3/s230027/CT/result/final_model')


#### 5.6 Model Evaluation

##### **Note**
If you want to run the following code, please download [*'final_model'*](https://github.com/si-tong-chen/Computational-Tool-for-Data-Science)

In [36]:

loaded_model = tf.saved_model.load('/work3/s230027/CT/result/final_model')

inference = loaded_model.signatures["serving_default"]
for user,item,rate in test_data:
    user = tf.constant(user, dtype=tf.float32)
    item = tf.constant(item, dtype=tf.float32)
    rate = tf.constant(rate, dtype=tf.float32)


output_star = inference(input_1=user,input_2=item)
rate = tf.reshape(rate, output_star['output_1'].shape)
mse = mean_squared_error(rate, output_star['output_1']) 
print(f"Mean Squared Error (MSE): {mse}")



Mean Squared Error (MSE): 0.9465680122375488


### 6 Conclusion

Based on our analysis in the third section, our movie recommendation system built using the Latent Factor Model (LFM) appears to be successful. The LFM-based movie system can provide personalized recommendation services. 

However, we have identified some shortcomings, such as the issue of excessive training loss and the lack of significant improvements from additional training. Due to time constraints, we are unable to make further improvements.But we have some ideas about improvements.


Our idea is to enhance feature extraction in LFM (Latent Factor Model) by incorporating latent feature vectors corresponding to movie titles and customer movie-watching behavior. 

To address issues like high training loss and significant differences between training and validation losses, we propose merging the latent feature vectors generated in LFM with other user and movie attributes, such as user names, gender, age, occupation, as well as movie genres and duration. These merged attributes would be fed into a Multi-Layer Perceptron (MLP) or other types of neural networks to improve the model's performance and generalization capability.