1. Reading the Data

In [3]:
import json

def read_data(filename):
    with open(filename, 'r') as file:
        data = [json.loads(line) for line in file]
    return data

train_data = read_data("goodreads_reviews_historybio_train.json")
# print the first 10 line of the data
print(train_data[:10])


[{'user_id': '26d5737b1eaff71248069cde4f590338', 'book_id': '30109111', 'review_id': 'd567c1be612401ee5cbe3da05683561f', 'rating': 5, 'date_added': 'Sun Jun 12 09:38:16 -0700 2016'}, {'user_id': 'e0a970290631fd711484f0d8155f2a06', 'book_id': '7198269', 'review_id': '1fca74b92a06f2cccdc81c1288687495', 'rating': 5, 'date_added': 'Thu May 10 18:37:25 -0700 2012'}, {'user_id': 'cca945e8a7369eeb035afd21527c339b', 'book_id': '32148570', 'review_id': 'e519d8377a10308742dd49f66f4a728a', 'rating': 3, 'date_added': 'Mon May 29 10:32:28 -0700 2017'}, {'user_id': 'd1789d248a75d3cb7c5f16eeee9fe419', 'book_id': '40024', 'review_id': 'a75d9f435773a377fbe81361a1ea19c6', 'rating': 2, 'date_added': 'Mon Jan 26 21:04:34 -0800 2009'}, {'user_id': '819f2797459b579a7782d4bd595e1c36', 'book_id': '3272163', 'review_id': 'cfbc4e10f33bad3bd235f775e1833b2d', 'rating': 3, 'date_added': 'Wed Jun 25 12:12:10 -0700 2014'}, {'user_id': '7d0b0d563843507c71f867720801d84e', 'book_id': '361056', 'review_id': '41556ee650e

## Task 1 [10 points]: Explore biases
(A) [4 points] The global bg bias

In [4]:
def calculate_global_bias(data):
    total_rating = sum([review['rating'] for review in data])
    return total_rating / len(data)

bg = calculate_global_bias(train_data)
print(f"Global bias bg: {bg}")

Global bias bg: 3.7669762808387413


(B) [3 points] The user specific bias of user id= “3913f3be1e8fadc1de34dc49dab06381”

In [5]:
def calculate_user_bias(data, uid, global_bias):
    user_reviews = [review for review in data if review['user_id'] == uid]
    #calculate the average rating of the user
    average_user_rating = sum([review['rating'] for review in user_reviews])/len(user_reviews)
    return average_user_rating - global_bias
user_id = "3913f3be1e8fadc1de34dc49dab06381"
b_user = calculate_user_bias(train_data, user_id, bg)
print(f"User bias for user {user_id}: {b_user}")

User bias for user 3913f3be1e8fadc1de34dc49dab06381: -0.1139150563489455


(C) [3 points] The item specific bias of book id = “16130”.

In [6]:
def calculate_item_bias(data, book_id, global_bias):
    item_reviews = [review for review in data if review['book_id'] == book_id]
    total_item_bias = sum([review['rating'] for review in item_reviews])/ len(item_reviews)
    return total_item_bias -global_bias if item_reviews else 0

book_id = "16130"
b_item = calculate_item_bias(train_data, book_id, bg)
print(f"Item specific bias for book id {book_id}: {b_item}")

Item specific bias for book id 16130: 0.4562653093753264


### Task 2 [45 points]: Implement the regularized latent factor model without bias using SGD
(A) [30 points] Implement the regularized latent factor model without considering the bias.

In [7]:
import numpy as np

# 1. Initialization
def initialization(k,train_data):
    num_users = len(set([d['user_id'] for d in train_data]))
    num_items = len(set([d['book_id'] for d in train_data]))
    P = np.random.normal(scale=0.01,size=(num_users, k))
    Q = np.random.normal(scale=0.01,size=(num_items, k))
    # Mapping user_ids and book_ids to integer indices for easier array operations
    user_map = {user_id: idx for idx, user_id in enumerate(set([d['user_id'] for d in train_data]))}
    book_map = {book_id: idx for idx, book_id in enumerate(set([d['book_id'] for d in train_data]))}
    return P,Q,user_map,book_map

k = 8
eta = 0.01
lambda1 = lambda2 = 0.3
epochs = 10
P, Q, user_map, book_map = initialization(k,train_data)

# 2. SGD
for epoch in range(epochs):
    np.random.shuffle(train_data)
    for review in train_data:
        i = user_map[review['user_id']]
        j = book_map[review['book_id']]
        r_ij = review['rating']
        e_ij = r_ij - np.dot(Q[j], P[i])

        # Update using gradients
        temp_q = Q[j, :]
        Q[j, :] += 2*eta * (e_ij * P[i, :] - lambda1 * Q[j, :])
        P[i, :] += 2*eta * (e_ij * temp_q - lambda2 * P[i, :])

    # 3. RMSE Calculation
    squared_errors = []
    for review in train_data:
        i = user_map[review['user_id']]
        j = book_map[review['book_id']]
        r_ij = review['rating']
        squared_errors.append((r_ij - np.dot(Q[j], P[i])) ** 2)
    rmse = np.sqrt(sum(squared_errors) / len(squared_errors))
    print(f"Epoch {epoch+1}: RMSE = {rmse}")


Epoch 1: RMSE = 3.971009273832496
Epoch 2: RMSE = 3.7346380322699684
Epoch 3: RMSE = 2.9982514795639115
Epoch 4: RMSE = 2.547222226771712
Epoch 5: RMSE = 2.235647198449414
Epoch 6: RMSE = 2.0094515101989767
Epoch 7: RMSE = 1.8344387330227594
Epoch 8: RMSE = 1.696898743950972
Epoch 9: RMSE = 1.5849721950948181
Epoch 10: RMSE = 1.4910604859689456


(B) [15 points] Use SGD to train the latent factor model on the training data for different values of k in {4,8,16}.

In [8]:
def train_latent_factors(P, Q, user_map, book_map, epochs=10):
    for epoch in range(epochs):
        np.random.shuffle(train_data)
        for review in train_data:
            i = user_map[review['user_id']]
            j = book_map[review['book_id']]
            r_ij = review['rating']
            e_ij = r_ij - np.dot(Q[j], P[i])

            # Update using gradients
            temp_q = Q[j, :]
            Q[j, :] += 2*eta * (e_ij * P[i, :] - lambda1 * Q[j, :])
            P[i, :] += 2*eta * (e_ij * temp_q - lambda2 * P[i, :])

    return P, Q

def compute_rmse(data, P, Q, user_map, book_map):
    squared_errors = []
    for review in data:
        if review['user_id'] not in user_map or review['book_id'] not in book_map:
            continue  # Skip this review if user or book not found in the maps
        i = user_map[review['user_id']]
        j = book_map[review['book_id']]
        r_ij = review['rating']
        squared_errors.append((r_ij - np.dot(Q[j], P[i])) ** 2)
    rmse = np.sqrt(sum(squared_errors) / len(squared_errors))
    return rmse

# Using the functions
best_rmse = float('inf') # set it to positive infinity
best_k = None
best_P = None
best_Q = None

for k in [4, 8, 16]:
    P, Q, user_map, book_map = initialization(k, train_data)
    P, Q = train_latent_factors(P, Q, user_map, book_map)
    validation_rmse = compute_rmse(train_data, P, Q, user_map, book_map)
    print(f"k = {k}, Validation RMSE: {validation_rmse:.4f}")

    if validation_rmse < best_rmse:
        best_rmse = validation_rmse
        best_k = k
        best_P = P
        best_Q = Q
test_data = read_data("goodreads_reviews_historybio_test.json")
test_rmse = compute_rmse(test_data, best_P, best_Q, user_map, book_map)
print(f"Best k = {best_k}, Test RMSE: {test_rmse:.4f}")



k = 4, Validation RMSE: 1.5069
k = 8, Validation RMSE: 1.4869
k = 16, Validation RMSE: 1.4745
Best k = 16, Test RMSE: 1.5992


### Task 3 [45 points]: Implement the regularized latent factor model with bias using SGD
(A) [30 points] Incorporate the bias terms bg, b(user) and b(item) to the latent factor model. ij

In [9]:
lambda1 = lambda2 = lambda3 =lambda4 =0.3
user_id = "3913f3be1e8fadc1de34dc49dab06381"
book_id = "16130"
bg = calculate_global_bias(train_data)
b_user = calculate_user_bias(train_data, user_id, bg)
b_item = calculate_item_bias(train_data, book_id, bg)

def train_latent_factors_with_bias(lambda4, lambda3, lambda2, lambda1, bg, b_user, b_item, data):
    np.random.shuffle(data)
    for review in data:
        i = user_map[review['user_id']]
        j = book_map[review['book_id']]
        r_ij_actual = review['rating']
        r_ij_predicted = bg + b_user + b_item + np.dot(Q[j], P[i])
        e_ij = r_ij_actual - r_ij_predicted

        # Update using gradients
        temp_q = Q[j, :]
        Q[j, :] += 2*eta * (e_ij * P[i, :] - lambda1 * Q[j, :])
        P[i, :] += 2*eta * (e_ij * temp_q - lambda2 * P[i, :])

        # Update biases
        b_user += 2*eta * (e_ij - lambda3 * b_user)
        b_item += 2*eta * (e_ij - lambda4 * b_item)
    return P, Q, b_user, b_item
def compute_rmse(P,Q, bg, b_user, b_item, data):
    squared_errors = []
    for review in train_data:
        i = user_map[review['user_id']]
        j = book_map[review['book_id']]
        r_ij = review['rating']
        squared_errors.append((r_ij - (bg + b_user + b_item + np.dot(Q[j], P[i]))) ** 2)
    rmse = np.sqrt(sum(squared_errors) / len(squared_errors))
    return rmse

for epoch in range(epochs):
    P, Q, b_user, b_item = train_latent_factors_with_bias(lambda4, lambda3, lambda2, lambda1, bg, b_user, b_item,train_data)

    # 3. RMSE Calculation
    # squared_errors = []
    # for review in train_data:
    #     i = user_map[review['user_id']]
    #     j = book_map[review['book_id']]
    #     r_ij = review['rating']
    #     squared_errors.append((r_ij - (bg + b_user + b_item + np.dot(Q[j], P[i]))) ** 2)
    # rmse = np.sqrt(sum(squared_errors) / len(squared_errors))
    rmse = compute_rmse(P,Q, bg, b_user, b_item, train_data)
    print(f"Epoch {epoch+1}: RMSE = {rmse}")
# After finishing all epoches, report the learned user-specific bias of the user with user id= “3913f3be1e8fadc1de34dc49dab06381” , and the learned item- specific bias of the book with book id = “16130”.

print(f"User bias for user {user_id}: {b_user}")
print(f"Item specific bias for book id {book_id}: {b_item}")

Epoch 1: RMSE = 1.1096063055209227
Epoch 2: RMSE = 1.0664791285164892
Epoch 3: RMSE = 1.0671081198948262
Epoch 4: RMSE = 1.0499299164387785
Epoch 5: RMSE = 1.0449913828986686
Epoch 6: RMSE = 1.0431716040150023
Epoch 7: RMSE = 1.0415781334173932
Epoch 8: RMSE = 1.0331426353178252
Epoch 9: RMSE = 1.0450186464762201
Epoch 10: RMSE = 1.032782589963173
User bias for user 3913f3be1e8fadc1de34dc49dab06381: -0.26721905308528415
Item specific bias for book id 16130: -0.26721905308528415


(B) [15 points] Similar to Task 2 (B), find the best k in {4, 8, 16} for the model you developed in Task 3 (A) on the validation set, by using RMSE to compare across these models, and apply the best of these models to the test data. Compare the resulting test RMSE with Task 2 (B). Analyse and explain your findings.

In [12]:
import json

def read_data(filename):
    with open(filename, 'r') as file:
        data = [json.loads(line) for line in file]
    return data

best_k = None
best_rmse = float('inf')
best_P = None
best_Q = None
validation_data = read_data("goodreads_reviews_historybio_val.json")
test_data = read_data("goodreads_reviews_historybio_test.json")

for k in [4, 8, 16]:
    P, Q, user_map, book_map = initialization(k, train_data)
    
    # Use the above train function to train the model
    P, Q, b_user, b_item = train_latent_factors_with_bias(lambda4, lambda3, lambda2, lambda1, bg, b_user, b_item,train_data)
    
    # Calculate RMSE on the validation data
    validation_rmse = compute_rmse(P,Q, bg, b_user, b_item,validation_data)
    
    # Update best RMSE and k
    if validation_rmse < best_rmse:
        best_rmse = validation_rmse
        best_k = k
        best_P = P
        best_Q = Q

# Calculate RMSE on the test data with the best k
test_rmse = compute_rmse(best_P, best_Q, bg, b_user, b_item, test_data)

print(f"Best k for Task 3 is {best_k}, with a test RMSE of {test_rmse}")


Best k for Task 3 is 4, with a test RMSE of 1.2667078269360939


**Findings:**

1. **Lower RMSE in Task 3(B) compared to Task 2(B):**
    - The inclusion of bias terms in the latent factor model (Task 3(B)) led to a better fit to the data, resulting in a lower RMSE. This suggests that incorporating user and item-specific biases can capture inherent characteristics in the data, enhancing prediction accuracy.

2. **Higher optimal \( k \) value in Task 2(B) than in Task 3(B):**
    - The model in Task 3(B) required a smaller number of latent factors \( k \) to achieve optimal performance, likely because the added bias terms already captured some intrinsic properties of users and items. This indicates that the bias-enhanced model can achieve comparable or better performance using fewer latent factors.