# Movie Recommendation System

In this project, we are developing a movie recommendation system using Graph Neural Networks (GNNs) to enhance the accuracy and personalization of recommendations. The workflow of our project consists of several key steps, leveraging the MovieLens dataset as our primary data source.

# Setup

In [1]:
import torch_geometric
import torch

In [2]:
torch_geometric.__version__

'2.5.2'

In [3]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
device

device(type='cpu')

# Data Preprocessing:
We begin by converting the MovieLens dataset into a graph representation. In this graph, nodes represent users and movies, while edges signify user-movie interactions, such as ratings. This structured format allows us to effectively model the relationships between users and movies.

# Read from csv

In [4]:
import pandas as pd

In [5]:
columns_name=['userId','movieId','rating','timestamp']
df = pd.read_csv("u.data",sep='\t',names = columns_name)
print(len(df))
display(df.head(5))

100000


Unnamed: 0,userId,movieId,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [6]:
# How many ratings are a 3 or above?
df_higher_than_3 = df[df['rating']>=3]
# df_higher_than_3 = df
print(len(df_higher_than_3))

82520


In [7]:
# What's the distribution of highly rated movies?
print("Rating Distribution")
df.groupby(['rating'])['rating'].count()

Rating Distribution


rating
1     6110
2    11370
3    27145
4    34174
5    21201
Name: rating, dtype: int64

# Prepare the test sets

## Perform a 80/20 train-test split on the interactions in the dataset

In [8]:
from sklearn.model_selection import train_test_split

In [9]:
train, test = train_test_split(df_higher_than_3.values, test_size=0.2, random_state=42)
train_df = pd.DataFrame(train, columns=df_higher_than_3.columns)
test_df = pd.DataFrame(test, columns=df_higher_than_3.columns)

In [10]:
train_df

Unnamed: 0,userId,movieId,rating,timestamp
0,109,157,4,880577961
1,436,553,3,887769777
2,864,562,4,888891794
3,690,197,4,881180427
4,668,283,5,881605324
...,...,...,...,...
66011,321,485,4,879439787
66012,804,981,3,879444077
66013,812,300,5,877625461
66014,188,443,4,875074329


In [11]:
print(f"Train Size  : {len(train_df)}")
print(f"Test Size   : {len (test_df)}")

Train Size  : 66016
Test Size   : 16504


## Label Encoding

We need label encoding to transform user and item IDs into consecutive integers because:

Embedding layers and GNNs require numerical indices to represent users and items.
PyTorch Geometric (PyG) uses these integer IDs to construct the interaction graph for the recommender system.
It ensures consistent mapping of users and items between the training and test sets.
This allows the model to efficiently process the data and perform learning and inference.

most GNN libraries require node indices to be consecutive integers starting from 0.
Without label encoding, your user and movie IDs could be arbitrary (e.g., 1001, 5678), which are not consecutive. PyG expects nodes indexed as 0, 1, 2, ..., N-1 because it uses these indices to reference node embeddings and for efficient indexing.

In [12]:
from sklearn import preprocessing as pp

In [13]:
le_user = pp.LabelEncoder()
le_item = pp.LabelEncoder()
train_df['userId_index'] = le_user.fit_transform(train_df['userId'].values)
train_df['movieId_index'] = le_item.fit_transform(train_df['movieId'].values)

In [14]:
train_user_ids = train_df['userId'].unique()
train_item_ids = train_df['movieId'].unique()

print(len(train_user_ids), len(train_item_ids))


943 1547


Filter all unsee movies and users

In [15]:
test_df = test_df[
  (test_df['userId'].isin(train_user_ids)) & \
  (test_df['movieId'].isin(train_item_ids))
]
print(len(test))

16504


In [16]:
test_df['userId_index'] = le_user.transform(test_df['userId'].values)
test_df['movieId_index'] = le_item.transform(test_df['movieId'].values)

In [17]:
n_users = train_df['userId_index'].nunique()
n_items = train_df['movieId_index'].nunique()
print("Number of Unique Users : ", n_users)
print("Number of unique Items : ", n_items)

Number of Unique Users :  943
Number of unique Items :  1547


In [18]:
train_df

Unnamed: 0,userId,movieId,rating,timestamp,userId_index,movieId_index
0,109,157,4,880577961,108,156
1,436,553,3,887769777,435,547
2,864,562,4,888891794,863,556
3,690,197,4,881180427,689,196
4,668,283,5,881605324,667,282
...,...,...,...,...,...,...
66011,321,485,4,879439787,320,479
66012,804,981,3,879444077,803,964
66013,812,300,5,877625461,811,299
66014,188,443,4,875074329,187,437


In [19]:
test_df

Unnamed: 0,userId,movieId,rating,timestamp,userId_index,movieId_index
0,796,732,5,893047241,795,724
1,381,216,5,892695996,380,215
2,178,97,5,882827020,177,96
3,541,15,3,883864806,540,14
4,75,222,5,884050194,74,221
...,...,...,...,...,...,...
16499,445,1129,4,891199994,444,1110
16500,301,610,3,882077176,300,602
16501,795,433,4,880588141,794,431
16502,665,343,3,884292654,664,341


## Data Preparation

In [20]:
import numpy as np

In [21]:
import torch
from torch_geometric.data import Data

# Create edge_index and edge_attr from your DataFrame (train_df)
edge_index = torch.tensor(
    np.stack([train_df['userId_index'].values, train_df['movieId_index'].values]),
    dtype=torch.long
).to(device)

edge_attr = torch.tensor(train_df['rating'].values, dtype=torch.float).to(device)

# Define the number of users and items
n_users = train_df['userId_index'].nunique()
n_items = train_df['movieId_index'].nunique()
 
# Create the PyG Data object
data = Data(edge_index=edge_index, edge_attr=edge_attr, num_nodes = n_users+n_items)
num_nodes = n_users+n_items

In [22]:
data

Data(edge_index=[2, 66016], edge_attr=[66016], num_nodes=2490)

In [23]:
special = []
for i in range(n_users):
    if (test_df[test_df["userId_index"]==i].empty):
        special.append(i)

In [24]:
special

[26, 240, 417, 684, 764, 872]

In [25]:
test_df[test_df["userId_index"]==91]

Unnamed: 0,userId,movieId,rating,timestamp,userId_index,movieId_index
571,92,55,3,875654245,91,54
916,92,845,3,886442565,91,833
1080,92,204,4,875653913,91,203
1775,92,356,3,875813171,91,354
1989,92,237,4,875640318,91,236
2097,92,1073,5,875653271,91,1055
2999,92,156,4,875656086,91,155
3605,92,591,4,875640294,91,585
3632,92,636,3,875812064,91,628
4073,92,230,3,875656055,91,229


In [26]:
def movie_watched(userID_index:int):
    return test_df[test_df["userId_index"]==userID_index]["movieId_index"].tolist()

movie = [] # Movies that the rating to be predicted
for i in range(n_users):
    movie.append(movie_watched(i))



In [27]:
def ground_truth_rating(userID_index:int):
    return test_df[test_df["userId_index"]==userID_index]["rating"].tolist()

ground_truth = [] # Ground Truth Rating
for i in range(0,n_users):
    ground_truth.append(ground_truth_rating(i))

In [28]:
import torch
import torch.nn.functional as F
from torch_geometric.nn import GCNConv
from torch_geometric.utils import negative_sampling
from torch_geometric.data import Data

In [29]:
class GCNLinkPredictor(torch.nn.Module):
    def __init__(self, num_nodes, embedding_dim, hidden_channels):
        super(GCNLinkPredictor, self).__init__()
        # Initialize node embeddings
        self.embeddings = torch.nn.Embedding(num_nodes, embedding_dim)
        self.conv1 = GCNConv(embedding_dim, hidden_channels)
        self.conv2 = GCNConv(hidden_channels, hidden_channels)

    def forward(self, edge_index):
        # Get node embeddings
        x = self.embeddings.weight
        # GCN layers
        x = F.relu(self.conv1(x, edge_index))
        x = self.conv2(x, edge_index)
        return x

    def decode(self, z, edge_index):
        # Use dot product to predict edge weights (ratings)
        return (z[edge_index[0]] * z[edge_index[1]]).sum(dim=-1)

# Initialize the model, embeddings, and optimizer
embedding_dim = 16
hidden_channels = 8
model = GCNLinkPredictor(num_nodes=num_nodes, embedding_dim=embedding_dim, hidden_channels=hidden_channels)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

# Helper function to get positive and negative edges
def get_train_edges(data):
    pos_edge_index = data.edge_index
    neg_edge_index = negative_sampling(
        edge_index=data.edge_index, num_nodes=data.num_nodes, num_neg_samples=pos_edge_index.size(1)
    )
    return pos_edge_index, neg_edge_index

# Training loop using Mean Squared Error (MSE) loss
def train(data, model, optimizer):
    model.train()
    optimizer.zero_grad()

    # Forward pass
    z = model(data.edge_index)

    # Get positive and negative edges
    pos_edge_index, neg_edge_index = get_train_edges(data)

    # Decode predictions for edge weights
    pos_pred = model.decode(z, pos_edge_index)
    neg_pred = model.decode(z, neg_edge_index)

    # Use MSE loss for regression on edge weights
    pos_loss = F.mse_loss(pos_pred, data.edge_attr[:pos_edge_index.size(1)])
    neg_loss = F.mse_loss(neg_pred, torch.zeros(neg_pred.size()))
    loss = pos_loss + neg_loss
    loss.backward()
    optimizer.step()
    return loss.item()

# Train the model
for epoch in range(200):
    loss = train(data, model, optimizer)
    if epoch % 20 == 0:
        print(f"Epoch {epoch}, Loss: {loss:.4f}")

# Make predictions on unseen edges only


Epoch 0, Loss: 15.6612
Epoch 20, Loss: 5.2036
Epoch 40, Loss: 4.1604
Epoch 60, Loss: 3.7967
Epoch 80, Loss: 3.5580
Epoch 100, Loss: 3.4046
Epoch 120, Loss: 3.2619
Epoch 140, Loss: 3.1815
Epoch 160, Loss: 3.1174
Epoch 180, Loss: 3.0590


In [32]:
rating_predicted = []
for i in range(n_users):
    if i in special:
        continue
    model.eval()
    with torch.no_grad():
        z = model(data.edge_index)

        # Define potential test edges (all possible node pairs between sets U and V)
        U_nodes = torch.tensor([i], dtype=torch.long)
        V_nodes = torch.tensor(movie[i], dtype=torch.long)
        test_edges = torch.cartesian_prod(U_nodes, V_nodes).t()

        # Convert existing edge index to a set for fast lookup
        existing_edges = set((u.item(), v.item()) for u, v in edge_index.t())

        # Filter unseen edges only
        unseen_edges = [(u.item(), v.item()) for u, v in test_edges.t() if (u.item(), v.item()) not in existing_edges]
        unseen_edges = torch.tensor(unseen_edges, dtype=torch.long).t()

        # Predict ratings (edge weights) for unseen edges
        if unseen_edges.size(1) > 0:  # Check if there are unseen edges
            pred_ratings = model.decode(z, unseen_edges)
            rating_predicted.append(pred_ratings)
            # Display predicted ratings for unseen edges
            for i, (u, v) in enumerate(unseen_edges.t()):
                print(f"Unseen Edge: ({u.item()}, {v.item()}), Predicted Rating: {pred_ratings[i].item():.2f}")
        else:
            print("No unseen edges found.")

Unseen Edge: (0, 161), Predicted Rating: 3.75
Unseen Edge: (0, 226), Predicted Rating: 4.11
Unseen Edge: (0, 176), Predicted Rating: 4.08
Unseen Edge: (0, 87), Predicted Rating: 4.58
Unseen Edge: (0, 178), Predicted Rating: 4.60
Unseen Edge: (0, 222), Predicted Rating: 4.31
Unseen Edge: (0, 12), Predicted Rating: 4.48
Unseen Edge: (0, 151), Predicted Rating: 3.89
Unseen Edge: (0, 4), Predicted Rating: 4.09
Unseen Edge: (0, 169), Predicted Rating: 4.74
Unseen Edge: (0, 99), Predicted Rating: 5.54
Unseen Edge: (0, 191), Predicted Rating: 4.22
Unseen Edge: (0, 23), Predicted Rating: 4.32
Unseen Edge: (0, 193), Predicted Rating: 4.67
Unseen Edge: (0, 249), Predicted Rating: 4.42
Unseen Edge: (0, 231), Predicted Rating: 3.61
Unseen Edge: (0, 216), Predicted Rating: 4.03
Unseen Edge: (0, 71), Predicted Rating: 4.37
Unseen Edge: (0, 70), Predicted Rating: 4.50
Unseen Edge: (0, 136), Predicted Rating: 4.39
Unseen Edge: (0, 37), Predicted Rating: 4.16
Unseen Edge: (0, 206), Predicted Rating: 3.

In [33]:
movie = [] # Movies that the rating to be predicted
for i in range(n_users):
    if i in special:
        continue
    movie.append(movie_watched(i))

In [34]:
ground_truth = [] # Ground Truth Rating
for i in range(0,n_users):
    if i in special:
        continue
    ground_truth.append(ground_truth_rating(i))

In [35]:
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
ground_truth_flat = [rating for sublist in ground_truth for rating in sublist]
rating_predicted_flat = torch.cat(rating_predicted).tolist()

In [36]:
mean_squared_error(ground_truth_flat, rating_predicted_flat)

1.246371067606549

In [37]:
r2_score(ground_truth_flat, rating_predicted_flat)

-1.1566854613828634

In [38]:
mean_absolute_error(ground_truth_flat, rating_predicted_flat)

0.889264211000151

In [39]:
print(data.is_cuda)

False


## GraphSAGE

In [40]:
def movie_watched(userID_index:int):
    return test_df[test_df["userId_index"]==userID_index]["movieId_index"].tolist()

movie = [] # Movies that the rating to be predicted
for i in range(n_users):
    movie.append(movie_watched(i))


def ground_truth_rating(userID_index:int):
    return test_df[test_df["userId_index"]==userID_index]["rating"].tolist()

ground_truth = [] # Ground Truth Rating
for i in range(0,n_users):
    ground_truth.append(ground_truth_rating(i))


In [41]:
import torch
import torch.nn.functional as F
from torch_geometric.nn import SAGEConv
from torch_geometric.utils import negative_sampling

class GraphSAGELinkPredictor(torch.nn.Module):
    def __init__(self, num_nodes, embedding_dim, hidden_channels):
        super(GraphSAGELinkPredictor, self).__init__()
        # Initialize node embeddings
        self.embeddings = torch.nn.Embedding(num_nodes, embedding_dim)
        self.conv1 = SAGEConv(embedding_dim, hidden_channels)
        self.conv2 = SAGEConv(hidden_channels, hidden_channels)

    def forward(self, edge_index):
        # Get node embeddings
        x = self.embeddings.weight
        # GraphSAGE layers
        x = F.relu(self.conv1(x, edge_index))
        x = self.conv2(x, edge_index)
        return x

    def decode(self, z, edge_index):
        # Use dot product to predict edge weights (ratings)
        return (z[edge_index[0]] * z[edge_index[1]]).sum(dim=-1)

# Initialize the model, embeddings, and optimizer
embedding_dim = 16
hidden_channels = 8
model = GraphSAGELinkPredictor(num_nodes=num_nodes, embedding_dim=embedding_dim, hidden_channels=hidden_channels)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

# Helper function to get positive and negative edges
def get_train_edges(data):
    pos_edge_index = data.edge_index
    neg_edge_index = negative_sampling(
        edge_index=data.edge_index, num_nodes=data.num_nodes, num_neg_samples=pos_edge_index.size(1)
    )
    return pos_edge_index, neg_edge_index

# Training loop using Mean Squared Error (MSE) loss
def train(data, model, optimizer):
    model.train()
    optimizer.zero_grad()

    # Forward pass
    z = model(data.edge_index)

    # Get positive and negative edges
    pos_edge_index, neg_edge_index = get_train_edges(data)

    # Decode predictions for edge weights
    pos_pred = model.decode(z, pos_edge_index)
    neg_pred = model.decode(z, neg_edge_index)

    # Use MSE loss for regression on edge weights
    pos_loss = F.mse_loss(pos_pred, data.edge_attr[:pos_edge_index.size(1)])
    neg_loss = F.mse_loss(neg_pred, torch.zeros(neg_pred.size()))
    loss = pos_loss + neg_loss
    loss.backward()
    optimizer.step()
    return loss.item()

# Train the model
for epoch in range(200):
    loss = train(data, model, optimizer)
    if epoch % 20 == 0:
        print(f"Epoch {epoch}, Loss: {loss:.4f}")


Epoch 0, Loss: 10.9482
Epoch 20, Loss: 4.4388
Epoch 40, Loss: 3.6440
Epoch 60, Loss: 3.1320
Epoch 80, Loss: 2.8587
Epoch 100, Loss: 2.7226
Epoch 120, Loss: 2.6175
Epoch 140, Loss: 2.5765
Epoch 160, Loss: 2.5645
Epoch 180, Loss: 2.5916


In [42]:
rating_predicted = []
for i in range(n_users):
    if i in special:
        continue
    model.eval()
    with torch.no_grad():
        z = model(data.edge_index)

        # Define potential test edges (all possible node pairs between sets U and V)
        U_nodes = torch.tensor([i], dtype=torch.long)
        V_nodes = torch.tensor(movie[i], dtype=torch.long)
        test_edges = torch.cartesian_prod(U_nodes, V_nodes).t()

        # Convert existing edge index to a set for fast lookup
        existing_edges = set((u.item(), v.item()) for u, v in edge_index.t())

        # Filter unseen edges only
        unseen_edges = [(u.item(), v.item()) for u, v in test_edges.t() if (u.item(), v.item()) not in existing_edges]
        unseen_edges = torch.tensor(unseen_edges, dtype=torch.long).t()

        # Predict ratings (edge weights) for unseen edges
        if unseen_edges.size(1) > 0:  # Check if there are unseen edges
            pred_ratings = model.decode(z, unseen_edges)
            rating_predicted.append(pred_ratings)
            # Display predicted ratings for unseen edges
            for i, (u, v) in enumerate(unseen_edges.t()):
                print(f"Unseen Edge: ({u.item()}, {v.item()}), Predicted Rating: {pred_ratings[i].item():.2f}")
        else:
            print("No unseen edges found.")

Unseen Edge: (0, 161), Predicted Rating: 3.78
Unseen Edge: (0, 226), Predicted Rating: 3.76
Unseen Edge: (0, 176), Predicted Rating: 4.05
Unseen Edge: (0, 87), Predicted Rating: 4.02
Unseen Edge: (0, 178), Predicted Rating: 4.28
Unseen Edge: (0, 222), Predicted Rating: 4.19
Unseen Edge: (0, 12), Predicted Rating: 4.44
Unseen Edge: (0, 151), Predicted Rating: 4.43
Unseen Edge: (0, 4), Predicted Rating: 3.80
Unseen Edge: (0, 169), Predicted Rating: 4.04
Unseen Edge: (0, 99), Predicted Rating: 4.82
Unseen Edge: (0, 191), Predicted Rating: 4.11
Unseen Edge: (0, 23), Predicted Rating: 4.12
Unseen Edge: (0, 193), Predicted Rating: 4.18
Unseen Edge: (0, 249), Predicted Rating: 4.24
Unseen Edge: (0, 231), Predicted Rating: 3.70
Unseen Edge: (0, 216), Predicted Rating: 3.79
Unseen Edge: (0, 71), Predicted Rating: 3.88
Unseen Edge: (0, 70), Predicted Rating: 4.00
Unseen Edge: (0, 136), Predicted Rating: 4.39
Unseen Edge: (0, 37), Predicted Rating: 4.07
Unseen Edge: (0, 206), Predicted Rating: 3.

In [43]:
movie = [] # Movies that the rating to be predicted
for i in range(n_users):
    if i in special:
        continue
    movie.append(movie_watched(i))

In [44]:
ground_truth = [] # Ground Truth Rating
for i in range(0,n_users):
    if i in special:
        continue
    ground_truth.append(ground_truth_rating(i))

In [45]:
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
ground_truth_flat = [rating for sublist in ground_truth for rating in sublist]
rating_predicted_flat = torch.cat(rating_predicted).tolist()

In [46]:
mean_squared_error(ground_truth_flat, rating_predicted_flat)

1.1780590104434243

In [47]:
r2_score(ground_truth_flat, rating_predicted_flat)

-1.0384801978382079

In [48]:
mean_absolute_error(ground_truth_flat, rating_predicted_flat)

0.8357644481780635

# LightGCN

In [49]:
def movie_watched(userID_index:int):
    return test_df[test_df["userId_index"]==userID_index]["movieId_index"].tolist()

movie = [] # Movies that the rating to be predicted
for i in range(n_users):
    movie.append(movie_watched(i))


def ground_truth_rating(userID_index:int):
    return test_df[test_df["userId_index"]==userID_index]["rating"].tolist()

ground_truth = [] # Ground Truth Rating
for i in range(0,n_users):
    ground_truth.append(ground_truth_rating(i))


In [51]:
import torch
import torch.nn.functional as F
from torch_geometric.utils import negative_sampling

class LightGCNLinkPredictor(torch.nn.Module):
    def __init__(self, num_nodes, embedding_dim, num_layers):
        super(LightGCNLinkPredictor, self).__init__()
        # Initialize node embeddings
        self.embeddings = torch.nn.Embedding(num_nodes, embedding_dim)
        self.num_layers = num_layers
        self.reset_parameters()

    def reset_parameters(self):
        torch.nn.init.xavier_uniform_(self.embeddings.weight)

    def forward(self, edge_index):
        # Get initial embeddings
        x = self.embeddings.weight
        # Aggregate embeddings over multiple layers
        embeddings = x
        for _ in range(self.num_layers):
            # Propagate embeddings through neighbors without activation
            x = self.propagate_embeddings(x, edge_index)
            embeddings = embeddings + x
        # Average the embeddings across all layers
        return embeddings / (self.num_layers + 1)

    def propagate_embeddings(self, x, edge_index):
        # Normalize embeddings by degree (LightGCN aggregation)
        row, col = edge_index
        deg = torch.bincount(row, minlength=x.size(0))
        deg_inv_sqrt = deg.pow(-0.5)
        deg_inv_sqrt[deg_inv_sqrt == float('inf')] = 0

        # Aggregate neighbor embeddings
        norm = deg_inv_sqrt[row] * deg_inv_sqrt[col]
        out = torch.zeros_like(x)
        out.index_add_(0, row, norm.view(-1, 1) * x[col])
        return out

    def decode(self, z, edge_index):
        # Use dot product to predict edge weights (ratings)
        return (z[edge_index[0]] * z[edge_index[1]]).sum(dim=-1)

# Initialize the model, embeddings, and optimizer
embedding_dim = 16
num_layers = 3
model = LightGCNLinkPredictor(num_nodes=num_nodes, embedding_dim=embedding_dim, num_layers=num_layers)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

# Helper function to get positive and negative edges
def get_train_edges(data):
    pos_edge_index = data.edge_index
    neg_edge_index = negative_sampling(
        edge_index=data.edge_index, num_nodes=data.num_nodes, num_neg_samples=pos_edge_index.size(1)
    )
    return pos_edge_index, neg_edge_index

# Training loop using Mean Squared Error (MSE) loss
def train(data, model, optimizer):
    model.train()
    optimizer.zero_grad()

    # Forward pass
    z = model(data.edge_index)

    # Get positive and negative edges
    pos_edge_index, neg_edge_index = get_train_edges(data)

    # Decode predictions for edge weights
    pos_pred = model.decode(z, pos_edge_index)
    neg_pred = model.decode(z, neg_edge_index)

    # Use MSE loss for regression on edge weights
    pos_loss = F.mse_loss(pos_pred, data.edge_attr[:pos_edge_index.size(1)])
    neg_loss = F.mse_loss(neg_pred, torch.zeros(neg_pred.size()))
    loss = pos_loss + neg_loss
    loss.backward()
    optimizer.step()
    return loss.item()

# Train the model
for epoch in range(200):
    loss = train(data, model, optimizer)
    if epoch % 20 == 0:
        print(f"Epoch {epoch}, Loss: {loss:.4f}")


Epoch 0, Loss: 16.0249
Epoch 20, Loss: 10.0815
Epoch 40, Loss: 4.9099
Epoch 60, Loss: 4.2709
Epoch 80, Loss: 3.9604
Epoch 100, Loss: 3.7335
Epoch 120, Loss: 3.5747
Epoch 140, Loss: 3.4156
Epoch 160, Loss: 3.2750
Epoch 180, Loss: 3.2085


In [52]:
rating_predicted = []
for i in range(n_users):
    if i in special:
        continue
    model.eval()
    with torch.no_grad():
        z = model(data.edge_index)

        # Define potential test edges (all possible node pairs between sets U and V)
        U_nodes = torch.tensor([i], dtype=torch.long)
        V_nodes = torch.tensor(movie[i], dtype=torch.long)
        test_edges = torch.cartesian_prod(U_nodes, V_nodes).t()

        # Convert existing edge index to a set for fast lookup
        existing_edges = set((u.item(), v.item()) for u, v in edge_index.t())

        # Filter unseen edges only
        unseen_edges = [(u.item(), v.item()) for u, v in test_edges.t() if (u.item(), v.item()) not in existing_edges]
        unseen_edges = torch.tensor(unseen_edges, dtype=torch.long).t()

        # Predict ratings (edge weights) for unseen edges
        if unseen_edges.size(1) > 0:  # Check if there are unseen edges
            pred_ratings = model.decode(z, unseen_edges)
            rating_predicted.append(pred_ratings)
            # Display predicted ratings for unseen edges
            for i, (u, v) in enumerate(unseen_edges.t()):
                print(f"Unseen Edge: ({u.item()}, {v.item()}), Predicted Rating: {pred_ratings[i].item():.2f}")
        else:
            print("No unseen edges found.")

Unseen Edge: (0, 161), Predicted Rating: 3.49
Unseen Edge: (0, 226), Predicted Rating: 3.65
Unseen Edge: (0, 176), Predicted Rating: 4.75
Unseen Edge: (0, 87), Predicted Rating: 2.98
Unseen Edge: (0, 178), Predicted Rating: 3.16
Unseen Edge: (0, 222), Predicted Rating: 4.32
Unseen Edge: (0, 12), Predicted Rating: 6.20
Unseen Edge: (0, 151), Predicted Rating: 4.95
Unseen Edge: (0, 4), Predicted Rating: 4.25
Unseen Edge: (0, 169), Predicted Rating: 3.45
Unseen Edge: (0, 99), Predicted Rating: 4.32
Unseen Edge: (0, 191), Predicted Rating: 3.49
Unseen Edge: (0, 23), Predicted Rating: 4.52
Unseen Edge: (0, 193), Predicted Rating: 5.24
Unseen Edge: (0, 249), Predicted Rating: 4.75
Unseen Edge: (0, 231), Predicted Rating: 4.21
Unseen Edge: (0, 216), Predicted Rating: 3.55
Unseen Edge: (0, 71), Predicted Rating: 4.43
Unseen Edge: (0, 70), Predicted Rating: 3.91
Unseen Edge: (0, 136), Predicted Rating: 4.45
Unseen Edge: (0, 37), Predicted Rating: 4.30
Unseen Edge: (0, 206), Predicted Rating: 4.

In [53]:
movie = [] # Movies that the rating to be predicted
for i in range(n_users):
    if i in special:
        continue
    movie.append(movie_watched(i))

In [54]:
ground_truth = [] # Ground Truth Rating
for i in range(0,n_users):
    if i in special:
        continue
    ground_truth.append(ground_truth_rating(i))

In [55]:
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
ground_truth_flat = [rating for sublist in ground_truth for rating in sublist]
rating_predicted_flat = torch.cat(rating_predicted).tolist()

In [56]:
mean_squared_error(ground_truth_flat, rating_predicted_flat)

1.8424100625363011

In [57]:
r2_score(ground_truth_flat, rating_predicted_flat)

-2.188054584264368

In [58]:
mean_absolute_error(ground_truth_flat, rating_predicted_flat)

1.0925375528126515