# Light Graph Convolution Network (LightGCN)

This is a TensorFlow implementation of LightGCN with a custom training loop.

The LightGCN is adapted from Neural Graph Collaborative Filtering (NGCF). It operates by:

1. Graph-Based Approach: Treating data as a graph with nodes (users, items) and edges (interactions like ratings), it captures complex relationships in the data.
2. Simplified Convolution Layers: LightGCN uses graph convolution layers to blend features from neighboring nodes, streamlining the process by omitting feature transformations and nonlinear activations.

LightGCN's simplicity allows it to effectively harness the graph structure, enhancing its ability to make accurate recommendations. In applications like Goodreads book recommendations, it adeptly uses user-book connections to suggest titles that closely match a user's preferences and reading habits.

In [1]:
import sys
import os
# Append the parent directory to sys.path for relative imports
project_root = os.path.dirname(os.getcwd())
sys.path.append(project_root)

import numpy as np
import pandas as pd
import random
import scipy.sparse as sp
import tensorflow as tf
from tensorflow.keras.utils import Progbar
from src.utils import preprocess, metrics
from src.models import LightGCN

# Suppress warnings for cleaner notebook presentation
import warnings
warnings.simplefilter("ignore")

# Prepare data

This LightGCN implementation takes an adjacency matrix in a sparse tensor format as input.

In preparation of the data for LightGCN, we must:

Download the data
Stratified train test split
Create a normalized adjacency matrix
Convert to tensor


# Load data

The data we use is the benchmark MovieLens 100K Dataset, with 100k ratings, 1000 users, and 1700 movies.

In [2]:
# Loading ratings data
rating_file = os.path.join('..', 'src', 'data', 'goodreads_2m', 'ratings.csv')
ratings = pd.read_csv(rating_file)

# Displaying the shape of the dataset and a random sample of 5 entries
print(f'Shape: {ratings.shape}')
ratings.sample(5, random_state=123)  # Setting a seed for reproducibility

Shape: (91226, 3)


Unnamed: 0,user_id,book_name,rating
74505,2540,"A Game of Thrones (A Song of Ice and Fire, #1)",4
60643,5886,The Amazing Adventures of Kavalier & Clay,4
87603,4411,The World to Come,5
81524,4934,Harry Potter and the Philosopher's Stone (Harr...,5
60556,5791,"Bloodsucking Fiends (A Love Story, #1)",3


# Train test split

We split the data using a stratified split so the users in the training set are also the same users in the test set. LightGCN is not able to generate recommendations for users not yet seen in the training set.

Here we will have a training size of 75%

In [3]:
# Split data into training and testing sets
train_size = 0.75
train, test = preprocess.stratified_split(ratings, 'user_id', train_size)
print(f'Train Shape: {train.shape}\nTest Shape: {test.shape}')
print(f'Do they have the same users?: {set(train.user_id) == set(test.user_id)}')

Train Shape: (68435, 3)
Test Shape: (22791, 3)
Do they have the same users?: True


# Reindex

Reset the index of users and movies from 0-n for both the training and test data. This is to allow better tracking of users and movies. Dictionaries are created so we can easily translate back and forth from the old index to the new index.

We would also normally remove users with no ratings, but in this case, all entries have a user and a rating between 1-5.

In [4]:
# Assuming train and test DataFrames are already defined
train.reset_index(drop=True, inplace=True)
test.reset_index(drop=True, inplace=True)

# Combine train and test data
combined = pd.concat([train, test]).reset_index(drop=True)

# Count unique users and movies
n_users = combined['user_id'].nunique()
n_movies = combined['book_name'].nunique()
print('Number of users:', n_users)
print('Number of books:', n_movies)

# Create user and item mappings
user2id = {uid: idx for idx, uid in enumerate(combined['user_id'].unique())}
book2id = {book: idx for idx, book in enumerate(combined['book_name'].unique())}
id2user = {idx: uid for uid, idx in user2id.items()}
id2item = {idx: book for book, idx in book2id.items()}

# Apply mappings to train and test sets
train['user_id_new'] = train['user_id'].map(user2id)
train['book_name_new'] = train['book_name'].map(book2id)
test['user_id_new'] = test['user_id'].map(user2id)
test['book_name_new'] = test['book_name'].map(book2id)

# Check for NaNs after reindexing
print("NaNs in train_reindex user_id_new:", train['user_id_new'].isna().sum())
print("NaNs in test_reindex user_id_new:", test['user_id_new'].isna().sum())

# Create a DataFrame to keep track of which books each user has interacted with
interacted = train.groupby("user_id_new")["book_name_new"].apply(set).reset_index()
interacted.rename(columns={"book_name_new": "book_interacted"}, inplace=True)

Number of users: 1371
Number of books: 2720
NaNs in train_reindex user_id_new: 0
NaNs in test_reindex user_id_new: 0


# Adjacency matrix

The adjacency matrix is a data structure the represents a graph by encoding the connections and between nodes. In our case, nodes are both users and movies. Rows and columns consist of ALL the nodes and for every connection (reviewed movie) there is the value 1.

To first create the adjacency matrix we first create a user-item graph where similar to the adjacency matrix, connected users and movies are represented as 1 in a sparse array. Unlike the adjacency matrix, a user-item graph only has users for the columns/rows and items as the other, whereas the adjacency matrix has both users and items concatenated as rows and columns.

In this case, because the graph is undirected (meaning the connections between nodes do not have a specified direction) the adjacency matrix is symmetric. We use this to our advantage by transposing the user-item graph to create the adjacency matrix.

Our adjacency matrix will not include self-connections where each node is connected to itself.

# Create adjacency matrix

In [5]:
# Create user-item interaction matrix
R = sp.dok_matrix((n_users, n_movies), dtype=np.float32)
for _, row in train.iterrows():
    R[row['user_id_new'], row['book_name_new']] = 1

# Create adjacency matrix
adj_mat = sp.dok_matrix((n_users + n_movies, n_users + n_movies), dtype=np.float32)
adj_mat[:n_users, n_users:] = R
adj_mat[n_users:, :n_users] = R.T

## Normalize adjacency matrix

This helps numerically stabilize values when repeating graph convolution operations, avoiding the scale of the embeddings increasing or decreasing.
 
 

 is the degree/diagonal matrix where it is zero everywhere but it's diagonal. The diagonal has the value of the neighborhood size of each node (how many other nodes that node connects to)

 
 on the left side scales  by the source node, while 
 
 right side scales by the neighborhood size of the destination node rather than the source node.

In [6]:
# Calculate normalized adjacency matrix
D_values = np.array(adj_mat.sum(1))
D_inv_values = np.power(D_values + 1e-9, -0.5).flatten()
D_inv_values[np.isinf(D_inv_values)] = 0.0
D_inv_sq_root = sp.diags(D_inv_values)
norm_adj_mat = D_inv_sq_root.dot(adj_mat).dot(D_inv_sq_root)

## Convert to tensor

In [7]:
# Convert to SparseTensor for TensorFlow
coo = norm_adj_mat.tocoo().astype(np.float32)
indices = np.mat([coo.row, coo.col]).transpose()
A_tilde = tf.SparseTensor(indices, coo.data, coo.shape)

## LightGCN

LightGCN keeps neighbor aggregation while removing self-connections, feature transformation, and nonlinear activation, simplifying as well as improving performance.

Neighbor aggregation is done through graph convolutions to learn embeddings that represent nodes. The size of the embeddings can be changed to whatever number. In this notebook, we set the embedding dimension to 64.

In matrix form, graph convolution can be thought of as matrix multiplication. In the implementation we create a graph convolution layer that performs just this, allowing us to stack as many graph convolutions as we want. We have the number of layers as 10 in this notebook.

## Custom training

For training, we batch a number of users from the training set and sample a single positive item (movie that has been reviewed) and a single negative item (movie that has not been reviewed) for each user.

In [8]:
# Model configuration
N_LAYERS = 10
EMBED_DIM = 64
DECAY = 0.0001
EPOCHS = 200
BATCH_SIZE = 1024
LEARNING_RATE = 1e-2

# Initialize LightGCN model
optimizer = tf.keras.optimizers.Adam(learning_rate=LEARNING_RATE)
model = LightGCN.LightGCN(A_tilde, n_users, n_movies, N_LAYERS, EMBED_DIM, DECAY)

In [9]:
%%time
# # Custom training loop with negative sampling and gradient updates
for epoch in range(1, EPOCHS + 1):
    print(f'Epoch {epoch}/{EPOCHS}')
    n_batch = len(train) // BATCH_SIZE + (len(train) % BATCH_SIZE != 0)
    bar = Progbar(n_batch)
    
    for _ in range(n_batch):
        # Sample a batch of users
        users = np.random.choice(n_users, BATCH_SIZE, replace=False)

        # Function for negative sampling
        def sample_neg(user_interacted_items):
            while True:
                neg_item = random.randint(0, n_movies - 1)
                if neg_item not in user_interacted_items:
                    return neg_item

        # Sample positive and negative items for each user
        pos_items = [random.choice(list(interacted[interacted['user_id_new'] == u]['book_interacted'].values[0])) for u in users]
        neg_items = [sample_neg(interacted[interacted['user_id_new'] == u]['book_interacted'].values[0]) for u in users]

        with tf.GradientTape() as tape:
            # Call LightGCN with user and item embeddings
            new_user_embeddings, new_item_embeddings = model(
                (model.user_embedding, model.item_embedding)
            )

            # Embeddings after convolutions
            user_embeddings = tf.nn.embedding_lookup(new_user_embeddings, users)
            pos_item_embeddings = tf.nn.embedding_lookup(new_item_embeddings, pos_items)
            neg_item_embeddings = tf.nn.embedding_lookup(new_item_embeddings, neg_items)

            # Initial embeddings before convolutions
            old_user_embeddings = tf.nn.embedding_lookup(model.user_embedding, users)
            old_pos_item_embeddings = tf.nn.embedding_lookup(model.item_embedding, pos_items)
            old_neg_item_embeddings = tf.nn.embedding_lookup(model.item_embedding, neg_items)

            # Calculate loss
            pos_scores = tf.reduce_sum(tf.multiply(user_embeddings, pos_item_embeddings), axis=1)
            neg_scores = tf.reduce_sum(tf.multiply(user_embeddings, neg_item_embeddings), axis=1)
            regularizer = (tf.nn.l2_loss(old_user_embeddings) +
                           tf.nn.l2_loss(old_pos_item_embeddings) +
                           tf.nn.l2_loss(old_neg_item_embeddings)) / BATCH_SIZE
            mf_loss = tf.reduce_mean(tf.nn.softplus(-(pos_scores - neg_scores)))
            emb_loss = DECAY * regularizer
            loss = mf_loss + emb_loss

        # Apply gradients
        grads = tape.gradient(loss, model.trainable_weights)
        optimizer.apply_gradients(zip(grads, model.trainable_weights))
        bar.add(1, values=[('training loss', float(loss))])

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200
Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch 65/200
Epoch 66/200
Epoch 67/200
Epoch 68/200
Epoch 69/200
Epoch 70/200
Epoch 71/200
Epoch 72/200
Epoch 73/200
Epoch 74/200
Epoch 75/200
Epoch 76/200
Epoch 77/200
Epoch 78

In [10]:
# Generate recommendations
users = np.array([user2id[x] for x in test['user_id'].unique()])
recommendations = model.recommend(users, k=10)

# Replace new user and book IDs with original IDs
recommendations['user_id'] = recommendations['user_id'].map(id2user)
recommendations['book_name'] = recommendations['book_name'].map(id2item)

# Display the first 5 recommendations
recommendations.head(5)

Unnamed: 0,user_id,book_name,prediction
0,1,To Kill a Mockingbird,9.68592
1,1,"The Da Vinci Code (Robert Langdon, #2)",9.277096
2,1,The Tipping Point: How Little Things Can Make ...,8.906483
3,1,A Tale of Two Cities,8.489778
4,1,Brave New World,8.306227


## Evaluation Metrics

The performance of our model is evaluated using the test set, which consists of the exact same users in the training set but with books the users have reviewed that the model has not seen before. A good model will recommend books that the user has also reviewed in the test set.

---

### Precision@k

Out of the books that are recommended, what proportion is relevant. Relevant in this case is if the user has reviewed the book.

A precision@10 of about 0.1 means that about 10% of the recommendations are relevant to the user. In other words, out of the 10 recommendations made, on average a user will have 1 book that is actually relevant.

### Recall@k

Out of all the relevant books (in the test set), how many are recommended.

A recall@10 of 0.1 means that about 10% of the relevant books were recommended. By definition you can see how even if all the recommendations made were relevant, recall@k is capped by k. A higher k means that more relevant books can be recommended.

### Mean Average Precision (MAP)

Calculate the average precision for each user and average all the average precisions over all users. Penalizes incorrect rankings of books.

### Normalized Discounted Cumulative Gain (NDGC)

Looks at both relevant books and the ranking order of the relevant books. Normalized by the total number of users.

---


In [11]:
# Evaluate model performance
top_k = recommendations.copy()
top_k['rank'] = top_k.groupby('user_id', sort=False).cumcount() + 1

# Calculate evaluation metrics
precision_at_k = metrics.precision_at_k(top_k, test, 'user_id', 'book_name', 'rank')
recall_at_k = metrics.recall_at_k(top_k, test, 'user_id', 'book_name', 'rank')
mean_average_precision = metrics.mean_average_precision(top_k, test, 'user_id', 'book_name', 'rank')
ndcg = metrics.ndcg(top_k, test, 'user_id', 'book_name', 'rank')

# Display evaluation metrics
print(f'Precision: {precision_at_k:.6f}',
      f'Recall: {recall_at_k:.6f}',
      f'MAP: {mean_average_precision:.6f}',
      f'NDCG: {ndcg:.6f}', sep='\n')

Precision: 0.200073
Recall: 0.148815
MAP: 0.080734
NDCG: 0.240993
