来自这个 Meidum Post
- [[Medium] LightGCN for Movie Recommendation](https://medium.com/stanford-cs224w/lightgcn-for-movie-recommendation-eb6d112f1e8)
-  Code: https://colab.research.google.com/drive/1VfP6JlWbX_AJnx88yN1tM3BYE6XAADiy?usp=sharing


可以好好学一学他的 notebook 是怎么设置的, 比如 training loop 这些咋写的, 我觉得我把这个 notebook 讲清楚，应该我还是能有很多 value adding 的空间的

卧槽， pyG install on Mac 真的是非常不方便。。遇到各种问题，比如我即使用 pytorch 1.11 然后安装 pyg 官方用 conda 安装话，还是会在 import torch_geometric 的时候，遇到 
- "oserror: dlopen(/opt/anaconda3/envs/pyg/lib/python3.7/site-packages/torch_sparse/_convert_cpu.so, 6): symbol not found: __zn2at8internal13_parallel_runexxxrknst3__18functionifvxxmeee"

遇到终于找到一个可以安装的方法: 
- https://gist.github.com/AnirudhDagar/05e9c51257dda06206a44c3b09aced4b

我自己做了一个[视频](https://www.youtube.com/watch?v=UuMjJVqCMQo)，来记录怎么安装, 基本步骤如下: 
```
conda create -n pygeometric python=3.7
conda activate pygeometric
conda install pytorch torchvision -c pytorch
python -c "import torch; print(torch.__version__)"
conda install -y clang_osx-64 clangxx_osx-64 gfortran_osx-64
MACOSX_DEPLOYMENT_TARGET=10.9 CC=clang CXX=clang++
pip install torch_scatter
python -c "import torch_scatter; print(torch_scatter.__version__)"
pip install torch_sparse
python -c "import torch_sparse; print(torch_sparse.__version__)"
pip install torch_cluster
python -c "import torch_cluster; print(torch_cluster.__version__)"
pip install torch-spline-conv
python -c "import torch_spline_conv; print(torch_spline_conv.__version__)"
pip install torch_geometric
python -c "import torch_geometric; print(torch_geometric.__version__)"
```

# Imports

In [1]:
import collections
import math
import os
import os.path as osp
from tqdm import tqdm
from typing import List
import random
import time
import zipfile

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
pd.options.display.max_rows = 10
from sklearn import metrics
from tensorly import decomposition # this is used to matrix factorization


In [2]:

import torch
from torch.functional import tensordot
from torch import nn, optim, Tensor
import torch_geometric
from torch_geometric.data import Dataset, Data, download_url, extract_zip
from torch_geometric.nn import MessagePassing
from torch_geometric.typing import Adj

# Configurations

Configure the model and training process. These parameters will make more sense as you move along.

In [3]:
rating_threshold = 1  #@param {type: "integer"}: Ratings equal to or greater than 3 are positive items.

config_dict = {
    "num_samples_per_user": 500,
    "num_users": 200,

    "epochs": 10,
    "batch_size": 128,
    "lr": 0.001,
    "weight_decay": 0.1,

    "embedding_size": 64,
    "num_layers": 3,
    "K": 10,
    "mf_rank": 8,

    "minibatch_per_print": 100,
    "epochs_per_print": 1,

    "val_frac": 0.2,
    "test_frac": 0.1,

    "model_name": "model.pth"
}

# Data Exploration

A great publicly available dataset for training movie recommenders is the MovieLens 1M dataset. The MovieLens 1M dataset consists of 1 million movie ratings of score 1 to 5, from 6000 users and 4000 movies.

这里 data explore 一下，其实也是手动试一下 Create Own Dataset class 里面的一些操作，这样看起来比较直观

In [6]:
DATA_PATH = "https://files.grouplens.org/datasets/movielens/ml-1m.zip"

In [7]:
# 这边 explore 一下 data
unames = ['user_id', 'gender', 'age', 'occupation', 'zip']
        
users = pd.read_table('./ml-1m/users.dat', 
                  sep='::', 
                  header=None, 
                  names=unames,
                  engine='python', 
                  encoding='latin-1')

rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table('./ml-1m/ratings.dat', 
                    sep='::', 
                    header=None, 
                    names=rnames, 
                    engine='python',
                    encoding='latin-1')

mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('./ml-1m/movies.dat', 
                   sep='::', 
                   header=None, 
                   names=mnames, 
                   engine='python',
                   encoding='latin-1')

# merge 这里就是 join
dat = pd.merge(pd.merge(ratings, users), movies)

In [8]:
dat.shape[0]

1000209

In [126]:
dat.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000209 entries, 0 to 1000208
Data columns (total 10 columns):
 #   Column      Non-Null Count    Dtype 
---  ------      --------------    ----- 
 0   user_id     1000209 non-null  int64 
 1   movie_id    1000209 non-null  int64 
 2   rating      1000209 non-null  int64 
 3   timestamp   1000209 non-null  int64 
 4   gender      1000209 non-null  object
 5   age         1000209 non-null  int64 
 6   occupation  1000209 non-null  int64 
 7   zip         1000209 non-null  object
 8   title       1000209 non-null  object
 9   genres      1000209 non-null  object
dtypes: int64(6), object(4)
memory usage: 83.9+ MB


In [127]:
dat.head()

Unnamed: 0,user_id,movie_id,rating,timestamp,gender,age,occupation,zip,title,genres
0,1,1193,5,978300760,F,1,10,48067,One Flew Over the Cuckoo's Nest (1975),Drama
1,2,1193,5,978298413,M,56,16,70072,One Flew Over the Cuckoo's Nest (1975),Drama
2,12,1193,4,978220179,M,25,12,32793,One Flew Over the Cuckoo's Nest (1975),Drama
3,15,1193,4,978199279,M,25,7,22903,One Flew Over the Cuckoo's Nest (1975),Drama
4,17,1193,5,978158471,M,50,1,95350,One Flew Over the Cuckoo's Nest (1975),Drama


In [128]:
users = users['user_id']
movies = movies['movie_id']

num_users = config_dict["num_users"]
# 这里是直选200 个 user? 
if num_users != -1:
    users = users[:num_users]

#这里啥意思 这个是不是跟我之前 basic CF 里面的 label encoder 一个意思
# 就是把 user_id 和 movie_id 的 range 跳到 从 0 开始的范围
user_ids = range(len(users))
movie_ids = range(len(movies))

# 这么弄好没有有意义呀，直接用 label encoder 不就行，这个弄的太复杂了
# 不过我那里 validation 也用 label encoded 好的从 0 开始的 dataset 了
# TODO; 到时候试一下 label encoder
user_to_id = dict(zip(users, user_ids))
movie_to_id = dict(zip(movies, movie_ids))


In [129]:
print(user_ids)

range(0, 200)


In [130]:
print(movie_ids)

range(0, 3883)


In [131]:
# 这里就是不用 label encoder 的笨方法，就是 (user_id -> 从0 开始的 user_id) 这样一个 dicitonary
print(user_to_id)

{1: 0, 2: 1, 3: 2, 4: 3, 5: 4, 6: 5, 7: 6, 8: 7, 9: 8, 10: 9, 11: 10, 12: 11, 13: 12, 14: 13, 15: 14, 16: 15, 17: 16, 18: 17, 19: 18, 20: 19, 21: 20, 22: 21, 23: 22, 24: 23, 25: 24, 26: 25, 27: 26, 28: 27, 29: 28, 30: 29, 31: 30, 32: 31, 33: 32, 34: 33, 35: 34, 36: 35, 37: 36, 38: 37, 39: 38, 40: 39, 41: 40, 42: 41, 43: 42, 44: 43, 45: 44, 46: 45, 47: 46, 48: 47, 49: 48, 50: 49, 51: 50, 52: 51, 53: 52, 54: 53, 55: 54, 56: 55, 57: 56, 58: 57, 59: 58, 60: 59, 61: 60, 62: 61, 63: 62, 64: 63, 65: 64, 66: 65, 67: 66, 68: 67, 69: 68, 70: 69, 71: 70, 72: 71, 73: 72, 74: 73, 75: 74, 76: 75, 77: 76, 78: 77, 79: 78, 80: 79, 81: 80, 82: 81, 83: 82, 84: 83, 85: 84, 86: 85, 87: 86, 88: 87, 89: 88, 90: 89, 91: 90, 92: 91, 93: 92, 94: 93, 95: 94, 96: 95, 97: 96, 98: 97, 99: 98, 100: 99, 101: 100, 102: 101, 103: 102, 104: 103, 105: 104, 106: 105, 107: 106, 108: 107, 109: 108, 110: 109, 111: 110, 112: 111, 113: 112, 114: 113, 115: 114, 116: 115, 117: 116, 118: 117, 119: 118, 120: 119, 121: 120, 122: 12

In [132]:
num_user = users.shape[0]
num_item = movies.shape[0]
print(f"num_user {num_user}, num_item {num_item}")

num_user 200, num_item 3883


In [133]:
ratings.head()

Unnamed: 0,user_id,movie_id,rating,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


In [134]:
# initialize the adjacency matrix

# 这 rat 也是表示 rating 的意思，好棒，直接说是 adjascency matrix 多方便...
rat = torch.zeros(num_user, num_item)

# 所以这里是开始 build adjacency matrix?
for index, row in ratings.iterrows():
    # 选前三个，因为第四个 col 是 timestamp 我们需要用 
    user, movie, rating = row[:3]
    
    if num_users != -1:
        if user not in user_to_id: break
    
    # create ratings matrix where (i, j) entry represents the ratings
    # of movie j given by user i.
    # 所以这里 rat 还是用 从 0 开始的 encoded 的 movie_id, 和 user_id 比较 rat adj matrix 的shape 是 boundary
    rat[user_to_id[user], movie_to_id[movie]] = rating

In [135]:
print(rat)
print(rat.size())

tensor([[5., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [5., 3., 0.,  ..., 0., 0., 0.],
        [0., 0., 3.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]])
torch.Size([200, 3883])


In [137]:
# Q: 诶，神奇了... edge_index 不是应该是 COO 格式，怎么变成直接用 adj matrix???
data_before_transform = Data(edge_index = rat.clone(), #TODO ??? 这个不对呀，应该是 COO 格式的!! 这里 clone 一下不然，会直接把前面 rat 也改了
            raw_edge_index = rat.clone(), # 这个是干啥的？ 哦，后面 _sample_pos_neg 有用到 自定义的几个 kv pair, 下面几个也是
            data = ratings,
            users = users,
            items = movies)

In [139]:
print(data_before_transform['edge_index']) # ?? 怎么还长这样

tensor([[5., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [5., 3., 0.,  ..., 0., 0., 0.],
        [0., 0., 3.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]])


In [140]:
data = trans_ml(data_before_transform, [rating_threshold])

In [141]:
print(data.keys)

['items', 'edge_index', 'users', 'data', 'raw_edge_index']


In [142]:
for key, item in data:
    print(f'{key} => {item}')

edge_index => tensor([[1., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [1., 1., 0.,  ..., 0., 0., 0.],
        [0., 0., 1.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]])
raw_edge_index => tensor([[5., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [5., 3., 0.,  ..., 0., 0., 0.],
        [0., 0., 3.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]])
data =>          user_id  movie_id  rating  timestamp
0              1      1193       5  978300760
1              1       661       3  978302109
2              1       914       3  978301968
3              1      3408       4  978300275
4              1      2355       5  978824291
...          ...       ...     ...        ...
1000204     6040      1091       1  956716541
1000205     6040      1094       5  956704887
1000206     6040       562       5  95

In [143]:
#但是 transform 之后，还是不是 COO 格式呀... 这个感觉还是不对..
# 除非后面 model 的用法，是比较神奇，不然。.
# 这里比较快的做法，应该是 adj matrix 转成 edge pair list, 然后直接 transpose
print(data['edge_index'])

tensor([[1., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [1., 1., 0.,  ..., 0., 0., 0.],
        [0., 0., 1.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]])


In [144]:
# 这里有个办法: https://discuss.pytorch.org/t/adjacency-matrix-to-edge-index-solution/148343
print(data['edge_index'].clone().nonzero().t().contiguous())

tensor([[   0,    0,    0,  ...,  199,  199,  199],
        [   0,   47,  148,  ..., 2982, 3339, 3682]])


_-------------------------------------------------------_

上面 explore 一下 data 不然下面这个 dataset 怎么弄的不是很直观

# Create Own Dataset

In [4]:
# 这是啥？？ pre-processing 用的？
def trans_ml(dat, thres):
    """
    Transform function that assign non-negative entries >= thres 1, and non-
    negative entries <= thres 0. Keep other entries the same.
    """
    
    # 这边好笨呀，因为他传进来长这样  data = self.transform(data, [rating_threshold])
    # 所以这里又  thre[0] 来吧 rating_threshold 这个 int 拿出来... 无语。..
    thres = thres[0]
    # 拿到 edge_index 但是他这里其实是存的 adj matrix 不是 COO format
    # 难道是这里转成 COO?
    matrix = dat['edge_index']
    matrix[(matrix < thres) & (matrix > -1)] = 0
    matrix[(matrix >= thres)] = 1
    # TOTRY: 试一下是不是这里出问题了? 这里Adj Matrix 转成 COO 看下一会不会崩。试了一下，不会但是 recall@k 还是上不去..
    # 没事，明天继续看
    # dat['edge_index'] = matrix.clone().nonzero().t().contiguous()
    dat['edge_index'] = matrix
    return dat


# Q: Dataset 是什么 class? A: 哦 from torch_geometric.data import Dataset, 所以是给 graph 用的一个 dataset呗
class MovieLens(Dataset):
    def __init__(self, root, transform=None, pre_transform=None,
            transform_args=None, pre_transform_args=None):
        """
        root = where the dataset should be stored. This folder is split
        into raw_dir (downloaded dataset) and processed_dir (process data).
        """
        super(MovieLens, self).__init__(root, transform, pre_transform)
        self.transform = transform
        self.pre_transform = pre_transform
        self.transform_args = transform_args
        self.pre_transform_args = pre_transform_args

    @property
    def raw_file_names(self):
        # 这边写 ml-1m.zip 没有问题，因为 这里就是看 <root_dir>/<raw> 里面有没有这个文件，就是提前下载好也行
        return ["ml-1m.zip"]

    @property
    def processed_file_names(self):
        return ["data_movielens.pt"]

    def download(self):
        # Download to `self.raw_dir`.
        download_url(DATA_PATH, self.raw_dir)

    # 这一步干啥？ 就是从 file 里面读出来，然后转成 pandas dataframe
    # process() 的第一步就是 call 这个
    def _load(self):
        print(self.raw_dir)
        # extract_zip(self.raw_paths[0], self.raw_dir)
        # 这里就是 unsip
        with zipfile.ZipFile(self.raw_paths[0], 'r') as zip_ref:
            zip_ref.extractall(self.raw_dir)
            
        unames = ['user_id', 'gender', 'age', 'occupation', 'zip']
        
        users = pd.read_table(self.raw_dir+'/ml-1m/users.dat', 
                              sep='::', 
                              header=None, 
                              names=unames,
                              engine='python', 
                              encoding='latin-1')
        
        rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
        ratings = pd.read_table(self.raw_dir+'/ml-1m/ratings.dat', 
                                sep='::', 
                                header=None, 
                                names=rnames, 
                                engine='python',
                                encoding='latin-1')
        
        mnames = ['movie_id', 'title', 'genres']
        movies = pd.read_table(self.raw_dir+'/ml-1m/movies.dat', 
                               sep='::', 
                               header=None, 
                               names=mnames, 
                               engine='python',
                               encoding='latin-1')
        
        # 参考上面 data exploration， 这个就是拼成一个 有下面这些 column 的 dataframe
        # user_id, movie_id, rating, timestamp, gender, age, occupation, zip, title, genres
        dat = pd.merge(pd.merge(ratings, users), movies)

        return users, ratings, movies, dat

    def process(self):
        print('run process')
        # load information from file
        users, ratings, movies, dat = self._load()

        # 以为你 users 和 movies 本身就是 pandas df, 然后这里就是获取 unique users 和 movies  的响应的 id
        users = users['user_id']
        movies = movies['movie_id']

        num_users = config_dict["num_users"]
        if num_users != -1:
            users = users[:num_users]

        user_ids = range(len(users))
        movie_ids = range(len(movies))

        user_to_id = dict(zip(users, user_ids))
        movie_to_id = dict(zip(movies, movie_ids))

        # get adjacency info
        self.num_user = users.shape[0]
        self.num_item = movies.shape[0]

        # initialize the adjacency matrix
        rat = torch.zeros(self.num_user, self.num_item)

        # 所以这里是开始 build adjacency matrix?
        for index, row in ratings.iterrows():
            user, movie, rating = row[:3]
            if num_users != -1:
                if user not in user_to_id: break
            # create ratings matrix where (i, j) entry represents the ratings
            # of movie j given by user i.
            rat[user_to_id[user], movie_to_id[movie]] = rating

        # create Data object
        # 这个又是  from torch_geometric.data 里面的
        data = Data(edge_index = rat,
                    raw_edge_index = rat.clone(),
                    data = ratings,
                    users = users,
                    items = movies)

        # apply any pre-transformation
        # 我们应该是没有这个
        if self.pre_transform is not None:
            data = self.pre_transform(data, self.pre_transform_args)

        # apply any post_transformation
        # if self.transform is not None:
        #     # data = self.transform(data, self.transform_args)
        data = self.transform(data, [rating_threshold])

        # save the processed data into .pt file
        # A PT file is a machine learning model created using PyTorch
        torch.save(data, osp.join(self.processed_dir, f'data_movielens.pt'))
        print('process finished')
      
    def len(self):
        """
        return the number of examples in your graph
        # 这个是啥意思？number of examples?? 这个咋定义?
        """
        # TODO: how to define number of examples
        # 我估计可以拿 上面 dat 的 shape
        
        users, ratings, movies, dat = self._load()
        
        return dat.shape[0]

    def get(self):
        """
        The logic to load a single graph
        """
        # 把我们上面存的 .pt file load 出来就行
        data = torch.load(osp.join(self.processed_dir, 'data_movielens.pt'))
        return data

    def train_val_test_split(self, val_frac=0.2, test_frac=0.1):
        """
        Return two mask matrices (M, N) that represents edges present in the
        train and validation set
        可以看下他这里怎么 split 的..
        """
        try:
            self.num_user, self.num_item
        except AttributeError:
            data = self.get()
            self.num_user = len(data["users"].unique())
            self.num_item = len(data["items"].unique())
            
        # get number of edges masked for training and validation
        num_train_replaced = round((test_frac + val_frac) * self.num_user * self.num_item)
        num_val_show = round(val_frac * self.num_user * self.num_item)

        # edges masked during training
        # 这里 training 时候的数据怎么处理要注意的
        indices_user = np.random.randint(0, self.num_user, num_train_replaced)
        indices_item = np.random.randint(0, self.num_item, num_train_replaced)
        
        # sample part of edges from training stage to be unmasked during
        # validation
        indices_val_user = np.random.choice(indices_user, num_val_show)
        indices_val_item = np.random.choice(indices_item, num_val_show)

        train_mask = torch.ones(self.num_user, self.num_item)
        train_mask[indices_user, indices_item] = 0

        val_mask = train_mask.clone()
        val_mask[indices_val_user, indices_val_item] = 1

        test_mask = torch.ones_like(train_mask)

        return train_mask, val_mask, test_mask

# LightGCN neiborhood aggregation layer

Starting with the initial embeddings $E^{(0)}$ and the bipartite graph, we iterate over each node to perform neighborhood aggregation. Note that LightGCN uses **a simple weighted sum aggregator** and **avoids the heavy-lifting feature transformation and nonlinear activation**.

Within each layer, for each user in the graph, we compute its updated embedding as the weighted sum of embeddings from all its neighboring items (movies) following the formula below:
$$ \textbf{e}_u^{(k+1)} = \sum_{i \in N_u} \frac{1}{\sqrt{|N_u|} \sqrt{|N_i|}} \textbf{e}_i^{(k)} $$
where $ \textbf{e}_u^{(k)} $ and $ \textbf{e}_i^{(k)} $ are the user and item (movie) node embeddings at the k-th layer. $ |N_u| $ and $ |N_i| $ are the user and item nodes’ number of neighbors.

Similarly, for each item, the updated embedding is computed using weighted sum of its neighboring users:
$$ \textbf{e}_i^{(k+1)} = \sum_{i \in N_i} \frac{1}{\sqrt{|N_i|} \sqrt{|N_u|}} \textbf{e}_u^{(k)} $$

In [5]:
class LightGCNConv(MessagePassing):
    r"""The neighbor aggregation operator from the `"LightGCN: Simplifying and
    Powering Graph Convolution Network for Recommendation"
    <https://arxiv.org/abs/2002.02126#>`_ paper

    Args:
        in_channels (int): Size of each input sample, or :obj:`-1` to derive
            the size from the first input(s) to the forward method.
        out_channels (int): Size of each output sample.
        num_users (int): Number of users for recommendation.
        num_items (int): Number of items to recommend.
        **kwargs (optional): Additional arguments of
            :class:`torch_geometric.nn.conv.MessagePassing`.
    """
    def __init__(self, 
                 in_channels: int, 
                 out_channels: int,
                 num_users: int, 
                 num_items: int, 
                 **kwargs):
        
        super(LightGCNConv, self).__init__(**kwargs)

        self.in_channels = in_channels
        self.out_channels = out_channels

        self.num_users = num_users
        self.num_items = num_items

        self.reset_parameters()

    def reset_parameters(self):
        pass  # There are no layer parameters to learn.

    def forward(self, x: Tensor, edge_index: Adj) -> Tensor:
        """Performs neighborhood aggregation for user/item embeddings."""
        user_item = torch.zeros(self.num_users, self.num_items, device=x.device)
        
        # 这个是啥？
        user_item[edge_index[:, 0], edge_index[:, 1]] = 1
        
        user_neighbor_counts = torch.sum(user_item, axis=1)
        
        item_neightbor_counts = torch.sum(user_item, axis=0)

        # Compute weight for aggregation: 1 / sqrt(N_u * N_i)
        weights = user_item / torch.sqrt(user_neighbor_counts.repeat(self.num_items, 1).T * 
                                         item_neightbor_counts.repeat(self.num_users, 1))
        
        weights = torch.nan_to_num(weights, nan=0)
        
        # Q：这边 @ 是啥？ A: a @ b 就是 a dot b 也就是 dot product 的意思. ref: https://stackoverflow.com/questions/6392739/what-does-the-at-symbol-do-in-python
        # Q: 这边 item 跑哪去了?
        out = torch.cat((weights.T @ x[:self.num_users], weights @ x[self.num_users:]), 0)
        return out

    # Q: 这是啥？ print 好看的? A: Python __repr__() function returns the object representation in string format.
    def __repr__(self):
        return '{}({}, {})'.format(self.__class__.__name__, self.in_channels, self.out_channels)


# LightGCN model

At layer combination, instead of taking the embedding of the final layer, LightGCN computes **a weighted sum of the embeddings at different layers**:
$$ \textbf{e}_u = \sum_{k=0}^K \alpha_k \textbf{e}_u^{(k)} $$
$$ \textbf{e}_i = \sum_{k=0}^K \alpha_k \textbf{e}_i^{(k)} $$
with $ \alpha \ge 0 $. Here, alpha values can either be learned as network parameters, or set as empirical hyperparameters. It has been found that $ \alpha = \frac{1}{K + 1} $ works well.

LightGCN predicts based on the inner product of the final user and item (movie) embeddings:
$$ \hat{y}_{ui} = \textbf{e}_u^T \textbf{e}_i $$
This inner product measures the similarity between the user and movie, therefore allowing us to understand how likely it is for the user to like the movie.

In [6]:
class LightGCN(nn.Module):
    def __init__(self, 
                 config: dict,
                 device=None,
                 **kwargs):
        super().__init__()

        self.num_users  = config["n_users"]
        self.num_items  = config["m_items"]
        self.embedding_size = config["embedding_size"]
        self.in_channels = self.embedding_size
        self.out_channels = self.embedding_size
        self.num_layers = config["num_layers"]

        # 0-th layer embedding.
        self.embedding_user_item = torch.nn.Embedding(
            num_embeddings=self.num_users + self.num_items,
            embedding_dim=self.embedding_size)
        self.alpha = None

        # random normal init seems to be a better choice when lightGCN actually
        # don't use any non-linear activation function
        nn.init.normal_(self.embedding_user_item.weight, std=0.1)
        print('use NORMAL distribution initilizer')

        self.f = nn.Sigmoid()

        self.convs = nn.ModuleList()
        self.convs.append(LightGCNConv(
                self.embedding_size, self.embedding_size,
                num_users=self.num_users, num_items=self.num_items, **kwargs))

        for _ in range(1, self.num_layers):
            self.convs.append(
                LightGCNConv(
                        self.embedding_size, self.embedding_size, 
                        num_users=self.num_users, num_items=self.num_items,
                        **kwargs))

        self.device = None
        if device is not None:
            self.convs.to(device)
            self.device = device

    def reset_parameters(self):
        for conv in self.convs:
            conv.reset_parameters()

    def forward(self, x: Tensor, edge_index: Adj, *args, **kwargs) -> Tensor:
        xs: List[Tensor] = []

        # 这里感觉写的有点问题，没有把 layers of gcn 连起来， 不对其实有？：
        # 参考: (尤其第一个)
        # - https://www.kaggle.com/code/dipanjandas96/lightgcn-pytorch-from-scratch
        # - https://medium.com/stanford-cs224w/lightgcn-with-pytorch-geometric-91bab836471e 
        #   - 发现这个也是 movie lens 呢: https://colab.research.google.com/drive/1KKugoFyUdydYC0XRyddcROzfQdMwDcnO?usp=sharing
        # - https://pytorch-geometric.readthedocs.io/en/latest/_modules/torch_geometric/nn/models/lightgcn.html
        edge_index = torch.nonzero(edge_index)
        for i in range(self.num_layers):
            # 诶，这里就是连 LightGCNConv 了好像, 你看把上一层的 embedding x
            # 传到下一层，然后把新的 embedding 加到 array/list 里面, 所以是有连的
            x = self.convs[i](x, edge_index, *args, **kwargs)
            if self.device is not None:
                x = x.to(self.device)
            xs.append(x)
        xs = torch.stack(xs)
        
        self.alpha = 1 / (1 + self.num_layers) * torch.ones(xs.shape)
        if self.device is not None:
            self.alpha = self.alpha.to(self.device)
            xs = xs.to(self.device)
        x = (xs * self.alpha).sum(dim=0)  # Sum along K layers.
        
        # TODO: 这里算出来的 x 最好还要过一层 linear layer: self.out = nn.Linear(64, 1)?
        # 然后我的 target 也要改一下，不能是 binary 而是 5 分制的 score
        # 我觉得manually paying forward run
        # thoughts： 我觉得这里后面其实不应该 item , user 直接✖️， 而是两个 embedding 各自
        # 都要补一个 weight vector 然后再 dot product 然后要跟最后的 score 比，而不是直接拿 relevance 
        # 这个 binary 来比? 诶，可是他上面LightGCNConv不是已经给 embedding 有 weight 了?
        return x

    def __repr__(self) -> str:
        return (f'{self.__class__.__name__}({self.in_channels}, '
                f'{self.out_channels}, num_layers={self.num_layers})')

# Utility functions

The utility functions allow us to retrieve embeddings and compute user-item similarities. These will become userful later on.

In [7]:
def getUsersRating(model, users, data):
    """ Get the embedding of users
    INPUT:
        model: the LightGCN model you are training on
        users: this is the user index (note: use 0-indexed and not user number, which is 1-indexed)
        data: the entire data, used to fetch all users and all items
    """
    all_users_items = model(model.embedding_user_item.weight.clone(), data["edge_index"])
    all_users = all_users_items[:len(data["users"])]
    items_emb = all_users_items[len(data["users"]):]
    users_emb = all_users[users.long()]
    rating = model.f(torch.matmul(users_emb, items_emb.t()))
    print(f"getUsersRating rating {rating}")
    return rating

def getEmbedding(model, users, pos, neg, data, mask):
    """
    INPUT:
        model: the LightGCN model you are training on
        users: this is the user index (note: use 0-indexed and not user number,
            which is 1-indexed)
        pos: positive index corresponding to an item that the user like
        neg: negative index corresponding to an item that the user doesn't like
        data: the entire data, used to fetch all users and all items
        mask: Masking matrix indicating edges present in the current
            train / validation / test set.
    """
    # assuming we always search for users and items by their indices (instead of
    # user/item number)
    all_users_items = model(model.embedding_user_item.weight.clone(), data["edge_index"] * mask)
    all_users = all_users_items[:len(data["users"])]
    all_items = all_users_items[len(data["users"]):]
    users_emb = all_users[users]
    pos_emb = all_items[pos]
    neg_emb = all_items[neg]
    n_user = len(data["users"])
    users_emb_ego = model.embedding_user_item(users)
    
    # offset the index to fetch embedding from user_item
    pos_emb_ego = model.embedding_user_item(pos + n_user)
    neg_emb_ego = model.embedding_user_item(neg + n_user)
    
    # ego 是啥？
    return users_emb, pos_emb, neg_emb, users_emb_ego, pos_emb_ego, neg_emb_ego

# Bayesian Personalized Ranking loss (BPR loss)

To train the LightGCN model, we need an objective function that aligns with our goal for movie recommendation. We use the Bayesian Personalized Ranking (BPR) loss, which encourages observed user-item predictions to have increasingly higher values than unobserved ones, along with $ L_2 $ regularization:
$$ L_{BPR} = - \sum_{u=1}^M \sum_{i \in N_u} \sum_{j \notin N_u} \ln \sigma(\hat{y}_{ui} - \hat{y}_{uj}) + \lambda ||\textbf{E}^{(0)} ||^2 $$
where $ \textbf{E}^{(0)} $ is a matrix with column vectors being the 0-th layer embeddings to learn.

In [8]:
# Bayesian Personalized Ranking (BPR) loss 
# Q: 为什么用这个? 
# - https://towardsdatascience.com/recommender-system-using-bayesian-personalized-ranking-d30e98bba0b9
# - https://d2l.ai/chapter_recommender-systems/ranking.html

# Q: 能不能不用这个， 直接用 RMSE, 两个 embedding 弄完直接连 linear layer?? 不过这个是 
# heterogenous graph, 得看一下怎么弄. 这里其实不需要 heterogenous graph，
# 
# 大概看了下公式， BPR 就是一个有 negative sample 的 cross-entropy

def bpr_loss(model, users, pos, neg, data, mask):
    """ 
    INPUT:
        model: the LightGCN model you are training on
        users: this is the user index (note: use 0-indexed and not user number,
            which is 1-indexed)
        pos: positive index corresponding to an item that the user like
            (0-indexed, note to index items starting from 0)
        neg: negative index corresponding to an item that the user doesn't like
        data: the entire data, used to fetch all users and all items
        mask: Masking matrix indicating edges present in the current
            train / validation / test set.
    OUTPUT:
        loss, reg_loss
    """
    # assuming we always sample the same number of positive and negative sample
    # per user
    assert len(users) == len(pos) and len(users) == len(neg)
    
    # 这了每个 return item 分别是啥呀？
    (users_emb, pos_emb, neg_emb, userEmb0,  posEmb0, negEmb0) = getEmbedding(model, 
                                                                              users.long(), 
                                                                              pos.long(),
                                                                              neg.long(), 
                                                                              data, 
                                                                              mask)
    
    reg_loss = (1/2)*(userEmb0.norm(2).pow(2) + 
                        posEmb0.norm(2).pow(2)  +
                        negEmb0.norm(2).pow(2))/float(len(users))

    pos_scores = torch.mul(users_emb, pos_emb)
    pos_scores = torch.sum(pos_scores, dim=1)
    neg_scores = torch.mul(users_emb, neg_emb)
    neg_scores = torch.sum(neg_scores, dim=1)
    
    loss = torch.mean(torch.nn.functional.softplus(neg_scores - pos_scores))
    
    return loss, reg_loss

# Personalized top K precision and recall

To evaluate training progress and model performance, we compute the **top K precision and recall** scores. Specifically, for each user, we rank movie items in order of decreasing similarity and choose the best K to recommend. Then, we compute the precision and recall of those K recommendations against ground truth items that the user likes and dislikes.

In [9]:
def personalized_topk(pred, K, user_indices, edge_index):
    """Computes TopK precision and recall.

    Args:
        TODO: 这个 pred 到底是谁？ 是test set 里每个 user 对每个 item 的判断吗？好像是来自getUsersRating
        pred: Predicted similarities between user and item. ?? 
        K: Number of items to rank.
        user_indices: Indices of users for each prediction in `pred`.
        edge_index: User and item connection matrix.

    Returns:
        Average Top K precision and recall for users in `user_indices`.
    """
    
    # user_id -> list of predicted items
    per_user_preds = collections.defaultdict(list)
    
    for index, user in enumerate(user_indices):
        per_user_preds[user.item()].append(pred[index].item())
        
    precisions = 0.0
    recalls = 0.0
    
    for user, preds in per_user_preds.items():
        # 如果 user interact 的 item 小于 K, 怎么处理, 他这里是随便选？直到塞满？这个我跟 NCF 的那个写法就很不一样了
        # 我那边是直接不塞， 比如 4 个预测相关，总共有5 个 relevant, 就不用 K 用 5, 这个无形间还是会增加不少差距的
        # TODO: 这里明天试着改一下，这里我觉得很有问题, 他这里是对 predicted relevant item 来塞到 10?
        while len(preds) < K:
            preds.append(random.choice(range(edge_index.shape[1])))
        
        # 这步是啥，不是很懂
        top_ratings, top_items = torch.topk(torch.tensor(preds), K)
        
        # 这步也不是很懂，没有去看 edge 怎么处理的
        correct_preds = edge_index[user, top_items].sum().item()
        
        total_pos = edge_index[user].sum().item()
        
        precisions += correct_preds / K
        
        recalls += correct_preds / total_pos if total_pos != 0 else 0
    
    num_users = len(user_indices.unique())
    return precisions / num_users, recalls / num_users

In [10]:
def _sample_pos_neg(data, mask, num_samples_per_user):
    """Samples (user, positive item, negative item) tuples per user.

    If a user does not have a postive (negative) item, we choose an item
    with unknown liking (an item without raw rating data).

    Args:
        data: Dataset object containing edge_index and raw ratings matrix.
        mask: Masking matrix indicating edges present in the current
            train / validation / test set.
        num_samples_per_user: Number of samples to generate for each user.

    Returns:
        torch.Tensor object of (user, positive item, negative item) samples.
    """
    print("=====Starting to sample=====")
    start = time.time()
    samples = []
    all_items = set(range(len(data["items"])))
    for user_index, user in enumerate(data["users"]):
        pos_items = set(
            torch.nonzero(data["edge_index"][user_index])[:, 0].tolist())
        unknown_items = all_items.difference(
                set(
                    torch.nonzero(
                        data["raw_edge_index"][user_index])[:, 0].tolist()))
        neg_items = all_items.difference(
            set(pos_items)).difference(set(unknown_items))
        unmasked_items = set(torch.nonzero(mask[user_index])[:, 0].tolist())
        if len(unknown_items.union(pos_items)) == 0 or \
                len(unknown_items.union(neg_items)) == 0:
            continue
        for _ in range(num_samples_per_user):
            if len(pos_items.intersection(unmasked_items)) == 0:
                pos_item_index = random.choice(
                    list(unknown_items.intersection(unmasked_items)))
            else:
                pos_item_index = random.choice(
                    list(pos_items.intersection(unmasked_items)))
            if len(neg_items.intersection(unmasked_items)) == 0:
                neg_item_index = random.choice(
                    list(unknown_items.intersection(unmasked_items)))
            else:
                neg_item_index = random.choice(
                    list(neg_items.intersection(unmasked_items)))
            samples.append((user_index, pos_item_index, neg_item_index))
    end = time.time()
    print(f"=====Sampling completed (took {end - start} seconds)=====")
    return torch.tensor(samples, dtype=torch.int32)

def sample_pos_neg(data, train_mask, val_mask, test_mask, num_samples_per_user):
    """Samples (user, positive item, negative item) tuples per user.

    If a user does not have a postive (negative) item, we choose an item
    with unknown liking (an item without raw rating data).

    Args:
        data: Dataset object containing edge_index and raw ratings matrix.
        train_mask: Masking matrix indicating edges present in train set.
        val_mask: Masking matrix indicating edges present in validation set.
        test_mask: Masking matrix indicating edges present in test set.
        num_samples_per_user: Number of samples to generate for each user.

    Returns:
        torch.Tensor object of (user, positive item, negative item) samples for
        train, validation and test.
    """
    train_samples = _sample_pos_neg(data, train_mask, num_samples_per_user)
    val_samples = _sample_pos_neg(data, val_mask, num_samples_per_user)
    test_samples = _sample_pos_neg(data, test_mask, num_samples_per_user)
    return train_samples, val_samples, test_samples

# Prep Training 

In [11]:
# getcwd: get current working directory in absolute path
root = os.getcwd()

movielens = MovieLens(root=root, transform=trans_ml)

data = movielens.get()

train_mask, val_mask, test_mask = movielens.train_val_test_split(val_frac=config_dict["val_frac"],
                                                                 test_frac=config_dict["test_frac"])

n_users = len(data["users"].unique())
m_items = len(data["items"].unique())
print(f"#Users: {n_users}")
print(f"#Items: {m_items}")

model_config = {
    "n_users": n_users,
    "m_items": m_items,
    "embedding_size": config_dict["embedding_size"],
    "num_layers": config_dict["num_layers"],
}

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

lightGCN = LightGCN(model_config, device=device)

num_samples_per_user = config_dict["num_samples_per_user"]
epochs = config_dict["epochs"]
batch_size = config_dict["batch_size"]
lr = config_dict["lr"]
weight_decay = config_dict["weight_decay"]

K = config_dict["K"]

lightGCN.to(device)

samples_train, samples_val, samples_test =  sample_pos_neg(data, train_mask, val_mask, test_mask,
                                                          num_samples_per_user)

# 这边就是把这些 tensor 往  GPU 里面推，如果有 GPU
samples_train=samples_train.to(device)
samples_val=samples_val.to(device)
samples_test=samples_test.to(device)

train_mask=train_mask.to(device)
val_mask=val_mask.to(device)
test_mask=test_mask.to(device)

data = data.to(device)

print(f"#Training samples: {len(samples_train)}",
      f"#Validation samples: {len(samples_val)}",
      f"#Test samples: {len(samples_test)}")

optimizer = optim.Adam(lightGCN.parameters(), lr=lr)
print("Optimizer:", optimizer)


#Users: 200
#Items: 3883
use NORMAL distribution initilizer
=====Starting to sample=====
=====Sampling completed (took 6.090041160583496 seconds)=====
=====Starting to sample=====
=====Sampling completed (took 5.583359003067017 seconds)=====
=====Starting to sample=====
=====Sampling completed (took 5.980459928512573 seconds)=====
#Training samples: 100000 #Validation samples: 100000 #Test samples: 100000
Optimizer: Adam (
Parameter Group 0
    amsgrad: False
    betas: (0.9, 0.999)
    capturable: False
    eps: 1e-08
    foreach: None
    lr: 0.001
    maximize: False
    weight_decay: 0
)


# Training Loop

In [13]:

epochs_tracked = []
train_topks = []
val_topks = []
loss_plot = []

for epoch in range(epochs):
    print("Training on the {} epoch".format(epoch))
    lightGCN.train()
    loss_sum = 0
    # Shuffle the order of rows.
    samples_train = samples_train[torch.randperm(samples_train.size()[0])]
    
    for batch_idx in range(math.ceil(len(samples_train) / batch_size)):
        optimizer.zero_grad()

        current_batch = samples_train[batch_idx*batch_size: (batch_idx+1)*batch_size]

        # Shuffle the order of rows.
        current_batch = current_batch[torch.randperm(current_batch.size()[0])]
        users = current_batch[:, 0:1]
        pos = current_batch[:, 1:2]
        neg = current_batch[:, 2:3]

        loss, reg_loss = bpr_loss(lightGCN, users, pos, neg, data, train_mask)
        reg_loss = reg_loss * weight_decay
        loss = loss + reg_loss
        loss_sum += loss.detach()
        
        loss_plot.append(loss.item())

        loss.backward()
        optimizer.step()

        if batch_idx % config_dict["minibatch_per_print"] == 0:
            all_users = torch.linspace(start=0, end=n_users - 1, steps=n_users).long()
            user_indices = current_batch[:, 0]
            user_indices = user_indices.repeat(2).long()
            item_indices = torch.cat((current_batch[:, 1], current_batch[:, 2])).long()
            
            pred = getUsersRating(lightGCN,
                                  all_users,
                                  data)[user_indices, item_indices]
            
            truth = data["edge_index"][user_indices, item_indices]
            
            topk_precision, topk_recall = personalized_topk(pred, K, user_indices, data["edge_index"])

            print("Training on epoch {} minibatch {}/{} completed\n".format(epoch, batch_idx+1,
                                                                            math.ceil(len(samples_train) / batch_size)),
                  "bpr_loss on current minibatch is {}, and regularization loss is {}.\n".format(round(float(loss.detach().cpu()), 6),
                                                                                                 round(float(reg_loss.detach().cpu()), 6)),
                  "Top K precision = {}, recall = {}.".format(topk_precision, topk_recall))

    if epoch % config_dict["epochs_per_print"] == 0:
        epochs_tracked.append(epoch)

        # evaluation on both the trainisng and validation set
        lightGCN.eval()
        
        # predict on the training set
        users = samples_train[:, 0:1]
        user_indices = samples_train[:, 0]
        user_indices = user_indices.repeat(2).long()
        item_indices = torch.cat((samples_train[:, 1], samples_train[:, 2])).long()
        
        pred = getUsersRating(lightGCN,
                              users[:,0],
                              data)[user_indices, item_indices]
        
        truth = data["edge_index"][users.long()[:,0]][user_indices, item_indices]

        train_topk_precision, train_topk_recall = personalized_topk(pred, K, user_indices, data["edge_index"])

        train_topks.append((train_topk_precision, train_topk_recall))

        # predict on the validation set
        users_val = samples_val[:, 0:1]
        pos_val = samples_val[:, 1:2]
        neg_val = samples_val[:, 2:3]

        loss_val, reg_loss_val = bpr_loss(lightGCN, users_val, pos_val, neg_val, data, val_mask)
        
        reg_loss_val = reg_loss_val * weight_decay

        # predict on the validation set
        user_indices = samples_val[:, 0]
        
        user_indices = user_indices.repeat(2).long()
        
        item_indices = torch.cat((samples_val[:, 1], samples_val[:, 2])).long()
        
        pred_val = getUsersRating(lightGCN,
                                  users_val[:,0],
                                  data)[user_indices, item_indices]
        
        truth_val = data["edge_index"][users_val.long()[:,0]][user_indices, item_indices]
        
        val_topk_precision, val_topk_recall = personalized_topk(pred_val, K, user_indices, data["edge_index"])
        
        val_topks.append((val_topk_precision, val_topk_recall))

        print("\nTraining on {} epoch completed.\n".format(epoch),
              "Average bpr_loss on train set is {} for the current epoch.\n".format(round(float(loss_sum/len(samples_train)), 6)),
              "Training top K precision = {}, recall = {}.\n".format(train_topk_precision, train_topk_recall),
              "Average bpr_loss on the validation set is {}, and regularization loss is {}.\n".format(round(float((loss_val+reg_loss_val)/len(samples_val)), 6),
                                                                                                      round(float(reg_loss_val/len(samples_val)), 6)),
              "Validation top K precision = {}, recall = {}.\n".format(val_topk_precision, val_topk_recall))

Training on the 0 epoch
getUsersRating rating tensor([[0.5000, 0.5000, 0.5000,  ..., 0.5001, 0.4999, 0.5000],
        [0.5000, 0.5000, 0.5000,  ..., 0.5000, 0.4999, 0.5000],
        [0.5000, 0.5000, 0.5000,  ..., 0.5000, 0.5000, 0.5000],
        ...,
        [0.5000, 0.5000, 0.5000,  ..., 0.5000, 0.5000, 0.5000],
        [0.5000, 0.5000, 0.5000,  ..., 0.5000, 0.5000, 0.5000],
        [0.5000, 0.5000, 0.5000,  ..., 0.5001, 0.4999, 0.5000]],
       grad_fn=<SigmoidBackward0>)
Training on epoch 0 minibatch 1/782 completed
 bpr_loss on current minibatch is 0.271874, and regularization loss is --.
 Top K precision = 0.08749999999999998, recall = 0.0077243154779485105.
getUsersRating rating tensor([[0.5000, 0.5000, 0.5000,  ..., 0.5000, 0.5000, 0.5000],
        [0.5000, 0.5000, 0.5000,  ..., 0.5000, 0.5000, 0.5000],
        [0.5000, 0.5000, 0.5000,  ..., 0.5000, 0.5000, 0.5000],
        ...,
        [0.5000, 0.5000, 0.5000,  ..., 0.5000, 0.5000, 0.5000],
        [0.5000, 0.5000, 0.5000,  ...

KeyboardInterrupt: 

In [None]:
plt.plot(epochs_tracked, [precision for precision, _ in train_topks], label="Train")

plt.plot(epochs_tracked, [precision for precision, _ in val_topks], label="Val")
plt.ylabel(f"Top {K} precision")
plt.xlabel("Epochs")
plt.legend()
plt.show()

In [None]:
plt.plot(loss_plot)

In [None]:
plt.plot(epochs_tracked, [recall for _, recall in train_topks],
         label="Train")
plt.plot(epochs_tracked, [recall for _, recall in val_topks],
         label="Val")
plt.ylabel(f"Top {K} recall")
plt.xlabel("Epochs")
plt.legend()
plt.show()

In [55]:
# predict on the test set
lightGCN.eval()
print("Training completed after {} epochs".format(epochs))

users_test = samples_test[:, 0:1]
pos_test = samples_test[:, 1:2]
neg_test = samples_test[:, 2:3]

loss_test, reg_loss_test = bpr_loss(lightGCN, users_test, pos_test, neg_test, data, test_mask)

reg_loss_test = reg_loss_test * weight_decay

# predict on the test set
user_indices = samples_test[:, 0]
user_indices = user_indices.repeat(2).long()
item_indices = torch.cat((samples_test[:, 1], samples_test[:, 2])).long()

pred_test = getUsersRating(lightGCN, users_test[:,0], data)[user_indices, item_indices]

truth_test = data["edge_index"][users_test.long()[:,0]][user_indices, item_indices]

test_topk_precision, test_topk_recall = personalized_topk(pred_test, K, user_indices, data["edge_index"])

print("Average bpr_loss on the test set is {}, and regularization loss is {}.\n"
      .format(
          round(float((loss_test+reg_loss_test)/len(samples_test)), 6),
          round(float(reg_loss_test/len(samples_test)), 6)
      ),
      "Top K precision = {}, recall = {}.".format(test_topk_precision, test_topk_recall))

# Save model embeddings.
torch.save(lightGCN, config_dict["model_name"])

Training completed after 10 epochs
getUsersRating rating tensor([[0.5000, 0.5000, 0.5000,  ..., 0.5000, 0.5000, 0.5000],
        [0.5000, 0.5000, 0.5000,  ..., 0.5000, 0.5000, 0.5000],
        [0.5000, 0.5000, 0.5000,  ..., 0.5000, 0.5000, 0.5000],
        ...,
        [0.5000, 0.5000, 0.5000,  ..., 0.5000, 0.5000, 0.5000],
        [0.5000, 0.5000, 0.5000,  ..., 0.5000, 0.5000, 0.5000],
        [0.5000, 0.5000, 0.5000,  ..., 0.5000, 0.5000, 0.5000]],
       grad_fn=<SigmoidBackward0>)
Average bpr_loss on the test set is 7e-06, and regularization loss is 0.0.
 Top K precision = 0.08699999999999994, recall = 0.007430810538437454.


# Run matrix factorization as baseline performance

In [42]:
def matrix_factorization(user_item, rank):
    """Runs matrix factorization on `user_item` and get user-item similarities.

    Args:
        user_item: User-item connectivity matrix.
        rank: Number of numbers to represent a user / item.

    Returns:
        User-item similarities.
    """
    weights, (user_factors, item_factors) = decomposition.parafac(user_item, rank)
    similarities = user_factors @ item_factors.T
    return 1 / (1 + np.exp(- similarities))

In [43]:
# Compute baseline metrics using matrix factorization.
baseline_pred = matrix_factorization(
        data["edge_index"].detach().cpu().numpy(),
        config_dict["mf_rank"])[user_indices.cpu(), item_indices.cpu()]

baseline_topk_precision, baseline_topk_recall = personalized_topk(baseline_pred, 
                                                                  K, 
                                                                  user_indices, 
                                                                  data["edge_index"])

print("Baseline (PARAFAC matrix factorization) produces ",
      "Top K precision = {}, recall = {}.".format(baseline_topk_precision,
                                                  baseline_topk_recall))

Baseline (PARAFAC matrix factorization) produces  Top K precision = 0.03699999999999999, recall = 0.002789248078607412.


In [None]:
# 为什么这里拿一个 item 都这么困难？？
# 参考下这里: GNN Project #2 - Creating a Custom Dataset in Pytorch Geometric
#      https://www.youtube.com/watch?v=QLIkOtKS4os

# Running Notes: 
Sep/10/2022
卧槽, 这个代码好复杂， 不知道这个作者写了多少个小时，我都看懵了...
- 我觉得我这个代码能讲一遍我就很厉害了
- 有的还真是看不懂，一个是 PyG 不熟悉，还有真的是 python 水平有待提高..
- 一点点啃，看下需要多久. (我是 sep/09 开始认真看的) 给自己两三个礼拜吧
- 我觉得还得结合着 pyG [官方文档](https://pytorch-geometric.readthedocs.io/en/latest/notes/introduction.html)看，不然有点难懂 
- 还有看一下他这边 node feature 怎么用上的?
- TODO: 思考一个问题，是不是他的 edge 没有处理好，他这里基本上是 >3 变成有，<3 是没有，那么就变成跟我的 RMSE 的 NCF 做法，他丢失了非常多的信息了， 所以得研究一下怎么改
今天先休息吧，明天继续啃

Sep/12/2022 
- 发现一个 GNN 视频，讲的还不错呢, 也是讲 PyG, 哈哈，可以学习一下: https://www.youtube.com/c/DeepFindr/videos ， 但是他图片讲的不错，没有想我这样仔细讲代码
- 这里是个 Kumo.ai 的创始人的 example: https://github.com/rusty1s/examples
- 发现 AntonioLonga 的视频真的好像过的挺细的，有机会可以看一下

- 如果还看不懂，就得把 Lindsey AI 的视频看一下: https://www.youtube.com/watch?v=-UjytpbqX4A&t=2016s
- 我觉得把 BPR loss 改成 RMSE 应该差不多？然后也是要 random sample 一下？

Sep/14/2022
- 可能都不是 BPR 的问题，你看这个 https://www.kaggle.com/code/dipanjandas96/lightgcn-pytorch-from-scratch, 基本上思路一样，只是不用 PyG 写，然后可能一些细节不太一样，这个能跑到 0.2152 的 recall@10
- 我觉得 evaluation personalized_topk() 这个是不是太有问题了
- follow 的好辛苦.. 这代码写的真多没法看... 而且 python 的 type 交代的非常不清楚，就得一点点挖.. 实在不行，我用 pycharm 来看?
试了一下，效果不是很好，因为 pyTorch 底层貌似是 c， 所以其实 click thru 也看不到啥
- 也许最好的办法，还是 manually step thru 一下一个简单的 dataset
- 我自己开了一个 Data Exploration 的 section 把, create dataset 那部分自己过了一下，就感觉清晰很多，所以还是要一步步过, 明天继续