## Setup

In [10]:
import torch

print(torch.__version__)

2.3.1


## Link Regression on the MovieLens Dataset

This notebook shows how to load a set of `*.csv` files into a `torch_geometric.data.HeteroData` object and how to train a [heterogeneous graph model](https://pytorch-geometric.readthedocs.io/en/latest/notes/heterogeneous.html#hgtutorial).

We are going to use the [Movielens dataset](https://grouplens.org/datasets/movielens/), which is collected by the GroupLens Research group. The toy dataset describes movies, users, and their ratings. We are going to predict the rating of a user for a movie.

## Data Ingestion

In [11]:
import pandas as pd

dataset_name = 'itstore'

orders_path = f'data/{dataset_name}/final_orders.csv'
prod_path = f'data/{dataset_name}/final_products.csv'
cus_path = f'data/{dataset_name}/final_customers.csv'

print(f'Loading orders from {orders_path}')
print(f'Loading customers from {cus_path}')
print(f'Loading products from {prod_path}')

Loading orders from data/itstore/final_orders.csv
Loading customers from data/itstore/final_customers.csv
Loading products from data/itstore/final_products.csv


In [12]:
import pandas as pd
# Load the entire ratings dataframe into memory:
#orders_df = pd.read_csv(orders_path)[["user_id", "item_id"]]

orders_df = pd.read_csv(orders_path, header=0, sep=' ')[["user_id", "item_id"]]
prod_df = pd.read_csv(prod_path, header=0, sep=',')
cus_df = pd.read_csv(cus_path, header=0, sep=',')[["user_id", "user_name", "user_cat", "sf_cat"]]

# Display the entire row in one line
#pd.set_option('display.width', None)
#pd.set_option('display.max_colwidth', None)
pd.options.display.max_colwidth = 40

print('orders.csv:')
print('=' * 21)
print(orders_df[["user_id", "item_id"]].head())
print(f"Number of ratings: {len(orders_df)}")
print()
print('final_customers.csv:')
print('=' * 41)

print(cus_df.head())
print(f"Number of customers: {len(cus_df)}")
print()
print('final_products.csv:')
print('=' * 65)
print(prod_df.head())
print(f"Number of products: {len(prod_df)}")
print()

N_USERS = len(orders_df['user_id'].unique())
N_ITEMS = 17192 #len(orders_df['item_id'].unique())

print(f'Number of users: {N_USERS}')
print(f'Number of items: {N_ITEMS}')

orders.csv:
   user_id         item_id
0  6032699  40101002000100
1  6032699  40101002000100
2  5769655  70100001001400
3  6508081  70100001001400
4  6508081  70100001001400
Number of ratings: 229832

final_customers.csv:
   user_id                                user_name  user_cat    sf_cat
0       47  "Монгол улсын төлөөллийн байгууллагы...  Entities  6. Төсөл
1  9927654      50-н ортой 3-н цэцэрлэг барих төсөл  Entities  6. Төсөл
2      277  Beijing Oriental Shine International...  Entities    7. ОУБ
3  2154565                  Canon Singapore Pte Ltd  Entities    7. ОУБ
4      111                    Dell Asia Pacific Sdn  Entities    7. ОУБ
Number of customers: 7453

final_products.csv:
       item_id                                item_name    item_category  \
0            9         Гаалийн бүрдүүлэлт (0-1,000,000)  Барааны ангилал   
1           14                   1 хүн сарын ажлын хөлс  Барааны ангилал   
2           14         Зэрэгцээ компани нэмэлт мэдүүлэг  Барааны ангила

### Add your ratings manually


We recommend adding at least 10 ratings. Let's first check out the most rated movies. Additional movies in the table are: *Avatar*, *The Dark Knight*, *Pretty Women*,
*Titanic*, *The Lion King*, *Jurassic Park*, *The Matrix*, *The Lord of the Rings* and *The Avengers*. Please note that the article in the movie title is often at the end of the title.

In [13]:

print('Most Purchased Items:')
print('=====================')

# Get the top 10 most purchased item IDs and their counts as a DataFrame
most_purchased_items = orders_df['item_id'].value_counts().head(10).reset_index()
most_purchased_items.columns = ['item_id', 'purchase_count']

# Merge with prod_df to get item names
most_purchased_items_df = most_purchased_items.merge(prod_df[['item_id', 'item_name']], on='item_id', how='left')

# Display the result
print(most_purchased_items_df[['item_name', 'purchase_count']])


Most Purchased Items:
                                  item_name  purchase_count
0                  itstore стандарт хүргэлт           13445
1   Service Software and Hardware  Оношл...            4198
2   Software ESET NOD32 Antivirus 8 New ...            3084
3                                       NaN            3084
4   Mouse: Dell Optical Wireless Mouse W...            1980
5                                       NaN            1980
6   Printers Supply Ink Cartridge: Canon...            1596
7   Mouse: Dell Optical Wired Mouse, MS1...            1539
8   Service of Printer Ажиллагаа ихтэй г...            1491
9    Printers Supply Ink Cartridge: Cano...            1458
10  Printers Supply Ink Cartridge: Canon...            1436
11  Printers Supply Ink Cartridge: Canon...            1425


In [14]:
print('Top Customers:')
print('=====================')

# Get the top 10 most purchased item IDs and their counts as a DataFrame
top_customers = orders_df['user_id'].value_counts().head(10).reset_index()
top_customers.columns = ['user_id', 'purchase_count']

# Merge with prod_df to get item names
top_customers_df = top_customers.merge(cus_df[['user_id', 'user_name']], on='user_id', how='left')

# Display the result
print(top_customers_df[['user_name', 'purchase_count']])

Top Customers:
                user_name  purchase_count
0                Хувь хүн           41544
1               Оюутолгой            9297
2           Дижитал повер            5733
3          Премиум нэксус            4755
4                 Хасбанк            3978
5            Могул сервис            3243
6                Киберком            2991
7       Нэкст-Электроникс            2809
8  Худалдаа хөгжлийн банк            1827
9          Могул экспресс            1670


## Evaluation

From the validation results, our model can generalize well to unseen data. The val RMSE is should be around 0.9, meaning that, on average our model is off by 0.9 stars. We can now evaluate our model on the test set and take a closer look into the predictions.

In [15]:
# from model import RecSysGNN
# import torch
import numpy as np

# device = "cuda" if torch.cuda.is_available() else "cpu"

config = {
    'model': 'hyperGCN',
    'dataset': 'itstore',
    'batch_size': 1024,
    'layers': 2,
    'epochs': 201,
    'e_attr_mode': 'exp',
    'g_seed': 2020,
    'edge': 'knn'
}

# cf_rec = RecSysGNN(model=config['model'], emb_dim=64,  n_layers=config['layers'], n_users=N_USERS, n_items=N_ITEMS, edge_attr_mode = config['e_attr_mode']).to(device)
    
#model_file_path = f"./models/params/{config['model']}_{device}_{config['g_seed']}_{config['dataset']}_{config['batch_size']}__{config['layers']}_{config['epochs']}_{config['edge']}"
    
#cf_rec.load_state_dict(torch.load(model_file_path, weights_only=True))

predictions = np.load(f"./models/preds/{config['model']}_{config['dataset']}_{config['batch_size']}__{config['layers']}_{config['edge']}.npy")

print(predictions[0])

# with torch.no_grad():
#     test_data = test_data.to(device)
#     pred = model(test_data.x_dict, test_data.edge_index_dict,
#                  test_data['user', 'movie'].edge_label_index)
#     pred = pred.clamp(min=0, max=5)
#     target = test_data['user', 'movie'].edge_label.float()
#     rmse = F.mse_loss(pred, target).sqrt()
#     print(f'Test RMSE: {rmse:.4f}')

# userId = test_data['user', 'movie'].edge_label_index[0].cpu().numpy()
# movieId = test_data['user', 'movie'].edge_label_index[1].cpu().numpy()
# pred = pred.cpu().numpy()
# target = target.cpu().numpy()

# print(pd.DataFrame({'userId': userId, 'movieId': movieId, 'rating': pred, 'target': target}))

[13.38094805 -2.75934431 -2.45643293 ... -1.56500767 -0.37430094
 -2.34160023]


In [16]:
# Define a function to get user_id from original_item_id
import pandas as pd

def get_user_id(original_u_id):
    
    # Load final_customers.csv to map user_id to original_item_id
    u_id_map = pd.read_csv("data/itstore/user_id_mapping.csv")

    # Check if original_item_id exists in the customer data
    # Note: Adjust the condition based on your actual data structure
    user_id = u_id_map[u_id_map['original_user_id'] == original_u_id]['encoded_user_id']
    return user_id.iloc[0] if not user_id.empty else None  # Return None if not found

#orig_user_id = 2693321 # Khan bank
orig_user_id = 2702673 # APU
user_id = get_user_id(orig_user_id)

print(f"User ID for original user ID 2693321: {user_id}")

User ID for original user ID 2693321: None


In [17]:
import numpy as np

top_K = 20

print(predictions[user_id])

# top 20 recommendations for user 0
top_K = np.argsort(predictions[user_id])[::-1][:top_K]
print(top_K)

[[[ 13.38094805  -2.75934431  -2.45643293 ...  -1.56500767  -0.37430094
    -2.34160023]
  [ -2.75934431  38.48698287   7.87614963 ...  -3.50444894  -1.05881497
    -0.61370405]
  [ -2.45643293   7.87614963  36.61102729 ...  -5.72542861   3.88337826
    -3.20178445]
  ...
  [  0.73303808  16.48046322   0.39615993 ...  -5.28285117  -1.40467194
    -1.13411003]
  [ -4.73472715   9.60171178  10.28527777 ...  -3.18204057  -1.41607489
    -4.17884758]
  [-15.7150025   42.13952578  -2.94858629 ...  -8.07945955  -2.57560702
     1.28864057]]]
[[[ 6482  2451  7402 ... 12302     0   269]
  [12826 12854 12914 ...  5178  4018  3214]
  [15807 15777 15848 ...  1343  3794  1382]
  ...
  [15807 15848 15777 ...  1144   188   792]
  [15807 15848 15847 ...  7384  7402  4911]
  [15807 15777 15778 ...  3933  1349  7402]]]


## Movie recommendations

We can now use the model to generate ratings for a movie we haven't seen.


In [22]:
# Step 1: Load item_id_mapping.csv to map encoded_item_id to original_item_id
item_id_mapping = pd.read_csv("data/itstore/item_id_mapping.csv")
encoded_to_original = dict(zip(item_id_mapping["encoded_item_id"], item_id_mapping["original_item_id"]))

# Step 2: Load final_products.csv to map original_item_id to item_name
final_products = pd.read_csv("data/itstore/final_products.csv")
original_to_name = dict(zip(final_products["item_id"], final_products["item_name"]))

# Load final_customers.csv to get user_name based on user_id
final_customers = pd.read_csv("data/itstore/final_customers.csv")
user_info = final_customers.set_index("user_id").to_dict()["user_name"]
user_name = user_info.get(orig_user_id, "Unknown User")

# Step 3: Get names and scores for the top 20 recommendations
top_20_item_details = []
for encoded_ids in top_K[0]:
    
    for encoded_id in encoded_ids:
        print(encoded_id)
        original_id = encoded_to_original.get(encoded_id)
        item_name = original_to_name.get(original_id, "Unknown Item")  # Use "Unknown Item" if not found
        print(item_name)
        ranking_score = predictions[user_id][encoded_id]
        top_20_item_details.append((item_name, ranking_score))

# Display results 2693321
print(f"Recommendations for User: {user_name} (ID: {user_id})")
for item_name, ranking_score in top_20_item_details:
    print(f"Ranking Score: {ranking_score:.2f}, Name: {item_name}")

6482
Unknown Item


IndexError: index 6482 is out of bounds for axis 0 with size 1

## Explaining the Predictions

PyTorch Geometric also provides a way to explain the predictions of a GNN. Let's check which movie ratings have influenced this prediction the most.

We will use the [captum](https://captum.ai/) library to explain the predictions.

In [None]:
from torch_geometric.explain import Explainer, CaptumExplainer

explainer = Explainer(
    model=cf_rec,
    algorithm=CaptumExplainer('IntegratedGradients'),
    explanation_type='model',
    model_config=dict(
        mode='regression',
        task_level='edge',
        return_type='raw',
    ),
    node_mask_type=None,
    edge_mask_type='object',
)

explanation = explainer(
    test_data.x_dict, test_data.edge_index_dict, index=0,
    edge_label_index=edge_label_index).cpu().detach()
explanation

HeteroExplanation(
  prediction=[1],
  target=[1],
  index=[1],
  edge_label_index=[2],
  user={ x=[611, 611] },
  movie={ x=[9742, 404] },
  (user, rates, movie)={
    edge_mask=[90762],
    edge_index=[2, 90762],
  },
  (movie, rev_rates, user)={
    edge_mask=[90762],
    edge_index=[2, 90762],
  }
)

In [None]:
# User to movie link + attribution
user_to_movie = explanation['user', 'movie'].edge_index.numpy().T
user_to_movie_attr = explanation['user', 'movie'].edge_mask.numpy().T
user_to_movie_df = pd.DataFrame(
    np.hstack([user_to_movie, user_to_movie_attr.reshape(-1,1)]),
    columns = ['mappedUserId', 'mappedMovieId', 'attr']
)

# Movie to user link + attribution
movie_to_user = explanation['movie', 'user'].edge_index.numpy().T
movie_to_user_attr = explanation[ 'movie', 'user'].edge_mask.numpy().T
movie_to_user_df = pd.DataFrame(
    np.hstack([movie_to_user, movie_to_user_attr.reshape(-1,1)]),
    columns = ['mappedMovieId', 'mappedUserId','attr']
)
explanation_df = pd.concat([user_to_movie_df, movie_to_user_df])
explanation_df[["mappedUserId", "mappedMovieId"]] = explanation_df[["mappedUserId", "mappedMovieId"]].astype(int)

print(f"Attribtion for all edges towards prediction of movie rating of movie:\n {movie['title'].item()}")
print("==========================================================================================")
print(explanation_df.sort_values(by='attr'))

Attribtion for all edges towards prediction of movie rating of movie:
 World Trade Center (2006)
       mappedUserId  mappedMovieId      attr
45606           447           5253 -0.024605
39417           273           5253 -0.019153
75311           248           5253 -0.010085
20443           610            926 -0.000082
72392           610            460 -0.000065
...             ...            ...       ...
70294           610             16  0.015985
76402           610             20  0.016374
21129           610             34  0.016384
11582           610              0  0.021239
19793           176           5253  0.023719

[181524 rows x 3 columns]


In [None]:
# Select links that connect to our user
explanation_df = explanation_df[explanation_df['mappedUserId'] == mapped_user_id]

# We group the attribution scores by movie
explanation_df = explanation_df.groupby('mappedMovieId').sum()

# Merge with movies_df to receive title
# But first, we need to add the original id
explanation_df = explanation_df.merge(unique_movie_id, on='mappedMovieId')
explanation_df = explanation_df.merge(movies_df, on='movieId')

pd.options.display.float_format = "{:,.9f}".format

print("Top movies that influenced the prediction:")
print("==============================================")
print(explanation_df.sort_values(by='attr', ascending=False, key= lambda x: abs(x))[['title', 'attr']].head())

Top movies that influenced the prediction:
                              title        attr
0                  Toy Story (1995) 0.021200064
4  Silence of the Lambs, The (1991) 0.016353807
2               Forrest Gump (1994) 0.016348168
1               Pulp Fiction (1994) 0.015957274
6  Shawshank Redemption, The (1994) 0.015772806
