Official repo with code for the ECIR'23 paper "Augmenting Graph Convolutional Networks with Textual Data for Recommendations".
Using python==3.10
To run the code:
- Create a separate environment and install requirements.txt. You might want to change the index paths to appropriate cuda versions for your machine.
- Download the amazon data and put it into the data folder. I have used reviews and metadata from "Per-category data" table
- Clean the data using the
process_data.py
script. It- removes the unnecessary columns and features
- removes the products and users with less than n reviews (n can be specified when calling the
core_n
function, 5 by default) - synchronizes the metadata with the reviews (i.e. removes the products that are not in the reviews).
- creates the train and test splits to save time when running experiments, but the splits can be force-regenerated when running main.py by speicifying the
--reshuffle
flag.
- Run the
main.py
file by specifying the desired parameters. Here are the most important ones:
options:
--model which model to use: {lgcn,adv_sampling,ltr_linear,ltr_pop}
--data DATA folder with the train/test data
--epochs EPOCHS, -e EPOCHS number of epochs to train
--eval_every EVAL_EVERY how often evaluation is performed during training
--batch_size BATCH_SIZE batch size for training and prediction
-k [K ...] list of k-s for metrics @k
--gpu GPU comma delimited list of GPUs that torch can see
--load_base LOAD_BASE path to the base LightGCN model to load for training the linear layer on top
To train TextGCN model you need to first train a LightGCN model on the same data and then load it as a base model for the TextGCN model.
After the model is trained, the folder runs
is created, and results for each experiment are saved there:
latest_checkpoint.pkl
- the latest checkpoint of the modelbest.pkl
- the best checkpoint of the modellog.log
- the log of the training process
For example, to train a TextGCN model described in the paper on the Electronics
category, you need to first train a default LightGCN model by running
python main.py --model lgcn --data data/Electronics --uid lightgcn_electronics
and then train the TextGCN model on top of it by running
python main.py --model ltr_linear --data data/Electronics --load_base runs/lightgcn_electronics --uid ltr_linear_electronics
lgcn
- default LightGCN model from the original paper. Defined in theBaseModel
class.adv_sampling
- LightGCN with dynamic negative sampling as described in the paper, selecting several negative samples with highest scores, instead of a random one sample. Better performance but much slower, can be run for a smaller number of epochs, and evaluated every 2 epochs instead of standard 25. Defined in the AdvSamplModel class.ltr_linear
- TextGCN model which uses LightGCN score and 4 textual features, combining them in a linear layer on top. Defined in the LTRLinear class. Corresponds to the architecture from the paper:ltr_pop
- same asltr_linear
, but also using popularity of the item and the user as features. Defined in the LTRLinearWPop class.
There are other models in other files, like text
, or gradient boosted versions I have experimented with, but they show worse performance and are not included in the paper.
Feel free to ask any questions by opening the issues.