This repository contains the implementation of:
- several recommender system models appropriate for large-scale jobs recommendations,
- a hyperparameter tuning,
- evaluation metrics.
Currently implemented models:
- ALS/WRMF: proposed in Collaborative Filtering for Implicit Feedback Datasets; implementation based on the Implicit implementation
- Prod2vec: proposed in E-commerce in Your Inbox: Product Recommendations at Scale; implementation based on Gensim Word2vec implementation
- RP3Beta proposed in Updatable, Accurate, Diverse, and Scalable Recommendations for Interactive Applications
- SLIM: proposed in SLIM: Sparse Linear Methods for Top-N Recommender Systems
- LightFM: proposed in Metadata Embeddings for User and Item Cold-start Recommendations; implementation based on the original LightFM implementation
- P3LTR: our method described in Learning edge importance in bipartite graph-based recommendations
If you use conda, set-up conda environment with a kernel (working with anaconda3):
make ckernel
If you use virtualenv, set-up virtual environment with a kernel:
make vkernel
Then activate the environment:
source activate jobs-research
The input data file interactions.csv should be stored in the directory data/raw/your-dataset-name. For example, data/raw/jobs_published/interactions.csv. The file is expected to contain the following columns: user, item, event, timestamp.
To reproduce our results download the olx-jobs dataset from Kaggle.
Execute the command:
python run.py
The script will:
- split the input data,
- run the hyperparameter optimization for all models,
- train the models,
- generate the recommendations,
- evaluate the models.
By default script executes all aforementioned steps, namely:
--steps '["prepare", "tune", "run", "evaluate"]'
This step:
- loads the raw interactions,
- splits the interactions into the train_and_validation and test sets,
- splits the train_and_validation set into train and validation sets,
- prepares target_users sets for whom recommendations are generated,
- saves all the prepared datasets.
Due to the large size of our dataset, we introduced additional parameters which enable us to decrease the size of the train and validation sets used in the hyperparameter tuning:
--validation_target_users_size 30000
--validation_fraction_users 0.2
--validation_fraction_items 0.2
This step performs Bayesian hyperparameter tuning on the train and validation sets.
For each model, the search space and the tuning parameters are defined in the src/tuning/config.py file.
The results of all iterations are stored.
This step, for each model:
- loads the best hyperparameters (if available),
- trains the model,
- generates and saves recommendations,
- saves efficiency metrics.
This step, for each model:
- loads stored recommendations,
- evaluates them based on the implemented metrics,
- displays and stores the evaluation results.
Notebooks to analyze the dataset structure and distribution.
Notebooks to demonstrate the usage of the particular models.
Notebooks to better understand the results. They utilize recommendations and metrics generated during the execution of the run script.