diff --git a/README.md b/README.md index 13cb73f..89dda4e 100644 --- a/README.md +++ b/README.md @@ -1,14 +1,14 @@ -# A Two-Phase Framework for Temporal-Aware Knowledge Graph Completion -This repository contains the code to train and evaluate models from the paper: -_Learning Cross-modal Embeddings for Cooking Recipes and Food Images_ +# pytorch implementation of A Two-Phase Framework for Temporal-Aware Knowledge Graph Completion -Clone it using: +Paper: [A Two-Phase Framework for Temporal-Aware Knowledge Graph Completion](https://arxiv.org/abs/1904.05530) -```shell -git clone --recursive https://github.com/torralba-lab/im2recipe.git -``` +We propose a novel two-phase model to infer missing facts in temporal knowledge graph by utilizing temporal information, and this repository contains the implementation of our two-phase model described in the paper. + +

-If you find this code useful, please consider citing: +Temporal knowledge graph (TKG) completion task aims to add newfacts to a KG by making inferences from facts contained in the existing triples andinformation of valid time. Many methods has been proposed towards this prob-lem, but most of them ignore temporal rules which is accurate and explainable fortemporal KG completion task. In this paper, we present a novel two-phase frame-work which integrate the advantage between time-aware embedding and temporal rules. Firstly, a trans-based temporal KG representation method is proposed tomodel the semantic information and temporal information of KG. Then a refinement model is utilized to further improve the performance of current task, which is achieved by solving a joint optimizing problem as an integer linear programming (ILP). + +If you make use of this code in your work, please cite the following paper: ``` @inproceedings{salvador2017learning, @@ -22,211 +22,77 @@ If you find this code useful, please consider citing: ## Contents 1. [Installation](#installation) -2. [Recipe1M Dataset](#recipe1m-dataset) -3. [Vision models](#vision-models) -4. [Out-of-the-box training](#out-of-the-box-training) -5. [Prepare training data](#prepare-training-data) -6. [Training](#training) -7. [Testing](#testing) -8. [Visualization](#visualization) -9. [Pretrained model](#pretrained-model) -10. [Contact](#contact) +2. [Train_and_Test](#Train_and_Test) +3. [Datasets](#Datasets) +4. [Baselines](#Baselines) +5. [Contact](#contact) ## Installation -Install [Torch](http://torch.ch/docs/getting-started.html): -``` -git clone https://github.com/torch/distro.git ~/torch --recursive -cd ~/torch; bash install-deps; -./install.sh -``` - Install the following packages: ``` -luarocks install torch -luarocks install nn -luarocks install image +pip install torch +pip install numpy ``` Install CUDA and cudnn. Then run: ``` -luarocks install cutorch -luarocks install cunn -luarocks install cudnn -``` - -A custom fork of torch-hdf5 with string support is needed: - -``` -cd ~/torch/extra -git clone https://github.com/nhynes/torch-hdf5.git -cd torch-hdf5 -git checkout chars2 -luarocks build hdf5-0-0.rockspec -``` - -We use Python2.7 for data processing. Install dependencies with ```pip install -r requirements.txt``` - -## Recipe1M Dataset - -Our Recipe1M dataset is available for download [here](http://im2recipe.csail.mit.edu/dataset/download). - -## Vision models - -We used the following pretrained vision models: - -- VGG-16 ([prototxt](https://gist.githubusercontent.com/ksimonyan/211839e770f7b538e2d8/raw/ded9363bd93ec0c770134f4e387d8aaaaa2407ce/VGG_ILSVRC_16_layers_deploy.prototxt) and [caffemodel](http://www.robots.ox.ac.uk/~vgg/software/very_deep/caffe/VGG_ILSVRC_16_layers.caffemodel)). - -when training, point arguments ```-proto``` and ```-caffemodel``` to the files you just downloaded. - -- ResNet-50 ([torchfile](https://d2j0dndfm35trm.cloudfront.net/resnet-50.t7)). - -when training, point the argument ```-resnet_model``` to this file. - -## Out-of-the-box training - -To train the model, you will need the following files: -* `data/data.h5`: HDF5 file containing skip-instructions vectors, ingredient ids, categories and preprocessed images. -* `data/text/vocab.bin`: ingredient Word2Vec vocabulary. Used during training to select word2vec vectors given ingredient ids. - -The links to download them are available [here](http://im2recipe.csail.mit.edu/dataset/download). - -## Prepare training data - -We also provide the steps to format and prepare Recipe1M data for training the trijoint model. We hope these instructions will allow others to train similar models with other data sources as well. - -### Choosing semantic categories - -We provide the script we used to extract semantic categories from bigrams in recipe titles: - -- Run ```python bigrams --crtbgrs```. This will save to disk all bigrams in the corpus of all recipe titles in the training set, sorted by frequency. -- Running the same script with ```--nocrtbgrs``` will create class labels from those bigrams adding food101 categories. - -These steps will create a file called ```classes1M.pkl``` in ```./data/``` that will be used later to create the HDF5 file including categories. - -### Word2Vec - -Training word2vec with recipe data: - -- Run ```python tokenize_instructions.py train``` to create a single file with all training recipe text. -- Run the same ```python tokenize_instructions.py``` to generate the same file with data for all partitions (needed for skip-thoughts later). -- Download and compile [word2vec](https://storage.googleapis.com/google-code-archive-source/v2/code.google.com/word2vec/source-archive.zip) -- Train with: - -``` -./word2vec -hs 1 -negative 0 -window 10 -cbow 0 -iter 10 -size 300 -binary 1 -min-count 10 -threads 20 -train tokenized_instructions_train.txt -output vocab.bin -``` - -- Run ```python get_vocab.py vocab.bin``` to extract dictionary entries from the w2v binary file. This script will save ```vocab.txt```, which will be used to create the dataset later. -- Move ```vocab.bin``` and ```vocab.txt``` to ```./data/text/```. - -### Skip-instructions - -- Navigate to ```th-skip``` -- Create directories where data will be stored: -``` -mkdir data -mkdir snaps +pip install cutorch +pip install cunn +pip install cudnn ``` -- Prepare the dataset running from ```scripts``` directory: +Then clone the repository:: ``` -python mk_dataset.py ---dataset /path/to/recipe1M/ ---vocab /path/to/w2v/vocab.txt ---toks /path/to/tokenized_instructions.txt +git clone https://github.com/shengyp/TKGComplt.git ``` -where ```tokenized_instructions.txt``` contains text instructions for the entire dataset (generated in step 2 of the Word2Vec section above), and ```vocab.txt``` are the entries of the word2vec dictionary (generated in step 6 in the previous section). - - -- Train the model with: - -``` -moon main.moon --dataset data/dataset.h5 --dim 1024 --nEncRNNs 2 --snapfile snaps/snapfile --savefreq 500 --batchSize 128 --w2v /path/to/w2v/vocab.bin -``` +We use Python3 for data processing and our code is also written in Python3. -- Get encoder from the trained model. From ```scripts```: +## Train_and_Test +Before running, the user should process the datasets at first ``` -moon extract_encoder.moon -../snaps/snapfile_xx.t7 -encoder.t7 -true +cd datasets/DATA_NAME +python data_processing.py ``` -- Extract features. From ```scripts```: - +Then, Train the model ``` -moon encode.moon --data ../data/dataset.h5 --model encoder.t7 --partition test --out encs_test_1024.t7 +cd .. +python Train.py -d ``` - -Run for ```-partition = {train,val,test}``` and ```-out={encs_train_1024,encs_val_1024,encs_test_1024}``` to extract features for the dataset. - -- Move files ```encs_*_1024.t7``` containing skip-instructions features to ```./data/text```. - - -### Creating HDF5 file - -Navigate back to ```./```. Run the following from ```./pyscripts```: - +Finally, ILP model was used to predict new facts ``` -python mk_dataset.py --vocab /path/to/w2v/vocab.txt --dataset /path/to/recipe1M/ --h5_data /path/to/h5/outfile/data.h5 --stvecs /path/to/skip-instr_files/ +python ILP_solver.py ``` +The default hyperparameters give the best performances. -## Training +## Datasets -- Train the model with: -``` -th main.lua --dataset /path/to/h5/file/data.h5 --ingrW2V /path/to/w2v/vocab.bin --net resnet --resnet_model /path/to/resnet/model/resnet-50.t7 --snapfile snaps/snap --dispfreq 1000 --valfreq 10000 -``` +There are three datasets used in our experiment:YAGO11K, WIKIDATA12K and WIKIDATA36K. facts of each datases have time annotation, which is formed as "[start_time , end_time]". Each data folder has six files: -*Note: Again, this can be run without arguments with default parameters if files are in the default location.* +-entity2id.txt: the first column is entity name, and second column is index of entity. -- You can use multiple GPUs to train the model with the ```-ngpus``` flag. With 4 GTX Titan X you can set ```-batchSize``` to ~150. This is the default config, which will make the model converge in about 3 days. -- Plot loss curves anytime with ```python plotcurve.py -logfile /path/to/logfile.txt```. If ```dispfreq``` and ```valfreq``` are different than default, they need to be passed as arguments to this script for the curves to be correctly displayed. Running this script will also give you the elapsed training time. ```logifle.txt``` should contain the stdout of ```main.lua```. Redirect it with ```th main.lua > /path/to/logfile.txt ```. +-relation2id.txt:the first column is relation name, and second column is index of relation. -## Testing +-train.txt , test.txt , valid.txt: the first column is index of subject entity, second column is index of relation, third column is index of object entity, fourth column is the start time of fact and fifth column is end time of fact. -- Extract features from test set ```th main.lua -test 1 -loadsnap snaps/snap_xx.dat```. They will be saved in ```results```. -- After feature extraction, compute MedR and recall scores with ```python rank.py```. -- Extracting embeddings for any dataset partition is possible with the ```extract``` flag, which can be either ```train```, ```val``` or ```test``` (default). +-stat.txt: num of entites and num of relations -## Visualization +## Baselines -We provide a script to visualize top-1 im2recipe examples in ```./pyscripts/vis.py ```. It will save figures under ```./data/figs/```. +We use following public codes for baseline experiments. -## Pretrained model - -Our best model can be downloaded [here](http://data.csail.mit.edu/im2recipe/im2recipe_model.t7.gz). -You can test it with: -``` -th main.lua -test 1 -loadsnap im2recipe_model.t7 -``` +| Baselines | Code | Embedding size | Batch num | +|-------------|---------------------------------------------------------------------------|----------------|------------| +| TransE ([Bordes et al., 2013](https://papers.nips.cc/paper/5071-translating-embeddings-for-modeling-multi-relational-data)) | [Link](https://github.com/thunlp/OpenKE/tree/OpenKE-PyTorch/openke) | 100, 200 | 100, 200 | +| TransH ([Wang et al., 2014](https://www.aaai.org/ocs/index.php/AAAI/AAAI14/paper/view/8531/8546)) | [Link](https://github.com/thunlp/OpenKE/tree/OpenKE-PyTorch/openke) | 100, 200 | 100, 200 | +| t-TransE ([Leblay et al., 2018](https://dl.acm.org/doi/fullHtml/10.1145/3184558.3191639)) | [Link](https://github.com/INK-USC/RE-Net/tree/master/baselines) | 50, 100, 200 | 100, 200 | +| TA-TransE ([Alberto et al., 2018](https://www.aclweb.org/anthology/D18-1516.pdf)) | [Link](https://github.com/INK-USC/RE-Net/tree/master/baselines) | 100, 200 | Default | +| HyTE ([Dasgupta et al., 2018](http://talukdar.net/papers/emnlp2018_HyTE.pdf)) | [Link](https://github.com/malllabiisc/HyTE) | Default | Default | ## Contact