Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update README.md #1

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
224 changes: 45 additions & 179 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
# A Two-Phase Framework for Temporal-Aware Knowledge Graph Completion
This repository contains the code to train and evaluate models from the paper:
_Learning Cross-modal Embeddings for Cooking Recipes and Food Images_
# pytorch implementation of A Two-Phase Framework for Temporal-Aware Knowledge Graph Completion

Clone it using:
Paper: [A Two-Phase Framework for Temporal-Aware Knowledge Graph Completion](https://arxiv.org/abs/1904.05530)

```shell
git clone --recursive https://github.com/torralba-lab/im2recipe.git
```
We propose a novel two-phase model to infer missing facts in temporal knowledge graph by utilizing temporal information, and this repository contains the implementation of our two-phase model described in the paper.

<p align="center"><img src="figs/renet.png" width="500"/></p>

If you find this code useful, please consider citing:
Temporal knowledge graph (TKG) completion task aims to add newfacts to a KG by making inferences from facts contained in the existing triples andinformation of valid time. Many methods has been proposed towards this prob-lem, but most of them ignore temporal rules which is accurate and explainable fortemporal KG completion task. In this paper, we present a novel two-phase frame-work which integrate the advantage between time-aware embedding and temporal rules. Firstly, a trans-based temporal KG representation method is proposed tomodel the semantic information and temporal information of KG. Then a refinement model is utilized to further improve the performance of current task, which is achieved by solving a joint optimizing problem as an integer linear programming (ILP).

If you make use of this code in your work, please cite the following paper:

```
@inproceedings{salvador2017learning,
Expand All @@ -22,211 +22,77 @@ If you find this code useful, please consider citing:

## Contents
1. [Installation](#installation)
2. [Recipe1M Dataset](#recipe1m-dataset)
3. [Vision models](#vision-models)
4. [Out-of-the-box training](#out-of-the-box-training)
5. [Prepare training data](#prepare-training-data)
6. [Training](#training)
7. [Testing](#testing)
8. [Visualization](#visualization)
9. [Pretrained model](#pretrained-model)
10. [Contact](#contact)
2. [Train_and_Test](#Train_and_Test)
3. [Datasets](#Datasets)
4. [Baselines](#Baselines)
5. [Contact](#contact)

## Installation

Install [Torch](http://torch.ch/docs/getting-started.html):
```
git clone https://github.com/torch/distro.git ~/torch --recursive
cd ~/torch; bash install-deps;
./install.sh
```

Install the following packages:

```
luarocks install torch
luarocks install nn
luarocks install image
pip install torch
pip install numpy
```

Install CUDA and cudnn. Then run:

```
luarocks install cutorch
luarocks install cunn
luarocks install cudnn
```

A custom fork of torch-hdf5 with string support is needed:

```
cd ~/torch/extra
git clone https://github.com/nhynes/torch-hdf5.git
cd torch-hdf5
git checkout chars2
luarocks build hdf5-0-0.rockspec
```

We use Python2.7 for data processing. Install dependencies with ```pip install -r requirements.txt```

## Recipe1M Dataset

Our Recipe1M dataset is available for download [here](http://im2recipe.csail.mit.edu/dataset/download).

## Vision models

We used the following pretrained vision models:

- VGG-16 ([prototxt](https://gist.githubusercontent.com/ksimonyan/211839e770f7b538e2d8/raw/ded9363bd93ec0c770134f4e387d8aaaaa2407ce/VGG_ILSVRC_16_layers_deploy.prototxt) and [caffemodel](http://www.robots.ox.ac.uk/~vgg/software/very_deep/caffe/VGG_ILSVRC_16_layers.caffemodel)).

when training, point arguments ```-proto``` and ```-caffemodel``` to the files you just downloaded.

- ResNet-50 ([torchfile](https://d2j0dndfm35trm.cloudfront.net/resnet-50.t7)).

when training, point the argument ```-resnet_model``` to this file.

## Out-of-the-box training

To train the model, you will need the following files:
* `data/data.h5`: HDF5 file containing skip-instructions vectors, ingredient ids, categories and preprocessed images.
* `data/text/vocab.bin`: ingredient Word2Vec vocabulary. Used during training to select word2vec vectors given ingredient ids.

The links to download them are available [here](http://im2recipe.csail.mit.edu/dataset/download).

## Prepare training data

We also provide the steps to format and prepare Recipe1M data for training the trijoint model. We hope these instructions will allow others to train similar models with other data sources as well.

### Choosing semantic categories

We provide the script we used to extract semantic categories from bigrams in recipe titles:

- Run ```python bigrams --crtbgrs```. This will save to disk all bigrams in the corpus of all recipe titles in the training set, sorted by frequency.
- Running the same script with ```--nocrtbgrs``` will create class labels from those bigrams adding food101 categories.

These steps will create a file called ```classes1M.pkl``` in ```./data/``` that will be used later to create the HDF5 file including categories.

### Word2Vec

Training word2vec with recipe data:

- Run ```python tokenize_instructions.py train``` to create a single file with all training recipe text.
- Run the same ```python tokenize_instructions.py``` to generate the same file with data for all partitions (needed for skip-thoughts later).
- Download and compile [word2vec](https://storage.googleapis.com/google-code-archive-source/v2/code.google.com/word2vec/source-archive.zip)
- Train with:

```
./word2vec -hs 1 -negative 0 -window 10 -cbow 0 -iter 10 -size 300 -binary 1 -min-count 10 -threads 20 -train tokenized_instructions_train.txt -output vocab.bin
```

- Run ```python get_vocab.py vocab.bin``` to extract dictionary entries from the w2v binary file. This script will save ```vocab.txt```, which will be used to create the dataset later.
- Move ```vocab.bin``` and ```vocab.txt``` to ```./data/text/```.

### Skip-instructions

- Navigate to ```th-skip```
- Create directories where data will be stored:
```
mkdir data
mkdir snaps
pip install cutorch
pip install cunn
pip install cudnn
```

- Prepare the dataset running from ```scripts``` directory:
Then clone the repository::

```
python mk_dataset.py
--dataset /path/to/recipe1M/
--vocab /path/to/w2v/vocab.txt
--toks /path/to/tokenized_instructions.txt
git clone https://github.com/shengyp/TKGComplt.git
```

where ```tokenized_instructions.txt``` contains text instructions for the entire dataset (generated in step 2 of the Word2Vec section above), and ```vocab.txt``` are the entries of the word2vec dictionary (generated in step 6 in the previous section).


- Train the model with:

```
moon main.moon
-dataset data/dataset.h5
-dim 1024
-nEncRNNs 2
-snapfile snaps/snapfile
-savefreq 500
-batchSize 128
-w2v /path/to/w2v/vocab.bin
```
We use Python3 for data processing and our code is also written in Python3.

- Get encoder from the trained model. From ```scripts```:
## Train_and_Test

Before running, the user should process the datasets at first
```
moon extract_encoder.moon
../snaps/snapfile_xx.t7
encoder.t7
true
cd datasets/DATA_NAME
python data_processing.py
```
- Extract features. From ```scripts```:

Then, Train the model
```
moon encode.moon
-data ../data/dataset.h5
-model encoder.t7
-partition test
-out encs_test_1024.t7
cd ..
python Train.py -d
```

Run for ```-partition = {train,val,test}``` and ```-out={encs_train_1024,encs_val_1024,encs_test_1024}``` to extract features for the dataset.

- Move files ```encs_*_1024.t7``` containing skip-instructions features to ```./data/text```.


### Creating HDF5 file

Navigate back to ```./```. Run the following from ```./pyscripts```:

Finally, ILP model was used to predict new facts
```
python mk_dataset.py
-vocab /path/to/w2v/vocab.txt
-dataset /path/to/recipe1M/
-h5_data /path/to/h5/outfile/data.h5
-stvecs /path/to/skip-instr_files/
python ILP_solver.py
```
The default hyperparameters give the best performances.

## Training
## Datasets

- Train the model with:
```
th main.lua
-dataset /path/to/h5/file/data.h5
-ingrW2V /path/to/w2v/vocab.bin
-net resnet
-resnet_model /path/to/resnet/model/resnet-50.t7
-snapfile snaps/snap
-dispfreq 1000
-valfreq 10000
```
There are three datasets used in our experiment:YAGO11K, WIKIDATA12K and WIKIDATA36K. facts of each datases have time annotation, which is formed as "[start_time , end_time]". Each data folder has six files:

*Note: Again, this can be run without arguments with default parameters if files are in the default location.*
-entity2id.txt: the first column is entity name, and second column is index of entity.

- You can use multiple GPUs to train the model with the ```-ngpus``` flag. With 4 GTX Titan X you can set ```-batchSize``` to ~150. This is the default config, which will make the model converge in about 3 days.
- Plot loss curves anytime with ```python plotcurve.py -logfile /path/to/logfile.txt```. If ```dispfreq``` and ```valfreq``` are different than default, they need to be passed as arguments to this script for the curves to be correctly displayed. Running this script will also give you the elapsed training time. ```logifle.txt``` should contain the stdout of ```main.lua```. Redirect it with ```th main.lua > /path/to/logfile.txt ```.
-relation2id.txt:the first column is relation name, and second column is index of relation.

## Testing
-train.txt , test.txt , valid.txt: the first column is index of subject entity, second column is index of relation, third column is index of object entity, fourth column is the start time of fact and fifth column is end time of fact.

- Extract features from test set ```th main.lua -test 1 -loadsnap snaps/snap_xx.dat```. They will be saved in ```results```.
- After feature extraction, compute MedR and recall scores with ```python rank.py```.
- Extracting embeddings for any dataset partition is possible with the ```extract``` flag, which can be either ```train```, ```val``` or ```test``` (default).
-stat.txt: num of entites and num of relations

## Visualization
## Baselines

We provide a script to visualize top-1 im2recipe examples in ```./pyscripts/vis.py ```. It will save figures under ```./data/figs/```.
We use following public codes for baseline experiments.

## Pretrained model

Our best model can be downloaded [here](http://data.csail.mit.edu/im2recipe/im2recipe_model.t7.gz).
You can test it with:
```
th main.lua -test 1 -loadsnap im2recipe_model.t7
```
| Baselines | Code | Embedding size | Batch num |
|-------------|---------------------------------------------------------------------------|----------------|------------|
| TransE ([Bordes et al., 2013](https://papers.nips.cc/paper/5071-translating-embeddings-for-modeling-multi-relational-data)) | [Link](https://github.com/thunlp/OpenKE/tree/OpenKE-PyTorch/openke) | 100, 200 | 100, 200 |
| TransH ([Wang et al., 2014](https://www.aaai.org/ocs/index.php/AAAI/AAAI14/paper/view/8531/8546)) | [Link](https://github.com/thunlp/OpenKE/tree/OpenKE-PyTorch/openke) | 100, 200 | 100, 200 |
| t-TransE ([Leblay et al., 2018](https://dl.acm.org/doi/fullHtml/10.1145/3184558.3191639)) | [Link](https://github.com/INK-USC/RE-Net/tree/master/baselines) | 50, 100, 200 | 100, 200 |
| TA-TransE ([Alberto et al., 2018](https://www.aclweb.org/anthology/D18-1516.pdf)) | [Link](https://github.com/INK-USC/RE-Net/tree/master/baselines) | 100, 200 | Default |
| HyTE ([Dasgupta et al., 2018](http://talukdar.net/papers/emnlp2018_HyTE.pdf)) | [Link](https://github.com/malllabiisc/HyTE) | Default | Default |

## Contact

Expand Down