Sample code for Visual Summary Identification from Scientific Publications via Self-Supervised Learning. As we do not retain the right to distribute the data used for training, we release the code with sample training data.
- pytorh 1.7.1
- spacy 2.3.2
- transformers 2.11.0
- tensorboardX 2.0
We placed the sample training data in "data/train". For each figure, single JSON file including figure caption and paragraph which mentions figure is created like "data/train/1/Figure_1.json". Each JSON file should be named "Figure_[number].JSON. Figures from the same paper should be placed in the same directory. For example, "data/train/1" contains all figures from a single paper.
As we do not retain the right to distribute all samples for training, we only provide the sample instances in "data/train".
We collected the original papers of the data available here https://github.com/viziometrics/centraul_figure
We collected publicly availabel papers though we do not retain the right for some of them.
Example
- CVF: https://openaccess.thecvf.com/menu
- ACLAnthology: https://aclanthology.org/
There are also several conferences providing proceedings as open access.
We need a figure caption and a paragraph which mentions figure for model training. To obtain a figure caption, we used DeepFigures library available here (https://github.com/allenai/deepfigures-open). To obtain a paragraph which mentions figure, we used ScienceParse library available here (https://github.com/allenai/science-parse).
We provide a code to prepare training instance in "data/convert.py". DeepFigures and ScienceParse generate JSON file named *deepfigures-results.json and *.pdf.json, respectively. After generating these files, please run "data/convert.py" as below.
python data/convert.py --scienceparse /path/to/*.pdf.json --deepfigure /path/to/*deepfigures-results.json --output_dir /path/to/output
All figures for a given paper are saved in "/path/to/output".
We prepared the sample test data in "data/test/test_sample.json". The JSON file needs to contain abstract and captions of all figures from a single paper.
We used the data available here https://github.com/viziometrics/centraul_figure
Please send an e-mail to s.yamamoto(at)fuji.waseda.jp
python test.py --test_data /path/to/test/data.json --model /path/to/model/weight --bert /path/to/BERT/model
We used scibert_base_uncased from here: https://github.com/allenai/scibert