TIES was my undergraduate thesis, Table Information Extraction System. I picked the name from there and made it 2.0 from there.
This is a repository containing source code for the arxiv paper 1905.13391 (link). This paper has been accepted into ICDAR 2019. To cite the paper, use:
@article{rethinkingGraphs,
author = {Qasim, Shah Rukh and Mahmood, Hassan and Shafait, Faisal},
title = {Rethinking Table Recognition using Graph Neural Networks},
journal = {Accepted into ICDAR 2019},
volume = {abs/1905.13391},
year = {2019},
url = {https://arxiv.org/abs/1905.13391},
archivePrefix = {arXiv},
eprint = {1905.13391},
}
We are still working to improve a few technical details for your convenience. We'll remove this note once we are done. Expect them to be done by June 15, 2019. We are also working to improve dataset format for easier understanding.
Partial dataset which was used for test can be found here. We are uploading rest of the dataset. The current format of the dataset is tfrecords.
In the meantime, if you want to generate the dataset, head on to the following repository:
github.com/hassan-mahmood/Structural_Analysis
The project is divided into language parts, python and cpp, for python and C++ respectively. There is nothing in the
cpp folder as of now.
The python dir is supposed to be the path where a script is to be run, or alternatively, it could be added to the
$PYTHONPATH environmental variable. It would contain further directories:
bincontain the scripts which are to be run from the terminal. Within bin, there would be multiple folders, short for different classes of executable programs.iteratefor running training or inference.analysefor analysing inference output.checksthis was for testing various files while development. You can safely ignore it.
iteratorsprovides functionality to iterate through the datasets while you are training or testing.layerscontains basic layers for graph networksmodelscontains the main model and network segments. Most of the functionality can be found inbasic_model.py. Start to trace from there.opscontains basic modified operations. These contains the advanced graph operations code.readersis for readers, entities responsible for reading the data fromtfrecords. Their format can be changed in this file.libscontains all other helper and library functions.
Within the context of this repository, iterate refers to any of train, test or anything which is done iteratively. You
can say anything that is done iteratively mostly on the GPU. So if there is an iterator somewhere, it probably refers
to an entity which handles training, testing etc.
- Prepare the dataset. For this, you are required to divide the dataset into three different sections, test, train and validation. Test set will be used to run the analysis after training is done. Backpropagation will be run on the train set. Validation set is used to produce plots for tensorboard to monitor performance of the network.
- The dataset files have to be in
tfrecordsformat. Make a new file calledtrain_files.txt. It should contain full paths of all the training tfrecords files. For example:/home/shahrukhqasim/dataset/train_1.tfrecord /home/shahrukhqasim/dataset/train_2.tfrecord /home/shahrukhqasim/dataset/train_3.tfrecord - Similarly, prepare
validation_files.txt,test_files.txt. The contents of these three files should not be overlapping. - Make a config file according to the format given in
configs/config.ini.example. This file determines all the settings, dataset locations and results generation paths. The example config file contains documentation for your ease. If you are unclear about a setting, send an email to me or generate an issue in this repository. - Each config file will contain multiple configurations. These configurations are recommended to be used for different models.
So, for instance, you make different configs for
DGCNN,GravNetandConvolutionalnetworks.
To run the training, you need to issue the following command:
$ python bin/iterate/table_adjacency_parsing.py path/to/the/config/file config
While you are running the training, you can monitor using tensorboard. The paths are to be set into the config file as described in the previous step. Use the following command to run the tensorboard:
$ tensorboard --logdir=/media/all/shahrukhqasim/Tables/TrainOut/betaout/summary
You can monitor the performance after that in your browser. The port number will be displayed when you run the above command.
You first need to run inference which will generate bin files in numpy pickle format.
$ python bin/iterate/table_adjacency_parsing.py path/to/the/config/file config --test True
TODO: Analaysis code and further documentation is coming.
Python 3.5+ is needed. We recommend using virtualenv but anaconda should also work fine.
The required packages are listed in requirements.txt. They can be installed by:
$ pip install -r requirements.txt
In addition to this, you need to download another repository from here:
github.com/jkiesele/caloGraphNN
Let's say you clone it into /home/shahrukhqasim/caloGraphNN. You need to add this path to the $PYTHONPATH environmental variable.
$ export PYTHONPATH=$PYTHONPATH:/home/shahrukhqasim/caloGraphNN
In addition to this, you should run all the commands from inside of python directory. And python should also be present in $PYTHONPATH environmental variable.
$ export PYTHONPATH=$PYTHONPATH:/home/shahrukhqasim/TIES-2.0/python
You can also add . to the $PYTHONPATH if you know you will always run the commands from inside of python directory.
It is advised you make a sh file with these export commands and a command which activates the virtual environment.
I use the following sourcing file (ties.sh):
source ~/Envs/h3/bin/activate
cd /Users/shahrukhqasim/Workspace/TIES-2.0/python
export PYTHONPATH=$PYTHONPATH:/Users/shahrukhqasim/Workspace/caloGraphNN:/Users/shahrukhqasim/Workspace/TIES-2.0
I source it every time I want to run training or inference using:
$ source ties.sh
- Training data uploaded
- Trained models