Parses references in academic papers and extracts their metadata such as authors, title, date and so on.
To use it you must first train on some hand-labeled data. Example of such
data is given in the file training-data.txt
. To do training, execute
this command in a terminal:
$ php run-train.php > trained-data.php
This creates file trained-data.php
with trained model parameters. Also,
if you want to see the actual transition and emission probability
matrices, you can print the HTML file with tables:
$ php run-train.php html > trained-data.html
To perform actual extraction process, execute:
$ php run-extract.php
This should print out labeled fields of a reference.
training-data.txt
contains some hand-labeled data with following fields:
<T> TITLE
<A> AUTHOR
<D> DATE
<P> PAGES
<V> VOLUME
<J> JOURNAL
<N> NUMBER
<U> URL
<B> PUBLISHER
<L> LOCATION
Of course you can create your own training data and capture the fields you want. The quality of extraction greatly depends on volume of this data.