Skip to content

MTE Parsers

Steven Lu edited this page Aug 20, 2021 · 1 revision

MTE Parser Indexer


The MTE Parser Indexer contains 1 base parser and 7 parsers created for different purposes.

TODO: UPDATE the following links once the parser scripts are merged from the parser to the master branch.

Base Parser: The base parser that all parsers should inherit.

TIKA Parser: The TIKA parser utilizes Apache TIKA service to convert PDF files to text files.

ADS Parser: The ADS parser utilizes the search API of the Astrophysics Data System (ADS) to extract information including title, author, primary author, affiliation, publication venue, and publication date.

CoreNLP Parser: The CoreNLP parser utilizes the Named Entity Recognition (NER) sub-module of the Stanford CoreNLP package to categorize words into named entities (e.g., target, mineral, element)

JSRE Parser: The JSRE parser utilizes the Java Simple Relation Extraction (JSRE) toolkit to extract relations between named entities.

Paper Parser: The Paper parser is a generic parser suitable for papers from all publication venues. The Paper parser is implemented to augment/remove contents (e.g., translate some UTF8 punctuation to ASCII, remove hyphenation at the end of lines, etc.) general to all papers.

LPSC Parser: The LPSC parser is created for the two-page abstract from Lunar and Planetary Science Conference (LPSC). It utilizes regular expression matches to remove contents specific (e.g., abstract id, conference header) to the LPSC abstract.

JGR Parser: The JGR parser is created for the papers from Journal of Geophysical Research.

The class diagram of the parsers is shown below:

MTE Parser class diagram


  • TIKA Parser
>>> python -h
usage: [-h] (-i IN_FILE | -li IN_LIST) -o OUT_FILE
                      [-l LOG_FILE] [-p TIKA_SERVER_URL]

optional arguments:
  -h, --help            show this help message and exit
  -i IN_FILE, --in_file IN_FILE
                        Path to input file
  -li IN_LIST, --in_list IN_LIST
                        Path to input list
  -o OUT_FILE, --out_file OUT_FILE
                        Path to output JSON file
  -l LOG_FILE, --log_file LOG_FILE
                        Log file that contains processing information. It is
                        default to ./tika-parser-log.txt unless otherwise
  -p TIKA_SERVER_URL, --tika_server_url TIKA_SERVER_URL
                        Tika server URL

Note that the -p TIKA_SERVER_URL argument is optional. The following command is an example of using TIKA parser:

  • ADS Parser
>>> python -h
usage: [-h] (-i IN_FILE | -li IN_LIST) -o OUT_FILE [-l LOG_FILE]
                     [-p TIKA_SERVER_URL] [-a ADS_URL] [-t ADS_TOKEN]

optional arguments:
  -h, --help            show this help message and exit
  -i IN_FILE, --in_file IN_FILE
                        Path to input file
  -li IN_LIST, --in_list IN_LIST
                        Path to input list
  -o OUT_FILE, --out_file OUT_FILE
                        Path to output JSON file
  -l LOG_FILE, --log_file LOG_FILE
                        Log file that contains processing information. It is
                        default to ./ads-parser-log.txt unless otherwise
  -p TIKA_SERVER_URL, --tika_server_url TIKA_SERVER_URL
                        Tika server URL
  -a ADS_URL, --ads_url ADS_URL
                        ADS RESTful API. The ADS RESTful API should not need
                        to be changed frequently unless someting at the ADS is
  -t ADS_TOKEN, --ads_token ADS_TOKEN
                        The ADS token, which is required to use the ADS
                        RESTful API. The token was obtained using the
                        instructions at
                        api#access. The ADS token should not need to be
                        changed frequently unless something at the ADS is

The example command is shown below:

  • CoreNLP Parser
>>> python -h
usage: [-h] (-i IN_FILE | -li IN_LIST) -o OUT_FILE
                         [-l LOG_FILE] [-p TIKA_SERVER_URL]
                         [-c CORENLP_SERVER_URL] [-n NER_MODEL] [-a ADS_URL]
                         [-t ADS_TOKEN]

optional arguments:
  -h, --help            show this help message and exit
  -i IN_FILE, --in_file IN_FILE
                        Path to input file
  -li IN_LIST, --in_list IN_LIST
                        Path to input list
  -o OUT_FILE, --out_file OUT_FILE
                        Path to output JSON file
  -l LOG_FILE, --log_file LOG_FILE
                        Log file that contains processing information. It is
                        default to ./corenlp-parser-log.txt unless otherwise
  -p TIKA_SERVER_URL, --tika_server_url TIKA_SERVER_URL
                        Tika server URL
  -c CORENLP_SERVER_URL, --corenlp_server_url CORENLP_SERVER_URL
                        CoreNLP Server URL
  -n NER_MODEL, --ner_model NER_MODEL
                        Path to a Named Entity Recognition (NER) model
  -a ADS_URL, --ads_url ADS_URL
                        ADS RESTful API. The ADS RESTful API should not need
                        to be changed frequently unless someting at the ADS is
  -t ADS_TOKEN, --ads_token ADS_TOKEN
                        The ADS token, which is required to use the ADS
                        RESTful API. The token was obtained using the
                        instructions at
                        api#access. The ADS token should not need to be
                        changed frequently unless something at the ADS is

The example command is shown below:

  • JSRE Parser
>>> python -h
usage: [-h] (-i IN_FILE | -li IN_LIST) -o OUT_FILE
                      [-l LOG_FILE] [-p TIKA_SERVER_URL]
                      [-c CORENLP_SERVER_URL] [-n NER_MODEL] [-jr JSRE_ROOT]
                      -jm JSRE_MODEL [-jt JSRE_TMP_DIR] [-a ADS_URL]
                      [-t ADS_TOKEN]

optional arguments:
  -h, --help            show this help message and exit
  -i IN_FILE, --in_file IN_FILE
                        Path to input file
  -li IN_LIST, --in_list IN_LIST
                        Path to input list
  -o OUT_FILE, --out_file OUT_FILE
                        Path to output JSON file
  -l LOG_FILE, --log_file LOG_FILE
                        Log file that contains processing information. It is
                        default to ./jsre-parser-log.txt unless otherwise
  -p TIKA_SERVER_URL, --tika_server_url TIKA_SERVER_URL
                        Tika server URL
  -c CORENLP_SERVER_URL, --corenlp_server_url CORENLP_SERVER_URL
                        CoreNLP Server URL
  -n NER_MODEL, --ner_model NER_MODEL
                        Path to a Named Entity Recognition (NER) model
  -jr JSRE_ROOT, --jsre_root JSRE_ROOT
                        Path to jSRE installation directory. Default is
  -jm JSRE_MODEL, --jsre_model JSRE_MODEL
                        Path to jSRE model
  -jt JSRE_TMP_DIR, --jsre_tmp_dir JSRE_TMP_DIR
                        Path to a directory for jSRE to temporarily store
                        input and output files. Default is /tmp
  -a ADS_URL, --ads_url ADS_URL
                        ADS RESTful API. The ADS RESTful API should not need
                        to be changed frequently unless someting at the ADS is
  -t ADS_TOKEN, --ads_token ADS_TOKEN
                        The ADS token, which is required to use the ADS
                        RESTful API. The token was obtained using the
                        instructions at
                        api#access. The ADS token should not need to be
                        changed frequently unless something at the ADS is

The example command is shown below:

  • Paper Parser
>>> python -h
usage: [-h] (-i IN_FILE | -li IN_LIST) -o OUT_FILE
                       [-l LOG_FILE] [-p TIKA_SERVER_URL]
                       [-c CORENLP_SERVER_URL] [-n NER_MODEL] [-jr JSRE_ROOT]
                       -jm JSRE_MODEL [-jt JSRE_TMP_DIR] [-a ADS_URL]
                       [-t ADS_TOKEN]

optional arguments:
  -h, --help            show this help message and exit
  -i IN_FILE, --in_file IN_FILE
                        Path to input file
  -li IN_LIST, --in_list IN_LIST
                        Path to input list
  -o OUT_FILE, --out_file OUT_FILE
                        Path to output JSON file
  -l LOG_FILE, --log_file LOG_FILE
                        Log file that contains processing information. It is
                        default to ./paper-parser-log.txt unless otherwise
  -p TIKA_SERVER_URL, --tika_server_url TIKA_SERVER_URL
                        Tika server URL
  -c CORENLP_SERVER_URL, --corenlp_server_url CORENLP_SERVER_URL
                        CoreNLP Server URL
  -n NER_MODEL, --ner_model NER_MODEL
                        Path to a Named Entity Recognition (NER) model
  -jr JSRE_ROOT, --jsre_root JSRE_ROOT
                        Path to jSRE installation directory. Default is
  -jm JSRE_MODEL, --jsre_model JSRE_MODEL
                        Path to jSRE model
  -jt JSRE_TMP_DIR, --jsre_tmp_dir JSRE_TMP_DIR
                        Path to a directory for jSRE to temporarily store
                        input and output files. Default is /tmp
  -a ADS_URL, --ads_url ADS_URL
                        ADS RESTful API. The ADS RESTful API should not need
                        to be changed frequently unless someting at the ADS is
  -t ADS_TOKEN, --ads_token ADS_TOKEN
                        The ADS token, which is required to use the ADS
                        RESTful API. The token was obtained using the
                        instructions at
                        api#access. The ADS token should not need to be
                        changed frequently unless something at the ADS is

The example command is shown below:

  • LPSC parser
python -h
usage: [-h] (-i IN_FILE | -li IN_LIST) -o OUT_FILE
                      [-l LOG_FILE] [-p TIKA_SERVER_URL]
                      [-c CORENLP_SERVER_URL] [-n NER_MODEL] [-jr JSRE_ROOT]
                      -jm JSRE_MODEL [-jt JSRE_TMP_DIR] [-a ADS_URL]
                      [-t ADS_TOKEN]

optional arguments:
  -h, --help            show this help message and exit
  -i IN_FILE, --in_file IN_FILE
                        Path to input file
  -li IN_LIST, --in_list IN_LIST
                        Path to input list
  -o OUT_FILE, --out_file OUT_FILE
                        Path to output JSON file
  -l LOG_FILE, --log_file LOG_FILE
                        Log file that contains processing information. It is
                        default to ./lpsc-parser-log.txt unless otherwise
  -p TIKA_SERVER_URL, --tika_server_url TIKA_SERVER_URL
                        Tika server URL
  -c CORENLP_SERVER_URL, --corenlp_server_url CORENLP_SERVER_URL
                        CoreNLP Server URL
  -n NER_MODEL, --ner_model NER_MODEL
                        Path to a Named Entity Recognition (NER) model
  -jr JSRE_ROOT, --jsre_root JSRE_ROOT
                        Path to jSRE installation directory. Default is
  -jm JSRE_MODEL, --jsre_model JSRE_MODEL
                        Path to jSRE model
  -jt JSRE_TMP_DIR, --jsre_tmp_dir JSRE_TMP_DIR
                        Path to a directory for jSRE to temporarily store
                        input and output files. Default is /tmp
  -a ADS_URL, --ads_url ADS_URL
                        ADS RESTful API. The ADS RESTful API should not need
                        to be changed frequently unless someting at the ADS is
  -t ADS_TOKEN, --ads_token ADS_TOKEN
                        The ADS token, which is required to use the ADS
                        RESTful API. The token was obtained using the
                        instructions at
                        api#access. The ADS token should not need to be
                        changed frequently unless something at the ADS is

The example command is shown below:

  • JGR Parser
python -h
usage: [-h] (-i IN_FILE | -li IN_LIST) -o OUT_FILE [-l LOG_FILE]
                     [-p TIKA_SERVER_URL] [-c CORENLP_SERVER_URL]
                     [-n NER_MODEL] [-jr JSRE_ROOT] -jm JSRE_MODEL
                     [-jt JSRE_TMP_DIR] [-a ADS_URL] [-t ADS_TOKEN]

optional arguments:
  -h, --help            show this help message and exit
  -i IN_FILE, --in_file IN_FILE
                        Path to input file
  -li IN_LIST, --in_list IN_LIST
                        Path to input list
  -o OUT_FILE, --out_file OUT_FILE
                        Path to output JSON file
  -l LOG_FILE, --log_file LOG_FILE
                        Log file that contains processing information. It is
                        default to ./jgr-parser-log.txt unless otherwise
  -p TIKA_SERVER_URL, --tika_server_url TIKA_SERVER_URL
                        Tika server URL
  -c CORENLP_SERVER_URL, --corenlp_server_url CORENLP_SERVER_URL
                        CoreNLP Server URL
  -n NER_MODEL, --ner_model NER_MODEL
                        Path to a Named Entity Recognition (NER) model
  -jr JSRE_ROOT, --jsre_root JSRE_ROOT
                        Path to jSRE installation directory. Default is
  -jm JSRE_MODEL, --jsre_model JSRE_MODEL
                        Path to jSRE model
  -jt JSRE_TMP_DIR, --jsre_tmp_dir JSRE_TMP_DIR
                        Path to a directory for jSRE to temporarily store
                        input and output files. Default is /tmp
  -a ADS_URL, --ads_url ADS_URL
                        ADS RESTful API. The ADS RESTful API should not need
                        to be changed frequently unless someting at the ADS is
  -t ADS_TOKEN, --ads_token ADS_TOKEN
                        The ADS token, which is required to use the ADS
                        RESTful API. The token was obtained using the
                        instructions at
                        api#access. The ADS token should not need to be
                        changed frequently unless something at the ADS is

The example command is shown below: