Skip to content

A competency extractor using CoreNLP dependency parsing and Semgrex

License

Notifications You must be signed in to change notification settings

traschke/bht-compex

Repository files navigation

CompEx

release license pythonversion jreversion corenlpversion coverage

Extract competency triples from written text.

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.

Prerequisites

  • Python 3.7
  • pipenv
  • (optional) pyenv to automatically install required Pythons
    • If pyenv is not installed, Python 3.7 is required, otherwise pyenv will install it
  • Java JRE 1.8+ for CoreNLP server
  • Stanford CoreNLP

Installing

Setup a python virtual environment and download all dependencies

$ pipenv install --dev

ComPex requires an installation of CoreNLP with german models. Download required CoreNLP Java server and german models from here to destination of your choosing. You can use the following script to automate this process, which downloads all required files to ./.corenlp:

$ ./download_corenlp.sh

Enter pipenv virtual environment

$ pipenv shell

Running

Set environment variable $CORENLP_HOME to the directory, where CoreNLP and german models are located. If you used the helper script download_corenlp.sh, the files are in ./.corenlp.

$ export CORENLP_HOME=./.corenlp

Show help

$ python -m compex -h

Extraction

Show help

$ python -m compex extract -h

Extract competencies of a simple sentence (you can pipe textdata into compex!)

$ echo "Die studierenden beherrschen grundlegende Techniken des wissenschaftlichen Arbeitens." | python -m compex extract

or use a file

$ python -m compex extract testsentences.txt

or use stdin

$ python -m compex extract < testsentences.txt

Check for taxonomy verbs. Checks if a found competency verb is in the given taxonomy verb dictionary. If not, it's ignored. In addition, this parameter fills the taxonomy_dimension parameter of the extracted competency. You can use the sample file blooms_taxonomy.json.

$ python -m compex extract --taxonomyjson blooms_taxonomy.json testsentences.txt

Sample output on stdout (formatted for better readability)

{
    "Die studierenden beherrschen grundlegende Techniken des wissenschaftlichen Arbeitens.": [
        {
            "objects": [],
            "taxonomy_dimension": null,
            "word": {
                "index": 2,
                "word": "beherrschen"
            }
        }
    ]
}

Evaluation

Evaluate compex against pre-annotated data. Outputs recall, precision and F1. To evaluate a pre-annoted WebAnno TSV 3.2 file is needed. See here for the file format. You can use WebAnno to annotate data and evaluate compex with it. This repository contains pre-annotated data from Modulhandbooks of Department~VI of Beuth University of Applied Sciences Berlin. They can be found here: tests/resources/bht-annotated. The corresponding WebAnno Projekt is located at tests/resources/webanno/BHT+Test_2020-03-22_1808.zip.

Show help

$ python -m compex evaluate -h

Evaluate only competency verbs

$ python -m compex evaluate tests/resources/test.tsv

Evaluate competency verbs and objects

$ python -m compex evaluate --objects tests/resources/test.tsv

Evaluate competency verbs, objects and contexts

$ python -m compex evaluate --objects --contexts tests/resources/test.tsv

It is possible to use a dedicated taxonomy json file just like with the extract function

$ python -m compex evaluate --taxonomyjson blooms_taxonomy.json tests/resources/test.tsv

Sample evaluation output on stdout (formatted for better readability)

{
    "f1": 0.5024705551113972,
    "negatives": {
        "false": 168.36206347622323,
        "true": 81.63793652377686
    },
    "positives": {
        "false": 137.53333333333336,
        "true": 154.4666666666666
    },
    "precision": 0.5289954337899542,
    "recall": 0.4784786862008745
}

Running the tests

Run unit tests. CoreNLP server in ./.corenlp is required!

$ pytest

Get test coverage

Run coverage

$ coverage run --source=./compex/ -m pytest

Export coverage report as html

$ coverage html

Generate coverage badge

$ coverage-badge -o coverage.svg

Built With

Authors

  • Timo Raschke - Initial work - traschke

License

This project is licensed under the MIT License - see the LICENSE file for details

Acknowledgments

Sources for Bloom's Taxonomy verbs: