The application can be used to extract annotation information about applied annotation schemes and languages in annotated language resources like corpora, in various formats (e.g. RDF, CoNLL and XML). Results are made available via a web-interface that serves as a means to edit and export the harvested meta-data. A forthcoming paper as well as the manual in the doc folder give more in-depth information about its actual use-case.
Annohub was conducted in the context of the Specialized Information Service Linguistics (FID), funded by German Research Foundation(DFG/LIS, 2017-2019).
Installation
-
Prerequisites
- Linux/Unix distribution
- Java runtime >= 1.8
- 7z (7za) file archiver utitily
- rapper rdf utility (http://librdf.org/raptor/)
- TomEE >= 7.1.0 (https://tomee.apache.org/)
- Linux/Unix distribution
-
Download Tinkerpop Gremlin Server version 3.3.10
-
Unpack the file and install the neo4j-gremlin driver
cd apache-tinkerpop-gremlin-server-3.3.10
bin/gremlin-server.sh install org.apache.tinkerpop neo4j-gremlin 3.3.10
The process of plugin installation is handled by Grape, which helps resolve dependencies into the classpath. If you run
into problems you can obtain further information on the installation of Grape at https://tinkerpop.apache.org/docs/current/reference/#neo4j-gremlin
-
Edit the Gremlin Server configuration file conf/neo4j-empty.properties to set the server's database directory
gremlin.neo4j.directory=/your/server/directory
-
Start the server
bin/gremlin-server.sh
-
Edit the Annohub configuration file (you can use /src/main/resources/FIDConfig.xml as a template)
Database setup
a. Gremlin.Server.home - /your/path/to/apache-tinkerpop-gremlin-server-3.3.10b. Gremlin.Server.conf - /your/path/to/apache-tinkerpop-gremlin-server-3.3.10/conf/gremlin-server-neo4j.yaml
c. Gremlin.Server.data - /another database directory (this is different from the directory entered in step 3 !)
Application setup
a. RunParameter.downloadFolder - crawler-download-directory (e.g. /tmp/annohub/downloads)b. RunParameter.ServiceUploadDirectory - web-application-upload-directory (e.g. /tmp/annohub/uploads)
c. RunParameter.decompressionUtility - enter 7z or (7za)
-
For easy maintenance of your configuration you can set the environment variable FID_CONFIG_FILE to the location of you configuration file
-
Build the Annohub application with maven
mvn install clean
-
Initialize the Annohub model database
run.sh -init
-
After initalization has finished you can parse data
run.sh -execute -seed seed_file
where seed_file contains a list of language resource URLs (one URL per line)
-
For the deployment of the Annohub web-application an installation of TomEE (https://tomee.apache.org/) is required.
Please consider the following configuration options :- CATALINA_OPTS=-Xmx4g -Xss5m
- in context.xml set <Resources cachingAllowed="true" cacheMaxSize="100000" />