TeXoo – A Zoo of Text Extractors

TeXoo is a framework for Deep Learning based text analytics in Java developed at DATEXIS, Beuth University of Applied Sciences Berlin. TeXoo comes with a NLP-style document model and a zoo of Deep Learning extraction models which you can access in texoo-models module. Here is a brief overview:

Features

Java Framework for language-independent text extraction
Language-independent document model
Convenient document readers with tokenization and sentence splitting
Named Entity Recognition
Named Entity Linking
Topic Classification and Segmentation

Getting Started

These instructions will get you a copy of TeXoo up and running on your local machine for development and testing purposes. If you are going to use TeXoo as a Maven dependency only, you might skip this step.

Prerequisites

TeXoo comes with a Dockerfile that contains all software necessary to run on most systems, including the CUDA 10 toolkit for GPUs.

Docker Container platform
https://docs.docker.com/install/
nvidia-docker v2 for CUDA 10 support
https://github.com/nvidia/nvidia-docker/wiki/Installation-(version-2.0)

The following dependencies are required if you are planning to run TeXoo locally. They are already contained in the Dockerfile:

OpenJDK 8
Apache Maven Build system for Java
https://maven.apache.org/guides/index.html

Installation

First we need to build a docker image with all dependencies (including CUDA 10.1):

run docker build -t texoo .

And then we're ready to build TeXoo from source:

run bin/run-docker texoo-build

Usage

Command Line

There exist several run scripts in the bin/ directory. You can start them right in the docker container, e.g. run all JUnit tests:

run bin/run-docker texoo-test or - run bin/run-docker-cuda texoo-test

See the Modules Overview for more examples.

Maven Dependency

To use TeXoo NER in your Java project, just add the following dependencies to your pom.xml:

<dependency>
  <groupId>de.datexis</groupId>
  <artifactId>texoo-core</artifactId>
  <version>1.3.3</version>
  <type>jar</type>
</dependency>
<dependency>
  <groupId>de.datexis</groupId>
  <artifactId>texoo-entity-recognition</artifactId>
  <version>1.3.3</version>
  <type>jar</type>
</dependency>

To enable CUDA support, add the following dependencies in your project:

<dependency>
  <groupId>org.nd4j</groupId>
  <artifactId>nd4j-cuda-9.2-platform</artifactId>
  <version>${dl4j.version}</version>
</dependency>
<!-- DL4j cuDNN -->
<dependency>
  <groupId>org.deeplearning4j</groupId>
  <artifactId>deeplearning4j-cuda-9.2</artifactId>
  <version>${dl4j.version}</version>
</dependency>
<!-- DL4j CUDA + cuDNN binaries -->
<dependency>
  <groupId>org.bytedeco.javacpp-presets</groupId>
  <artifactId>cuda</artifactId>
  <version>9.2-7.1-1.4.2</version>
  <classifier>linux-x86_64-redist</classifier>
</dependency>

And to enable AVX512 CPU optimizations, add the following dependencies in your project:

<dependency>
  <groupId>org.nd4j</groupId>
  <artifactId>nd4j-native</artifactId>
  <classifier>linux-x86_64-avx512</classifier>
</dependency>

See the examples module for some implementation examples.

texoo-core – Document Model and Core Library

Package / Class	Description
de.datexis.model	TeXoo Document model (see below)
de.datexis.encoder	Implementations of Bag-of-words, Word2Vec, Trigrams, etc.
DocumentFactory	Factory to create Document objects from text
RawTextDatasetReader	Reader to create Datasets from files
AnnotatorFactory	Factory to create and load models from the zoo
ObjectSerializer	Helper methods to import/export JSON

texoo-entity-recognition – Named Entity Recognition (NER)

This module contains Annotators for Named Entity Recognition (NER). This is a very robust deep learning model that can be trained with only 4000-5000 sentences. It is based on a bidirection LSTM with Letter-trigram encoding, see http://arxiv.org/abs/1608.06757.

Command Line Usage:

run bin/run-docker texoo-annotate-ner

usage: texoo-annotate-ner -i <arg> [-o <arg>]
TeXoo: run pre-trained MentionAnnotator model
 -i,--input <arg>    path or file name for raw input text
 -o,--output <arg>   path to create and store the output JSON, otherwise dump to stdout

run bin/run-docker texoo-train-ner

usage: texoo-train-ner -i <arg> [-l <arg>] -o <arg> [-t <arg>] [-u] [-v
       <arg>]
TeXoo: train MentionAnnotator with CoNLL annotations
 -i,--input <arg>        path to input training data (CoNLL format)
 -l,--language <arg>     language to use for sentence splitting and
                         stopwords (EN or DE)
 -o,--output <arg>       path to create and store the model
 -t,--test <arg>         path to test data (CoNLL format)
 -u,--ui                 enable training UI (http://127.0.0.1:9000)
 -v,--validation <arg>   path to validation data (CoNLL format)

run bin/run-docker texoo-train-ner-seed

usage: texoo-train-ner-seed -i <arg> -o <arg> -s <arg> [-u]
TeXoo: train MentionAnnotator with seed list
 -i,--input <arg>    path and file name pattern for raw input text
 -o,--output <arg>   path to create and store the model
 -s,--seed <arg>     path to seed list text file
 -u,--ui             enable training UI (http://127.0.0.1:9000)

Java Classes:

Package / Class	Description / Reference
MentionAnnotator	Named Entity Recognition
GenericMentionAnnotator	Pre-trained models for English and German
MatchingAnnotator	Gazetteer that uses Lists to annotate Documents
CoNLLDatasetReader	Reader for CoNLL files

Cite

If you use this module for research, please cite:

Sebastian Arnold, Felix A. Gers, Torsten Kilias, Alexander Löser: Robust Named Entity Recognition in Idiosyncratic Domains. arXiv:1608.06757 [cs.CL] 2016 https://arxiv.org/abs/1608.06757

texoo-entity-linking – Named Entity Linking (NEL)

This module contains the Annotators for Named Entity Linking (NEL) (currently under development). There is no model included, but you can use the Knowledge Base and Annotators with your own datasets, see https://www.aclweb.org/anthology/C/C16/C16-2024.pdf.

Package / Class	Description / Reference
NamedEntityAnnotator	Named Entity Linking used in TASTY
ArticleIndexFactory	Knowledge Base implemented as local Lucene Index which imports Wikidata entities

If you use this module for research, please cite:

Sebastian Arnold, Robert Dziuba, Alexander Löser: TASTY: Interactive Entity Linking As-You-Type. COLING (Demos) 2016: 111–115

texoo-sector – Topic Classification and Segmentation (SECTOR)

Annotators for SECTOR models from WikiSection dataset.

Package / Class	Description / Reference
SectorAnnotator	Topic Segmentation and Classification for Long Documents

If you use this module for research, please cite:

Sebastian Arnold, Rudolf Schneider, Philippe Cudré-Mauroux, Felix A. Gers and Alexander Löser. SECTOR: A Neural Model for Coherent Topic Segmentation and Classification. Transactions of the Association for Computational Linguistics 2019 Vol. 7, 169-184

Command Line Usage:

run bin/run-docker texoo-train-sector

usage: texoo-train-sector -i <arg> -o <arg> [-u]
TeXoo: train SectorAnnotator from WikiSection dataset
 -i,--input <arg>    file name of WikiSection training dataset
 -o,--output <arg>   path to create and store the model
 -u,--ui             enable training UI (http://127.0.0.1:9000)

About TeXoo

Frameworks used in TeXoo

Deeplearning4j Machine learning library
http://deeplearning4j.org/documentation
ND4J Scientific computing library
http://nd4j.org/userguide
Apache OpenNLP Natural language processing
https://opennlp.apache.org/docs/

Contributors

Sebastian Arnold – core developer https://prof.beuth-hochschule.de/loeser/people/sebastian-arnold/

Rudolf Schneider https://prof.beuth-hochschule.de/loeser/people/rudolf-schneider/

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Name		Name	Last commit message	Last commit date
Latest commit History 401 Commits
bin		bin
deploy		deploy
doc		doc
temp		temp
texoo-cdv		texoo-cdv
texoo-core		texoo-core
texoo-encoder-api		texoo-encoder-api
texoo-entity-linking		texoo-entity-linking
texoo-entity-recognition		texoo-entity-recognition
texoo-examples		texoo-examples
texoo-retrieval		texoo-retrieval
texoo-sector		texoo-sector
.editorconfig		.editorconfig
.gitignore		.gitignore
Dockerfile		Dockerfile
Dockerfile.deploy		Dockerfile.deploy
LICENSE		LICENSE
README.md		README.md
deploy-datexis.sh		deploy-datexis.sh
formatter.xml		formatter.xml
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TeXoo – A Zoo of Text Extractors

Features

Getting Started

Prerequisites

Installation

Usage

Command Line

Maven Dependency

texoo-core – Document Model and Core Library

texoo-entity-recognition – Named Entity Recognition (NER)

Command Line Usage:

Java Classes:

Cite

texoo-entity-linking – Named Entity Linking (NEL)

texoo-sector – Topic Classification and Segmentation (SECTOR)

Command Line Usage:

About TeXoo

Frameworks used in TeXoo

Contributors

License

About

Releases 13

Packages

Contributors 5

Languages

License

sebastianarnold/TeXoo

Folders and files

Latest commit

History

Repository files navigation

TeXoo – A Zoo of Text Extractors

Features

Getting Started

Prerequisites

Installation

Usage

Command Line

Maven Dependency

texoo-core – Document Model and Core Library

texoo-entity-recognition – Named Entity Recognition (NER)

Command Line Usage:

Java Classes:

Cite

texoo-entity-linking – Named Entity Linking (NEL)

texoo-sector – Topic Classification and Segmentation (SECTOR)

Command Line Usage:

About TeXoo

Frameworks used in TeXoo

Contributors

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 13

Packages 0

Contributors 5

Languages

Packages