Parse strings and create objects from them based on a dictionary
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.


Tiny program which does two things:

  1. Tokenize a string, read in a dictionary (.json) and find matches in the input
  2. Use these inputs in order to create objects, parsed from the matches

The exemplary data provided (everything except, and is themed around some bevereges' descriptions. The extending class is exemplary for how parsed data can be transformed into an object with which things can be done. The given dict.json is also filled with random data which is used to parse the given input texts.

With the extensive use of reflection, class are instanciated and field values are filled. The user has to customize rules for parsing an element. Exemplary data is provided; the core classes,, and should not be tweeked! For parsing, you only have to create

  • classes which extend RawParserObject such that jsonToFieldname is filled in the constructor
  • a dictionary as dict.json.

Uses jackson in order to read and process the dictionary (which is a json file). Such a nice package.


As the name suggests, this is supposed to be used as superclass for all the data parsed. It holds information about

  • the input from which it is created
  • the keywords which matched against words from a given dictionary
  • the parsed information from the matched keywords
  • probable field values (by default the most common value)

A typical workflow is that an object gets instantiated, filled with keywords and field information with tokenizer.parse(), and in the end finalized. The matchings have to be customized in the extending class of RawParserObject but will then work fully automatically. The way fields (if they are not just plain Strings) are filled has to be customized via overriding RawParserObject.mapFinalFieldValues() (see for an example).

The method createChildClass() returns an object of the most probable/most common class which is present in fieldValues. An example can be seen in the dict.json file, where the key "_CLASS" references the class name to which a certain keyword matches and should be cast to via reflection.


Self explanatory - reads the .json. It handles the various json data types and saves them accordingly (String vs Array vs Dict/Map, see below under dict.json).


Handles the tokenization of the given string (surprise) and takes action in actually parsing the input. The function tokenizer.parse() instantiates


Reads the possible entries from the dictionary and compares tokens (single ones and, if there is a possible match, concatenations of tokens until there is no candidate-match left).


There is a large comment in the file itself, but here is it again:

The .json file holds all the dictionary entries which are used to parse the input. There are three types of entries which the program then understands as such:

  1. "Keyword":"Synonym" - self explanatory.
  2. "Keyword":[Array] - saves the Keyword with the same values as Array[0] (will be overhauled)
  3. "Keyword":{Dict} - save the Keyword with Dict as LinkedHashMap for easier mapping while parsing

There are several keyword or just prefixes worth noting:

  • "_CLASS":"classname" - the full name of the java class the entry should be parsed to
  • "_COMMENTARY":"comment" - prefixing a key with _COMMENTARY makes the ParserDict ignore this entry fully
  • "_DICT:Keyword":{Dict} - handles the given Dict as an Dictionary of its own but extends each element by the definition of "Keyword". See the given dict.json for an example on how to define multiple grapes at once without writing down "Grape":"THIS" everytime.

Why not just use String.contains() instead of using the KeywordFinder?**

You actually should use String.contains() instead. The runtime is no-where near O(n) like e.g. this algorithm by Boyer-Moore.

This is just something I did for educational purposes and I like how (and: that) it works in a reasonable time; the parsing is somewhere around O(m * log(n)) because words which do not match any dictionary entry are not entering the time consuming matching-loop.


Copyright (c) 2017 Roman Lamsal, University of Konstanz.

All rights reserved. This program and the accompanying materials are made available under the terms of the GNU Public License v3.0 which accompanies this distribution, and is available at

For distributors of proprietary software, other licensing is possible on request: