Parse strings and create objects from them based on a dictionary
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
src
.gitignore
LICENSE
README.md
dict.json
jackson-annotations-2.8.8.jar
jackson-core-2.8.8.jar
jackson-databind-2.8.8.jar

README.md

ParseStringsWithDicts

Tiny program which does two things:

  1. Tokenize a string, read in a dictionary (.json) and find matches in the input
  2. Use these inputs in order to create objects, parsed from the matches

The exemplary data provided (everything except RawParserObject.java, Tokenizer.java and KeywordFinder.java) is themed around some bevereges' descriptions. The extending class Wine.java is exemplary for how parsed data can be transformed into an object with which things can be done. The given dict.json is also filled with random data which is used to parse the given input texts.

With the extensive use of reflection, class are instanciated and field values are filled. The user has to customize rules for parsing an element. Exemplary data is provided; the core classes RawParserObject.java, ParserDict.java, KeywordFinder.java and Tokenizer.java should not be tweeked! For parsing, you only have to create

  • classes which extend RawParserObject such that jsonToFieldname is filled in the constructor
  • a dictionary as dict.json.

Uses jackson in order to read and process the dictionary (which is a json file). Such a nice package.

RawParserObject

As the name suggests, this is supposed to be used as superclass for all the data parsed. It holds information about

  • the input from which it is created
  • the keywords which matched against words from a given dictionary
  • the parsed information from the matched keywords
  • probable field values (by default the most common value)

A typical workflow is that an object gets instantiated, filled with keywords and field information with tokenizer.parse(), and in the end finalized. The matchings have to be customized in the extending class of RawParserObject but will then work fully automatically. The way fields (if they are not just plain Strings) are filled has to be customized via overriding RawParserObject.mapFinalFieldValues() (see Wine.java for an example).

The method createChildClass() returns an object of the most probable/most common class which is present in fieldValues. An example can be seen in the dict.json file, where the key "_CLASS" references the class name to which a certain keyword matches and should be cast to via reflection.

ParserDict

Self explanatory - reads the .json. It handles the various json data types and saves them accordingly (String vs Array vs Dict/Map, see below under dict.json).

Tokenizer

Handles the tokenization of the given string (surprise) and takes action in actually parsing the input. The function tokenizer.parse() instantiates

KeywordFinder

Reads the possible entries from the dictionary and compares tokens (single ones and, if there is a possible match, concatenations of tokens until there is no candidate-match left).

dict.json

There is a large comment in the file itself, but here is it again:

The .json file holds all the dictionary entries which are used to parse the input. There are three types of entries which the program then understands as such:

  1. "Keyword":"Synonym" - self explanatory.
  2. "Keyword":[Array] - saves the Keyword with the same values as Array[0] (will be overhauled)
  3. "Keyword":{Dict} - save the Keyword with Dict as LinkedHashMap for easier mapping while parsing

There are several keyword or just prefixes worth noting:

  • "_CLASS":"classname" - the full name of the java class the entry should be parsed to
  • "_COMMENTARY":"comment" - prefixing a key with _COMMENTARY makes the ParserDict ignore this entry fully
  • "_DICT:Keyword":{Dict} - handles the given Dict as an Dictionary of its own but extends each element by the definition of "Keyword". See the given dict.json for an example on how to define multiple grapes at once without writing down "Grape":"THIS" everytime.

Why not just use String.contains() instead of using the KeywordFinder?**

You actually should use String.contains() instead. The runtime is no-where near O(n) like e.g. this algorithm by Boyer-Moore.

This is just something I did for educational purposes and I like how (and: that) it works in a reasonable time; the parsing is somewhere around O(m * log(n)) because words which do not match any dictionary entry are not entering the time consuming matching-loop.

License

Copyright (c) 2017 Roman Lamsal, University of Konstanz.

All rights reserved. This program and the accompanying materials are made available under the terms of the GNU Public License v3.0 which accompanies this distribution, and is available at http://www.gnu.org/licenses/gpl.html

For distributors of proprietary software, other licensing is possible on request: roman@lamsal.com