Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rita rust engine #87

Merged
merged 14 commits into from
Aug 29, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .coveragerc
Original file line number Diff line number Diff line change
Expand Up @@ -3,5 +3,8 @@ branch = True
source =
rita

omit = rita/engine/translate_rust.py

[report]
show_missing = True
omit = rita/engine/translate_rust.py
58 changes: 58 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,61 @@
0.6.0 (2020-08-29)
****************************

Features
--------

- Implemented ability to alias macros, eg.:

.. code-block::

numbers = {"one", "two", "three"}
@alias IN_LIST IL

IL(numbers) -> MARK("NUMBER")

Now using "IL" will actually call "IN_LIST" macro.
#66
- introduce the TAG element as a module. Needs a new parser for the SpaCy translate.
Would allow more flexible matching of detailed part-of-speech tag, like all adjectives or nouns: TAG("^NN|^JJ").

Implemented by:
Roland M. Mueller (https://github.com/rolandmueller)
#81
- Add a new module for a PLURALIZE tag
For a noun or a list of nouns, it will match any singular or plural word.

Implemented by:
Roland M. Mueller (https://github.com/rolandmueller)
#82
- Add a new Configuration implicit_hyphon (default false) for automatically adding hyphon characters - to the rules.

Implemented by:
Roland M. Mueller (https://github.com/rolandmueller)
#84
- Allow to give custom regex impl. By default `re` is used
#86
- An interface to be able to use rust engine.

In general it's identical to `standalone`, but differs in one crucial part - all of the rules are compiled into actual binary code and that provides large performance boost.
It is proprietary, because there are various caveats, engine itself is a bit more fragile and needs to be tinkered to be optimized to very specific case
(eg. few long texts with many matches vs a lot short texts with few matches).
#87

Fix
---

- Fix `-` bug when it is used as stand alone word
#71
- Fix regex matching, when shortest word is selected from IN_LIST
#72
- Fix IN_LIST regex so that it wouldn't take part of word
#75
- Fix IN_LIST operation bug - it was ignoring them
#77
- Use list branching only when using spaCy Engine
#80


0.5.0 (2020-06-18)
****************************

Expand Down
10 changes: 0 additions & 10 deletions changes/66.feature.rst

This file was deleted.

1 change: 0 additions & 1 deletion changes/71.fix.rst

This file was deleted.

1 change: 0 additions & 1 deletion changes/72.fix.rst

This file was deleted.

1 change: 0 additions & 1 deletion changes/75.fix.rst

This file was deleted.

1 change: 0 additions & 1 deletion changes/77.fix.rst

This file was deleted.

1 change: 0 additions & 1 deletion changes/80.fix.rst

This file was deleted.

5 changes: 0 additions & 5 deletions changes/81.feature.rst

This file was deleted.

5 changes: 0 additions & 5 deletions changes/82.feature.rst

This file was deleted.

4 changes: 0 additions & 4 deletions changes/84.feature.rst

This file was deleted.

1 change: 0 additions & 1 deletion changes/86.feature.rst

This file was deleted.

36 changes: 36 additions & 0 deletions docs/engines.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# Engines

In RITA what we call `engine` is a system we will compile rules to, and which will do the heavy lifting after that.

Currently there are three engines:

## spaCy

Activated by using `rita.compile(<rules_file>, use_engine="spacy")`

Using this engine, all of the RITA rules will be compiled into spaCy patterns, which can be natively used by spaCy in various scenarios.
Most often - to improve NER (Named Entity Recognition), by adding additional entities derived from your given rules

It requires to have spaCy package installed (`pip install spacy`) and to actually use it later, language model needs to be downloaded (`python -m spacy download <language_code>`)

## Standalone

Activated by using `rita.compile(<rules_file>, use_engine="standalone")`. It compiles into pure regex and can be used with zero dependencies.
By default, it uses Python `re` library. Since `0.5.10` version, you can give a custom regex implementation to use:
eg. regex package: `rita.compile(<rules_file>, use_engine="standalone", regex_impl=regex)`

It is very lightweight, very fast (compared to spaCy), however lacking in some functionality which only proper language model can bring:
- Patterns by entity (PERSON, ORGANIZATION, etc)
- Patterns by Lemmas
- Patterns by POS (Part Of Speech)

Only generic things, like WORD, NUMBER can be matched.


## Rust (new in `0.6.0`)

There's only an interface inside the code, engine itself is proprietary.

In general it's identical to `standalone`, but differs in one crucial part - all of the rules are compiled into actual binary code and that provides large performance boost.
It is proprietary, because there are various caveats, engine itself is a bit more fragile and needs to be tinkered to be optimized to very specific case
(eg. few long texts with many matches vs a lot short texts with few matches).
56 changes: 56 additions & 0 deletions docs/modules.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# Modules

Modules are like plugins to the system, usually providing additional functionality at some cost - needs additional dependencies, supports only specific language etc.
That's why they are not included into the core system, but can be easily included into your rules.

eg.
```
!IMPORT("rita.modules.fuzzy")

FUZZY("squirrel") -> MARK("CRITTER")
```

**NOTE**: the import path can be any proper Python import. So this actually allows you to add extra functionality by not modifying RITA's source code.
More on that in [Extending section](./extend.md)

## Fuzzy

This is more as an example rather than proper module. The main goal is to generate possible misspelled variants of given word, so that match matches more cases.
Very useful when dealing with actual natural language, eg. comments, social media posts. Word `you` can be automatically matched by proper `you` and `u`, `for` as `for` and `4` etc.

Usage:
```
!IMPORT("rita.modules.fuzzy")

FUZZY("squirrel") -> MARK("CRITTER")
```

## Pluralize

Takes list (or single) words, and creates plural version of each of these.

Requires: `inflect` library (`pip install inflect`) before using. Works only on english words.

Usage:

```
!IMPORT("rita.modules.pluralize")

vehicles={"car", "motorbike", "bicycle", "ship", "plane"}
{NUM, PLURALIZE(vehicles)}->MARK("VEHICLES")
```

## Tag

Is used or generating POS/TAG patterns based on a Regex
e.g. TAG("^NN|^JJ") for nouns or adjectives.

Works only with spaCy engine

Usage:

```
!IMPORT("rita.modules.tag")

{WORD*, TAG("^NN|^JJ")}->MARK("TAGGED_MATCH")
```
2 changes: 2 additions & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,8 @@ nav:
- Quickstart: quickstart.md
- Syntax: syntax.md
- Macros: macros.md
- Engines: engines.md
- Modules: modules.md
- Extending: extend.md
- Config: config.md
- Advanced: advanced.md
Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[tool.poetry]
name = "rita-dsl"
version = "0.5.10"
version = "0.6.0"
description = "DSL for building language rules"
authors = [
"Šarūnas Navickas <zaibacu@gmail.com>"
Expand Down
2 changes: 1 addition & 1 deletion rita/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@

logger = logging.getLogger(__name__)

__version__ = (0, 5, 10, os.getenv("VERSION_PATCH"))
__version__ = (0, 6, 0, os.getenv("VERSION_PATCH"))


def get_version():
Expand Down
2 changes: 2 additions & 0 deletions rita/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
pass

from rita.engine.translate_standalone import compile_rules as standalone_engine
from rita.engine.translate_rust import compile_rules as rust_engine

from rita.utils import SingletonMixin

Expand All @@ -27,6 +28,7 @@ def __init__(self):
# spacy_engine is not imported
pass
self.register_engine(2, "standalone", standalone_engine)
self.register_engine(3, "rust", rust_engine)

def register_engine(self, priority, key, compile_fn):
self.available_engines.append((priority, key, compile_fn))
Expand Down
89 changes: 89 additions & 0 deletions rita/engine/translate_rust.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
import os
import logging

from ctypes import (c_char_p, c_size_t, c_uint, Structure, cdll, POINTER)

from rita.engine.translate_standalone import rules_to_patterns, RuleExecutor

logger = logging.getLogger(__name__)


class ResultEntity(Structure):
_fields_ = [
("label", c_char_p),
("text", c_char_p),
("start", c_size_t),
("end", c_size_t),
]


class ResultsWrapper(Structure):
_fields_ = [
("count", c_uint),
("results", (ResultEntity * 32))
]


class Context(Structure):
_fields_ = []


def load_lib():
try:
if "nt" in os.name:
lib = cdll.LoadLibrary("rita_rust.dll")
elif os.name == "posix":
lib = cdll.LoadLibrary("librita_rust.dylib")
else:
lib = cdll.LoadLibrary("librita_rust.so")
lib.compile.restype = POINTER(Context)
lib.execute.argtypes = [POINTER(Context), c_char_p]
lib.execute.restype = ResultsWrapper
lib.clean_env.argtypes = [POINTER(Context)]
return lib
except Exception as ex:
logger.error("Failed to load rita-rust library, reason: {}\n\n"
"Most likely you don't have required shared library to use it".format(ex))


class RustRuleExecutor(RuleExecutor):
def __init__(self, patterns, config):
self.config = config
self.context = None

self.lib = load_lib()
self.patterns = [self._build_regex_str(label, rules)
for label, rules in patterns]

self.compile()

@staticmethod
def _build_regex_str(label, rules):
return r"(?P<{0}>{1})".format(label, "".join(rules))

def compile(self):
flag = 0 if self.config.ignore_case else 1
c_array = (c_char_p * len(self.patterns))(*list([p.encode("UTF-8") for p in self.patterns]))
self.context = self.lib.compile(c_array, len(c_array), flag)
return self.context

def _results(self, text):
raw = self.lib.execute(self.context, text.encode("UTF-8"))
for i in range(0, raw.count):
match = raw.results[i]
yield {
"start": match.start,
"end": match.end,
"text": match.text.decode("UTF-8").strip(),
"label": match.label.decode("UTF-8"),
}

def clean_context(self):
self.lib.clean_env(self.context)


def compile_rules(rules, config, **kwargs):
logger.info("Using rita-rust rule implementation")
patterns = [rules_to_patterns(*group) for group in rules]
executor = RustRuleExecutor(patterns, config)
return executor
2 changes: 1 addition & 1 deletion tests/test_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ def test_registered_engines(cfg):
def test_registered_engines_has_spacy(cfg):
pytest.importorskip("spacy", minversion="2.1")
from rita.engine.translate_spacy import compile_rules
assert len(cfg.available_engines) == 2
assert len(cfg.available_engines) == 3
assert cfg.default_engine == compile_rules


Expand Down
Loading