Skip to content
This repository has been archived by the owner on Apr 27, 2022. It is now read-only.

suamin/PyNemex

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PyNemex

Python package for Named entity matching and extraction (PyNemex) with approximate string matching. Currently, it is mainly based on Faerie [1] algorithm.

Installation

Soon as:

pip install nemex

Quickstart

Here we show a simple example to extract entities from a pre-defined dictionary.

from nemex import Nemex
import json

E = [
    "kaushik ch",
    "chakrabarti",
    "chaudhuri",
    "venkatesh",
    "surajit ch"
]
D = "an efficient filter for approximate membership checking. venkaee shga kamunshik kabarati, dong xin, surauijt chadhurisigmod."

nemex = Nemex(E) # initialize with dictionar
output = nemex(D) # query document

print(json.dumps(output, indent=2))

Running the example gives results as:

{
  "document": "an efficient filter for approximate membership checking. venkaee shga kamunshik kabarati, dong xin, surauijt chadhurisigmod.",
  "matches": [
    {
      "valid": true,
      "entity": [
        "chaudhuri",
        2
      ],
      "score": 2,
      "match": " chadhuri",
      "span": [
        108,
        117
      ]
    },
    {
      "valid": true,
      "entity": [
        "chaudhuri",
        2
      ],
      "score": 1,
      "match": "chadhuri",
      "span": [
        109,
        117
      ]
    },
    {
      "valid": true,
      "entity": [
        "chaudhuri",
        2
      ],
      "score": 2,
      "match": "chadhuris",
      "span": [
        109,
        118
      ]
    },
    {
      "valid": true,
      "entity": [
        "chaudhuri",
        2
      ],
      "score": 2,
      "match": "hadhuri",
      "span": [
        110,
        117
      ]
    },
    {
      "valid": true,
      "entity": [
        "venkatesh",
        3
      ],
      "score": 2,
      "match": "venkaee sh",
      "span": [
        57,
        67
      ]
    },
    {
      "valid": true,
      "entity": [
        "surajit ch",
        4
      ],
      "score": 2,
      "match": "surauijt ch",
      "span": [
        100,
        111
      ]
    }
  ]
}

History

The authors of Faerie [1] released the binary for the code, which was written in C++. The first open-source version, called NEMEX, was originally written in Java by Günter Neumann and then maintained partly by Amir Moin at DFKI, Saarbrücken.

Update (16.09.2021): Added test suite and documentation by iodike

References

[1] Li, G., Deng, D., & Feng, J. (2011, June). Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data (pp. 529-540). ACM.

Releases

No releases published

Packages

No packages published

Languages