Extract and disambiguate named entities

Installation

$ pip install faktotum

Getting started

You can use pre-trained models for both literary texts (literary-texts) and press releases (press-texts).

Named Entity Recognition

>>> import faktotum
>>> text = "Er hieß Eduard. Ihr Name war Klara, manchmal auch Klärchen."
>>> faktotum.ner(text, domain="literary-texts")
    sentence_id      word entity
0             0        Er    NaN
1             0      hieß    NaN
2             0    Eduard  B-PER
3             0         .    NaN
4             1       Ihr    NaN
5             1      Name    NaN
6             1       war    NaN
7             1     Klara  B-PER
8             1         ,    NaN
9             1  manchmal    NaN
10            1      auch    NaN
11            1  Klärchen  B-PER
12            1         .    NaN

Named Entity Linking

You have to provide a knowledge base in a JSON file like:

{
    "Q1": {
        "CONTEXTS": [
            [
                "Eduard",
                "ist",
                "ein",
                "schöner",
                "Name",
                "."]
        ],
        "ENTITY_INDICES": [
            [
                0
            ]
        ]
    },
    "Q2": {
        "CONTEXTS": [
            [
                "Klara",
                "ist",
                "auch",
                "ein",
                "schöner",
                "Name",
                "."
            ]
        ],
        "ENTITY_INDICES": [
            [
                0
            ]
        ]
    }

where an identifier maps to a list of contexts and a list of indices referring to entities in the contexts.

>>> import json
>>> import faktotum
>>> with open("kb.json", "r", encoding="utf-8") as f:
...     data = json.load(f)
>>> kb = faktotum.KnowledgeBase(data, domain="literary-texts")
>>> text = "Er hieß Eduard. Ihr Name war Klara, manchmal auch Klärchen."
>>> faktotum.nel(text, kb, domain="literary-texts")
    sentence_id      word entity entity_id
0             0        Er    NaN       NaN
1             0      hieß    NaN       NaN
2             0    Eduard  B-PER        Q1
3             0         .    NaN       NaN
4             1       Ihr    NaN       NaN
5             1      Name    NaN       NaN
6             1       war    NaN       NaN
7             1     Klara  B-PER        Q2
8             1         ,    NaN       NaN
9             1  manchmal    NaN       NaN
10            1      auch    NaN       NaN
11            1  Klärchen  B-PER        Q2
12            1         .    NaN       NaN

Or with a knowledge base with two records related to press releases:

{
    "Q1": {
        "CONTEXTS": [
            [
                "Apple",
                "produziert",
                "Software",
                "und",
                "Hardware",
                "."
            ]
        ],
        "ENTITY_INDICES": [
            [
                0
            ]
        ]
    },
    "Q2": {
        "CONTEXTS": [
            [
                "Apple",
                "ist",
                "ein",
                "Plattenlabel",
                "."
            ]
        ],
        "ENTITY_INDICES": [
            [
                0
            ]
        ]
    }
}

set the domain argument to press-texts:

>>> import json
>>> import faktotum
>>> with open("kb.json", "r", encoding="utf-8") as f:
...     data = json.load(f)
>>> kb = faktotum.KnowledgeBase(data, domain="press-texts")
>>> text = "Apple hat kürzlich neue Computer vorgestellt. Das erste Musikalbum auf Apple ist 1968 erschienen."
>>> faktotum.nel(text, kb, "press-texts")
    sentence_id         word entity entity_id
0             0        Apple  B-ORG        Q1
1             0          hat    NaN       NaN
2             0     kürzlich    NaN       NaN
3             0         neue    NaN       NaN
4             0     Computer    NaN       NaN
5             0  vorgestellt    NaN       NaN
6             0            .    NaN       NaN
7             1          Das    NaN       NaN
8             1        erste    NaN       NaN
9             1   Musikalbum    NaN       NaN
10            1          auf    NaN       NaN
11            1        Apple  B-ORG        Q2
12            1          ist    NaN       NaN
13            1         1968    NaN       NaN
14            1   erschienen    NaN       NaN
15            1            .    NaN       NaN

Name		Name	Last commit message	Last commit date
Latest commit History 1,127 Commits
faktotum		faktotum
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

faktotum

faktotum

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

poetry.lock

poetry.lock

pyproject.toml

pyproject.toml

Repository files navigation

Extract and disambiguate named entities

Installation

Getting started

Named Entity Recognition

Named Entity Linking

About

Releases

Packages

Contributors 2

Languages

License

severinsimmler/faktotum

Folders and files

Latest commit

History

Repository files navigation

Extract and disambiguate named entities

Installation

Getting started

Named Entity Recognition

Named Entity Linking

About

Resources

License

Stars

Watchers

Forks

Languages