-
-
Notifications
You must be signed in to change notification settings - Fork 30
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
7 changed files
with
281 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,53 @@ | ||
.. _page-classifier_training: | ||
|
||
Training a classifier | ||
===================== | ||
|
||
Once a Wikidata dump is preprocessed and indexed, we can train a classifier | ||
to predict matches in text. | ||
|
||
Getting a NIF dataset | ||
--------------------- | ||
|
||
Training requires access to a dataset encoded in `NIF (Natural Language Interchange Format) <https://github.com/dice-group/gerbil/wiki/NIF>`__. | ||
Various such datasets can be found at the `NLP2RDF dashboard <http://dashboard.nlp2rdf.aksw.org/>`__. | ||
The NIF dataset is required to use Wikidata entity URIs for its annotations. Here is an example of what it looks like in the flesh:: | ||
|
||
<https://zenodo.org/wd_affiliations/4> a nif:Context, | ||
nif:OffsetBasedString ; | ||
nif:beginIndex "0"^^xsd:nonNegativeInteger ; | ||
nif:endIndex "67"^^xsd:nonNegativeInteger ; | ||
nif:isString "Konarka Technologies, 116 John St., Suite 12, Lowell, MA 01852, USA" ; | ||
nif:sourceUrl <https://doi.org/10.1002/aenm.201100390> . | ||
|
||
<https://zenodo.org/wd_affiliations/4#offset_64_67> a nif:OffsetBasedString, | ||
nif:Phrase ; | ||
nif:anchorOf "USA" ; | ||
nif:beginIndex "64"^^xsd:nonNegativeInteger ; | ||
nif:endIndex "67"^^xsd:nonNegativeInteger ; | ||
nif:referenceContext <https://zenodo.org/wd_affiliations/4> ; | ||
itsrdf:taIdentRef <http://www.wikidata.org/entity/Q30> . | ||
|
||
|
||
Annotating your own dataset | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
If you want to annotate your own dataset, you could use an existing annotator such as `NIFIFY <https://github.com/henryrosalesmendez/NIFify_v2>`__ (although it currently does not seem to handle large datasets very well). | ||
|
||
Converting an existing dataset to Wikidata | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
If you have an existing dataset with URIs pointing to another knowledge base, such as DBpedia, you can convert it to Wikidata. | ||
This will first require translating existing annotations, which can be done automatically with tools such as `nifconverter <https://github.com/wetneb/nifconverter>`__. Then comes the harder part: you need to annotate any mention of an entity which is not | ||
covered by the original knowledge base, but is included in Wikidata. If out-of-KB mentions are already annotated in your dataset, | ||
then you can extract these and use tools such as `OpenRefine <http://openrefine.org>`__ to match their phrases to Wikidata. Otherwise, you can extract them with a named entity recognition tool, or annotate them manually. | ||
|
||
|
||
Training with cross-validation | ||
------------------------------ | ||
|
||
Training a classifier on a dataset is done via the CLI, as follows:: | ||
|
||
tapioca train-classifier -c my_solr_collection -b my_language_model.pkl -p my_pagerank.npy -d my_dataset.ttl -o my_classifier.pkl | ||
|
||
This will save the classifier as ``my_classifier.pkl``, which can then be used to tag text in the web app. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
.. OpenTapioca documentation master file, created by | ||
sphinx-quickstart on Sun Apr 14 18:36:13 2019. | ||
You can adapt this file completely to your liking, but it should at least | ||
contain the root `toctree` directive. | ||
Welcome to OpenTapioca's documentation! | ||
======================================= | ||
|
||
This explains how to install and configure OpenTapioca. | ||
|
||
.. toctree:: | ||
:maxdepth: 2 | ||
:caption: Contents: | ||
|
||
install | ||
indexing | ||
classifier_training | ||
webapp | ||
testing | ||
|
||
|
||
Indices and tables | ||
================== | ||
|
||
* :ref:`genindex` | ||
* :ref:`modindex` | ||
* :ref:`search` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,112 @@ | ||
.. _indexing: | ||
|
||
Dump preprocessing and indexing | ||
=============================== | ||
|
||
Various components need to be trained in order to obtain a functional | ||
tagger. | ||
|
||
First, download a Wikidata JSON dump compressed in ``.bz2`` | ||
format: | ||
|
||
:: | ||
|
||
wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.bz2 | ||
|
||
|
||
Language model | ||
-------------- | ||
|
||
We will first use this dump to train a bag of words language model: | ||
|
||
:: | ||
|
||
tapioca train-bow latest-all.json.bz2 | ||
|
||
This will create a ``bow.pkl`` file which counts the number of | ||
occurences of words in Wikidata labels. | ||
|
||
PageRank computation | ||
-------------------- | ||
|
||
Second, we will use the dump to extract a more compact graph of entities | ||
that can be stored in memory. This will be used to compute the pagerank | ||
of items in this graph. We convert a Wikidata dump into an adjacency | ||
matrix and a pagerank vector in four steps: 1. preprocess the dump, only | ||
extracting the information we need: this creates a TSV file containing | ||
on each line the item id (without leading Q), the list of ids this item | ||
points to, and the number of occurences of such links. | ||
``tapioca preprocess latest-all.json.bz2`` | ||
|
||
2. this dump must be externally sorted (for instance with GNU sort). | ||
Doing the sorting externally is more efficient than doing it inside | ||
Python itself. | ||
|
||
:: | ||
|
||
sort -n -k 1 latest-all.unsorted.tsv > wikidata_graph.tsv | ||
|
||
3. the sorted dump is converted into a Numpy sparse adjacency matrix | ||
``wikidata_graph.npz`` | ||
|
||
:: | ||
|
||
tapioca compile wikidata_graph.tsv | ||
|
||
4. we can compute the pagerank from the Numpy sparse matrix and store it | ||
as a dense matrix ``wikidata_graph.pgrank.npy`` | ||
|
||
:: | ||
|
||
tapioca compute-pagerank wikidata_graph.npz | ||
|
||
This slightly convoluted setup makes it possible to compute the | ||
adjacency matrix and pagerank from entire dumps on a machine with little | ||
memory (8GB). | ||
|
||
Indexing for tagging | ||
-------------------- | ||
|
||
We then need to index the Wikidata dump in a Solr collection. This uses | ||
the JSON dump only. This also requires creating an indexing profile, | ||
which defines which items will be indexed and how. A sample profile is | ||
provided to index people, organizations and places at | ||
``profiles/human_organization_place.json``: | ||
|
||
:: | ||
|
||
{ | ||
"language": "en", # The preferred language | ||
"name": "human_organization_location", # An identifier for the profile | ||
"restrict_properties": [ | ||
"P2427", "P1566", "P496", # Include all items bearing any of these properties | ||
], | ||
"restrict_types": [ | ||
# Include all items with any of these types, or subclasses of them | ||
{"type": "Q43229", "property": "P31"}, | ||
{"type": "Q618123", "property": "P31"}, | ||
{"type": "Q5", "property": "P31"} | ||
], | ||
"alias_properties": [ | ||
# Add as alias the values of these properties | ||
{"property": "P496", "prefix": null}, | ||
{"property": "P2002", "prefix": "@"}, | ||
{"property": "P4550", "prefix": null} | ||
] | ||
} | ||
|
||
Pick a Solr collection name and run: | ||
|
||
:: | ||
|
||
tapioca index-dump my_collection_name latest-all.json.bz2 --profile profiles/human_organization_place.json | ||
|
||
Note that if you have multiple cores available, you might want to run | ||
decompression as a separate process, given that it is generally the | ||
bottleneck: | ||
|
||
:: | ||
|
||
bunzip2 < latest-all.json.bz2 | tapioca index-dump my_collection_name - --profile profiles/human_organization_place.json | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,63 @@ | ||
.. _page-install: | ||
|
||
Installing OpenTapioca | ||
====================== | ||
|
||
This software is a Python web service that requires Solr. | ||
|
||
Installing Solr | ||
-------------- | ||
|
||
It relies | ||
on a recent new feature of Solr (since 7.4.0), which was previously available as | ||
an external plugin, `SolrTextTagger <https://github.com/OpenSextant/SolrTextTagger>`__. | ||
If you cannot use a recent Solr version, it is possible to use older versions with the plugin | ||
installed: this will require changing the class names in the Solr configs (in the ``configset`` directory). | ||
|
||
Install `Solr <https://lucene.apache.org/solr/>`__ 7.4.0 or above. | ||
|
||
OpenTapioca requires that Solr runs in Cloud mode, so you can start it as follows: | ||
|
||
:: | ||
|
||
bin/solr start -c -m 4g | ||
|
||
The memory available to Solr (here 4 GB) will determine how many indexing operations you can run in parallel | ||
(searching is cheap). | ||
|
||
In its Cloud mode, Solr reads the configuration for its indices from so-called "configsets" which gouvern the | ||
configuration of multiple collections. OpenTapioca comes with the appropriate configsets for its collections | ||
and the default one is called "tapioca". You need to upload it to Solr before indexing any data, as follows: | ||
|
||
:: | ||
|
||
bin/solr zk -upconfig -z localhost:9983 -n tapioca -d configsets/tapioca | ||
|
||
Custom analyzers | ||
~~~~~~~~~~~~~~~~ | ||
|
||
Some profiles require custom Solr analyzers and tokenizers. For instance | ||
the Twitter profile can be used to index Twitter usernames and hashtags | ||
as labels, which is useful to annotate mentions in Twitter feeds. This | ||
requires a special tokenizer which handles these tokens appropriately. | ||
This tokenizer is provided as a Solr plugin in the ``plugins`` | ||
directory. It can be installed by adding this jar in the | ||
``server/solr/lib`` directory of your Solr instance (the ``lib`` | ||
subfolder needs to be created first). | ||
|
||
|
||
Installing Python dependencies | ||
------------------------------ | ||
|
||
OpenTapioca is known to work with Python 3.6, and offers a command-line interface | ||
to manipulate Wikidata dumps and train classifiers from datasets. | ||
|
||
In a Virtualenv, do ``pip install -r requirements.txt`` to install the | ||
Python dependencies, and ``python setup.py install`` to install the CLI | ||
in your PATH. | ||
|
||
When developing OpenTapioca, you can use `pip install -e .` to install the CLI | ||
from the local files, so that your changes on the source code are directly reflected | ||
in the CLI, without the need to run ``python setup.py install`` every time you change | ||
something. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
.. _page-testing: | ||
|
||
Testing OpenTapioca | ||
=================== | ||
|
||
OpenTapioca comes with a test suite that can be run with ``pytest``. | ||
This requires a Solr server to be running on ``localhost:8983``, in Cloud mode. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
.. _page-webapp: | ||
|
||
Running the web app | ||
=================== | ||
|
||
Once a classifier is trained, you can run the web app. This requires supplying | ||
the filenames to the various preprocessed files and the Solr collection name via environment | ||
variables. You can run the application locally for development as follows:: | ||
|
||
export TAPIOCA_BOW="my_language_model.pkl" | ||
export TAPIOCA_PAGERANK="my_pagerank.npy" | ||
export TAPIOCA_CLASSIFIER="my_classifier.pkl" | ||
export TAPIOCA_COLLECTION="my_solr_collection" | ||
python app.py | ||
|
||
For production deployment, you should use a proper web server with WSGI support. | ||
|