Add Sphinx docs

opentapioca · Apr 14, 2019 · 62c8321 · 62c8321
1 parent 52b8479
commit 62c8321
Show file tree

Hide file tree

Showing 7 changed files with 281 additions and 1 deletion.
diff --git a/README.md b/README.md
@@ -1,6 +1,6 @@
 OpenTapioca
 ===========
-[![Build Status](https://travis-ci.org/wetneb/opentapioca.svg?branch=master)](https://travis-ci.org/wetneb/opentapioca) [![Coverage Status](https://coveralls.io/repos/github/wetneb/opentapioca/badge.svg)](https://coveralls.io/github/wetneb/opentapioca)
+[![Documentation Status](https://readthedocs.org/projects/opentapioca/badge/?version=latest)](https://opentapioca.readthedocs.io/en/latest/?badge=latest) [![Build Status](https://travis-ci.org/wetneb/opentapioca.svg?branch=master)](https://travis-ci.org/wetneb/opentapioca) [![Coverage Status](https://coveralls.io/repos/github/wetneb/opentapioca/badge.svg)](https://coveralls.io/github/wetneb/opentapioca)
 
 Simple entity linking system for Wikidata.
 

diff --git a/docs/classifier_training.rst b/docs/classifier_training.rst
@@ -0,0 +1,53 @@
+.. _page-classifier_training:
+
+Training a classifier
+=====================
+
+Once a Wikidata dump is preprocessed and indexed, we can train a classifier
+to predict matches in text.
+
+Getting a NIF dataset
+---------------------
+
+Training requires access to a dataset encoded in `NIF (Natural Language Interchange Format) <https://github.com/dice-group/gerbil/wiki/NIF>`__.
+Various such datasets can be found at the `NLP2RDF dashboard <http://dashboard.nlp2rdf.aksw.org/>`__.
+The NIF dataset is required to use Wikidata entity URIs for its annotations. Here is an example of what it looks like in the flesh::
+
+   <https://zenodo.org/wd_affiliations/4> a nif:Context,
+         nif:OffsetBasedString ;
+      nif:beginIndex "0"^^xsd:nonNegativeInteger ;
+      nif:endIndex "67"^^xsd:nonNegativeInteger ;
+      nif:isString "Konarka Technologies, 116 John St., Suite 12, Lowell, MA 01852, USA" ;
+      nif:sourceUrl <https://doi.org/10.1002/aenm.201100390> .
+
+   <https://zenodo.org/wd_affiliations/4#offset_64_67> a nif:OffsetBasedString,
+         nif:Phrase ;
+      nif:anchorOf "USA" ;
+      nif:beginIndex "64"^^xsd:nonNegativeInteger ;
+      nif:endIndex "67"^^xsd:nonNegativeInteger ;
+      nif:referenceContext <https://zenodo.org/wd_affiliations/4> ;
+      itsrdf:taIdentRef <http://www.wikidata.org/entity/Q30> .
+
+
+Annotating your own dataset
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If you want to annotate your own dataset, you could use an existing annotator such as `NIFIFY <https://github.com/henryrosalesmendez/NIFify_v2>`__ (although it currently does not seem to handle large datasets very well).
+
+Converting an existing dataset to Wikidata
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If you have an existing dataset with URIs pointing to another knowledge base, such as DBpedia, you can convert it to Wikidata.
+This will first require translating existing annotations, which can be done automatically with tools such as `nifconverter <https://github.com/wetneb/nifconverter>`__. Then comes the harder part: you need to annotate any mention of an entity which is not
+covered by the original knowledge base, but is included in Wikidata. If out-of-KB mentions are already annotated in your dataset,
+then you can extract these and use tools such as `OpenRefine <http://openrefine.org>`__ to match their phrases to Wikidata. Otherwise, you can extract them with a named entity recognition tool, or annotate them manually.
+
+
+Training with cross-validation
+------------------------------
+
+Training a classifier on a dataset is done via the CLI, as follows::
+
+   tapioca train-classifier -c my_solr_collection -b my_language_model.pkl -p my_pagerank.npy -d my_dataset.ttl -o my_classifier.pkl
+
+This will save the classifier as ``my_classifier.pkl``, which can then be used to tag text in the web app.
diff --git a/docs/index.rst b/docs/index.rst
@@ -0,0 +1,27 @@
+.. OpenTapioca documentation master file, created by
+   sphinx-quickstart on Sun Apr 14 18:36:13 2019.
+   You can adapt this file completely to your liking, but it should at least
+   contain the root `toctree` directive.
+
+Welcome to OpenTapioca's documentation!
+=======================================
+
+This explains how to install and configure OpenTapioca.
+
+.. toctree::
+   :maxdepth: 2
+   :caption: Contents:
+
+   install
+   indexing
+   classifier_training
+   webapp
+   testing
+
+
+Indices and tables
+==================
+
+* :ref:`genindex`
+* :ref:`modindex`
+* :ref:`search`
diff --git a/docs/indexing.rst b/docs/indexing.rst
@@ -0,0 +1,112 @@
+.. _indexing:
+
+Dump preprocessing and indexing
+===============================
+
+Various components need to be trained in order to obtain a functional
+tagger.
+
+First, download a Wikidata JSON dump compressed in ``.bz2``
+format:
+
+::
+
+   wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.bz2
+
+
+Language model
+--------------
+
+We will first use this dump to train a bag of words language model:
+
+::
+
+   tapioca train-bow latest-all.json.bz2
+
+This will create a ``bow.pkl`` file which counts the number of
+occurences of words in Wikidata labels.
+
+PageRank computation
+--------------------
+
+Second, we will use the dump to extract a more compact graph of entities
+that can be stored in memory. This will be used to compute the pagerank
+of items in this graph. We convert a Wikidata dump into an adjacency
+matrix and a pagerank vector in four steps: 1. preprocess the dump, only
+extracting the information we need: this creates a TSV file containing
+on each line the item id (without leading Q), the list of ids this item
+points to, and the number of occurences of such links.
+``tapioca preprocess latest-all.json.bz2``
+
+2. this dump must be externally sorted (for instance with GNU sort).
+   Doing the sorting externally is more efficient than doing it inside
+   Python itself.
+
+   ::
+
+      sort -n -k 1 latest-all.unsorted.tsv > wikidata_graph.tsv
+
+3. the sorted dump is converted into a Numpy sparse adjacency matrix
+   ``wikidata_graph.npz``
+
+   ::
+
+      tapioca compile wikidata_graph.tsv
+
+4. we can compute the pagerank from the Numpy sparse matrix and store it
+   as a dense matrix ``wikidata_graph.pgrank.npy``
+
+   ::
+
+      tapioca compute-pagerank wikidata_graph.npz
+
+This slightly convoluted setup makes it possible to compute the
+adjacency matrix and pagerank from entire dumps on a machine with little
+memory (8GB).
+
+Indexing for tagging
+--------------------
+
+We then need to index the Wikidata dump in a Solr collection. This uses
+the JSON dump only. This also requires creating an indexing profile,
+which defines which items will be indexed and how. A sample profile is
+provided to index people, organizations and places at
+``profiles/human_organization_place.json``:
+
+::
+
+   {
+       "language": "en", # The preferred language
+       "name": "human_organization_location", # An identifier for the profile
+       "restrict_properties": [
+           "P2427", "P1566", "P496", # Include all items bearing any of these properties
+       ],
+       "restrict_types": [
+           # Include all items with any of these types, or subclasses of them
+           {"type": "Q43229", "property": "P31"},
+           {"type": "Q618123", "property": "P31"},
+           {"type": "Q5", "property": "P31"}
+       ],
+       "alias_properties": [
+           # Add as alias the values of these properties
+           {"property": "P496", "prefix": null},
+           {"property": "P2002", "prefix": "@"},
+           {"property": "P4550", "prefix": null}
+       ]
+   }
+
+Pick a Solr collection name and run:
+
+::
+
+   tapioca index-dump my_collection_name latest-all.json.bz2 --profile profiles/human_organization_place.json
+
+Note that if you have multiple cores available, you might want to run
+decompression as a separate process, given that it is generally the
+bottleneck:
+
+::
+
+   bunzip2 < latest-all.json.bz2 | tapioca index-dump my_collection_name - --profile profiles/human_organization_place.json
+
+
diff --git a/docs/install.rst b/docs/install.rst
@@ -0,0 +1,63 @@
+.. _page-install:
+
+Installing OpenTapioca
+======================
+
+This software is a Python web service that requires Solr.
+
+Installing Solr
+--------------
+
+It relies 
+on a recent new feature of Solr (since 7.4.0), which was previously available as 
+an external plugin, `SolrTextTagger <https://github.com/OpenSextant/SolrTextTagger>`__.
+If you cannot use a recent Solr version, it is possible to use older versions with the plugin
+installed: this will require changing the class names in the Solr configs (in the ``configset`` directory).
+
+Install `Solr <https://lucene.apache.org/solr/>`__ 7.4.0 or above.
+
+OpenTapioca requires that Solr runs in Cloud mode, so you can start it as follows:
+
+::
+
+   bin/solr start -c -m 4g
+
+The memory available to Solr (here 4 GB) will determine how many indexing operations you can run in parallel
+(searching is cheap).
+
+In its Cloud mode, Solr reads the configuration for its indices from so-called "configsets" which gouvern the
+configuration of multiple collections. OpenTapioca comes with the appropriate configsets for its collections
+and the default one is called "tapioca". You need to upload it to Solr before indexing any data, as follows:
+
+::
+
+   bin/solr zk -upconfig -z localhost:9983 -n tapioca -d configsets/tapioca
+
+Custom analyzers
+~~~~~~~~~~~~~~~~
+
+Some profiles require custom Solr analyzers and tokenizers. For instance
+the Twitter profile can be used to index Twitter usernames and hashtags
+as labels, which is useful to annotate mentions in Twitter feeds. This
+requires a special tokenizer which handles these tokens appropriately.
+This tokenizer is provided as a Solr plugin in the ``plugins``
+directory. It can be installed by adding this jar in the
+``server/solr/lib`` directory of your Solr instance (the ``lib``
+subfolder needs to be created first).
+
+
+Installing Python dependencies
+------------------------------
+
+OpenTapioca is known to work with Python 3.6, and offers a command-line interface
+to manipulate Wikidata dumps and train classifiers from datasets.
+
+In a Virtualenv, do ``pip install -r requirements.txt`` to install the
+Python dependencies, and ``python setup.py install`` to install the CLI
+in your PATH.
+
+When developing OpenTapioca, you can use `pip install -e .` to install the CLI
+from the local files, so that your changes on the source code are directly reflected
+in the CLI, without the need to run ``python setup.py install`` every time you change
+something.
+
diff --git a/docs/testing.rst b/docs/testing.rst
@@ -0,0 +1,8 @@
+.. _page-testing:
+
+Testing OpenTapioca
+===================
+
+OpenTapioca comes with a test suite that can be run with ``pytest``.
+This requires a Solr server to be running on ``localhost:8983``, in Cloud mode.
+
diff --git a/docs/webapp.rst b/docs/webapp.rst
@@ -0,0 +1,17 @@
+.. _page-webapp:
+
+Running the web app
+===================
+
+Once a classifier is trained, you can run the web app. This requires supplying
+the filenames to the various preprocessed files and the Solr collection name via environment
+variables. You can run the application locally for development as follows::
+
+   export TAPIOCA_BOW="my_language_model.pkl"
+   export TAPIOCA_PAGERANK="my_pagerank.npy"
+   export TAPIOCA_CLASSIFIER="my_classifier.pkl"
+   export TAPIOCA_COLLECTION="my_solr_collection"
+   python app.py 
+
+For production deployment, you should use a proper web server with WSGI support.
+