-
Notifications
You must be signed in to change notification settings - Fork 29
Quick start
JATE2.0 is a modular, scalable, configurable and ready-to-use term extraction tool and development framework. It integrates with Apache Solr, and provides a number general-purpose plugins for text mining. Currently, JATE2.0 doesn't support multiple languages directly. Instead, it provides a framework so that various language specific plugins can be developed and configured.
This page explains several examples to get you started as quick as possible. Two youtube videos (Embedded Mode Demo and Plugin Mode Demo) are also available to help you quickly go through this tutorial.
JATE2.0 is based on JDK 1.8. Please download & install JAVA 1.8 or later.
JATE2.0 is based on Apache Solr 7.2.1. Please download & install . This is optional if you don't need to run JATE within your external Solr server (as in the Plugin mode).
Next, you need to download latest release of JATE2.0 via Nexus Repository or Git.
JATE uses Maven to manage its libraries. If you are using JATE as a library in your project, the only one library that is not available in the Maven central repository is Dragontools. For this, either manually configure it in your IDE, or run mvn install
to install it in your local maven repository properly.
Used version of dragontool from edu.drexel
doesn't exist at maven central. Alternatively, the same version exists at de.julielab
(credit to @catap) if you would like to configure JATE as external library in your project.
Maven setup of JATE2.0 is as following, if you are using JATE as a library:
<dependency>
<groupId>uk.ac.shef.dcs</groupId>
<artifactId>jate</artifactId>
<version>2.0-beta.11</version>
</dependency>
See more details in Using-JATE
JATE2.0 requires a Solr core's instance directory for indexing documents and storing term statistics. JATE2.0 testbed contains several examples of Solr Core configuration for different corpora. Note: document unique id field is mandatory and the field name MUST be 'id'.
jate.properties defines a number of properties you may need to configure JATE2.0 to work with your Solr instance. However if you use the default Solr instance configuration, you should not need to provide this file and this will automatically be loaded from class path.
Next,to run an ATE program for your corpus, you can just fire up the JATE standalone jar file from following the command line as below. Running JATE2.0 in embedded mode means that you don't have to install Apache Solr (i.e., skip Step 2).:
java -cp jate-2.0-*-jar-with-dependencies.jar <APP_ALGORITHM> <OPTIONS> <SOLR_HOME_PATH> <SOLR_CORE_NAME>
For example, to analyse a given corpus with CValue algorithm and the default Solr setting using the example core of "ACLRDTEC", you can run the following program from the command line :
java -cp <PATH>/jate-2.0-*-jar-with-dependencies.jar uk.ac.shef.dcs.jate.app.AppCValue -corpusDir <CORPUS_DIR> -o cvalue-terms.json <JATE_HOME>/testdata/solr-testbed ACLRDTEC
Then, to rank and output weighted terms with a different algorithm (e.g., TF-IDF), you can simply change a different algorithm:
java -cp <PATH>/jate-2.0-*-jar-with-dependencies.jar uk.ac.shef.dcs.jate.app.AppTFIDF -o tfidf-terms.json <JATE_HOME>/testdata/solr-testbed ACLRDTEC
The algorithms that can be run as standalone applications:
Algorithm | APP_ALGORITHM |
---|---|
TTF | uk.ac.shef.dcs.jate.app.AppTTF |
ATTF | uk.ac.shef.dcs.jate.app.AppATTF |
TTF-IDF | uk.ac.shef.dcs.jate.app.AppTFIDF |
RIDF | uk.ac.shef.dcs.jate.app.AppRIDF |
CValue | uk.ac.shef.dcs.jate.app.AppCValue |
ChiSquare | uk.ac.shef.dcs.jate.app.AppChiSquare |
RAKE | uk.ac.shef.dcs.jate.app.AppRAKE |
Weirdness | uk.ac.shef.dcs.jate.app.AppWeirdness |
GlossEx | uk.ac.shef.dcs.jate.app.AppGlossEx |
TermEx | uk.ac.shef.dcs.jate.app.AppTermEx |
Basic | uk.ac.shef.dcs.jate.app.Basic |
ComboBasic | uk.ac.shef.dcs.jate.app.ComboBasic |
Run-time parameters options for the standalone application:
options | Expected Type | description |
---|---|---|
-corpusDir | string | The directory of the corpus that will be processed. |
-prop | string | jate.properties file(path) for the configuration of Solr schema. |
-c | boolean | Expect 'true' or 'false'. This parameter specifies whether to collect term information for exporting, e.g., offsets in documents. Default is false. Setting to true will significantly increase post-processing time that is need to query the Solr index for such information. |
-r | string | Reference corpus frequency file (path) is required by AppGlossEx, AppTermEx and AppWeirdness. An example is provided in '/testdata/solr-testbed/ACLRDTEC/conf/bnc_unifrqs.normal'. |
-cf.t | number | This is a post-filtering setting. Cutoff score threshold for selecting terms. If multiple -cf.* parameters are set the preference order will be cf.t, cf.k, cf.kp. |
-cf.k | number | This is a post-filtering setting. Cutoff top ranked K terms to be selected. If multiple -cf.* parameters are set the preference order will be cf.t, cf.k, cf.kp. |
-cf.kp | number | This is a post-filtering setting. Cutoff top ranked K% terms to be selected. If multiple -cf.* parameters are set the preference order will be cf.t, cf.k, cf.kp. |
-pf.mttf | number | Pre-filter minimum total term frequency. Any candidate term whose total frequency in the corpus is less than this value will not be considered for ranking |
-pf.mtcf | number | Pre-filter minimum context frequency of a term (used by co-occurrence based methods). This is the number of context objects where a term appears. If any candidate's mtcf is lower than this value it will not be considered for ranking |
-o | string | File (path) to save output. Only JSON output is supported now. |
You can also run JATE2.0 as SolrPlugins in your own Solr server (only supported since Beta version). Recommended setting in an individual SolrCore is as follows:
Create a new folder 'jate' in a lib or contrib directory in the instanceDir of your SolrCore. Then, place JATE 2.0 jars (simply use the jate-2.0-**-dependencies.jar
) in the $SOLR_HOME/contrib/jate/lib folder.
Assuming you have created a Solr core called 'jate'. This should be a folder '$SOLR_HOME/solr/jate', and contain content similar to the two sample cores provided (e.g., testbed). Next configure JATE jars path in your '$SOLR_HOME/solr/jate/conf/solrconfig.xml'. An example setting is as follows:
<lib dir="${solr.install.dir:../../..}/contrib/jate/lib" regex=".*\.jar" />
You may also refer to how to add custom plugins in SolrCloud mode.
Candidate extraction and term scoring are two separate processes. The former is performed automatically at index-time and the latter needs be triggered separately. In JATE2.0, the scoring can be triggered by a HTTP request. Term recognition request handler needs be configured in your solrconfig.xml
so that candidate terms (processed in document indexing time) can be scored, ranked, filtered and exported. The final selected terms are saved in a field that by default is called jate_domain_terms
, defined in your schema. This can be changed in jate.properties
.
In addition to run-time parameter options (listed above), the following parameters can be configured for the request handler:
options | Expected Type | Is Required | description |
---|---|---|---|
algorithm | string | Y | The ATE algorithm that are used to weight candidate terms. For accepted values, please refer to the algorithms listed above. |
extraction | boolean | N | Set true or false to determine whether candidate terms will be (re)extracted from current index. Default as false. Essentially, it is a re-indexing process. For example, you can set true to try out different term PoS sequence patterns or a pre-filtering setting. Note: be aware of use RELOAD for Configuration_Changes_In_Solr |
indexTerm | boolean | N | Set true or false to determine whether filtered candidate terms will be indexed and stored (e.g., for supporting faceted navigation/search). This requires corresponding solr field to be configured in schema if set to true. The value is false by default. Indexing filtered terms with boosting is only available in plugin mode in current version. |
boosting | boolean | N | Set true or false to determine whether term score will be used as boosting value for indexing filtered terms. The value is false by default. You'll need to turn on 'omitNorms' (set to 'false') in the jate_domain_terms field of your schema before you set the boosts. Enabling boosting requires more memory. Warning: Only works in Beta.1 version (supports 5.x) because that index-time boosts are not supported any more since Solr 6.5 |
An example setting is as follows:
<requestHandler name="/termRecogniser" class="uk.ac.shef.dcs.jate.solr.TermRecognitionRequestHandler">
<lst name="defaults">
<str name="algorithm">CValue</str>
<bool name="extraction">false</bool>
<bool name="indexTerm">true</bool>
<bool name="boosting">false</bool>
<str name="-prop"><YOUR_PATH>/resource/jate.properties</str>
<float name="-cf.t">0</float>
<str name="-o"><YOUR_PATH>/industry_terms.json</str>
</lst>
</requestHandler>
To make sure JATE2.0 works with your indexing engine, you need to configure your schema.xml
properly. Two solr content analysis fields (ngram field and term candidate field) are mandatory. Please refer to jate_text_2_ngrams
and jate_text_2_terms
in schema.xml
of the sample instance cores for example.
To enable JATE2.0 to work with Tika plugin, you need to make sure that (1) your Tika requestHandler
inside solrconfig.xml
defines fmap.content
to map to the text
field defined in your schema.xml; (2) your schema.xml
defines the text
field to copy the two required fields by JATE2.0; (3) your text
field must have set indexed="true" stored="true"
; (4) in terms of boosting, your setting for the jate_domain_terms
field in your schema is compatible with your setting for the requestHandler
in your solrconfig.
An example setting is shown below:
----------- solrconfig.xml --------------------
<!-- Solr Cell Update Request Handler in solrconfig.xml
http://wiki.apache.org/solr/ExtractingRequestHandler
-->
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
...
<str name="fmap.content">text</str>
<bool name="boosting">false</bool>
...
</lst>
...
</requestHandler>
----------- schema.xml-----------
...
<types>
...
<fieldType name="jate_text_2_ngrams" class="solr.TextField" positionIncrementGap="100"> ... </fieldType>
<fieldType name="jate_text_2_terms" class="solr.TextField" positionIncrementGap="100"> ... </fieldType>
...
</types>
<fields>
...
<field name="text" type="text_general" indexed="true" stored="true" multiValued="true"/>
...
<!-- Field to index text with n-gram tokens-->
<field name="jate_ngraminfo" type="jate_text_2_ngrams" indexed="true" stored="false" multiValued="false"
termVectors="true" termPositions="true" termOffsets="true"
termPayloads="true"/>
<!-- Field to index text with candidate terms. -->
<field name="jate_cterms" type="jate_text_2_terms" indexed="true" stored="false" multiValued="false"
termVectors="true"/>
<field name="jate_domain_terms" type="string" indexed="true" stored="true" omitNorms="true" required="false" multiValued="true"/>
<field name="text" type="text_general" indexed="true" stored="true" multiValued="true"/>
<copyField source="text" dest="jate_cterms" />
<copyField source="text" dest="jate_ngraminfo" />
</fields>
Now, you are able to upload documents with Tika plugin. Term candidates will be extracted and indexed at document index time.
For example, you can use JATE2.0 to analyse your local archive by directly uploading documents with Solr POST tool
#!/bin/sh
<SOLR_HOME>/bin/post -c <CORE_NAME> -host:<HOST_NAME> -p:<PORT_NO> <CORPUS_DIR>
Alternatively, you can alter algorithm
to analyse your content with a different ATE algorithm.
Step 8.5 term scoring, ranking, filtering, indexing, storing and exporting triggered by a HTTP request
Please note that in plugin mode, term scoring process is separated from candidate extraction (at index-time) and the scoring and filtering process can be triggered by sending a HTTP request to Solr. Candidate extraction can be enabled as an option (by setting 'extraction' to true in request handler config).
For example, with the setting above, sending a POST request will export final ranked & filtered terms into a json file for further analysis:
$ curl -X POST http://localhost:8983/solr/jateCore/termRecogniser
For the live demo in LREC 2016 conference, Please refer to more details in jateSolrPluginDemo.