Purging index between JATE calls #32

eltimster · 2016-10-24T22:26:11Z

I am trying to run JATE on different corpora, but found that it seems to incrementally add to the SOLR index when it indexes a new corpus, meaning I get terms from not just the corpus of interest, but the union of all corpora processed to that point. My solution to the problem has been to rm purge files from the relevant data/index file, but this is now causing an exception:

`2016-10-25 09:24:04 INFO  AppCValue:328 - Indexing corpus from [docs/english] and perform candidate extraction ...
2016-10-25 09:24:05 INFO  AppCValue:331 -  [151996] files are scanned and will be indexed and analysed.
Tue Oct 25 09:24:08 AEDT 2016 loading exception data for lemmatiser...
Tue Oct 25 09:24:08 AEDT 2016 loading exception data for lemmatiser...
Tue Oct 25 09:24:08 AEDT 2016 loading exception data for lemmatiser...
Tue Oct 25 09:24:08 AEDT 2016 loading done
Tue Oct 25 09:24:08 AEDT 2016 loading done
Tue Oct 25 09:24:08 AEDT 2016 loading done
Tue Oct 25 09:24:09 AEDT 2016 loading exception data for lemmatiser...
Tue Oct 25 09:24:09 AEDT 2016 loading exception data for lemmatiser...
Tue Oct 25 09:24:09 AEDT 2016 loading done
Tue Oct 25 09:24:09 AEDT 2016 loading done
2016-10-25 09:24:09 ERROR SolrCore:525 - [jateCore] Solr index directory '/home/tim/forum-style/jate-2.0-beta.1/testdata/solr-testbed/jateCore/data/index/' is locked.  Throwing exception.
2016-10-25 09:24:09 ERROR CoreContainer:740 - Error creating core [jateCore]: Index locked for write for core 'jateCore'. Solr now longer supports forceful unlocking via 'unlockOnStartup'. Please verify locks manually!
org.apache.solr.common.SolrException: Index locked for write for core 'jateCore'. Solr now longer supports forceful unlocking via 'unlockOnStartup'. Please verify locks manually!
        at org.apache.solr.core.SolrCore.<init>(SolrCore.java:820)
        at org.apache.solr.core.SolrCore.<init>(SolrCore.java:659)
        at org.apache.solr.core.CoreContainer.create(CoreContainer.java:727)
        at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:447)
2016-10-25 09:24:12 ERROR SolrCore:139 - org.apache.solr.common.SolrException: Exception writing document id 112188-q to the index; possible analysis error.
        at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:167)
        at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69)
        at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
        at org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:955)
        at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1110)
        at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:706)
        at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:250)
        at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:177)
        at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:98)
        at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
        at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:2068)
        at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:179)
        at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:135)
        at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:174)
        at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:139)
        at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:153)
        at uk.ac.shef.dcs.jate.util.JATEUtil.addNewDoc(JATEUtil.java:339)
        at uk.ac.shef.dcs.jate.app.App.indexJATEDocuments(App.java:374)
        at uk.ac.shef.dcs.jate.app.App.lambda$index$4(App.java:340)
        at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1374)
        at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580)
        at uk.ac.shef.dcs.jate.app.App.index(App.java:338)
        at uk.ac.shef.dcs.jate.app.AppCValue.main(AppCValue.java:45)
Caused by: java.lang.NullPointerException
        at opennlp.tools.util.Cache.put(Cache.java:134)
        at opennlp.tools.postag.DefaultPOSContextGenerator.getContext(DefaultPOSContextGenerator.java:195)
        at opennlp.tools.postag.DefaultPOSContextGenerator.getContext(DefaultPOSContextGenerator.java:87)
        at opennlp.tools.postag.DefaultPOSContextGenerator.getContext(DefaultPOSContextGenerator.java:32)
        at opennlp.tools.ml.BeamSearch.bestSequences(BeamSearch.java:102)
        at opennlp.tools.ml.BeamSearch.bestSequences(BeamSearch.java:168)
        at opennlp.tools.ml.BeamSearch.bestSequence(BeamSearch.java:173)
        at opennlp.tools.postag.POSTaggerME.tag(POSTaggerME.java:194)
        at opennlp.tools.postag.POSTaggerME.tag(POSTaggerME.java:190)
        at uk.ac.shef.dcs.jate.nlp.opennlp.POSTaggerOpenNLP.tag(POSTaggerOpenNLP.java:23)
        at org.apache.lucene.analysis.jate.OpenNLPPOSTaggerFilter.assignPOS(OpenNLPPOSTaggerFilter.java:103)
        at org.apache.lucene.analysis.jate.OpenNLPPOSTaggerFilter.createTags(OpenNLPPOSTaggerFilter.java:97)
        at org.apache.lucene.analysis.jate.OpenNLPPOSTaggerFilter.incrementToken(OpenNLPPOSTaggerFilter.java:51)
        at org.apache.lucene.analysis.jate.ComplexShingleFilter.getNextToken(ComplexShingleFilter.java:335)
        at org.apache.lucene.analysis.jate.ComplexShingleFilter.shiftInputWindow(ComplexShingleFilter.java:412)
        at org.apache.lucene.analysis.jate.ComplexShingleFilter.incrementToken(ComplexShingleFilter.java:175)
        at org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:45)
        at org.apache.lucene.analysis.jate.EnglishLemmatisationFilter.incrementToken(EnglishLemmatisationFilter.java:30)
        at org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:51)
        at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:613)
        at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:344)
        at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:300)
        at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:234)
        at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:450)
        at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1475)
        at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:239)
        at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:163)
        ... 23 more

2016-10-25 09:24:12 ERROR TransactionLog:567 - Error: Forcing close of tlog{file=/home/tim/forum-style/jate-2.0-beta.1/testdata/solr-testbed/ACLRDTEC/data/tlog/tlog.0000000000000004167 refcount=2}`

Is there a clean way to do what I want to do?

Also, by way of note, the lack of support for concurrent processes (also caused by SOLR only wanting one JATE indexer running at a time) is a real bottleneck ...

The text was updated successfully, but these errors were encountered:

jerrygaoLondon · 2016-11-15T11:54:48Z

Thanks for reporting this issue. They are not bug from my perspective . You should post in our google group for further discussion. I put my short answer as below.

To analysis different corpus, you can create two different solr core directory with corpus-specific setting. You don't need always purge solr index every time. If you want to try different ATE algorithm, you DON NOT need to run candidate extraction again. Corpus directory is an option for both embedded mode and plugin mode. For embedded mode, if "-corpusDir" is not provided, JATE will skip term candidate extraction step and directly run term scoring, ranking and exporting over the provided solr core directory. For plugin mode, there is a 'extraction' option in 'solrconfig.xml'.

From the solr exception you are reporting, i suspect that your solr core index are not clean. You can check if there is a write.lock file there. You should check whether JATE/SOLR process is killed and simply remove all files in data directory. This problem happens usually because that the solr process is not shutdown or killed successfully. You can also simply manually remove the write lock for the unexpected situation provided that your solr index is not corrupted and you don't want to re-index the corpus.

For concurrent process/indexing for large corpus, there are many ways to scale up/out solr. In large scale case, jate embedded mode is not the good choice. You should go for plugin mode and looking into how to set up solr cloud server, for instance.

Note that JATE is not intended simply as an app. We make it easily to run and do the demo. It is designed and developed as a library to work with Apache Solr. You can extended it with your ATE algorithm based on Solr framework.

jerrygaoLondon self-assigned this Nov 15, 2016

jerrygaoLondon closed this as completed Mar 24, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Purging index between JATE calls #32

Purging index between JATE calls #32

eltimster commented Oct 24, 2016 •

edited by jerrygaoLondon

Loading

jerrygaoLondon commented Nov 15, 2016 •

edited

Loading

Purging index between JATE calls #32

Purging index between JATE calls #32

Comments

eltimster commented Oct 24, 2016 • edited by jerrygaoLondon Loading

jerrygaoLondon commented Nov 15, 2016 • edited Loading

eltimster commented Oct 24, 2016 •

edited by jerrygaoLondon

Loading

jerrygaoLondon commented Nov 15, 2016 •

edited

Loading