Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Purging index between JATE calls #32

Closed
eltimster opened this issue Oct 24, 2016 · 1 comment
Closed

Purging index between JATE calls #32

eltimster opened this issue Oct 24, 2016 · 1 comment
Assignees

Comments

@eltimster
Copy link

eltimster commented Oct 24, 2016

I am trying to run JATE on different corpora, but found that it seems to incrementally add to the SOLR index when it indexes a new corpus, meaning I get terms from not just the corpus of interest, but the union of all corpora processed to that point. My solution to the problem has been to rm purge files from the relevant data/index file, but this is now causing an exception:

`2016-10-25 09:24:04 INFO  AppCValue:328 - Indexing corpus from [docs/english] and perform candidate extraction ...
2016-10-25 09:24:05 INFO  AppCValue:331 -  [151996] files are scanned and will be indexed and analysed.
Tue Oct 25 09:24:08 AEDT 2016 loading exception data for lemmatiser...
Tue Oct 25 09:24:08 AEDT 2016 loading exception data for lemmatiser...
Tue Oct 25 09:24:08 AEDT 2016 loading exception data for lemmatiser...
Tue Oct 25 09:24:08 AEDT 2016 loading done
Tue Oct 25 09:24:08 AEDT 2016 loading done
Tue Oct 25 09:24:08 AEDT 2016 loading done
Tue Oct 25 09:24:09 AEDT 2016 loading exception data for lemmatiser...
Tue Oct 25 09:24:09 AEDT 2016 loading exception data for lemmatiser...
Tue Oct 25 09:24:09 AEDT 2016 loading done
Tue Oct 25 09:24:09 AEDT 2016 loading done
2016-10-25 09:24:09 ERROR SolrCore:525 - [jateCore] Solr index directory '/home/tim/forum-style/jate-2.0-beta.1/testdata/solr-testbed/jateCore/data/index/' is locked.  Throwing exception.
2016-10-25 09:24:09 ERROR CoreContainer:740 - Error creating core [jateCore]: Index locked for write for core 'jateCore'. Solr now longer supports forceful unlocking via 'unlockOnStartup'. Please verify locks manually!
org.apache.solr.common.SolrException: Index locked for write for core 'jateCore'. Solr now longer supports forceful unlocking via 'unlockOnStartup'. Please verify locks manually!
        at org.apache.solr.core.SolrCore.<init>(SolrCore.java:820)
        at org.apache.solr.core.SolrCore.<init>(SolrCore.java:659)
        at org.apache.solr.core.CoreContainer.create(CoreContainer.java:727)
        at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:447)
2016-10-25 09:24:12 ERROR SolrCore:139 - org.apache.solr.common.SolrException: Exception writing document id 112188-q to the index; possible analysis error.
        at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:167)
        at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69)
        at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
        at org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:955)
        at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1110)
        at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:706)
        at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:250)
        at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:177)
        at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:98)
        at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
        at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:2068)
        at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:179)
        at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:135)
        at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:174)
        at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:139)
        at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:153)
        at uk.ac.shef.dcs.jate.util.JATEUtil.addNewDoc(JATEUtil.java:339)
        at uk.ac.shef.dcs.jate.app.App.indexJATEDocuments(App.java:374)
        at uk.ac.shef.dcs.jate.app.App.lambda$index$4(App.java:340)
        at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1374)
        at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580)
        at uk.ac.shef.dcs.jate.app.App.index(App.java:338)
        at uk.ac.shef.dcs.jate.app.AppCValue.main(AppCValue.java:45)
Caused by: java.lang.NullPointerException
        at opennlp.tools.util.Cache.put(Cache.java:134)
        at opennlp.tools.postag.DefaultPOSContextGenerator.getContext(DefaultPOSContextGenerator.java:195)
        at opennlp.tools.postag.DefaultPOSContextGenerator.getContext(DefaultPOSContextGenerator.java:87)
        at opennlp.tools.postag.DefaultPOSContextGenerator.getContext(DefaultPOSContextGenerator.java:32)
        at opennlp.tools.ml.BeamSearch.bestSequences(BeamSearch.java:102)
        at opennlp.tools.ml.BeamSearch.bestSequences(BeamSearch.java:168)
        at opennlp.tools.ml.BeamSearch.bestSequence(BeamSearch.java:173)
        at opennlp.tools.postag.POSTaggerME.tag(POSTaggerME.java:194)
        at opennlp.tools.postag.POSTaggerME.tag(POSTaggerME.java:190)
        at uk.ac.shef.dcs.jate.nlp.opennlp.POSTaggerOpenNLP.tag(POSTaggerOpenNLP.java:23)
        at org.apache.lucene.analysis.jate.OpenNLPPOSTaggerFilter.assignPOS(OpenNLPPOSTaggerFilter.java:103)
        at org.apache.lucene.analysis.jate.OpenNLPPOSTaggerFilter.createTags(OpenNLPPOSTaggerFilter.java:97)
        at org.apache.lucene.analysis.jate.OpenNLPPOSTaggerFilter.incrementToken(OpenNLPPOSTaggerFilter.java:51)
        at org.apache.lucene.analysis.jate.ComplexShingleFilter.getNextToken(ComplexShingleFilter.java:335)
        at org.apache.lucene.analysis.jate.ComplexShingleFilter.shiftInputWindow(ComplexShingleFilter.java:412)
        at org.apache.lucene.analysis.jate.ComplexShingleFilter.incrementToken(ComplexShingleFilter.java:175)
        at org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:45)
        at org.apache.lucene.analysis.jate.EnglishLemmatisationFilter.incrementToken(EnglishLemmatisationFilter.java:30)
        at org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:51)
        at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:613)
        at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:344)
        at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:300)
        at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:234)
        at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:450)
        at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1475)
        at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:239)
        at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:163)
        ... 23 more

2016-10-25 09:24:12 ERROR TransactionLog:567 - Error: Forcing close of tlog{file=/home/tim/forum-style/jate-2.0-beta.1/testdata/solr-testbed/ACLRDTEC/data/tlog/tlog.0000000000000004167 refcount=2}`

Is there a clean way to do what I want to do?

Also, by way of note, the lack of support for concurrent processes (also caused by SOLR only wanting one JATE indexer running at a time) is a real bottleneck ...

@jerrygaoLondon
Copy link
Collaborator

jerrygaoLondon commented Nov 15, 2016

Thanks for reporting this issue. They are not bug from my perspective . You should post in our google group for further discussion. I put my short answer as below.

To analysis different corpus, you can create two different solr core directory with corpus-specific setting. You don't need always purge solr index every time. If you want to try different ATE algorithm, you DON NOT need to run candidate extraction again. Corpus directory is an option for both embedded mode and plugin mode. For embedded mode, if "-corpusDir" is not provided, JATE will skip term candidate extraction step and directly run term scoring, ranking and exporting over the provided solr core directory. For plugin mode, there is a 'extraction' option in 'solrconfig.xml'.

From the solr exception you are reporting, i suspect that your solr core index are not clean. You can check if there is a write.lock file there. You should check whether JATE/SOLR process is killed and simply remove all files in data directory. This problem happens usually because that the solr process is not shutdown or killed successfully. You can also simply manually remove the write lock for the unexpected situation provided that your solr index is not corrupted and you don't want to re-index the corpus.

For concurrent process/indexing for large corpus, there are many ways to scale up/out solr. In large scale case, jate embedded mode is not the good choice. You should go for plugin mode and looking into how to set up solr cloud server, for instance.

Note that JATE is not intended simply as an app. We make it easily to run and do the demo. It is designed and developed as a library to work with Apache Solr. You can extended it with your ATE algorithm based on Solr framework.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants