Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extracting remote file stream.uri causing TikaException #474

Closed
dragos-dumi opened this issue Jan 11, 2017 · 1 comment
Closed

Extracting remote file stream.uri causing TikaException #474

dragos-dumi opened this issue Jan 11, 2017 · 1 comment

Comments

@dragos-dumi
Copy link
Contributor

When trying to extract a file via update/extract, the built in Tika from Solr returns this error.

ERROR org.apache.solr.core.SolrCore  – org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: XML parse error
	at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225)
	at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
	at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
	at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:246)
	at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967)
	at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777)
	at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418)
	at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
	at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
	at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
	at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
	at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
	at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
	at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
	at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
	at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)
	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
	at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
	at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
	at org.eclipse.jetty.server.Server.handle(Server.java:368)
	at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)
	at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
	at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:942)
	at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1004)
	at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:640)
	at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
	at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
	at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
	at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
	at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
	at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.tika.exception.TikaException: XML parse error
	at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:78)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
	at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219)
	... 32 more
Caused by: org.xml.sax.SAXParseException; Premature end of file.
	at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)
	at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
	at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
	at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
	at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
	at org.apache.xerces.impl.XMLVersionDetector.determineDocVersion(Unknown Source)
	at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
	at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
	at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
	at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
	at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
	at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
	at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
	at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:72)
	... 36 more
dragos-dumi added a commit to dragos-dumi/solarium that referenced this issue Jan 11, 2017
@dragos-dumi
Copy link
Contributor Author

dragos-dumi commented Jan 11, 2017

When using stream.url, the request is not sending any header (as in the file method with form-data) but it's still using method POST method without any post data and that's what causing that Tika error. Since we are sending only parameters when using stream.url, we can use GET method (#475 )

if (preg_match('/^(http|https):\/\/(.+)/i', $file)) {
            $request->addParam('stream.url', $file);
        } elseif (is_readable($file)) {
            $request->setFileUpload($file);
            $request->addParam('resource.name', basename($query->getFile()));
            $request->addHeader('Content-Type: multipart/form-data');
        } else {

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants