Introduce Apache Tika extension #2890

sberyozkin · 2019-06-19T13:03:21Z

No description provided.

sberyozkin · 2019-06-19T13:05:13Z

@gsmet FYI. I'll keep growing TikaParser going forward, it is a simple start but there is definitely a space for further enhancements. Will demo the MBRs with the quick starts PR later on, thanks

sberyozkin · 2019-06-20T08:53:15Z

@gsmet FYI, for the moment I'm thinking TikaParser should not be an injectable interface, I'd like to iterate few times to settle on the good enough and stable API worthy of the interface (and I'm not convinced yet the interface is actually needed).
And yeah, I squashed a 2nd commit :-)

agiertli · 2019-06-21T18:47:11Z

extensions/tika/deployment/src/main/java/io/quarkus/tika/deployment/TikaProcessor.java

+    }
+
+    @BuildStep
+    public void producePdfBoxResources(BuildProducer<SubstrateResourceBuildItem> resource) throws Exception {


Not sure if it's required for Tika extension, but when I was trying to make PDFBox work in quarkus, when I was calling:

PDFTextStripper stripper = new PDFTextStripper();

I was receiving this error inside native image:

https://gist.github.com/agiertli/35170cfdd5a8b3fd7e86fc5d6e238e3a

I solved it by adding Identity-h to the native-image resources:

https://github.com/cvt-oss/fb-invoice-pdf-analyzer/blob/quarkus/pom.xml#L112

Hope this helps.

@agiertli thanks for the review and the hint. I don't recall right now but without those (or one of those) inclusions I was getting some native error as well.
It depends on whether the PDFTextStripper is used on the Tika path somewhere or not. We may need to take care of the Identity-HS resources as well.

Can you consider giving this PR a try and see if it works for your case ? We may get some valuable early review this way. Or please attach or share a link to some test PDF and I'll give it a try with this extension, may be faster :-)
Thanks

It;s better to have the extension add these resource files instead of forcing the user to set them in the pom.ml for info. That's one the reason they exist, hide the GraalVM ugliness.

@emmanuelbernard I agree :-) but @agiertli has not used Tika but PDFBox directly so I'm not sure that is needed for the Tika code path yet, the PDF extraction test works fine though it is a primitive PDF file

Update: Identity-HS is here. It is hard right now to get all the resources which will actually be needed so I propose to add them whenever the exception is raised at the Tika path.

I confirm that parsing the same PDF using tika-extension API succeed:

TikaParser parser = new TikaParser(); String invoiceText = parser.getContent(is).getText();

This extension also simplifies the project configuration a lot! No more hacks like this are needed anymore:
https://github.com/gunnarmorling/quarkus-pdf-extract/blob/master/pom.xml#L113

The only difference in behavior is that the above code produces string with extra \n lines (while using original PDFBox API does not do this)

Hi @agiertli thanks, Identity-HS and friends may still show up on some Tika code paths for some complex PDF files :-), but we will take care of those resources if needed for sure

For info in Hobernate ORM we had a long list of files ot add and we used the following script that we run one a new version of Hibernate is integrated and adapt the processor accordingly
https://github.com/quarkusio/quarkus/blob/master/extensions/hibernate-orm/deployment/src/test/java/io/quarkus/hibernate/orm/HqlNodeScannerTestCase.java

Might be an inspiration

Hi Emmanuel, yes, something like that would be needed to smartly add all the resources that may be needed (as far as I understand the idea, based on the link, would be to iterate over all, in this case, non-class resources and add them), I was contemplating how one would do it for all the resources, there could indeed be a large number of resources scattered around PDFBox, FontBox and other libraries :-)

In practice, we generated that list and copied it in our extension code. So that we don't have to iterate over this list every single build. But that's a burden when the extension updates to the next version of the underlying library to make sure to rerun that script. It happens relatively rarely though

sberyozkin · 2019-06-24T19:43:44Z

@gsmet Hi, so how does this PR look to you (and the team) ? IMHO it can be useful, will require more work to make it really useful, but until it makes it as an extension it won't be happening as fast as I'd like to as I can't predict how users will want to use it beyond the straightforward text and metadata extraction :-), even though I have a plan how to grow it in the short term.

gsmet · 2019-06-25T05:20:06Z

@sberyozkin I agree it's useful, I just didn't have the time to review it yet (preparing a talk).

Not sure yet if it will make 0.18.0 as I plan to release it tomorrow (we want the GraalVM 19 support out) but it will be in 0.19.0 for sure.

sberyozkin · 2019-06-25T11:01:05Z

@gsmet thanks. I agree there is no rush around having this feature in, will be happy to see it in whenever it makes it (might tweak few bits in meantime)

sberyozkin · 2019-06-25T12:10:33Z

Recording here future enhancement ideas after the conversation with @agiertli : 1) Optionally strip off the empty lines - will require replacing ToTextContentHandler with a custom one 2) Support the user provided content handlers 3) Add a TikaParser.getText() shortcut as an alternative to TikaParser.getContent().getText() which is simpler when no metadata is needed. (done) 4) Support the Tika parsers and ParserContexts alternative to the AutoDetectParser which can be more optimized to deal with a given file 5) Add few handy utilities like check if the given PDF(or other format) file has been created on this date or by that author, etc

(Will open the issues after the extension makes it to the master)

gsmet

Hi @sberyozkin ,

Finally reviewing this, sorry for the delay.

I added a couple of comments inline, let's try to get this in for the next version!

gsmet · 2019-07-01T13:59:09Z

bom/pom.xml

+                 </exclusion>
+               </exclusions>
+               <version>${tika.version}</version>
+            </dependency>


I think this is not properly indented: it should use 4 spaces and AFAICS it uses a mix of 3 and 2 spaces.

gsmet · 2019-07-01T13:59:21Z

devtools/common/src/main/filtered/extensions.json

+  {
+    "name": "Apache Tika",
+    "labels": [
+      "apache",


let's remove this one.

gsmet · 2019-07-01T13:59:38Z

devtools/common/src/main/filtered/extensions.json

+    "labels": [
+      "apache",
+      "tika",
+      "data",


not sure this one has value, it's too generic

gsmet · 2019-07-01T14:00:11Z

extensions/tika/deployment/pom.xml

+    <modelVersion>4.0.0</modelVersion>
+
+    <artifactId>quarkus-tika-deployment</artifactId>
+    <name>Quarkus - Tika - Deployment</name>


Here, we should use the full name Apache Tika. Same for the other artifacts.

gsmet · 2019-07-01T14:03:11Z

integration-tests/tika/pom.xml

+
+    <artifactId>quarkus-integration-test-tika</artifactId>
+    <name>Quarkus - Integration Tests - Tika</name>
+    <description>The Tika integration tests module</description>


Same here, let's use the full name in name and description.

gsmet · 2019-07-01T14:04:50Z

extensions/tika/runtime/src/main/java/io/quarkus/tika/Content.java

+        this.metadata = metadata;
+    }
+
+    public String getText() {


The getter should be consistent.

@gsmet can you clarify it please ?

The getter is called getText() whereas the field is called content.

gsmet · 2019-07-01T14:08:38Z

extensions/tika/runtime/src/main/java/io/quarkus/tika/Content.java

@@ -0,0 +1,20 @@
+package io.quarkus.tika;
+
+public class Content {


I would name this one TikaResult.

I'm really not too keen to change Content to TikaResult, so now I'd have parser.getResult(InputStream) instead of parser.getContent(InputStream) the latter reflects better what the user is after IMHO... It looks natural to me to ask the parser to return a composite Content from a provided input stream. Update as agreed it is not a concern any longer with getContent to be changed to parse

gsmet · 2019-07-01T14:09:22Z

extensions/tika/runtime/src/main/java/io/quarkus/tika/Metadata.java

+import java.util.Map;
+import java.util.Set;
+
+public class Metadata {


I would get rid of that one and expose TikaMetadata. I don't see any value in having an additional abstraction layer.

Originally I started with this one I think, but Tika Metadata interface is a bit verbose and somewhat complex (arrays, etc), and it is publicly modifiable so my plan was expose it via a simple API sufficient to get the String metadata properties.
Does it sound reasonable ?

I still think that this one should be named TikaMetadata to be consistent with TikaContent.

As for being reasonable, yes. I'm not a big fan of instantiating new objects in a getter but maybe it makes sense to not do it for all the extracted metadata, considering you would probably only use part of it.

gsmet · 2019-07-01T14:11:58Z

extensions/tika/runtime/src/main/java/io/quarkus/tika/TikaParser.java

+                return new Content(tikaHandler == null ? null : tikaHandler.toString().trim(), convert(tikaMetadata));
+            }
+        } catch (Exception e) {
+            LOG.warnf("%s stream can not be parsed", contentType);


That's not how it should be done:

you shouldn't have a warning

you should have a proper message for your exception

not sure if the contentType information is useful but in any case it might be an issue as if it's null, you will end up with null stream can not be parsed which is misleading. Unable to parse stream for content-type ... might be better with the last part conditioned to the contentType variable not being null.

content-type will never be null at the MBR level, it is guaranteed by the JAX-RS spec, but your proposed message reads better, happy to update to it

You're passing null as a contentType just a few lines above.

Sorry, I forgot it was now not hidden any longer inside MBR :-)

gsmet · 2019-07-01T14:13:18Z

extensions/tika/runtime/src/main/java/io/quarkus/tika/TikaParser.java

+import org.xml.sax.ContentHandler;
+import org.xml.sax.helpers.DefaultHandler;
+
+public class TikaParser {


I think this should be an @ApplicationScoped bean that we can inject.

The Tika parser is supposed to be thread safe so it would be highly beneficial if it was instantiated only once.

sberyozkin · 2019-07-02T12:06:17Z

Hi @gsmet thanks for the review, let me work on it...

gsmet · 2019-07-02T12:11:18Z

@sberyozkin maybe we could rename the method to `parseˋ? I think that would be a better choice.

sberyozkin · 2019-07-02T12:17:45Z

@gsmet Changing it to parse is good so then Content would become less important to me :-) (sorry for moving the comment around)

sberyozkin · 2019-07-02T16:04:22Z

Hi @gsmet
I've updated the PR,

I've only stopped from removing the Metadata wrapper for the reasons described above, let me know please if you are OK with it.

(IMHO it would be a bit cleaner to go with the wrapper - as the metadata is supposed to be immutable after it has been extracted as it also avoids exposing Tika packages - it would be more difficult to add something useful re the metadata checks which may make sense for Quarkus but not necessarily at the lower level for Tika, and the Quarkus and Tika package mix at the extension api level does not seem ideal to me).

Also added a TikaParser.getText() shortcut as planned earlier (to support the cases when only text is needed)

So may be it is good to go in :-) ?

thanks

sberyozkin · 2019-07-05T11:08:35Z

@gsmet I'm having recurring doubts about the Content to TikaResult renaming.
This reads OK:

TikaResult result = parser.parse(inputStream);

but if I demo it done at the MessageBodyReader level (which was agreed to move out of this PR) then this is not so cool as TikaResult is somewhat out of context:

@POST
@Consumes("application/pdf")
public Response post(TikaResult result) {
}

I wonder if TikaContent could become a good compromise between the original Content and the current TikaResult:

TikaContent tikaContent = parser.parse(inputStream);
processTextAndMetadata(tikaContent.getText(), tikaContent.getMetadata());

and

@POST
@Consumes("application/pdf")
public Response post(TikaContent tikaContent) {
    processTextAndMetadata(tikaContent.getText(), tikaContent.getMetadata());
}

If TikaResult stays then it won't be really a problem that I'm probably exaggerating a bit :-), but i'd like to check what you think

gsmet · 2019-07-06T14:18:59Z

@sberyozkin I agree with your proposal. Can you make the changes?

gsmet · 2019-07-06T14:20:00Z

integration-tests/tika/pom.xml

+  ~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  ~ See the License for the specific language governing permissions and
+  ~ limitations under the License.
+  -->


Can you remove the copyright headers? We decided to get rid of all of them (in Java and XML files altogether).

sberyozkin · 2019-07-08T09:25:35Z

@gsmet sorry, missed your comments, let me take of them...

sberyozkin · 2019-07-08T10:08:59Z

@gsmet done

sberyozkin · 2019-07-08T13:09:39Z

@gsmet I realized I forgot TikaProcessor to have a FeatureBuildItem. That should really be it :-). I have few other minor enhancements in mind but I'd rather follow up with them once this PR is merged

gsmet

@sberyozkin I added some hopefully final comments :).

Would you mind taking a look? Once it's done, I think we can merge this first iteration.

The next step will probably be to write a quickstart and a guide.

BTW, would you mind squashing everything in one atomic commit? Thanks!

gsmet · 2019-07-08T14:19:16Z

integration-tests/tika/src/main/java/io/quarkus/it/tika/GreetingResource.java

+import io.quarkus.tika.TikaParser;
+
+@Path("/parse")
+public class GreetingResource {


Could we rename this class to something that is more relevant?

gsmet · 2019-07-08T14:21:39Z

integration-tests/tika/src/test/java/io/quarkus/it/tika/GreetingResourceTest.java

+import io.quarkus.test.junit.QuarkusTest;
+
+@QuarkusTest
+public class GreetingResourceTest {


Same here, the name is not consistent with what it tests.

gsmet · 2019-07-08T14:21:49Z

integration-tests/tika/src/test/java/io/quarkus/it/tika/NativeGreetingResourceIT.java

+import io.quarkus.test.junit.SubstrateTest;
+
+@SubstrateTest
+public class NativeGreetingResourceIT extends GreetingResourceTest {


Same here about the name.

gsmet · 2019-07-08T14:23:39Z

extensions/tika/runtime/src/main/java/io/quarkus/tika/Metadata.java

+import java.util.Map;
+import java.util.Set;
+
+public class Metadata {


I still think that this one should be named TikaMetadata to be consistent with TikaContent.

As for being reasonable, yes. I'm not a big fan of instantiating new objects in a getter but maybe it makes sense to not do it for all the extracted metadata, considering you would probably only use part of it.

sberyozkin · 2019-07-08T14:57:44Z

@gsmet sure, renaming the Metadata is fine, as all other classes there are starting with Tika now, but what did you mean about the instantiation part ? Update I think you probably meant to avoid the unmodifiableList wrapper when a single property value is needed

gsmet · 2019-07-08T15:15:49Z

@sberyozkin yes, basically, every time you call getNames(), for instance, you will create a new wrapper.

For the keys, I think I would maybe create the list in the constructor. For the values, I'm not totally sure it's a good idea as we would create potentially a lot of wrappers, just to extract a property.

So for now, I would say let's leave it at that and if someone complains with a proper use case, we can revisit it.

sberyozkin · 2019-07-08T15:23:10Z

@gsmet OK, I've also removed the wrapper on the getSingleValue path. Now the most difficult part, the squashing :-), I should've done straight away but thought I'd get the latest changes out asap

gsmet · 2019-07-08T15:53:11Z

Following our discussion on Zulip, I rebased and force-pushed.

Let's wait for CI. Thanks for your patience and your efforts!

sberyozkin · 2019-07-08T19:52:43Z

@gsmet thanks for helping to make this extension a noteworthy feature :-). CI is green now and it is good to go in. I'll need to follow up with some interesting enough quick start example which in itself will likely drive more feature enhancements)

sberyozkin requested a review from gsmet June 19, 2019 13:03

sberyozkin added the kind/new-feature label Jun 19, 2019

agiertli reviewed Jun 21, 2019

View reviewed changes

gsmet requested changes Jul 1, 2019

View reviewed changes

gsmet added this to the 0.19.0 milestone Jul 6, 2019

gsmet reviewed Jul 6, 2019

View reviewed changes

gsmet reviewed Jul 8, 2019

View reviewed changes

Introduce Apache Tika extension

cd3a0df

gsmet added release/noteworthy-feature triage/waiting-for-ci Ready to merge when CI successfully finishes labels Jul 8, 2019

gsmet approved these changes Jul 8, 2019

View reviewed changes

sberyozkin merged commit 56a6875 into quarkusio:master Jul 9, 2019

sberyozkin deleted the apache_tika_extension branch July 9, 2019 09:33

		@@ -0,0 +1,20 @@
		package io.quarkus.tika;

		public class Content {

Introduce Apache Tika extension #2890

Introduce Apache Tika extension #2890

Conversation

sberyozkin commented Jun 19, 2019

sberyozkin commented Jun 19, 2019

sberyozkin commented Jun 20, 2019 • edited Loading

agiertli Jun 21, 2019 • edited Loading

Choose a reason for hiding this comment

sberyozkin Jun 24, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sberyozkin Jun 24, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sberyozkin Jun 25, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sberyozkin commented Jun 24, 2019

gsmet commented Jun 25, 2019

sberyozkin commented Jun 25, 2019

sberyozkin commented Jun 25, 2019 • edited Loading

gsmet left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sberyozkin Jul 2, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sberyozkin commented Jul 2, 2019 • edited Loading

gsmet commented Jul 2, 2019

sberyozkin commented Jul 2, 2019

sberyozkin commented Jul 2, 2019 • edited Loading

sberyozkin commented Jul 5, 2019

gsmet commented Jul 6, 2019

Choose a reason for hiding this comment

sberyozkin commented Jul 8, 2019

sberyozkin commented Jul 8, 2019

sberyozkin commented Jul 8, 2019 • edited Loading

gsmet left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sberyozkin commented Jul 8, 2019 • edited Loading

gsmet commented Jul 8, 2019

sberyozkin commented Jul 8, 2019

gsmet commented Jul 8, 2019

sberyozkin commented Jul 8, 2019

sberyozkin commented Jun 20, 2019 •

edited

Loading

agiertli Jun 21, 2019 •

edited

Loading

sberyozkin Jun 24, 2019 •

edited

Loading

sberyozkin Jun 24, 2019 •

edited

Loading

sberyozkin Jun 25, 2019 •

edited

Loading

sberyozkin commented Jun 25, 2019 •

edited

Loading

sberyozkin Jul 2, 2019 •

edited

Loading

sberyozkin commented Jul 2, 2019 •

edited

Loading

sberyozkin commented Jul 2, 2019 •

edited

Loading

sberyozkin commented Jul 8, 2019 •

edited

Loading

sberyozkin commented Jul 8, 2019 •

edited

Loading