Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce Apache Tika extension #2890

Merged
merged 1 commit into from
Jul 9, 2019
Merged

Introduce Apache Tika extension #2890

merged 1 commit into from
Jul 9, 2019

Conversation

sberyozkin
Copy link
Member

No description provided.

@sberyozkin sberyozkin requested a review from gsmet June 19, 2019 13:03
@sberyozkin
Copy link
Member Author

@gsmet FYI. I'll keep growing TikaParser going forward, it is a simple start but there is definitely a space for further enhancements. Will demo the MBRs with the quick starts PR later on, thanks

@sberyozkin
Copy link
Member Author

sberyozkin commented Jun 20, 2019

@gsmet FYI, for the moment I'm thinking TikaParser should not be an injectable interface, I'd like to iterate few times to settle on the good enough and stable API worthy of the interface (and I'm not convinced yet the interface is actually needed).
And yeah, I squashed a 2nd commit :-)

}

@BuildStep
public void producePdfBoxResources(BuildProducer<SubstrateResourceBuildItem> resource) throws Exception {
Copy link

@agiertli agiertli Jun 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it's required for Tika extension, but when I was trying to make PDFBox work in quarkus, when I was calling:

        PDFTextStripper stripper = new PDFTextStripper();

I was receiving this error inside native image:

https://gist.github.com/agiertli/35170cfdd5a8b3fd7e86fc5d6e238e3a

I solved it by adding Identity-h to the native-image resources:

https://github.com/cvt-oss/fb-invoice-pdf-analyzer/blob/quarkus/pom.xml#L112

Hope this helps.

Copy link
Member Author

@sberyozkin sberyozkin Jun 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@agiertli thanks for the review and the hint. I don't recall right now but without those (or one of those) inclusions I was getting some native error as well.
It depends on whether the PDFTextStripper is used on the Tika path somewhere or not. We may need to take care of the Identity-HS resources as well.

Can you consider giving this PR a try and see if it works for your case ? We may get some valuable early review this way. Or please attach or share a link to some test PDF and I'll give it a try with this extension, may be faster :-)
Thanks

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It;s better to have the extension add these resource files instead of forcing the user to set them in the pom.ml for info. That's one the reason they exist, hide the GraalVM ugliness.

Copy link
Member Author

@sberyozkin sberyozkin Jun 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emmanuelbernard I agree :-) but @agiertli has not used Tika but PDFBox directly so I'm not sure that is needed for the Tika code path yet, the PDF extraction test works fine though it is a primitive PDF file

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: Identity-HS is here. It is hard right now to get all the resources which will actually be needed so I propose to add them whenever the exception is raised at the Tika path.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confirm that parsing the same PDF using tika-extension API succeed:

      TikaParser parser = new TikaParser();
      String invoiceText = parser.getContent(is).getText();

This extension also simplifies the project configuration a lot! No more hacks like this are needed anymore:
https://github.com/gunnarmorling/quarkus-pdf-extract/blob/master/pom.xml#L113

The only difference in behavior is that the above code produces string with extra \n lines (while using original PDFBox API does not do this)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @agiertli thanks, Identity-HS and friends may still show up on some Tika code paths for some complex PDF files :-), but we will take care of those resources if needed for sure

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For info in Hobernate ORM we had a long list of files ot add and we used the following script that we run one a new version of Hibernate is integrated and adapt the processor accordingly
https://github.com/quarkusio/quarkus/blob/master/extensions/hibernate-orm/deployment/src/test/java/io/quarkus/hibernate/orm/HqlNodeScannerTestCase.java

Might be an inspiration

Copy link
Member Author

@sberyozkin sberyozkin Jun 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Emmanuel, yes, something like that would be needed to smartly add all the resources that may be needed (as far as I understand the idea, based on the link, would be to iterate over all, in this case, non-class resources and add them), I was contemplating how one would do it for all the resources, there could indeed be a large number of resources scattered around PDFBox, FontBox and other libraries :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice, we generated that list and copied it in our extension code. So that we don't have to iterate over this list every single build. But that's a burden when the extension updates to the next version of the underlying library to make sure to rerun that script. It happens relatively rarely though

@sberyozkin
Copy link
Member Author

@gsmet Hi, so how does this PR look to you (and the team) ? IMHO it can be useful, will require more work to make it really useful, but until it makes it as an extension it won't be happening as fast as I'd like to as I can't predict how users will want to use it beyond the straightforward text and metadata extraction :-), even though I have a plan how to grow it in the short term.

@gsmet
Copy link
Member

gsmet commented Jun 25, 2019

@sberyozkin I agree it's useful, I just didn't have the time to review it yet (preparing a talk).

Not sure yet if it will make 0.18.0 as I plan to release it tomorrow (we want the GraalVM 19 support out) but it will be in 0.19.0 for sure.

@sberyozkin
Copy link
Member Author

@gsmet thanks. I agree there is no rush around having this feature in, will be happy to see it in whenever it makes it (might tweak few bits in meantime)

@sberyozkin
Copy link
Member Author

sberyozkin commented Jun 25, 2019

Recording here future enhancement ideas after the conversation with @agiertli : 1) Optionally strip off the empty lines - will require replacing ToTextContentHandler with a custom one 2) Support the user provided content handlers 3) Add a TikaParser.getText() shortcut as an alternative to TikaParser.getContent().getText() which is simpler when no metadata is needed. (done) 4) Support the Tika parsers and ParserContexts alternative to the AutoDetectParser which can be more optimized to deal with a given file 5) Add few handy utilities like check if the given PDF(or other format) file has been created on this date or by that author, etc

(Will open the issues after the extension makes it to the master)

Copy link
Member

@gsmet gsmet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @sberyozkin ,

Finally reviewing this, sorry for the delay.

I added a couple of comments inline, let's try to get this in for the next version!

</exclusion>
</exclusions>
<version>${tika.version}</version>
</dependency>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not properly indented: it should use 4 spaces and AFAICS it uses a mix of 3 and 2 spaces.

{
"name": "Apache Tika",
"labels": [
"apache",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's remove this one.

"labels": [
"apache",
"tika",
"data",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure this one has value, it's too generic

<modelVersion>4.0.0</modelVersion>

<artifactId>quarkus-tika-deployment</artifactId>
<name>Quarkus - Tika - Deployment</name>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, we should use the full name Apache Tika. Same for the other artifacts.


<artifactId>quarkus-integration-test-tika</artifactId>
<name>Quarkus - Integration Tests - Tika</name>
<description>The Tika integration tests module</description>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, let's use the full name in name and description.

this.metadata = metadata;
}

public String getText() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The getter should be consistent.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gsmet can you clarify it please ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The getter is called getText() whereas the field is called content.

@@ -0,0 +1,20 @@
package io.quarkus.tika;

public class Content {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would name this one TikaResult.

Copy link
Member Author

@sberyozkin sberyozkin Jul 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really not too keen to change Content to TikaResult, so now I'd have parser.getResult(InputStream) instead of parser.getContent(InputStream) the latter reflects better what the user is after IMHO... It looks natural to me to ask the parser to return a composite Content from a provided input stream. Update as agreed it is not a concern any longer with getContent to be changed to parse

import java.util.Map;
import java.util.Set;

public class Metadata {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would get rid of that one and expose TikaMetadata. I don't see any value in having an additional abstraction layer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally I started with this one I think, but Tika Metadata interface is a bit verbose and somewhat complex (arrays, etc), and it is publicly modifiable so my plan was expose it via a simple API sufficient to get the String metadata properties.
Does it sound reasonable ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think that this one should be named TikaMetadata to be consistent with TikaContent.

As for being reasonable, yes. I'm not a big fan of instantiating new objects in a getter but maybe it makes sense to not do it for all the extracted metadata, considering you would probably only use part of it.

return new Content(tikaHandler == null ? null : tikaHandler.toString().trim(), convert(tikaMetadata));
}
} catch (Exception e) {
LOG.warnf("%s stream can not be parsed", contentType);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not how it should be done:

  • you shouldn't have a warning
  • you should have a proper message for your exception
  • not sure if the contentType information is useful but in any case it might be an issue as if it's null, you will end up with null stream can not be parsed which is misleading. Unable to parse stream for content-type ... might be better with the last part conditioned to the contentType variable not being null.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

content-type will never be null at the MBR level, it is guaranteed by the JAX-RS spec, but your proposed message reads better, happy to update to it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're passing null as a contentType just a few lines above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I forgot it was now not hidden any longer inside MBR :-)

import org.xml.sax.ContentHandler;
import org.xml.sax.helpers.DefaultHandler;

public class TikaParser {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be an @ApplicationScoped bean that we can inject.

The Tika parser is supposed to be thread safe so it would be highly beneficial if it was instantiated only once.

@sberyozkin
Copy link
Member Author

sberyozkin commented Jul 2, 2019

Hi @gsmet thanks for the review, let me work on it...

@gsmet
Copy link
Member

gsmet commented Jul 2, 2019

@sberyozkin maybe we could rename the method to `parseˋ? I think that would be a better choice.

@sberyozkin
Copy link
Member Author

@gsmet Changing it to parse is good so then Content would become less important to me :-) (sorry for moving the comment around)

@sberyozkin
Copy link
Member Author

sberyozkin commented Jul 2, 2019

Hi @gsmet
I've updated the PR,

I've only stopped from removing the Metadata wrapper for the reasons described above, let me know please if you are OK with it.

(IMHO it would be a bit cleaner to go with the wrapper - as the metadata is supposed to be immutable after it has been extracted as it also avoids exposing Tika packages - it would be more difficult to add something useful re the metadata checks which may make sense for Quarkus but not necessarily at the lower level for Tika, and the Quarkus and Tika package mix at the extension api level does not seem ideal to me).

Also added a TikaParser.getText() shortcut as planned earlier (to support the cases when only text is needed)

So may be it is good to go in :-) ?

thanks

@sberyozkin
Copy link
Member Author

@gsmet I'm having recurring doubts about the Content to TikaResult renaming.
This reads OK:

TikaResult result = parser.parse(inputStream);

but if I demo it done at the MessageBodyReader level (which was agreed to move out of this PR) then this is not so cool as TikaResult is somewhat out of context:

@POST
@Consumes("application/pdf")
public Response post(TikaResult result) {
}

I wonder if TikaContent could become a good compromise between the original Content and the current TikaResult:

TikaContent tikaContent = parser.parse(inputStream);
processTextAndMetadata(tikaContent.getText(), tikaContent.getMetadata());

and

@POST
@Consumes("application/pdf")
public Response post(TikaContent tikaContent) {
    processTextAndMetadata(tikaContent.getText(), tikaContent.getMetadata());
}

If TikaResult stays then it won't be really a problem that I'm probably exaggerating a bit :-), but i'd like to check what you think

@gsmet gsmet added this to the 0.19.0 milestone Jul 6, 2019
@gsmet
Copy link
Member

gsmet commented Jul 6, 2019

@sberyozkin I agree with your proposal. Can you make the changes?

~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~ See the License for the specific language governing permissions and
~ limitations under the License.
-->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove the copyright headers? We decided to get rid of all of them (in Java and XML files altogether).

@sberyozkin
Copy link
Member Author

@gsmet sorry, missed your comments, let me take of them...

@sberyozkin
Copy link
Member Author

@gsmet done

@sberyozkin
Copy link
Member Author

sberyozkin commented Jul 8, 2019

@gsmet I realized I forgot TikaProcessor to have a FeatureBuildItem. That should really be it :-). I have few other minor enhancements in mind but I'd rather follow up with them once this PR is merged

Copy link
Member

@gsmet gsmet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sberyozkin I added some hopefully final comments :).

Would you mind taking a look? Once it's done, I think we can merge this first iteration.

The next step will probably be to write a quickstart and a guide.

BTW, would you mind squashing everything in one atomic commit? Thanks!

import io.quarkus.tika.TikaParser;

@Path("/parse")
public class GreetingResource {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we rename this class to something that is more relevant?

import io.quarkus.test.junit.QuarkusTest;

@QuarkusTest
public class GreetingResourceTest {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, the name is not consistent with what it tests.

import io.quarkus.test.junit.SubstrateTest;

@SubstrateTest
public class NativeGreetingResourceIT extends GreetingResourceTest {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here about the name.

import java.util.Map;
import java.util.Set;

public class Metadata {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think that this one should be named TikaMetadata to be consistent with TikaContent.

As for being reasonable, yes. I'm not a big fan of instantiating new objects in a getter but maybe it makes sense to not do it for all the extracted metadata, considering you would probably only use part of it.

@sberyozkin
Copy link
Member Author

sberyozkin commented Jul 8, 2019

@gsmet sure, renaming the Metadata is fine, as all other classes there are starting with Tika now, but what did you mean about the instantiation part ? Update I think you probably meant to avoid the unmodifiableList wrapper when a single property value is needed

@gsmet
Copy link
Member

gsmet commented Jul 8, 2019

@sberyozkin yes, basically, every time you call getNames(), for instance, you will create a new wrapper.

For the keys, I think I would maybe create the list in the constructor. For the values, I'm not totally sure it's a good idea as we would create potentially a lot of wrappers, just to extract a property.

So for now, I would say let's leave it at that and if someone complains with a proper use case, we can revisit it.

@sberyozkin
Copy link
Member Author

@gsmet OK, I've also removed the wrapper on the getSingleValue path. Now the most difficult part, the squashing :-), I should've done straight away but thought I'd get the latest changes out asap

@gsmet
Copy link
Member

gsmet commented Jul 8, 2019

Following our discussion on Zulip, I rebased and force-pushed.

Let's wait for CI. Thanks for your patience and your efforts!

@gsmet gsmet added release/noteworthy-feature triage/waiting-for-ci Ready to merge when CI successfully finishes labels Jul 8, 2019
@sberyozkin
Copy link
Member Author

@gsmet thanks for helping to make this extension a noteworthy feature :-). CI is green now and it is good to go in. I'll need to follow up with some interesting enough quick start example which in itself will likely drive more feature enhancements)

@sberyozkin sberyozkin merged commit 56a6875 into quarkusio:master Jul 9, 2019
@sberyozkin sberyozkin deleted the apache_tika_extension branch July 9, 2019 09:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants