-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce Apache Tika extension #2890
Conversation
@gsmet FYI. I'll keep growing TikaParser going forward, it is a simple start but there is definitely a space for further enhancements. Will demo the MBRs with the quick starts PR later on, thanks |
@gsmet FYI, for the moment I'm thinking TikaParser should not be an injectable interface, I'd like to iterate few times to settle on the good enough and stable API worthy of the interface (and I'm not convinced yet the interface is actually needed). |
} | ||
|
||
@BuildStep | ||
public void producePdfBoxResources(BuildProducer<SubstrateResourceBuildItem> resource) throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if it's required for Tika extension, but when I was trying to make PDFBox work in quarkus, when I was calling:
PDFTextStripper stripper = new PDFTextStripper();
I was receiving this error inside native image:
https://gist.github.com/agiertli/35170cfdd5a8b3fd7e86fc5d6e238e3a
I solved it by adding Identity-h to the native-image resources:
https://github.com/cvt-oss/fb-invoice-pdf-analyzer/blob/quarkus/pom.xml#L112
Hope this helps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@agiertli thanks for the review and the hint. I don't recall right now but without those (or one of those) inclusions I was getting some native error as well.
It depends on whether the PDFTextStripper
is used on the Tika path somewhere or not. We may need to take care of the Identity-HS
resources as well.
Can you consider giving this PR a try and see if it works for your case ? We may get some valuable early review this way. Or please attach or share a link to some test PDF and I'll give it a try with this extension, may be faster :-)
Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It;s better to have the extension add these resource files instead of forcing the user to set them in the pom.ml for info. That's one the reason they exist, hide the GraalVM ugliness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@emmanuelbernard I agree :-) but @agiertli has not used Tika but PDFBox directly so I'm not sure that is needed for the Tika code path yet, the PDF extraction test works fine though it is a primitive PDF file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: Identity-HS
is here. It is hard right now to get all the resources which will actually be needed so I propose to add them whenever the exception is raised at the Tika path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I confirm that parsing the same PDF using tika-extension API succeed:
TikaParser parser = new TikaParser();
String invoiceText = parser.getContent(is).getText();
This extension also simplifies the project configuration a lot! No more hacks like this are needed anymore:
https://github.com/gunnarmorling/quarkus-pdf-extract/blob/master/pom.xml#L113
The only difference in behavior is that the above code produces string with extra \n lines (while using original PDFBox API does not do this)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @agiertli thanks, Identity-HS
and friends may still show up on some Tika code paths for some complex PDF files :-), but we will take care of those resources if needed for sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For info in Hobernate ORM we had a long list of files ot add and we used the following script that we run one a new version of Hibernate is integrated and adapt the processor accordingly
https://github.com/quarkusio/quarkus/blob/master/extensions/hibernate-orm/deployment/src/test/java/io/quarkus/hibernate/orm/HqlNodeScannerTestCase.java
Might be an inspiration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Emmanuel, yes, something like that would be needed to smartly add all the resources that may be needed (as far as I understand the idea, based on the link, would be to iterate over all, in this case, non-class resources and add them), I was contemplating how one would do it for all the resources, there could indeed be a large number of resources scattered around PDFBox, FontBox and other libraries :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In practice, we generated that list and copied it in our extension code. So that we don't have to iterate over this list every single build. But that's a burden when the extension updates to the next version of the underlying library to make sure to rerun that script. It happens relatively rarely though
@gsmet Hi, so how does this PR look to you (and the team) ? IMHO it can be useful, will require more work to make it really useful, but until it makes it as an extension it won't be happening as fast as I'd like to as I can't predict how users will want to use it beyond the straightforward text and metadata extraction :-), even though I have a plan how to grow it in the short term. |
@sberyozkin I agree it's useful, I just didn't have the time to review it yet (preparing a talk). Not sure yet if it will make 0.18.0 as I plan to release it tomorrow (we want the GraalVM 19 support out) but it will be in 0.19.0 for sure. |
@gsmet thanks. I agree there is no rush around having this feature in, will be happy to see it in whenever it makes it (might tweak few bits in meantime) |
Recording here future enhancement ideas after the conversation with @agiertli : 1) Optionally strip off the empty lines - will require replacing (Will open the issues after the extension makes it to the master) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @sberyozkin ,
Finally reviewing this, sorry for the delay.
I added a couple of comments inline, let's try to get this in for the next version!
</exclusion> | ||
</exclusions> | ||
<version>${tika.version}</version> | ||
</dependency> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is not properly indented: it should use 4 spaces and AFAICS it uses a mix of 3 and 2 spaces.
{ | ||
"name": "Apache Tika", | ||
"labels": [ | ||
"apache", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's remove this one.
"labels": [ | ||
"apache", | ||
"tika", | ||
"data", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure this one has value, it's too generic
extensions/tika/deployment/pom.xml
Outdated
<modelVersion>4.0.0</modelVersion> | ||
|
||
<artifactId>quarkus-tika-deployment</artifactId> | ||
<name>Quarkus - Tika - Deployment</name> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, we should use the full name Apache Tika
. Same for the other artifacts.
integration-tests/tika/pom.xml
Outdated
|
||
<artifactId>quarkus-integration-test-tika</artifactId> | ||
<name>Quarkus - Integration Tests - Tika</name> | ||
<description>The Tika integration tests module</description> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, let's use the full name in name and description.
this.metadata = metadata; | ||
} | ||
|
||
public String getText() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The getter should be consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gsmet can you clarify it please ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The getter is called getText()
whereas the field is called content
.
@@ -0,0 +1,20 @@ | |||
package io.quarkus.tika; | |||
|
|||
public class Content { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would name this one TikaResult
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm really not too keen to change Content
to TikaResult
, so now I'd have parser.getResult(InputStream)
instead of parser.getContent(InputStream)
the latter reflects better what the user is after IMHO... It looks natural to me to ask the parser to return a composite Content
from a provided input stream. Update as agreed it is not a concern any longer with getContent
to be changed to parse
import java.util.Map; | ||
import java.util.Set; | ||
|
||
public class Metadata { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would get rid of that one and expose TikaMetadata
. I don't see any value in having an additional abstraction layer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally I started with this one I think, but Tika Metadata interface is a bit verbose and somewhat complex (arrays, etc), and it is publicly modifiable so my plan was expose it via a simple API sufficient to get the String metadata properties.
Does it sound reasonable ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think that this one should be named TikaMetadata
to be consistent with TikaContent
.
As for being reasonable, yes. I'm not a big fan of instantiating new objects in a getter but maybe it makes sense to not do it for all the extracted metadata, considering you would probably only use part of it.
return new Content(tikaHandler == null ? null : tikaHandler.toString().trim(), convert(tikaMetadata)); | ||
} | ||
} catch (Exception e) { | ||
LOG.warnf("%s stream can not be parsed", contentType); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not how it should be done:
- you shouldn't have a warning
- you should have a proper message for your exception
- not sure if the
contentType
information is useful but in any case it might be an issue as if it's null, you will end up withnull stream can not be parsed
which is misleading.Unable to parse stream for content-type ...
might be better with the last part conditioned to thecontentType
variable not being null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
content-type will never be null at the MBR level, it is guaranteed by the JAX-RS spec, but your proposed message reads better, happy to update to it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're passing null
as a contentType
just a few lines above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I forgot it was now not hidden any longer inside MBR :-)
import org.xml.sax.ContentHandler; | ||
import org.xml.sax.helpers.DefaultHandler; | ||
|
||
public class TikaParser { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be an @ApplicationScoped
bean that we can inject.
The Tika parser is supposed to be thread safe so it would be highly beneficial if it was instantiated only once.
Hi @gsmet thanks for the review, let me work on it... |
@sberyozkin maybe we could rename the method to `parseˋ? I think that would be a better choice. |
@gsmet Changing it to |
Hi @gsmet I've only stopped from removing the Metadata wrapper for the reasons described above, let me know please if you are OK with it. (IMHO it would be a bit cleaner to go with the wrapper - as the metadata is supposed to be immutable after it has been extracted as it also avoids exposing Tika packages - it would be more difficult to add something useful re the metadata checks which may make sense for Quarkus but not necessarily at the lower level for Tika, and the Quarkus and Tika package mix at the extension api level does not seem ideal to me). Also added a TikaParser.getText() shortcut as planned earlier (to support the cases when only text is needed) So may be it is good to go in :-) ? thanks |
@gsmet I'm having recurring doubts about the
but if I demo it done at the
I wonder if
and
If |
@sberyozkin I agree with your proposal. Can you make the changes? |
integration-tests/tika/pom.xml
Outdated
~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
~ See the License for the specific language governing permissions and | ||
~ limitations under the License. | ||
--> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you remove the copyright headers? We decided to get rid of all of them (in Java and XML files altogether).
@gsmet sorry, missed your comments, let me take of them... |
@gsmet done |
@gsmet I realized I forgot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sberyozkin I added some hopefully final comments :).
Would you mind taking a look? Once it's done, I think we can merge this first iteration.
The next step will probably be to write a quickstart and a guide.
BTW, would you mind squashing everything in one atomic commit? Thanks!
import io.quarkus.tika.TikaParser; | ||
|
||
@Path("/parse") | ||
public class GreetingResource { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we rename this class to something that is more relevant?
import io.quarkus.test.junit.QuarkusTest; | ||
|
||
@QuarkusTest | ||
public class GreetingResourceTest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, the name is not consistent with what it tests.
import io.quarkus.test.junit.SubstrateTest; | ||
|
||
@SubstrateTest | ||
public class NativeGreetingResourceIT extends GreetingResourceTest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here about the name.
import java.util.Map; | ||
import java.util.Set; | ||
|
||
public class Metadata { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think that this one should be named TikaMetadata
to be consistent with TikaContent
.
As for being reasonable, yes. I'm not a big fan of instantiating new objects in a getter but maybe it makes sense to not do it for all the extracted metadata, considering you would probably only use part of it.
@gsmet sure, renaming the |
@sberyozkin yes, basically, every time you call For the keys, I think I would maybe create the list in the constructor. For the values, I'm not totally sure it's a good idea as we would create potentially a lot of wrappers, just to extract a property. So for now, I would say let's leave it at that and if someone complains with a proper use case, we can revisit it. |
@gsmet OK, I've also removed the wrapper on the |
Following our discussion on Zulip, I rebased and force-pushed. Let's wait for CI. Thanks for your patience and your efforts! |
@gsmet thanks for helping to make this extension a noteworthy feature :-). CI is green now and it is good to go in. I'll need to follow up with some interesting enough quick start example which in itself will likely drive more feature enhancements) |
No description provided.