-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature: Document.text #72
Comments
Yeah, I'm thinking it probably makes sense to have a This particular snippet misses text that's in tables, so there would need to be a little more to it, but I'm sure it would be modest in size. One thing other folks have mentioned is also capturing text that's in headers, footers, footnotes, and endnotes. I'm supposing it's enough that some folks will want that, but wondering a little bit about whether it makes sense to keep those bits separate, perhaps returning a tuple like Based on your needs, do you have a point of view on that? |
Ah, I didn't realize this would miss tables, headers, and footers. If that's feasible to do that would be awesome. I've recently started a project to extract text from any document and I think it would be helpful to be able to omit headers and footers but keep tables, for example. In my particular use case, it would actually be beneficial to have the tables correctly interwoven with the body text, so returning as a tuple is less desirable. Maybe instead of a class Document(object):
def get_text(self, omit_tables=False, omit_footers=False, omit_headers=False):
pass |
This is related to #40 and deanmalmgren/textract#92, too. Just adding this here as a note for myself and anyone else that might take a crack at this. |
@mikemaccana's old project had a simple script for extracting text from a document. Took me a few minutes to figure it out, but this is really simple now:
Just opening this issue with this little code snippet might just serve the purpose of documenting the methodology, but it might be nice to include this somewhere in the documentation or as a script that is installed with the package. I'm happy to contribute.
Do you have any preferences on a script vs documenting this two-liner? If just documenting is enough, any thoughts on where it should go?
The text was updated successfully, but these errors were encountered: