Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature: Document.text #72

Open
deanmalmgren opened this issue Jul 1, 2014 · 3 comments
Open

feature: Document.text #72

deanmalmgren opened this issue Jul 1, 2014 · 3 comments
Labels

Comments

@deanmalmgren
Copy link

@mikemaccana's old project had a simple script for extracting text from a document. Took me a few minutes to figure it out, but this is really simple now:

document = docx.Document(filename)
return '\n\n'.join([
    paragraph.text.encode('utf-8') for paragraph in document.paragraphs
])

Just opening this issue with this little code snippet might just serve the purpose of documenting the methodology, but it might be nice to include this somewhere in the documentation or as a script that is installed with the package. I'm happy to contribute.

Do you have any preferences on a script vs documenting this two-liner? If just documenting is enough, any thoughts on where it should go?

@scanny
Copy link
Contributor

scanny commented Jul 8, 2014

Yeah, I'm thinking it probably makes sense to have a Document.text property or something like that that produces a list of strings roughly like this. The question comes up from time to time for indexing purposes and so forth.

This particular snippet misses text that's in tables, so there would need to be a little more to it, but I'm sure it would be modest in size.

One thing other folks have mentioned is also capturing text that's in headers, footers, footnotes, and endnotes. I'm supposing it's enough that some folks will want that, but wondering a little bit about whether it makes sense to keep those bits separate, perhaps returning a tuple like (document_text, hdr_ftr_text, end_and_foot_note_text) so folks could pick and choose without having to go to several different objects to collect it all.

Based on your needs, do you have a point of view on that?

@scanny scanny changed the title clarify how to extract text from document feature: Document.text Jul 8, 2014
@deanmalmgren
Copy link
Author

Ah, I didn't realize this would miss tables, headers, and footers. If that's feasible to do that would be awesome. I've recently started a project to extract text from any document and I think it would be helpful to be able to omit headers and footers but keep tables, for example. In my particular use case, it would actually be beneficial to have the tables correctly interwoven with the body text, so returning as a tuple is less desirable.

Maybe instead of a Document.text property it could be a method that has a signature with optional kwargs that make it easy to select different parts of the text:

class Document(object):
    def get_text(self, omit_tables=False, omit_footers=False, omit_headers=False):
        pass

@deanmalmgren
Copy link
Author

This is related to #40 and deanmalmgren/textract#92, too. Just adding this here as a note for myself and anyone else that might take a crack at this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants