feature: Document.text #72

deanmalmgren · 2014-07-01T10:22:16Z

@mikemaccana's old project had a simple script for extracting text from a document. Took me a few minutes to figure it out, but this is really simple now:

document = docx.Document(filename)
return '\n\n'.join([
    paragraph.text.encode('utf-8') for paragraph in document.paragraphs
])

Just opening this issue with this little code snippet might just serve the purpose of documenting the methodology, but it might be nice to include this somewhere in the documentation or as a script that is installed with the package. I'm happy to contribute.

Do you have any preferences on a script vs documenting this two-liner? If just documenting is enough, any thoughts on where it should go?

scanny · 2014-07-08T05:51:58Z

Yeah, I'm thinking it probably makes sense to have a Document.text property or something like that that produces a list of strings roughly like this. The question comes up from time to time for indexing purposes and so forth.

This particular snippet misses text that's in tables, so there would need to be a little more to it, but I'm sure it would be modest in size.

One thing other folks have mentioned is also capturing text that's in headers, footers, footnotes, and endnotes. I'm supposing it's enough that some folks will want that, but wondering a little bit about whether it makes sense to keep those bits separate, perhaps returning a tuple like (document_text, hdr_ftr_text, end_and_foot_note_text) so folks could pick and choose without having to go to several different objects to collect it all.

Based on your needs, do you have a point of view on that?

deanmalmgren · 2014-07-08T10:41:51Z

Ah, I didn't realize this would miss tables, headers, and footers. If that's feasible to do that would be awesome. I've recently started a project to extract text from any document and I think it would be helpful to be able to omit headers and footers but keep tables, for example. In my particular use case, it would actually be beneficial to have the tables correctly interwoven with the body text, so returning as a tuple is less desirable.

Maybe instead of a Document.text property it could be a method that has a signature with optional kwargs that make it easy to select different parts of the text:

class Document(object):
    def get_text(self, omit_tables=False, omit_footers=False, omit_headers=False):
        pass

deanmalmgren · 2015-08-31T10:38:02Z

This is related to #40 and deanmalmgren/textract#92, too. Just adding this here as a note for myself and anyone else that might take a crack at this.

scanny changed the title ~~clarify how to extract text from document~~ feature: Document.text Jul 8, 2014

scanny added the text label Nov 27, 2014

scanny mentioned this issue Feb 13, 2015

Legacy getdocumenttext #32

Closed

deanmalmgren mentioned this issue Aug 31, 2015

Extract table contents from docx deanmalmgren/textract#92

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature: Document.text #72

feature: Document.text #72

deanmalmgren commented Jul 1, 2014

scanny commented Jul 8, 2014

deanmalmgren commented Jul 8, 2014

deanmalmgren commented Aug 31, 2015

feature: Document.text #72

feature: Document.text #72

Comments

deanmalmgren commented Jul 1, 2014

scanny commented Jul 8, 2014

deanmalmgren commented Jul 8, 2014

deanmalmgren commented Aug 31, 2015