Extracting data from tables #13

BeebBenjamin · 2014-02-13T14:14:22Z

Hi,

Is it possible to extract data from tables using the docx module?

It would be useful to have more examples with regards to learning to use this library.

Regards,

Ben

scanny · 2014-02-13T20:29:34Z

If you can say a bit more about your use case that would be helpful to inform future feature design. Also it will probably enable me to offer more specific advice.

In general the current approach would be to iterate over the cells and then iterate over the paragraphs sequence for each of the cells. The paragraphs in a table cell are just like the ones found elsewhere in Word. So something like this might work for you:

for row in table.rows:
    for cell in row.cells:
        for paragraph in cell.paragraphs:
            print paragraph.text

http://python-docx.readthedocs.org/en/latest/api/table.html#docx.table._Cell.paragraphs
http://python-docx.readthedocs.org/en/latest/api/text.html#docx.text.Paragraph

BeebBenjamin · 2014-02-17T16:54:52Z

Thanks for the reply.

I have a word document with a series of tables in it. I want to extract just the tables and export them to CSV file.

At the moment using your module I can get a list of the tables in the file using the following:

tblList = document.xpath('//w:tbl', namespaces=document.nsmap)

Now, I do not know what to do with this list. I don't know how to grab the table and use it as an object so that I can iterate through it's elements.

scanny · 2014-02-17T23:12:38Z

Which version are you using? I can't see how the line of code you're using would work with this version of python-docx. Can you provide a few lines of context code?

scanny · 2014-02-18T00:25:54Z

@BeebBenjamin if you install python-docx (v0.3.0+) rather than docx (v0.2.x) there is API I think you can use for this, roughly like so:

document = Document(path_to_your_docx)
tables = document.tables
for table in tables:
    for row in table.rows:
        for cell in row.cells:
            for paragraph in cell.paragraphs:
                print(paragraph.text)

To get the right version installed you might need to do this:

$ pip uninstall docx
$ pip install -U --pre python-docx

The documentation you reference above is for this latest version. The legacy version doesn't have any documentation beyond the README.txt. Note the two versions are in separate repositories. This repository is for the latest one. The legacy version is here: https://github.com/mikemaccana/python-docx

Let me know if it gives you trouble.

BeebBenjamin · 2014-02-18T12:31:25Z

@scanny thanks for the help with this. I get the following error:

Traceback (most recent call last):
File "OpenDocs.py", line 9, in
print(cell.text)
AttributeError: unreadable attribute

Thanks

scanny · 2014-02-18T19:46:20Z

Hmm, weird, it seems to have saved an earlier revision of my comment. What you want is:

...
    for cell in row.cells:
        for paragraph in cell.paragraphs:
            print paragraph.text

I'll edit the code above to show the change.

BeebBenjamin · 2014-02-19T16:06:05Z

Yes, that has fixed it. Thanks so much for your help and your development of this tool!

scanny · 2014-02-19T16:49:30Z

You're welcome @BeebBenjamin; glad you're finding it useful :)

abruski · 2014-10-17T10:34:14Z

What this does is print/return every cell on a new line. Is it possible to print/return every cell belonging to the same row on one line?

abruski · 2014-10-17T10:52:23Z

I worked it out. Might not be the best solution.

import sys
from docx import Document

file = sys.argv[1]
document = Document(file)
tables = document.tables
for row in tables[2].rows:
    print row.cells[0].paragraphs[0].text + row.cells[2].paragraphs[0].text

substitute list indexes where needed.

divyaiyer · 2016-04-18T10:12:03Z

@scanny - Is there any way I can extract tables with a custom name I had given in my template?
Use case : Parsing a docx file and storing values in to database, docx contains lot of tables, but I want to extract only the tables with the style name 'docReader_Tables'. Is this possible? Thanks in advance.

scanny · 2016-04-18T18:40:26Z

Hi @divyaiyer, you might want to post this one on StackOverflow if you have a login there. It's more of a support question than a code issue.

I do think you'll need to provide a bit more detail.

Off the top of my head, tables have a different type of style than say paragraphs, if I remember correctly, and are specified by a GUID rather than a name. So that part doesn't make sense to me. The tables are available in Document.tables, so you can iterate through them that way once you have the test worked out to identify the one you want.

divyaiyer · 2016-04-19T06:15:59Z

Thnx for your reply @scanny . I shall post on stackOverflow, I do have an account. Ya my case is bit tricky, as I have a particular paragraph heading and I want to fetch all tables comes under that heading and store in database. Thnx

scanny · 2016-04-19T16:40:00Z

If your test for inclusion is following a heading paragraph, you'll need to iterate over both paragraphs and tables in document order. The document.iter_block_level_items() method is designed for that (although it hasn't been officially implemented yet.

You may want to check out this issue that has details on that.
#40

divyaiyer · 2016-04-20T03:46:21Z

Thanks @scanny , . I tried that program, I get the blocks as Paragraph and tables, what I need exactly is I have a paragraph heading with a custom style, if I pass that to a method, that should return me all headings, sub heading and table under that.

Ex:

1.1. Heading 1 (defined cutsom style)
1.1.1. sub Heading 1(defined cutsom style)
1.1.1.1. Child of sub heading (defined cutsom style)
----
#Table#
1.2. Heading 2 (defined cutsom style)
1.2.1. sub Heading 1 (defined cutsom style)
1.2.1.1. Child of sub heading (defined cutsom style)
----
#Table#

scanny · 2016-04-20T04:29:45Z

How were you thinking to approach the problem? It sounds like you've successfully identified your "start" marker. Have you thought about how to handle the end marker? Like when to stop and return what you've found?

divyaiyer · 2016-04-21T03:36:55Z

@scanny , ya I know end marker is not easy. Currently I read tables and looks for row -> columns and I am validating table using first column name what I have defined. Will be parsing only such tables and once it reaches end of the document, it gets returns. While parsing I am pushing them to DB. I know this is not very good way of dealing my requirement, searching more how can I reach best solution. Thnx

npregot · 2017-10-26T20:11:10Z

@scanny I tried your code:

for table in tables:
    assignments.write(table)
    for row in table.rows:
        for cell in row.cells:
            for paragraph in cell.paragraphs:
                 file.write(paragraph.text)  # I am storing the data in a txt file

My output is repeated an X amount of times, in other words when I open my txt file and I look for a unique "data" I notice that this "data" is repeated different times in the txt file. ( 6 to 8 times, depending on the cell)

scanny · 2017-10-26T20:51:36Z

@Nawuy I believe that row.cells will return a "merged" cell multiple times, once for each cell that is merged into it. Could that be your problem?

lxj0276 · 2017-12-12T07:30:00Z

when I parsed my docx file,I found it lost data when I parse tables.
my method it's like:
document = Document(path_to_your_docx) tables = document.tables for table in tables: for row in table.rows: for cell in row.cells: for paragraph in cell.paragraphs: print(paragraph.text)

I found some data become empty string.

scanny · 2018-01-02T17:57:57Z

@movinghands Sounds like a question for StackOverflow.

Aviru · 2018-06-04T08:06:45Z

Hi,
I am using python-docx to read tables from docx file. But different docx files contain different table structure i.e. in some docx files the table heads are top aligned and in some files table heads are left aligned. I am unable to get the alignment/position of table heads, so when I iterate through table and try to fetch the table data in key value pair, the data is inappropriate as the keys and values are inappropriate.
I have attached the images for table head alignment.

In 1st image table head is left aligned
In 2nd image table head is top aligned

Please help me.

Thanks & Regards,
Aviru Bhattacharjee

npregot · 2018-06-04T19:55:40Z

Can we see your code? Also, if I were you I would ask this question on Stackoverflow as well :)

Aviru · 2018-06-05T07:22:02Z

  def convertDocxToText(path):

    doc = Document(path)
    fullText = []

   for block in iter_block_items(doc):
      if isinstance(block, Paragraph):
          #.... This part is for paragraph.
   
   elif isinstance(block, Table):
        for i, row in enumerate(block.rows):
            text = (cell.text for cell in row.cells)
          
              #.... This part is for Table.
             
            # for col in block.columns:
            #     for colCel in col.cells:
            #         print(colCel.text)
            #
            # for cell in row.cells:
            #     for paragraph in cell.paragraphs:
            #         print(paragraph.text)

            # Establish the mapping based on the first row
            # headers; these will become the keys of our dictionary
            # if i == 0:
            # tpl = tuple(text)
            # keys = tpl[0]  # tuple(text)
            # continue
            # Construct a dictionary for this row, mapping
            # keys to values for this row

            if i == 0:
                keys = tuple(text)
                continue

            row_data = dict(zip(keys, text))
            print(str(row_data))
            fullText.append(str(row_data))
            # for key, val in row_data.items():
            #     print(key + ": " + val)
            #     strKeyVal = key + ": " + val
            #     fullText.append(strKeyVal)

            #     for paragraph in cell.paragraphs:
            #         fullText.append(paragraph.text)
            # return ("\t".join(fullText))
return '\n'.join(fullText)


def iter_block_items(parent):
  if isinstance(parent, _Document):
     parent_elm = parent.element.body
  elif isinstance(parent, _Cell):
     parent_elm = parent._tc
 elif isinstance(parent, _Row):
     parent_elm = parent._tr
else:
    raise ValueError("something's not right")
for child in parent_elm.iterchildren():
    if isinstance(child, CT_P):
        yield Paragraph(child, parent)
    elif isinstance(child, CT_Tbl):
        yield Table(child, parent)`

Aviru · 2018-06-07T06:20:40Z

@Nawuy any help please.

npregot · 2018-06-08T23:24:48Z

@Aviru , what is your ultimate goal? get the data from the cells? or get the alignment of the text inside the cell?
Are always the tables going to be different, or there is any redundancy in the format of the tables?

Aviru · 2018-06-09T05:55:09Z

Hi, Thank you for asking. Actually I am trying to build a resume parser. As, resumes follow no specific format, so when I am parsing resumes, I am seeing that in some resumes table head are top aligned and in some table heads are left aligned reference the image I have posted in github. I am trying to build a list which will contain key value pair for the tables where the key will be table heads and value will be corresponding row or column value e.g [{'degree':'msc','name of institution':'abc'}] and so on. But while traversing the table I am unable to differentiate between the table heads and table body. Thanks & Regards, Aviru Bhattacharjee

…

On Sat, Jun 9, 2018 at 4:54 AM Nahuel ***@***.***> wrote: @Aviru <https://github.com/Aviru> , what is your ultimate goal? get the data from the cells? or get the alignment of the text inside the cell? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#13 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AJbgoqc_sWse9VshyFvRyO-jsaj7h2EJks5t6wfIgaJpZM4BhrBf> .

* Update version * reverse version to origin * Add add_ole_object_to_run func

scanny closed this as completed Feb 19, 2014

Aviru mentioned this issue Dec 6, 2018

How to determine table head alignment while reading docx file #584

Open

bhavasagar-dv pushed a commit to bhavasagar-dv/python-docx that referenced this issue May 3, 2024

Create add_ole_object_to_run func (python-openxml#13)

9be99b3

* Update version * reverse version to origin * Add add_ole_object_to_run func

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extracting data from tables #13

Extracting data from tables #13

BeebBenjamin commented Feb 13, 2014

scanny commented Feb 13, 2014

BeebBenjamin commented Feb 17, 2014

scanny commented Feb 17, 2014

scanny commented Feb 18, 2014

BeebBenjamin commented Feb 18, 2014

scanny commented Feb 18, 2014

BeebBenjamin commented Feb 19, 2014

scanny commented Feb 19, 2014

abruski commented Oct 17, 2014

abruski commented Oct 17, 2014

divyaiyer commented Apr 18, 2016 •

edited

scanny commented Apr 18, 2016

divyaiyer commented Apr 19, 2016

scanny commented Apr 19, 2016

divyaiyer commented Apr 20, 2016 •

edited

scanny commented Apr 20, 2016

divyaiyer commented Apr 21, 2016

npregot commented Oct 26, 2017 •

edited by scanny

scanny commented Oct 26, 2017 •

edited

lxj0276 commented Dec 12, 2017

scanny commented Jan 2, 2018

Aviru commented Jun 4, 2018 •

edited

npregot commented Jun 4, 2018

Aviru commented Jun 5, 2018 •

edited

Aviru commented Jun 7, 2018

npregot commented Jun 8, 2018 •

edited

Aviru commented Jun 9, 2018 via email

Extracting data from tables #13

Extracting data from tables #13

Comments

BeebBenjamin commented Feb 13, 2014

scanny commented Feb 13, 2014

BeebBenjamin commented Feb 17, 2014

scanny commented Feb 17, 2014

scanny commented Feb 18, 2014

BeebBenjamin commented Feb 18, 2014

scanny commented Feb 18, 2014

BeebBenjamin commented Feb 19, 2014

scanny commented Feb 19, 2014

abruski commented Oct 17, 2014

abruski commented Oct 17, 2014

divyaiyer commented Apr 18, 2016 • edited

scanny commented Apr 18, 2016

divyaiyer commented Apr 19, 2016

scanny commented Apr 19, 2016

divyaiyer commented Apr 20, 2016 • edited

scanny commented Apr 20, 2016

divyaiyer commented Apr 21, 2016

npregot commented Oct 26, 2017 • edited by scanny

scanny commented Oct 26, 2017 • edited

lxj0276 commented Dec 12, 2017

scanny commented Jan 2, 2018

Aviru commented Jun 4, 2018 • edited

npregot commented Jun 4, 2018

Aviru commented Jun 5, 2018 • edited

Aviru commented Jun 7, 2018

npregot commented Jun 8, 2018 • edited

Aviru commented Jun 9, 2018 via email

divyaiyer commented Apr 18, 2016 •

edited

divyaiyer commented Apr 20, 2016 •

edited

npregot commented Oct 26, 2017 •

edited by scanny

scanny commented Oct 26, 2017 •

edited

Aviru commented Jun 4, 2018 •

edited

Aviru commented Jun 5, 2018 •

edited

npregot commented Jun 8, 2018 •

edited