Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extracting data from tables #13

Closed
BeebBenjamin opened this issue Feb 13, 2014 · 27 comments
Closed

Extracting data from tables #13

BeebBenjamin opened this issue Feb 13, 2014 · 27 comments

Comments

@BeebBenjamin
Copy link

Hi,

Is it possible to extract data from tables using the docx module?

It would be useful to have more examples with regards to learning to use this library.

Regards,

Ben

@scanny
Copy link
Contributor

scanny commented Feb 13, 2014

If you can say a bit more about your use case that would be helpful to inform future feature design. Also it will probably enable me to offer more specific advice.

In general the current approach would be to iterate over the cells and then iterate over the paragraphs sequence for each of the cells. The paragraphs in a table cell are just like the ones found elsewhere in Word. So something like this might work for you:

for row in table.rows:
    for cell in row.cells:
        for paragraph in cell.paragraphs:
            print paragraph.text

http://python-docx.readthedocs.org/en/latest/api/table.html#docx.table._Cell.paragraphs
http://python-docx.readthedocs.org/en/latest/api/text.html#docx.text.Paragraph

@BeebBenjamin
Copy link
Author

Thanks for the reply.

I have a word document with a series of tables in it. I want to extract just the tables and export them to CSV file.

At the moment using your module I can get a list of the tables in the file using the following:

tblList = document.xpath('//w:tbl', namespaces=document.nsmap)

Now, I do not know what to do with this list. I don't know how to grab the table and use it as an object so that I can iterate through it's elements.

@scanny
Copy link
Contributor

scanny commented Feb 17, 2014

Which version are you using? I can't see how the line of code you're using would work with this version of python-docx. Can you provide a few lines of context code?

@scanny
Copy link
Contributor

scanny commented Feb 18, 2014

@BeebBenjamin if you install python-docx (v0.3.0+) rather than docx (v0.2.x) there is API I think you can use for this, roughly like so:

document = Document(path_to_your_docx)
tables = document.tables
for table in tables:
    for row in table.rows:
        for cell in row.cells:
            for paragraph in cell.paragraphs:
                print(paragraph.text)

To get the right version installed you might need to do this:

$ pip uninstall docx
$ pip install -U --pre python-docx

The documentation you reference above is for this latest version. The legacy version doesn't have any documentation beyond the README.txt. Note the two versions are in separate repositories. This repository is for the latest one. The legacy version is here: https://github.com/mikemaccana/python-docx

Let me know if it gives you trouble.

@BeebBenjamin
Copy link
Author

@scanny thanks for the help with this. I get the following error:

Traceback (most recent call last):
File "OpenDocs.py", line 9, in
print(cell.text)
AttributeError: unreadable attribute

Thanks

@scanny
Copy link
Contributor

scanny commented Feb 18, 2014

Hmm, weird, it seems to have saved an earlier revision of my comment. What you want is:

...
    for cell in row.cells:
        for paragraph in cell.paragraphs:
            print paragraph.text

I'll edit the code above to show the change.

@BeebBenjamin
Copy link
Author

Yes, that has fixed it. Thanks so much for your help and your development of this tool!

@scanny
Copy link
Contributor

scanny commented Feb 19, 2014

You're welcome @BeebBenjamin; glad you're finding it useful :)

@scanny scanny closed this as completed Feb 19, 2014
@abruski
Copy link

abruski commented Oct 17, 2014

What this does is print/return every cell on a new line. Is it possible to print/return every cell belonging to the same row on one line?

@abruski
Copy link

abruski commented Oct 17, 2014

I worked it out. Might not be the best solution.

import sys
from docx import Document

file = sys.argv[1]
document = Document(file)
tables = document.tables
for row in tables[2].rows:
    print row.cells[0].paragraphs[0].text + row.cells[2].paragraphs[0].text

substitute list indexes where needed.

@divyaiyer
Copy link

divyaiyer commented Apr 18, 2016

@scanny - Is there any way I can extract tables with a custom name I had given in my template?
Use case : Parsing a docx file and storing values in to database, docx contains lot of tables, but I want to extract only the tables with the style name 'docReader_Tables'. Is this possible? Thanks in advance.

@scanny
Copy link
Contributor

scanny commented Apr 18, 2016

Hi @divyaiyer, you might want to post this one on StackOverflow if you have a login there. It's more of a support question than a code issue.

I do think you'll need to provide a bit more detail.

Off the top of my head, tables have a different type of style than say paragraphs, if I remember correctly, and are specified by a GUID rather than a name. So that part doesn't make sense to me. The tables are available in Document.tables, so you can iterate through them that way once you have the test worked out to identify the one you want.

@divyaiyer
Copy link

Thnx for your reply @scanny . I shall post on stackOverflow, I do have an account. Ya my case is bit tricky, as I have a particular paragraph heading and I want to fetch all tables comes under that heading and store in database. Thnx

@scanny
Copy link
Contributor

scanny commented Apr 19, 2016

If your test for inclusion is following a heading paragraph, you'll need to iterate over both paragraphs and tables in document order. The document.iter_block_level_items() method is designed for that (although it hasn't been officially implemented yet.

You may want to check out this issue that has details on that.
#40

@divyaiyer
Copy link

divyaiyer commented Apr 20, 2016

Thanks @scanny , . I tried that program, I get the blocks as Paragraph and tables, what I need exactly is I have a paragraph heading with a custom style, if I pass that to a method, that should return me all headings, sub heading and table under that.

Ex:

1.1. Heading 1 (defined cutsom style)
1.1.1. sub Heading 1(defined cutsom style)
1.1.1.1. Child of sub heading (defined cutsom style)
----
#Table#
1.2. Heading 2 (defined cutsom style)
1.2.1. sub Heading 1 (defined cutsom style)
1.2.1.1. Child of sub heading (defined cutsom style)
----
#Table#

@scanny
Copy link
Contributor

scanny commented Apr 20, 2016

How were you thinking to approach the problem? It sounds like you've successfully identified your "start" marker. Have you thought about how to handle the end marker? Like when to stop and return what you've found?

@divyaiyer
Copy link

@scanny , ya I know end marker is not easy. Currently I read tables and looks for row -> columns and I am validating table using first column name what I have defined. Will be parsing only such tables and once it reaches end of the document, it gets returns. While parsing I am pushing them to DB. I know this is not very good way of dealing my requirement, searching more how can I reach best solution. Thnx

@npregot
Copy link

npregot commented Oct 26, 2017

@scanny I tried your code:

for table in tables:
    assignments.write(table)
    for row in table.rows:
        for cell in row.cells:
            for paragraph in cell.paragraphs:
                 file.write(paragraph.text)  # I am storing the data in a txt file

My output is repeated an X amount of times, in other words when I open my txt file and I look for a unique "data" I notice that this "data" is repeated different times in the txt file. ( 6 to 8 times, depending on the cell)

@scanny
Copy link
Contributor

scanny commented Oct 26, 2017

@Nawuy I believe that row.cells will return a "merged" cell multiple times, once for each cell that is merged into it. Could that be your problem?

@lxj0276
Copy link

lxj0276 commented Dec 12, 2017

when I parsed my docx file,I found it lost data when I parse tables.
my method it's like:
document = Document(path_to_your_docx) tables = document.tables for table in tables: for row in table.rows: for cell in row.cells: for paragraph in cell.paragraphs: print(paragraph.text)

I found some data become empty string.

@scanny
Copy link
Contributor

scanny commented Jan 2, 2018

@movinghands Sounds like a question for StackOverflow.

@Aviru
Copy link

Aviru commented Jun 4, 2018

Hi,
I am using python-docx to read tables from docx file. But different docx files contain different table structure i.e. in some docx files the table heads are top aligned and in some files table heads are left aligned. I am unable to get the alignment/position of table heads, so when I iterate through table and try to fetch the table data in key value pair, the data is inappropriate as the keys and values are inappropriate.
I have attached the images for table head alignment.
left_aligned_table_head
top_aligned_table_head

In 1st image table head is left aligned
In 2nd image table head is top aligned

Please help me.

Thanks & Regards,
Aviru Bhattacharjee

@npregot
Copy link

npregot commented Jun 4, 2018

Can we see your code? Also, if I were you I would ask this question on Stackoverflow as well :)

@Aviru
Copy link

Aviru commented Jun 5, 2018

  def convertDocxToText(path):

    doc = Document(path)
    fullText = []

   for block in iter_block_items(doc):
      if isinstance(block, Paragraph):
          #.... This part is for paragraph.
   
   elif isinstance(block, Table):
        for i, row in enumerate(block.rows):
            text = (cell.text for cell in row.cells)
          
              #.... This part is for Table.
             
            # for col in block.columns:
            #     for colCel in col.cells:
            #         print(colCel.text)
            #
            # for cell in row.cells:
            #     for paragraph in cell.paragraphs:
            #         print(paragraph.text)

            # Establish the mapping based on the first row
            # headers; these will become the keys of our dictionary
            # if i == 0:
            # tpl = tuple(text)
            # keys = tpl[0]  # tuple(text)
            # continue
            # Construct a dictionary for this row, mapping
            # keys to values for this row

            if i == 0:
                keys = tuple(text)
                continue

            row_data = dict(zip(keys, text))
            print(str(row_data))
            fullText.append(str(row_data))
            # for key, val in row_data.items():
            #     print(key + ": " + val)
            #     strKeyVal = key + ": " + val
            #     fullText.append(strKeyVal)

            #     for paragraph in cell.paragraphs:
            #         fullText.append(paragraph.text)
            # return ("\t".join(fullText))
return '\n'.join(fullText)


def iter_block_items(parent):
  if isinstance(parent, _Document):
     parent_elm = parent.element.body
  elif isinstance(parent, _Cell):
     parent_elm = parent._tc
 elif isinstance(parent, _Row):
     parent_elm = parent._tr
else:
    raise ValueError("something's not right")
for child in parent_elm.iterchildren():
    if isinstance(child, CT_P):
        yield Paragraph(child, parent)
    elif isinstance(child, CT_Tbl):
        yield Table(child, parent)`

@Aviru
Copy link

Aviru commented Jun 7, 2018

@Nawuy any help please.

@npregot
Copy link

npregot commented Jun 8, 2018

@Aviru , what is your ultimate goal? get the data from the cells? or get the alignment of the text inside the cell?
Are always the tables going to be different, or there is any redundancy in the format of the tables?

@Aviru
Copy link

Aviru commented Jun 9, 2018 via email

bhavasagar-dv pushed a commit to bhavasagar-dv/python-docx that referenced this issue May 3, 2024
* Update version

* reverse version to origin

* Add add_ole_object_to_run func
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants
@divyaiyer @scanny @abruski @lxj0276 @BeebBenjamin @Aviru @npregot and others