New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extracting data from tables #13
Comments
If you can say a bit more about your use case that would be helpful to inform future feature design. Also it will probably enable me to offer more specific advice. In general the current approach would be to iterate over the cells and then iterate over the for row in table.rows:
for cell in row.cells:
for paragraph in cell.paragraphs:
print paragraph.text http://python-docx.readthedocs.org/en/latest/api/table.html#docx.table._Cell.paragraphs |
Thanks for the reply. I have a word document with a series of tables in it. I want to extract just the tables and export them to CSV file. At the moment using your module I can get a list of the tables in the file using the following: tblList = document.xpath('//w:tbl', namespaces=document.nsmap) Now, I do not know what to do with this list. I don't know how to grab the table and use it as an object so that I can iterate through it's elements. |
Which version are you using? I can't see how the line of code you're using would work with this version of python-docx. Can you provide a few lines of context code? |
@BeebBenjamin if you install python-docx (v0.3.0+) rather than docx (v0.2.x) there is API I think you can use for this, roughly like so: document = Document(path_to_your_docx)
tables = document.tables
for table in tables:
for row in table.rows:
for cell in row.cells:
for paragraph in cell.paragraphs:
print(paragraph.text) To get the right version installed you might need to do this: $ pip uninstall docx
$ pip install -U --pre python-docx The documentation you reference above is for this latest version. The legacy version doesn't have any documentation beyond the README.txt. Note the two versions are in separate repositories. This repository is for the latest one. The legacy version is here: https://github.com/mikemaccana/python-docx Let me know if it gives you trouble. |
@scanny thanks for the help with this. I get the following error: Traceback (most recent call last): Thanks |
Hmm, weird, it seems to have saved an earlier revision of my comment. What you want is: ...
for cell in row.cells:
for paragraph in cell.paragraphs:
print paragraph.text I'll edit the code above to show the change. |
Yes, that has fixed it. Thanks so much for your help and your development of this tool! |
You're welcome @BeebBenjamin; glad you're finding it useful :) |
What this does is print/return every cell on a new line. Is it possible to print/return every cell belonging to the same row on one line? |
I worked it out. Might not be the best solution. import sys
from docx import Document
file = sys.argv[1]
document = Document(file)
tables = document.tables
for row in tables[2].rows:
print row.cells[0].paragraphs[0].text + row.cells[2].paragraphs[0].text substitute list indexes where needed. |
@scanny - Is there any way I can extract tables with a custom name I had given in my template? |
Hi @divyaiyer, you might want to post this one on StackOverflow if you have a login there. It's more of a support question than a code issue. I do think you'll need to provide a bit more detail. Off the top of my head, tables have a different type of style than say paragraphs, if I remember correctly, and are specified by a GUID rather than a name. So that part doesn't make sense to me. The tables are available in Document.tables, so you can iterate through them that way once you have the test worked out to identify the one you want. |
Thnx for your reply @scanny . I shall post on stackOverflow, I do have an account. Ya my case is bit tricky, as I have a particular paragraph heading and I want to fetch all tables comes under that heading and store in database. Thnx |
If your test for inclusion is following a heading paragraph, you'll need to iterate over both paragraphs and tables in document order. The document.iter_block_level_items() method is designed for that (although it hasn't been officially implemented yet. You may want to check out this issue that has details on that. |
Thanks @scanny , . I tried that program, I get the blocks as Paragraph and tables, what I need exactly is I have a paragraph heading with a custom style, if I pass that to a method, that should return me all headings, sub heading and table under that. Ex: 1.1. Heading 1 (defined cutsom style) |
How were you thinking to approach the problem? It sounds like you've successfully identified your "start" marker. Have you thought about how to handle the end marker? Like when to stop and return what you've found? |
@scanny , ya I know end marker is not easy. Currently I read tables and looks for row -> columns and I am validating table using first column name what I have defined. Will be parsing only such tables and once it reaches end of the document, it gets returns. While parsing I am pushing them to DB. I know this is not very good way of dealing my requirement, searching more how can I reach best solution. Thnx |
@scanny I tried your code: for table in tables:
assignments.write(table)
for row in table.rows:
for cell in row.cells:
for paragraph in cell.paragraphs:
file.write(paragraph.text) # I am storing the data in a txt file My output is repeated an X amount of times, in other words when I open my txt file and I look for a unique "data" I notice that this "data" is repeated different times in the txt file. ( 6 to 8 times, depending on the cell) |
@Nawuy I believe that |
when I parsed my docx file,I found it lost data when I parse tables. I found some data become empty string. |
@movinghands Sounds like a question for StackOverflow. |
Can we see your code? Also, if I were you I would ask this question on Stackoverflow as well :) |
|
@Nawuy any help please. |
@Aviru , what is your ultimate goal? get the data from the cells? or get the alignment of the text inside the cell? |
Hi,
Thank you for asking. Actually I am trying to build a resume parser. As,
resumes follow no specific format, so when I am parsing resumes, I am
seeing that in some resumes table head are top aligned and in some table
heads are left aligned reference the image I have posted in github. I am
trying to build a list which will contain key value pair for the tables
where the key will be table heads and value will be corresponding row or
column value e.g [{'degree':'msc','name of institution':'abc'}] and so on.
But while traversing the table I am unable to differentiate between the
table heads and table body.
Thanks & Regards,
Aviru Bhattacharjee
…On Sat, Jun 9, 2018 at 4:54 AM Nahuel ***@***.***> wrote:
@Aviru <https://github.com/Aviru> , what is your ultimate goal? get the
data from the cells? or get the alignment of the text inside the cell?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#13 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AJbgoqc_sWse9VshyFvRyO-jsaj7h2EJks5t6wfIgaJpZM4BhrBf>
.
|
* Update version * reverse version to origin * Add add_ole_object_to_run func
Hi,
Is it possible to extract data from tables using the docx module?
It would be useful to have more examples with regards to learning to use this library.
Regards,
Ben
The text was updated successfully, but these errors were encountered: