-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
python = 10.x.x
Windows OS
python-docx module = 0.8.11
I am trying to read the bulleted data from the word document using document.xml and numbering.xml
below is the code in which checking for w:ilvl exists or not and w:del for deleted content.
then not considering the w:numId value == 0 because we don't have a equivalent in the numbering.xml
then i am capturing the bulleted data and also the type of bullet format from the numbering.xml
document = Document("\\path_to_file.docx")
num_ele = document.part.numbering_part.element
doc_ele = document._part._element
word_paragraphs = doc_ele.xpath(".//w:p[boolean(.//w:pPr//w:numPr//w:ilvl)][not(boolean(.//w:del))]")
for word_paragraph in word_paragraphs:
paragraph_properties = word_paragraph.xpath(".//w:pPr")
number_properties = paragraph_properties[0].xpath(".//w:numPr")
number_id = number_properties[0].xpath(".//w:numId//@w:val")
if number_id[0] != '0':
word_runs = word_paragraph.xpath(".//w:r")
word_text = [word_run.xpath(".//w:t")[0].text for word_run in word_runs if len(word_run.xpath(".//w:t"))>0]
word_text = ''.join(word_text)
indentation_level = number_properties[0].xpath(".//w:ilvl//@w:val")
abstract_num_id = num_ele.xpath(".//w:num[@w:numId="+number_id[0]+"]//w:abstractNumId//@w:val")[0]
abstract_num = num_ele.xpath(".//w:abstractNum[@w:abstractNumId="+abstract_num_id+"]")[0]
word_level = abstract_num.xpath(".//w:lvl[@w:ilvl="+indentation_level[0]+"]", namespaces={'w': 'http://schemas.openxmlformats.org/wordprocessingml/2006/main'})[0]
number_format = word_level.xpath(".//w:numFmt//@w:val", namespaces={'w': 'http://schemas.openxmlformats.org/wordprocessingml/2006/main'})[0]
After extracting the data and saved to database.
When it is required i am generating the document again back using the python-docx module and for bulleted points i have made an observation that python-docx is having a different format compared to the MS word actual format for the numbering.xml
the basic difference i have identified are as below
python-docx numbering.xml
MS Word numbering.xml
the difference in the indentation levels of the document are different, we can observe that in the following document.xml files
word generated document.xml
python-docx generated document.xml
we can observe that python-docx always maintains w:ilvl in the document.xml as '0' and updated the numbering.xml accordingly but the challenge is due to difference in the format i have following queries.
Questions:
could you please suggest what are the possible ways i can maintain similar implementation for both types.
Or give me an idea how can i make a difference b/w word edited and python-docx edited/modify document .. i tried custom tags but not worked because MS Word is removing automatically