Skip to content

Capture Bullet points #1134

@eswarthammana

Description

@eswarthammana

python = 10.x.x
Windows OS
python-docx module = 0.8.11

I am trying to read the bulleted data from the word document using document.xml and numbering.xml
below is the code in which checking for w:ilvl exists or not and w:del for deleted content.
then not considering the w:numId value == 0 because we don't have a equivalent in the numbering.xml
then i am capturing the bulleted data and also the type of bullet format from the numbering.xml

document = Document("\\path_to_file.docx")
num_ele = document.part.numbering_part.element
doc_ele = document._part._element
word_paragraphs = doc_ele.xpath(".//w:p[boolean(.//w:pPr//w:numPr//w:ilvl)][not(boolean(.//w:del))]")
for word_paragraph in word_paragraphs:
    paragraph_properties = word_paragraph.xpath(".//w:pPr")
    number_properties = paragraph_properties[0].xpath(".//w:numPr")
    number_id = number_properties[0].xpath(".//w:numId//@w:val")
    if number_id[0] != '0':
        word_runs = word_paragraph.xpath(".//w:r")
        word_text = [word_run.xpath(".//w:t")[0].text for word_run in word_runs if len(word_run.xpath(".//w:t"))>0]
        word_text = ''.join(word_text)
        indentation_level = number_properties[0].xpath(".//w:ilvl//@w:val")
        abstract_num_id = num_ele.xpath(".//w:num[@w:numId="+number_id[0]+"]//w:abstractNumId//@w:val")[0]
        abstract_num = num_ele.xpath(".//w:abstractNum[@w:abstractNumId="+abstract_num_id+"]")[0]
        word_level = abstract_num.xpath(".//w:lvl[@w:ilvl="+indentation_level[0]+"]", namespaces={'w': 'http://schemas.openxmlformats.org/wordprocessingml/2006/main'})[0]
        number_format =  word_level.xpath(".//w:numFmt//@w:val", namespaces={'w': 'http://schemas.openxmlformats.org/wordprocessingml/2006/main'})[0]

After extracting the data and saved to database.
When it is required i am generating the document again back using the python-docx module and for bulleted points i have made an observation that python-docx is having a different format compared to the MS word actual format for the numbering.xml

the basic difference i have identified are as below

python-docx numbering.xml

python-docx numbering.xml

MS Word numbering.xml

MS Word numbering.xml

the difference in the indentation levels of the document are different, we can observe that in the following document.xml files

word generated document.xml

word generated document.xml

python-docx generated document.xml

python-docx generated document.xml

we can observe that python-docx always maintains w:ilvl in the document.xml as '0' and updated the numbering.xml accordingly but the challenge is due to difference in the format i have following queries.

Questions:
could you please suggest what are the possible ways i can maintain similar implementation for both types.

Or give me an idea how can i make a difference b/w word edited and python-docx edited/modify document .. i tried custom tags but not worked because MS Word is removing automatically

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions