<a href="https://colab.research.google.com/github/stancsz/notebook-scripts/blob/main/docx_to_markdown_converter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This script, docx_to_markdown_converter.py, is designed to automate the conversion of Microsoft Word documents (.docx format) into Markdown (.md) format. It utilizes the python-docx library to read the contents of a Word document and the markdownify library to translate the text into Markdown syntax. The script processes each paragraph in the Word document, ensuring that the basic formatting is retained in the conversion.

The user needs to provide the path to the Word document and the desired output folder. The script will then create a Markdown file named output.md in the specified folder. If the folder does not exist, it will be created automatically. This tool is particularly useful for those looking to convert documents into a format suitable for web publishing, GitHub repositories, or other platforms where Markdown is the preferred format.

It's important to note that while the script handles basic formatting effectively, some complex formatting elements in the Word document may not be perfectly translated due to the differences in capabilities between Word and Markdown formatting.

In [1]:
!pip install python-docx markdownify

Collecting python-docx
  Downloading python_docx-1.1.0-py3-none-any.whl (239 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/239.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━[0m [32m112.6/239.6 kB[0m [31m3.1 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m239.6/239.6 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting markdownify
  Downloading markdownify-0.11.6-py3-none-any.whl (16 kB)
Installing collected packages: python-docx, markdownify
Successfully installed markdownify-0.11.6 python-docx-1.1.0


In [None]:
import os
from docx import Document
from markdownify import markdownify as md

def convert_docx_to_md(docx_path, output_folder):
    # Read the Word document
    doc = Document(docx_path)
    full_text = []

    # Convert each paragraph in the document to markdown
    for para in doc.paragraphs:
        full_text.append(md(para.text))

    # Combine all paragraphs into a single markdown string
    md_text = '\n'.join(full_text)

    # Create the output folder if it doesn't exist
    if not os.path.exists(output_folder):
        os.makedirs(output_folder)

    # Define the output file path
    output_path = os.path.join(output_folder, 'output.md')

    # Write the markdown text to the output file
    with open(output_path, 'w', encoding='utf-8') as f:
        f.write(md_text)

    print(f"Markdown file saved to {output_path}")

# Example usage
# convert_docx_to_md('path_to_your_word_document.docx', 'path_to_output_folder')

In [None]:
convert_docx_to_md('path_to_your_word_document.docx', 'path_to_output_folder')