<a href="https://colab.research.google.com/github/rainbirddigital/mammoth/blob/main/Mammoth_word_docx_to_html_github.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Mammoth: Word .docx to Html Conversion

* By default [Mammoth](https://github.com/mwilliamson/python-mammoth) does not automatically convert blockquotes or captions so these must be added as custom styles within Word. To implement these add the a Word paragraph style **Quote** to blockquotes and **Caption** to image captions. 
* If you require further custom word styles to be mapped to the exported html these can be added in the **'Convert File'** cell below. Consult the [Mammoth documentation](https://github.com/mwilliamson/python-mammoth#writing-style-maps) for more information  



## 1. Add Your Google Drive to Colab

* Run the cell below to connect your Google Drive to Colab
* Google will ask permission to share your details and then supply you with a code which you enter into the input box below. Press return/enter once you have added the code to the input field

In [None]:
from google.colab import drive
drive.mount('/content/drive')

## 2. Install Mammoth, Import files and settings
* The cell below will Install Mammoth, create a directory **mammoth** on your Google Drive, import header/footer files and settings
* If you have previously installed the files on your Google Drive the installation will be skipped and only settings applied.
* If you require custom styles applied to your exported html you can edit these within the **header.htm** file

In [None]:
#@title Install Mammoth and settings
!pip install mammoth
header_filename = "header.htm"
footer_filename = "footer.htm"
input_directory = "/content/drive/MyDrive/mammoth/"
output_directory = "/content/drive/MyDrive/mammoth/exported-html/"

# Create Directories and Import Files
import os
if os.path.isdir("/content/drive/MyDrive/mammoth"):
  %cd "/content/drive/MyDrive/mammoth" 
elif os.path.isdir("/content/drive/"):
  %cd "/content/drive/MyDrive/"
  !git clone https://github.com/rainbirddigital/mammoth
  %cd mammoth
  !mkdir exported-html
elif os.path.isdir("/content/mammoth"):
  %cd "/content/mammoth/"
  input_directory = "/content/mammoth/"
  output_directory = "/content/mammoth/exported-html/"
else:
  !git clone https://github.com/rainbirddigital/mammoth
  %cd mammoth
  !mkdir exported-html
  input_directory = "/content/mammoth/"
  output_directory = "/content/mammoth/exported-html/"

# Import Files
%cd $input_directory
with open(header_filename, 'r') as file:
  header = file.read().replace('\n', '')
with open(footer_filename, 'r') as file:
  footer = file.read().replace('\n', '')

## 3. Convert .docx to .html
* Add your .docx files into the mammoth directory on your Google Drive
* Add the filename of your .docx document into the cell below
* Then run the cell below to convert to html
* To convert additional files, simply input a new .docx filename and run the cell again.

In [None]:
#@title Convert File
docx_filename = "mammoth-demo.docx" #@param {type:"string"}

#Change Styles. Format: word-style => html-style.optional-css-class
output_filename = docx_filename.rsplit( ".", 1 )[ 0 ] + ".html"
%cd $input_directory
import mammoth
# Map custom styles here
style_map = """
u => u.underlined
i => em
table => div.tableWrap > table
p[style-name='Quote'] => blockquote.quote:fresh
p[style-name='Caption'] => figcaption.caption:fresh
"""
#Convert file and save html
with open(docx_filename, "rb") as docx_file:
    result = mammoth.convert_to_html(docx_file, style_map=style_map)
    html = result.value
    #messages = result.messages # Any messages, such as warnings during conversion
    #print(messages)
    %cd $output_directory
    file = open(output_filename, "w")
    file.write(header + html + footer)
    #file.write(html)
    file.close()