# Mammoth. Word .docx to HTML Conversion

* Run each of the cells below in order
* [Mammoth Support documentation](https://github.com/mwilliamson/python-mammoth)
* v3 July 2023

## 1. Install Mammoth

In [5]:
!pip install mammoth



## 2. Add Your Google Drive to Colab

* Run the cell below to connect your Google Drive to Colab
* Google will ask permission to share your details and then supply you with a code which you enter into the input box below. Press return/enter once you have added the code to the input field

In [6]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## 3. Import Header & Footer Html Files

* Copy your header.htm, footer.htm and .docx files to a directory on Google Drive
* Add the path to the Google Drive input directory that contains the header, footer and Word .docx files
* Run the cell below

In [None]:
input_directory = "/content/drive/MyDrive/" #@param {type:"string"}
header_filename = "header.htm" #@param {type:"string"}
footer_filename = "footer.htm" #@param {type:"string"}

# Import Files
%cd $input_directory
with open(header_filename, 'r') as file:   # The header filename
    ests_header = file.read().replace('\n', '')
with open(footer_filename, 'r') as file:  # The footer filename
    ests_footer = file.read().replace('\n', '')

## 4. Convert .docx to .html
* Add the path to your Google Drive output directory that you wish to export the html files
* Add the filename of your .docx document (excluding the path)
* Then run the cell below
* To convert additional files, simply input a new .docx filename and run the cell 4 again.
* If you require custom word styles to be mapped to the exported html these can be added in the code below. Consult the [Mammoth documentation](https://github.com/mwilliamson/python-mammoth#writing-style-maps) for more information  

In [None]:
output_directory = "/content/drive/MyDrive/" #@param {type:"string"}
docx_filename = "/content/drive/MyDrive/mammoth-demo.docx" #@param {type:"string"}

#Change Styles. Format: word-style => html-style.optional-css-class
output_filename = docx_filename.rsplit( ".", 1 )[ 0 ] + ".html"
%cd $input_directory
import mammoth
# Map custom styles here
style_map = """
u => u.underlined
i => em
table => div.tableWrap > table
p[style-name='Quote'] => blockquote.quote:fresh
p[style-name='imageCaption'] => figcaption.imageCaption:fresh
r[style-name='sourceSans'] => span.sourceSans:fresh
p[style-name='headerAuthors'] => div.headerAuthors
r[style-name='columnBreak'] => p
p[style-name='copyrightMeta'] => p.copyrightMeta
p[style-name='bibloReference'] => p.bibloReference:fresh
"""
#Convert file and save html
with open(docx_filename, "rb") as docx_file:
    result = mammoth.convert_to_html(docx_file, style_map=style_map)
    html = result.value
    #messages = result.messages # Any messages, such as warnings during conversion
    #print(messages)
    %cd $output_directory
    file = open(output_filename, "w")
    file.write(ests_header + html + ests_footer)
    #file.write(html)
    file.close()