An HTML to Markdown converter
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
demo
test
.gitignore
LICENSE
README.md
package.json
unmarked.js

README.md

unmarked

An HTML to Markdown converter, written in JavaScript.

Demo

I built this library while adding web clipping support to Quiver: The Programmer's Notebook.

Design

Unmarked is really a DOM to Markdown converter. To feed an HTML string into unmarked, first you need to use the browser, jQuery, or jsdom on Node.js to create a DOM element from the HTML string.

By converting an HTML string into a DOM element first, unmarked avoids all the pitfalls with a regex-based approach. Since mapping a tree structure to plain text is strictly predictable, unmarked doesn't choke on complex webpages. The worst case scenario is that some node contents don't have perfect Markdown output, but the converted result is always clean and readable.

Unmarked does NOT aim to create a perfect Markdown representation of the input HTML --- that would be an unrealistic goal. Converting HTML to Markdown is always lossy, since lots of HTML tags don't have Markdown equivalents. Although HTML is indeed allowed inside a Markdown document, that's almost never what a user wants when he uses a converter like this.

The goal of unmarked is to create a clean and readable Markdown document from the input HTML. It will be lossy; some tags may be lost, whitespaces and styles may change, but the text content and basic formatting should be well kept. There shouldn't be any HTML tags left in the converted Markdown document. No extra spaces or linebreaks. Nested lists should be indented and numbered correctly.

Features

  • A DOM to Markdown converter.
  • Does not choke on complex pages. It will give you a plain text Markdown document even in the worse case.
  • Properly handles tricky cases such as deeply nested lists, lists inside blockquotes, paragraphs inside lists, mixed list types, headers inside anchors.
  • An extensive test suite.
  • Support options such as italic style, bold style, horizontal rule style, etc.
  • Support GFM linebreaks, strikethrough, fenced code blocks.
  • Support custom converters.

Note: Currently no table support.

Usage

Convert a DOM element to Markdown:

var md = unmarked(el, options);

Convert an HTML string to Markdown:

var md = unmarked(html, options);

See below for all the supported options.

Use in Browser

Unmarked itself doesn't have any dependencies. To use unmarked, just include it in your header. It will expose a global variable unmarked.

<script src="unmarked.js"></script>

Unmarked takes a DOM element or an HTML string as the input.

var el = document.getElementById("content");
var md = unmarked(el);
var md = unmarked(html);

Use on Node.js

Install jsdom via npm. Then:

var unmarked = require("unmarked");
var jsdom = require("jsdom").jsdom;
var doc = jsdom(html);
var md = unmarked(doc.documentElement);

Options

Example setting options:

unmarked.setOptions({
  hrStyle: '* * *',
  boldStyle: '__'
});

Default options:

unmarked.defaults = {
  gfm: false, // turn this on if you want strikethrough, GFM linebreaks, fenced code blocks, etc.
  trim: true, // trim whitespaces from beginning and end
  tabSize: 2,
  headerStyle: 'setext', // ['setext', 'atx']
  hrStyle: '---',
  italicStyle: '*',
  boldStyle: '**',
  ulStyle: '*',
  ignored: ['head', 'script', 'style', 'meta']
};

Custom Converters

Add a custom converter for a tag:

unmarked.addConverter({
  tag: 'dl',
  converter: function(node, options, content, indent) {
    // node: the node being processed
    // options: unmarked options, see above
    // content: content of child nodes, already converted to Markdown
    // indent: indentation set by the parent node. Most of the time this is an empty string and can be ignored. But sometimes a child node needs to be indented, e.g., a paragraph inside a list item.
    return 'converted markdown';
  }
});

You can also specify tags to add a converter for multiple tags at once:

unmarked.addConverter({
  tags: ['dl', 'dd', 'dt'],
  converter: function(node, options, content, indent) {
    //...
  }
});

You can also override the default converter for a tag by providing a custom converter.

Run Tests

Make sure you have mocha installed. Then:

$ cd unmarked/
$ npm install --dev
$ mocha

Acknowledgement

Unmarked is based on the code from these projects, but largely rewritten.

License

Released under the MIT license.