Skip to content

syohex/org-protocol-capture-html

 
 

Repository files navigation

org-protocol-capture-html

org-protocol is awesome, but browsers do a pretty poor job of turning a page’s HTML content into plain-text. However, Pandoc supports converting from HTML to org-mode, so we can use it to turn HTML into Org-mode content! It can even turn HTML tables into Org tables!

Screenshot

Here’s an example of what you get in Emacs from capturing this page:

screenshot.png

Contents

Requirements

  • org-protocol: This is what connects org-mode to the “outside world” using a MIME protocol handler. The instructions on the org-protocol page are a bit out of date, so you might want to try these instructions instead.
  • Pandoc: Version 1.8 or later is required.
  • python-readability: This is optional but very handy. It lets you automatically capture just the article or main content from a page.
  • The shell script uses curl to download URLs (if you use it in that mode).

Installation

Emacs

Put org-protocol-capture-html.el in your load-path and add to your init file:

(require 'org-protocol-capture-html)

org-capture Template

You need a suitable org-capture template. I recommend this one. Whatever you choose, the default selection key is w, so if you want to use a different key, you’ll need to modify the script and the bookmarklets.

("w" "Web site" entry
  (file "")
  "* %a :website:\n\n%U %?\n\n%:initial")

Bookmarklets

Now you need to make a bookmarklet in your browser(s) of choice. You can select text in the page when you capture and it will be copied into the template, or you can just capture the page title and URL. A selection-grabbing function is used to capture the selection.

Note: The w in the URL in these bookmarklets chooses the corresponding capture template. You can leave it out if you want to be prompted for the template, or change it to another letter for a different template key.

Firefox

This bookmarklet captures what is currently selected in the browser. Or if nothing is selected, it just captures the page’s URL and title.

javascript:location.href = 'org-protocol://capture-html://w/' + encodeURIComponent(location.href) + '/' + encodeURIComponent(document.title) + '/' + encodeURIComponent(function () {var html = ""; if (typeof document.getSelection != "undefined") {var sel = document.getSelection(); if (sel.rangeCount) {var container = document.createElement("div"); for (var i = 0, len = sel.rangeCount; i < len; ++i) {container.appendChild(sel.getRangeAt(i).cloneContents());} html = container.innerHTML;}} else if (typeof document.selection != "undefined") {if (document.selection.type == "Text") {html = document.selection.createRange().htmlText;}} var relToAbs = function (href) {var a = document.createElement("a"); a.href = href; var abs = a.protocol + "//" + a.host + a.pathname + a.search + a.hash; a.remove(); return abs;}; var elementTypes = [['a', 'href'], ['img', 'src']]; var div = document.createElement('div'); div.innerHTML = html; elementTypes.map(function(elementType) {var elements = div.getElementsByTagName(elementType[0]); for (var i = 0; i < elements.length; i++) {elements[i].setAttribute(elementType[1], relToAbs(elements[i].getAttribute(elementType[1])));}}); return div.innerHTML;}());

This one uses python-readability to capture the article or main content of the page.

javascript:location.href = 'org-protocol://capture-readability://w/' + encodeURIComponent(location.href) + '/' + encodeURIComponent(document.title) + '/';

Note: When you click on one of these bookmarklets for the first time, Firefox will ask what program to use to handle the org-protocol protocol. You can simply choose the default program that appears (org-protocol).

Pentadactyl

If you use Pentadactyl, you can use the Firefox bookmarklets above, or you can put these commands in your .pentadactylrc:

map -modes=n,v ch -javascript content.location.href = 'org-protocol://capture-html://w/' + encodeURIComponent(content.location.href) + '/' + encodeURIComponent(content.document.title) + '/' + encodeURIComponent(function () {var html = ""; if (typeof content.document.getSelection != "undefined") {var sel = content.document.getSelection(); if (sel.rangeCount) {var container = document.createElement("div"); for (var i = 0, len = sel.rangeCount; i < len; ++i) {container.appendChild(sel.getRangeAt(i).cloneContents());} html = container.innerHTML;}} else if (typeof document.selection != "undefined") {if (document.selection.type == "Text") {html = document.selection.createRange().htmlText;}} var relToAbs = function (href) {var a = content.document.createElement("a"); a.href = href; var abs = a.protocol + "//" + a.host + a.pathname + a.search + a.hash; a.remove(); return abs;}; var elementTypes = [['a', 'href'], ['img', 'src']]; var div = content.document.createElement('div'); div.innerHTML = html; elementTypes.map(function(elementType) {var elements = div.getElementsByTagName(elementType[0]); for (var i = 0; i < elements.length; i++) {elements[i].setAttribute(elementType[1], relToAbs(elements[i].getAttribute(elementType[1])));}}); return div.innerHTML;}());
map -modes=n,v cr -javascript content.location.href = 'org-protocol://capture-readability://w/' + encodeURIComponent(content.location.href) + '/' + encodeURIComponent(content.document.title) + '/';

Note: The JavaScript objects are slightly different for running as Pentadactyl commands since it has its own chrome.

Chrome

These bookmarklets work in Chrome:

javascript:location.href = 'org-protocol:///capture-html:///w/' + encodeURIComponent(location.href) + '/' + encodeURIComponent(document.title) + '/' + encodeURIComponent(function () {var html = ""; if (typeof window.getSelection != "undefined") {var sel = window.getSelection(); if (sel.rangeCount) {var container = document.createElement("div"); for (var i = 0, len = sel.rangeCount; i < len; ++i) {container.appendChild(sel.getRangeAt(i).cloneContents());} html = container.innerHTML;}} else if (typeof document.selection != "undefined") {if (document.selection.type == "Text") {html = document.selection.createRange().htmlText;}} var relToAbs = function (href) {var a = document.createElement("a"); a.href = href; var abs = a.protocol + "//" + a.host + a.pathname + a.search + a.hash; a.remove(); return abs;}; var elementTypes = [['a', 'href'], ['img', 'src']]; var div = document.createElement('div'); div.innerHTML = html; elementTypes.map(function(elementType) {var elements = div.getElementsByTagName(elementType[0]); for (var i = 0; i < elements.length; i++) {elements[i].setAttribute(elementType[1], relToAbs(elements[i].getAttribute(elementType[1])));}}); return div.innerHTML;}());
javascript:location.href = 'org-protocol:///capture-readability:///w/' + encodeURIComponent(location.href) + '/' + encodeURIComponent(document.title) + '/';

Note: The first sets of slashes are tripled compared to the Firefox bookmarklets. When testing with Chrome, I found that xdg-open was collapsing the double-slashes into single-slashes, which breaks org-protocol. I’m not sure why that doesn’t seem to be necessary for Firefox. If you have any trouble with this, you might try removing the extra slashes.

Shell script

The shell script is handy for piping any HTML (or plain-text) content to Org through the shell, or downloading and capturing any URL directly (without a browser), but it’s not required. You can use it like this:

org-protocol-capture-html.sh [OPTIONS] [HTML]
cat html | org-protocol-capture-html.sh [OPTIONS]

Send HTML to Emacs through org-protocol, passing it through Pandoc to
convert HTML to Org-mode.  HTML may be passed as an argument or
through STDIN.  If only URL is given, it will be downloaded and its
contents used.

Options:
    -h, --heading HEADING     Heading
    -r, --readability         Capture web page article with python-readability
    -t, --template TEMPLATE   org-capture template key (default: w)
    -u, --url URL             URL

    --debug  Print debug info
    --help   I need somebody!

Usage

After installing the bookmarklets, you can select some text on a web page with your mouse, open the bookmarklet with the browser, and Emacs should pop up an Org capture buffer. You can also do it without selecting text first, if you just want to capture a link to the page.

You can also pass data through the shell script, for example:

dmesg | grep -i sata | org-protocol-capture-html.sh --heading "dmesg SATA messages" --template i

org-protocol-capture-html.sh --readability --url "https://lwn.net/Articles/615220/"

org-protocol-capture-html.sh -h "TODO Feed the cat!" -t i "He gets grouchy if I forget!"

Changelog

<2016-04-03 Sun>

<2016-03-23 Wed>

  • Add URL downloading to the shell script. Now you can run org-protocol-capture-html.sh -u http://example.com and it will download and capture the page.
  • Add org-capture template to the readme. This will make it much easier for new users.

Appendix

org-protocol Instructions

1. Add protocol handler

Create the file ~/.local/share/applications/org-protocol.desktop containing:

[Desktop Entry]
Name=org-protocol
Exec=emacsclient %u
Type=Application
Terminal=false
Categories=System;
MimeType=x-scheme-handler/org-protocol;

Note: Each line’s key must be capitalized exactly as displayed, or it will be an invalid .desktop file.

Then update ~/.local/share/applications/mimeinfo.cache by running:

  • On KDE: kbuildsycoca4
  • On GNOME: update-desktop-database ~/.local/share/applications/

2. Configure Emacs

Init file

Add to your Emacs init file:

(server-start) 
(require 'org-protocol)

Capture template

You’ll probably want to add a capture template something like this:

("w" "Web site"
 entry (file+olp "~/org/inbox.org" "Web")
 "* %c :website:\n%U %?%:initial")

Note: Using %:initial instead of %i seems to handle multi-line content better.

This will result in a capture like this:

* [[http://orgmode.org/worg/org-contrib/org-protocol.html][org-protocol.el – Intercept calls from emacsclient to trigger custom actions]] :website:
[2015-09-29 Tue 11:09] About org-protocol.el org-protocol.el is based on code and ideas from org-annotation-helper.el and org-browser-url.el.

3. Configure Firefox

On some versions of Firefox, it may be necessary to add this setting. You may skip this step and come back to it if you get an error saying that Firefox doesn’t know how to handle org-protocol links.

Open about:config and create a new boolean value named network.protocol-handler.expose.org-protocol and set it to true.

Note: If you do skip this step, and you do encounter the error, Firefox may replace all open tabs in the window with the error message, making it difficult or impossible to recover those tabs. It’s best to use a new window with a throwaway tab to test this setup until you know it’s working.

Selection-grabbing function

This function gets the HTML from the browser’s selection. It’s from this answer on StackOverflow.

function () {
    var html = "";

    if (typeof content.document.getSelection != "undefined") {
        var sel = content.document.getSelection();
        if (sel.rangeCount) {
            var container = document.createElement("div");
            for (var i = 0, len = sel.rangeCount; i < len; ++i) {
                container.appendChild(sel.getRangeAt(i).cloneContents());
            }
            html = container.innerHTML;
        }
    } else if (typeof document.selection != "undefined") {
        if (document.selection.type == "Text") {
            html = document.selection.createRange().htmlText;
        }
    }

    var relToAbs = function (href) {
        var a = content.document.createElement("a");
        a.href = href;
        var abs = a.protocol + "//" + a.host + a.pathname + a.search + a.hash;
        a.remove();
        return abs;
    };
    var elementTypes = [
        ['a', 'href'],
        ['img', 'src']
    ];

    var div = content.document.createElement('div');
    div.innerHTML = html;

    elementTypes.map(function(elementType) {
        var elements = div.getElementsByTagName(elementType[0]);
        for (var i = 0; i < elements.length; i++) {
            elements[i].setAttribute(elementType[1], relToAbs(elements[i].getAttribute(elementType[1])));
        }
    });
    return div.innerHTML;
}

Here’s a one-line version of it, better for pasting into bookmarklets and such:

function () {var html = ""; if (typeof content.document.getSelection != "undefined") {var sel = content.document.getSelection(); if (sel.rangeCount) {var container = document.createElement("div"); for (var i = 0, len = sel.rangeCount; i < len; ++i) {container.appendChild(sel.getRangeAt(i).cloneContents());} html = container.innerHTML;}} else if (typeof document.selection != "undefined") {if (document.selection.type == "Text") {html = document.selection.createRange().htmlText;}} var relToAbs = function (href) {var a = content.document.createElement("a"); a.href = href; var abs = a.protocol + "//" + a.host + a.pathname + a.search + a.hash; a.remove(); return abs;}; var elementTypes = [['a', 'href'], ['img', 'src']]; var div = content.document.createElement('div'); div.innerHTML = html; elementTypes.map(function(elementType) {var elements = div.getElementsByTagName(elementType[0]); for (var i = 0; i < elements.length; i++) {elements[i].setAttribute(elementType[1], relToAbs(elements[i].getAttribute(elementType[1])));}}); return div.innerHTML;}

To-Do

Handle long chunks of HTML

If you try to capture too long a chunk of HTML, it will fail with “argument list too long errors” from emacsclient. To work around this will require capturing via STDIN instead of arguments. Since org-protocol is based on using URLs, this will probably require using a shell script and a new Emacs function, and perhaps another MIME protocol-handler. Even then, it might still run into problems, because the data is passed to the shell script as an argument in the protocol-handler. Working around that would probably require a non-protocol-handler-based method using a browser extension to send the HTML directly via STDIN. Might be possible with Pentadactyl instead of making an entirely new browser extension. Also, maybe the Org-mode Capture Firefox extension could be extended (…) to do this.

However, most of the time, this is not a problem.

Package for MELPA

This would be nice.

About

Capture HTML from the browser selection into Emacs as org-mode content

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Emacs Lisp 65.1%
  • Shell 34.9%