Skip to content
Converts PDF, DOC, DOCX, XML, HTML, RTF, etc to plain text
Go Shell Dockerfile
Branch: master
Clone or download
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
client
docd docd: return error messages in the response Feb 7, 2018
docx_test Fixes Sep 27, 2019
iWork Add photo schema files Apr 27, 2015
snappy Skip failing tests for now, updated #8 to note this needs fixing Sep 26, 2015
.gitignore cleanup Dec 1, 2017
.travis.yml travis: update to remove old versions of go Feb 7, 2018
LICENSE Added LICENSE file. Sep 28, 2015
README.md
doc.go fix golint issues Jun 26, 2017
docconv.go docd: return error messages in the response Feb 7, 2018
docconv_test.go convert: compact table test and remove t.Run use Sep 20, 2017
docx.go try to avoid loading file data into slices Sep 27, 2019
html.go html: add passthru version that works with appengine Aug 26, 2017
html_appengine.go html: add passthru version that works with appengine Aug 26, 2017
image.go fix golint issues Jun 26, 2017
image_ocr.go convert images in pdfs when package is build with ocr tags Dec 1, 2017
local.go Refactor code into library. Aug 10, 2015
odt.go fix golint issues Jun 26, 2017
pages.go moved pkg to code.sajari.com/docconv Dec 27, 2017
pdf.go iterates on code review Dec 1, 2017
pdf_ocr.go cleanup: naming, empty lines, remove race condition Dec 1, 2017
pdf_text.go pdf: add extra time layout for pdfs Oct 16, 2019
rtf.go fix golint issues Jun 26, 2017
tidy.go fix golint issues Jun 26, 2017
url.go fix golint issues Jun 26, 2017
xml.go fix golint issues Jun 26, 2017

README.md

docconv

GoDoc Build Status

A Go wrapper library to convert PDF, DOC, DOCX, XML, HTML, RTF, ODT, Pages documents and images (see optional dependencies below) to plain text.

Note for returning users: the Go code path for this pkg been moved to code.sajari.com/docconv. Follow the installation instructions to checkout a version of the code in the correct place.

Installation

If you haven't setup Go before, you need to first set a GOPATH (see https://golang.org/doc/code.html#GOPATH).

To fetch and build the code:

$ go get code.sajari.com/docconv/...

This will also build the command line tool docd into $GOPATH/bin (assumed to be in your PATH already).

Dependencies

tidy, wv, popplerutils, unrtf, https://github.com/JalfResi/justext

Example install of dependencies (not all systems):

$ sudo apt-get install poppler-utils wv unrtf tidy
$ go get github.com/JalfResi/justext

Optional Dependencies

To add image support to the docconv library you first need to install and build https://github.com/otiai10/gosseract. Now you can add -tags ocr to any go command when building/fetching docconv to include support for processing images:

$ go get -tags ocr code.sajari.com/docconv/...

This may complain on OSX, which you can fix by installing tesseract via brew:

$ brew install tesseract

docd tool

The docd tool runs as either

  1. a service on port 8888 (by default)

    Documents can be sent as a multipart POST request and the plain text (body) and meta information are then returned as a JSON object

  2. a service exposed from within a Docker container

    This also runs as a service, but from within a Docker container. There are three build scripts: ./docd/debian.sh ./docd/alpine.sh ./docd/appengine.sh

    The debian version uses the Debian package repository which can vary with builds. The Alpine version uses a very cut down Linux distribution to produce a container ~40MB. It also locks the dependency versions for consistency, but may miss out on future updates. The appengine version is a flex based custom runtime for Google Cloud.

  3. via the command line.

    Documents can be sent as an argument, e.g.

    docd -input document.pdf

Optional Flags

  • "addr" - the bind address for the HTTP server, default is ":8888"
  • "log-level"
    • 0: errors & critical info
    • 1: inclues 0 and logs each request as well
    • 2: include 1 and logs the response payloads
  • "readability-length-low" - Sets the readability length low if the ?readability=1 parameter is set
  • "readability-length-high" - Sets the readability length high if the ?readability=1 parameter is set
  • "readability-stopwords-low" - Sets the readability stopwords low if the ?readability=1 parameter is set
  • "readability-stopwords-high" - Sets the readability stopwords high if the ?readability=1 parameter is set
  • "readability-max-link-density" - Sets the readability max link density if the ?readability=1 parameter is set
  • "readability-max-heading-distance" - Sets the readability max heading distance if the ?readability=1 parameter is set
  • "readability-use-classes - Comma separated list of readability classes to use if the ?readability=1 parameter is set

How to start the service

docd -log-level 0 # will only log errors & critical info

docd -addr :8000 -log-level 1 # will run on port 8000 and log each request as well

Example Usage (code)

Some basic code is shown below, but normally you would accept the file by http or open it from the file system. It should be enough to get you started though...

Use case 1: run locally Note: this assumes you have the dependencies installed.

package main

import (
	"fmt"
	"log"

	"code.sajari.com/docconv"
)

func main() {
	res, err := docconv.ConvertPath("your-file.pdf")
	if err != nil {
		log.Fatal(err)
	}
	fmt.Println(res)
}

Use case 2: request over the network

package main

import (
	"fmt"
	"log"

	"code.sajari.com/docconv/client"
)

func main() {
	// Create a new client, using the default endpoint (localhost:8888)
	c := client.New()

	res, err := client.ConvertPath(c, "your-file.pdf")
	if err != nil {
		log.Fatal(err)
	}
	fmt.Println(res)
}
You can’t perform that action at this time.