Thie omeka plugin allow creation of xml files from pdf using pdftohtml. The xml is stored as a new file associated with the item.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.

Extract OCR (plugin for Omeka)


Omeka plugin to extract OCR text in XML from PDF files, allowing fulltext searching within BookReader plugin for omeka.

See demo of the in Bibliothèque numérique de l'université Rennes 2 (France).


  • This plugin needs pdftohtml command-line tool on your server
    sudo apt-get install poppler-utils
  • Upload the Extract OCR plugin folder into your plugins folder on the server;
  • you can install the plugin via github
    cd omeka/plugins  
    git clone "ExtractOcr"
  • Activate it from the admin → Settings → Plugins page
  • If necessary, allow the upload of XML files in the Security Settings: Add xml to the Allowed File Extensions list and application/xml to the Allowed File Types list.
  • Click the Configure link to process or not existing PDF files.

Using the PDF TOC Plugin

  • Create an item
  • Add PDF file(s) to this item
  • Save Item
  • To locate extracted OCR xml file, select the item to which the PDF is attached. Normally, you should see an XML file attached to the record with the same filename than the pdf file.

Optional plugins

  • BookReader : This plugin adds Internet Archive BookReader into Omeka. If both plugins (BookReader & ExtractOcr) are installed it's possible to search fulltext within BookReader frame. To enable it you need to overwrite Bookreader/libraries/BookReaderCustom.php using Bookreader/libraries/BookReaderCustom_extractOCR.php


See online PDF TOC issues.


This plugin is published under [GNU/GPL].

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.


  • Syvain Machefert, Université Bordeaux 3 (see symac)