Apache Tika bindings for PHP: extract text and metadata from documents, images and other formats
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.


Current release Package at Packagist Build status Code coverage Code insight License

PHP Apache Tika

This tool provides Apache Tika bindings for PHP, allowing to extract text and metadata from documents, images and other formats.

The following modes are supported:

Server mode is recommended because is 5 times faster, but some shared hosts don't allow run processes in background.

Although the library contains a list of supported versions, any version of Apache Tika should be compatible as long as backward compatibility is maintained by Tika team. Therefore, it is not necessary to wait for an update of the library to work with the new versions of the tool.


  • Simple class interface to Apache Tika features:
    • Text and HTML extraction
    • Metadata extraction
    • OCR recognition
  • Standarized metadata for documents
  • Support for local and remote resources
  • No heavyweight library dependencies
  • Compatible with Apache Tika 1.7 or greater
    • Tested up to 1.19.1



Install using Composer:

composer require vaites/php-apache-tika

If you want to use OCR you must install Tesseract:

  • Fedora/CentOS: sudo yum install tesseract (use dnf instead of yum on Fedora 22 or greater)
  • Debian/Ubuntu: sudo apt-get install tesseract-ocr
  • Mac OS X: brew install tesseract (using Homebrew)

The library assumes tesseract binary is in path, so you can compile it yourself or install using any other method.


Start Apache Tika server with caution:

java -jar tika-server-x.xx.jar

If you are using JRE instead of JDK, you must run if you have Java 9 or greater:

java --add-modules java.se.ee -jar tika-server-x.xx.jar

Instantiate the class:

$client = \Vaites\ApacheTika\Client::make('localhost', 9998);           // server mode (default)
$client = \Vaites\ApacheTika\Client::make('/path/to/tika-app.jar');     // app mode 

Use the class to extract text from documents:

$language = $client->getLanguage('/path/to/your/document');
$metadata = $client->getMetadata('/path/to/your/document');

$html = $client->getHTML('/path/to/your/document');
$text = $client->getText('/path/to/your/document');

Or use to extract text from images:

$client = \Vaites\ApacheTika\Client::make($host, $port);
$metadata = $client->getMetadata('/path/to/your/image');

$text = $client->getText('/path/to/your/image');

You can use an URL instead of a file path and the library will download the file and pass it to Apache Tika. There's no need to add -enableUnsecureFeatures -enableFileUrl to command line when starting the server, as described here.


Here are the full list of available methods


Tika file related methods:


Other Tika related methods:


Supported versions related methods:


Set/get a callback for sequential read of response:


Set/get the chunk size for secuential read:


Enable/disable the internal remote file downloader:


Command line client

Set/get JAR/Java paths (only CLI mode):



Web client

Set/get host properties




Set/get cURL client options

$client->setOption($option, $value);

Set/get cURL client common options:



Tests are designed to cover all features for all supported versions of Apache Tika in app mode and server mode. There are a few samples to test against:

  • sample1: document metadata and text extraction
  • sample2: image metadata
  • sample3: text recognition
  • sample4: unsupported media
  • sample5: huge text for callbacks


There are some issues found during tests, not related with this library:

  • 1.9 version running Java 7 on server mode throws random error 500 (Unexpected RuntimeException)
  • 1.14 version on server mode throws random errors (Expected ';', got ',') when parsing image metadata
  • Tesseract slows down document parsing as described in TIKA-2359