Skip to content

Plugin: File filter

cmsmerge (Searchdaimon) edited this page Aug 27, 2013 · 2 revisions

Aub article This article is a work in progress. You can help Searchdaimon by expanding it with information you know.

File filters are plugins to the document manager that extracts text from files so that text can be added to the index. The goal is to be able to use any text based programs on Linux as a file filter.

If you do any changes you will need to restart the document manager for the changes to take place.

The file filters resists in the fileFilter folder. Each file filter has its own folder with some of the following file and folders:

  • runinfo – Configuration files that tells the ES what and how to run this file filter
  • src – Folder with the source code of the file filter if available
  • test – Folder with example files useful for testing
  • [binary] - Possible binary the file filter need

The document manager reads the runinfo files at startup and will so run the correct file filter for files that gets crawled.

Test files

We try to collect at list one test file for each format the ES can support. If there is any test files it will be in the test folder for that file filter.

Runinfo options

The runinfo file is the main configuration file. It list one section for each file format it know how to convert. The runinfo configuration file support several different type of file filters.

command

command describes what document manager should do when encountering a file. Normally it will execute a script or binary either of the file filter folder or somewhere on the system.

For example
/usr/bin/binary --to=txt #file

Will execute /usr/bin/binary with the file path as where #file is.

outputformat

outputformat describes what format the file filter will use to output its result. It can be one of:

  • html - The output will be formatted as html
  • text - The output will be formatted as plain text
  • htmlfile - The output will be a new file that is formatted as html
  • textfile - The output will be a new file that is formatted as text
  • dir - The output will be a new directory containing new files. The document manager will so go thru this new directory and call other file filters in each file. This is normally used on file that contains other files. For example a .zip file may contain other archived files, and an email file may contain attachment.
outputtype

outputtype if describes where the file filter will output its results. It is currently only used for html and text and must then be stdio to indicate that the file will us stdout and stderr for output.

filtertype

filtertype is used to run a specially crafted Perl module as a filter. It is an undocumented function for now.

Example of runinfo configuration file

For example the file fileFilter/pdftotext/runinfo is the runinfo file for pdftotext, a program for converting PDF files to text. It therefore has a section for PDF files like this:

documentstype: pdf
command: ./pdftotext #file
outputformat: textfile

It tells the document manager that for each PDF file so shall it run the pdftotext binary located in its own folder. The pdftotext program will so make a new text file with the extracted text.

This file filter has some test files, so you can try this out directly from the console by typing this from the boithoTools folder:

fileFilter/pdftotext/pdftotext fileFilter/pdftotext/test/dmca.pdf

This will give you a new file fileFilter/pdftotext/test/dmca.txt with the text in the dmca.pdf file.

Image files

Files that don't contain any text, like images may also benefit from file filters. For example for images the ES will extract the name if the file format and the geometry.

For example the file as fileFilter/identify/runinfo has a section for png images like this:

documentstype: png
command: identify -ping -format "Geometry: %wx%h, Format: %m" #file
outputformat: text
outputtype: stdio

It tells the document manager that for each PNG file it shall run the ImageMagicks identify program with some parameters and the file name. The identify program will then output text on stdio.

This file filter also has some test files, so you can try this out directly from the console by typing this from the boithoTools folder:

identify -ping -format "Geometry: %wx%h, Format: %m" fileFilter/identify/test/png_sample.png

This will try to extract information from the png_sample.png file.