/
DETAILS
399 lines (267 loc) · 13.3 KB
/
DETAILS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
INTRODUCTION
============
This DETAILS file accompanies doc2html version 3.0.1.
Read this file for instructions on the installation and use of the
doc2html scripts.
The set of files is:
DETAILS - this file
doc2html.pl - the main Perl script
doc2html.cfg - configuration file for use with wp2html
doc2html.sty - style file for use with wp2html
pdf2html.pl - Perl script for converting PDF files to HTML
swf2html.pl - Perl script for extracting links from Shockwave flash files.
README - brief description
doc2html.pl is a Perl5 script for use as an external converter with
htdig 3.1.4 or later. It takes as input the name of a file containing a
document in a number of possible formats and its MIME type. It uses
the appropriate conversion utility to convert it to HTML on standard
output.
doc2html.pl was designed to be easily adapted to use whatever conversion
utilities are available, and although it has been written around the
"wp2html" utility, it does not require wp2html to function.
NOTE: version 3.0.1 has only been tested on Unix.
pdf2html.pl is a Perl script which uses a pair of utilities (pdfinfo and
pdf2text) to extract information and text from an Adobe PDF file and
write HTML output. It can be called directly from htdig, but you are
recommended to call it via doc2html.pl.
swf2html.pl is a Perl script which calls a utility (swfparse) and
outputs HTML containing links to the URL's found in a Shockwave flash
file. It can be called directly from htdig, but you are recommended to
call it via doc2html.pl.
ABOUT DOC2HTML.PL
=================
doc2html.pl is essentially a wrapper script, and is itself only capable
of reading plain text files. It requires the utility programs described
below to work properly.
doc2html.pl was written by David Adams <d.j.adams@soton.ac.uk>, it is
based on conv_doc.pl written by Gilles Detillieux <grdetil@scrc.umanitoba.ca>.
This in turn was based on the parse_word_doc.pl script, written by
Jesse op den Brouw <MSQL_User@st.hhs.nl>.
doc2html.pl makes up to three attempts to read a file. It first tries
utilities which convert directly into HTML. If one is not found, or no
output is produced, it then tries utilities which output plain text. If
none is found, and the file is not of a type known to be unconvertable,
then doc2html.pl attempts to read the file itself, stripping out any
control characters.
doc2html.pl is written to be flexible and easy to adapt to whatever
conversion utilites are available. New conversion utilities may be
added simply by making additions to routine 'store_methods', with no
other changes being necessary. The existing lines in store_methods
should provide sufficient examples on how to add more converters. Note
that converters which produce HTML are entered differently to those that
produce plain text.
htdig provides three arguments which are read by doc2html.pl:
1) the name of a temporary file containing a copy of the
document to be converted.
2) the MIME type of the document.
3) the URL of the document (which is used in generating the
title in the output).
The test for document type uses both the MIME-type passed as second
argument and the "Magic number" of the file.
INSTALLATION
============
Installation requires that you acquire, compile and install the utilities
you need to do the conversions. Those already setup in the Perl scripts are
described below.
If you don't have Perl module Sys::AlarmCall installed, then consider
installing it, see section "TIMEOUT" below.
You may need to change the first line of each script to the location of
Perl on your system.
Edit doc2html.pl to include the full pathname of each utility you have
installed. For example:
my $WP2HTML = '/opt/local/wp2html-3.2/bin/wp2html';
If you don't have a particular utility then leave its location as a null
string.
Then place doc2html.pl and the other scripts where htdig can access them.
If you are going to convert PDF files then you will need to edit pdf2html.pl
and include its full path name in doc2html.pl.
If you are going to extract links from Shockwave flash files then you will
need to edit swf2html.pl and include its full path name in doc2html.pl.
Edit the htdig.conf configuration file to use the script, as in this example:
external_parsers: application/rtf->text/html /usr/local/scripts/doc2html.pl \
text/rtf->text/html /usr/local/scripts/doc2html.pl \
application/pdf->text/html /usr/local/scripts/doc2html.pl \
application/postscript->text/html /usr/local/scripts/doc2html.pl \
application/msword->text/html /usr/local/scripts/doc2html.pl \
application/Wordperfect5.1->text/html /usr/local/scripts/doc2html.pl \
application/msexcel->text/html /usr/local/scripts/doc2html.pl \
application/vnd.ms-excel->text/html /usr/local/scripts/doc2html.pl \
application/vnd.ms-powerpoint->text/html /usr/local/scripts/doc2html.pl \
application/x-shockwave-flash->text/html /usr/local/scripts/doc2html.pl \
application/x-shockwave-flash2-preview->text/html /usr/local/scripts/doc2html.pl
If you are using wp2html then place the files doc2html.cfg and doc2html.sty in the
wp2html library directory.
UTILITY WP2HTML
===============
Obtain wp2html from http://www.res.bbsrc.ac.uk/wp2html/
Note that wp2html is not free; its author charges a small fee for
"registration". Various pre-compiled versions and the source code are
available, together with extensive documentation. Upgrades are
available at no further charge.
wp2html converts WordPerfect documents (5.1 and later) to HTML.
Versions 3.2 and later will also convert Word7 and Word97 documents to
HTML. A feature of wp2html which doc2html.pl exploits is that the -q
option will result in either good HTML or no output at all.
wp2html is very flexible in the output it creates. The two files,
doc2html.cfg and doc2html.sty, should be placed in the wp2html library
directory along with the .cfg and .sty files supplied with wp2html.
Edit the line in doc2html.pl:
my $WP2HTML = '';
to set $WP2HTML to the full pathname of wp2html.
wp2html will look for the title in a document, and if it is found then
output it in <TITLE>....</TITLE> markup. If a title is not found
then it defaults to the file name in square brackets.
If wp2html is unable to convert a document, or is not installed,
then doc2html.pl can use the "catdoc" or "catwpd" utilities instead.
UTILITY CATDOC
==============
Obtain catdoc from http://www.ice.ru/~vitus/catdoc/, it is available
under the terms of the Gnu Public License.
Edit the line in doc2html.pl:
my $CATDOC = '';
to set the variables to the full pathname of catdoc. You might want
to use a different version of catdoc for Word2 documents or for MAC Word
files.
catdoc converts MS Word6, Word7, etc., documents to plain text. The
latest beta version is also able to convert Word2 documents. catdoc
also produces a certaint amount of "garbage" as well as the text of the
document. The -b option improves the likelihood that catdoc will
extract all the text from the document, but at the expense of increasing
the garbage as well. doc2html.pl removes some non-printing characters
to minimise the garbage. If a later version of catdoc than 0.91.4 is
obtained then the use of the -b option should be reviewed.
UTILITY CATWPD
==============
Obtain catwpd from the contribs section of the Ht://Dig web site where
you obtained doc2html. It extracts words from some versions of WordPerfect
files. You won't need it if you buy the superior wp2html.
If you do use it, then edit the line in doc2html.pl:
my $CATWPD = '';
to set the variables to the full pathname of catwpd.
UTILITY PPTHTML
===============
obtain ppthtml from http://www.xlhtml.org, where it is bundled in with
xlhtml.
In doc2html.pl, edit the line:
my $PPT2HTML = '';
to set $PPT2HTML to the full pathname of ppthtml.
ppthtml converts Microsoft Powerpoint files into HTML. It uses the input
filename as the title. doc2html.pl replaces this with the original
filename from the URL in square brackets.
UTILITY XLHTML
==============
Obtain xlhtml from http://www.xlhtml.org
In doc2html.pl, edit the line:
my $XLS2HTML = '';
to set $XLS2HTML to the full pathname of xlhtml.
xlhtml converts Microsoft Excel spreadsheets into HTML. It uses the input
filename as the title. doc2html.pl replaces this with the original
filename from the URL in square brackets.
The present version of xlHtml (0.4) writes HTML output, but does not
mark up hyperlinks in .xls files as links in its output.
An alternative to xlHtml is xls2csv, see below.
UTILITY RTF2HTML
================
Obtain rtf2html from http://www.ice.ru/~vitus/catdoc/
In doc2html.pl, edit the line:
my $RTF2HTML = '';
to set $RTF2HTML to the full pathname of rtf2html.
rtf2html converts Rich Text Font documents into HTML. It uses the input
filename as the title, doc2html.pl replaces this with the original
filename from the URL within square brackets.
UTILITY PS2ASCII
================
Ps2ascii is a PostScript to text converter.
In doc2html.pl, edit the line:
my $CATPS = '';
to the correct full pathname of ps2ascii.
ps2ascii comes with ghostscript 3.33 (or later) package, which is
pre-installed on many Unix systems. Commonly, it is a Bourne-shell
script which invokes "gs", the Ghostscript binary. doc2html.pl has
provision for adding the location of gs to the search path.
UTILITY PDFTOTEXT
=================
pdftotext converts Adobe PDF files to text. pdfinfo is a tool which
displays information about the document, and is used to obtain its
title, etc. Get them from the xpdf package at
http://www.foolabs.com/xpdf/
In script pdf2html.pl, change the lines:
my $PDFTOTEXT = "/... .../pdftotext";
my $PDFINFO = "/... .../pdfinfo";
to the correct full pathnames.
Edit doc2html.pl to include the full pathname of the pdf2html.pl script.
pdf2text may fail to convert PDF documents which have been truncated
because htdig has max_doc_size set to smaller than the documents full
size. Some PDF documents do not allow text to be extracted.
UTILITY CATXLS
==============
The Excel to .csv converter, xls2csv, is included with recent versions of
catdoc. This is an alternative to xlhtml (see above).
Edit the line:
my $CATXLS = '';
to the full pathname of xls2csv.
Xls2csv translates Excel spread sheets into comma-separated data.
UTILITY SWFPARSE
================
swfparse (aka swfdump) extracts information from Shockwave flash files,
and can be obtained from the contribs section of the Ht://Dig web site,
where you obtained doc2html.
Perl script swf2html.pl calls swfparse and writes HTML output containing
links to the URLs found in the Shockwave file. It does NOT extract text
from the file.
In script swf2html.pl, change the line:
my $SWFPARSE = "/... .../swfdump";
to the full pathname.
Edit doc2html.pl to include the full pathname of the swf2html.pl script.
LOGGING
=======
Output of logging information and error messages is controlled by the
environmental variable DOC2HTML_LOG, which may be set in the rundig
script. If it is not set then only error messages output by doc2html.pl
and by the conversion utilities it calls are returned to htdig and will
appear in its STDOUT. If DOC2HTML_LOG is set to a filename, then
doc2html.pl appends logging information and any error messages to the
file. If DOC2HTML_LOG is set but blank, or the file cannot be opened
for writing, logging information and error messages are passed back to
htdig and will appear its STDOUT.
In doc2html.pl, the variables $Emark and $EEmark, set in subroutine init,
are used to highlight error messages.
The number of lines of STDERR output from a utility which is logged or
passed back to htdig is controlled by the variable $Maxerr set in
routine "init" of doc2html.pl. This is provided in order to curb the
large number of error messages which some utilities can produce from
processing a single file.
TIMEOUT
=======
If possible, install Perl module Sys::AlarmCall, obtainable from CPAN if
you don't already have it. This module is used by doc2html.pl to
terminate a utility if it takes too long to finish. The line in
doc2html.pl:
$Time = 60; # allow 60 seconds for external utility to complete
may be altered to suit.
LIMITING INPUT AND OUTPUT
=========================
The environmental variable DOC2HTML_IP_LIMIT may be set in the rundig
script to limit the size of the file which doc2html.pl will attempt to
convert. The default value is 20000000. Doc2html.pl will return no
output to htdig if the file size is equal to or greater than this size.
You are recommended to set DOC2HTML_IP_LIMIT to the same as the
"max_doc_size" parameter in the htdig configuration file. Then no
attempt wil be made to extract text from files which have been truncated
by htdig. It is not possible to extract any text from .PDF files, for
example, if they have been truncated.
The environmental variable DOC2HTML_OP_LIMIT may be set in the rundig
script to limit the output sent back to htdig by a single call to
doc2html.pl. The default value is 10000000. Doc2html.pl will stop
returning output to htdig once the DOC2HTML_OP_LIMIT has been reached.
This is precaution against the unlikely event of a conversion utility
returning disproportionately large amounts of data.
CONTACT
=======
Any queries regarding doc2html are best sent to the mailing list
htdig-general@lists.sourceforge.net
The author can be emailed at D.J.Adams@soton.ac.uk
David Adams
Information Systems Services
University of Southampton
27-November-2002