This repository has been archived by the owner on Jul 26, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 11
Analysing Yiddish
Assaf Urieli edited this page Mar 28, 2019
·
5 revisions
Starting with v2.3.5, it is possible to use a pre-built Yiddish model and lexicon to analyse Yiddish.
First, install Jochre, using the installation instructions.
Now, let's assume you created a working directory structure as follows:
-
jochre
-
bin
- all JARs and other content copied fromjochre_distribution/target/jochre-distribution-X.X.X-bin
-
input
- where you place the PDF you would like to analyse -
resources
- where to place resources (statistical models and lexicons) -
output
- where the Jochre output will go
-
Copy the following resources into the resources
directory:
Let's assume you want to analyse Sholem Aleykhem's "Tevye der milkhiker". Copy the PDF into the input
directory:
Now, in the jochre
directory, run the following command:
java -jar -Xmx3G bin/jochre-yiddish-X.X.X.jar command=analyseFile file=input/nybc200076.pdf outDir=output/nybc200076 lexicon=resources/jochre-yiddish-lexicon-X.X.X.zip letterModel=resources/yiddish_letter_model.zip outputFormat=Alto4zip,HTML,Text,ImageExtractor
To analyse only a subset of pages, use the additional parameters first
and last
, as in first=18 last=30
.
The output formats above allow you to:
-
Alto4zip
: Generate an Alto4 layer of the book -
HTML
: Generate an HTML page of the book which can be opened in a browser -
Text
: Generate a text file containing the book's text -
ImageExtractor
: Extract the individual images analysed (useful when constructing a training/test corpus)
The full list of available output formats are listed here.