Analysing Yiddish

Starting with v2.3.5, it is possible to use a pre-built Yiddish model and lexicon to analyse Yiddish.

First, install Jochre, using the installation instructions.

Now, let's assume you created a working directory structure as follows:

jochre
- bin - all JARs and other content copied from jochre_distribution/target/jochre-distribution-X.X.X-bin
- input - where you place the PDF you would like to analyse
- resources - where to place resources (statistical models and lexicons)
- output - where the Jochre output will go

Copy the following resources into the resources directory:

Let's assume you want to analyse Sholem Aleykhem's "Tevye der milkhiker". Copy the PDF into the input directory:

nybc200076.pdf

Now, in the jochre directory, run the following command:

java -jar -Xmx3G bin/jochre-yiddish-X.X.X.jar command=analyseFile file=input/nybc200076.pdf outDir=output/nybc200076 lexicon=resources/jochre-yiddish-lexicon-X.X.X.zip letterModel=resources/yiddish_letter_model.zip outputFormat=Alto4zip,HTML,Text,ImageExtractor

To analyse only a subset of pages, use the additional parameters first and last, as in first=18 last=30.

The output formats above allow you to:

Alto4zip: Generate an Alto4 layer of the book
HTML: Generate an HTML page of the book which can be opened in a browser
Text: Generate a text file containing the book's text
ImageExtractor: Extract the individual images analysed (useful when constructing a training/test corpus)

The full list of available output formats are listed here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Analysing Yiddish

Clone this wiki locally