Skip to content
This repository has been archived by the owner on Jul 26, 2024. It is now read-only.

Analysing Yiddish

Assaf Urieli edited this page Mar 28, 2019 · 5 revisions

Starting with v2.3.5, it is possible to use a pre-built Yiddish model and lexicon to analyse Yiddish.

First, install Jochre, using the installation instructions.

Now, let's assume you created a working directory structure as follows:

  • jochre
    • bin - all JARs and other content copied from jochre_distribution/target/jochre-distribution-X.X.X-bin
    • input - where you place the PDF you would like to analyse
    • resources - where to place resources (statistical models and lexicons)
    • output - where the Jochre output will go

Copy the following resources into the resources directory:

Let's assume you want to analyse Sholem Aleykhem's "Tevye der milkhiker". Copy the PDF into the input directory:

Now, in the jochre directory, run the following command:

java -jar -Xmx3G bin/jochre-yiddish-X.X.X.jar command=analyseFile file=input/nybc200076.pdf outDir=output/nybc200076 lexicon=resources/jochre-yiddish-lexicon-X.X.X.zip letterModel=resources/yiddish_letter_model.zip outputFormat=Alto4zip,HTML,Text,ImageExtractor

To analyse only a subset of pages, use the additional parameters first and last, as in first=18 last=30.

The output formats above allow you to:

  • Alto4zip: Generate an Alto4 layer of the book
  • HTML: Generate an HTML page of the book which can be opened in a browser
  • Text: Generate a text file containing the book's text
  • ImageExtractor: Extract the individual images analysed (useful when constructing a training/test corpus)

The full list of available output formats are listed here.

Clone this wiki locally