Skip to content

rewicks/ParaDocs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 

Repository files navigation

Quick Use

There are two scripts. One to download and filter from huggingface (paradocs-hf) and one to filter from an already downloaded file (paradocs) (this may be faster than using huggingface).

They both have the same general usage:

usage: paradocs [-h] [--NAME NAME] [OPTIONS]

ParaDocs: Extracts document-level information from RAW ParaCrawl data and filters output for easy bitext extraction.
      Example: paradocs --path en-de-strict.gz

options:
  -h, --help            show this help message and exit
  --path PATH           The path to the data
  --minimum_size MINIMUM_SIZE
                        The minimum consecutive sequence length to include in the output. By default, set to 1 (all sentences included)
  --frequency_cutoff FREQUENCY_CUTOFF
                        Will break documents at any line that occurs above this frequency in the original data. This is an alternative to sentence-level deduping.
  --lid_cutoff LID_CUTOFF
                        Will break documents at any line that occurs below this probability for the assigned language.
  --min_avg_score MIN_AVG_SCORE
                        To print out a document, the whole document must maintain an average score across the whole document.
  --outpath OUTPATH     The output path to write the data. If not set, defaults to sys.stdout

Alternatively to use paradocs-hf, you can substitute --path for --name and pass the name of a data split.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages