# Onboard Project

This notebook serves as an alternative to running silnlp.common.onboard_project from the command line. Each command line or config file argument is able to be freely edited here.

If you would like to run the entire Notebook at once, then fill in at least the project and copy_from (Potentially zip_password as well, if needed) parameters in the first Code Cell, the rest of the Notebook will run successfully with default arguments.

## Setup Local Project

This block sets up the local project to be ready for cleaning and uploading. It unzips the project if it is zipped, checks for Settings.xml errors, and renames the project by replacing hyphens with underscores and optionally adds a datestamp.

### Parameters
- **project**: str - The name of the Paratext project to be used for Onboarding. No default. Input is always required.
- **copy_from**: str - This is the path to the local project to be uploaded to the bucket. Default is None.
- **zip_password**: str - This is the password to the project's zip file. Default is None.
- **datestamp**: bool - When True, a datestamp will be appended to the end of the project name. Default is False.
- **overwrite**: bool - When True, overwrite any existing files and folders. Default is False.

In [None]:
from pathlib import Path
#----------Parameters to Edit----------
project: str = "INSERT_PROJECT_NAME_HERE"
copy_from: str = None
zip_password: str = None
datestamp: bool = False
overwrite: bool = False
#--------------------------------------

import silnlp.common.onboard_project as onboard_project
project_name, local_project_path, copy_from = onboard_project.setup_local_project(project, Path(copy_from), zip_password, datestamp)

## Clean Project
This block optionally cleans the local project before uploading, removing unncessary files. This is done by running silnlp.common.clean_projects

### Parameters
- **no_clean**: bool - When True, skip cleaning the project. Default is False.

In [None]:
#----------Parameters to Edit----------
no_clean = False
#--------------------------------------

from silnlp.common.clean_projects import process_single_project_for_cleaning

if not no_clean:
    print(f"Cleaning Paratext project: {project_name}.")
    process_single_project_for_cleaning(
        local_project_path,
    )

## Copy Paratext Project

This block copies the local Paratext project to the bucket.

In [None]:
from silnlp.common.environment import SIL_NLP_ENV

if copy_from:
    print(
        f"Copying project: {project_name} from {copy_from} to {SIL_NLP_ENV.pt_projects_dir}/{project_name}"
    )
    source_path = Path(copy_from)
    if source_path.name != project_name:
        source_path = Path(source_path / project_name)
    paratext_project_dir: Path = onboard_project.create_paratext_project_folder_if_not_exists(project_name)
    onboard_project.copy_paratext_project_folder(source_path, paratext_project_dir, overwrite=overwrite)

## Extract Corpora

This block extracts text corpora from the Paratext project by running silnlp.common.extract_corpora.

### Parameters
- **include**: List[str] - The list of books to include. Default is [].
- **exclude**: List[str] - The list of books to exclude. Default is [].
- **markers**: bool - When True, include USFM markers in extraction. Default is False.
- **lemmas**: bool - When True, extract lemmas, if available. Default is False.
- **project_vrefs**: bool - When True, extract verse references. Default is False.

In [None]:
from typing import List
#----------Parameters to Edit----------
include: List[str] = []
exclude: List[str] = []
markers: bool = False
lemmas: bool = False
project_vrefs: bool = False
#--------------------------------------

extract_config = {
            "include": include,
            "exclude": exclude,
            "markers": markers,
            "lemmas": lemmas,
            "project_vrefs": project_vrefs,
        }

onboard_project.extract_corpora_wrapper(
    project_name,
    extract_config,
    overwrite,
)

## Collect Verse Counts

This block collects the project's verse counts by running silnlp.common.collect_verse_counts.

### Parameters
- **input_folder**: Path - Folder with corpus of Bible extract files. Default is MT/scripture.
- **output_folder**: Path - Folder to store the results. Default is MT/experiments/OnboardingRequests/project_name/verse_counts.
- **files**: str - Semicolon-delimited list of patterns of extract file names to count. Default is "\*project_name\*.txt".
- **deutero**: bool - When True, include counts for Deuterocanonical books. Default is False.
- **recount**: bool - When True, force a recount of the verses. Default is False.

In [None]:
#----------Parameters to Edit----------
output_folder: Path = SIL_NLP_ENV.mt_experiments_dir / "OnboardingRequests" / project_name / "verse_counts"
input_folder: Path = SIL_NLP_ENV.mt_scripture_dir
files: str = f"*{project_name}*.txt"
deutero: bool = False
recount: bool = False
#--------------------------------------

verse_counts_config = {
    "input_folder": input_folder,
    "output_folder": output_folder,
    "files": files,
    "deutero": deutero,
    "recount": recount,
}

onboard_project.collect_verse_counts_wrapper(project_name, verse_counts_config, overwrite)

## Wildebeest Analysis

This block generates a Wildebeest report and stores the results in MT/experiments/OnboardingRequests/project_name/wildbeest.

### Parameters
- **max_examples**: int - Max number of examples per line. Default is 500.
- **max_cases**: int - Max number of cases per group. Default is 500.
- **ref_id_file**: str - Reference filename. Default is silnlp/assets/vref.txt.

In [None]:
#----------Parameters to Edit----------
max_examples: int = 500
max_cases: int = 500
ref_id_file: str = "silnlp/assets/vref.txt"
#--------------------------------------

wildebeest_config = {
            "max_examples": max_examples,
            "max_cases": max_cases,
            "ref_id_file": ref_id_file,
         }

onboard_project.wildebeest_analysis_wrapper(project_name, wildebeest_config, overwrite)

## Stats

This block calculates tokenization statistics by running silnlp.nmt.preprocess --stats with the project as the source and target. Results are stored in MT/experiments/OnboardingRequests/project_name/stats

### Parameters
- **stats_config**: dict - The config to use for the preprocess step. Default is None.

In [None]:
#----------Parameters to Edit----------
stats_config: dict = None
#--------------------------------------

onboard_project.calculate_tokenization_stats(project_name, stats_config, overwrite=overwrite)