**** These pip installs need to be adapted to use the appropriate release level. Alternatively, The venv running the jupyter lab could be pre-configured with a requirement file that includes the right release. Example for transform developers working from git clone:
```
    make venv 
    source venv/bin/activate 
    pip install jupyterlab
```

In [1]:
%%capture
## This is here as a reference only
# Users and application developers must use the right tag for the latest from pypi
%pip install data-prep-toolkit
%pip install data-prep-toolkit-transforms==0.2.2.dev3

**** Configure the transform parameters. The set of dictionary keys holding DocQualityTransform configuration for values are as follows:
 - model_name_or_path - specify the HAP model, which should be compatible with HuggingFace's AutoModelForSequenceClassification. Defaults to IBM's open-source toxicity classifier ibm-granite/granite-guardian-hap-38m.
 - annotation_column - the column name containing hap (toxicity) score in the output .parquet file. Defaults to hap_score.
 - doc_text_column - the column name containing the document text in the input .parquet file. Defaults to contents.
 - batch_size - modify it based on the infrastructure capacity. Defaults to 128.
 - max_length - the maximum length for the tokenizer. Defaults to 512.

***** Import required classes and modules

In [2]:
import ast
import os
import sys

from data_processing.runtime.pure_python import PythonTransformLauncher
from data_processing.utils import ParamsUtils
from hap_transform_python import HAPPythonTransformConfiguration

[nltk_data] Downloading package punkt_tab to /Users/ian/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


***** Setup runtime parameters for this transform

In [3]:
# create parameters
__file__ = os.getcwd()
input_folder = os.path.abspath(os.path.join(os.path.dirname(__file__), "../test-data/input"))
output_folder = os.path.abspath(os.path.join(os.path.dirname(__file__), "../output"))
local_conf = {
    "input_folder": input_folder,
    "output_folder": output_folder,
}
code_location = {"github": "github", "commit_hash": "12345", "path": "path"}

params = {
    "data_local_config": ParamsUtils.convert_to_ast(local_conf),
    "runtime_pipeline_id": "pipeline_id",
    "runtime_job_id": "job_id",
    "runtime_code_location": ParamsUtils.convert_to_ast(code_location),
}


hap_params = {
    "model_name_or_path": 'ibm-granite/granite-guardian-hap-38m',
    "annotation_column": "hap_score",
    "doc_text_column": "contents",
    "inference_engine": "CPU",
    "max_length": 512,
    "batch_size": 128,
}

***** Use python runtime to invoke the transform

In [4]:
%%capture
sys.argv = ParamsUtils.dict_to_req(d=params | hap_params)
launcher = PythonTransformLauncher(runtime_config=HAPPythonTransformConfiguration())
launcher.launch()

11:29:11 INFO - hap params are {'model_name_or_path': 'ibm-granite/granite-guardian-hap-38m', 'annotation_column': 'hap_score', 'doc_text_column': 'contents', 'inference_engine': 'CPU', 'max_length': 512, 'batch_size': 128} 
11:29:11 INFO - pipeline id pipeline_id
11:29:11 INFO - code location {'github': 'github', 'commit_hash': '12345', 'path': 'path'}
11:29:11 INFO - data factory data_ is using local data access: input_folder - /Users/ian/Desktop/data-prep-kit/transforms/universal/hap/test-data/input output_folder - /Users/ian/Desktop/data-prep-kit/transforms/universal/hap/output
11:29:11 INFO - data factory data_ max_files -1, n_sample -1
11:29:11 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
11:29:11 INFO - orchestrator hap started at 2024-12-03 11:29:11
11:29:11 ERROR - No input files to process - exiting
11:29:11 INFO - Completed execution in 0.0 min, execution resul

**** The specified folder will include the transformed parquet files.

In [5]:
# the outputs will be located in the following folders
import glob
glob.glob("../output/*")

['../output/metadata.json', '../output/test1.parquet']