##### **** These pip installs need to be adapted to use the appropriate release level. Alternatively, The venv running the jupyter lab could be pre-configured with a requirement file that includes the right release. Example for transform developers working from git clone:
```
make venv 
source venv/bin/activate 
pip install jupyterlab
```

In [1]:
%%capture
## This is here as a reference only
# Users and application developers must use the right tag for the latest from pypi
%pip install data-prep-toolkit
%pip install data-prep-toolkit-transforms==0.2.2.dev3

##### **** Configure the transform parameters. The set of dictionary keys holding DocQualityTransform configuration for values are as follows: 
* text_lang - specifies language used in the text content. By default, "en" is used.
* doc_content_column - specifies column name that contains document text. By default, "contents" is used.
* bad_word_filepath - specifies a path to bad word file: local folder (file or directory) that points to bad word file. You don't have to set this parameter if you don't need to set bad words.
#####

##### ***** Import required classes and modules

In [2]:
import os
import sys

from data_processing.runtime.pure_python import PythonTransformLauncher
from data_processing.utils import ParamsUtils
from doc_quality_transform import (bad_word_filepath_cli_param, doc_content_column_cli_param, text_lang_cli_param,)
from doc_quality_transform_python import DocQualityPythonTransformConfiguration

##### ***** Setup runtime parameters for this transform

In [3]:

# create parameters
input_folder = os.path.join("python", "test-data", "input")
output_folder = os.path.join( "python", "output")
local_conf = {
    "input_folder": input_folder,
    "output_folder": output_folder,
}
code_location = {"github": "github", "commit_hash": "12345", "path": "path"}
params = {
    # Data access. Only required parameters are specified
    "data_local_config": ParamsUtils.convert_to_ast(local_conf),
    # execution info
    "runtime_pipeline_id": "pipeline_id",
    "runtime_job_id": "job_id",
    "runtime_code_location": ParamsUtils.convert_to_ast(code_location),
    # doc_quality params
    text_lang_cli_param: "en",
    doc_content_column_cli_param: "contents",
    bad_word_filepath_cli_param: os.path.join("python", "ldnoobw", "en"),
}

##### ***** Use python runtime to invoke the transform

In [4]:
%%capture
sys.argv = ParamsUtils.dict_to_req(d=params)
launcher = PythonTransformLauncher(runtime_config=DocQualityPythonTransformConfiguration())
launcher.launch()

12:39:07 INFO - doc_quality parameters are : {'text_lang': 'en', 'doc_content_column': 'contents', 'bad_word_filepath': 'python/ldnoobw/en', 's3_cred': None, 'docq_data_factory': <data_processing.data_access.data_access_factory.DataAccessFactory object at 0x12ae67650>}
12:39:07 INFO - pipeline id pipeline_id
12:39:07 INFO - code location {'github': 'github', 'commit_hash': '12345', 'path': 'path'}
12:39:07 INFO - data factory data_ is using local data access: input_folder - python/test-data/input output_folder - python/output
12:39:07 INFO - data factory data_ max_files -1, n_sample -1
12:39:07 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
12:39:07 INFO - orchestrator docq started at 2024-11-25 12:39:07
12:39:07 INFO - Number of files is 1, source profile {'max_file_size': 0.0009870529174804688, 'min_file_size': 0.0009870529174804688, 'total_file_size': 0.00098705291748046

##### **** The specified folder will include the transformed parquet files.

In [5]:
import glob
glob.glob("python/output/*")

['python/output/metadata.json', 'python/output/test1.parquet']