##### **** These pip installs need to be adapted to use the appropriate release level. Alternatively, The venv running the jupyter lab could be pre-configured with a requirement file that includes the right release. Example for transform developers working from git clone:
```
make venv 
source venv/bin/activate 
pip install jupyterlab
```

In [1]:
%%capture
## This is here as a reference only
# Users and application developers must use the right tag for the latest from pypi
#!pip install data-prep-toolkit
#!pip install data-prep-toolkit-transforms
#!pip install data-prep-connector

##### **** Configure the transform parameters. We will only show the use of double_precision. For a complete list, please refer to the README.md for this transform.
##### 
| parameter:type | Description |
| --- | --- |
| data_files_to_use: list | list of file extensions in the input folder to use for running the transform |
|pdf2parquet_double_precision: int | If set, all floating points (e.g. bounding boxes) are rounded to this precision. For tests it is advised to use 0 |



##### ***** Import required classes and modules

In [2]:
import ast
import os
import sys

from data_processing.runtime.pure_python import PythonTransformLauncher
from data_processing.utils import ParamsUtils
from pdf2parquet_transform_python import Pdf2ParquetPythonTransformConfiguration


##### ***** Setup runtime parameters for this transform

In [3]:

# create parameters
input_folder = os.path.join("python", "test-data", "input")
output_folder = os.path.join( "python", "output")
local_conf = {
    "input_folder": input_folder,
    "output_folder": output_folder,
}
params = {
    # Data access. Only required parameters are specified
    "data_local_config": ParamsUtils.convert_to_ast(local_conf),
    "data_files_to_use": ast.literal_eval("['.pdf','.docx','.pptx','.zip']"),
    # execution info
    "runtime_pipeline_id": "pipeline_id",
    "runtime_job_id": "job_id",
    # pdf2parquet params
    "pdf2parquet_double_precision": 0,
}

##### ***** Use python runtime to invoke the transform

In [4]:
%%capture
sys.argv = ParamsUtils.dict_to_req(d=params)
launcher = PythonTransformLauncher(runtime_config=Pdf2ParquetPythonTransformConfiguration())
launcher.launch()


15:13:18 INFO - pdf2parquet parameters are : {'batch_size': -1, 'artifacts_path': None, 'contents_type': <pdf2parquet_contents_types.MARKDOWN: 'text/markdown'>, 'do_table_structure': True, 'do_ocr': True, 'ocr_engine': <pdf2parquet_ocr_engine.EASYOCR: 'easyocr'>, 'bitmap_area_threshold': 0.05, 'pdf_backend': <pdf2parquet_pdf_backend.DLPARSE_V2: 'dlparse_v2'>, 'double_precision': 0}
15:13:18 INFO - pipeline id pipeline_id
15:13:18 INFO - code location None
15:13:18 INFO - data factory data_ is using local data access: input_folder - python/test-data/input output_folder - python/output
15:13:18 INFO - data factory data_ max_files -1, n_sample -1
15:13:18 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf', '.docx', '.pptx', '.zip'], files to checkpoint ['.parquet']
15:13:18 INFO - orchestrator pdf2parquet started at 2024-11-20 15:13:18
15:13:18 INFO - Number of files is 2, source profile {'max_file_size': 0.30131721496

##### **** The specified folder will include the transformed parquet files.

In [5]:
import glob
glob.glob("python/output/*")

['python/output/redp5110-ch1.parquet',
 'python/output/metadata.json',
 'python/output/archive1.parquet']