# Data Prep Kit Introduction


## PDF Processing Pipeline

This notebook will demonstrate processing PDFs


Here is the workflow,

![](https://raw.githubusercontent.com/sujee/data-prep-kit-examples/main/media/data-prep-kit-3-workflow.png)

Open this notebook in Google Colab: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/santoshborse/pydatanyc2024/blob/main/dpk-intro.ipynb)

In [1]:
import os

if os.getenv("COLAB_RELEASE_TAG"):
   print("Running in Colab")
   !rm -rf sample_data
   RUNNING_IN_COLAB = True
else:
   print("NOT in Colab")
   RUNNING_IN_COLAB = False


Running in Colab


## Step-0.1: Download and inspect the input Data


In [3]:
if RUNNING_IN_COLAB:
    !mkdir -p 'input'
    !wget -O 'input/earth.pdf'  'https://raw.githubusercontent.com/sujee/data-prep-kit-examples/main/data/solar-system/earth.pdf'
    !wget -O 'input/mars.pdf'  'https://raw.githubusercontent.com/sujee/data-prep-kit-examples/main/data/solar-system/mars.pdf'
    !wget -O 'utils.py'  'https://raw.githubusercontent.com/sujee/data-prep-kit-examples/main/utils.py'


--2024-11-04 21:25:34--  https://raw.githubusercontent.com/sujee/data-prep-kit-examples/main/data/solar-system/earth.pdf
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 58535 (57K) [application/octet-stream]
Saving to: ‘input/earth.pdf’


2024-11-04 21:25:34 (5.19 MB/s) - ‘input/earth.pdf’ saved [58535/58535]

--2024-11-04 21:25:34--  https://raw.githubusercontent.com/sujee/data-prep-kit-examples/main/data/solar-system/mars.pdf
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 57872 (57K) [application/octet-stream]
Saving to: ‘input

## Step-0.2: Install DPK and Required Transforms


In [4]:
## Step 2: Install DPK and all transforms
if RUNNING_IN_COLAB:
  !pip install data-prep-toolkit==0.2.2.dev2 && pip install data-prep-toolkit-transforms[pdf2parquet]==0.2.2.dev1


Collecting data-prep-toolkit==0.2.2.dev2
  Downloading data_prep_toolkit-0.2.2.dev2-py3-none-any.whl.metadata (2.2 kB)
Collecting pyarrow==16.1.0 (from data-prep-toolkit==0.2.2.dev2)
  Downloading pyarrow-16.1.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.0 kB)
Collecting boto3==1.34.69 (from data-prep-toolkit==0.2.2.dev2)
  Downloading boto3-1.34.69-py3-none-any.whl.metadata (6.6 kB)
Collecting argparse (from data-prep-toolkit==0.2.2.dev2)
  Downloading argparse-1.4.0-py2.py3-none-any.whl.metadata (2.8 kB)
Collecting mmh3 (from data-prep-toolkit==0.2.2.dev2)
  Downloading mmh3-5.0.1-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (14 kB)
Collecting botocore<1.35.0,>=1.34.69 (from boto3==1.34.69->data-prep-toolkit==0.2.2.dev2)
  Downloading botocore-1.34.162-py3-none-any.whl.metadata (5.7 kB)
Collecting jmespath<2.0.0,>=0.7.1 (from boto3==1.34.69->data-prep-toolkit==0.2.2.dev2)
  Downloading jmespath-1.0.1-py3-none-any.wh

## STAGE-1: pdf2parquet - Convert data from PDF to Parquet
The PDF2PARQUET transforms iterate through PDF files or zip of PDF files and generates parquet files containing the converted document in Markdown format.

The PDF conversion is using the [Docling package](https://github.com/DS4SD/docling).




In [2]:
STAGE = 1
input_folder = "input"
output_folder =  "s1-pdf2parquet"
print (f"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'")

🏃🏼 STAGE-1: Processing input='input' --> output='s1-pdf2parquet'


In [3]:
%%time

import ast, os, sys

from pdf2parquet_transform import (
    pdf2parquet_contents_type_cli_param,
    pdf2parquet_contents_types,
)
from data_processing.runtime.pure_python import PythonTransformLauncher
from pdf2parquet_transform_python import Pdf2ParquetPythonTransformConfiguration

from data_processing.utils import GB, ParamsUtils


# create parameters
local_conf = {
    "input_folder": input_folder,
    "output_folder": output_folder,
}
ingest_config = {
    pdf2parquet_contents_type_cli_param: pdf2parquet_contents_types.JSON,
}

params = {
    # Data access. Only required parameters are specified
    "data_local_config": ParamsUtils.convert_to_ast(local_conf),
    "data_files_to_use": ast.literal_eval("['.pdf']"),
}


sys.argv = ParamsUtils.dict_to_req(d=(params | ingest_config))
# create launcher
launcher = PythonTransformLauncher(Pdf2ParquetPythonTransformConfiguration())
# launch
return_code = launcher.launch()

if return_code == 0:
    print (f"✅ Stage:{STAGE} completed successfully")
else:
    raise Exception ("❌ Job failed")

21:27:07 INFO - pdf2parquet parameters are : {'artifacts_path': None, 'contents_type': <pdf2parquet_contents_types.JSON: 'application/json'>, 'do_table_structure': True, 'do_ocr': True, 'double_precision': 8}
INFO:pdf2parquet_transform:pdf2parquet parameters are : {'artifacts_path': None, 'contents_type': <pdf2parquet_contents_types.JSON: 'application/json'>, 'do_table_structure': True, 'do_ocr': True, 'double_precision': 8}
21:27:07 INFO - pipeline id pipeline_id
INFO:data_processing.runtime.execution_configuration:pipeline id pipeline_id
21:27:07 INFO - code location None
INFO:data_processing.runtime.execution_configuration:code location None
21:27:07 INFO - data factory data_ is using local data access: input_folder - input output_folder - s1-pdf2parquet
INFO:data_processing.data_access.data_access_factory_base30d8ba92-2747-462d-8b7f-8ec2ef17fc55:data factory data_ is using local data access: input_folder - input output_folder - s1-pdf2parquet
21:27:07 INFO - data factory data_ max_

Fetching 10 files:   0%|          | 0/10 [00:00<?, ?it/s]

README.md:   0%|          | 0.00/3.49k [00:00<?, ?B/s]

.gitignore:   0%|          | 0.00/5.18k [00:00<?, ?B/s]

.gitattributes:   0%|          | 0.00/1.71k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/41.0 [00:00<?, ?B/s]

model.pt:   0%|          | 0.00/169M [00:00<?, ?B/s]

model.pt:   0%|          | 0.00/201M [00:00<?, ?B/s]

(…)artifacts/tableformer/fat/tm_config.json:   0%|          | 0.00/7.09k [00:00<?, ?B/s]

otslp_all_standard_094_clean.check:   0%|          | 0.00/213M [00:00<?, ?B/s]

otslp_all_fast.check:   0%|          | 0.00/146M [00:00<?, ?B/s]

(…)del_artifacts/tableformer/tm_config.json:   0%|          | 0.00/7.09k [00:00<?, ?B/s]



Progress: |██████████████████████████████████████████████████| 100.0% Complete



Progress: |██████████████████████████████████████████████████| 100.0% Complete

21:29:31 INFO - Completed 1 files (50.0%) in 0.109 min
INFO:data_processing.runtime.pure_python.transform_orchestrator:Completed 1 files (50.0%) in 0.109 min
21:29:36 INFO - Completed 2 files (100.0%) in 0.189 min
INFO:data_processing.runtime.pure_python.transform_orchestrator:Completed 2 files (100.0%) in 0.189 min
21:29:36 INFO - Done processing 2 files, waiting for flush() completion.
INFO:data_processing.runtime.pure_python.transform_orchestrator:Done processing 2 files, waiting for flush() completion.
21:29:36 INFO - done flushing in 0.0 sec
INFO:data_processing.runtime.pure_python.transform_orchestrator:done flushing in 0.0 sec
21:29:36 INFO - Completed execution in 2.484 min, execution result 0
INFO:data_processing.runtime.pure_python.transform_launcher:Completed execution in 2.484 min, execution result 0


✅ Stage:1 completed successfully
CPU times: user 41 s, sys: 5.69 s, total: 46.7 s
Wall time: 2min 42s


In [16]:
# Inspect the STAGE 1 output
from utils import read_parquet_files_as_df
output_df = read_parquet_files_as_df(output_folder)
print ("Output dimensions (rows x columns)= ", output_df.shape)
output_df.head(10)

Output dimensions (rows x columns)=  (8, 16)


Unnamed: 0,filename,num_pages,num_tables,num_doc_elements,ext,hash,size,date_acquired,pdf_convert_time,source_filename,source_document_id,contents,doc_jsonpath,page_number,bbox,document_id
0,mars.pdf,1,0,11,pdf,8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...,2800,2024-11-04T21:29:36.635612,4.782521,mars.pdf,f1a7c918-a02b-47ea-be91-bf00fe1020a2,Solar System\nOur solar system is a vast and f...,$.main-text[2],1,"[132.84518433, 588.96014404, 479.40917969, 623...",44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...
1,mars.pdf,1,0,11,pdf,8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...,2800,2024-11-04T21:29:36.635612,4.782521,mars.pdf,f1a7c918-a02b-47ea-be91-bf00fe1020a2,Solar System\nFor more details about the Solar...,$.main-text[3],1,"[133.18510437, 570.83258057, 374.99838257, 581...",dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...
2,mars.pdf,1,0,11,pdf,8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...,2800,2024-11-04T21:29:36.635612,4.782521,mars.pdf,f1a7c918-a02b-47ea-be91-bf00fe1020a2,"Mars\nMars, the fourth planet from the Sun, is...",$.main-text[5],1,"[132.87440491, 500.84011841, 477.48345947, 534...",a31663e06fac41470ecc459f5a58658a3f9997d7801053...
3,mars.pdf,1,0,11,pdf,8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...,2800,2024-11-04T21:29:36.635612,4.782521,mars.pdf,f1a7c918-a02b-47ea-be91-bf00fe1020a2,Basic facts about Mars:\n· Distance from the S...,$.main-text[6],1,"[133.2026062, 482.90710449, 237.04431152, 493....",7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...
4,earth.pdf,1,0,11,pdf,18713f970989055625bef22209b6f4b6830b9ca22046bf...,2686,2024-11-04T21:29:31.832960,6.511793,earth.pdf,b75e7b29-6c82-4f46-9f4a-6ac94f18b612,Solar System\nOur solar system is a vast and f...,$.main-text[2],1,"[132.87112427, 588.96014404, 479.40917969, 623...",44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...
5,earth.pdf,1,0,11,pdf,18713f970989055625bef22209b6f4b6830b9ca22046bf...,2686,2024-11-04T21:29:31.832960,6.511793,earth.pdf,b75e7b29-6c82-4f46-9f4a-6ac94f18b612,Solar System\nFor more details about our Solar...,$.main-text[3],1,"[133.20942688, 570.81555176, 375.57919312, 581...",d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...
6,earth.pdf,1,0,11,pdf,18713f970989055625bef22209b6f4b6830b9ca22046bf...,2686,2024-11-04T21:29:31.832960,6.511793,earth.pdf,b75e7b29-6c82-4f46-9f4a-6ac94f18b612,Earth\nEarth is the third planet from the Sun....,$.main-text[5],1,"[132.91053772, 512.46295166, 477.84887695, 534...",7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...
7,earth.pdf,1,0,11,pdf,18713f970989055625bef22209b6f4b6830b9ca22046bf...,2686,2024-11-04T21:29:31.832960,6.511793,earth.pdf,b75e7b29-6c82-4f46-9f4a-6ac94f18b612,Earth\nBasic facts about Earth:\n· Distance fr...,$.main-text[6],1,"[133.30151367, 494.86206055, 240.17156982, 505...",189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...


In [7]:
import pprint
import json

pprint.pprint (json.loads(output_df.iloc[0, ]['contents']))

{'_name': '',
 'description': {'logs': []},
 'equations': [],
 'figures': [],
 'file-info': {'#-pages': 1,
               'document-hash': '1a83f43f3a202e3f203c1263e36961ecc45d401aad488f638fc5559a584333b2',
               'filename': 'mars.pdf',
               'page-hashes': [{'hash': '551fe7a9bde2a9302f150c0a79a13fcc0868fcf73ac6afb80be645c1174734a0',
                                'model': 'default',
                                'page': 1}]},
 'footnotes': [],
 'main-text': [{'name': 'Section-header',
                'prov': [{'bbox': [133.35137939,
                                   654.45184326,
                                   169.88169861,
                                   667.98492432],
                          'page': 1,
                          'span': [0, 4]}],
                'text': 'Mars',
                'type': 'subtitle-level-1'},
               {'name': 'Section-header',
                'prov': [{'bbox': [133.09541321,
                                   630.681

## STAGE 2 - Chuck the pdf file contents into multiple logical documents
This transform uses Quackling HierarchicalChunker to chunk according to the document layout segmentation, i.e. respecting the original document components as paragraphs, tables, enumerations, etc. It relies on documents converted with the Docling library in the pdf2parquet transform using the option contents_type: "application/json", which provides the required JSON structure.

In [8]:
# Set Input/output Folder
STAGE = 2
input_folder = output_folder # s1-pdf2parquet - previous output folder is the input folder for the current stage
output_folder = 's2-doc-chunks'

input_df = read_parquet_files_as_df(input_folder)  ## for debug purposes

print (f"🏃🏼 STAGE-{STAGE}: Processing input = '{input_folder}' --> output = '{output_folder}'")

🏃🏼 STAGE-2: Processing input='s1-pdf2parquet' --> output='s2-doc-chunks'


In [10]:
!pip install llama_index

Collecting llama_index
  Downloading llama_index-0.11.21-py3-none-any.whl.metadata (11 kB)
Collecting llama-index-agent-openai<0.4.0,>=0.3.4 (from llama_index)
  Downloading llama_index_agent_openai-0.3.4-py3-none-any.whl.metadata (728 bytes)
Collecting llama-index-cli<0.4.0,>=0.3.1 (from llama_index)
  Downloading llama_index_cli-0.3.1-py3-none-any.whl.metadata (1.5 kB)
Collecting llama-index-core<0.12.0,>=0.11.20 (from llama_index)
  Downloading llama_index_core-0.11.21-py3-none-any.whl.metadata (2.4 kB)
Collecting llama-index-embeddings-openai<0.3.0,>=0.2.4 (from llama_index)
  Downloading llama_index_embeddings_openai-0.2.5-py3-none-any.whl.metadata (686 bytes)
Collecting llama-index-indices-managed-llama-cloud>=0.3.0 (from llama_index)
  Downloading llama_index_indices_managed_llama_cloud-0.4.0-py3-none-any.whl.metadata (3.8 kB)
Collecting llama-index-legacy<0.10.0,>=0.9.48 (from llama_index)
  Downloading llama_index_legacy-0.9.48.post3-py3-none-any.whl.metadata (8.5 kB)
Collecti

In [11]:
%%time

from data_processing.runtime.pure_python import PythonTransformLauncher
from doc_chunk_transform_python import DocChunkPythonTransformConfiguration

# Prepare the commandline params
local_conf = {
    "input_folder": input_folder,
    "output_folder": output_folder,
}
params = {
    # Data access. Only required parameters are specified
    "data_local_config": ParamsUtils.convert_to_ast(local_conf),
    # doc_chunk arguments
    # ...
}

# Pass the commandline params
sys.argv = ParamsUtils.dict_to_req(d=params)

# create launcher
launcher = PythonTransformLauncher(DocChunkPythonTransformConfiguration())
# launch
return_code = launcher.launch()

if return_code == 0:
    print (f"✅ Stage:{STAGE} completed successfully")
else:
    raise Exception ("❌ Job failed")

21:43:35 INFO - doc_chunk parameters are : {'chunking_type': <chunking_types.DL_JSON: 'dl_json'>, 'content_column_name': 'contents', 'doc_id_column_name': 'document_id', 'dl_min_chunk_len': None, 'output_chunk_column_name': 'contents', 'output_source_doc_id_column_name': 'source_document_id', 'output_jsonpath_column_name': 'doc_jsonpath', 'output_pageno_column_name': 'page_number', 'output_bbox_column_name': 'bbox', 'chunk_size_tokens': 128, 'chunk_overlap_tokens': 30}
INFO:doc_chunk_transformcfg:doc_chunk parameters are : {'chunking_type': <chunking_types.DL_JSON: 'dl_json'>, 'content_column_name': 'contents', 'doc_id_column_name': 'document_id', 'dl_min_chunk_len': None, 'output_chunk_column_name': 'contents', 'output_source_doc_id_column_name': 'source_document_id', 'output_jsonpath_column_name': 'doc_jsonpath', 'output_pageno_column_name': 'page_number', 'output_bbox_column_name': 'bbox', 'chunk_size_tokens': 128, 'chunk_overlap_tokens': 30}
21:43:35 INFO - pipeline id pipeline_id


✅ Stage:2 completed successfully
CPU times: user 3.29 s, sys: 359 ms, total: 3.65 s
Wall time: 4.78 s


In [18]:
# Inspect Stage 2 output
from utils import read_parquet_files_as_df
input_df = read_parquet_files_as_df(input_folder)
output_df = read_parquet_files_as_df(output_folder)
print (f"Files processed : {input_df.shape[0]:,}")
print (f"Chunks created : {output_df.shape[0]:,}")
print ("Output dimensions (rows x columns)= ", output_df.shape)

output_df.head(10)

Files processed : 2
Chunks created : 8
Output dimensions (rows x columns)=  (8, 16)


Unnamed: 0,filename,num_pages,num_tables,num_doc_elements,ext,hash,size,date_acquired,pdf_convert_time,source_filename,source_document_id,contents,doc_jsonpath,page_number,bbox,document_id
0,mars.pdf,1,0,11,pdf,8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...,2800,2024-11-04T21:29:36.635612,4.782521,mars.pdf,f1a7c918-a02b-47ea-be91-bf00fe1020a2,Solar System\nOur solar system is a vast and f...,$.main-text[2],1,"[132.84518433, 588.96014404, 479.40917969, 623...",44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...
1,mars.pdf,1,0,11,pdf,8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...,2800,2024-11-04T21:29:36.635612,4.782521,mars.pdf,f1a7c918-a02b-47ea-be91-bf00fe1020a2,Solar System\nFor more details about the Solar...,$.main-text[3],1,"[133.18510437, 570.83258057, 374.99838257, 581...",dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...
2,mars.pdf,1,0,11,pdf,8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...,2800,2024-11-04T21:29:36.635612,4.782521,mars.pdf,f1a7c918-a02b-47ea-be91-bf00fe1020a2,"Mars\nMars, the fourth planet from the Sun, is...",$.main-text[5],1,"[132.87440491, 500.84011841, 477.48345947, 534...",a31663e06fac41470ecc459f5a58658a3f9997d7801053...
3,mars.pdf,1,0,11,pdf,8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...,2800,2024-11-04T21:29:36.635612,4.782521,mars.pdf,f1a7c918-a02b-47ea-be91-bf00fe1020a2,Basic facts about Mars:\n· Distance from the S...,$.main-text[6],1,"[133.2026062, 482.90710449, 237.04431152, 493....",7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...
4,earth.pdf,1,0,11,pdf,18713f970989055625bef22209b6f4b6830b9ca22046bf...,2686,2024-11-04T21:29:31.832960,6.511793,earth.pdf,b75e7b29-6c82-4f46-9f4a-6ac94f18b612,Solar System\nOur solar system is a vast and f...,$.main-text[2],1,"[132.87112427, 588.96014404, 479.40917969, 623...",44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...
5,earth.pdf,1,0,11,pdf,18713f970989055625bef22209b6f4b6830b9ca22046bf...,2686,2024-11-04T21:29:31.832960,6.511793,earth.pdf,b75e7b29-6c82-4f46-9f4a-6ac94f18b612,Solar System\nFor more details about our Solar...,$.main-text[3],1,"[133.20942688, 570.81555176, 375.57919312, 581...",d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...
6,earth.pdf,1,0,11,pdf,18713f970989055625bef22209b6f4b6830b9ca22046bf...,2686,2024-11-04T21:29:31.832960,6.511793,earth.pdf,b75e7b29-6c82-4f46-9f4a-6ac94f18b612,Earth\nEarth is the third planet from the Sun....,$.main-text[5],1,"[132.91053772, 512.46295166, 477.84887695, 534...",7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...
7,earth.pdf,1,0,11,pdf,18713f970989055625bef22209b6f4b6830b9ca22046bf...,2686,2024-11-04T21:29:31.832960,6.511793,earth.pdf,b75e7b29-6c82-4f46-9f4a-6ac94f18b612,Earth\nBasic facts about Earth:\n· Distance fr...,$.main-text[6],1,"[133.30151367, 494.86206055, 240.17156982, 505...",189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...


## STAGE 3 - DOC ID generation of Chunks
This transform annotates documents with document "ids". It supports the following transformations of the original data:

 - Adding document hash: this enables the addition of a document hash-based id to the data. The hash is calculated with `hashlib.sha256(doc.encode("utf-8")).hexdigest()`. To enable this annotation, set **hash_column** to the name of the column, where you want to store it.
 - Adding integer document id: this allows the addition of an integer document id to the data that is unique across all rows in all tables provided to the transform() method. To enable this annotation, set **int_id_column** to the name of the column, where you want to store it.

**This is a pre-requisite for fuzzy dedup** in the pipeline.

In [19]:
# Set Input/output Folder
STAGE = 3
input_folder = output_folder # s2-doc-chunks - previous output folder is the input folder for the current stage
output_folder = 's3-doc-ids'

input_df = read_parquet_files_as_df(input_folder)  ## for debug purposes

print (f"🏃🏼 STAGE-{STAGE}: Processing input = '{input_folder}' --> output = '{output_folder}'")

🏃🏼 STAGE-3: Processing input = 's2-doc-chunks' --> output = 's3-doc-ids'


In [20]:
%%time

from data_processing.runtime.pure_python import PythonTransformLauncher
from doc_id_transform_python import DocIDPythonTransformRuntimeConfiguration

local_conf = {
    "input_folder": input_folder,
    "output_folder": output_folder,
}
params = {
    # Data access. Only required parameters are specified
    "data_local_config": ParamsUtils.convert_to_ast(local_conf),
    # orchestrator
    # doc id configuration
    "doc_id_doc_column": "contents",
    "doc_id_hash_column": "chunk_hash",
    "doc_id_int_column": "chunk_id",
}
sys.argv = ParamsUtils.dict_to_req(d=params)

# launch

launcher = PythonTransformLauncher(DocIDPythonTransformRuntimeConfiguration())

return_code = launcher.launch()

if return_code == 0:
    print (f"✅ Stage:{STAGE} completed successfully")
else:
    raise Exception ("❌ Job failed")

21:57:13 INFO - Doc id parameters are : {'doc_column': 'contents', 'hash_column': 'chunk_hash', 'int_column': 'chunk_id', 'start_id': 0}
INFO:doc_id_transform_base:Doc id parameters are : {'doc_column': 'contents', 'hash_column': 'chunk_hash', 'int_column': 'chunk_id', 'start_id': 0}
21:57:13 INFO - pipeline id pipeline_id
INFO:data_processing.runtime.execution_configuration:pipeline id pipeline_id
21:57:13 INFO - code location None
INFO:data_processing.runtime.execution_configuration:code location None
21:57:13 INFO - data factory data_ is using local data access: input_folder - s2-doc-chunks output_folder - s3-doc-ids
INFO:data_processing.data_access.data_access_factory_base30d8ba92-2747-462d-8b7f-8ec2ef17fc55:data factory data_ is using local data access: input_folder - s2-doc-chunks output_folder - s3-doc-ids
21:57:13 INFO - data factory data_ max_files -1, n_sample -1
INFO:data_processing.data_access.data_access_factory_base30d8ba92-2747-462d-8b7f-8ec2ef17fc55:data factory data_ m

✅ Stage:3 completed successfully
CPU times: user 63.7 ms, sys: 2.89 ms, total: 66.6 ms
Wall time: 85.4 ms


In [26]:
# Inspect stage 3 output
from utils import read_parquet_files_as_df

input_df = read_parquet_files_as_df(input_folder)
output_df = read_parquet_files_as_df(output_folder)

print ("Input data dimensions (rows x columns)= ", input_df.shape)
print ("Output data dimensions (rows x columns)= ", output_df.shape)
print (f"Newly added columns = {set(output_df.columns)-set(input_df.columns)}")
output_df.head(10)



Input data dimensions (rows x columns)=  (8, 16)
Output data dimensions (rows x columns)=  (8, 18)
Newly added columns = {'chunk_id', 'chunk_hash'}


Unnamed: 0,filename,num_pages,num_tables,num_doc_elements,ext,hash,size,date_acquired,pdf_convert_time,source_filename,source_document_id,contents,doc_jsonpath,page_number,bbox,document_id,chunk_hash,chunk_id
0,mars.pdf,1,0,11,pdf,8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...,2800,2024-11-04T21:29:36.635612,4.782521,mars.pdf,f1a7c918-a02b-47ea-be91-bf00fe1020a2,Solar System\nOur solar system is a vast and f...,$.main-text[2],1,"[132.84518433, 588.96014404, 479.40917969, 623...",44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...,44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...,4
1,mars.pdf,1,0,11,pdf,8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...,2800,2024-11-04T21:29:36.635612,4.782521,mars.pdf,f1a7c918-a02b-47ea-be91-bf00fe1020a2,Solar System\nFor more details about the Solar...,$.main-text[3],1,"[133.18510437, 570.83258057, 374.99838257, 581...",dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...,dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...,5
2,mars.pdf,1,0,11,pdf,8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...,2800,2024-11-04T21:29:36.635612,4.782521,mars.pdf,f1a7c918-a02b-47ea-be91-bf00fe1020a2,"Mars\nMars, the fourth planet from the Sun, is...",$.main-text[5],1,"[132.87440491, 500.84011841, 477.48345947, 534...",a31663e06fac41470ecc459f5a58658a3f9997d7801053...,a31663e06fac41470ecc459f5a58658a3f9997d7801053...,6
3,mars.pdf,1,0,11,pdf,8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...,2800,2024-11-04T21:29:36.635612,4.782521,mars.pdf,f1a7c918-a02b-47ea-be91-bf00fe1020a2,Basic facts about Mars:\n· Distance from the S...,$.main-text[6],1,"[133.2026062, 482.90710449, 237.04431152, 493....",7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...,7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...,7
4,earth.pdf,1,0,11,pdf,18713f970989055625bef22209b6f4b6830b9ca22046bf...,2686,2024-11-04T21:29:31.832960,6.511793,earth.pdf,b75e7b29-6c82-4f46-9f4a-6ac94f18b612,Solar System\nOur solar system is a vast and f...,$.main-text[2],1,"[132.87112427, 588.96014404, 479.40917969, 623...",44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...,44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...,0
5,earth.pdf,1,0,11,pdf,18713f970989055625bef22209b6f4b6830b9ca22046bf...,2686,2024-11-04T21:29:31.832960,6.511793,earth.pdf,b75e7b29-6c82-4f46-9f4a-6ac94f18b612,Solar System\nFor more details about our Solar...,$.main-text[3],1,"[133.20942688, 570.81555176, 375.57919312, 581...",d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...,d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...,1
6,earth.pdf,1,0,11,pdf,18713f970989055625bef22209b6f4b6830b9ca22046bf...,2686,2024-11-04T21:29:31.832960,6.511793,earth.pdf,b75e7b29-6c82-4f46-9f4a-6ac94f18b612,Earth\nEarth is the third planet from the Sun....,$.main-text[5],1,"[132.91053772, 512.46295166, 477.84887695, 534...",7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...,7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...,2
7,earth.pdf,1,0,11,pdf,18713f970989055625bef22209b6f4b6830b9ca22046bf...,2686,2024-11-04T21:29:31.832960,6.511793,earth.pdf,b75e7b29-6c82-4f46-9f4a-6ac94f18b612,Earth\nBasic facts about Earth:\n· Distance fr...,$.main-text[6],1,"[133.30151367, 494.86206055, 240.17156982, 505...",189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...,189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...,3


## STAGE 4 - Exact Dedup


In [24]:
input_folder = output_folder # previous output folder from stage 3 is the input folder for the current stage
output_folder =  "s4-exact-dedup"

input_df = read_parquet_files_as_df(input_folder)  ## for debug purposes

print (f"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'")

Output data dimensions (rows x columns)=  Index(['filename', 'num_pages', 'num_tables', 'num_doc_elements', 'ext',
       'hash', 'size', 'date_acquired', 'pdf_convert_time', 'source_filename',
       'source_document_id', 'contents', 'doc_jsonpath', 'page_number', 'bbox',
       'document_id', 'chunk_hash', 'chunk_id'],
      dtype='object')


{'chunk_hash', 'chunk_id'}