# Data Prep Kit Demo 1 - Ray Version

This notebook will introduce DPK and showcase some of it's capabilities.

Here is the workflow

![](https://raw.githubusercontent.com/sujee/data-prep-kit-examples/main/media/data-prep-kit-3-workflow.png)

References
- [Data prep kit](https://ibm.github.io/data-prep-kit)  is an open source framework that helps with data wrangling.
- [github repo](https://github.com/IBM/data-prep-kit)


## How to run this notebook

Two options:

- **Google Colab:** easiest option.  no setup required.  Click this link to open this on google colab.  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/santoshborse/pydatanyc2024/blob/main/dpk_intro_1_ray.ipynb)

## Step-1: Inspect the Data

We will use simple PDFs about Solar system.  The files are [here](https://github.com/sujee/data-prep-kit-examples/tree/main/data/solar-system)

- [earth.pdf](https://github.com/sujee/data-prep-kit-examples/blob/main/data/solar-system/earth.pdf)
- [mars.pdf](https://github.com/sujee/data-prep-kit-examples/blob/main/data/solar-system/mars.pdf)



## Step-2: Figure out Runtime Environment

### 2.1 - Determine runtime

Determine if we are running on Google colab or local python environment

In [1]:
import os

if os.getenv("COLAB_RELEASE_TAG"):
   print("Running in Colab")
   RUNNING_IN_COLAB = True
else:
   print("NOT in Colab")
   RUNNING_IN_COLAB = False

Running in Colab


### 2.2 -Download Data if running on Google Colab

In [2]:
if RUNNING_IN_COLAB:
    !mkdir -p 'input'
    !wget -O 'input/earth.pdf'  'https://raw.githubusercontent.com/sujee/data-prep-kit-examples/main/data/solar-system/earth.pdf'
    !wget -O 'input/mars.pdf'  'https://raw.githubusercontent.com/sujee/data-prep-kit-examples/main/data/solar-system/mars.pdf'
    !wget -O 'utils.py'  'https://raw.githubusercontent.com/sujee/data-prep-kit-examples/main/utils.py'

--2024-11-05 03:11:06--  https://raw.githubusercontent.com/sujee/data-prep-kit-examples/main/data/solar-system/earth.pdf
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 58535 (57K) [application/octet-stream]
Saving to: ‚Äòinput/earth.pdf‚Äô


2024-11-05 03:11:07 (3.65 MB/s) - ‚Äòinput/earth.pdf‚Äô saved [58535/58535]

--2024-11-05 03:11:07--  https://raw.githubusercontent.com/sujee/data-prep-kit-examples/main/data/solar-system/mars.pdf
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 57872 (57K) [application/octet-stream]
Saving to

### 2.3 - Install dependencies if running on Google Colab

In [None]:
if RUNNING_IN_COLAB:
    ! pip install  --default-timeout=100  \
        data-prep-toolkit-transforms==0.2.1 \
        data-prep-toolkit-transforms-ray==0.2.1 \
        deepsearch-toolkit

### 2.4 - Restart Runtime

After installing dependencies, be sure <font color="red">restart runtime</font>, so libraries will be loaded

You do this by going to **`Runtime --> Restart Session`**

Then you can continue to the next step (no need to re-run the notebook)

## Step-2: Configuration

### 2.1 - Basic Config

In [3]:
import os

if os.getenv("COLAB_RELEASE_TAG"):
   print("Running in Colab")
   !rm -rf sample_data/
   RUNNING_IN_COLAB = True
else:
   print("NOT in Colab")
   RUNNING_IN_COLAB = False

Running in Colab


In [22]:
import os

## Configuration
class MyConfig:
    pass

MY_CONFIG = MyConfig ()

if RUNNING_IN_COLAB:
  MY_CONFIG.INPUT_DATA_DIR = 'input'
else:
  MY_CONFIG.INPUT_DATA_DIR = os.path.join (os.path.abspath (''), '..', 'data', 'solar-system')
MY_CONFIG.OUTPUT_FOLDER = "output"
MY_CONFIG.OUTPUT_FOLDER_FINAL = os.path.join(MY_CONFIG.OUTPUT_FOLDER , "output_final")


## RAY CONFIGURATION
### For local runs, we can use more parallelism
### For google colab, be conservative as usually only 2 CPUs are available in free account

if RUNNING_IN_COLAB:
  MY_CONFIG.RAY_RUNTIME_WORKERS = 2
  MY_CONFIG.RAY_NUM_CPUS =  0.3
  MY_CONFIG.RAY_MEMORY_GB = 2  # GB
else:  # local run
  num_cpus_available =  os.cpu_count()
  # print (num_cpus_available)
  MY_CONFIG.RAY_NUM_CPUS =  1
  MY_CONFIG.RAY_MEMORY_GB = 2  # GB
  # MY_CONFIG.RAY_RUNTIME_WORKERS = num_cpus_available // 3
  MY_CONFIG.RAY_RUNTIME_WORKERS = 2

print ('MY_CONFIG.RAY_RUNTIME_WORKERS:', MY_CONFIG.RAY_RUNTIME_WORKERS)
print ('MY_CONFIG.RAY_NUM_CPUS:', MY_CONFIG.RAY_NUM_CPUS)
print ('MY_CONFIG.RAY_MEMORY_GB:', MY_CONFIG.RAY_MEMORY_GB)


MY_CONFIG.RAY_RUNTIME_WORKERS: 2
MY_CONFIG.RAY_NUM_CPUS: 0.3
MY_CONFIG.RAY_MEMORY_GB: 2


In [5]:
## Add parent dir to path
import os,sys

this_dir = os.path.abspath('')
parent_dir = os.path.dirname(this_dir)
sys.path.append (os.path.abspath (parent_dir))

### 2.2 - Setup input/outpur directories

In [6]:
import os, sys
import shutil

if not os.path.exists(MY_CONFIG.INPUT_DATA_DIR ):
    raise Exception (f"‚ùå Input folder MY_CONFIG.INPUT_DATA_DIR = '{MY_CONFIG.INPUT_DATA_DIR}' not found")

output_parquet_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '01_parquet_out')
output_chunk_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '02_chunk_out')
output_docid_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '03_docid_out')
output_exact_dedupe_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '04_exact_dedupe_out')
output_fuzzy_dedupe_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '05_fuzzy_dedupe_out')
output_embeddings_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '06_embeddings_out')

## clear output folder
shutil.rmtree(MY_CONFIG.OUTPUT_FOLDER, ignore_errors=True)
shutil.os.makedirs(MY_CONFIG.OUTPUT_FOLDER, exist_ok=True)

print ("‚úÖ Cleared output directory")

‚úÖ Cleared output directory


## Step-3: pdf2parquet -  Convert data from PDF to Parquet

This step is reading the input folder containing all PDF files and ingest them in a parquet table using the [Docling package](https://github.com/DS4SD/docling).
The documents are converted into a JSON format which allows to easily chunk it in the later steps.



### 3.1 - Set Input/output Folder

In [7]:
STAGE = 1

input_folder = MY_CONFIG.INPUT_DATA_DIR
output_folder =  output_parquet_dir

print (f"üèÉüèº STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'")

üèÉüèº STAGE-1: Processing input='input' --> output='output/01_parquet_out'


### 3.2 - Execute

In [8]:
%%time

import ast
import os
import sys

from pdf2parquet_transform import (
    pdf2parquet_contents_type_cli_param,
    pdf2parquet_contents_types,
)
from data_processing_ray.runtime.ray import RayTransformLauncher
from pdf2parquet_transform_python import Pdf2ParquetPythonTransformConfiguration
from pdf2parquet_transform_ray import Pdf2ParquetRayTransformConfiguration

from data_processing.utils import GB, ParamsUtils


# create parameters
local_conf = {
    "input_folder": input_folder,
    "output_folder": output_folder,
}
worker_options = {"num_cpus" : MY_CONFIG.RAY_NUM_CPUS, "memory": MY_CONFIG.RAY_MEMORY_GB * GB}
ingest_config = {
    pdf2parquet_contents_type_cli_param: pdf2parquet_contents_types.JSON,
}

params = {
    # where to run
    "run_locally": True,
    # Data access. Only required parameters are specified
    "data_local_config": ParamsUtils.convert_to_ast(local_conf),
    "data_files_to_use": ast.literal_eval("['.pdf']"),
    # orchestrator
    "runtime_worker_options": ParamsUtils.convert_to_ast(worker_options),
    # "runtime_num_workers": MY_CONFIG.RAY_RUNTIME_WORKERS,
    "runtime_num_workers": 1,  # setting it to 1 for this instance
}


sys.argv = ParamsUtils.dict_to_req(d=(params | ingest_config))
# create launcher
launcher = RayTransformLauncher(Pdf2ParquetRayTransformConfiguration())
# launcher = PythonTransformLauncher(Pdf2ParquetPythonTransformConfiguration())
# launch
return_code = launcher.launch()

if return_code == 0:
    print (f"‚úÖ Stage:{STAGE} completed successfully")
else:
    raise Exception ("‚ùå Ray job failed")


03:15:59 INFO - pdf2parquet parameters are : {'artifacts_path': None, 'contents_type': <pdf2parquet_contents_types.JSON: 'application/json'>, 'do_table_structure': True, 'do_ocr': True, 'double_precision': 8}
INFO:pdf2parquet_transform:pdf2parquet parameters are : {'artifacts_path': None, 'contents_type': <pdf2parquet_contents_types.JSON: 'application/json'>, 'do_table_structure': True, 'do_ocr': True, 'double_precision': 8}
03:15:59 INFO - pipeline id pipeline_id
INFO:data_processing.runtime.execution_configuration:pipeline id pipeline_id
03:15:59 INFO - code location None
INFO:data_processing.runtime.execution_configuration:code location None
03:15:59 INFO - number of workers 1 worker options {'num_cpus': 0.5, 'memory': 2147483648, 'max_restarts': -1}
INFO:data_processing_ray.runtime.ray.execution_configuration:number of workers 1 worker options {'num_cpus': 0.5, 'memory': 2147483648, 'max_restarts': -1}
03:15:59 INFO - actor creation delay 0
INFO:data_processing_ray.runtime.ray.exec

[36m(RayTransformFileProcessor pid=2215)[0m Progress: |--------------------------------------------------| 0.0% CompleteProgress: |--------------------------------------------------| 0.0% CompleteProgress: |--------------------------------------------------| 0.0% CompleteProgress: |--------------------------------------------------| 0.0% CompleteProgress: |--------------------------------------------------| 0.0% CompleteProgress: |--------------------------------------------------| 0.1% CompleteProgress: |--------------------------------------------------| 0.1% CompleteProgress: |--------------------------------------------------| 0.1% CompleteProgress: |--------------------------------------------------| 0.1% CompleteProgress: |--------------------------------------------------| 0.1% CompleteProgress: |--------------------------------------------------| 0.1% CompleteProgress: |--------------------------------------------------| 0.1% CompleteProgress: |------------------

[36m(RayTransformFileProcessor pid=2215)[0m Downloading recognition model, please wait. This may take several minutes depending upon your network connection.


[36m(RayTransformFileProcessor pid=2215)[0m Progress: |--------------------------------------------------| 0.0% CompleteProgress: |--------------------------------------------------| 0.1% CompleteProgress: |--------------------------------------------------| 0.1% CompleteProgress: |--------------------------------------------------| 0.2% Complete
Progress: |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà------------------------------------| 29.3% Complete
Progress: |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà-------| 86.3% Complete
Progress: |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà-------| 87.0% Complete
Progress: |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100.0% Complete


[36m(orchestrate pid=2070)[0m 03:17:24 INFO - Completed 1 files in 1.21 min
[36m(orchestrate pid=2070)[0m 03:17:24 INFO - Completed 1 files (50.0%)  in 1.21 min. Waiting for completion
[36m(orchestrate pid=2070)[0m 03:17:29 INFO - Completed processing 2 files in 1.29 min
[36m(orchestrate pid=2070)[0m 03:17:29 INFO - done flushing in 0.002 sec
03:17:39 INFO - Completed execution in 1.674 min, execution result 0
INFO:data_processing_ray.runtime.ray.transform_launcher:Completed execution in 1.674 min, execution result 0


‚úÖ Stage:1 completed successfully
CPU times: user 11.1 s, sys: 1.52 s, total: 12.7 s
Wall time: 1min 56s


### 3.3 - Inspect Generated output

Here we should see one entry per input file processed.

In [9]:
from utils import read_parquet_files_as_df

output_df = read_parquet_files_as_df(output_folder)

print ("Output dimensions (rows x columns)= ", output_df.shape)

output_df.head(5)

## To display certain columns
#parquet_df[['column1', 'column2', 'column3']].head(5)

Output dimensions (rows x columns)=  (2, 12)


Unnamed: 0,filename,contents,num_pages,num_tables,num_doc_elements,document_id,ext,hash,size,date_acquired,pdf_convert_time,source_filename
0,mars.pdf,"{""_name"":"""",""type"":""pdf-document"",""description...",1,0,11,b89261d9-fcaf-488e-b0a2-ded576779a51,pdf,8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...,2800,2024-11-05T03:17:29.759389,4.81335,mars.pdf
1,earth.pdf,"{""_name"":"""",""type"":""pdf-document"",""description...",1,0,11,d4fd8ebe-fb1e-4425-9334-cb547962b117,pdf,18713f970989055625bef22209b6f4b6830b9ca22046bf...,2686,2024-11-05T03:17:24.929663,4.296048,earth.pdf



### 3.4 - Understand the output

Here are some interesting attributes to note:

- **filename** : original filename
- **contents** : text
- **document_id**: unique id (UUID) assignd to this document
- **hash** : hash of document
- **pdf_convert_time** : time to convert this pdf in seconds

Let's inspect the **contents** column.  See how the text is being divided up!

In [None]:
import pprint
import json

pprint.pprint (json.loads(output_df.iloc[0, ]['contents']))
# json.loads(output_df.iloc[0, ]['contents'])

In [None]:
pprint.pprint (json.loads(output_df.iloc[1, ]['contents']))

##  Step-4: Doc chunks

In the previous step, we have extracted text from oru PDFs.  But we have the content of entire file as 'one row' in our parquet output.

In this step, we are going to split the documents in chunks, according to their layout segmentation.

This transform uses [Quackling](https://github.com/DS4SD/quackling) `HierarchicalChunker`
to chunk according to the document layout segmentation, i.e. respecting the original document components as paragraphs, tables, enumerations, etc.
It relies on documents converted with the Docling library in the [pdf2parquet transform](https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/pdf2parquet/python/README.md) using the option `contents_type: "application/json"`,
which provides the required JSON structure.

### 4.1 - Set Input/output Folder

In [12]:
STAGE = 2

input_folder = output_parquet_dir # previous output folder is the input folder for the current stage
output_folder =  output_chunk_dir

input_df = read_parquet_files_as_df(input_folder)  ## for debug purposes

print (f"üèÉüèº STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'")

üèÉüèº STAGE-2: Processing input='output/01_parquet_out' --> output='output/02_chunk_out'


### 4.2 - Execute

In [13]:
%%time

from data_processing_ray.runtime.ray import RayTransformLauncher
from doc_chunk_transform_ray import DocChunkRayTransformConfiguration


# Prepare the commandline params
local_conf = {
    "input_folder": input_folder,
    "output_folder": output_folder,
}
worker_options = {"num_cpus" : MY_CONFIG.RAY_NUM_CPUS}
params = {
    # where to run
    "run_locally": True,
    # Data access. Only required parameters are specified
    "data_local_config": ParamsUtils.convert_to_ast(local_conf),
    # orchestrator
    "runtime_worker_options": ParamsUtils.convert_to_ast(worker_options),
    "runtime_num_workers": MY_CONFIG.RAY_RUNTIME_WORKERS,
    # doc_chunk arguments
    # ...
}

# Pass the commandline params
sys.argv = ParamsUtils.dict_to_req(d=params)

# create launcher
launcher = RayTransformLauncher(DocChunkRayTransformConfiguration())
# launch
return_code = launcher.launch()

if return_code == 0:
    print (f"‚úÖ Stage:{STAGE} completed successfully")
else:
    raise Exception ("‚ùå Ray job failed")

03:20:11 INFO - doc_chunk parameters are : {'chunking_type': <chunking_types.DL_JSON: 'dl_json'>, 'content_column_name': 'contents', 'doc_id_column_name': 'document_id', 'dl_min_chunk_len': None, 'output_chunk_column_name': 'contents', 'output_source_doc_id_column_name': 'source_document_id', 'output_jsonpath_column_name': 'doc_jsonpath', 'output_pageno_column_name': 'page_number', 'output_bbox_column_name': 'bbox'}
INFO:doc_chunk_transformcfg:doc_chunk parameters are : {'chunking_type': <chunking_types.DL_JSON: 'dl_json'>, 'content_column_name': 'contents', 'doc_id_column_name': 'document_id', 'dl_min_chunk_len': None, 'output_chunk_column_name': 'contents', 'output_source_doc_id_column_name': 'source_document_id', 'output_jsonpath_column_name': 'doc_jsonpath', 'output_pageno_column_name': 'page_number', 'output_bbox_column_name': 'bbox'}
03:20:11 INFO - pipeline id pipeline_id
INFO:data_processing.runtime.execution_configuration:pipeline id pipeline_id
03:20:11 INFO - code location N

‚úÖ Stage:2 completed successfully
CPU times: user 3.81 s, sys: 689 ms, total: 4.5 s
Wall time: 36.9 s


### 4.3 - Inspect Generated output

We would see documents are split into many chunks

In [15]:
from utils import read_parquet_files_as_df

output_df = read_parquet_files_as_df(output_folder)

print (f"Files processed : {input_df.shape[0]:,}")
print (f"Chunks created : {output_df.shape[0]:,}")

print ("Input data dimensions (rows x columns)= ", input_df.shape)
print ("Output data dimensions (rows x columns)= ", output_df.shape)
print (f"New columns Added: {set(output_df.columns)-set(input_df.columns)}")
output_df.head(10)

Files processed : 2
Chunks created : 8
Input data dimensions (rows x columns)=  (2, 12)
Output data dimensions (rows x columns)=  (8, 16)
New columns Added: {'page_number', 'bbox', 'doc_jsonpath', 'source_document_id'}


Unnamed: 0,filename,num_pages,num_tables,num_doc_elements,ext,hash,size,date_acquired,pdf_convert_time,source_filename,source_document_id,contents,doc_jsonpath,page_number,bbox,document_id
0,mars.pdf,1,0,11,pdf,8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...,2800,2024-11-05T03:17:29.759389,4.81335,mars.pdf,b89261d9-fcaf-488e-b0a2-ded576779a51,Solar System\nOur solar system is a vast and f...,$.main-text[2],1,"[132.84518433, 588.96014404, 479.40917969, 623...",44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...
1,mars.pdf,1,0,11,pdf,8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...,2800,2024-11-05T03:17:29.759389,4.81335,mars.pdf,b89261d9-fcaf-488e-b0a2-ded576779a51,Solar System\nFor more details about the Solar...,$.main-text[3],1,"[133.18510437, 570.83258057, 374.99838257, 581...",dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...
2,mars.pdf,1,0,11,pdf,8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...,2800,2024-11-05T03:17:29.759389,4.81335,mars.pdf,b89261d9-fcaf-488e-b0a2-ded576779a51,"Mars\nMars, the fourth planet from the Sun, is...",$.main-text[5],1,"[132.87440491, 500.84011841, 477.48345947, 534...",a31663e06fac41470ecc459f5a58658a3f9997d7801053...
3,mars.pdf,1,0,11,pdf,8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...,2800,2024-11-05T03:17:29.759389,4.81335,mars.pdf,b89261d9-fcaf-488e-b0a2-ded576779a51,Basic facts about Mars:\n¬∑ Distance from the S...,$.main-text[6],1,"[133.2026062, 482.90710449, 237.04431152, 493....",7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...
4,earth.pdf,1,0,11,pdf,18713f970989055625bef22209b6f4b6830b9ca22046bf...,2686,2024-11-05T03:17:24.929663,4.296048,earth.pdf,d4fd8ebe-fb1e-4425-9334-cb547962b117,Solar System\nOur solar system is a vast and f...,$.main-text[2],1,"[132.87112427, 588.96014404, 479.40917969, 623...",44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...
5,earth.pdf,1,0,11,pdf,18713f970989055625bef22209b6f4b6830b9ca22046bf...,2686,2024-11-05T03:17:24.929663,4.296048,earth.pdf,d4fd8ebe-fb1e-4425-9334-cb547962b117,Solar System\nFor more details about our Solar...,$.main-text[3],1,"[133.20942688, 570.81555176, 375.57919312, 581...",d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...
6,earth.pdf,1,0,11,pdf,18713f970989055625bef22209b6f4b6830b9ca22046bf...,2686,2024-11-05T03:17:24.929663,4.296048,earth.pdf,d4fd8ebe-fb1e-4425-9334-cb547962b117,Earth\nEarth is the third planet from the Sun....,$.main-text[5],1,"[132.91053772, 512.46295166, 477.84887695, 534...",7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...
7,earth.pdf,1,0,11,pdf,18713f970989055625bef22209b6f4b6830b9ca22046bf...,2686,2024-11-05T03:17:24.929663,4.296048,earth.pdf,d4fd8ebe-fb1e-4425-9334-cb547962b117,Earth\nBasic facts about Earth:\n¬∑ Distance fr...,$.main-text[6],1,"[133.30151367, 494.86206055, 240.17156982, 505...",189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...


### 4.4 - Understanding the Output

Here we see 2 PDF files are split into 8 chunks.  Basically we see the documents are being split along 'natural boundaris' - paragraphs and bullet points

See how **document_id** is carried throughout.  This helps us identify original documents.

Also note **contents** is now plain text (not JSON as before)

In [None]:
output_df[['filename', 'contents']]

Unnamed: 0,filename,contents
0,mars.pdf,Solar System\nOur solar system is a vast and f...
1,mars.pdf,Solar System\nFor more details about the Solar...
2,mars.pdf,"Mars\nMars, the fourth planet from the Sun, is..."
3,mars.pdf,Basic facts about Mars:\n¬∑ Distance from the S...
4,earth.pdf,Solar System\nOur solar system is a vast and f...
5,earth.pdf,Solar System\nFor more details about our Solar...
6,earth.pdf,Earth\nEarth is the third planet from the Sun....
7,earth.pdf,Earth\nBasic facts about Earth:\n¬∑ Distance fr...


In [16]:
for f in output_df['filename'].unique():
    print ('==========' , f, '===========')
    chunks = output_df[output_df['filename'] == f]['contents']
    for idx , chunk in enumerate(chunks):
        print (f'-------Chunk {idx}------\n{chunk}\n-------')

-------Chunk 0------
Solar System
Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun.
-------
-------Chunk 1------
Solar System
For more details about the Solar system see Chapter 1.
-------
-------Chunk 2------
Mars
Mars, the fourth planet from the Sun, is a cold, desert world with a thin atmosphere composed primarily of carbon dioxide. Its reddish hue comes from iron oxide, or rust, prevalent on its surface.
-------
-------Chunk 3------
Basic facts about Mars:
¬∑ Distance from the Sun: Average of 228 million kilometers (142 million miles)
¬∑ Rotation Period: 24.6 hours (one Martian day - called a "sol")
¬∑ Moons: Two small moons, Phobos and Deimos.
-------
-------Chunk 0------
Solar System
Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other ce

## Step-5:  DOC ID generation

This transform annotates documents with document "ids". It supports the following transformations of the original data:

 - Adding document hash: this enables the addition of a document hash-based id to the data. The hash is calculated with `hashlib.sha256(doc.encode("utf-8")).hexdigest()`. To enable this annotation, set **hash_column** to the name of the column, where you want to store it.
 - Adding integer document id: this allows the addition of an integer document id to the data that is unique across all rows in all tables provided to the transform() method. To enable this annotation, set **int_id_column** to the name of the column, where you want to store it.

**This is a pre-requisite for fuzzy dedup** in the pipeline.

### 5.1 - Set Input/output Folder

In [17]:

# Input for this stage is the output of exact dedeup component
# output of this component makes it possible for fdedup component to run on data.

STAGE  = 3

input_folder = output_chunk_dir
output_folder =  output_docid_dir

input_df = read_parquet_files_as_df(input_folder)  ## for debug purposes

print (f"üèÉüèº STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'")

üèÉüèº STAGE-3: Processing input='output/02_chunk_out' --> output='output/03_docid_out'


### 5.2 - Execute

In [18]:
%%time

from data_processing_ray.runtime.ray import RayTransformLauncher
from doc_id_transform_ray import DocIDRayTransformRuntimeConfiguration

local_conf = {
    "input_folder": input_folder,
    "output_folder": output_folder,
}
worker_options = {"num_cpus" : MY_CONFIG.RAY_NUM_CPUS}
params = {
    # where to run
    "run_locally": True,
    # Data access. Only required parameters are specified
    "data_local_config": ParamsUtils.convert_to_ast(local_conf),
    # orchestrator
    "runtime_worker_options": ParamsUtils.convert_to_ast(worker_options),
    "runtime_num_workers": MY_CONFIG.RAY_RUNTIME_WORKERS,
    # doc id configuration
    "doc_id_doc_column": "contents",
    "doc_id_hash_column": "chunk_hash",
    "doc_id_int_column": "chunk_id",
}
sys.argv = ParamsUtils.dict_to_req(d=params)

# launch

launcher = RayTransformLauncher(DocIDRayTransformRuntimeConfiguration())

return_code = launcher.launch()

if return_code == 0:
    print (f"‚úÖ Stage:{STAGE} completed successfully")
else:
    raise Exception ("‚ùå Ray job failed")

03:22:43 INFO - Doc id parameters are : {'doc_column': 'contents', 'hash_column': 'chunk_hash', 'int_column': 'chunk_id', 'start_id': 0}
INFO:doc_id_transform_base:Doc id parameters are : {'doc_column': 'contents', 'hash_column': 'chunk_hash', 'int_column': 'chunk_id', 'start_id': 0}
03:22:43 INFO - pipeline id pipeline_id
INFO:data_processing.runtime.execution_configuration:pipeline id pipeline_id
03:22:43 INFO - code location None
INFO:data_processing.runtime.execution_configuration:code location None
03:22:43 INFO - number of workers 2 worker options {'num_cpus': 0.5, 'max_restarts': -1}
INFO:data_processing_ray.runtime.ray.execution_configuration:number of workers 2 worker options {'num_cpus': 0.5, 'max_restarts': -1}
03:22:43 INFO - actor creation delay 0
INFO:data_processing_ray.runtime.ray.execution_configuration:actor creation delay 0
03:22:43 INFO - job details {'job category': 'preprocessing', 'job name': 'doc_id', 'job type': 'ray', 'job id': 'job_id'}
INFO:data_processing_r

‚úÖ Stage:3 completed successfully
CPU times: user 343 ms, sys: 172 ms, total: 515 ms
Wall time: 25.7 s


### 5.3 - Inspect Generated output

You will notice we have two extra columns

- **hash_column**
- **int_id_column**

But still the same number or rows as before

In [19]:
from utils import read_parquet_files_as_df

output_df = read_parquet_files_as_df(output_folder)

print ("Input data dimensions (rows x columns)= ", input_df.shape)
print ("Output data dimensions (rows x columns)= ", output_df.shape)
print (f"New columns Added: {set(output_df.columns)-set(input_df.columns)}")

output_df.head(10)

Input data dimensions (rows x columns)=  (8, 16)
Output data dimensions (rows x columns)=  (8, 18)
New columns Added: {'chunk_id', 'chunk_hash'}


Unnamed: 0,filename,num_pages,num_tables,num_doc_elements,ext,hash,size,date_acquired,pdf_convert_time,source_filename,source_document_id,contents,doc_jsonpath,page_number,bbox,document_id,chunk_hash,chunk_id
0,mars.pdf,1,0,11,pdf,8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...,2800,2024-11-05T03:17:29.759389,4.81335,mars.pdf,b89261d9-fcaf-488e-b0a2-ded576779a51,Solar System\nOur solar system is a vast and f...,$.main-text[2],1,"[132.84518433, 588.96014404, 479.40917969, 623...",44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...,44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...,0
1,mars.pdf,1,0,11,pdf,8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...,2800,2024-11-05T03:17:29.759389,4.81335,mars.pdf,b89261d9-fcaf-488e-b0a2-ded576779a51,Solar System\nFor more details about the Solar...,$.main-text[3],1,"[133.18510437, 570.83258057, 374.99838257, 581...",dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...,dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...,1
2,mars.pdf,1,0,11,pdf,8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...,2800,2024-11-05T03:17:29.759389,4.81335,mars.pdf,b89261d9-fcaf-488e-b0a2-ded576779a51,"Mars\nMars, the fourth planet from the Sun, is...",$.main-text[5],1,"[132.87440491, 500.84011841, 477.48345947, 534...",a31663e06fac41470ecc459f5a58658a3f9997d7801053...,a31663e06fac41470ecc459f5a58658a3f9997d7801053...,2
3,mars.pdf,1,0,11,pdf,8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...,2800,2024-11-05T03:17:29.759389,4.81335,mars.pdf,b89261d9-fcaf-488e-b0a2-ded576779a51,Basic facts about Mars:\n¬∑ Distance from the S...,$.main-text[6],1,"[133.2026062, 482.90710449, 237.04431152, 493....",7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...,7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...,3
4,earth.pdf,1,0,11,pdf,18713f970989055625bef22209b6f4b6830b9ca22046bf...,2686,2024-11-05T03:17:24.929663,4.296048,earth.pdf,d4fd8ebe-fb1e-4425-9334-cb547962b117,Solar System\nOur solar system is a vast and f...,$.main-text[2],1,"[132.87112427, 588.96014404, 479.40917969, 623...",44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...,44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...,4
5,earth.pdf,1,0,11,pdf,18713f970989055625bef22209b6f4b6830b9ca22046bf...,2686,2024-11-05T03:17:24.929663,4.296048,earth.pdf,d4fd8ebe-fb1e-4425-9334-cb547962b117,Solar System\nFor more details about our Solar...,$.main-text[3],1,"[133.20942688, 570.81555176, 375.57919312, 581...",d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...,d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...,5
6,earth.pdf,1,0,11,pdf,18713f970989055625bef22209b6f4b6830b9ca22046bf...,2686,2024-11-05T03:17:24.929663,4.296048,earth.pdf,d4fd8ebe-fb1e-4425-9334-cb547962b117,Earth\nEarth is the third planet from the Sun....,$.main-text[5],1,"[132.91053772, 512.46295166, 477.84887695, 534...",7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...,7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...,6
7,earth.pdf,1,0,11,pdf,18713f970989055625bef22209b6f4b6830b9ca22046bf...,2686,2024-11-05T03:17:24.929663,4.296048,earth.pdf,d4fd8ebe-fb1e-4425-9334-cb547962b117,Earth\nBasic facts about Earth:\n¬∑ Distance fr...,$.main-text[6],1,"[133.30151367, 494.86206055, 240.17156982, 505...",189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...,189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...,7


## Step-6: Exact Dedup



### 6.1 - Set Input/output Folder

In [20]:
STAGE  = 4

input_folder = output_docid_dir # previous output folder is the input folder for the current stage
output_folder =  output_exact_dedupe_dir

input_df = read_parquet_files_as_df(input_folder)  ## for debug purposes

print (f"üèÉüèº STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'")

üèÉüèº STAGE-4: Processing input='output/03_docid_out' --> output='output/04_exact_dedupe_out'


### 6.2 - Execute

In [23]:
%%time

from data_processing_ray.runtime.ray import RayTransformLauncher
from ededup_transform_ray import EdedupRayTransformRuntimeConfiguration


# Prepare the commandline params
local_conf = {
    "input_folder": input_folder,
    "output_folder": output_folder,
}
worker_options = {"num_cpus" : MY_CONFIG.RAY_NUM_CPUS}
params = {
    # where to run
    "run_locally": True,
    # Data access. Only required parameters are specified
    "data_local_config": ParamsUtils.convert_to_ast(local_conf),
    # orchestrator
    "runtime_worker_options": ParamsUtils.convert_to_ast(worker_options),
    "runtime_num_workers": MY_CONFIG.RAY_RUNTIME_WORKERS,
    # ededup parameters
    "ededup_hash_cpu": 0.5,
    "ededup_num_hashes": 2,
    "ededup_doc_column": "contents",
    "ededup_doc_id_column": "chunk_hash",
}

# Pass the commandline params
sys.argv = ParamsUtils.dict_to_req(d=params)

# create launcher
launcher = RayTransformLauncher(EdedupRayTransformRuntimeConfiguration())
# launch
return_code = launcher.launch()

if return_code == 0:
    print (f"‚úÖ Stage:{STAGE} completed successfully")
else:
    raise Exception ("‚ùå Ray job failed")

03:25:53 INFO - exact dedup params are {'doc_column': 'contents', 'doc_id_column': 'chunk_hash', 'use_snapshot': False, 'snapshot_directory': None, 'hash_cpu': 0.5, 'num_hashes': 2}
INFO:ededup_transform_base:exact dedup params are {'doc_column': 'contents', 'doc_id_column': 'chunk_hash', 'use_snapshot': False, 'snapshot_directory': None, 'hash_cpu': 0.5, 'num_hashes': 2}
03:25:53 INFO - pipeline id pipeline_id
INFO:data_processing.runtime.execution_configuration:pipeline id pipeline_id
03:25:53 INFO - code location None
INFO:data_processing.runtime.execution_configuration:code location None
03:25:53 INFO - number of workers 2 worker options {'num_cpus': 0.3, 'max_restarts': -1}
INFO:data_processing_ray.runtime.ray.execution_configuration:number of workers 2 worker options {'num_cpus': 0.3, 'max_restarts': -1}
03:25:53 INFO - actor creation delay 0
INFO:data_processing_ray.runtime.ray.execution_configuration:actor creation delay 0
03:25:53 INFO - job details {'job category': 'preproces

‚úÖ Stage:4 completed successfully
CPU times: user 342 ms, sys: 177 ms, total: 520 ms
Wall time: 26.1 s


### 6.3 - Inspect Generated output

In [24]:
from utils import read_parquet_files_as_df

output_df = read_parquet_files_as_df(output_folder)

print ("Input data dimensions (rows x columns)= ", input_df.shape)
print ("Output data dimensions (rows x columns)= ", output_df.shape)
print (f"Input chunks before exact dedupe : {input_df.shape[0]:,}")
print (f"Output chunks after exact dedupe : {output_df.shape[0]:,}")
print ("Duplicate chunks removed :  ", (input_df.shape[0] - output_df.shape[0]))
print (f"New columns Added: {set(output_df.columns)-set(input_df.columns)}")

output_df.head(10)

Input data dimensions (rows x columns)=  (8, 18)
Output data dimensions (rows x columns)=  (7, 19)
Input chunks before exact dedupe : 8
Output chunks after exact dedupe : 7
Duplicate chunks removed :   1
New columns Added: {'removed'}


Unnamed: 0,filename,num_pages,num_tables,num_doc_elements,ext,hash,size,date_acquired,pdf_convert_time,source_filename,source_document_id,contents,doc_jsonpath,page_number,bbox,document_id,chunk_hash,chunk_id,removed
0,mars.pdf,1,0,11,pdf,8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...,2800,2024-11-05T03:17:29.759389,4.81335,mars.pdf,b89261d9-fcaf-488e-b0a2-ded576779a51,Solar System\nFor more details about the Solar...,$.main-text[3],1,"[133.18510437, 570.83258057, 374.99838257, 581...",dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...,dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...,1,[44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf567...
1,mars.pdf,1,0,11,pdf,8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...,2800,2024-11-05T03:17:29.759389,4.81335,mars.pdf,b89261d9-fcaf-488e-b0a2-ded576779a51,"Mars\nMars, the fourth planet from the Sun, is...",$.main-text[5],1,"[132.87440491, 500.84011841, 477.48345947, 534...",a31663e06fac41470ecc459f5a58658a3f9997d7801053...,a31663e06fac41470ecc459f5a58658a3f9997d7801053...,2,[]
2,mars.pdf,1,0,11,pdf,8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...,2800,2024-11-05T03:17:29.759389,4.81335,mars.pdf,b89261d9-fcaf-488e-b0a2-ded576779a51,Basic facts about Mars:\n¬∑ Distance from the S...,$.main-text[6],1,"[133.2026062, 482.90710449, 237.04431152, 493....",7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...,7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...,3,[]
3,earth.pdf,1,0,11,pdf,18713f970989055625bef22209b6f4b6830b9ca22046bf...,2686,2024-11-05T03:17:24.929663,4.296048,earth.pdf,d4fd8ebe-fb1e-4425-9334-cb547962b117,Solar System\nOur solar system is a vast and f...,$.main-text[2],1,"[132.87112427, 588.96014404, 479.40917969, 623...",44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...,44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...,4,[]
4,earth.pdf,1,0,11,pdf,18713f970989055625bef22209b6f4b6830b9ca22046bf...,2686,2024-11-05T03:17:24.929663,4.296048,earth.pdf,d4fd8ebe-fb1e-4425-9334-cb547962b117,Solar System\nFor more details about our Solar...,$.main-text[3],1,"[133.20942688, 570.81555176, 375.57919312, 581...",d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...,d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...,5,[]
5,earth.pdf,1,0,11,pdf,18713f970989055625bef22209b6f4b6830b9ca22046bf...,2686,2024-11-05T03:17:24.929663,4.296048,earth.pdf,d4fd8ebe-fb1e-4425-9334-cb547962b117,Earth\nEarth is the third planet from the Sun....,$.main-text[5],1,"[132.91053772, 512.46295166, 477.84887695, 534...",7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...,7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...,6,[]
6,earth.pdf,1,0,11,pdf,18713f970989055625bef22209b6f4b6830b9ca22046bf...,2686,2024-11-05T03:17:24.929663,4.296048,earth.pdf,d4fd8ebe-fb1e-4425-9334-cb547962b117,Earth\nBasic facts about Earth:\n¬∑ Distance fr...,$.main-text[6],1,"[133.30151367, 494.86206055, 240.17156982, 505...",189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...,189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...,7,[]


In [25]:
output_df[['filename', 'contents']]

Unnamed: 0,filename,contents
0,mars.pdf,Solar System\nFor more details about the Solar...
1,mars.pdf,"Mars\nMars, the fourth planet from the Sun, is..."
2,mars.pdf,Basic facts about Mars:\n¬∑ Distance from the S...
3,earth.pdf,Solar System\nOur solar system is a vast and f...
4,earth.pdf,Solar System\nFor more details about our Solar...
5,earth.pdf,Earth\nEarth is the third planet from the Sun....
6,earth.pdf,Earth\nBasic facts about Earth:\n¬∑ Distance fr...


In [26]:
for f in output_df['filename'].unique():
    print ('==========' , f, '===========')
    chunks = output_df[output_df['filename'] == f]['contents']
    for idx , chunk in enumerate(chunks):
        print (f'-------Chunk {idx}------\n{chunk}\n-------')

-------Chunk 0------
Solar System
For more details about the Solar system see Chapter 1.
-------
-------Chunk 1------
Mars
Mars, the fourth planet from the Sun, is a cold, desert world with a thin atmosphere composed primarily of carbon dioxide. Its reddish hue comes from iron oxide, or rust, prevalent on its surface.
-------
-------Chunk 2------
Basic facts about Mars:
¬∑ Distance from the Sun: Average of 228 million kilometers (142 million miles)
¬∑ Rotation Period: 24.6 hours (one Martian day - called a "sol")
¬∑ Moons: Two small moons, Phobos and Deimos.
-------
-------Chunk 0------
Solar System
Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun.
-------
-------Chunk 1------
Solar System
For more details about our Solar system see Chapter 1.
-------
-------Chunk 2------
Earth
Earth is the third planet from the Sun. It's our home p

### 6.4 - Understanding the output

Remember we had 8 chunks initially.  Now we have 7!  One duplicate chunk is removed.

If you look at the PDF, the following common paragraph in `earth.pdf` and `mars.pdf`  is removed from one of the documents!  Pretty neat, eh!

```text
## Solar System

Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun.
```

## Step-7: Fuzzy Dedup

Post exact deduplication, fuzzy deduplication is applied with the goal of removing code files that may have **slight variations** and thereby unbiasing
the data further.

Small variations are quite commonly seen in code data in the form of variations in the values of variables, addittion of logging statements etc.

### 7.1 - Set Input/output Folder

In [31]:
## Input to this component is the output of doc_id generator component.

STAGE  = 5

input_folder = output_exact_dedupe_dir # previous output folder is the input folder for the current stage
output_folder =  output_fuzzy_dedupe_dir

input_df = read_parquet_files_as_df(input_folder)  ## for debug purposes

print (f"üèÉüèº STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'")

üèÉüèº STAGE-5: Processing input='output/04_exact_dedupe_out' --> output='output/05_fuzzy_dedupe_out'


### 7.2 - Execute

In [None]:
%%time

import os
import sys

from data_processing.utils import ParamsUtils
from fdedup_transform_ray import FdedupRayTransformConfiguration
from data_processing_ray.runtime.ray import RayTransformLauncher

# create parameters

local_conf = {
    "input_folder": input_folder,
    "output_folder": output_folder,
}
worker_options = {"num_cpus" : MY_CONFIG.RAY_NUM_CPUS}
code_location = {"github": "github", "commit_hash": "12345", "path": "path"}
params = {
    # where to run
    "run_locally": True,
    # Data access. Only required parameters are specified
    "data_local_config": ParamsUtils.convert_to_ast(local_conf),
    # Orchestration parameters
    "runtime_worker_options": ParamsUtils.convert_to_ast(worker_options),
    "runtime_num_workers": MY_CONFIG.RAY_RUNTIME_WORKERS,
    # columns used
    "fdedup_doc_column": "contents",
    "fdedup_id_column": "chunk_id",
    "fdedup_cluster_column": "chunk_hash",
    # infrastructure
    "fdedup_bucket_cpu": 0.3,
    "fdedup_doc_cpu": 0.3,
    "fdedup_mhash_cpu": 0.3,
    "fdedup_num_doc_actors": 1,
    "fdedup_num_bucket_actors": 1,
    "fdedup_num_minhash_actors": 1,
    "fdedup_num_preprocessors": 1,
    # fuzzy parameters
    "fdedup_num_permutations": 64,
    "fdedup_threshold": 0.7, # (default 0.8)
    "fdedup_shingles_size": 5,
    "fdedup_delimiters": " "
}

# Pass commandline params
sys.argv = ParamsUtils.dict_to_req(d=params)

# launch

launcher = RayTransformLauncher(FdedupRayTransformConfiguration())

return_code = launcher.launch()

if return_code == 0:
    print (f"‚úÖ Stage:{STAGE} completed successfully")
else:
    raise Exception ("‚ùå Ray job failed")

03:30:41 INFO - fuzzy dedup params are {'doc_column': 'contents', 'id_column': 'chunk_id', 'cluster_column': 'chunk_hash', 'bucket_cpu': 0.3, 'mhash_cpu': 0.3, 'doc_cpu': 0.3, 'num_doc_actors': 1, 'num_minhash_actors': 1, 'num_bucket_actors': 1, 'num_preprocessors': 1, 'num_permutations': 64, 'threshold': 0.7, 'shingles_size': 5, 'delimiters': ' ', 'snapshot_delay': 1, 'use_bucket_snapshot': False, 'use_doc_snapshot': False, 'random_delay_limit': 10, 'worker_options': {'num_cpus': 0.3}}
INFO:fdedup_transform_ray:fuzzy dedup params are {'doc_column': 'contents', 'id_column': 'chunk_id', 'cluster_column': 'chunk_hash', 'bucket_cpu': 0.3, 'mhash_cpu': 0.3, 'doc_cpu': 0.3, 'num_doc_actors': 1, 'num_minhash_actors': 1, 'num_bucket_actors': 1, 'num_preprocessors': 1, 'num_permutations': 64, 'threshold': 0.7, 'shingles_size': 5, 'delimiters': ' ', 'snapshot_delay': 1, 'use_bucket_snapshot': False, 'use_doc_snapshot': False, 'random_delay_limit': 10, 'worker_options': {'num_cpus': 0.3}}
03:30:

### 7.3 - Inspect Generated output

In [29]:
from utils import read_parquet_files_as_df

output_df = read_parquet_files_as_df(output_folder)

print ("Input data dimensions (rows x columns)= ", input_df.shape)
print ("Output data dimensions (rows x columns)= ", output_df.shape)
print ("Duplicate chunks removed  by fuzzy-dedupe:  ", (input_df.shape[0] - output_df.shape[0]))

output_df.head(10)

Input data dimensions (rows x columns)=  (8, 18)
Output data dimensions (rows x columns)=  (6, 18)
Duplicate chunks removed  by fuzzy-dedupe:   2


Unnamed: 0,filename,num_pages,num_tables,num_doc_elements,ext,hash,size,date_acquired,pdf_convert_time,source_filename,source_document_id,contents,doc_jsonpath,page_number,bbox,document_id,chunk_id,chunk_hash
0,mars.pdf,1,0,11,pdf,8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...,2800,2024-11-05T03:17:29.759389,4.81335,mars.pdf,b89261d9-fcaf-488e-b0a2-ded576779a51,Solar System\nOur solar system is a vast and f...,$.main-text[2],1,"[132.84518433, 588.96014404, 479.40917969, 623...",44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...,0,4
1,mars.pdf,1,0,11,pdf,8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...,2800,2024-11-05T03:17:29.759389,4.81335,mars.pdf,b89261d9-fcaf-488e-b0a2-ded576779a51,Solar System\nFor more details about the Solar...,$.main-text[3],1,"[133.18510437, 570.83258057, 374.99838257, 581...",dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...,1,5
2,mars.pdf,1,0,11,pdf,8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...,2800,2024-11-05T03:17:29.759389,4.81335,mars.pdf,b89261d9-fcaf-488e-b0a2-ded576779a51,"Mars\nMars, the fourth planet from the Sun, is...",$.main-text[5],1,"[132.87440491, 500.84011841, 477.48345947, 534...",a31663e06fac41470ecc459f5a58658a3f9997d7801053...,2,-1
3,mars.pdf,1,0,11,pdf,8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...,2800,2024-11-05T03:17:29.759389,4.81335,mars.pdf,b89261d9-fcaf-488e-b0a2-ded576779a51,Basic facts about Mars:\n¬∑ Distance from the S...,$.main-text[6],1,"[133.2026062, 482.90710449, 237.04431152, 493....",7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...,3,-1
4,earth.pdf,1,0,11,pdf,18713f970989055625bef22209b6f4b6830b9ca22046bf...,2686,2024-11-05T03:17:24.929663,4.296048,earth.pdf,d4fd8ebe-fb1e-4425-9334-cb547962b117,Earth\nEarth is the third planet from the Sun....,$.main-text[5],1,"[132.91053772, 512.46295166, 477.84887695, 534...",7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...,6,-1
5,earth.pdf,1,0,11,pdf,18713f970989055625bef22209b6f4b6830b9ca22046bf...,2686,2024-11-05T03:17:24.929663,4.296048,earth.pdf,d4fd8ebe-fb1e-4425-9334-cb547962b117,Earth\nBasic facts about Earth:\n¬∑ Distance fr...,$.main-text[6],1,"[133.30151367, 494.86206055, 240.17156982, 505...",189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...,7,-1


In [None]:
output_df[['filename', 'contents']]

Unnamed: 0,filename,contents
0,mars.pdf,Solar System\nOur solar system is a vast and f...
1,mars.pdf,"Mars\nMars, the fourth planet from the Sun, is..."
2,mars.pdf,Basic facts about Mars:\n¬∑ Distance from the S...
3,earth.pdf,Solar System\nFor more details about our Solar...
4,earth.pdf,Earth\nEarth is the third planet from the Sun....
5,earth.pdf,Earth\nBasic facts about Earth:\n¬∑ Distance fr...


In [None]:
for f in output_df['filename'].unique():
    print ('==========' , f, '===========')
    chunks = output_df[output_df['filename'] == f]['contents']
    for idx , chunk in enumerate(chunks):
        print (f'-------Chunk {idx}------\n{chunk}\n-------')

-------Chunk 0------
Solar System
Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun.
-------
-------Chunk 1------
Mars
Mars, the fourth planet from the Sun, is a cold, desert world with a thin atmosphere composed primarily of carbon dioxide. Its reddish hue comes from iron oxide, or rust, prevalent on its surface.
-------
-------Chunk 2------
Basic facts about Mars:
¬∑ Distance from the Sun: Average of 228 million kilometers (142 million miles)
¬∑ Rotation Period: 24.6 hours (one Martian day - called a "sol")
¬∑ Moons: Two small moons, Phobos and Deimos.
-------
-------Chunk 0------
Solar System
For more details about our Solar system see Chapter 1.
-------
-------Chunk 1------
Earth
Earth is the third planet from the Sun. It's our home planet. Earth is the only place we know of with life.
-------
-------Chunk 2------
Earth
Basic fac

### 7.4- Understanding the output

So we started with 7 rows and ended up with 6.  Fuzzy dedupe removed the following **very similar** chunk.

These are pretty similar chunks except for the words 'the' and 'our'

**earth.pdf**

`For more details about *our* Solar system see Chapter 1.`

**mars.pdf**

`For more details about *the* Solar system see Chapter 1.`

Pretty neat, eh? üëè

### Configuring Fuzzy de-dupe

You can tweak fuzzy dedupe by tweaking the following parameters

```python
# fuzzy parameters
    "fdedup_num_permutations": 64,
    "fdedup_threshold": 0.7, #  (default 0.8)
    "fdedup_shingles_size": 5,
    "fdedup_delimiters": " "
```

In our case, we set `fdedup_threshold` parameter to 0.7.  


## Step-8:   Text encoding

Encode text for the vector storage.

### 8.1 - Set Input/output Folder

In [None]:
STAGE  = 6

input_folder = output_fuzzy_dedupe_dir # previous output folder is the input folder for the current stage
output_folder =  output_embeddings_dir

input_df = read_parquet_files_as_df(input_folder)  ## for debug purposes

print (f"üèÉüèº STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'")

üèÉüèº STAGE-6: Processing input='output/05_fuzzy_dedupe_out' --> output='output/06_embeddings_out'


### 8.2 - Execute

In [None]:
%%time

from text_encoder_transform_ray import TextEncoderRayTransformConfiguration

local_conf = {
    "input_folder": input_folder,
    "output_folder": output_folder,
}
worker_options = {"num_cpus" : MY_CONFIG.RAY_NUM_CPUS}
## Embedding model
MY_CONFIG.EMBEDDING_MODEL = 'sentence-transformers/all-MiniLM-L6-v2'
params = {
    # where to run
    "run_locally": True,
    # Data access. Only required parameters are specified
    "data_local_config": ParamsUtils.convert_to_ast(local_conf),
    # orchestrator
    "runtime_worker_options": ParamsUtils.convert_to_ast(worker_options),
    "runtime_num_workers": MY_CONFIG.RAY_RUNTIME_WORKERS,
    # text_encoder
    "text_encoder_model_name": MY_CONFIG.EMBEDDING_MODEL,
}

sys.argv = ParamsUtils.dict_to_req(d=params)
# create launcher
launcher = RayTransformLauncher(TextEncoderRayTransformConfiguration())
# Launch the ray actor(s) to process the input

return_code = launcher.launch()

if return_code == 0:
    print (f"‚úÖ Stage:{STAGE} completed successfully")
else:
    raise Exception ("‚ùå Ray job failed")

00:36:56 INFO - text_encoder parameters are : {'content_column_name': 'contents', 'output_embeddings_column_name': 'embeddings', 'model_name': 'sentence-transformers/all-MiniLM-L6-v2'}
00:36:56 INFO - pipeline id pipeline_id
00:36:56 INFO - code location None
00:36:56 INFO - number of workers 2 worker options {'num_cpus': 1, 'max_restarts': -1}
00:36:56 INFO - actor creation delay 0
00:36:56 INFO - job details {'job category': 'preprocessing', 'job name': 'text_encoder', 'job type': 'ray', 'job id': 'job_id'}
00:36:56 INFO - data factory data_ is using local data access: input_folder - output/05_fuzzy_dedupe_out output_folder - output/06_embeddings_out
00:36:56 INFO - data factory data_ max_files -1, n_sample -1
00:36:56 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
00:36:56 INFO - Running locally
2024-10-20 00:36:58,405	INFO worker.py:1744 -- Started a local Ray instance.

‚úÖ Stage:6 completed successfully
CPU times: user 566 ms, sys: 201 ms, total: 767 ms
Wall time: 22 s


### 8.3 - Inspect Generated output

You will see a column called `embeddings` added at the end.  This the text content converted into vectors or embeddings.  We used the model `sentence-transformers/all-MiniLM-L6-v2`

In [None]:
from utils import read_parquet_files_as_df

output_df = read_parquet_files_as_df(output_folder)

print ("Input data dimensions (rows x columns)= ", input_df.shape)
print ("Output data dimensions (rows x columns)= ", output_df.shape)

output_df.head(5)

Input data dimensions (rows x columns)=  (6, 18)
Output data dimensions (rows x columns)=  (6, 19)


Unnamed: 0,filename,num_pages,num_tables,num_doc_elements,ext,hash,size,date_acquired,pdf_convert_time,source_filename,source_document_id,contents,doc_jsonpath,page_number,bbox,document_id,chunk_id,chunk_hash,embeddings
0,mars.pdf,1,0,11,pdf,8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...,2800,2024-10-20T00:35:21.542009,1.907552,mars.pdf,3ae7e976-20df-4dea-96a9-5f92543ddbff,Solar System\nOur solar system is a vast and f...,$.main-text[2],1,"[132.84518433, 588.96014404, 479.40917969, 623...",44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...,4,-1,"[0.0077404897, -0.020559434, 0.026426662, 0.01..."
1,mars.pdf,1,0,11,pdf,8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...,2800,2024-10-20T00:35:21.542009,1.907552,mars.pdf,3ae7e976-20df-4dea-96a9-5f92543ddbff,"Mars\nMars, the fourth planet from the Sun, is...",$.main-text[5],1,"[132.87440491, 500.84011841, 477.48345947, 534...",a31663e06fac41470ecc459f5a58658a3f9997d7801053...,6,-1,"[0.07728298, 0.024971062, -0.04318075, 0.05809..."
2,mars.pdf,1,0,11,pdf,8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...,2800,2024-10-20T00:35:21.542009,1.907552,mars.pdf,3ae7e976-20df-4dea-96a9-5f92543ddbff,Basic facts about Mars:\n¬∑ Distance from the S...,$.main-text[6],1,"[133.2026062, 482.90710449, 237.04431152, 493....",7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...,7,-1,"[0.1059802, 0.025460616, 0.02362733, 0.0390564..."
3,earth.pdf,1,0,11,pdf,18713f970989055625bef22209b6f4b6830b9ca22046bf...,2686,2024-10-20T00:35:19.630048,1.933564,earth.pdf,f9ad6dea-b764-4172-9cb9-cedf00fb184c,Solar System\nFor more details about our Solar...,$.main-text[3],1,"[133.20942688, 570.81555176, 375.57919312, 581...",d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...,1,5,"[-0.062105577, -0.0053322953, 0.03127779, 0.04..."
4,earth.pdf,1,0,11,pdf,18713f970989055625bef22209b6f4b6830b9ca22046bf...,2686,2024-10-20T00:35:19.630048,1.933564,earth.pdf,f9ad6dea-b764-4172-9cb9-cedf00fb184c,Earth\nEarth is the third planet from the Sun....,$.main-text[5],1,"[132.91053772, 512.46295166, 477.84887695, 534...",7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...,2,-1,"[0.0724358, -0.058001805, -0.01977186, -0.0243..."


## Step-9: Copy output to final output dir

In [None]:
import shutil

shutil.rmtree(MY_CONFIG.OUTPUT_FOLDER_FINAL, ignore_errors=True)
shutil.copytree(src=output_folder, dst=MY_CONFIG.OUTPUT_FOLDER_FINAL)

print (f"‚úÖ Copied output from '{output_folder}' --> '{MY_CONFIG.OUTPUT_FOLDER_FINAL}'")

‚úÖ Copied output from 'output/06_embeddings_out' --> 'output/output_final'
