<div style="background-color: #04D7FD; padding: 20px; text-align: left;">
    <h1 style="color: #000000; font-size: 36px; margin: 0;">Data Processing for RAG with Data Prep Kit (Python)</h1>
    
</div>


## Before Running the notebook

Please complete [setting up python dev environment](./setup-python-dev-env.md)

## Overview

This notebook will process PDF documents as part of RAG pipeline

### RAG Overview

![](media/rag-overview-2.png)

### PDF processing / cleaning

This notebook will perform step 1 RAG pipeline.

Here is the workflow:

- docling2parquet: Extract text from PDF documents
- docid: compute hashes
- exact dedupe : filter out identical documents
- fuzzy dedupe : filter out 'near duplicates'
- document quality: scoring documents for quality
- HAP (Hate Abuse Profanity) detector for filterig out HAP speech

![](https://raw.githubusercontent.com/data-prep-kit/data-prep-kit/dev/examples/pdf-processing-1/images/data-prep-kit-3-workflow.png)

## Step-1: Configuration

In [1]:
from my_config import MY_CONFIG

In [2]:
## setup path to utils folder
import sys
sys.path.append('../utils')

## Step-2:  Data

We will use white papers  about LLMs.  

- [Granite Code Models](https://arxiv.org/abs/2405.04324)
- [Attention is all you need](https://arxiv.org/abs/1706.03762)

You can of course substite your own data below

### 2.1 - data

In [3]:
import os, sys
import shutil
from file_utils import download_file

print ("Using input folder:", MY_CONFIG.INPUT_DATA_DIR)

if not os.path.exists(MY_CONFIG.INPUT_DATA_DIR ):
    raise Exception (f"‚ùå Input folder MY_CONFIG.INPUT_DATA_DIR = '{MY_CONFIG.INPUT_DATA_DIR}' not found")

Using input folder: data


### 2.2 - Set input/output path variables for the pipeline

In [4]:
import os, sys
import shutil

print ("Using output directory:", MY_CONFIG.OUTPUT_FOLDER)

## clear output folder
shutil.rmtree(MY_CONFIG.OUTPUT_FOLDER, ignore_errors=True)
shutil.os.makedirs(MY_CONFIG.OUTPUT_FOLDER, exist_ok=True)

print ("‚úÖ Cleared output directory")


output_docling2pq_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '1_docling2pq_out')
output_docid_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '2_docid_out')
output_exact_dedupe_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '3_exact_dedupe_out')
output_fuzzy_dedupe_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '4_fuzzy_dedupe_out')
output_doc_quality_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '5_doc_quality_out')
output_doc_quality_clean_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '6_doc_quality_clean_out')
output_hap_detection_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '7_hap_detection_out')
output_final_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, 'output_final')

Using output directory: output
‚úÖ Cleared output directory


## Step-3: docling2parquet -  Convert data from PDF to Parquet

This step is reading the input folder containing all PDF files and ingest them in a parquet table using the [Docling package](https://github.com/DS4SD/docling).
The documents are converted into a JSON format which allows to easily chunk it in the later steps.



### 3.1 - Execute 

In [5]:
%%time 

from dpk_docling2parquet import Docling2Parquet
from dpk_docling2parquet import docling2parquet_contents_types

STAGE = 1
print (f"üèÉüèº STAGE-{STAGE}: Processing input='{MY_CONFIG.INPUT_DATA_DIR}' --> output='{output_docling2pq_dir}'\n", flush=True)

result = Docling2Parquet(input_folder=MY_CONFIG.INPUT_DATA_DIR,
                    output_folder=output_docling2pq_dir,
                    data_files_to_use=['.pdf'],
                    docling2parquet_contents_type=docling2parquet_contents_types.MARKDOWN,   # markdown
                    ).transform()

if result == 0:
    print (f"‚úÖ Stage:{STAGE} completed successfully")
else:
    raise Exception (f"‚ùå Stage:{STAGE}  failed")

üèÉüèº STAGE-1: Processing input='data' --> output='output/1_docling2pq_out'



{"time": "22:22:07", "logger": "dpk", "logLevel": "INFO", "message": "docling2parquet parameters are : {'batch_size': -1, 'artifacts_path': None, 'contents_type': <docling2parquet_contents_types.MARKDOWN: 'text/markdown'>, 'do_table_structure': True, 'do_ocr': True, 'ocr_engine': <docling2parquet_ocr_engine.EASYOCR: 'easyocr'>, 'bitmap_area_threshold': 0.05, 'pdf_backend': <docling2parquet_pdf_backend.DLPARSE_V2: 'dlparse_v2'>, 'double_precision': 8, 'pipeline': <docling2parquet_pipeline.MULTI_STAGE: 'multi_stage'>, 'generate_picture_images': False, 'generate_page_images': False, 'images_scale': 2.0}"}
{"time": "22:22:07", "logger": "dpk", "logLevel": "INFO", "message": "pipeline id pipeline_id"}
{"time": "22:22:07", "logger": "dpk", "logLevel": "INFO", "message": "code location {'github': 'UNDEFINED', 'build-date': 'UNDEFINED', 'commit_hash': 'UNDEFINED', 'path': 'UNDEFINED'}"}
{"time": "22:22:07", "logger": "dpk", "logLevel": "INFO", "message": "data factory data_ max_files -1, n_sam

‚úÖ Stage:1 completed successfully
CPU times: user 40min 28s, sys: 1min 53s, total: 42min 22s
Wall time: 7min 50s


### 3.2 -  Inspect Generated output

Here we should see one entry per input file processed

In [6]:
from file_utils import read_parquet_files_as_df

output_df = read_parquet_files_as_df(output_docling2pq_dir)

# print ("Output dimensions (rows x columns)= ", output_df.shape)

output_df.head(10)

## To display certain columns
#parquet_df[['column1', 'column2', 'column3']].head(5)

Successfully read 8 parquet files with 8 total rows


Unnamed: 0,filename,contents,num_pages,num_tables,num_doc_elements,document_id,document_hash,ext,hash,size,date_acquired,document_convert_time,source_filename
0,granite-similar.pdf,## Granite Code Models: A Family of Open Found...,28,17,488,221416c6-c2a2-4aa4-be34-62b47f2b44b8,2995119145186144319,pdf,0de7807974e6888cabafef3484ef571ceb5c3167f7433a...,121236,2025-11-19T22:27:35.280582,144.064493,granite-similar.pdf
1,lorem-ipsum.pdf,Lorem ipsum Lorem ipsum Lorem ipsum,1,0,2,1579cb34-5a75-46af-b4c2-f0b80e0fede5,50660627009040522,pdf,bc012d063005cc02deb6c2592d1f8c3b273625edf9eec5...,35,2025-11-19T22:29:53.377453,0.899349,lorem-ipsum.pdf
2,granite-duplicate.pdf,## Granite Code Models: A Family of Open Found...,28,17,484,ba897c8f-7637-49e7-9cb4-19e770fd5031,3127757990743433032,pdf,58342470e7d666dca0be87a15fb0552f949a5632606fe1...,121131,2025-11-19T22:25:11.161649,145.223961,granite-duplicate.pdf
3,spam.pdf,Free xxx,1,0,2,412bba11-c661-456d-bbbc-bf57de69cbc4,5577338085393325113,pdf,543ffc97aef373ee009a5f908e0358ef80d329ca7ba964...,8,2025-11-19T22:29:54.286406,0.90472,spam.pdf
4,attention.pdf,"Provided proper attribution is provided, Googl...",15,4,513,b21cf9b2-6c3a-4a22-a5dc-aacecf1a99dc,2949302674760005271,pdf,bcaa6e6e0c640fb63393c7e40229a512571f61ccdba42d...,48981,2025-11-19T22:22:45.882819,35.593498,attention.pdf
5,hap2.pdf,## HAP Example - Hate\n\nI hate all immigrants!,1,0,3,9a9c972d-86df-470b-acca-51a3135e9074,17586229987672717381,pdf,18b2a552b5b54d5bd374a1416117d019a5e18a5da1547a...,45,2025-11-19T22:29:52.475010,0.917891,hap2.pdf
6,hap1.pdf,## HAP example - Abuse and Profanity\n\nYou ar...,1,0,3,d4614cde-d599-4da9-969b-1aa9ac57310d,11080891969043035065,pdf,c21aebe6661c25c508faf03d9030813497e4ded4c840f4...,135,2025-11-19T22:29:51.554267,0.814161,hap1.pdf
7,granite.pdf,## Granite Code Models: A Family of Open Found...,28,17,484,83d8d8fc-df9a-4476-832d-07fa9f58ca03,3127757990743433032,pdf,58342470e7d666dca0be87a15fb0552f949a5632606fe1...,121131,2025-11-19T22:29:50.727842,135.392525,granite.pdf


## Step-4:  Create DOC ID for Documents

This transform annotates documents with document "ids". It supports the following transformations of the original data:

 - Adding document hash: this enables the addition of a document hash-based id to the data. The hash is calculated with `hashlib.sha256(doc.encode("utf-8")).hexdigest()`. To enable this annotation, set **hash_column** to the name of the column, where you want to store it.
 - Adding integer document id: this allows the addition of an integer document id to the data that is unique across all rows in all tables provided to the transform() method. To enable this annotation, set **int_id_column** to the name of the column, where you want to store it.

**This step is a pre-requisite for fuzzy dedup** in the pipeline.

[DocID documentation](https://github.com/data-prep-kit/data-prep-kit/tree/dev/transforms/universal/doc_id)

### 4.1 - Execute

In [7]:
%%time

from dpk_doc_id import DocID

STAGE = 2
print (f"üèÉüèº STAGE-{STAGE}: Processing input='{output_docling2pq_dir}' --> output='{output_docid_dir}'\n", flush=True)

result = DocID(input_folder= output_docling2pq_dir,
                output_folder= output_docid_dir,
                doc_id_doc_column= "contents",
                doc_id_hash_column= "doc_hash",
                # doc_id_int_column= "doc_id_int",
                doc_id_int_column= "int_id_column",
                #doc_id_start_id= 5
                ).transform()

if result == 0:
    print (f"‚úÖ Stage:{STAGE} completed successfully")
else:
    raise Exception (f"‚ùå Stage:{STAGE}  failed")


üèÉüèº STAGE-2: Processing input='output/1_docling2pq_out' --> output='output/2_docid_out'



{"time": "22:29:54", "logger": "dpk", "logLevel": "INFO", "message": "Doc id parameters are : {'doc_column': 'contents', 'hash_column': 'doc_hash', 'int_column': 'int_id_column', 'start_id': 0}"}
{"time": "22:29:54", "logger": "dpk", "logLevel": "INFO", "message": "pipeline id pipeline_id"}
{"time": "22:29:54", "logger": "dpk", "logLevel": "INFO", "message": "code location {'github': 'UNDEFINED', 'build-date': 'UNDEFINED', 'commit_hash': 'UNDEFINED', 'path': 'UNDEFINED'}"}
{"time": "22:29:54", "logger": "dpk", "logLevel": "INFO", "message": "data factory data_ max_files -1, n_sample -1"}
{"time": "22:29:54", "logger": "dpk", "logLevel": "INFO", "message": "data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']"}
{"time": "22:29:54", "logger": "dpk", "logLevel": "INFO", "message": "data factory data_ Data Access:  DataAccessLocal"}
{"time": "22:29:54", "logger": "dpk", "logLevel": "INFO", 

‚úÖ Stage:2 completed successfully
CPU times: user 48.5 ms, sys: 7.21 ms, total: 55.8 ms
Wall time: 44.3 ms


### 4.2 - Inspect Generated output

You would see a new columns **doc_hash** and **int_id_column**

In [8]:
from file_utils import read_parquet_files_as_df
print ("Displaying contents of : ", output_docid_dir)
output_df = read_parquet_files_as_df(output_docid_dir)
output_df.head(10)

Displaying contents of :  output/2_docid_out
Successfully read 8 parquet files with 8 total rows


Unnamed: 0,filename,contents,num_pages,num_tables,num_doc_elements,document_id,document_hash,ext,hash,size,date_acquired,document_convert_time,source_filename,doc_hash,int_id_column
0,granite-similar.pdf,## Granite Code Models: A Family of Open Found...,28,17,488,221416c6-c2a2-4aa4-be34-62b47f2b44b8,2995119145186144319,pdf,0de7807974e6888cabafef3484ef571ceb5c3167f7433a...,121236,2025-11-19T22:27:35.280582,144.064493,granite-similar.pdf,0de7807974e6888cabafef3484ef571ceb5c3167f7433a...,2
1,lorem-ipsum.pdf,Lorem ipsum Lorem ipsum Lorem ipsum,1,0,2,1579cb34-5a75-46af-b4c2-f0b80e0fede5,50660627009040522,pdf,bc012d063005cc02deb6c2592d1f8c3b273625edf9eec5...,35,2025-11-19T22:29:53.377453,0.899349,lorem-ipsum.pdf,bc012d063005cc02deb6c2592d1f8c3b273625edf9eec5...,6
2,granite-duplicate.pdf,## Granite Code Models: A Family of Open Found...,28,17,484,ba897c8f-7637-49e7-9cb4-19e770fd5031,3127757990743433032,pdf,58342470e7d666dca0be87a15fb0552f949a5632606fe1...,121131,2025-11-19T22:25:11.161649,145.223961,granite-duplicate.pdf,58342470e7d666dca0be87a15fb0552f949a5632606fe1...,1
3,spam.pdf,Free xxx,1,0,2,412bba11-c661-456d-bbbc-bf57de69cbc4,5577338085393325113,pdf,543ffc97aef373ee009a5f908e0358ef80d329ca7ba964...,8,2025-11-19T22:29:54.286406,0.90472,spam.pdf,543ffc97aef373ee009a5f908e0358ef80d329ca7ba964...,7
4,attention.pdf,"Provided proper attribution is provided, Googl...",15,4,513,b21cf9b2-6c3a-4a22-a5dc-aacecf1a99dc,2949302674760005271,pdf,bcaa6e6e0c640fb63393c7e40229a512571f61ccdba42d...,48981,2025-11-19T22:22:45.882819,35.593498,attention.pdf,bcaa6e6e0c640fb63393c7e40229a512571f61ccdba42d...,0
5,hap2.pdf,## HAP Example - Hate\n\nI hate all immigrants!,1,0,3,9a9c972d-86df-470b-acca-51a3135e9074,17586229987672717381,pdf,18b2a552b5b54d5bd374a1416117d019a5e18a5da1547a...,45,2025-11-19T22:29:52.475010,0.917891,hap2.pdf,18b2a552b5b54d5bd374a1416117d019a5e18a5da1547a...,5
6,hap1.pdf,## HAP example - Abuse and Profanity\n\nYou ar...,1,0,3,d4614cde-d599-4da9-969b-1aa9ac57310d,11080891969043035065,pdf,c21aebe6661c25c508faf03d9030813497e4ded4c840f4...,135,2025-11-19T22:29:51.554267,0.814161,hap1.pdf,c21aebe6661c25c508faf03d9030813497e4ded4c840f4...,4
7,granite.pdf,## Granite Code Models: A Family of Open Found...,28,17,484,83d8d8fc-df9a-4476-832d-07fa9f58ca03,3127757990743433032,pdf,58342470e7d666dca0be87a15fb0552f949a5632606fe1...,121131,2025-11-19T22:29:50.727842,135.392525,granite.pdf,58342470e7d666dca0be87a15fb0552f949a5632606fe1...,3


## Step-5: Eliminate Duplicate Documents

We have 2 duplicate documnets here : `granite.pdf` and `granite2.pdf`.

Note how the `hash` for these documents are same.

We are going to perform **de-dupe**

On the content of each document, a SHA256 hash is computed, followed by de-duplication of record having identical hashes.

[Dedupe transform documentation](https://github.com/data-prep-kit/data-prep-kit/blob/dev/transforms/universal/ededup/README.md)

### 5.1 - Execute 

In [9]:
%%time 

from dpk_ededup.transform_python import Ededup

STAGE = 2
print (f"üèÉüèº STAGE-{STAGE}: Processing input='{output_docid_dir}' --> output='{output_exact_dedupe_dir}'\n", flush=True)

result = Ededup(input_folder=output_docid_dir,
    output_folder=output_exact_dedupe_dir,
    ededup_doc_column="contents",
    ededup_doc_id_column="document_id"
    ).transform()

if result == 0:
    print (f"‚úÖ Stage:{STAGE} completed successfully")
else:
    raise Exception (f"‚ùå Stage:{STAGE}  failed")

üèÉüèº STAGE-2: Processing input='output/2_docid_out' --> output='output/3_exact_dedupe_out'



{"time": "22:29:54", "logger": "dpk", "logLevel": "INFO", "message": "exact dedup params are {'doc_column': 'contents', 'doc_id_column': 'document_id', 'use_snapshot': False, 'snapshot_directory': None}"}
{"time": "22:29:54", "logger": "dpk", "logLevel": "INFO", "message": "pipeline id pipeline_id"}
{"time": "22:29:54", "logger": "dpk", "logLevel": "INFO", "message": "code location {'github': 'UNDEFINED', 'build-date': 'UNDEFINED', 'commit_hash': 'UNDEFINED', 'path': 'UNDEFINED'}"}
{"time": "22:29:54", "logger": "dpk", "logLevel": "INFO", "message": "data factory data_ max_files -1, n_sample -1"}
{"time": "22:29:54", "logger": "dpk", "logLevel": "INFO", "message": "data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']"}
{"time": "22:29:54", "logger": "dpk", "logLevel": "INFO", "message": "data factory data_ Data Access:  DataAccessLocal"}
{"time": "22:29:54", "logger": "dpk", "logLevel":

‚úÖ Stage:2 completed successfully
CPU times: user 59.7 ms, sys: 5.22 ms, total: 64.9 ms
Wall time: 55.7 ms


### 5.2 - Inspect Generated output

We would see 2 documents: `attention.pdf`  and `granite.pdf`.  The duplicate `granite.pdf` has been filtered out!

In [10]:
from file_utils import read_parquet_files_as_df

input_df = read_parquet_files_as_df(output_docid_dir)
output_df = read_parquet_files_as_df(output_exact_dedupe_dir)

# print ("Input data dimensions (rows x columns)= ", input_df.shape)
# print ("Output data dimensions (rows x columns)= ", output_df.shape)
print (f"Input files before exact dedupe : {input_df.shape[0]:,}")
print (f"Output files after exact dedupe : {output_df.shape[0]:,}")
print ("Duplicate files removed :  ", (input_df.shape[0] - output_df.shape[0]))

output_df.sample(min(10, output_df.shape[0]))

Successfully read 8 parquet files with 8 total rows
Successfully read 7 parquet files with 7 total rows
Input files before exact dedupe : 8
Output files after exact dedupe : 7
Duplicate files removed :   1


Unnamed: 0,filename,contents,num_pages,num_tables,num_doc_elements,document_id,document_hash,ext,hash,size,date_acquired,document_convert_time,source_filename,doc_hash,int_id_column
6,hap1.pdf,## HAP example - Abuse and Profanity\n\nYou ar...,1,0,3,d4614cde-d599-4da9-969b-1aa9ac57310d,11080891969043035065,pdf,c21aebe6661c25c508faf03d9030813497e4ded4c840f4...,135,2025-11-19T22:29:51.554267,0.814161,hap1.pdf,c21aebe6661c25c508faf03d9030813497e4ded4c840f4...,4
3,spam.pdf,Free xxx,1,0,2,412bba11-c661-456d-bbbc-bf57de69cbc4,5577338085393325113,pdf,543ffc97aef373ee009a5f908e0358ef80d329ca7ba964...,8,2025-11-19T22:29:54.286406,0.90472,spam.pdf,543ffc97aef373ee009a5f908e0358ef80d329ca7ba964...,7
2,granite-duplicate.pdf,## Granite Code Models: A Family of Open Found...,28,17,484,ba897c8f-7637-49e7-9cb4-19e770fd5031,3127757990743433032,pdf,58342470e7d666dca0be87a15fb0552f949a5632606fe1...,121131,2025-11-19T22:25:11.161649,145.223961,granite-duplicate.pdf,58342470e7d666dca0be87a15fb0552f949a5632606fe1...,1
5,hap2.pdf,## HAP Example - Hate\n\nI hate all immigrants!,1,0,3,9a9c972d-86df-470b-acca-51a3135e9074,17586229987672717381,pdf,18b2a552b5b54d5bd374a1416117d019a5e18a5da1547a...,45,2025-11-19T22:29:52.475010,0.917891,hap2.pdf,18b2a552b5b54d5bd374a1416117d019a5e18a5da1547a...,5
0,granite-similar.pdf,## Granite Code Models: A Family of Open Found...,28,17,488,221416c6-c2a2-4aa4-be34-62b47f2b44b8,2995119145186144319,pdf,0de7807974e6888cabafef3484ef571ceb5c3167f7433a...,121236,2025-11-19T22:27:35.280582,144.064493,granite-similar.pdf,0de7807974e6888cabafef3484ef571ceb5c3167f7433a...,2
4,attention.pdf,"Provided proper attribution is provided, Googl...",15,4,513,b21cf9b2-6c3a-4a22-a5dc-aacecf1a99dc,2949302674760005271,pdf,bcaa6e6e0c640fb63393c7e40229a512571f61ccdba42d...,48981,2025-11-19T22:22:45.882819,35.593498,attention.pdf,bcaa6e6e0c640fb63393c7e40229a512571f61ccdba42d...,0
1,lorem-ipsum.pdf,Lorem ipsum Lorem ipsum Lorem ipsum,1,0,2,1579cb34-5a75-46af-b4c2-f0b80e0fede5,50660627009040522,pdf,bc012d063005cc02deb6c2592d1f8c3b273625edf9eec5...,35,2025-11-19T22:29:53.377453,0.899349,lorem-ipsum.pdf,bc012d063005cc02deb6c2592d1f8c3b273625edf9eec5...,6


## Step-6: Fuzzy Dedupe

In previous step, we removed **exact duplicates (identical documents)**.

Fuzzy de-dupe can further filter out documents that are **not exactly identical, but nearly identical**

Here is a simple example:

`Our solar system is a vast and fascinating expanse`

`The solar system is a vast and fascinating expanse`

Only one word is different `Our` vs `The`.

Imagine two documents with one extra blank line.  For our purposes they are the same.

[Fuzzy dedupe documentation](https://github.com/data-prep-kit/data-prep-kit/tree/dev/transforms/universal/fdedup)

### Tweaking fuzzy matches

**`jaccard_similarity_threshold`** is the parameter used to tweak similarities between documents.  It's value is between 0 and 1.0.  Values close to 1.0 means more strict checking (fewer documents will qualify).  Lower threshold means more leniant matches (more documents will qualify)

Adjust this value to find what works for your documents

### 6.1 - Execute

In [11]:
%%time

from dpk_fdedup.transform_python import Fdedup

STAGE = 4
print (f"üèÉüèº STAGE-{STAGE}: Processing input='{output_exact_dedupe_dir}' --> output='{output_fuzzy_dedupe_dir}'\n", flush=True)

result = Fdedup(input_folder=output_exact_dedupe_dir,
                output_folder=output_fuzzy_dedupe_dir,
                contents_column= "contents",
                # document_id_column= "doc_id",
                document_id_column= "int_id_column",
                num_permutations= 112,
                num_bands= 14,
                num_minhashes_per_band= 8,
                jaccard_similarity_threshold = 0.8, # between 0 - 1.  higher means more strict checking
                operation_mode="filter_duplicates",
                # operation_mode="annotate",
                ).transform()
if result == 0:
    print (f"‚úÖ Stage:{STAGE} completed successfully")
else:
    raise Exception (f"‚ùå Stage:{STAGE}  failed (result={result})")

üèÉüèº STAGE-4: Processing input='output/3_exact_dedupe_out' --> output='output/4_fuzzy_dedupe_out'



{"time": "22:29:54", "logger": "dpk", "logLevel": "INFO", "message": "Starting SignatureCalculation step"}
{"time": "22:29:54", "logger": "dpk", "logLevel": "INFO", "message": "Got parameters for SignatureCalculation"}
{"time": "22:29:54", "logger": "dpk", "logLevel": "INFO", "message": "minhash parameters are : {'document_id_column': 'int_id_column', 'contents_column': 'contents', 'seed': 42, 'num_permutations': 112, 'jaccard_similarity_threshold': 0.8, 'word_shingle_size': 5, 'num_bands': 14, 'num_minhashes_per_band': 8, 'num_segments': 1, 'shingle_option': 'word'}"}
{"time": "22:29:54", "logger": "dpk", "logLevel": "INFO", "message": "data factory scdata_ Missing local configuration"}
{"time": "22:29:54", "logger": "dpk", "logLevel": "INFO", "message": "data factory scdata_ max_files -1, n_sample -1"}
{"time": "22:29:54", "logger": "dpk", "logLevel": "INFO", "message": "data factory scdata_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pa

‚úÖ Stage:4 completed successfully
CPU times: user 352 ms, sys: 92 ms, total: 444 ms
Wall time: 276 ms


### 6.2 - Inspect Output

FuzzyDedupe will write documents that are filtered in **output/4_fuzzy_dedupe_out/cleaned** folder

You will notice only one **granite.pdf** made it!  So fuzzy dedupe did filter out the almost identical doc.

In [12]:
from file_utils import read_parquet_files_as_df
input_df = read_parquet_files_as_df(output_exact_dedupe_dir)
output_df = read_parquet_files_as_df(os.path.join(output_fuzzy_dedupe_dir, "cleaned"))

# print ("Input data dimensions (rows x columns)= ", input_df.shape)
# print ("Output data dimensions (rows x columns)= ", output_df.shape)
print (f"Input files before exact dedupe : {input_df.shape[0]:,}")
print (f"Output files after exact dedupe : {output_df.shape[0]:,}")
print ("Near duplicate files removed :  ", (input_df.shape[0] - output_df.shape[0]))

print ("Displaying contents of : ", output_fuzzy_dedupe_dir)
output_df.head(10)

Successfully read 7 parquet files with 7 total rows
Successfully read 6 parquet files with 6 total rows
Input files before exact dedupe : 7
Output files after exact dedupe : 6
Near duplicate files removed :   1
Displaying contents of :  output/4_fuzzy_dedupe_out


Unnamed: 0,filename,contents,num_pages,num_tables,num_doc_elements,document_id,document_hash,ext,hash,size,date_acquired,document_convert_time,source_filename,doc_hash,int_id_column
0,granite-similar.pdf,## Granite Code Models: A Family of Open Found...,28,17,488,221416c6-c2a2-4aa4-be34-62b47f2b44b8,2995119145186144319,pdf,0de7807974e6888cabafef3484ef571ceb5c3167f7433a...,121236,2025-11-19T22:27:35.280582,144.064493,granite-similar.pdf,0de7807974e6888cabafef3484ef571ceb5c3167f7433a...,2
1,lorem-ipsum.pdf,Lorem ipsum Lorem ipsum Lorem ipsum,1,0,2,1579cb34-5a75-46af-b4c2-f0b80e0fede5,50660627009040522,pdf,bc012d063005cc02deb6c2592d1f8c3b273625edf9eec5...,35,2025-11-19T22:29:53.377453,0.899349,lorem-ipsum.pdf,bc012d063005cc02deb6c2592d1f8c3b273625edf9eec5...,6
2,spam.pdf,Free xxx,1,0,2,412bba11-c661-456d-bbbc-bf57de69cbc4,5577338085393325113,pdf,543ffc97aef373ee009a5f908e0358ef80d329ca7ba964...,8,2025-11-19T22:29:54.286406,0.90472,spam.pdf,543ffc97aef373ee009a5f908e0358ef80d329ca7ba964...,7
3,attention.pdf,"Provided proper attribution is provided, Googl...",15,4,513,b21cf9b2-6c3a-4a22-a5dc-aacecf1a99dc,2949302674760005271,pdf,bcaa6e6e0c640fb63393c7e40229a512571f61ccdba42d...,48981,2025-11-19T22:22:45.882819,35.593498,attention.pdf,bcaa6e6e0c640fb63393c7e40229a512571f61ccdba42d...,0
4,hap2.pdf,## HAP Example - Hate\n\nI hate all immigrants!,1,0,3,9a9c972d-86df-470b-acca-51a3135e9074,17586229987672717381,pdf,18b2a552b5b54d5bd374a1416117d019a5e18a5da1547a...,45,2025-11-19T22:29:52.475010,0.917891,hap2.pdf,18b2a552b5b54d5bd374a1416117d019a5e18a5da1547a...,5
5,hap1.pdf,## HAP example - Abuse and Profanity\n\nYou ar...,1,0,3,d4614cde-d599-4da9-969b-1aa9ac57310d,11080891969043035065,pdf,c21aebe6661c25c508faf03d9030813497e4ded4c840f4...,135,2025-11-19T22:29:51.554267,0.814161,hap1.pdf,c21aebe6661c25c508faf03d9030813497e4ded4c840f4...,4


## Step-7: Document Quality

This handy plugin will score documents across many metrics.

Here we will look for 'bad words' metric.

[Document quality documentation](https://github.com/data-prep-kit/data-prep-kit/tree/dev/transforms/language/doc_quality)

By default it uses [bad words collection](https://github.com/data-prep-kit/data-prep-kit/tree/dev/transforms/language/doc_quality/dpk_doc_quality/ldnoobw).  You can supply a custom file by passing an argument `bad_word_filepath=/path/to/badwords_file`

### 7.1 - Execute

In [13]:
%%time

from dpk_doc_quality.transform_python import DocQuality

STAGE = 5
output_fuzzy_dedupe_cleaned_dir = os.path.join(output_fuzzy_dedupe_dir, "cleaned")
print (f"üèÉüèº STAGE-{STAGE}: Processing input='{output_fuzzy_dedupe_cleaned_dir}' --> output='{output_doc_quality_dir}'\n", flush=True)

result = DocQuality(input_folder=output_fuzzy_dedupe_cleaned_dir,
                    output_folder= output_doc_quality_dir,
                    docq_text_lang = "en",
                    docq_doc_content_column ="contents",
                    ).transform()

if result == 0:
    print (f"‚úÖ Stage:{STAGE} completed successfully")
else:
    raise Exception (f"‚ùå Stage:{STAGE}  failed (result={result})")

üèÉüèº STAGE-5: Processing input='output/4_fuzzy_dedupe_out/cleaned' --> output='output/5_doc_quality_out'



{"time": "22:29:54", "logger": "dpk", "logLevel": "INFO", "message": "doc_quality parameters are : {'text_lang': 'en', 'doc_content_column': 'contents', 'bad_word_filepath': '/home/sujee/my-stuff/ai-alliance/data-prep-kit-examples/dpk-dev/.venv/lib/python3.12/site-packages/dpk_doc_quality/ldnoobw/en', 'docq_data_factory': <data_processing.data_access.data_access_factory.DataAccessFactory object at 0x7dfaee840260>}"}
{"time": "22:29:54", "logger": "dpk", "logLevel": "INFO", "message": "data factory docq_ Missing local configuration"}
{"time": "22:29:54", "logger": "dpk", "logLevel": "INFO", "message": "data factory docq_ max_files -1, n_sample -1"}
{"time": "22:29:54", "logger": "dpk", "logLevel": "INFO", "message": "data factory docq_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']"}
{"time": "22:29:54", "logger": "dpk", "logLevel": "INFO", "message": "data factory docq_ Data Access:  DataAccessLocal

‚úÖ Stage:5 completed successfully
CPU times: user 217 ms, sys: 18.7 ms, total: 236 ms
Wall time: 224 ms


### 7.2 - Inspect the Output

We will see several new columns starting with the name **docq_**.

Look at the column **docq_contain_bad_word**; this will flag documents with 'bad words'.

Also inspect the column **docq_lorem_ipsum_ratio**; this will flag documents with 'lorem ipsum' text

For more information see : [Doc Quality documentation](https://github.com/data-prep-kit/data-prep-kit/tree/dev/transforms/language/doc_quality)

In [14]:
from file_utils import read_parquet_files_as_df
output_df = read_parquet_files_as_df(output_doc_quality_dir)
print ("Displaying contents of : ", output_doc_quality_dir)
output_df.head(10)

Successfully read 6 parquet files with 6 total rows
Displaying contents of :  output/5_doc_quality_out


Unnamed: 0,filename,contents,num_pages,num_tables,num_doc_elements,document_id,document_hash,ext,hash,size,...,docq_mean_word_len,docq_symbol_to_word_ratio,docq_sentence_count,docq_lorem_ipsum_ratio,docq_curly_bracket_ratio,docq_contain_bad_word,docq_bullet_point_ratio,docq_ellipsis_line_ratio,docq_alphabet_word_ratio,docq_contain_common_en_words
0,granite-similar.pdf,## Granite Code Models: A Family of Open Found...,28,17,488,221416c6-c2a2-4aa4-be34-62b47f2b44b8,2995119145186144319,pdf,0de7807974e6888cabafef3484ef571ceb5c3167f7433a...,121236,...,5.177224,0.002295,2809,0.0,0.0,False,0.110682,0.0,0.663245,True
1,lorem-ipsum.pdf,Lorem ipsum Lorem ipsum Lorem ipsum,1,0,2,1579cb34-5a75-46af-b4c2-f0b80e0fede5,50660627009040522,pdf,bc012d063005cc02deb6c2592d1f8c3b273625edf9eec5...,35,...,5.0,0.0,1,0.085714,0.0,False,0.0,0.0,1.0,False
2,spam.pdf,Free xxx,1,0,2,412bba11-c661-456d-bbbc-bf57de69cbc4,5577338085393325113,pdf,543ffc97aef373ee009a5f908e0358ef80d329ca7ba964...,8,...,3.5,0.0,1,0.0,0.0,True,0.0,0.0,1.0,False
3,attention.pdf,"Provided proper attribution is provided, Googl...",15,4,513,b21cf9b2-6c3a-4a22-a5dc-aacecf1a99dc,2949302674760005271,pdf,bcaa6e6e0c640fb63393c7e40229a512571f61ccdba42d...,48981,...,5.058187,0.004962,541,0.0,0.0,False,0.117808,0.0,0.79988,True
4,hap2.pdf,## HAP Example - Hate\n\nI hate all immigrants!,1,0,3,9a9c972d-86df-470b-acca-51a3135e9074,17586229987672717381,pdf,18b2a552b5b54d5bd374a1416117d019a5e18a5da1547a...,45,...,4.0,0.111111,1,0.0,0.0,False,0.0,0.0,0.777778,False
5,hap1.pdf,## HAP example - Abuse and Profanity\n\nYou ar...,1,0,3,d4614cde-d599-4da9-969b-1aa9ac57310d,11080891969043035065,pdf,c21aebe6661c25c508faf03d9030813497e4ded4c840f4...,135,...,4.625,0.041667,2,0.0,0.0,False,0.0,0.0,0.916667,True


### 7.3 - Filtering 'quality' documents

So from the output above we see **spam.pdf** is flagged for containing bad words (**docq_contain_bad_word=True**).

Also **lorem.pdf** is flagged for place holder content **lorem ipsum**  (**docq_lorem_ipsum_ratio > 0**)

We are going to filter them both out

In [15]:
all_docs_df = read_parquet_files_as_df(output_doc_quality_dir)

# remove documents with badwords
clean_docs_df = all_docs_df[all_docs_df['docq_contain_bad_word'] == False]

# also filter out 'lorem ipsum' text
clean_docs_df = clean_docs_df[clean_docs_df['docq_lorem_ipsum_ratio'] == 0]

clean_docs_df.head(10)

Successfully read 6 parquet files with 6 total rows


Unnamed: 0,filename,contents,num_pages,num_tables,num_doc_elements,document_id,document_hash,ext,hash,size,...,docq_mean_word_len,docq_symbol_to_word_ratio,docq_sentence_count,docq_lorem_ipsum_ratio,docq_curly_bracket_ratio,docq_contain_bad_word,docq_bullet_point_ratio,docq_ellipsis_line_ratio,docq_alphabet_word_ratio,docq_contain_common_en_words
0,granite-similar.pdf,## Granite Code Models: A Family of Open Found...,28,17,488,221416c6-c2a2-4aa4-be34-62b47f2b44b8,2995119145186144319,pdf,0de7807974e6888cabafef3484ef571ceb5c3167f7433a...,121236,...,5.177224,0.002295,2809,0.0,0.0,False,0.110682,0.0,0.663245,True
3,attention.pdf,"Provided proper attribution is provided, Googl...",15,4,513,b21cf9b2-6c3a-4a22-a5dc-aacecf1a99dc,2949302674760005271,pdf,bcaa6e6e0c640fb63393c7e40229a512571f61ccdba42d...,48981,...,5.058187,0.004962,541,0.0,0.0,False,0.117808,0.0,0.79988,True
4,hap2.pdf,## HAP Example - Hate\n\nI hate all immigrants!,1,0,3,9a9c972d-86df-470b-acca-51a3135e9074,17586229987672717381,pdf,18b2a552b5b54d5bd374a1416117d019a5e18a5da1547a...,45,...,4.0,0.111111,1,0.0,0.0,False,0.0,0.0,0.777778,False
5,hap1.pdf,## HAP example - Abuse and Profanity\n\nYou ar...,1,0,3,d4614cde-d599-4da9-969b-1aa9ac57310d,11080891969043035065,pdf,c21aebe6661c25c508faf03d9030813497e4ded4c840f4...,135,...,4.625,0.041667,2,0.0,0.0,False,0.0,0.0,0.916667,True


### 7.4 -  clean quality docs

In [16]:
import shutil

shutil.rmtree(output_doc_quality_clean_dir, ignore_errors=True)
os.makedirs(output_doc_quality_clean_dir, exist_ok=True)

clean_docs_df.to_parquet(os.path.join(output_doc_quality_clean_dir,"clean_docs.parquet"))
print (f"‚úÖ Saved CLEAN parquet output to '{output_doc_quality_clean_dir}'")


‚úÖ Saved CLEAN parquet output to 'output/6_doc_quality_clean_out'


## Step-8: HAP Detector

[HAP transform documentation](https://github.com/data-prep-kit/data-prep-kit/blob/dev/transforms/universal/hap/)

Some parameters:

- `model_name_or_path` - specify the HAP model, which should be compatible with HuggingFace's AutoModelForSequenceClassification. Defaults to IBM's open-source toxicity classifier **ibm-granite/granite-guardian-hap-38m**
- `annotation_column` - the column name containing hap (toxicity) score in the output .parquet file. Defaults to hap_score.
- `doc_text_column`- the column name containing the document text in the input .parquet file. Defaults to contents.
- `batch_size` - modify it based on the infrastructure capacity. Defaults to 128.
- `max_length` - the maximum length for the tokenizer. Defaults to 512.

Here are HAP detection models

- [ibm-granite/granite-guardian-hap-38m](https://huggingface.co/ibm-granite/granite-guardian-hap-38m)
- [ibm-granite/granite-guardian-hap-125m](https://huggingface.co/ibm-granite/granite-guardian-hap-125m)

### 8.1 - Execute

In [17]:
from dpk_hap.transform_python import HAP


result = HAP(input_folder= output_doc_quality_clean_dir,
        output_folder= output_hap_detection_dir,
        model_name_or_path= 'ibm-granite/granite-guardian-hap-38m',
        annotation_column= "hap_score",
        doc_text_column= "contents",
        inference_engine= "CPU",
        max_length= 512,
        batch_size= 128,
        ).transform()

if result == 0:
    print (f"‚úÖ Operation completed successfully")
else:
    raise Exception (f"‚ùå Operation  failed")

[nltk_data] Downloading package punkt_tab to /home/sujee/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
{"time": "22:29:55", "logger": "dpk", "logLevel": "INFO", "message": "hap params are {'model_name_or_path': 'ibm-granite/granite-guardian-hap-38m', 'annotation_column': 'hap_score', 'doc_text_column': 'contents', 'inference_engine': 'CPU', 'max_length': 512, 'batch_size': 128} "}
{"time": "22:29:55", "logger": "dpk", "logLevel": "INFO", "message": "pipeline id pipeline_id"}
{"time": "22:29:55", "logger": "dpk", "logLevel": "INFO", "message": "code location {'github': 'UNDEFINED', 'build-date': 'UNDEFINED', 'commit_hash': 'UNDEFINED', 'path': 'UNDEFINED'}"}
{"time": "22:29:55", "logger": "dpk", "logLevel": "INFO", "message": "data factory data_ max_files -1, n_sample -1"}
{"time": "22:29:55", "logger": "dpk", "logLevel": "INFO", "message": "data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], file

‚úÖ Operation completed successfully


### 8.2 - Inspect Generated output

Let's see the output.  Inspect **hap_score** output

In [18]:
from file_utils import read_parquet_files_as_df
output_df = read_parquet_files_as_df(output_hap_detection_dir)
print ("Displaying contents of : ", output_hap_detection_dir)
output_df.head(10)

Successfully read 1 parquet files with 4 total rows
Displaying contents of :  output/7_hap_detection_out


Unnamed: 0,filename,contents,num_pages,num_tables,num_doc_elements,document_id,document_hash,ext,hash,size,...,docq_symbol_to_word_ratio,docq_sentence_count,docq_lorem_ipsum_ratio,docq_curly_bracket_ratio,docq_contain_bad_word,docq_bullet_point_ratio,docq_ellipsis_line_ratio,docq_alphabet_word_ratio,docq_contain_common_en_words,hap_score
0,granite-similar.pdf,## Granite Code Models: A Family of Open Found...,28,17,488,221416c6-c2a2-4aa4-be34-62b47f2b44b8,2995119145186144319,pdf,0de7807974e6888cabafef3484ef571ceb5c3167f7433a...,121236,...,0.002295,2809,0.0,0.0,False,0.110682,0.0,0.663245,True,0.154206
1,attention.pdf,"Provided proper attribution is provided, Googl...",15,4,513,b21cf9b2-6c3a-4a22-a5dc-aacecf1a99dc,2949302674760005271,pdf,bcaa6e6e0c640fb63393c7e40229a512571f61ccdba42d...,48981,...,0.004962,541,0.0,0.0,False,0.117808,0.0,0.79988,True,0.181448
2,hap2.pdf,## HAP Example - Hate\n\nI hate all immigrants!,1,0,3,9a9c972d-86df-470b-acca-51a3135e9074,17586229987672717381,pdf,18b2a552b5b54d5bd374a1416117d019a5e18a5da1547a...,45,...,0.111111,1,0.0,0.0,False,0.0,0.0,0.777778,False,0.90329
3,hap1.pdf,## HAP example - Abuse and Profanity\n\nYou ar...,1,0,3,d4614cde-d599-4da9-969b-1aa9ac57310d,11080891969043035065,pdf,c21aebe6661c25c508faf03d9030813497e4ded4c840f4...,135,...,0.041667,2,0.0,0.0,False,0.0,0.0,0.916667,True,0.997993


In [19]:
output_df[['filename', 'contents', 'hap_score']]

Unnamed: 0,filename,contents,hap_score
0,granite-similar.pdf,## Granite Code Models: A Family of Open Found...,0.154206
1,attention.pdf,"Provided proper attribution is provided, Googl...",0.181448
2,hap2.pdf,## HAP Example - Hate\n\nI hate all immigrants!,0.90329
3,hap1.pdf,## HAP example - Abuse and Profanity\n\nYou ar...,0.997993


### 8.3 - Extract clean docs from HAP

In [20]:
from file_utils import read_parquet_files_as_df

hap_output_df = read_parquet_files_as_df(output_hap_detection_dir)
clean_docs_df = hap_output_df[hap_output_df['hap_score'] < 0.2]

print ('clean documents')
clean_docs_df[['filename', 'contents', 'hap_score']]

Successfully read 1 parquet files with 4 total rows
clean documents


Unnamed: 0,filename,contents,hap_score
0,granite-similar.pdf,## Granite Code Models: A Family of Open Found...,0.154206
1,attention.pdf,"Provided proper attribution is provided, Googl...",0.181448


## Step 9 - Save final docs

In [21]:
## clear out final output folder

import shutil

shutil.rmtree(MY_CONFIG.OUTPUT_FOLDER_FINAL, ignore_errors=True)
shutil.os.makedirs(MY_CONFIG.OUTPUT_FOLDER_FINAL, exist_ok=True)

output_final_dir_parquet = os.path.join (MY_CONFIG.OUTPUT_FOLDER_FINAL, 'pq')
shutil.os.makedirs(output_final_dir_parquet, exist_ok=True)

# output_final_dir_markdown = os.path.join (MY_CONFIG.OUTPUT_FOLDER_FINAL, 'markdown')
output_final_dir_markdown = MY_CONFIG.OUTPUT_FOLDER_FINAL_MD
shutil.os.makedirs(output_final_dir_markdown, exist_ok=True)

In [22]:
## save parquet

clean_docs_df.to_parquet(os.path.join(output_final_dir_parquet, "clean_docs.parquet"))
print (f"‚úÖ Saved CLEAN parquet output to '{output_final_dir_parquet}'")

‚úÖ Saved CLEAN parquet output to 'output/output_final/pq'


In [23]:
## save markdown text

for index, row in clean_docs_df.iterrows():
    output_file_name = os.path.join (output_final_dir_markdown, row['filename'] + '.md')
    print (f"Saving file: {output_file_name}")
    with open(output_file_name, 'w') as output_file:
        output_file.write(row['contents'])

print (f"‚úÖ Saved CLEAN markdown output to '{output_final_dir_markdown}'")

Saving file: output/output_final/markdown/granite-similar.pdf.md
Saving file: output/output_final/markdown/attention.pdf.md
‚úÖ Saved CLEAN markdown output to 'output/output_final/markdown'
