<div style="background-color: #04D7FD; padding: 20px; text-align: left;">
    <h1 style="color: #000000; font-size: 36px; margin: 0;">Data Processing for RAG with Data Prep Kit (Python)</h1>
    
</div>


## Before Running the notebook

Please complete [setting up python dev environment](./setup-python-dev-env.md)

## Overview

This notebook will process PDF documents as part of RAG pipeline

![](media/rag-overview-2.png)

This notebook will perform steps 1, 2, 3 and 4 in RAG pipeline.

Here are the processing steps:

- **pdf2parquet** : Extract text (in markdown format) from PDF and store them as parquet files
- **Exact Dedup**: Documents with exact content are filtered out
- **Chunk documents**: Split the PDFs into 'meaningful sections' (paragraphs, sentences ..etc)
- **Text encoder**: Convert chunks into vectors using embedding models

## Step-1: Configuration

In [1]:
from my_config import MY_CONFIG

In [2]:
## setup path to utils folder
import sys
sys.path.append('../utils')

## Step-2:  Data

We will use white papers  about LLMs.  

- [Granite Code Models](https://arxiv.org/abs/2405.04324)
- [Attention is all you need](https://arxiv.org/abs/1706.03762)

You can of course substite your own data below

### 2.1 - data

In [3]:
import os, sys
import shutil
from file_utils import download_file

print ("Using input data:", MY_CONFIG.INPUT_DATA_DIR)

Using input data: ../data/papers


In [4]:
# import os, sys
# import shutil
# from file_utils import download_file

# shutil.rmtree(MY_CONFIG.INPUT_DATA_DIR, ignore_errors=True)
# shutil.os.makedirs(MY_CONFIG.INPUT_DATA_DIR, exist_ok=True)
# print ("‚úÖ Cleared input directory")
 
# download_file (url = 'https://arxiv.org/pdf/1706.03762', local_file = os.path.join(MY_CONFIG.INPUT_DATA_DIR, 'attention.pdf' ))
# download_file (url = 'https://arxiv.org/pdf/2405.04324', local_file = os.path.join(MY_CONFIG.INPUT_DATA_DIR, 'granite.pdf' ))
# download_file (url = 'https://arxiv.org/pdf/2405.04324', local_file = os.path.join(MY_CONFIG.INPUT_DATA_DIR, 'granite2.pdf' )) # duplicate


### 2.2 - Set input/output path variables for the pipeline

In [5]:
import os, sys
import shutil

if not os.path.exists(MY_CONFIG.INPUT_DATA_DIR ):
    raise Exception (f"‚ùå Input folder MY_CONFIG.INPUT_DATA_DIR = '{MY_CONFIG.INPUT_DATA_DIR}' not found")

output_parquet_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '01_parquet_out')
output_exact_dedupe_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '02_dedupe_out')
# output_chunk_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '03_chunk_out')
# output_embeddings_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '04_embeddings_out')

## clear output folder
shutil.rmtree(MY_CONFIG.OUTPUT_FOLDER, ignore_errors=True)
shutil.os.makedirs(MY_CONFIG.OUTPUT_FOLDER, exist_ok=True)

print ("‚úÖ Cleared output directory")

‚úÖ Cleared output directory


## Step-3: docling2parquet -  Convert data from PDF to Parquet

This step is reading the input folder containing all PDF files and ingest them in a parquet table using the [Docling package](https://github.com/DS4SD/docling).
The documents are converted into a JSON format which allows to easily chunk it in the later steps.



### 3.1 - Execute 

In [6]:
%%time 

from dpk_docling2parquet import Docling2Parquet
from dpk_docling2parquet import docling2parquet_contents_types

STAGE = 1
print (f"üèÉüèº STAGE-{STAGE}: Processing input='{MY_CONFIG.INPUT_DATA_DIR}' --> output='{output_parquet_dir}'\n", flush=True)

result = Docling2Parquet(input_folder=MY_CONFIG.INPUT_DATA_DIR,
                    output_folder=output_parquet_dir,
                    data_files_to_use=['.pdf'],
                    docling2parquet_contents_type=docling2parquet_contents_types.MARKDOWN,   # markdown
                    ).transform()

if result == 0:
    print (f"‚úÖ Stage:{STAGE} completed successfully")
else:
    raise Exception (f"‚ùå Stage:{STAGE}  failed")

üèÉüèº STAGE-1: Processing input='../data/papers' --> output='output-papers/01_parquet_out'



{"time": "15:57:14", "logger": "dpk", "logLevel": "INFO", "message": "docling2parquet parameters are : {'batch_size': -1, 'artifacts_path': None, 'contents_type': <docling2parquet_contents_types.MARKDOWN: 'text/markdown'>, 'do_table_structure': True, 'do_ocr': True, 'ocr_engine': <docling2parquet_ocr_engine.EASYOCR: 'easyocr'>, 'bitmap_area_threshold': 0.05, 'pdf_backend': <docling2parquet_pdf_backend.DLPARSE_V2: 'dlparse_v2'>, 'double_precision': 8, 'pipeline': <docling2parquet_pipeline.MULTI_STAGE: 'multi_stage'>, 'generate_picture_images': False, 'generate_page_images': False, 'images_scale': 2.0}"}
{"time": "15:57:14", "logger": "dpk", "logLevel": "INFO", "message": "pipeline id pipeline_id"}
{"time": "15:57:14", "logger": "dpk", "logLevel": "INFO", "message": "code location {'github': 'UNDEFINED', 'build-date': 'UNDEFINED', 'commit_hash': 'UNDEFINED', 'path': 'UNDEFINED'}"}
{"time": "15:57:14", "logger": "dpk", "logLevel": "INFO", "message": "data factory data_ max_files -1, n_sam

‚úÖ Stage:1 completed successfully
CPU times: user 1min 17s, sys: 2.42 s, total: 1min 19s
Wall time: 57.9 s


### 3.2 -  Inspect Generated output

Here we should see one entry per input file processed

In [7]:
from file_utils import read_parquet_files_as_df

output_df = read_parquet_files_as_df(output_parquet_dir)

# print ("Output dimensions (rows x columns)= ", output_df.shape)

output_df.head(5)

## To display certain columns
#parquet_df[['column1', 'column2', 'column3']].head(5)

Successfully read 3 parquet files with 3 total rows


Unnamed: 0,filename,contents,num_pages,num_tables,num_doc_elements,document_id,document_hash,ext,hash,size,date_acquired,document_convert_time,source_filename
0,attention.pdf,"Provided proper attribution is provided, Googl...",15,4,513,6ac50bf5-f455-42db-a820-7d2f895995c3,2949302674760005271,pdf,214960a61e817387f01087f0b0b323cf1ebd8035fffcab...,48981,2025-11-16T15:57:30.045742,11.843195,attention.pdf
1,granite2.pdf,## Granite Code Models: A Family of Open Found...,28,17,486,a71ae005-ed2b-4866-96a9-32e28f6eca0a,3127757990743433032,pdf,58342470e7d666dca0be87a15fb0552f949a5632606fe1...,121131,2025-11-16T15:58:08.089522,18.953736,granite2.pdf
2,granite.pdf,## Granite Code Models: A Family of Open Found...,28,17,486,c86b16c1-d83d-4366-b409-78b9884757e2,3127757990743433032,pdf,58342470e7d666dca0be87a15fb0552f949a5632606fe1...,121131,2025-11-16T15:57:49.075365,18.963363,granite.pdf


## Step-4: Eliminate Duplicate Documents

We have 2 duplicate documnets here : `granite.pdf` and `granite2.pdf`.

Note how the `hash` for these documents are same.

We are going to perform **de-dupe**

On the content of each document, a SHA256 hash is computed, followed by de-duplication of record having identical hashes.

[Dedupe transform documentation](https://github.com/data-prep-kit/data-prep-kit/blob/dev/transforms/universal/ededup/README.md)

### 4.1 - Execute 

In [8]:
%%time 

from dpk_ededup.transform_python import Ededup

STAGE = 2
print (f"üèÉüèº STAGE-{STAGE}: Processing input='{output_parquet_dir}' --> output='{output_exact_dedupe_dir}'\n", flush=True)

result = Ededup(input_folder=output_parquet_dir,
    output_folder=output_exact_dedupe_dir,
    ededup_doc_column="contents",
    ededup_doc_id_column="document_id"
    ).transform()

if result == 0:
    print (f"‚úÖ Stage:{STAGE} completed successfully")
else:
    raise Exception (f"‚ùå Stage:{STAGE}  failed")

üèÉüèº STAGE-2: Processing input='output-papers/01_parquet_out' --> output='output-papers/02_dedupe_out'



{"time": "15:58:08", "logger": "dpk", "logLevel": "INFO", "message": "exact dedup params are {'doc_column': 'contents', 'doc_id_column': 'document_id', 'use_snapshot': False, 'snapshot_directory': None}"}
{"time": "15:58:08", "logger": "dpk", "logLevel": "INFO", "message": "pipeline id pipeline_id"}
{"time": "15:58:08", "logger": "dpk", "logLevel": "INFO", "message": "code location {'github': 'UNDEFINED', 'build-date': 'UNDEFINED', 'commit_hash': 'UNDEFINED', 'path': 'UNDEFINED'}"}
{"time": "15:58:08", "logger": "dpk", "logLevel": "INFO", "message": "data factory data_ max_files -1, n_sample -1"}
{"time": "15:58:08", "logger": "dpk", "logLevel": "INFO", "message": "data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']"}
{"time": "15:58:08", "logger": "dpk", "logLevel": "INFO", "message": "data factory data_ Data Access:  DataAccessLocal"}
{"time": "15:58:08", "logger": "dpk", "logLevel":

‚úÖ Stage:2 completed successfully
CPU times: user 36.5 ms, sys: 4.87 ms, total: 41.4 ms
Wall time: 37.1 ms


### 4.2 - Inspect Generated output

We would see 2 documents: `attention.pdf`  and `granite.pdf`.  The duplicate `granite.pdf` has been filtered out!

In [9]:
from file_utils import read_parquet_files_as_df

input_df = read_parquet_files_as_df(output_parquet_dir)
output_df = read_parquet_files_as_df(output_exact_dedupe_dir)

# print ("Input data dimensions (rows x columns)= ", input_df.shape)
# print ("Output data dimensions (rows x columns)= ", output_df.shape)
print (f"Input files before exact dedupe : {input_df.shape[0]:,}")
print (f"Output files after exact dedupe : {output_df.shape[0]:,}")
print ("Duplicate files removed :  ", (input_df.shape[0] - output_df.shape[0]))

output_df.sample(min(3, output_df.shape[0]))

Successfully read 3 parquet files with 3 total rows
Successfully read 2 parquet files with 2 total rows
Input files before exact dedupe : 3
Output files after exact dedupe : 2
Duplicate files removed :   1


Unnamed: 0,filename,contents,num_pages,num_tables,num_doc_elements,document_id,document_hash,ext,hash,size,date_acquired,document_convert_time,source_filename
0,attention.pdf,"Provided proper attribution is provided, Googl...",15,4,513,6ac50bf5-f455-42db-a820-7d2f895995c3,2949302674760005271,pdf,214960a61e817387f01087f0b0b323cf1ebd8035fffcab...,48981,2025-11-16T15:57:30.045742,11.843195,attention.pdf
1,granite.pdf,## Granite Code Models: A Family of Open Found...,28,17,486,c86b16c1-d83d-4366-b409-78b9884757e2,3127757990743433032,pdf,58342470e7d666dca0be87a15fb0552f949a5632606fe1...,121131,2025-11-16T15:57:49.075365,18.963363,granite.pdf


## Step 5 - Save final docs

In [10]:
## clear out final output folder

import shutil

shutil.rmtree(MY_CONFIG.OUTPUT_FOLDER_FINAL, ignore_errors=True)
shutil.os.makedirs(MY_CONFIG.OUTPUT_FOLDER_FINAL, exist_ok=True)

output_final_dir_parquet = os.path.join (MY_CONFIG.OUTPUT_FOLDER_FINAL, 'pq')
shutil.os.makedirs(output_final_dir_parquet, exist_ok=True)

output_final_dir_markdown = os.path.join (MY_CONFIG.OUTPUT_FOLDER_FINAL, 'markdown')
shutil.os.makedirs(output_final_dir_markdown, exist_ok=True)

In [11]:
## save parquet

output_df.to_parquet(os.path.join(output_final_dir_parquet, "clean_docs.parquet"))
print (f"‚úÖ Saved CLEAN parquet output to '{output_final_dir_parquet}'")

‚úÖ Saved CLEAN parquet output to 'output-papers/output_final/pq'


In [12]:
## save markdown text

for index, row in output_df.iterrows():
    output_file_name = os.path.join (output_final_dir_markdown, row['filename'] + '.md')
    with open(output_file_name, 'w') as output_file:
        output_file.write(row['contents'])

print (f"‚úÖ Saved CLEAN markdown output to '{output_final_dir_markdown}'")

‚úÖ Saved CLEAN markdown output to 'output-papers/output_final/markdown'
