# Data Prep Kit Intro - Quick start

Let's get started with DPK - and read some PDF files

## Step-1: Figure out Runtime Environment

### 1.1 - Determine runtime

Determine if we are running on Google colab or local python environment

In [1]:
import os

if os.getenv("COLAB_RELEASE_TAG"):
   print("Running in Colab")
   RUNNING_IN_COLAB = True
else:
   print("NOT in Colab")
   RUNNING_IN_COLAB = False

NOT in Colab


### 1.2 - Install dependencies if running on Google Colab

In [2]:
%%capture

if RUNNING_IN_COLAB:
    ! pip install  --default-timeout=100  \
        data-prep-toolkit-transforms[pdf2parquet]==1.0.0


## Step-2: Settings / Config

In [3]:
# If connection to https://huggingface.co/ failed, uncomment the following path
import os
os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'

In [4]:
## Setup input / output dir
import shutil

shutil.os.makedirs('input', exist_ok=True)

shutil.rmtree('output', ignore_errors=True)
shutil.os.makedirs('output', exist_ok=True)
shutil.os.makedirs('output/md', exist_ok=True)
shutil.os.makedirs('output/pq', exist_ok=True)

In [5]:
import pandas as pd
import glob

## Reads parquet files in a folder into a pandas dataframe
def read_parquet_files_as_df (parquet_dir):
    parquet_files = glob.glob(f'{parquet_dir}/*.parquet')
    # read each parquet file into a DataFrame and store in a list
    dfs = [pd.read_parquet (f) for f in parquet_files]
    dfs = [df for df in dfs if not df.empty]  # filter out empty dataframes
    # Concatenate all DataFrames into a single DataFrame
    if len(dfs) > 0:
        data_df = pd.concat(dfs, ignore_index=True)
        return data_df
    else:
        return pd.DataFrame() # return empty df
# ------------

## Step-3: Checkout Data files

We will use simple PDFs.  The files are [here](https://github.com/sujee/data-prep-kit-examples/tree/main/data)





In [6]:
if RUNNING_IN_COLAB:
    input_dir = 'input'
    !wget -O  'input/earth.pdf' 'https://raw.githubusercontent.com/sujee/data-prep-kit-examples/main/data/solar-system/earth.pdf'
    !wget -O  'input/mars.pdf' 'https://raw.githubusercontent.com/sujee/data-prep-kit-examples/main/data/solar-system/mars.pdf'
    !wget -O  'input/solar-system-overview.pdf' 'https://raw.githubusercontent.com/sujee/data-prep-kit-examples/main/data/solar-system/solar-system-overview.pdf'
else:
    input_dir = '../data/solar-system/'
    

## Step-4: Extract Text from PDF

This step we will read PDF files and extract the text data.

[Pdf2Parquet documentation](https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/pdf2parquet/README.md)

We use the [Docling package](https://github.com/DS4SD/docling).

### 4.1 - Execute

In [7]:
%%time

from dpk_pdf2parquet.transform_python import Pdf2Parquet
from dpk_pdf2parquet.transform import pdf2parquet_contents_types

result = Pdf2Parquet(input_folder= input_dir,
                    output_folder= "output/pq",
                    data_files_to_use=['.pdf'],
                    pdf2parquet_contents_type=pdf2parquet_contents_types.MARKDOWN,   # markdown
                    ).transform()

if result == 0:
    print (f"✅ Success!")
else:
    raise Exception (f"❌ Failed")

23:20:45 INFO - pdf2parquet parameters are : {'batch_size': -1, 'artifacts_path': None, 'contents_type': <pdf2parquet_contents_types.MARKDOWN: 'text/markdown'>, 'do_table_structure': True, 'do_ocr': True, 'ocr_engine': <pdf2parquet_ocr_engine.EASYOCR: 'easyocr'>, 'bitmap_area_threshold': 0.05, 'pdf_backend': <pdf2parquet_pdf_backend.DLPARSE_V2: 'dlparse_v2'>, 'double_precision': 8}
23:20:45 INFO - pipeline id pipeline_id
23:20:45 INFO - code location None
23:20:45 INFO - data factory data_ is using local data access: input_folder - ../data/solar-system/ output_folder - output/pq
23:20:45 INFO - data factory data_ max_files -1, n_sample -1
23:20:45 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf'], files to checkpoint ['.parquet']
23:20:45 INFO - orchestrator pdf2parquet started at 2025-01-29 23:20:45
23:20:45 INFO - Number of files is 4, source profile {'max_file_size': 0.05775737762451172, 'min_file_size': 0.0551

Fetching 9 files:   0%|          | 0/9 [00:00<?, ?it/s]

23:20:49 INFO - Completed 1 files (25.0%) in 0.017 min
23:20:50 INFO - Completed 2 files (50.0%) in 0.031 min
23:20:51 INFO - Completed 3 files (75.0%) in 0.042 min
23:20:53 INFO - Completed 4 files (100.0%) in 0.077 min
23:20:53 INFO - Done processing 4 files, waiting for flush() completion.
23:20:53 INFO - done flushing in 0.0 sec
23:20:53 INFO - Completed execution in 0.139 min, execution result 0


✅ Success!
CPU times: user 19.3 s, sys: 2.01 s, total: 21.3 s
Wall time: 11.4 s


### 4.2 - Inspect Generated output

Here we should see one entry per input file processed.

In [8]:
output_df = read_parquet_files_as_df("output/pq")
output_df.head(10)

## To display certain columns
#parquet_df[['column1', 'column2', 'column3']].head(5)

Unnamed: 0,filename,contents,num_pages,num_tables,num_doc_elements,document_id,document_hash,ext,hash,size,date_acquired,pdf_convert_time,source_filename
0,solar-system-overview.pdf,## Solar System\n\n| Planet | Distance from ...,2,3,4,23cf296c-8b64-4b1c-b809-c1754d27efb9,13478721415333026623,pdf,3fa3a3d372ee03f02b37267a012560052323655cd60451...,3111,2025-01-29T23:20:53.487661,2.131391,solar-system-overview.pdf
1,mars.pdf,## Mars\n\n## Solar System\n\nOur solar system...,1,0,11,f27e5a0b-fe21-4f8d-85f6-02d722bcf702,10359780639229817778,pdf,a3a4bb3b8f4f441d6d669e09f0cd07a9420d06850cf63e...,717,2025-01-29T23:20:51.352227,0.642222,mars.pdf
2,earth-copy.pdf,## Earth\n\n## Solar System\n\nOur solar syste...,1,0,11,cf774291-d5a4-4471-a8a7-23172fffdcc6,17915699055171962696,pdf,6140cf695f269a3ddca6568536076756105ad3186086b2...,610,2025-01-29T23:20:49.851863,1.0001,earth-copy.pdf
3,earth.pdf,## Earth\n\n## Solar System\n\nOur solar syste...,1,0,11,f5b79114-56cf-48a1-804e-e10fb3a3eec8,17915699055171962696,pdf,6140cf695f269a3ddca6568536076756105ad3186086b2...,610,2025-01-29T23:20:50.708209,0.85324,earth.pdf


## Step-5: Save Content in Markdown

In [9]:
for index, row in output_df.iterrows():
    output_file_name = os.path.join ("output", "md", row['filename'] + '.md')
    with open(output_file_name, 'w') as output_file:
        output_file.write(row['contents'])

print (f"✅ Saved CLEAN markdown output to 'output/md'")

✅ Saved CLEAN markdown output to 'output/md'
