# Data Prep Kit: Process PDF with OCR

We will process scanned PDFs and do OCR of hand written text


References
- [Data prep kit](https://ibm.github.io/data-prep-kit)  is an open source framework that helps with data wrangling.
- [github repo](https://github.com/IBM/data-prep-kit)


## How to run this notebook

Two options:

- **Option 1 - Google Colab:** easiest option.  no setup required.  Click this link to open this on google colab.  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sujee/data-prep-kit-examples/blob/main/dpk-intro/dpk_intro_1_python.ipynb)
- **Option 2 - Local python dev environment:**  Setup using this guide: [setting up python dev environment](../setup-python-dev-env.md)

The notebook will work as in both environments

## Step-1: Inspect the Data

We will use simple PDFs.  The files are [here](https://github.com/sujee/data-prep-kit-examples/tree/main/data/ocr)

- [poem1.pdf](https://github.com/sujee/data-prep-kit-examples/blob/main/data/ocr/poem1.pdf)
- [shopping-list1.pdf](https://github.com/sujee/data-prep-kit-examples/blob/main/data/ocr/shopping-list1.pdf)

## Step-2: Figure out Runtime Environment

### 2.1 - Determine runtime

Determine if we are running on Google colab or local python environment

In [1]:
import os

if os.getenv("COLAB_RELEASE_TAG"):
   print("Running in Colab")
   RUNNING_IN_COLAB = True
else:
   print("NOT in Colab")
   RUNNING_IN_COLAB = False

NOT in Colab


### 2.2 -Download Data if running on Google Colab

In [None]:
if RUNNING_IN_COLAB:
    !mkdir -p 'input'
    !wget -O 'input/poem1.pdf'  'https://raw.githubusercontent.com/sujee/data-prep-kit-examples/main/data/ocr/poem1.pdf'
    !wget -O 'input/shopping-list.pdf'  'https://raw.githubusercontent.com/sujee/data-prep-kit-examples/main/data/ocr/shopping-list1.pdf'
    !wget -O 'utils.py'  'https://raw.githubusercontent.com/sujee/data-prep-kit-examples/main/utils.py'

### 2.3 - Install dependencies if running on Google Colab

In [None]:
if RUNNING_IN_COLAB:
    ! pip install  --default-timeout=100  \
        data-prep-toolkit-transforms==0.2.2dev2 \
        data-prep-toolkit-transforms-ray==0.2.2dev2 \
        deepsearch-toolkit

### 2.4 - Restart Runtime

After installing dependencies, be sure <font color="red">restart runtime</font>, so libraries will be loaded

You do this by going to **`Runtime --> Restart Session`**

Then you can continue to the next step (no need to re-run the notebook)

## Step-3: Configuration

### 3.1 - Basic Config

In [4]:
import os

if os.getenv("COLAB_RELEASE_TAG"):
   print("Running in Colab")
   RUNNING_IN_COLAB = True
else:
   print("NOT in Colab")
   RUNNING_IN_COLAB = False

NOT in Colab


In [5]:
import os

## Configuration
class MyConfig:
    pass

MY_CONFIG = MyConfig ()

if RUNNING_IN_COLAB:
  MY_CONFIG.INPUT_DATA_DIR = 'input'
else:
  MY_CONFIG.INPUT_DATA_DIR = os.path.join ('..', 'data', 'ocr')
  # MY_CONFIG.INPUT_DATA_DIR = os.path.join ('..', 'data', 'forms')
  
MY_CONFIG.OUTPUT_FOLDER = "output"
MY_CONFIG.OUTPUT_FOLDER_FINAL = os.path.join(MY_CONFIG.OUTPUT_FOLDER , "output_final")


In [6]:
## Add parent dir to path
import os,sys

this_dir = os.path.abspath('')
parent_dir = os.path.dirname(this_dir)
sys.path.append (os.path.abspath (parent_dir))

### 3.2 - Setup input/outpur directories

In [7]:
import os, sys
import shutil

if not os.path.exists(MY_CONFIG.INPUT_DATA_DIR ):
    raise Exception (f"❌ Input folder MY_CONFIG.INPUT_DATA_DIR = '{MY_CONFIG.INPUT_DATA_DIR}' not found")

output_parquet_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '01_parquet_out')
output_chunk_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '02_chunk_out')

## clear output folder
shutil.rmtree(MY_CONFIG.OUTPUT_FOLDER, ignore_errors=True)
shutil.os.makedirs(MY_CONFIG.OUTPUT_FOLDER, exist_ok=True)

print ("✅ Cleared output directory")

✅ Cleared output directory


## Step-4: pdf2parquet -  Convert data from PDF to Parquet

This step is reading the input folder containing all PDF files and ingest them in a parquet table using the [Docling package](https://github.com/DS4SD/docling).
The documents are converted into a JSON format which allows to easily chunk it in the later steps.



### 4.1 - Set Input/output Folder

In [8]:
STAGE = 1

input_folder = MY_CONFIG.INPUT_DATA_DIR
output_folder =  output_parquet_dir

print (f"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'")

🏃🏼 STAGE-1: Processing input='../data/ocr' --> output='output/01_parquet_out'


### 4.2 - Execute

In [9]:
%%time

import ast
import os
import sys

from pdf2parquet_transform import (
    pdf2parquet_contents_type_cli_param,
    pdf2parquet_contents_types,
)
from data_processing.runtime.pure_python import PythonTransformLauncher
from pdf2parquet_transform_python import Pdf2ParquetPythonTransformConfiguration

from data_processing.utils import GB, ParamsUtils


# create parameters
local_conf = {
    "input_folder": input_folder,
    "output_folder": output_folder,
}
ingest_config = {
    pdf2parquet_contents_type_cli_param: pdf2parquet_contents_types.JSON,
}

params = {
    "data_local_config": ParamsUtils.convert_to_ast(local_conf),
    "data_files_to_use": ast.literal_eval("['.pdf']"),
    "pdf2parquet_do_ocr" : True,
}


sys.argv = ParamsUtils.dict_to_req(d=(params | ingest_config))
# create launcher
launcher = PythonTransformLauncher(Pdf2ParquetPythonTransformConfiguration())
# launch
return_code = launcher.launch()

if return_code == 0:
    print (f"✅ Stage:{STAGE} completed successfully")
else:
    raise Exception ("❌ Job failed")


23:31:44 INFO - pdf2parquet parameters are : {'batch_size': -1, 'artifacts_path': None, 'contents_type': <pdf2parquet_contents_types.JSON: 'application/json'>, 'do_table_structure': True, 'do_ocr': True, 'ocr_engine': <pdf2parquet_ocr_engine.EASYOCR: 'easyocr'>, 'bitmap_area_threshold': 0.05, 'pdf_backend': <pdf2parquet_pdf_backend.DLPARSE_V2: 'dlparse_v2'>, 'double_precision': 8}
23:31:44 INFO - pipeline id pipeline_id
23:31:44 INFO - code location None
23:31:44 INFO - data factory data_ is using local data access: input_folder - ../data/ocr output_folder - output/01_parquet_out
23:31:44 INFO - data factory data_ max_files -1, n_sample -1
23:31:44 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf'], files to checkpoint ['.parquet']
23:31:44 INFO - orchestrator pdf2parquet started at 2024-11-04 23:31:44
23:31:44 INFO - Number of files is 2, source profile {'max_file_size': 0.31499481201171875, 'min_file_size': 0.285

Fetching 9 files:   0%|          | 0/9 [00:00<?, ?it/s]

23:31:49 ERROR - Fatal error with file file_name='/home/sujee/my-stuff/projects/ai-alliance/data-prep-kit-examples/data/ocr/poem1.pdf'. No results produced.
  File "/home/sujee/apps/anaconda3/envs/dpk-6-basic-022dev2-py311/lib/python3.11/site-packages/data_processing/runtime/transform_file_processor.py", line 79, in process_file
    out_files, stats = self.transform.transform_binary(file_name=f_name, byte_array=filedata)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/sujee/apps/anaconda3/envs/dpk-6-basic-022dev2-py311/lib/python3.11/site-packages/pdf2parquet_transform.py", line 371, in transform_binary
    table = pa.Table.from_pylist(data)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/table.pxi", line 1984, in pyarrow.lib._Tabular.from_pylist
  File "pyarrow/table.pxi", line 6044, in pyarrow.lib._from_pylist
  File "pyarrow/table.pxi", line 4625, in pyarrow.lib.Table.from_arrays
  File "pyarrow/table.pxi", line 1547

✅ Stage:1 completed successfully
CPU times: user 13.3 s, sys: 2.34 s, total: 15.6 s
Wall time: 11.3 s


### 4.3 - Inspect Generated output

Here we should see one entry per input file processed.

In [10]:
from utils import read_parquet_files_as_df

output_df = read_parquet_files_as_df(output_folder)

print ("Output dimensions (rows x columns)= ", output_df.shape)

output_df.head(5)

## To display certain columns
#parquet_df[['column1', 'column2', 'column3']].head(5)

Output dimensions (rows x columns)=  (1, 13)


Unnamed: 0,filename,contents,num_pages,num_tables,num_doc_elements,document_id,document_hash,ext,hash,size,date_acquired,pdf_convert_time,source_filename
0,shopping-list.pdf,"{""schema_name"":""DoclingDocument"",""version"":""1....",1,1,0,3ed7066c-1fe8-48b8-9901-6c3ffa0614e2,7976019744775581307,pdf,ce06863613c95cb82f3eb90f1f6426bdb92bb189762fbd...,3279,2024-11-04T23:31:51.237089,1.417158,shopping-list.pdf



### 4.4 - Understand the output

Here are some interesting attributes to note:

- **filename** : original filename
- **contents** : text
- **document_id**: unique id (UUID) assignd to this document
- **hash** : hash of document
- **pdf_convert_time** : time to convert this pdf in seconds

Let's inspect the **contents** column.  See how the text is being divided up!

In [11]:
import pprint
import json

pprint.pprint (json.loads(output_df.iloc[0, ]['contents']))
# json.loads(output_df.iloc[0, ]['contents'])

{'body': {'children': [{'$ref': '#/tables/0'}],
          'label': 'unspecified',
          'name': '_root_',
          'self_ref': '#/body'},
 'furniture': {'children': [],
               'label': 'unspecified',
               'name': '_root_',
               'self_ref': '#/furniture'},
 'groups': [],
 'key_value_items': [],
 'name': 'shopping-list',
 'origin': {'binary_hash': 7976019744775581307,
            'filename': 'shopping-list.pdf',
            'mimetype': 'application/pdf'},
 'pages': {'1': {'page_no': 1,
                 'size': {'height': 658.32000732, 'width': 477.6000061}}},
 'pictures': [],
 'schema_name': 'DoclingDocument',
 'tables': [{'captions': [],
             'children': [],
             'data': {'grid': [[{'bbox': {'b': 553.32000732,
                                          'coord_origin': 'BOTTOMLEFT',
                                          'l': 79.43351746,
                                          'r': 154.33332825,
                                       

In [12]:
pprint.pprint (json.loads(output_df.iloc[1, ]['contents']))

IndexError: single positional indexer is out-of-bounds