# Data Prep Kit Introduction


## PDF Processing Pipeline

This notebook will demonstrate processing PDFs


Here is the workflow,

![](https://raw.githubusercontent.com/sujee/data-prep-kit-examples/main/media/data-prep-kit-3-workflow.png)

Open this notebook in Google Colab: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/santoshborse/pydatanyc2024/blob/main/dpk-intro.ipynb)

In [3]:
import os

if os.getenv("COLAB_RELEASE_TAG"):
   print("Running in Colab")
   !rm -rf sample_data
   RUNNING_IN_COLAB = True
else:
   print("NOT in Colab")
   RUNNING_IN_COLAB = False


Running in Colab


## Step-0.1: Download and inspect the input Data


In [4]:
if RUNNING_IN_COLAB:
    !mkdir -p 'input'
    !wget -O 'input/earth.pdf'  'https://raw.githubusercontent.com/sujee/data-prep-kit-examples/main/data/solar-system/earth.pdf'
    !wget -O 'input/mars.pdf'  'https://raw.githubusercontent.com/sujee/data-prep-kit-examples/main/data/solar-system/mars.pdf'
    !wget -O 'utils.py'  'https://raw.githubusercontent.com/sujee/data-prep-kit-examples/main/utils.py'


--2024-11-04 18:26:42--  https://raw.githubusercontent.com/sujee/data-prep-kit-examples/main/data/solar-system/earth.pdf
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 58535 (57K) [application/octet-stream]
Saving to: ‘input/earth.pdf’


2024-11-04 18:26:42 (6.84 MB/s) - ‘input/earth.pdf’ saved [58535/58535]

--2024-11-04 18:26:42--  https://raw.githubusercontent.com/sujee/data-prep-kit-examples/main/data/solar-system/mars.pdf
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 57872 (57K) [application/octet-stream]
Saving to: ‘input

## Step-0.2: Install DPK and Required Transforms


In [5]:
## Step 2: Install DPK and all transforms
if RUNNING_IN_COLAB:
  !pip install data-prep-toolkit==0.2.2.dev2 && pip install data-prep-toolkit-transforms[pdf2parquet]==0.2.2.dev2


Collecting argparse (from data-prep-toolkit==0.2.2.dev2)
  Using cached argparse-1.4.0-py2.py3-none-any.whl.metadata (2.8 kB)
Using cached argparse-1.4.0-py2.py3-none-any.whl (23 kB)
Installing collected packages: argparse
Successfully installed argparse-1.4.0
Collecting argparse (from data-prep-toolkit>=0.2.2.dev2->data-prep-toolkit-transforms==0.2.2.dev2->data-prep-toolkit-transforms[pdf2parquet]==0.2.2.dev2)
  Using cached argparse-1.4.0-py2.py3-none-any.whl.metadata (2.8 kB)
Using cached argparse-1.4.0-py2.py3-none-any.whl (23 kB)
Installing collected packages: argparse
Successfully installed argparse-1.4.0


## Step-1: pdf2parquet - Convert data from PDF to Parquet
STAGE = 1
input_folder = "input"
output_folder =  "s1-pdf2parquet"
print (f"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'")

In [None]:
%%time

import ast, os, sys

from pdf2parquet_transform import (
    pdf2parquet_contents_type_cli_param,
    pdf2parquet_contents_types,
)
from data_processing.runtime.pure_python import PythonTransformLauncher
from pdf2parquet_transform_python import Pdf2ParquetPythonTransformConfiguration

from data_processing.utils import GB, ParamsUtils


# create parameters
local_conf = {
    "input_folder": input_folder,
    "output_folder": output_folder,
}
ingest_config = {
    pdf2parquet_contents_type_cli_param: pdf2parquet_contents_types.JSON,
}

params = {
    # Data access. Only required parameters are specified
    "data_local_config": ParamsUtils.convert_to_ast(local_conf),
    "data_files_to_use": ast.literal_eval("['.pdf']"),
}


sys.argv = ParamsUtils.dict_to_req(d=(params | ingest_config))
# create launcher
launcher = PythonTransformLauncher(Pdf2ParquetPythonTransformConfiguration())
# launch
return_code = launcher.launch()

if return_code == 0:
    print (f"✅ Stage:{STAGE} completed successfully")
else:
    raise Exception ("❌ Job failed")