# Billing Extraction Demo

This demonstration will show how GroundX, running on OpenShift and integrated with OpenShift AI, can extract structured data from an image, which contains unstructured data. The structured data will be returned in JSON format. 

Before you start, view the document in the `test-docs` folder. While the file extension suggests it is a PDF, it is actually a photo of a mobile phone statement and bill in `.jpg` format. 

GroundX will extract the account number, amount due, due date, company owed payment and where to send payment. These are defined in `./prompts/simple.yaml`. You can view the prompts used for the data extraction in the `*.py` files in the `prompts` folder. 

## Update your GroundX client

Start by installing GroundX dependencies, then GroundX itself. 

In [None]:
pip install -U "groundx[extract]" && pip install ipywidgets smolagents

## Confirm installation



In [None]:
pip show groundx

## Initialize GroundX Client and Prompt Manager

The GroundX client will create an interface to the GroundX API. In order to do this, the client requires the base URL to the GroundX application as well as the GroundX admin API key. Both of these values should have been set as environment variables in the workbench, as `GROUNDX_ADMIN_API_KEY` and `GROUNDX_BASE_URL`.

In [None]:
import typing
import os

### set this to the local yaml file and path (leave off the .yaml from the file name though)

cache_path = "./prompts"
file_name = "simple"

###

### set these if working with specific files

document_id: typing.Optional[str] = None
process_id: typing.Optional[str] = None

### Set GroundX access information

# Same as values/values.groundx.secret.yaml, "GROUNDX_ADMIN_API_KEY" 
api_key = os.environ.get('GROUNDX_ADMIN_API_KEY')
print("GroundX API Key: " + api_key)

# Route for groundx service with /api added
base_url = os.environ.get('GROUNDX_BASE_URL')
print("GroundX Base URL: " + base_url)


from groundx import GroundX

gx_client = GroundX(api_key=api_key,base_url=base_url,)

from groundx.extract import Logger, Source
from manager import ExtractPromptManager

logger = Logger(name="manage-workflows", level="info")
prompt_manager = ExtractPromptManager(
    cache_source=Source(
        logger=logger,
        cache_path=cache_path,
    ),
    config_source=Source(
        logger=logger,
        cache_path=cache_path,
    ),
    logger=logger,
    default_file_name=file_name,
    default_workflow_id=file_name,
    gx_client=gx_client,
)

## Create a Bucket

This is an optional step needed if you do not already have a test bucket to work with.

In the GroundX platform by EyeLevel, Buckets serve as the fundamental organizational unit for your content. Think of them as high-level folders or containers that help you manage, group, and secure your data before it is used for search and RAG (Retrieval-Augmented Generation).

In [None]:
res = gx_client.buckets.create(
        name="workflow-test",
    )

if res.bucket:
    print(f"bucket_id=[{res.bucket.bucket_id}]")

    bucket_id = res.bucket.bucket_id
else:
    print(res)

## Create a Workflow

If you have not done so, you should create a workflow and apply it to your account or a bucket (described in subsequent steps).

A GroundX workflow is the end-to-end process of transforming complex, unstructured enterprise data into "LLM-ready" context for Retrieval-Augmented Generation (RAG).  Rather than being a single feature, a workflow in GroundX represents the automated orchestration of several distinct stages. Because GroundX is designed for high-stakes enterprise environments, these workflows focus heavily on visual document understanding (handling tables, charts, and complex layouts) to prevent hallucinations caused by poor data parsing.

In [None]:
res = gx_client.workflows.create(
    chunk_strategy="element",
    name=file_name,
    # loads extract prompt from `{cache_path}/{file_name}.yaml`
    steps=prompt_manager.workflow_steps(file_name=file_name),
    # configures workflow to be an `extract` workflow
    extract=prompt_manager.workflow_extract_dict(file_name=file_name),
)

workflow_id = res.workflow.workflow_id

print(f"[{res.workflow.workflow_id}]\t\t[{res.workflow.name}]")

## Assign to Account as the Default Prompt

An optional step to change the account default prompt.

**note: this will replace the current default account prompt**

In [None]:
if not workflow_id:
    raise Exception(f"set workflow_id in the Initialize client step")

res = gx_client.workflows.add_to_account(workflow_id=workflow_id)

print(res)

## Extract Information from a File

Upload the billing statement invoice image for information extraction.

In [None]:
from groundx import Document

res = gx_client.ingest(
    documents=[
        Document(
            bucket_id=bucket_id,
            file_path="./test-docs/t-mobile.pdf",
        ),
    ],
)

process_id = res.ingest.process_id
print(f"process_id = [{process_id}]")

## Check Document Processing Status by `process_id`

Check the processing status of a file by `process_id`. In the output section, you'll see the status first progress to `training`, before moving to the `complete` status. Once complete, you can move on to the next step to see the structured data extracted into JSON format. The output will be saved to a file in the object storge bucket.

In [None]:
if not process_id:
    raise Exception("process_id is not set")

res = gx_client.documents.get_processing_status_by_id(
    process_id=process_id,
)

document_id: typing.Optional[str] = None
if res.ingest.progress:
    if res.ingest.progress.complete and res.ingest.progress.complete.documents:
        document_id = res.ingest.progress.complete.documents[0].document_id
    elif res.ingest.progress.processing and res.ingest.progress.processing.documents:
        document_id = res.ingest.progress.processing.documents[0].document_id

print(f"[{res.ingest.status}]\t[{res.ingest.process_id}]\t\t[{document_id}]")

## Download Extractions

The extracted data, in JSON format, represents the final extractions from the GroundX pipeline. This step will pull the contents of the extracted data file.

In [None]:
if not document_id:
    raise Exception("set document_id")

print(document_id)

gx_client.documents.get_extract(document_id=document_id)