## SKU Price List Conversion: Dynamic PDF to JSON Parser

This Jupyter Notebook, `dynamic_pdf_to_json.ipynb`, is designed to automate the process of converting SKU price lists from unstructured PDF format into structured JSON data. It dynamically processes selected pages from PDF documents, extracts SKU information using the Gemini Pro Vision model, and outputs the data as a series of JSON objects, optimizing for efficient data handling and analysis.

##### IMPORT LIBRARIES

In [1]:
import google.generativeai as genai
import os
from pdf2image import convert_from_path
import json

##### SETUP API KEY

In [2]:
# Used to securely store your API key
GOOGLE_API_KEY = "your-api-key-here"  # replace 'your-api-key-here' with your actual API key

genai.configure(api_key=GOOGLE_API_KEY)

##### MODEL LIST

In [3]:
for m in genai.list_models():
  if 'generateContent' in m.supported_generation_methods:
    print(m.name)

models/gemini-1.0-pro
models/gemini-1.0-pro-001
models/gemini-1.0-pro-latest
models/gemini-1.0-pro-vision-latest
models/gemini-pro
models/gemini-pro-vision


##### LOAD MODEL WITH CONFIGS

In [4]:
# Model Configuration
MODEL_CONFIG = {
  "temperature": 0.2,
  "top_p": 1,
  "top_k": 32,
  "max_output_tokens": 4096,
}

## Safety Settings of Model
safety_settings = [
  {
    "category": "HARM_CATEGORY_HARASSMENT",
    "threshold": "BLOCK_MEDIUM_AND_ABOVE"
  },
  {
    "category": "HARM_CATEGORY_HATE_SPEECH",
    "threshold": "BLOCK_MEDIUM_AND_ABOVE"
  },
  {
    "category": "HARM_CATEGORY_SEXUALLY_EXPLICIT",
    "threshold": "BLOCK_MEDIUM_AND_ABOVE"
  },
  {
    "category": "HARM_CATEGORY_DANGEROUS_CONTENT",
    "threshold": "BLOCK_MEDIUM_AND_ABOVE"
  }
]

In [5]:
model = genai.GenerativeModel(model_name = "gemini-pro-vision",
                              generation_config = MODEL_CONFIG,
                              safety_settings = safety_settings)

##### DEFINE IMAGE INPUT FORMAT

In [6]:
from pathlib import Path

def image_format(image_path):
    img = Path(image_path)

    if not img.exists():
        raise FileNotFoundError(f"Could not find image: {img}")

    image_parts = [
        {
            "mime_type": "image/png", ## Mime type are PNG - image/png. JPEG - image/jpeg. WEBP - image/webp
            "data": img.read_bytes()
        }
    ]
    return image_parts


##### DEFINE GEMINI MODEL OUTPUT

In [7]:
def gemini_output(image_path, system_prompt, user_prompt):

    image_info = image_format(image_path)
    input_prompt= [system_prompt, image_info[0], user_prompt]
    response = model.generate_content(input_prompt)
    return response.text

##### DEFINE PDF PATH & CREATE DIRECTORY

In [8]:
pdf_path = "pdfs/boundaried/sku_list_2.pdf" ## Change PDF to parse
pdf_name = os.path.splitext(os.path.basename(pdf_path))[0]
os.makedirs(pdf_name, exist_ok=True)

##### DEFINE PAGES TO PARSE

In [9]:
images = convert_from_path(pdf_path, first_page=2, last_page=10) ## Set the exact pages to parse here

##### DEFINE PROMPTS

In [10]:
system_prompt = """
As an expert in analyzing price lists, your role involves processing images containing SKU price lists. Your objective is to transform these images into a collection of JSON objects, where each object encapsulates details of an individual SKU. Each JSON must include the following attributes: 'ID' (refer to this as Model Number or IDH No), 'Title' (equivalent to Model Category), 'Description' (or Model Name/Product Description), 'Price' (known as MRP/Cost), and 'Pack Size' (referred to as Pack/Unit). Ensure the values for these attributes accurately reflect the data presented in the tables. The final output should be in JSON format, directly representing the extracted data without converting it into a string.
"""

user_prompt = """   
Carefully examine the entire image, focusing on the PriceList data. Extract information for 10 unique SKUs present and organize it into JSON format. Each JSON object must include the following fields with their respective values derived from the data: 'ID' (also known as Model Number), 'Title' (also known as Model Category), 'Description' (also known as Model Name/Description), 'Price' (also known as MRP/Cost), and 'Pack Size' (including units like Pack, Unit, Ml, Kg). Ensure each field's values are closely matched to the data within the tables. Output should be in JSON format without including the term 'SKUs' within the objects. Avoid using triple backticks (```) at the start or end of your response. Prioritize extracting SKUs with distinct 'ID's, if such uniqueness is depicted in the input image.
"""

##### PROCESS IMAGES & GENERATE JSON OUTPUTS

In [11]:
all_json_outputs = [] 

for i, image in enumerate(images):
    image_path = os.path.join(pdf_name, f"output_image_{i}.png")
    image.save(image_path, "PNG")

    response_text = gemini_output(image_path, system_prompt, user_prompt)

    try:
        json_output = json.loads(response_text)
        all_json_outputs.extend(json_output)
    except json.JSONDecodeError:
        print(f"Failed to parse JSON for image {i}. Response was: {response_text}")
        continue

##### SAVE ALL JSON OUTPUTS

In [12]:
final_json_path = os.path.join(pdf_name, f"{pdf_name}.json")
with open(final_json_path, 'w') as f:
    json.dump(all_json_outputs, f)

print(f"Final JSON saved to: {final_json_path}.")

Final JSON saved to: sku_list_2/sku_list_2.json.
