Catalogist

Transform Product/SKU Price Lists from PDF to JSON: Automated extraction, cleaning, and conversion pipeline for SKU data.

Overview

This repository contains a Python-based solution for converting SKU (Stock Keeping Unit) price lists from PDF format into structured JSON objects. It handles both PDFs with and without boundary lines around tables, applies data cleaning to remove unwanted information, and exports the cleaned data into JSON files where each object represents an SKU with its attributes.

Prerequisites

Python 3.8 or higher
Java 8 or higher (required by tabula-py)

Installation

Clone the repository:

git clone https://github.com/txhno/sku-list-parser.git
cd sku-list-parser

Install required Python packages:
```
pip install -r requirements.txt
```
Ensure Java (version 8 or above) is installed and properly set up on your system. You can verify this by running:
```
java -version
```
If Java is not installed, please install it from Oracle Java or your preferred source.

Usage

SKU pricelist PDFs are present in the pdfs/boundaried or pdfs/unboundaried directories, depending on whether PDFs have boundary lines around their tables.
Run the pipeline script:
```
python3 run_pipeline.py
```

This script will process all PDFs, extract and clean CSV data, and then convert them into JSON files. The JSON files will be saved in the exported-jsons directory.

Dynamic PDF Parser

For dynamic PDF to JSON parsing and conversion, open and run the dynamic_pdf_to_json.ipynb Jupyter notebook after installing requirements.

This notebook provides a step-by-step guide for converting specific PDF pages to structured JSON data using the Gemini Pro Vision model. It specializes in converting SKU pricelist PDFs, including those based on images, into structured JSONs. The model settings can be configured within the notebook, allowing for precise data extraction tailored to your needs. Additionally, the range of PDF pages to be converted can also be specified and adjusted as required. Ensure all prerequisites are installed, and follow the notebook's instructions for tailored SKU data processing.

Project Structure

pdfs/boundaried and pdfs/unboundaried: Directories to place your PDF files for processing.
src: Contains the Python scripts for the pipeline steps.
- pdf_extractor.py: Extracts tables from PDFs to CSV format.
- csv_cleaner.py: Cleans the extracted CSV files.
- json_converters: Contains various converter scripts to transform cleaned CSVs into JSON format.
exported-jsons: The output directory where the JSON files are saved.

Built With

Python - The primary programming language used.
pandas - Data manipulation and analysis library.
tabula-py - Python wrapper for Tabula, used to extract tables from PDFs into pandas DataFrames.
OpenAI Assistants API - Leveraged the knowledge-retrieval tool to create unique JSON converters for the CSVs.

Contributing

Contributions to the SKU List Parser project are welcome! Please submit pull requests or open issues to suggest improvements or report bugs.

Authors

Roshan Warrier - Project Owner - Txhno

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Catalogist

Overview

Prerequisites

Installation

Usage

Dynamic PDF Parser

Project Structure

Built With

Contributing

Authors

About

Releases 1

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
pdfs		pdfs
src		src
.gitignore		.gitignore
README.md		README.md
dynamic_pdf_to_json.ipynb		dynamic_pdf_to_json.ipynb
requirements.txt		requirements.txt
run_pipeline.py		run_pipeline.py

txhno/catalogist

Folders and files

Latest commit

History

Repository files navigation

Catalogist

Overview

Prerequisites

Installation

Usage

Dynamic PDF Parser

Project Structure

Built With

Contributing

Authors

About

Topics

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages