Skip to content

zhangpelf/PDF-

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

China Energy Yearbook PDF Extractor

This project extracts regional energy consumption data from the China Energy Statistical Yearbook PDFs using OCR (RapidOCR) and creates structured Excel spreadsheets.

Environment Requirements

  • Python 3
  • pdf2image (requires poppler to be installed on system)
  • rapidocr_onnxruntime
  • pandas
  • openpyxl

Usage

  1. Place your target PDFs in a pdfs/ directory.
  2. Run the batch script:
python batch_extract_all.py
  1. The extracted Excel records will be generated in the output/ directory.

File Overview

  • extract_energy_final.py: The single-file OCR processing script. Uses rapidocr_onnxruntime to parse tabular rows/columns based on coordinate alignment.
  • batch_extract_all.py: Batch loop script to extract all PDFs in a folder one by one without merging them.

Disclaimer

This project uses OCR to digitize data from statistical yearbooks where raw files or valid text layers are not available. Accuracy depends on the OCR engine and the PDF quality.

About

Extract regional energy consumption data from China Energy Statistical Yearbook PDFs using OCR

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages