This project extracts regional energy consumption data from the China Energy Statistical Yearbook PDFs using OCR (RapidOCR) and creates structured Excel spreadsheets.
- Python 3
pdf2image(requirespopplerto be installed on system)rapidocr_onnxruntimepandasopenpyxl
- Place your target PDFs in a
pdfs/directory. - Run the batch script:
python batch_extract_all.py- The extracted Excel records will be generated in the
output/directory.
extract_energy_final.py: The single-file OCR processing script. Usesrapidocr_onnxruntimeto parse tabular rows/columns based on coordinate alignment.batch_extract_all.py: Batch loop script to extract all PDFs in a folder one by one without merging them.
This project uses OCR to digitize data from statistical yearbooks where raw files or valid text layers are not available. Accuracy depends on the OCR engine and the PDF quality.