China Energy Yearbook PDF Extractor

This project extracts regional energy consumption data from the China Energy Statistical Yearbook PDFs using OCR (RapidOCR) and creates structured Excel spreadsheets.

Environment Requirements

Python 3
pdf2image (requires poppler to be installed on system)
rapidocr_onnxruntime
pandas
openpyxl

Usage

Place your target PDFs in a pdfs/ directory.
Run the batch script:

python batch_extract_all.py

The extracted Excel records will be generated in the output/ directory.

File Overview

extract_energy_final.py: The single-file OCR processing script. Uses rapidocr_onnxruntime to parse tabular rows/columns based on coordinate alignment.
batch_extract_all.py: Batch loop script to extract all PDFs in a folder one by one without merging them.

Disclaimer

This project uses OCR to digitize data from statistical yearbooks where raw files or valid text layers are not available. Accuracy depends on the OCR engine and the PDF quality.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
batch_extract_all.py		batch_extract_all.py
extract_energy_final.py		extract_energy_final.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

China Energy Yearbook PDF Extractor

Environment Requirements

Usage

File Overview

Disclaimer

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

China Energy Yearbook PDF Extractor

Environment Requirements

Usage

File Overview

Disclaimer

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages