A repo that shows how to automatically extract the data of a line chart. Mainly a wrapper around LineFormer and ChartDete.
- You need a modal.com account to run this repo out of the box. Sign up here.
- Deploy the relevant functions by running:
chmod +x deploy.sh && ./deploy.sh
If you'd like to see a "modal-free" version of this, ping me.
All images in the folder input
will be processed.
- Add your images to the
input
folder. - In the root folder, run the data extraction using:
modal run plextract/main.py
- Download the processed files using
modal volume get plextract-vol <run_id>
. The run id is a uuid and can be found in the console log. For the example files, the result will look like this:<run_id>/ ├── input │ ├── input1.jpeg │ ├── input2.jpeg │ └── input3.png └── output ├── input1.jpeg │ ├── axis_label_texts.json # Text extracted from axis labels │ ├── chartdete │ │ ├── bounding_boxes.json │ │ ├── cropped_xlabels_0.jpg # Cropped images of axis labels │ │ ├── ... │ │ ├── cropped_ylabels_0.jpg │ │ ├── ... │ │ ├── label_coordinates.json # Coordinates of the detected elements │ │ └── predictions.jpg # Image with bounding boxes of detected elements │ ├── converted_datapoints │ │ ├── data.json # The extracted data! │ │ └── plot.png # The plot generated from the extracted data │ └── lineformer │ ├── coordinates.json # The image relative coordinates of the lines │ └── prediction.png ├── input2.jpeg │ ├── ... └── input3.png ├── ... 14 directories, 60 files
- The extracted data is provided as json: e.g.
<run_id>/output/input1.jpeg/converted_datapoints/data.json
. - You can use display_extracted_data.ipynb to plot the extracted data.
The pipeline works as follows:
- Use ChartDete to detect chart elements, most importantly axis labels and the plot area.
- OCR the numbers from the labels.
- Extract the coordinates of the lines in the line chart using LineFormer.
- Correct the coordinates of the lines to be relative to the plot origin.
- Calculate the conversion from pixels to axis values.
- Convert the coordinates using the conversion parameter from step before.
This chart was generated using matplotlib using the extracted data (example/data.json
)
If you need help setting this up or would just like to use it, shoot me an email: mail@timonschneider.de