# Chart Extractor from PDF — Starter Notebook

This notebook gives you a practical, reproducible workflow to extract **key info from charts inside PDFs**.

### What you can do here
- Convert PDF pages to images (high DPI)
- Crop to a chart region (manual coords)
- Run OCR on axis labels, tick labels, and legend text (via Tesseract)
- (Optional) Detect bars/lines with OpenCV and map pixels → data values
- Export cleaned results to CSV

### Requirements
- Python 3.8+
- Suggested packages (install in your local environment if missing):
  - `pymupdf` (a.k.a. `fitz`) **or** `pdf2image`
  - `opencv-python`
  - `pytesseract` + Tesseract engine installed on your OS
  - `pandas`, `numpy`
  - (Optional) `pdfplumber` for table text; `camelot-py`/`tabula-py` for tables

If a library isn't available, the notebook will guide you with clear errors and alternatives.

---
**Tip:** If your chart is a table, try table extraction tools first (Camelot/Tabula). If it's a vector chart, `pdfplumber` or `pymupdf` text extraction may give you labels/values directly. If it's a raster image, use OCR + (optional) the bar/line detection helpers below.


## 0) Installation helper (run locally if needed)
Uncomment and run this cell if you need to install packages into your local environment. You can skip if everything is already installed.

In [None]:
# %%bash
# pip install --upgrade pip
# pip install pymupdf pdf2image opencv-python pytesseract pandas numpy pdfplumber
#
# # On macOS with Homebrew, you may need:
# # brew install tesseract
# # On Ubuntu/Debian:
# # sudo apt-get update && sudo apt-get install -y tesseract-ocr poppler-utils
#
# # If you plan to try table extraction:
# # pip install camelot-py tabula-py
# # Note: camelot requires ghostscript; tabula requires Java.


## 1) Set paths & options
Fill in your input PDF path and choose an output folder. Adjust DPI for image clarity (300–600 recommended).

In [None]:
from pathlib import Path

# === User inputs ===
PDF_PATH = Path('your_file.pdf')  # <-- replace with your PDF file path
OUT_DIR = Path('chart_extractor_output')
DPI = 400  # 300–600 recommended

OUT_DIR.mkdir(parents=True, exist_ok=True)
print('PDF:', PDF_PATH.resolve())
print('Output dir:', OUT_DIR.resolve())


## 2) Convert PDF pages to images
Uses **PyMuPDF** (preferred) or **pdf2image** (fallback).

In [None]:
import importlib, sys
from typing import List

def pdf_to_images(pdf_path, out_dir, dpi=400) -> List[Path]:
    images = []
    # Try PyMuPDF first (fast & precise)
    if importlib.util.find_spec('fitz') is not None:
        import fitz  # PyMuPDF
        doc = fitz.open(pdf_path)
        for i, page in enumerate(doc):
            mat = fitz.Matrix(dpi/72, dpi/72)
            pix = page.get_pixmap(matrix=mat, alpha=False)
            out_path = out_dir / f'page_{i+1:03d}.png'
            pix.save(out_path.as_posix())
            images.append(out_path)
        return images
    # Fallback to pdf2image
    elif importlib.util.find_spec('pdf2image') is not None:
        from pdf2image import convert_from_path
        pages = convert_from_path(pdf_path, dpi=dpi)
        for i, img in enumerate(pages):
            out_path = out_dir / f'page_{i+1:03d}.png'
            img.save(out_path)
            images.append(out_path)
        return images
    else:
        raise RuntimeError('Neither PyMuPDF (fitz) nor pdf2image found. Please install one of them.')

images = pdf_to_images(PDF_PATH, OUT_DIR, dpi=DPI)
print(f'Exported {len(images)} page image(s):')
for p in images:
    print(' -', p.name)


## 3) Preview pages and pick a chart region to crop
Set the page index (0-based) and crop coordinates `(x1, y1, x2, y2)` in **pixels**. You can use any image viewer to eyeball coordinates (or iterate by trial).

In [None]:
from PIL import Image
import matplotlib.pyplot as plt

PAGE_INDEX = 0  # 0-based
x1, y1, x2, y2 = 100, 100, 1200, 900  # <-- adjust these

page_img = Image.open(images[PAGE_INDEX])
plt.figure(figsize=(10, 8))
plt.imshow(page_img)
plt.title(f'Full Page Preview (page {PAGE_INDEX+1})')
plt.axis('off')
plt.show()

chart = page_img.crop((x1, y1, x2, y2))
chart_path = OUT_DIR / f'chart_crop_p{PAGE_INDEX+1}.png'
chart.save(chart_path)
print('Saved chart crop ->', chart_path)

plt.figure(figsize=(8,6))
plt.imshow(chart)
plt.title('Chart Crop Preview')
plt.axis('off')
plt.show()


## 4) OCR axis/legend text (pytesseract)
This extracts visible text from the cropped chart. Useful for axis labels, tick values, and legend labels.

> **Note:** You need Tesseract installed on your OS and `pytesseract` in Python.

In [None]:
import importlib
if importlib.util.find_spec('pytesseract') is None:
    raise RuntimeError('pytesseract not found. Install it (and OS-level Tesseract) to enable OCR.')
import pytesseract
import numpy as np
import cv2

img = cv2.imread(chart_path.as_posix())
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
gray = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]

ocr_text = pytesseract.image_to_string(gray)
print('--- OCR TEXT ---')
print(ocr_text)
with open(OUT_DIR / 'ocr_text.txt', 'w', encoding='utf-8') as f:
    f.write(ocr_text)
print('Saved OCR text to ocr_text.txt')


## 5) (Optional) Detect bars (bar charts) and convert to values
If your chart is a **bar chart**, this helper finds vertical bars via contour detection and estimates their heights. You must specify the **data range** (y-axis min & max) and the **pixel range** (baseline and top) for correct mapping.

**Workflow:**
1. Identify the y-axis baseline and top in pixels within your crop (e.g., via eyeballing/measurement).
2. Provide `YPIX_MIN` (lower pixel, baseline) and `YPIX_MAX` (upper pixel, top).
3. Provide `YVAL_MIN`, `YVAL_MAX` according to y-axis scale.
4. The code computes bar heights in data units and outputs a CSV.


In [None]:
import pandas as pd
import numpy as np
import cv2

# === User inputs for mapping ===
YPIX_MIN = 800   # pixel y of baseline (within crop)
YPIX_MAX = 100   # pixel y of top (within crop)
YVAL_MIN = 0.0   # y-axis min value
YVAL_MAX = 100.0 # y-axis max value

def pixel_to_value(y_pix):
    # Map pixel y (top-down) to data value (bottom-up). Adjust if your axis is inverted.
    # Here, smaller y (towards top) -> larger data value.
    ratio = (YPIX_MIN - y_pix) / max(1, (YPIX_MIN - YPIX_MAX))
    return YVAL_MIN + ratio * (YVAL_MAX - YVAL_MIN)

chart_bgr = cv2.imread(chart_path.as_posix())
chart_gray = cv2.cvtColor(chart_bgr, cv2.COLOR_BGR2GRAY)
blur = cv2.GaussianBlur(chart_gray, (3,3), 0)
thr = cv2.threshold(blur, 0, 255, cv2.THRESH_BINARY+cv2.THRESH_OTSU)[1]

# Invert if bars are dark on light background
num_black = (thr == 0).sum()
num_white = (thr == 255).sum()
if num_black > num_white:
    thr = cv2.bitwise_not(thr)

cnts, _ = cv2.findContours(thr, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
bars = []
for c in cnts:
    x, y, w, h = cv2.boundingRect(c)
    aspect = h / max(1.0, w)
    area = w * h
    # Heuristics for vertical bars
    if aspect > 1.5 and area > 300:  # tweak thresholds as needed
        bar_top = y
        bar_bottom = y + h
        bar_val = pixel_to_value(bar_top)
        bars.append({'x': x, 'y_top': bar_top, 'y_bottom': bar_bottom, 'w': w, 'h': h, 'value_estimate': bar_val})

bars_df = pd.DataFrame(sorted(bars, key=lambda d: d['x']))
print(bars_df.head())
bars_csv = OUT_DIR / 'bars_estimates.csv'
bars_df.to_csv(bars_csv, index=False)
print('Saved bar value estimates ->', bars_csv)


## 6) (Optional) Detect line/scatter points
Line/scatter extraction is more chart-specific. A simple approach is to isolate a color (HSV threshold) if the line/points have a distinct color, then trace the topmost pixel per x-column and map via `pixel_to_value`.
Below is a minimal example you can adapt.

In [None]:
import numpy as np
import pandas as pd
import cv2

# HSV color range example (tweak these to match your line color)
LOWER_HSV = (0, 0, 0)      # e.g., black line lower bound
UPPER_HSV = (180, 255, 60) # e.g., black line upper bound

hsv = cv2.cvtColor(chart_bgr, cv2.COLOR_BGR2HSV)
mask = cv2.inRange(hsv, LOWER_HSV, UPPER_HSV)

coords = []
height, width = mask.shape
for x in range(width):
    col = mask[:, x]
    ys = np.where(col > 0)[0]
    if len(ys) > 0:
        y_top = ys.min()
        val = pixel_to_value(y_top)
        coords.append({'x': x, 'y': y_top, 'value_estimate': val})

line_df = pd.DataFrame(coords)
line_csv = OUT_DIR / 'line_estimates.csv'
line_df.to_csv(line_csv, index=False)
print('Saved line estimates ->', line_csv)
print(line_df.head())


## 7) (Optional) Extract text from vector PDFs
If your PDF is vector-based (not raster), text objects may be directly extractable with PyMuPDF or pdfplumber. This can pull axis labels and legend text without OCR.

In [None]:
import importlib
if importlib.util.find_spec('fitz') is None:
    print('PyMuPDF (fitz) not found; install it to use vector text extraction.')
else:
    import fitz
    doc = fitz.open(PDF_PATH)
    page = doc[PAGE_INDEX]
    text = page.get_text("text")
    print('--- Page text (vector) ---\n')
    print(text[:2000])
    with open(OUT_DIR / f'page_{PAGE_INDEX+1:03d}_text.txt', 'w', encoding='utf-8') as f:
        f.write(text)
    print('Saved vector text to', OUT_DIR / f'page_{PAGE_INDEX+1:03d}_text.txt')


## 8) Export & sanity checks
At this point, you should have:
- `chart_crop_pX.png` — the cropped chart image
- `ocr_text.txt` — OCR'd labels/ticks/legend (if OCR step run)
- `bars_estimates.csv` or `line_estimates.csv` — estimated numeric data (if those steps run)

Always **spot-check** a few points against the chart (e.g., read a bar height and compare to CSV) to validate mapping and thresholds.

## 9) Troubleshooting & Tips
- Increase `DPI` in step 2 if OCR is noisy (try 600).
- Adjust the crop to include full axes (so OCR sees ticks/labels).
- For bar detection, tweak the `aspect` and `area` thresholds and consider morphological operations.
- For line extraction, refine HSV ranges to isolate the series color.
- If you only need insights (trends, peaks), reading the chart and writing bullet-point summaries is often faster.