# Mixed PDF Page Extractor: Text + Tables + Bar Charts

This notebook segments a PDF page into **text**, **table**, and **bar-chart** regions and extracts key info from each.

### What you'll get
- **Text regions** → TXT files
- **Table regions** → CSV files
- **Bar chart regions** → numeric CSV + recreated chart PNG
- **(Optional) PPTX** summarizing the outputs

### Strategies
1. **Vector-first** (recommended):
   - Use `pdfplumber`/`PyMuPDF` to get page objects, words, images, and lines.
   - Detect **tables** with Camelot (lattice/stream) or `pdfplumber` table parser.
   - Detect **figures/charts** via `page.images`/non-text vector regions; pass suspected charts to the bar-digitizer.
2. **Vision-first** (fallback):
   - Convert page → image, run OpenCV:
     - **Tables**: detect gridlines (Hough/erosion) and cells → export to CSV.
     - **Bar-charts**: find vertical bars and map pixel heights → values.
     - **Text**: OCR within non-table/non-chart regions.

You can run either or both, depending on your document.


## 0) Install packages (run locally if missing)
Uncomment the cell below and install packages in your local environment.

In [None]:
# %%bash
# pip install --upgrade pip
# pip install pymupdf pdfplumber camelot-py tabula-py opencv-python pytesseract pdf2image pandas numpy matplotlib python-pptx
#
# # OS dependencies:
# # macOS: brew install tesseract ghostscript
# # Ubuntu/Debian: sudo apt-get update && sudo apt-get install -y tesseract-ocr poppler-utils ghostscript default-jre


## 1) Set inputs and output directory
- Provide the PDF path, page index, and an output directory.
- Choose DPI for rasterization.


In [None]:
from pathlib import Path

PDF_PATH = Path('your_file.pdf')   # <-- change this to your PDF
PAGE_INDEX = 0                     # 0-based page index
OUT_DIR = Path('mixed_pdf_output')
DPI = 400

OUT_DIR.mkdir(parents=True, exist_ok=True)
print('PDF:', PDF_PATH.resolve())
print('Page:', PAGE_INDEX+1)
print('OUT_DIR:', OUT_DIR.resolve())


## 2) Convert the target page to an image (for Vision-first path)
Uses PyMuPDF (`fitz`) if available, otherwise pdf2image.

In [None]:
import importlib
def page_to_image(pdf_path, page_index, out_dir, dpi=400):
    if importlib.util.find_spec('fitz') is not None:
        import fitz
        doc = fitz.open(pdf_path)
        page = doc[page_index]
        mat = fitz.Matrix(dpi/72, dpi/72)
        pix = page.get_pixmap(matrix=mat, alpha=False)
        out = out_dir / f'page_{page_index+1:03d}.png'
        pix.save(out.as_posix())
        return out
    elif importlib.util.find_spec('pdf2image') is not None:
        from pdf2image import convert_from_path
        imgs = convert_from_path(pdf_path, dpi=dpi, first_page=page_index+1, last_page=page_index+1)
        out = out_dir / f'page_{page_index+1:03d}.png'
        imgs[0].save(out)
        return out
    else:
        raise RuntimeError('Install PyMuPDF or pdf2image to rasterize PDF pages.')

PAGE_IMG = page_to_image(PDF_PATH, PAGE_INDEX, OUT_DIR, dpi=DPI)
PAGE_IMG

## 3A) Vector-first segmentation (objects from pdfplumber)
We try to identify **tables**, **images (figures/charts)**, and text regions directly from the PDF.
- Tables: try `camelot` (lattice/stream) on the page. If not available or fails, try `pdfplumber` tables.
- Figures/charts: `page.images` + non-text vector regions.
- Text: `page.extract_words()` grouped into blocks.

This section saves candidate regions with bounding boxes for further processing.


In [None]:
import importlib, json
from dataclasses import dataclass

@dataclass
class Region:
    kind: str  # 'text' | 'table' | 'figure'
    bbox: tuple  # (x0,y0,x1,y1) in PDF coordinate space
    meta: dict

regions = []

if importlib.util.find_spec('pdfplumber') is None:
    print('pdfplumber not installed, skipping vector-first. Install to enable.')
else:
    import pdfplumber
    with pdfplumber.open(PDF_PATH) as pdf:
        page = pdf.pages[PAGE_INDEX]
        # 1) Candidate figures from embedded images
        for im in page.images:
            bbox = (im['x0'], im['top'], im['x1'], im['bottom'])
            regions.append(Region('figure', bbox, {'source': 'image'}))

        # 2) Text regions: group words by proximity (simple heuristic)
        words = page.extract_words(use_text_flow=True)
        # If words empty, skip
        if words:
            # Simple block merging: expand a running bbox for contiguous words by y-line proximity
            lines = {}
            for w in words:
                y_center = (w['top'] + w['bottom'])/2
                key = round(y_center/6)*6  # bin by approx 6 pts
                lines.setdefault(key, []).append(w)
            # merge lines into blocks by vertical gaps
            line_boxes = []
            for _, ws in lines.items():
                xs0 = min(w['x0'] for w in ws); ys0 = min(w['top'] for w in ws)
                xs1 = max(w['x1'] for w in ws); ys1 = max(w['bottom'] for w in ws)
                line_boxes.append([xs0, ys0, xs1, ys1])
            # merge nearby lines vertically
            line_boxes.sort(key=lambda b: (b[1], b[0]))
            merged = []
            for b in line_boxes:
                if not merged:
                    merged.append(b)
                else:
                    a = merged[-1]
                    # if vertical gap small and x-overlap reasonable, merge
                    if b[1] - a[3] < 12 and not (b[0] > a[2] or b[2] < a[0]):
                        a[0] = min(a[0], b[0]); a[1] = min(a[1], b[1])
                        a[2] = max(a[2], b[2]); a[3] = max(a[3], b[3])
                    else:
                        merged.append(b)
            for b in merged:
                regions.append(Region('text', tuple(b), {'source': 'words'}))

        # 3) Tables via Camelot (if available) else pdfplumber.extract_table
        if importlib.util.find_spec('camelot') is not None:
            import camelot
            try:
                t_stream = camelot.read_pdf(PDF_PATH.as_posix(), pages=str(PAGE_INDEX+1), flavor='stream')
                for t in t_stream:
                    b = t._bbox  # (x1, y1, x2, y2) Camelot coords from bottom-left
                    # pdfplumber uses (x0, top, x1, bottom); convert if needed
                    # Here we just store Camelot bbox in meta
                    regions.append(Region('table', (b[0], b[1], b[2], b[3]), {'source': 'camelot_stream'}))
            except Exception as e:
                print('Camelot stream error:', e)
            try:
                t_lattice = camelot.read_pdf(PDF_PATH.as_posix(), pages=str(PAGE_INDEX+1), flavor='lattice')
                for t in t_lattice:
                    b = t._bbox
                    regions.append(Region('table', (b[0], b[1], b[2], b[3]), {'source': 'camelot_lattice'}))
            except Exception as e:
                print('Camelot lattice error:', e)
        else:
            # Try pdfplumber's table finding via lines and words
            try:
                table_settings = {
                    'vertical_strategy': 'lines',
                    'horizontal_strategy': 'lines'
                }
                table = page.extract_table(table_settings)
                # If any table found, mark page area heuristically (fallback: entire page)
                if table:
                    # Use page bbox as table region fallback
                    regions.append(Region('table', (page.bbox[0], page.bbox[1], page.bbox[2], page.bbox[3]), {'source': 'pdfplumber_lines'}))
            except Exception as e:
                print('pdfplumber table finding error:', e)

len(regions)

## 3B) Vision-first segmentation (OpenCV heuristics)
We detect **tables** via gridlines and **bar charts** via vertical rectangles; remaining areas are likely **text**.
This saves candidate region bounding boxes in image coordinates.

In [None]:
import cv2, numpy as np
from PIL import Image

img = cv2.imread(str(PAGE_IMG))
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
thr = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY+cv2.THRESH_OTSU)[1]

# --- Table detection via line morphology ---
horiz = thr.copy(); vert = thr.copy()
h_scale = max(10, thr.shape[1]//60)
v_scale = max(10, thr.shape[0]//60)
h_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (h_scale,1))
v_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1,v_scale))
h_lines = cv2.morphologyEx(255 - horiz, cv2.MORPH_OPEN, h_kernel)
v_lines = cv2.morphologyEx(255 - vert,  cv2.MORPH_OPEN, v_kernel)
grid = cv2.bitwise_and(h_lines, v_lines)

cnts,_ = cv2.findContours(grid, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
table_boxes = []
for c in cnts:
    x,y,w,h = cv2.boundingRect(c)
    if w*h > 5000 and w>80 and h>60:
        table_boxes.append((x,y,w,h))

# --- Bar chart detection via tall rectangles ---
blur = cv2.GaussianBlur(gray, (3,3), 0)
bt = cv2.threshold(blur, 0, 255, cv2.THRESH_BINARY+cv2.THRESH_OTSU)[1]
if (bt==0).sum() > (bt==255).sum():
    bt = cv2.bitwise_not(bt)
bt = cv2.morphologyEx(bt, cv2.MORPH_CLOSE, cv2.getStructuringElement(cv2.MORPH_RECT,(3,3)), iterations=1)
cnts,_ = cv2.findContours(bt, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
bar_boxes = []
for c in cnts:
    x,y,w,h = cv2.boundingRect(c)
    aspect = h / max(1.0,w)
    area = w*h
    if aspect>1.5 and area>300 and h>30:
        bar_boxes.append((x,y,w,h))

# Merge bars that are close into a single chart region (convex hull bbox)
def merge_boxes(boxes, pad=10):
    if not boxes: return []
    boxes = sorted(boxes)
    merged = []
    for (x,y,w,h) in boxes:
        bx = (x-pad, y-pad, x+w+pad, y+h+pad)
        if not merged:
            merged.append(list(bx))
        else:
            a = merged[-1]
            # if horizontally close/overlapping, merge
            if bx[0] <= a[2] + 40 and not (bx[0]>a[2] or bx[2]<a[0]):
                a[0]=min(a[0],bx[0]); a[1]=min(a[1],bx[1]); a[2]=max(a[2],bx[2]); a[3]=max(a[3],bx[3])
            else:
                merged.append(list(bx))
    return [tuple(m) for m in merged]

chart_regions_img = merge_boxes(bar_boxes)
print('Table boxes (image coords):', table_boxes)
print('Chart candidate regions (image coords):', chart_regions_img)


## 4) Extract text by region (vector-first or OCR fallback)
We will:
- Use `pdfplumber` (if available) to extract text blocks from the page using the vector regions gathered earlier; otherwise,
- OCR the image, optionally restricting to non-table/non-chart areas.


In [None]:
text_outputs = []
import importlib
if importlib.util.find_spec('pdfplumber') is not None:
    import pdfplumber
    with pdfplumber.open(PDF_PATH) as pdf:
        page = pdf.pages[PAGE_INDEX]
        texts = page.extract_text(x_tolerance=1.5, y_tolerance=2)
        if texts:
            fp = OUT_DIR / 'page_text_vector.txt'
            with open(fp,'w',encoding='utf-8') as f:
                f.write(texts)
            text_outputs.append(fp)
else:
    print('pdfplumber not found; falling back to OCR for text.')
    if importlib.util.find_spec('pytesseract') is not None:
        import pytesseract, cv2
        img = cv2.imread(str(PAGE_IMG))
        gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
        gray = cv2.threshold(gray,0,255,cv2.THRESH_BINARY+cv2.THRESH_OTSU)[1]
        txt = pytesseract.image_to_string(gray)
        fp = OUT_DIR / 'page_text_ocr.txt'
        with open(fp,'w',encoding='utf-8') as f:
            f.write(txt)
        text_outputs.append(fp)
    else:
        print('pytesseract not installed; cannot OCR text.')

text_outputs

## 5) Extract tables to CSV
Try **Camelot** first (stream & lattice), then **pdfplumber** table extraction. If vector extraction fails, we fallback to **Vision-first** grid-based extraction.

In [None]:
import importlib
table_csvs = []
if importlib.util.find_spec('camelot') is not None:
    import camelot
    try:
        tables = camelot.read_pdf(PDF_PATH.as_posix(), pages=str(PAGE_INDEX+1), flavor='lattice')
        for i,t in enumerate(tables):
            fp = OUT_DIR / f'table_lattice_{i+1}.csv'
            t.to_csv(fp.as_posix())
            table_csvs.append(fp)
    except Exception as e:
        print('Camelot lattice failed:', e)
    try:
        tables = camelot.read_pdf(PDF_PATH.as_posix(), pages=str(PAGE_INDEX+1), flavor='stream')
        for i,t in enumerate(tables):
            fp = OUT_DIR / f'table_stream_{i+1}.csv'
            t.to_csv(fp.as_posix())
            table_csvs.append(fp)
    except Exception as e:
        print('Camelot stream failed:', e)

if not table_csvs and importlib.util.find_spec('pdfplumber') is not None:
    import pdfplumber, csv
    with pdfplumber.open(PDF_PATH) as pdf:
        page = pdf.pages[PAGE_INDEX]
        try:
            table = page.extract_table({'vertical_strategy':'lines','horizontal_strategy':'lines'})
            if table:
                fp = OUT_DIR / 'table_pdfplumber_lines.csv'
                with open(fp,'w',newline='',encoding='utf-8') as f:
                    writer = csv.writer(f)
                    writer.writerows(table)
                table_csvs.append(fp)
        except Exception as e:
            print('pdfplumber table failed:', e)

if not table_csvs:
    # Vision fallback: crop each detected table box and attempt OCR-based CSV (simple heuristic)
    import cv2, pytesseract, csv
    img = cv2.imread(str(PAGE_IMG))
    for i,(x,y,w,h) in enumerate(table_boxes, start=1):
        roi = img[y:y+h, x:x+w]
        gray = cv2.cvtColor(roi, cv2.COLOR_BGR2GRAY)
        gray = cv2.threshold(gray,0,255,cv2.THRESH_BINARY+cv2.THRESH_OTSU)[1]
        data = pytesseract.image_to_data(gray, output_type=pytesseract.Output.DICT)
        # naive row grouping by y
        rows = {}
        for j in range(len(data['text'])):
            txt = data['text'][j].strip()
            if not txt: continue
            yy = data['top'][j]
            key = round(yy/10)*10
            rows.setdefault(key, []).append((data['left'][j], txt))
        csv_rows = []
        for _, cells in sorted(rows.items()):
            line = [t for _,t in sorted(cells)]
            csv_rows.append(line)
        fp = OUT_DIR / f'table_vision_{i}.csv'
        with open(fp,'w',newline='',encoding='utf-8') as f:
            writer = csv.writer(f)
            writer.writerows(csv_rows)
        table_csvs.append(fp)

table_csvs

## 6) Extract bar-chart numbers from detected chart regions
For each chart candidate region, we run bar detection and convert pixel heights → values.
- You must provide a **y-axis mapping** for each region (baseline/top pixel and data min/max).
- If the chart has tick labels, you can add OCR code to infer mapping automatically from two ticks.

In [None]:
import cv2, numpy as np, pandas as pd
from PIL import Image

# === Provide defaults; adjust per region ===
YVAL_MIN = 0.0
YVAL_MAX = 100.0

def extract_bars_from_roi(roi_bgr, ypix_baseline, ypix_top, yval_min, yval_max, idx_prefix='chart'):
    gray = cv2.cvtColor(roi_bgr, cv2.COLOR_BGR2GRAY)
    blur = cv2.GaussianBlur(gray, (3,3), 0)
    thr = cv2.threshold(blur, 0, 255, cv2.THRESH_BINARY+cv2.THRESH_OTSU)[1]
    if (thr==0).sum() > (thr==255).sum():
        thr = cv2.bitwise_not(thr)
    thr = cv2.morphologyEx(thr, cv2.MORPH_CLOSE, cv2.getStructuringElement(cv2.MORPH_RECT,(3,3)), iterations=1)
    cnts,_ = cv2.findContours(thr, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    bars=[]
    for c in cnts:
        x,y,w,h = cv2.boundingRect(c)
        area=w*h; aspect = h/max(1.0,w)
        if aspect>1.5 and area>300 and h>20:
            bars.append((x,y,w,h))
    bars = sorted(bars, key=lambda t: t[0])
    span_pix = max(1, ypix_baseline - ypix_top)
    def map_val(y_top):
        r = (ypix_baseline - y_top)/span_pix
        return yval_min + r*(yval_max - yval_min)
    rows=[]
    for i,(x,y,w,h) in enumerate(bars, start=1):
        rows.append({'index': i, 'x': x, 'y_top': y, 'w': w, 'h_px': h, 'value': map_val(y)})
    return pd.DataFrame(rows)

chart_csvs = []
img = cv2.imread(str(PAGE_IMG))
for k,(x0,y0,x1,y1) in enumerate(chart_regions_img, start=1):
    roi = img[y0:y1, x0:x1]
    # TODO: set these per chart region (measure in the ROI image coordinates)
    YPIX_BASELINE = int(0.9*(y1-y0))
    YPIX_TOP      = int(0.1*(y1-y0))
    df = extract_bars_from_roi(roi, YPIX_BASELINE, YPIX_TOP, YVAL_MIN, YVAL_MAX, idx_prefix=f'c{k}')
    fp = OUT_DIR / f'chart_{k}_bars.csv'
    df.to_csv(fp, index=False)
    chart_csvs.append(fp)
chart_csvs

## 7) Recreate a bar chart from extracted CSV and save PNG
Pick one of the `chart_*.csv` files and plot it.

In [None]:
import pandas as pd, matplotlib.pyplot as plt
from pathlib import Path

# Choose a chart CSV (change if needed)
CHART_CSV = chart_csvs[0] if 'chart_csvs' in globals() and chart_csvs else None
if CHART_CSV:
    df = pd.read_csv(CHART_CSV)
    plt.figure(figsize=(8,5))
    plt.bar(df['index'].astype(str), df['value'].astype(float))
    plt.ylabel('Value')
    plt.title(CHART_CSV.name)
    plt.tight_layout()
    CHART_PNG = OUT_DIR / (Path(CHART_CSV).stem + '.png')
    plt.savefig(CHART_PNG.as_posix(), dpi=200)
    plt.show()
    CHART_PNG
else:
    print('No chart CSVs produced yet. Adjust detection/mapping in Step 6 and re-run.')


## 8) (Optional) Build a PPTX with extracts
Creates a PowerPoint with:
- Page text extract (first 2,000 chars)
- First detected table CSV as a table
- First chart PNG


In [None]:
import importlib
if importlib.util.find_spec('pptx') is None:
    print('python-pptx not installed; skipping PPT build.')
else:
    from pptx import Presentation
    from pptx.util import Inches
    prs = Presentation()
    # Title
    slide = prs.slides.add_slide(prs.slide_layouts[0])
    slide.shapes.title.text = 'PDF Mixed-Layout Extracts'
    slide.placeholders[1].text = f'Page {PAGE_INDEX+1} from {PDF_PATH.name}'
    # Text
    if 'text_outputs' in globals() and text_outputs:
        slide2 = prs.slides.add_slide(prs.slide_layouts[1])
        slide2.shapes.title.text = 'Text Extract'
        with open(text_outputs[0], 'r', encoding='utf-8') as f:
            snippet = f.read()[:2000]
        slide2.placeholders[1].text = snippet
    # Table
    if 'table_csvs' in globals() and table_csvs:
        import pandas as pd
        df = pd.read_csv(table_csvs[0], header=None)
        slide3 = prs.slides.add_slide(prs.slide_layouts[6])
        rows, cols = df.shape
        table = slide3.shapes.add_table(rows+1, cols, Inches(0.5), Inches(1), Inches(9), Inches(5)).table
        for j in range(cols):
            table.cell(0,j).text = f'Col {j+1}'
        for i in range(rows):
            for j in range(cols):
                table.cell(i+1,j).text = str(df.iat[i,j])
    # Chart
    if 'CHART_PNG' in globals() and CHART_PNG:
        slide4 = prs.slides.add_slide(prs.slide_layouts[6])
        slide4.shapes.add_picture(CHART_PNG.as_posix(), Inches(1), Inches(1), width=Inches(8))
    PPTX_PATH = OUT_DIR / 'mixed_layout_summary.pptx'
    prs.save(PPTX_PATH.as_posix())
    PPTX_PATH


## 9) Next steps & tuning
- If **tables** aren't detected:
  - Try Camelot `flavor='stream'` vs `'lattice'`.
  - Increase DPI and re-run.
  - Tweak morphology kernel sizes in Step 3B.
- If **bar charts** aren't detected:
  - Adjust the aspect/area thresholds.
  - Provide exact `YPIX_BASELINE` and `YPIX_TOP` (measure in the ROI) plus real `YVAL_MIN/MAX`.
- If **text** is poor:
  - Use vector text (pdfplumber) when possible.
  - For OCR, try `--psm 6`/`--oem 1` configs and language packs.
- For multi-series charts or stacked bars, segment by color and detect sub-rectangles (requires more OpenCV steps).
