# Bar Chart → Numbers → PowerPoint (Reproduction Pipeline)

This notebook detects bars in a chart image (from a PDF), extracts numeric values, and reproduces the chart inside a PowerPoint file.

### Workflow
1. **Convert PDF pages to images** (PyMuPDF or pdf2image)
2. **Crop** to the bar chart region
3. **Detect bars** with OpenCV and estimate their heights
4. **Map pixels → values** using your y-axis scale (linear mapping)
5. **(Optional) OCR** labels near bars for category names
6. **Export CSV** of results
7. **Recreate chart** with matplotlib (PNG)
8. **Generate PPTX** with the chart image and a data table

If you need automatic y-axis mapping, add OCR for tick labels and compute the mapping from two detected ticks; this template provides manual mapping as the robust default.


## 0) Install (run locally if needed)
Uncomment if you are missing packages.

In [None]:
# %%bash
# pip install --upgrade pip
# pip install pymupdf pdf2image opencv-python pytesseract matplotlib pandas numpy python-pptx
#
# # macOS (Homebrew) for Tesseract:
# # brew install tesseract
# # Ubuntu/Debian:
# # sudo apt-get update && sudo apt-get install -y tesseract-ocr poppler-utils


## 1) Paths and options
- Set your PDF path and output folder.
- Choose DPI for rasterizing pages (300–600 usually good).

In [None]:
from pathlib import Path

PDF_PATH = Path('your_file.pdf')   # <-- change
OUT_DIR = Path('bar_pipeline_output')
DPI = 400  # 300–600 recommended
OUT_DIR.mkdir(parents=True, exist_ok=True)
print('PDF:', PDF_PATH.resolve())
print('OUT_DIR:', OUT_DIR.resolve())


## 2) PDF → page images
Tries **PyMuPDF** (fitz) first, then falls back to **pdf2image**.

In [None]:
import importlib
from typing import List

def pdf_to_images(pdf_path, out_dir, dpi=400) -> List[Path]:
    images = []
    if importlib.util.find_spec('fitz') is not None:
        import fitz
        doc = fitz.open(pdf_path)
        for i, page in enumerate(doc):
            mat = fitz.Matrix(dpi/72, dpi/72)
            pix = page.get_pixmap(matrix=mat, alpha=False)
            out_path = out_dir / f'page_{i+1:03d}.png'
            pix.save(out_path.as_posix())
            images.append(out_path)
        return images
    elif importlib.util.find_spec('pdf2image') is not None:
        from pdf2image import convert_from_path
        pages = convert_from_path(pdf_path, dpi=dpi)
        for i, img in enumerate(pages):
            out_path = out_dir / f'page_{i+1:03d}.png'
            img.save(out_path)
            images.append(out_path)
        return images
    else:
        raise RuntimeError('Install PyMuPDF or pdf2image to convert PDF pages to images.')

images = pdf_to_images(PDF_PATH, OUT_DIR, dpi=DPI)
print(f'Exported {len(images)} page image(s).')
for p in images: print(' -', p.name)


## 3) Pick page & crop chart area
- Set `PAGE_INDEX` and crop coords `(x1, y1, x2, y2)`.
- Use the previews to dial in a tight crop around the bars **including axes** for better OCR/mapping.

In [None]:
from PIL import Image
import matplotlib.pyplot as plt

PAGE_INDEX = 0  # 0-based
x1, y1, x2, y2 = 100, 100, 1200, 900  # <-- adjust

full = Image.open(images[PAGE_INDEX])
plt.figure(figsize=(10,8)); plt.imshow(full); plt.axis('off'); plt.title(f'Page {PAGE_INDEX+1}'); plt.show()

crop = full.crop((x1,y1,x2,y2))
CROP_PATH = OUT_DIR / f'crop_p{PAGE_INDEX+1}.png'
crop.save(CROP_PATH)
plt.figure(figsize=(8,6)); plt.imshow(crop); plt.axis('off'); plt.title('Chart Crop'); plt.show()
print('Saved crop at', CROP_PATH)


## 4) Detect bars (OpenCV) and compute heights
This finds candidate vertical rectangles (bars) and sorts them left→right. Adjust thresholds if needed.

In [None]:
import cv2, numpy as np, pandas as pd

img = cv2.imread(CROP_PATH.as_posix())
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
blur = cv2.GaussianBlur(gray, (3,3), 0)
thr = cv2.threshold(blur, 0, 255, cv2.THRESH_BINARY+cv2.THRESH_OTSU)[1]

# Invert if needed so bars are white on black
if (thr==0).sum() > (thr==255).sum():
    thr = cv2.bitwise_not(thr)

# Clean up noise
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (3,3))
thr = cv2.morphologyEx(thr, cv2.MORPH_CLOSE, kernel, iterations=1)

cnts, _ = cv2.findContours(thr, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
bars = []
for c in cnts:
    x,y,w,h = cv2.boundingRect(c)
    area = w*h
    aspect = h / max(1.0, w)
    # Heuristics: tall-enough, skinny-enough, sizeable area
    if h > 20 and w > 5 and aspect > 1.5 and area > 300:
        bars.append((x,y,w,h))

bars = sorted(bars, key=lambda t: t[0])
print(f'Found {len(bars)} bar candidates.')

# Visual check
vis = img.copy()
for (x,y,w,h) in bars:
    cv2.rectangle(vis, (x,y), (x+w,y+h), (0,0,255), 2)
VIS_PATH = OUT_DIR / 'bars_detected_preview.png'
cv2.imwrite(VIS_PATH.as_posix(), vis)
print('Saved detection preview at', VIS_PATH)


## 5) Map pixel heights → numeric values
Provide your y-axis mapping:
- `YPIX_BASELINE`: pixel y at the baseline (x-axis) **within the crop**
- `YPIX_TOP`: pixel y at the top of the plotted range
- `YVAL_MIN` and `YVAL_MAX`: corresponding data values

The function assumes **linear** mapping. For log scales, adjust accordingly.

In [None]:
# === Set these to match your chart ===
YPIX_BASELINE = 780   # pixel y for baseline (larger y is lower on image)
YPIX_TOP      = 120   # pixel y for top of chart area
YVAL_MIN      = 0.0
YVAL_MAX      = 100.0

def ypix_to_value(y_top):
    # y decreases upward; top of bar is y_top
    span_pix = max(1, (YPIX_BASELINE - YPIX_TOP))
    ratio = (YPIX_BASELINE - y_top) / span_pix
    return YVAL_MIN + ratio * (YVAL_MAX - YVAL_MIN)

rows = []
for idx,(x,y,w,h) in enumerate(bars, start=1):
    val = ypix_to_value(y)
    rows.append({'index': idx, 'x': x, 'y_top': y, 'width': w, 'height_px': h, 'value': val})

data_df = pd.DataFrame(rows)
DATA_CSV = OUT_DIR / 'bar_values.csv'
data_df.to_csv(DATA_CSV, index=False)
data_df


## 6) (Optional) OCR category labels (bottom zones under each bar)
We crop a small rectangle under each bar and OCR it as the category label. Adjust `PAD` and `ROI_HEIGHT` if needed.

In [None]:
import importlib, cv2, numpy as np
if importlib.util.find_spec('pytesseract') is None:
    print('pytesseract not installed. Skipping OCR labels step.')
else:
    import pytesseract
    PAD = 5
    ROI_HEIGHT = 40  # region below baseline to search for x labels
    crop_bgr = cv2.imread(CROP_PATH.as_posix())
    labels = []
    for idx,(x,y,w,h) in enumerate(bars, start=1):
        x1 = max(0, x - PAD)
        x2 = min(crop_bgr.shape[1], x + w + PAD)
        y1 = min(crop_bgr.shape[0]-1, YPIX_BASELINE + 5)
        y2 = min(crop_bgr.shape[0], y1 + ROI_HEIGHT)
        roi = crop_bgr[y1:y2, x1:x2]
        if roi.size == 0:
            txt = ''
        else:
            gray = cv2.cvtColor(roi, cv2.COLOR_BGR2GRAY)
            gray = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY+cv2.THRESH_OTSU)[1]
            txt = pytesseract.image_to_string(gray, config='--psm 7').strip()
        labels.append(txt)
    # Attach labels to dataframe
    if len(labels) == len(data_df):
        data_df['label'] = labels
    else:
        data_df['label'] = ''
    DATA_CSV = OUT_DIR / 'bar_values_with_labels.csv'
    data_df.to_csv(DATA_CSV, index=False)
    data_df


## 7) Recreate bar chart (matplotlib) and export PNG
- Single plot, no style/colors set explicitly (per constraints).

In [None]:
import matplotlib.pyplot as plt

labels_for_plot = data_df['label'].fillna('').replace('', pd.Series(range(1, len(data_df)+1))).astype(str)
values_for_plot = data_df['value'].astype(float).values

plt.figure(figsize=(8,5))
plt.bar(labels_for_plot, values_for_plot)
plt.ylabel('Value')
plt.title('Recreated Bar Chart')
plt.tight_layout()
CHART_PNG = OUT_DIR / 'recreated_bar_chart.png'
plt.savefig(CHART_PNG.as_posix(), dpi=200)
plt.show()
print('Saved chart image at', CHART_PNG)


## 8) Build PPT (python-pptx) with chart image and data table
Creates a simple PPT with:
- Title slide
- Slide with the chart image
- Slide with a table of (label, value)

In [None]:
import importlib
if importlib.util.find_spec('pptx') is None:
    raise RuntimeError('python-pptx is not installed. Please install it to build PPTX.')
from pptx import Presentation
from pptx.util import Inches, Pt

prs = Presentation()

# Title slide
title_layout = prs.slide_layouts[0]
slide = prs.slides.add_slide(title_layout)
slide.shapes.title.text = 'Bar Chart Reproduction'
slide.placeholders[1].text = f'Generated on {datetime.now().strftime("%Y-%m-%d %H:%M")}'

# Chart image slide
blank = prs.slide_layouts[6]
slide2 = prs.slides.add_slide(blank)
left, top = Inches(1), Inches(1)
slide2.shapes.add_picture(CHART_PNG.as_posix(), left, top, width=Inches(8))

# Data table slide
slide3 = prs.slides.add_slide(blank)
rows = len(data_df) + 1
cols = 2 if 'label' in data_df.columns else 1
table = slide3.shapes.add_table(rows, cols, Inches(0.5), Inches(1), Inches(9), Inches(5)).table
if cols == 2:
    table.cell(0,0).text = 'Category'
    table.cell(0,1).text = 'Value'
    for i,(lab,val) in enumerate(zip(data_df.get('label', ['']*len(data_df)), data_df['value']), start=1):
        table.cell(i,0).text = str(lab)
        table.cell(i,1).text = f"{float(val):.4g}"
else:
    table.cell(0,0).text = 'Value'
    for i,val in enumerate(data_df['value'], start=1):
        table.cell(i,0).text = f"{float(val):.4g}"

PPTX_PATH = OUT_DIR / 'bar_chart_reproduction.pptx'
prs.save(PPTX_PATH.as_posix())
PPTX_PATH


## 9) Tips & accuracy checks
- Ensure crop contains the **full y-range** and baseline.
- Tune detection thresholds (`aspect`, `area`, morphology) as needed.
- Verify 2–3 bars by reading values from the original chart and compare to your CSV.
- For stacked bars, detect sub-rectangles (contour hierarchy) or color-segment each layer.
- For log scales, replace linear `ypix_to_value` with a log-mapping.
