# Task 1b: Turning a pdf into an image. 

**We will aim for a 300 x 300 DPI**



| Library          | Installation          | Key Features                          | Best For                          |
|------------------|-----------------------|---------------------------------------|-----------------------------------|
| **PyMuPDF (fitz)** | `pip install pymupdf` | - Fastest conversion<br>- Low memory<br>- Multi-page support | Batch processing, high-performance apps |
| **pdf2image**    | `pip install pdf2image`<br>(+ `poppler-utils`) | - Pillow-compatible<br>- Simple API<br>- Multi-threaded | Quick conversions, Pillow workflows |
| **Wand**         | `pip install wand`<br>(+ `imagemagick`) | - High-quality rendering<br>- Supports PDF transparency/vectors | Complex PDFs, print-quality output |
| **pdfium**       | `pip install pdfium`  | - Chrome’s PDF engine<br>- Pixel-perfect rendering | Web apps, fidelity-critical cases |


We use the profiler functions native to python and will later on run a scalene report on the effectivity of each. 

I will store the resulting images in the folder [task1b_images](./task1b_images/)

In [1]:
%pip install python-dotenv
from dotenv import load_dotenv
load_dotenv()

Note: you may need to restart the kernel to use updated packages.


True

In [2]:
image_conversion_stats = []

## Conversion Method 1: pdf2Image. 


In [3]:
%pip install pdf2image
%pip install opencv-python
%pip install python-dotenv

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [4]:
###                                   TRACKING                                     ###
import time
import psutil
import os

process = psutil.Process(os.getpid())
_ = process.cpu_percent(interval=None)  # Reset CPU counter
mem_before = process.memory_info().rss
start_time = time.time()

###                                   TRACKING END                                 ###
from dotenv import load_dotenv
load_dotenv()
from pdf2image import convert_from_path
import cv2
import numpy as np
from PIL import Image

filePath = os.environ.get("filePath")

print(filePath)
images = convert_from_path(filePath, dpi=300)

for i, image in enumerate(images):
    # Convert PIL image to OpenCV format
    cv_img = np.array(image)
    cv_img = cv2.cvtColor(cv_img, cv2.COLOR_RGB2GRAY)

    # Apply GaussianBlur to reduce noise
    denoised_img = cv2.GaussianBlur(cv_img, (5, 5), 0)

    # Convert back to PIL format
    pil_img = Image.fromarray(denoised_img)

    # Save the denoised image
    pil_img.save(os.path.join("task1b_images/", f'pdf2image_denoised_page_{i + 1}.png'), 'PNG')
    print(f'Saved denoised output_page_{i + 1}.png')

###                                   TRACKING                                     ###

# Calculate stats
cpu_used = process.cpu_percent(interval=None)
memory_MB = process.memory_info().rss-mem_before
total_time = time.time() - start_time

image_conversion_stats.append( {
    "name": "pdf2image",
    "time": total_time,
    "num_images": len(images),
    "memory_MB": memory_MB / (1024 * 1024),  # Delta in MB
    "cpu_percent": cpu_used,
}
)

###                                   TRACKING END                                 ###

/home//Documents/clinical-trials-ocr-idp/highlighted_output.pdf
Saved denoised output_page_1.png
Saved denoised output_page_2.png
Saved denoised output_page_3.png
Saved denoised output_page_4.png
Saved denoised output_page_5.png


[Return to the top](#Task-1b-Turning-a-pdf-into-an-image)

## Conversion Method 2: pyMuPDF

In [5]:
%pip install PyMuPDF

Note: you may need to restart the kernel to use updated packages.


In [6]:
###                                   TRACKING                                     ###
import time
import psutil
import os

process = psutil.Process(os.getpid())
_ = process.cpu_percent(interval=None)  # Reset CPU counter
mem_before = process.memory_info().rss
start_time = time.time()

###                                   TRACKING END                                 ###
import fitz  # PyMuPDF
import os
from PIL import Image, ImageFilter

filePath = os.environ.get("filePath")
output_folder = "task1b_images/"

doc = fitz.open(filePath)

# Calculate scale matrix for 300 DPI (72 dpi is default)
zoom = 300 / 72  # 4.1667 approx

matrix = fitz.Matrix(zoom, zoom)

for i in range(len(doc)):
    page = doc.load_page(i)
    pix = page.get_pixmap(matrix=matrix)
    output_path = os.path.join(output_folder, f'PyMuPDF_page_{i + 1}.png')
    pix.save(output_path)
    
    # Convert image to grayscale using PIL and apply denoising
    with Image.open(output_path) as im:
        grayscale_image = im.convert('L')  # Convert to grayscale
        denoised_image = grayscale_image.filter(ImageFilter.GaussianBlur(radius=2))  # Apply Gaussian blur for denoising
        denoised_image.save(output_path, dpi=(300, 300))

###                                   TRACKING                                     ###

# Calculate stats
cpu_used = process.cpu_percent(interval=None)
memory_MB = process.memory_info().rss-mem_before
total_time = time.time() - start_time

image_conversion_stats.append( {
    "name": "PyMuPDF",
    "time": total_time,
    "num_images": len(doc),
    "memory_MB": memory_MB / (1024 * 1024),  # Delta in MB
    "cpu_percent": cpu_used,
}
)

###                                   TRACKING END                                 ###

[Return to the top](#Task-1b-Turning-a-pdf-into-an-image)

## Conversion Method 3: pdfium


In [7]:
%pip install pypdfium2

Note: you may need to restart the kernel to use updated packages.


In [8]:
###                                   TRACKING                                     ###
import time
import psutil
import os

process = psutil.Process(os.getpid())
_ = process.cpu_percent(interval=None)  # Reset CPU counter
mem_before = process.memory_info().rss
start_time = time.time()

###                                   TRACKING END                                 ###

import pypdfium2 as pdfium
from PIL import Image

# Load a document
pdf = pdfium.PdfDocument(os.environ.get("filePath"))

# Loop over pages and render
for i in range(len(pdf)):
    page = pdf[i]
    image = page.render(scale=4).to_pil()
    # Convert image to black and white
    bw_image = image.convert('1')
    bw_image.save(os.path.join("task1b_images/", f"pdfium_image_bw_{i+1:03d}.png"), format="PNG", dpi=(300, 300))

###                                   TRACKING                                     ###

# Calculate stats
cpu_used = process.cpu_percent(interval=None)
memory_MB = process.memory_info().rss-mem_before
total_time = time.time() - start_time

image_conversion_stats.append( {
    "name": "pdfium2",
    "time": total_time,
    "num_images": len(pdf),
    "memory_MB": memory_MB / (1024 * 1024),  # Delta in MB
    "cpu_percent": cpu_used,
}
)

###                                   TRACKING END                                 ###

[Return to the top](#Task-1b-Turning-a-pdf-into-an-image)

## Conversion Method 4: Wand

In [9]:
%pip install wand

Collecting wand
  Downloading Wand-0.6.13-py2.py3-none-any.whl.metadata (4.0 kB)
Downloading Wand-0.6.13-py2.py3-none-any.whl (143 kB)
Installing collected packages: wand
Successfully installed wand-0.6.13
Note: you may need to restart the kernel to use updated packages.


You also need to install image magick on your system as such: 
https://docs.wand-py.org/en/latest/guide/install.html

**Debian/ubuntu**:
```bash
sudo apt-get install libmagickwand-dev
```
**MAC**:
```bash
brew install imagemagick
```
**Windows**:
 
use this [link](https://imagemagick.org/script/download.php#windows)


In [12]:
###                                   TRACKING                                     ###
import time
import psutil
import os

process = psutil.Process(os.getpid())
_ = process.cpu_percent(interval=None)  # Reset CPU counter
mem_before = process.memory_info().rss
start_time = time.time()

###                                   TRACKING END                                 ###
from wand.image import Image
from wand.color import Color

pdf_path = os.environ.get("filePath")
output_folder = "task1b_images"
output_prefix = "wand_image"
resolution = 300  # DPI for high quality
# Optimize Wand processing by disabling alpha channel removal and compression quality adjustment
with Image(filename=pdf_path, resolution=resolution) as pdf:
    num_pages = len(pdf.sequence)  # Store the number of pages
    for i, page in enumerate(pdf.sequence):
        with Image(page) as img:
            img.format = 'png'
            output_path = os.path.join(output_folder, f"{output_prefix}-{i + 1}.png")
            img.save(filename=output_path)
            print(f"Saved: {output_path}")

###                                   TRACKING                                     ###


# Calculate stats
cpu_used = process.cpu_percent(interval=None)
memory_MB = process.memory_info().rss-mem_before
total_time = time.time() - start_time

image_conversion_stats.append( {
    "name": "Wand",
    "time": total_time,
    "num_images": num_pages,
    "memory_MB": memory_MB / (1024 * 1024),  # Delta in MB
    "cpu_percent": cpu_used,
}
)

###                                   TRACKING END                                 ###

Saved: task1b_images/wand_image-1.png
Saved: task1b_images/wand_image-2.png
Saved: task1b_images/wand_image-3.png
Saved: task1b_images/wand_image-4.png
Saved: task1b_images/wand_image-5.png


[Return to the top](#Task-1b-Turning-a-pdf-into-an-image)

## PDF to image method comparison

### Native performance check:

Great way of measuring time and pictures taken but we shouldn't really trust memory and cpu usage in this method. Scalene is probably better.

In [14]:
%pip install tabulate

Collecting tabulate
  Downloading tabulate-0.9.0-py3-none-any.whl.metadata (34 kB)
Downloading tabulate-0.9.0-py3-none-any.whl (35 kB)
Installing collected packages: tabulate
Successfully installed tabulate-0.9.0
Note: you may need to restart the kernel to use updated packages.


In [15]:
from tabulate import tabulate

print(tabulate(image_conversion_stats, headers="keys"))


name          time    num_images    memory_MB    cpu_percent
---------  -------  ------------  -----------  -------------
pdf2image  1.25702             5     277.648           109.8
PyMuPDF    2.54073             5     149.57             99.6
pdfium2    0.8929              5      58.0508           65
Wand       6.38445             5     664.609            83.8


### Quality check: 


In [None]:
import os
from PIL import Image, ImageStat
from tabulate import tabulate

# Known physical page size in inches (A4)
PAGE_WIDTH_INCH = 8.27
PAGE_HEIGHT_INCH = 11.69

def analyze_image(image_path):
    with Image.open(image_path) as img:
        width, height = img.size
        mode = img.mode

        # Calculate DPI from pixels and physical size
        dpi_x = width / PAGE_WIDTH_INCH
        dpi_y = height / PAGE_HEIGHT_INCH
        dpi_str = f"{dpi_x:.2f}x{dpi_y:.2f}"

        # Calculate contrast from grayscale stddev
        grayscale = img.convert('L')
        stat = ImageStat.Stat(grayscale)
        contrast = stat.stddev[0]
        total_pixels = width * height

        return {
            "File": os.path.basename(image_path),
            "Width": width,
            "Height": height,
            "Total Pixels": total_pixels,
            "Calculated DPI": dpi_str,
            "Mode": mode,
            "Contrast": round(contrast, 2)
        }

def batch_check_images(folder):
    results = []
    for filename in sorted(os.listdir(folder)):
        if filename.lower().endswith(".png"):
            path = os.path.join(folder, filename)
            info = analyze_image(path)
            results.append(info)

    print(tabulate(results, headers="keys", tablefmt="grid"))

# Run the check
batch_check_images("task1b_images/")


+-------------------------+---------+----------+----------------+------------------+--------+------------+
| File                    |   Width |   Height |   Total Pixels | Calculated DPI   | Mode   |   Contrast |
| PyMuPDF_page_1.png      |    2481 |     3508 |        8703348 | 300.00x300.09    | 1      |      56.18 |
+-------------------------+---------+----------+----------------+------------------+--------+------------+
| PyMuPDF_page_2.png      |    2481 |     3508 |        8703348 | 300.00x300.09    | 1      |      49.75 |
+-------------------------+---------+----------+----------------+------------------+--------+------------+
| PyMuPDF_page_3.png      |    2481 |     3508 |        8703348 | 300.00x300.09    | 1      |      51.4  |
+-------------------------+---------+----------+----------------+------------------+--------+------------+
| PyMuPDF_page_4.png      |    2481 |     3508 |        8703348 | 300.00x300.09    | 1      |      52.26 |
+-------------------------+---------+

### Third-party Performance Check: 

In [None]:
%pip install -U scalene

Note: you may need to restart the kernel to use updated packages.


How to get a report:


> **WARNING**: Remember to run this entire notebook to download the needed libraries.

- CTRL + SHIFT + P
- Open terminal --> Bash
- cd to /RESEARCH/task1b_images/
- on the terminal: 
```bash
scalene --cpu --memory pdfToImageReport.py
```
- Will open the report on your browser. 



My Report

I used the command line format and sent the output to a txt file:
[pdfToImageReport](./task1b_images/pdfToImageReport.txt)


In order to get this I did: 
- CTRL + SHIFT + P
- Open terminal --> Bash
- cd to /RESEARCH/task1b_images/
- on the terminal : 
```bash
scalene --cpu --memory --cli --column-width 6000 --outfile=pdfToImageReport.txt pdfToImageReport.py
```

My Report on HTML

In order to get this I did: 
- CTRL + SHIFT + P
- Open terminal --> Bash
- cd to /RESEARCH/task1b_images/
- on the terminal : 
```bash
scalene --cpu --memory --outfile=pdfToImageReport.html pdfToImageReport.py
```
- on the same terminal: 
```bash
xdg-open pdfToImageReport.html     #Linux
open pdfToImageReport.html         #MacOS
explorer.exe pdfToImageReport.html #Windows
```
- The report will open in your browser. If you get an installation error, type in your bash terminal or in a python snippet in the notebook the command they request. 

> **Warning**: The statistics depend on your machine, run various tests to see if things make sense. 

[Return to the top](#Task-1b-Turning-a-pdf-into-an-image)