Skip to content

Commit 83c4743

Browse files
committed
added pdf image extractor tutorial
1 parent 79a41f5 commit 83c4743

File tree

5 files changed

+47
-0
lines changed

5 files changed

+47
-0
lines changed

Diff for: README.md

+1
Original file line numberDiff line numberDiff line change
@@ -90,6 +90,7 @@ This is a repository of all the tutorials of [The Python Code](https://www.thepy
9090
- [How to Get Domain Name Information in Python](https://www.thepythoncode.com/article/extracting-domain-name-information-in-python). ([code](web-scraping/get-domain-info))
9191
- [How to Extract YouTube Comments in Python](https://www.thepythoncode.com/article/extract-youtube-comments-in-python). ([code](web-scraping/youtube-comments-extractor))
9292
- [How to Extract All PDF Links in Python](https://www.thepythoncode.com/article/extract-pdf-links-with-python). ([code](web-scraping/pdf-url-extractor))
93+
- [How to Extract Images from PDF in Python](https://www.thepythoncode.com/article/extract-pdf-images-in-python). ([code](web-scraping/pdf-image-extractor))
9394

9495
- ### [Python Standard Library](https://www.thepythoncode.com/topic/python-standard-library)
9596
- [How to Transfer Files in the Network using Sockets in Python](https://www.thepythoncode.com/article/send-receive-files-using-sockets-python). ([code](general/transfer-files/))

Diff for: web-scraping/pdf-image-extractor/1710.05006.pdf

5.09 MB
Binary file not shown.

Diff for: web-scraping/pdf-image-extractor/README.md

+15
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
# [How to Extract Images from PDF in Python](https://www.thepythoncode.com/article/extract-pdf-images-in-python)
2+
To run this:
3+
- `pip3 install -r requirements.txt`
4+
- To extract and save all images of `1710.05006.pdf` PDF file, you run:
5+
```
6+
python pdf_image_extractor.py 1710.05006.pdf
7+
```
8+
This will save all available images in the current directory and outputs:
9+
```
10+
[!] No images found on page 0
11+
[+] Found a total of 3 images in page 1
12+
[+] Found a total of 3 images in page 2
13+
[!] No images found on page 3
14+
[!] No images found on page 4
15+
```
+30
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
import fitz # PyMuPDF
2+
import io
3+
from PIL import Image
4+
5+
# file path you want to extract images from
6+
file = "1710.05006.pdf"
7+
# open the file
8+
pdf_file = fitz.open(file)
9+
# iterate over PDF pages
10+
for page_index in range(len(pdf_file)):
11+
# get the page itself
12+
page = pdf_file[page_index]
13+
image_list = page.getImageList()
14+
# printing number of images found in this page
15+
if image_list:
16+
print(f"[+] Found a total of {len(image_list)} images in page {page_index}")
17+
else:
18+
print("[!] No images found on page", page_index)
19+
for image_index, img in enumerate(page.getImageList(), start=1):
20+
# get the XREF of the image
21+
xref = img[0]
22+
# extract the image bytes
23+
base_image = pdf_file.extractImage(xref)
24+
image_bytes = base_image["image"]
25+
# get the image extension
26+
image_ext = base_image["ext"]
27+
# load it to PIL
28+
image = Image.open(io.BytesIO(image_bytes))
29+
# save it to local disk
30+
image.save(open(f"image{page_index+1}_{image_index}.{image_ext}", "wb"))

Diff for: web-scraping/pdf-image-extractor/requirements.txt

+1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
PyMuPDF

0 commit comments

Comments
 (0)