added pdf image extractor tutorial

x4nth055 · x4nth055 · commit 83c4743317d8 · 2020-08-31T16:11:57.000+02:00
diff --git a/README.md b/README.md
@@ -90,6 +90,7 @@ This is a repository of all the tutorials of [The Python Code](https://www.thepy
     - [How to Get Domain Name Information in Python](https://www.thepythoncode.com/article/extracting-domain-name-information-in-python). ([code](web-scraping/get-domain-info))
     - [How to Extract YouTube Comments in Python](https://www.thepythoncode.com/article/extract-youtube-comments-in-python). ([code](web-scraping/youtube-comments-extractor))
     - [How to Extract All PDF Links in Python](https://www.thepythoncode.com/article/extract-pdf-links-with-python). ([code](web-scraping/pdf-url-extractor))
+    - [How to Extract Images from PDF in Python](https://www.thepythoncode.com/article/extract-pdf-images-in-python). ([code](web-scraping/pdf-image-extractor))
 
 - ### [Python Standard Library](https://www.thepythoncode.com/topic/python-standard-library)
     - [How to Transfer Files in the Network using Sockets in Python](https://www.thepythoncode.com/article/send-receive-files-using-sockets-python). ([code](general/transfer-files/))
diff --git a/web-scraping/pdf-image-extractor/1710.05006.pdf b/web-scraping/pdf-image-extractor/1710.05006.pdf
diff --git a/web-scraping/pdf-image-extractor/README.md b/web-scraping/pdf-image-extractor/README.md
@@ -0,0 +1,15 @@
+# [How to Extract Images from PDF in Python](https://www.thepythoncode.com/article/extract-pdf-images-in-python)
+To run this:
+- `pip3 install -r requirements.txt`
+- To extract and save all images of `1710.05006.pdf` PDF file, you run:
+    ```
+    python pdf_image_extractor.py 1710.05006.pdf
+    ```
+    This will save all available images in the current directory and outputs:
+    ```
+    [!] No images found on page 0
+    [+] Found a total of 3 images in page 1
+    [+] Found a total of 3 images in page 2
+    [!] No images found on page 3
+    [!] No images found on page 4
+    ```
diff --git a/web-scraping/pdf-image-extractor/pdf_image_extractor.py b/web-scraping/pdf-image-extractor/pdf_image_extractor.py
@@ -0,0 +1,30 @@
+import fitz # PyMuPDF
+import io
+from PIL import Image
+
+# file path you want to extract images from
+file = "1710.05006.pdf"
+# open the file
+pdf_file = fitz.open(file)
+# iterate over PDF pages
+for page_index in range(len(pdf_file)):
+    # get the page itself
+    page = pdf_file[page_index]
+    image_list = page.getImageList()
+    # printing number of images found in this page
+    if image_list:
+        print(f"[+] Found a total of {len(image_list)} images in page {page_index}")
+    else:
+        print("[!] No images found on page", page_index)
+    for image_index, img in enumerate(page.getImageList(), start=1):
+        # get the XREF of the image
+        xref = img[0]
+        # extract the image bytes
+        base_image = pdf_file.extractImage(xref)
+        image_bytes = base_image["image"]
+        # get the image extension
+        image_ext = base_image["ext"]
+        # load it to PIL
+        image = Image.open(io.BytesIO(image_bytes))
+        # save it to local disk
+        image.save(open(f"image{page_index+1}_{image_index}.{image_ext}", "wb"))
diff --git a/web-scraping/pdf-image-extractor/requirements.txt b/web-scraping/pdf-image-extractor/requirements.txt
@@ -0,0 +1 @@
+PyMuPDF