# Create a labeled screenshot collection
With Python, Selenium, Pillow

https://github.com/wendlingd; last updated 2021-09-04

This script uses an Excel-based URL inventory to write screenshots that show the page's URL at of the top of each image. Example use case: Create a gallery / "photo album" of home page screenshots for a web product portfolio.

**Requirements**
- https://selenium-python.readthedocs.io
- https://selenium.dev/selenium/docs/api/py/
- webdriver; choices include:
  - [Chromedriver](https://chromedriver.chromium.org/downloads) (used here)
  - [Geckodriver](https://github.com/mozilla/geckodriver) (not used here)
- Pillow:
  - Used to write on top of the image, such as URL, ID
  - [Pillow doc](https://pillow.readthedocs.io/en/stable/index.html)
  - [Cropping examples](https://stackoverflow.com/questions/9983263/how-to-crop-an-image-using-pil)
  - [Font doc](https://pillow.readthedocs.io/en/stable/reference/ImageFont.html) (used here: Verdana MacOS system font; for Windows consider arial.ttf)


**Notes**

- Adjust screenshot appearance, cropping, and font to your resolution and operating system.
- Will not capture pages that are behind authentication.
- Check the output in case requests time out; you can manually alter the code to collect these as one-off's.
- Serving suggestions: Import into photo manager app; convert slide show into a video; link images from product portfolio database.
  

## Get started

In [31]:
import pandas as pd
import os
import sys

from selenium import webdriver # The web scraper
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import TimeoutException
from selenium.common.exceptions import WebDriverException

from PIL import Image, ImageDraw, ImageFont # To write filename onto images

from time import sleep

```webdriver.Chrome``` will not run if your Chrome and Chomedriver are different versions. In that case you'll need to download and place a new version of Chomedriver. If the below test fails. Might be easiest to (1) use a path when invoking the driver and (2) include the specific updating procedure here... 
- Go to https://chromedriver.chromium.org/downloads
- download appropriate version
- Uncompress
- Copy to path you have chosen

In [32]:
driver = webdriver.Chrome()
cdver = driver.capabilities['chrome']['chromedriverVersion'].split(' ')[0]
driver.quit()

print(f'Current chromedriverVersion is {cdver}; okay to proceed.')


Current chromedriverVersion is 93.0.4577.15; okay to proceed.


## Option 1: Process multi-column file

In [33]:
sourceInDf = pd.read_excel('productList.xlsx', engine='openpyxl')
sourceInDf.head()

Unnamed: 0,product_id,product,home_page
0,1001,Official Guide to U.S. Government Information ...,https://www.usa.gov/
1,1002,U.S. Government Web Traffic,https://analytics.usa.gov/
2,1003,USAJOBS - The Federal Government’s official em...,https://www.usajobs.gov/


In [34]:
# If you want to limit rows for testing, re-running errors, etc.
# sourceInDf = sourceInDf.iloc[0:3]
# sourceInDf = sourceInDf[sourceInDf['product'].str.contains("Gov") == True]
# sourceInDf = sourceInDf[sourceInDf.product_id == 1002]
# sourceInDf

In [35]:
def get_UrlBox(size, fontname, fontsize, bg, fg, currImgText, position):
   """
   # e.g., run from urlBox = get_UrlBox((w,90), 'Verdana', 30, '#FFC', 'black', currImgText, (4,4))

   Uses Pillow to return light-yellow-background box that will be at image top;
   'draws' the URL text on it. More options:
   https://stackoverflow.com/questions/61742298/using-pythons-pillow-library-how-to-draw-text-without-creating-draw-object-of
   https://newbedev.com/pil-drawing-a-semi-transparent-square-overlay-on-image
   """
   urlBox = Image.new('RGBA', size, bg)

   # Get a drawing context
   draw = ImageDraw.Draw(urlBox)
   font = ImageFont.truetype(fontname, fontsize)
   draw.text(position, currImgText, fg, font=font)

   return urlBox


In [36]:
def get_vertConcat(urlBox, img_mod):
    """
    Uses Pillow to unite and return the URL textbox joined to the screenshot as one image; 
    vertical concat of URL textbox + cropped image. More options: 
    https://note.nkmk.me/en/python-pillow-concat-images/
    """
    combinedImg = Image.new('RGB', (urlBox.width, urlBox.height + img_mod.height))
    combinedImg.paste(urlBox, (0, 0))
    combinedImg.paste(img_mod, (0, urlBox.height))
    
    return combinedImg


In [37]:
def save_screenshots(sourceList):
    """
    Walks the df containing URLs and returns one screenshot file for each row, putting the 
    URL at the top in a separate box. Files are written to /reports, with name product- and
    then the product_id number, for example to hyperlink from a product portfolio. currUrl: 
    You could include other info in the spreadsheet and pass it through -- traffic, device 
    stats, line of business, etc. 
    More on cropping: https://stackoverflow.com/questions/9983263/how-to-crop-an-image-using-pil
    """
    driver = webdriver.Chrome() # Might require path
    driver.set_page_load_timeout(7)
    options = Options()
    options.add_argument("headless")
    options.headless = True
    options.accept_untrusted_certs = True
    options.assume_untrusted_cert_issuer = True

    for key, value in sourceInDf.iterrows():
        currId = value['product_id']
        currProduct = value['product']
        currUrl = value['home_page']
        currImgText = currProduct + '\n' + currUrl
        
        try :
            driver.get(currUrl)
            # Write to disk
            pathAndName = 'reports/product-' + str(currId) + ".png"
            driver.save_screenshot(pathAndName)
            # Open the file; use Pillow to write filename onto image
            image = Image.open(pathAndName)
            draw = ImageDraw.Draw(image)
            # Reduce screenshot size
            w, h = image.size
            (left, upper, right, lower) = (0, 0, w, 1500) # reduces the height
            img_mod = image.crop((left, upper, right, lower))
            # Generate URL-text box. Font: try Verdana for MacOS and arial.ttf for Windows
            urlBox = get_UrlBox((w,90), 'Verdana', 30, '#FFC', 'black', currImgText, (4,4))
            # Concat images
            get_vertConcat(urlBox, img_mod).save('reports/product-' + str(currId) + '.png')

            print(f"{currUrl}")
            sleep(0.5)
        except (TimeoutException, WebDriverException): #  as e
            print(f"## Error, skipping: {currUrl}  ID {currId}")

    driver.quit()   

save_screenshots(sourceInDf)


https://www.usa.gov/
https://analytics.usa.gov/
https://www.usajobs.gov/
