## Google Image Web Scraper with Selenium in Julia/Python

The purpose of this notebook is to collect images from the web to use for training and testing a neural network for image classification.

[Selenium](https://selenium-python.readthedocs.io/) is a convenient Python package for controlling a web browser using code, automating web scraping processes. I was unable to find an equivalently convenient package that replaces the fuctionality of ```selenium``` in pure Julia, but ```PyCall``` makes calling Python packages from Julia very easy!

There are many good tutorials available on scraping Google Images with ```selenium``` and I chose to use this [tutorial](https://medium.com/@wwwanandsuresh/web-scraping-images-from-google-9084545808a2) from Anand Suresh posted on Medium as my guide.  Using his code as the base I modified out of need for Julia language differences and some personal preference and conveniences.

In [1]:
# UNCOMMENT AND RUN TO INSTALL PACKAGES
# using Pkg
# Pkg.add("Images")
# Pkg.add("PyCall")
# Pkg.add("Conda")
# using Conda
# Conda.add("selenium")

In [2]:
using Images
using PyCall

Use ```PyCall``` to import the ```selenium webdriver``` and define the path where the Chrome (or other browser) web driver executable is saved.

In [3]:
@pyimport selenium.webdriver as webdriver

DRIVER_PATH = "C:/WebDriver/bin/chromedriver"    # replace with path to your Chrome Web Driver

"C:/WebDriver/bin/chromedriver"

Now we we'll write the functions to control the web driver and execute search queries.

In [4]:
function google_image_urls(query::String, max_links::Int64, webdrv::PyObject, sleep_time::Int64 )
    
    # Build query
    search_url = string("https://www.google.com/search?safe=off&site=&tbm=isch&source=hp&q=",query,"&oq=",query,"&gs_l=img")
    # Load url
    webdrv.get(search_url)
    
    image_urls = Set()
    global image_count = 0
    global results_start = 1
    
    while image_count < max_links
        
        # scroll to end
        webdrv.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        sleep(sleep_time)
        
        thumbnails = webdrv.find_elements_by_css_selector("img.Q4LuWd")
        number_results = length(thumbnails)
        println("Found: $number_results search results. Extracting links from results $results_start:$number_results")
        
        for img in thumbnails[results_start:number_results]
            try
                img.click()
                sleep(1)
            catch e
                continue
            end
            
            actual_images = webdrv.find_elements_by_css_selector("img.n3VNCb")
            for act_img in actual_images
                if occursin("http", act_img.get_attribute("src"))
                    push!(image_urls, act_img.get_attribute("src"))
                end
            end
            
            global image_count = length(image_urls)
            if image_count >= max_links
                println("Found: $image_count image links. Complete.")
                break
            end
                
        end
            
        if image_count < max_links
            println("Found: $image_count image links, continuing search...")
        end        
          
        sleep(3)
        
        load_more_button = webdrv.find_element_by_css_selector(".mye4qd")
        if typeof(load_more_button) == "PyObject"
            webdrv.execute_script("document.querySelector('.mye4qd').click();")
        end   

        # reset start point for results
        global results_start = length(thumbnails)
        
    end
    
    return image_urls
end

google_image_urls (generic function with 1 method)

The function below will save the images into a local directory named ```/web_images```.

In [5]:
function download_image_urls(dirname, url_set; filename="Image-")
    root_path = "web_images" # Folder where images will be saved to, create it if it doesn't exist
    dl_path = joinpath(root_path,dirname)
    if !isdir(dl_path)
        println("Creating new directory to store images at $dl_path")
        mkdir(dl_path)
    end
    
    in_count = length(url_set)
    image_counter = 1
    println("Downloading images...")
    for i ∈ url_set
        try
            img = load(download(i))
            fl_name = string(filename, image_counter,".jpg")
            fullname = joinpath(dl_path,fl_name)
            save(fullname, img)
            image_counter +=1
        catch e
            nothing
        end
    end
    
    out_count = length(readdir(dl_path))
    
    println("Saved $out_count images out of $in_count links to $dl_path")
end

download_image_urls (generic function with 1 method)

### Running & Saving a Single Query

The processed that I used for collecting the yoga pose images was running separate queries for each pose name, then manually reviewing the harvest of each scrape and cleaning up the images as much as I could, deleting irrelevant images or incorrect poses, removing or altering duplicates, and deleting poor quality images.

Below is the code for a single query scrape, and the next section shows how we can provide multiple queries and loop the funtion to scrape many poses in one go.

In [6]:
# Initiate WebDriver
wd = webdriver.Chrome(executable_path=DRIVER_PATH)

# Get Urls from Image Query
goog_links = google_image_urls("Yoga Crane Crow Pose", 250, wd, 2)

# Quit WebDriver session
wd.quit()

# Download and save images to given directory
download_image_urls("cranecrow", goog_links; filename="cranecrow_" )

Found: 100 search results. Extracting links from results 1:100
Found: 175 image links, continuing search...
Found: 312 search results. Extracting links from results 100:312
Found: 251 image links. Complete.
Creating new directory to store images at web_images\cranecrow
Downloading images...


┌ Error: Download failed: curl: (22) The requested URL returned error: 504 Gateway Time-out
└ @ Base download.jl:43


Saved 276 images out of 251 links to web_images\cranecrow


Sometimes a single image link will yield multiple images, like taking several frames from a video/GIF.

### Looping Multiple Queries

In [7]:
pose_queries = [ 
                "yoga bridge pose",
                "yoga childs pose",
                "yoga downward dog pose",
                "yoga mountain pose",
                "yoga plank pose",
                "yoga seated forward fold",
                "yoga triangle pose",
                "yoga warrior one pose",
                "yoga warrior two pose",
                "yoga tree pose"
                ]

10-element Array{String,1}:
 "yoga bridge pose"
 "yoga childs pose"
 "yoga downward dog pose"
 "yoga mountain pose"
 "yoga plank pose"
 "yoga seated forward fold"
 "yoga triangle pose"
 "yoga warrior one pose"
 "yoga warrior one pose"
 "ypga tree pose"

In [None]:
for q in pose_queries
    wd = webdriver.Chrome(executable_path=DRIVER_PATH)
    image_url_links = google_image_urls(q, 500, wd, 2)
    download_image_urls(q, image_url_links)
    wd.quit()
end