## Google Image Web Scraper with Selenium in Julia/Python

The purpose of this notebook is to collect images from the web to use for training and testing a neural network for image classification.

[Selenium](https://selenium-python.readthedocs.io/) is a convenient Python package for controlling a web browser using code, automating web scraping processes. There is not an equivalently convenient package that replaces the fuctionality of ```selinium``` in pure Julia, but ```PyCall``` makes calling Python packages from Julia very easy!

There are many good tutorials available on scraping Google Images with ```selinium``` and I chose to use this [tutorial](https://medium.com/@wwwanandsuresh/web-scraping-images-from-google-9084545808a2) from Anand Suresh posted on Medium as my guide.  Using his code as the skeleton I only modified out of need for Julia language differences and some personal preference and conveniences.

In [1]:
using Images

The code below can be used to install ```selenium``` in the default Conda python environment (see docs for details).

In [2]:
#using Conda
#Conda.add("selenium")

Use ```PyCall``` to import the ```selenium webdriver``` and define the path where the Chrome (or other browser) web driver executable is saved.

In [3]:
using PyCall

@pyimport selenium.webdriver as webdriver

DRIVER_PATH = "C:/WebDriver/bin/chromedriver"    # replace with path to your Chrome Web Driver

"C:/Users/Ryn/anaconda3/ChromeWebDrivers/bin/chromedriver"

Now we we'll write the functions to control the web driver and execute search queries.

In [7]:
function google_image_urls(query::String, max_links::Int64, webdrv::PyObject, sleep_time::Int64 )
    
    # Build query
    search_url = string("https://www.google.com/search?safe=off&site=&tbm=isch&source=hp&q=",query,"&oq=",query,"&gs_l=img")
    # Load url
    webdrv.get(search_url)
    
    image_urls = Set()
    global image_count = 0
    global results_start = 1
    
    while image_count < max_links
        
        # scroll to end
        webdrv.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        sleep(sleep_time)
        
        thumbnails = webdrv.find_elements_by_css_selector("img.Q4LuWd")
        number_results = length(thumbnails)
        println("Found: $number_results search results. Extracting links from $results_start:$number_results")
        
        for img in thumbnails[results_start:number_results]
            try
                img.click()
                sleep(1)
            catch e
                continue
            end
            
            actual_images = webdrv.find_elements_by_css_selector("img.n3VNCb")
            for act_img in actual_images
                if occursin("http", act_img.get_attribute("src"))
                    push!(image_urls, act_img.get_attribute("src"))
                end
            end
            
            global image_count = length(image_urls)
            
            if image_count >= max_links
                println("Found: $image_count image links. Complete.")
                break
            end
            println("Found: $image_count image links, continuing search...")
        end
            
        sleep(5)
        
        load_more_button = webdrv.find_element_by_css_selector(".mye4qd")
        if typeof(load_more_button) == "PyObject"
            webdrv.execute_script("document.querySelector('.mye4qd').click();")
        end   

        # reset start point for results
        global results_start = length(thumbnails)
        
    end
    
    return image_urls
end

google_image_urls (generic function with 1 method)

The function below will save the images into a local directory named ```/web_images```.

In [10]:
function download_image_urls(dirname, url_set; filename="Image-")
    root_path = "web_images"
    dl_path = joinpath(root_path,dirname)
    if !isdir(dl_path)
        println("Creating new directory to store images at $dl_path")
        mkdir(dl_path)
    end
    
    in_count = length(url_set)
    image_counter = 1
    println("Downloading images...")
    for i ∈ url_set
        try
            img = load(download(i))
            fl_name = string(filename, image_counter,".jpg")
            fullname = joinpath(dl_path,fl_name)
            save(fullname, img)
            image_counter +=1
        catch e
            continue
        end
    end
    
    out_count = length(readdir(dl_path))
    
    println("Saved $out_count images out of $in_count links to $dl_path")
end

download_image_urls (generic function with 1 method)

### Running & Saving a Single Query

The processed that I used for collecting the yoga pose images was running separate queries for each pose name, then manually reviewing the harvest of each scrape and cleaning up the images as much as I could, deleting irrelevant images or incorrect poses, removing or altering duplicates, and deleting poor quality images.

Below is the code for a single query scrape, and the next section shows how we can provide multiple queries and loop the funtion to scrape many poses in one go.

In [None]:
# Initiate WebDriver
wd = webdriver.Chrome(executable_path=DRIVER_PATH)

# Get Urls from Image Query
goog_links = google_image_urls("sacramento kings", 250, wd, 2)

# Quit WebDriver session
wd.quit()

# Download and save images to given directory
download_image_urls("sacramento kings", goog_links; filename="SacKings_" )



### Looping Multiple Queries

In [7]:
pose_queries = [ 
                "yoga bridge pose",
                "yoga childs pose",
                "yoga downward dog pose",
                "yoga mountain pose",
                "yoga plank pose",
                "yoga seated forward fold",
                "yoga triangle pose",
                "yoga warrior one pose",
                "yoga warrior one pose",
                "ypga tree pose"
                ]

10-element Array{String,1}:
 "yoga bridge pose"
 "yoga childs pose"
 "yoga downward dog pose"
 "yoga mountain pose"
 "yoga plank pose"
 "yoga seated forward fold"
 "yoga triangle pose"
 "yoga warrior one pose"
 "yoga warrior one pose"
 "ypga tree pose"

In [8]:
for q in pose_queries
    wd = webdriver.Chrome(executable_path=DRIVER_PATH)
    image_url_links = google_image_urls(q, 500, wd, 2)
    download_image_urls(q, image_url_links)
    wd.quit()
end

Found: 100 search results. Extracting links from 1:100
Found: 1 image links, continuing search...
Found: 1 image links, continuing search...
Found: 1 image links, continuing search...
Found: 2 image links, continuing search...
Found: 3 image links, continuing search...
Found: 4 image links, continuing search...
Found: 5 image links, continuing search...
Found: 6 image links, continuing search...
Found: 7 image links, continuing search...
Found: 8 image links, continuing search...
Found: 9 image links, continuing search...
Found: 10 image links, continuing search...
Found: 11 image links, continuing search...
Found: 12 image links, continuing search...
Found: 13 image links, continuing search...
Found: 14 image links, continuing search...
Found: 15 image links, continuing search...
Found: 16 image links, continuing search...
Found: 17 image links, continuing search...
Found: 19 image links, continuing search...
Found: 21 image links, continuing search...
Found: 23 image links, continuin

┌ Error: Download failed: curl: (22) The requested URL returned error: 406 Not Acceptable
└ @ Base download.jl:43


Saved 502 images out of 500 links to web_images\yoga bridge pose
Found: 100 search results. Extracting links from 1:100
Found: 0 image links, continuing search...
Found: 1 image links, continuing search...
Found: 2 image links, continuing search...
Found: 3 image links, continuing search...
Found: 4 image links, continuing search...
Found: 5 image links, continuing search...
Found: 6 image links, continuing search...
Found: 7 image links, continuing search...
Found: 8 image links, continuing search...
Found: 9 image links, continuing search...
Found: 10 image links, continuing search...
Found: 11 image links, continuing search...
Found: 12 image links, continuing search...
Found: 13 image links, continuing search...
Found: 14 image links, continuing search...
Found: 15 image links, continuing search...
Found: 16 image links, continuing search...
Found: 17 image links, continuing search...
Found: 18 image links, continuing search...
Found: 20 image links, continuing search...
Found: 22 

┌ Error: Download failed: curl: (22) The requested URL returned error: 404 Not Found
└ @ Base download.jl:43
┌ Error: Download failed: curl: (22) The requested URL returned error: 403 Forbidden
└ @ Base download.jl:43
└ @ PNGFiles C:\Users\Ryn\.julia\packages\PNGFiles\CQUsD\src\wraphelpers.jl:2


Saved 583 images out of 501 links to web_images\yoga childs pose
Found: 100 search results. Extracting links from 1:100
Found: 0 image links, continuing search...
Found: 1 image links, continuing search...
Found: 2 image links, continuing search...
Found: 3 image links, continuing search...
Found: 4 image links, continuing search...
Found: 5 image links, continuing search...
Found: 6 image links, continuing search...
Found: 7 image links, continuing search...
Found: 8 image links, continuing search...
Found: 9 image links, continuing search...
Found: 10 image links, continuing search...
Found: 11 image links, continuing search...
Found: 12 image links, continuing search...
Found: 13 image links, continuing search...
Found: 14 image links, continuing search...
Found: 15 image links, continuing search...
Found: 16 image links, continuing search...
Found: 17 image links, continuing search...
Found: 19 image links, continuing search...
Found: 21 image links, continuing search...
Found: 23 

└ @ PNGFiles C:\Users\Ryn\.julia\packages\PNGFiles\CQUsD\src\wraphelpers.jl:2


Saved 973 images out of 501 links to web_images\yoga downward dog pose
Found: 100 search results. Extracting links from 1:100
Found: 1 image links, continuing search...
Found: 1 image links, continuing search...
Found: 1 image links, continuing search...
Found: 2 image links, continuing search...
Found: 3 image links, continuing search...
Found: 4 image links, continuing search...
Found: 5 image links, continuing search...
Found: 6 image links, continuing search...
Found: 7 image links, continuing search...
Found: 8 image links, continuing search...
Found: 9 image links, continuing search...
Found: 10 image links, continuing search...
Found: 11 image links, continuing search...
Found: 11 image links, continuing search...
Found: 12 image links, continuing search...
Found: 13 image links, continuing search...
Found: 14 image links, continuing search...
Found: 15 image links, continuing search...
Found: 16 image links, continuing search...
Found: 18 image links, continuing search...
Found

┌ Error: Download failed: curl: (22) The requested URL returned error: 403 Forbidden
└ @ Base download.jl:43
└ @ PNGFiles C:\Users\Ryn\.julia\packages\PNGFiles\CQUsD\src\wraphelpers.jl:2


Saved 519 images out of 501 links to web_images\yoga mountain pose
Found: 100 search results. Extracting links from 1:100
Found: 1 image links, continuing search...
Found: 1 image links, continuing search...
Found: 2 image links, continuing search...
Found: 3 image links, continuing search...
Found: 4 image links, continuing search...
Found: 5 image links, continuing search...
Found: 6 image links, continuing search...
Found: 7 image links, continuing search...
Found: 8 image links, continuing search...
Found: 9 image links, continuing search...
Found: 10 image links, continuing search...
Found: 11 image links, continuing search...
Found: 12 image links, continuing search...
Found: 13 image links, continuing search...
Found: 14 image links, continuing search...
Found: 15 image links, continuing search...
Found: 16 image links, continuing search...
Found: 17 image links, continuing search...
Found: 18 image links, continuing search...
Found: 20 image links, continuing search...
Found: 2

┌ Error: Download failed: curl: (22) The requested URL returned error: 403 Forbidden
└ @ Base download.jl:43


Saved 496 images out of 501 links to web_images\yoga plank pose
Found: 100 search results. Extracting links from 1:100
Found: 1 image links, continuing search...
Found: 2 image links, continuing search...
Found: 2 image links, continuing search...
Found: 3 image links, continuing search...
Found: 4 image links, continuing search...
Found: 5 image links, continuing search...
Found: 6 image links, continuing search...
Found: 7 image links, continuing search...
Found: 8 image links, continuing search...
Found: 9 image links, continuing search...
Found: 10 image links, continuing search...
Found: 11 image links, continuing search...
Found: 12 image links, continuing search...
Found: 13 image links, continuing search...
Found: 14 image links, continuing search...
Found: 15 image links, continuing search...
Found: 16 image links, continuing search...
Found: 17 image links, continuing search...
Found: 18 image links, continuing search...
Found: 20 image links, continuing search...
Found: 22 i

┌ Error: Download failed: curl: (22) The requested URL returned error: 403 Forbidden
└ @ Base download.jl:43


Saved 553 images out of 501 links to web_images\yoga seated forward fold
Found: 100 search results. Extracting links from 1:100
Found: 1 image links, continuing search...
Found: 1 image links, continuing search...
Found: 1 image links, continuing search...
Found: 2 image links, continuing search...
Found: 3 image links, continuing search...
Found: 4 image links, continuing search...
Found: 5 image links, continuing search...
Found: 6 image links, continuing search...
Found: 7 image links, continuing search...
Found: 8 image links, continuing search...
Found: 9 image links, continuing search...
Found: 10 image links, continuing search...
Found: 11 image links, continuing search...
Found: 12 image links, continuing search...
Found: 13 image links, continuing search...
Found: 14 image links, continuing search...
Found: 15 image links, continuing search...
Found: 16 image links, continuing search...
Found: 17 image links, continuing search...
Found: 19 image links, continuing search...
Fou

┌ Error: Download failed: curl: (22) The requested URL returned error: 406 Not Acceptable
└ @ Base download.jl:43
┌ Error: Download failed: curl: (22) The requested URL returned error: 406 Not Acceptable
└ @ Base download.jl:43


Saved 502 images out of 500 links to web_images\yoga triangle pose
Found: 100 search results. Extracting links from 1:100
Found: 1 image links, continuing search...
Found: 2 image links, continuing search...
Found: 2 image links, continuing search...
Found: 3 image links, continuing search...
Found: 4 image links, continuing search...
Found: 5 image links, continuing search...
Found: 6 image links, continuing search...
Found: 7 image links, continuing search...
Found: 8 image links, continuing search...
Found: 9 image links, continuing search...
Found: 10 image links, continuing search...
Found: 11 image links, continuing search...
Found: 12 image links, continuing search...
Found: 13 image links, continuing search...
Found: 14 image links, continuing search...
Found: 15 image links, continuing search...
Found: 16 image links, continuing search...
Found: 17 image links, continuing search...
Found: 18 image links, continuing search...
Found: 20 image links, continuing search...
Found: 2

└ @ PNGFiles C:\Users\Ryn\.julia\packages\PNGFiles\CQUsD\src\wraphelpers.jl:2
┌ Error: Download failed: curl: (22) The requested URL returned error: 403 Forbidden
└ @ Base download.jl:43
└ @ PNGFiles C:\Users\Ryn\.julia\packages\PNGFiles\CQUsD\src\wraphelpers.jl:2


Saved 500 images out of 500 links to web_images\yoga warrior one pose
Found: 100 search results. Extracting links from 1:100
Found: 0 image links, continuing search...
Found: 1 image links, continuing search...
Found: 2 image links, continuing search...
Found: 3 image links, continuing search...
Found: 4 image links, continuing search...
Found: 5 image links, continuing search...
Found: 6 image links, continuing search...
Found: 7 image links, continuing search...
Found: 8 image links, continuing search...
Found: 9 image links, continuing search...
Found: 10 image links, continuing search...
Found: 11 image links, continuing search...
Found: 12 image links, continuing search...
Found: 13 image links, continuing search...
Found: 14 image links, continuing search...
Found: 15 image links, continuing search...
Found: 16 image links, continuing search...
Found: 17 image links, continuing search...
Found: 18 image links, continuing search...
Found: 20 image links, continuing search...
Found

└ @ PNGFiles C:\Users\Ryn\.julia\packages\PNGFiles\CQUsD\src\wraphelpers.jl:2
┌ Error: Download failed: curl: (22) The requested URL returned error: 403 Forbidden
└ @ Base download.jl:43
└ @ PNGFiles C:\Users\Ryn\.julia\packages\PNGFiles\CQUsD\src\wraphelpers.jl:2


Saved 501 images out of 500 links to web_images\yoga warrior one pose
Found: 100 search results. Extracting links from 1:100
Found: 0 image links, continuing search...
Found: 0 image links, continuing search...
Found: 0 image links, continuing search...
Found: 1 image links, continuing search...
Found: 2 image links, continuing search...
Found: 3 image links, continuing search...
Found: 4 image links, continuing search...
Found: 5 image links, continuing search...
Found: 6 image links, continuing search...
Found: 7 image links, continuing search...
Found: 8 image links, continuing search...
Found: 9 image links, continuing search...
Found: 10 image links, continuing search...
Found: 11 image links, continuing search...
Found: 12 image links, continuing search...
Found: 13 image links, continuing search...
Found: 14 image links, continuing search...
Found: 15 image links, continuing search...
Found: 17 image links, continuing search...
Found: 19 image links, continuing search...
Found: 