<a href="https://colab.research.google.com/github/vasugamdha/random-colab-notebooks/blob/main/Google_Images_Scraper.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Web Scrapping Google Images for using selenium**


## Installing selenium and downloading chromedriver

In [15]:
!pip install selenium
# !apt-get update # to update ubuntu to correctly run apt install
!apt install chromium-chromedriver
# !cp /usr/lib/chromium-browser/chromedriver /usr/bin

Reading package lists... Done
Building dependency tree       
Reading state information... Done
chromium-chromedriver is already the newest version (87.0.4280.66-0ubuntu0.18.04.1).
0 upgraded, 0 newly installed, 0 to remove and 29 not upgraded.


## Adding path to system

In [16]:
import sys
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')
DRIVER_PATH = "/usr/bin/chromedriver"

## Configuring options for resolving **`WebDriverException:`** i.e.

Message: unknown error: Chrome failed to start: exited abnormally.  
(unknown error: DevToolsActivePort file doesn't exist)  
(The process started from chrome location /usr/bin/chromium-browser is no longer running, so ChromeDriver is assuming that Chrome has crashed.)

In [17]:
from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
wd = webdriver.Chrome('chromedriver',options=chrome_options)
wd.get("https://www.webite-url.com")

<font color='red' size=4>Note:</font>  
If you want to fetch lots of images (50+), then search for `scroll_to_end` in the below cell and uncomment the <u>definition</u> & <u>calling</u> line of function. 

In [18]:
def get_image_urls(query:str, max_links_to_fetch:int, wd:webdriver, sleep_between_interactions:int=1):
    
    # def scroll_to_end(wd):
    #     wd.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    #     time.sleep(sleep_between_interactions)    
    
    search_url = "https://www.google.com/search?q={q}&tbm=isch" # build the google query
    
    wd.get(search_url.format(q=query))                          # get the page

    image_urls = set()
    image_count = 0
    results_start = 0
    while image_count < max_links_to_fetch:
        
        # scroll_to_end(wd)                                     # <---- uncomment if needed

        # get all image thumbnail results
        thumbnail_results = wd.find_elements_by_css_selector("img.Q4LuWd")
        number_results = len(thumbnail_results)
        
        print(f"Found: {number_results} search results. Extracting links from {results_start}:{number_results}")
        
        for img in thumbnail_results[results_start:number_results]:
            # try to click every thumbnail such that we can get the real image behind it
            try:
                img.click()
                time.sleep(sleep_between_interactions)
            except Exception:
                continue

            # extract image urls    
            actual_images = wd.find_elements_by_css_selector('img.n3VNCb')
            for actual_image in actual_images:
                if actual_image.get_attribute('src') and 'http' in actual_image.get_attribute('src'):
                    image_urls.add(actual_image.get_attribute('src'))

            image_count = len(image_urls)

            if len(image_urls) >= max_links_to_fetch:
                print(f"Found: {len(image_urls)} image links, done!")
                break
        else:
            print("Found:", len(image_urls), "image links, looking for more ...")
            time.sleep(30)
            return
            load_more_button = wd.find_element_by_css_selector(".mye4qd")
            if load_more_button:
                wd.execute_script("document.querySelector('.mye4qd').click();")

        # move the result startpoint further down
        results_start = len(thumbnail_results)

    return image_urls

In [19]:
def save_image(folder_path:str,url:str):
    try:
        image_content = requests.get(url).content
    except Exception as e:
        print("ERROR - Could not download", url,"-",e)

    try:
        image_file = io.BytesIO(image_content)
        image = Image.open(image_file).convert('RGB')
        file_path = os.path.join(folder_path,hashlib.sha1(image_content).hexdigest()[:10] + '.jpg')
        with open(file_path, 'wb') as f:
            image.save(f, "JPEG", quality=85)
        print(f"SUCCESS - saved {url} - as {file_path}")
    except Exception as e:
        print("ERROR - Could not save", url,"-",e)
    

In [20]:
def search_n_save(search_term:str,driver_path:str,target_path='/content',number_images=5):
    target_folder = os.path.join(target_path,'_'.join(search_term.lower().split(' ')))

    if not os.path.exists(target_folder):
        os.makedirs(target_folder)

    with webdriver.Chrome(options=chrome_options,executable_path=driver_path) as wd:
        res = get_image_urls(search_term, number_images, wd=wd, sleep_between_interactions=0.5)
        
    for elem in res:
        save_image(target_folder,elem)

## Variables you can manipulate

1. set search_term to an array of strings for which you want images
2. set number_images to the number of images you want for each class
3. set target_path to the path where you want images dataset created.

In [21]:
search_term = ["Tesla","Elon Musk"]
number_images = 5
target_path = '/content'

In [22]:
import os 
import time
import requests
import io
from PIL import Image
import hashlib

for i in search_term:
    search_n_save(search_term = i, driver_path=DRIVER_PATH, target_path=target_path, number_images=number_images)

Found: 48 search results. Extracting links from 0:48
Found: 5 image links, done!
SUCCESS - saved https://cnet1.cbsistatic.com/img/6u2kmEm0RJUfcSgupFaBkEsKeK4=/940x0/2020/02/13/ae1b9b28-ac0f-4b16-90c7-1f232b6633e4/press00-model-x-rear-three-quarter-with-doors-open.jpg - as /content/tesla/f24f70e79e.jpg
SUCCESS - saved https://hips.hearstapps.com/hmg-prod/amv-prod-cad-assets/wp-content/uploads/2017/11/Tesla-Roadster-103.jpg?crop=0.779xw:0.950xh;0.141xw,0.0499xh&resize=640:* - as /content/tesla/ca3323f563.jpg
SUCCESS - saved https://tesla-cdn.thron.com/delivery/public/image/tesla/32e5e0f3-5c04-42ef-8f8f-c6b1c26f8a9e/bvlatuR/std/2880x1800/ms-main-hero-desktop - as /content/tesla/0d984c8f95.jpg
SUCCESS - saved https://tesla-cdn.thron.com/delivery/public/image/tesla/c82315a6-ac99-464a-a753-c26bc0fb647d/bvlatuR/std/1200x628/lhd-model-3-social - as /content/tesla/7b96c89c6a.jpg
SUCCESS - saved https://cdn.motor1.com/images/mgl/Yp07j/s1/tesla-pricing-lead.jpg - as /content/tesla/73a3296ae4.jpg
