In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


**Scraping Images from Google**

All the labeled CORROSION and NO CORROSION images were collected by scraping images from google. Selenium was used to automate web browser interaction with Python. Selenium pretends to be a real user, opens the browser, moves the cursor around, and clicks buttons if you tell it to do so.

The CORROSION images were scraped from Google Images using keyword searches that include eight categories of corrosion problems, such as **‘Steel Corrosion/Rust,’ ‘Ships Corrosion,’ ‘Ship Propellers Corrosion,’ ‘Cars Corrosion,’ ‘Oil and Gas Pipelines Corrosion,’ ‘Concrete Rebar Corrosion,’ ‘Water/Oil Tanks Corrosion,’ and ‘Stainless Steel Corrosion,’** The **NO CORROSION** images were also scraped from Google Images using the same terms without corrosion.

In [2]:
!pip install selenium

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting selenium
  Downloading selenium-4.8.3-py3-none-any.whl (6.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.5/6.5 MB[0m [31m31.1 MB/s[0m eta [36m0:00:00[0m
Collecting trio-websocket~=0.9
  Downloading trio_websocket-0.10.2-py3-none-any.whl (17 kB)
Collecting trio~=0.17
  Downloading trio-0.22.0-py3-none-any.whl (384 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m384.9/384.9 KB[0m [31m31.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sniffio
  Downloading sniffio-1.3.0-py3-none-any.whl (10 kB)
Collecting async-generator>=1.9
  Downloading async_generator-1.10-py3-none-any.whl (18 kB)
Collecting outcome
  Downloading outcome-1.2.0-py2.py3-none-any.whl (9.7 kB)
Collecting wsproto>=0.14
  Downloading wsproto-1.2.0-py3-none-any.whl (24 kB)
Collecting h11<1,>=0.9.0
  Downloading h11-0.14.0-py3-none-any.whl (58 kB)
[2K     [90m━━━━━━

In [4]:

import selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
import PIL
from PIL import Image
import time
import pathlib
import glob
import os, os.path, shutil
import requests
# Regular expressions allows us to parse text easier
import re
# Function for load a specific webpage
import io
import hashlib
DRIVER_PATH = '/content/drive/MyDrive/Major_Project/data'

**Searching for a specific term & get image links**

The function fetch_image_urls expects three input parameters:

1. Query : Search term, like "Corrosion on Steel Structures"
2. Max_links_to_fetch : Number of links the scraper is supposed to collect
3. Webdriver : instantiated Webdriver

In [5]:
def fetch_image_urls(query:str, max_links_to_fetch:int, wd:webdriver, sleep_between_interactions:int=1):
    def scroll_to_end(wd):
        wd.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(sleep_between_interactions)    
    
    # build the google query
    search_url = "https://www.google.com/search?safe=off&site=&tbm=isch&source=hp&q={q}&oq={q}&gs_l=img"

    # load the page
    wd.get(search_url.format(q=query))

    image_urls = set()
    image_count = 0
    results_start = 0
    while image_count < max_links_to_fetch:
        scroll_to_end(wd)

        # get all image thumbnail results
        thumbnail_results = wd.find_elements(By.CSS_SELECTOR,"img.Q4LuWd")
        number_results = len(thumbnail_results)
        
        print(f"Found: {number_results} search results. Extracting links from {results_start}:{number_results}")
        
        for img in thumbnail_results[results_start:number_results]:
            # try to click every thumbnail such that we can get the real image behind it
            try:
                img.click()
                time.sleep(sleep_between_interactions)
            except Exception:
                continue

            # extract image urls    
            actual_images = wd.find_elements(By.CSS_SELECTOR,'img.n3VNCb')
            for actual_image in actual_images:
                if actual_image.get_attribute('src') and 'http' in actual_image.get_attribute('src'):
                    image_urls.add(actual_image.get_attribute('src'))

            image_count = len(image_urls)

            if len(image_urls) >= max_links_to_fetch:
                print(f"Found: {len(image_urls)} image links, done!")
                break
        else:
            print("Found:", len(image_urls), "image links, looking for more ...")
            time.sleep(30)
            return
            load_more_button = wd.find_elements(By.CSS_SELECTOR,".mye4qd")
            if load_more_button:
                wd.execute_script("document.querySelector('.mye4qd').click();")

        # move the result startpoint further down
        results_start = len(thumbnail_results)

    return image_urls

**Downloading the images**

The persist_image function grabs an image URL url and downloads it into the folder_path. The function will assign the image a random 10-digit id.

In [6]:
def persist_image(folder_path:str,url:str):
    try:
        image_content = requests.get(url).content

    except Exception as e:
        print(f"ERROR - Could not download {url} - {e}")

    try:
        image_file = io.BytesIO(image_content)
        image = Image.open(image_file).convert('RGB')
        file_path = os.path.join(folder_path,hashlib.sha1(image_content).hexdigest()[:10] + '.jpg')
        with open(file_path, 'wb') as f:
            image.save(f, "JPEG", quality=85)
        print(f"SUCCESS - saved {url} - as {file_path}")
    except Exception as e:
        print(f"ERROR - Could not save {url} - {e}")

**Putting both Search and Downloading Function Together**

The following function search_and_download combines the previous two functions and adds some resiliency to how we use the ChromeDriver. 
More precisely, we are using the ChromeDriver within a with context, which guarantees that the browser closes down ordinarily, even if something within the with context raises an error. search_and_download allows you to specify number_images, which by default is set to 100, but can be set to whatever number of images you want to download.

In [7]:
def search_and_download(search_term:str,driver_path:str,target_path='./images',number_images=100):
    target_folder = os.path.join(target_path,'_'.join(search_term.lower().split(' ')))

    if not os.path.exists(target_folder):
        os.makedirs(target_folder)

    with webdriver.Chrome(executable_path=driver_path) as wd:
        res = fetch_image_urls(search_term, number_images, wd=wd, sleep_between_interactions=0.5)
        
    for elem in res:
        persist_image(target_folder,elem)

**Following Lines of codes can be used to scrape 9 above mentioned categories of corrosion.**

1. **Steel Corrosion/Rust**

  search_term = 'Steel Corrosion/Rust'

  search_and_download(search_term = search_term, driver_path=DRIVER_PATH)

2. **Ships Corrosion**

3. **Ship Propellers Corrosion**

4. **Cars Corrosion**

5. **Oil and Gas Pipelines Corrosion**

6. **Concrete Rebar Corrosion**

7. **Water/Oil Tanks Corrosion**

8. **Stainless Steel Corrosion**

9. **NO CORROSION**