### article link : https://towardsdatascience.com/image-scraping-with-python-a96feda8af2d

#  #1 Scraping static pages

Pages that don't utilize JavaScript to create a high degree of interaction on the page.

In [1]:
# import requests
# # # download wikipage
# wikipage = "https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_by_continent_(data_file)"
# result = requests.get(wikipage)
# result.content

# using Beautiful Soup

documentation link : https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Beautiful Soup version - 4.9.2 (works for both Python - 2.7 and 3.8)

It is a Python library for pulling data out of HTML and XML files. It works with our favorite parser to provide idomatic ways of navigating, searching, and modifying the parse tree. 

In [2]:
import requests
import pandas as pd
from bs4 import BeautifulSoup

In [3]:
# # download wikipage
# wikipage = "https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_by_continent_(data_file)"
# result = requests.get(wikipage)

In [4]:
# # if successful parse the download into a BeautifulSoup object, which allows easy manipulation
# if result.status_code == 200:
#     soup = BeautifulSoup(result.content, "html.parser")

In [5]:
# # find the object with HTML class wikitable sortable
# table = soup.find('table', {'class':'wikitable sortable'})

In [6]:
# # loop through all the rows and pull the text
# new_table = []
# for row in table.find_all('tr')[1:]:
#     columns_marker = 0
#     columns = row.find_all('td')
#     new_table.append([column.get_text() for column in columns])

In [7]:
# df = pd.DataFrame(new_table, columns = ['ContinentCode', 'Alpha2', 'Aplha3',
#                                        'PhoneCode', 'Name'])
# df['Name'] = df['Name'].str.replace('\n', '')
# df

we obtained a clean parse tree. In this tree, we can search for an element of the type "table" with the class "wikitable sortable". We can get information about class and type by right-clicking on the table and clicking inspect to see the source code. After which we loop through the table and extract the data row by row, ultimately getting the above result.

In [8]:
# to use pandas built in read_html method 
# pip install lxml 

In [9]:
# res = pd.read_html(wikipage)
# res[2]

we call res[2] as pd.read_html( ) dumps everything it finds that even loosely resembles a table into an individual DataFrame. We will have to check which of the resulting DataFrames contains the desired data. We can try using read_html for nicely structured data.

# #2 Scraping interactive pages

In [10]:
!pip install selenium



selenium pretends to be a real user, it opens the browser, “moves” the cursor around and clicks buttons if you tell it to do so. The initial idea behind selenium, is automated testing. However, it is equally powerful when it comes to automating repetitive web-based tasks.

In [11]:
#import selenium 

In [12]:
# from selenium import webdriver

# DRIVER_PATH = ('/Users/msowmya/Desktop/Personal/Projects/FINAL YEAR/Files/chromedriver')

# wd = webdriver.Chrome(executable_path = DRIVER_PATH)

# #3 Scraping images from Google

In [14]:
import selenium
from selenium import webdriver

In [15]:
def fetch_image_urls(query:str, max_links_to_fetch:int, wd:webdriver, sleep_between_interactions:int=1):
    def scroll_to_end(wd):
        wd.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(sleep_between_interactions)    
    
    # build the google query
    search_url = "https://www.google.com/search?safe=off&site=&tbm=isch&source=hp&q={q}&oq={q}&gs_l=img"

    # load the page
    wd.get(search_url.format(q=query))

    image_urls = set()
    image_count = 0
    results_start = 0
    while image_count < max_links_to_fetch:
        scroll_to_end(wd)

        # get all image thumbnail results
        thumbnail_results = wd.find_elements_by_css_selector("img.Q4LuWd")
        number_results = len(thumbnail_results)
        
        print(f"Found: {number_results} search results. Extracting links from {results_start}:{number_results}")
        
        for img in thumbnail_results[results_start:number_results]:
            # try to click every thumbnail such that we can get the real image behind it
            try:
                img.click()
                time.sleep(sleep_between_interactions)
            except Exception:
                continue

            # extract image urls    
            actual_images = wd.find_elements_by_css_selector('img.n3VNCb')
            for actual_image in actual_images:
                if actual_image.get_attribute('src') and 'http' in actual_image.get_attribute('src'):
                    image_urls.add(actual_image.get_attribute('src'))

            image_count = len(image_urls)

            if len(image_urls) >= max_links_to_fetch:
                print(f"Found: {len(image_urls)} image links, done!")
                break
        else:
            print("Found:", len(image_urls), "image links, looking for more ...")
            time.sleep(30)
            return
            load_more_button = wd.find_element_by_css_selector(".mye4qd")
            if load_more_button:
                wd.execute_script("document.querySelector('.mye4qd').click();")

        # move the result startpoint further down
        results_start = len(thumbnail_results)

    return image_urls

The function "fetch_image_urls" expects three input parameters :
    1. query : Search term like, Dog
    2. max_links_to_fetch : Number of links the scaper is supposed to collect
    3. webdriver : intantiated Webdriver

In [16]:
# install pillow
!pip install Pillow



In [17]:
import io, os
import time
import pandas as pd
import hashlib
from PIL import Image

In [18]:
def persist_image(folder_path:str,url:str):
    try:
        image_content = requests.get(url).content

    except Exception as e:
        print(f"ERROR - Could not download {url} - {e}")

    try:
        image_file = io.BytesIO(image_content)
        image = Image.open(image_file).convert('RGB')
        file_path = os.path.join(folder_path,hashlib.sha1(image_content).hexdigest()[:10] + '.jpg')
        with open(file_path, 'wb') as f:
            image.save(f, "JPEG", quality=85)
        print(f"SUCCESS - saved {url} - as {file_path}")
    except Exception as e:
        print(f"ERROR - Could not save {url} - {e}")

The "persist_image" functions grabs an image URL "url" and downloads it into the "folder_path". The function will assign the image a random 10-digit id

In [19]:
DRIVER_PATH = ('/Users/msowmya/Desktop/Personal/Projects/#4 Final Year/Files/chromedriver')

In [20]:
def search_and_download(search_term:str,driver_path:str,target_path='./images',number_images=20):
    target_folder = os.path.join(target_path,'_'.join(search_term.lower().split(' ')))

    if not os.path.exists(target_folder):
        os.makedirs(target_folder)

    with webdriver.Chrome(executable_path=driver_path) as wd:
        res = fetch_image_urls(search_term, number_images, wd=wd, sleep_between_interactions=0.5)
        
    for elem in res:
        persist_image(target_folder,elem)

The function search_and_download combines the previous two functions and adds some resiliency to how we use the ChromeDriver. We are basically using the ChromeDriver within a "with" context, which guarantees that the browser closes down ordinally, even if something within the "with" contect raises an error.

"search_and_download" allows you to specify "number_images" which by default is set to 5, but can be set to whatever number of images we want to download.

In [22]:
search_term = 'bunch of tomatoes'

search_and_download(
    search_term = search_term,
    driver_path = DRIVER_PATH
)

WebDriverException: Message: Service /Users/msowmya/Desktop/Personal/Projects/#4 Final Year/Files/chromedriver unexpectedly exited. Status code was: -9
