# Dude, Where's My House?
## Part 1: Gathering Data

    Authors:
      1. Aman Hafez
      2. Kavan Pandya
      3. Yan Nusinovich
      
    Using Google Street View to retrieve (pre-event) photos of structures and extracting residential house values from Zillow.
    
    Notebook drafted by Aman Hafez

### Gathering Data 

    Our datasets are based on image scraping from google images by searching these terms, and variations thereof:
    1. Good Houses.
    2. Damaged Houses.
    3. Destroyed Houses.
    
    All datasets are scraped for approximately 300 images each google search by scraping the images URL using Google chrome webdriver, and downloading the image into the folder_path using Selenium package in Python to automate web browser interaction. 

In [1]:
from selenium import webdriver     #Selenium can be used to automate web browser interaction with Python
import io
import os
import time
import requests
import hashlib
from PIL import Image

# Put the path for ChromeDriver here
DRIVER_PATH = '/Users/amanhafez/Desktop/chromedriver'
wd = webdriver.Chrome(executable_path=DRIVER_PATH)

In [2]:
wd.get('https://google.com')

In [3]:
search_box = wd.find_element_by_css_selector('input.gLFyf')
search_box.send_keys('damaged houses')

In [4]:
wd.quit()

The initial idea behind Selenium, is automated testing. However, Selenium is equally powerful when it comes to automating repetitive web-based tasks.

### Searching for a Particular Phrase & Getting the Image Links

    The function fetch_image_urls expects three input parameters:

     1. query : Search term, like Dog
     2. max_links_to_fetch : Number of links the scraper is supposed to collect
     3. webdriver : instantiated Webdrive

In [5]:
def fetch_image_urls(query:str, max_links_to_fetch:int, wd:webdriver, sleep_between_interactions:int=1):
    def scroll_to_end(wd):
        wd.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(sleep_between_interactions)    
    
    # build the google query
    search_url = "https://www.google.com/search?safe=off&site=&tbm=isch&source=hp&q={q}&oq={q}&gs_l=img"

    # load the page
    wd.get(search_url.format(q=query))

    image_urls = set()
    image_count = 0
    results_start = 0
    while image_count < max_links_to_fetch:
        scroll_to_end(wd)

        # get all image thumbnail results
        thumbnail_results = wd.find_elements_by_css_selector("img.Q4LuWd")
        number_results = len(thumbnail_results)
        
        print(f"Found: {number_results} search results. Extracting links from {results_start}:{number_results}")
        
        for img in thumbnail_results[results_start:number_results]:
            # try to click every thumbnail such that we can get the real image behind it
            try:
                img.click()
                time.sleep(sleep_between_interactions)
            except Exception:
                continue

            # extract image urls    
            actual_images = wd.find_elements_by_css_selector('img.n3VNCb')
            for actual_image in actual_images:
                if actual_image.get_attribute('src') and 'http' in actual_image.get_attribute('src'):
                    image_urls.add(actual_image.get_attribute('src'))

            image_count = len(image_urls)

            if len(image_urls) >= max_links_to_fetch:
                print(f"Found: {len(image_urls)} image links, done!")
                break
        else:
            print("Found:", len(image_urls), "image links, looking for more ...")
            time.sleep(30)
            return
            load_more_button = wd.find_element_by_css_selector(".mye4qd")
            if load_more_button:
                wd.execute_script("document.querySelector('.mye4qd').click();")

        # move the result startpoint further down
        results_start = len(thumbnail_results)

    return image_urls

### Downloading the Images

    The persist_image function grabs an image URL url and downloads it into the folder_path. The function will assign the image a random 10-digit id.

In [6]:
def persist_image(folder_path:str,url:str):
    try:
        image_content = requests.get(url).content

    except Exception as e:
        print(f"ERROR - Could not download {url} - {e}")

    try:
        image_file = io.BytesIO(image_content)
        image = Image.open(image_file).convert('RGB')
        file_path = os.path.join(folder_path,hashlib.sha1(image_content).hexdigest()[:10] + '.jpg')
        with open(file_path, 'wb') as f:
            image.save(f, "JPEG", quality=85)
        print(f"SUCCESS - saved {url} - as {file_path}")
    except Exception as e:
        print(f"ERROR - Could not save {url} - {e}")

### Putting It All Together

    The following function search_and_download combines the previous two functions and adds some resiliency to how we use the ChromeDriver. More precisely, we are using the ChromeDriver within a with context, which guarantees that the browser closes down ordinarily, even if something within the with context raises an error. search_and_download allows you to specify number_images, which by default is set to 5, but can be set to whatever number of images you want to download.

In [7]:
def search_and_download(search_term:str,driver_path:str,target_path='./images',number_images=200):
    target_folder = os.path.join(target_path,'_'.join(search_term.lower().split(' ')))

    if not os.path.exists(target_folder):
        os.makedirs(target_folder)

    with webdriver.Chrome(executable_path=driver_path) as wd:
        res = fetch_image_urls(search_term, number_images, wd=wd, sleep_between_interactions=0.5)
        
    for elem in res:
        print(elem)
        persist_image(target_folder,elem)

### Testing 

In [12]:
search_term = 'damaged houses'
search_and_download(search_term = search_term, driver_path = DRIVER_PATH)

Found: 200 search results. Extracting links from 0:200
Found: 201 image links, done!
https://c8.alamy.com/comp/E9J7CD/a-fire-damaged-house-E9J7CD.jpg
SUCCESS - saved https://c8.alamy.com/comp/E9J7CD/a-fire-damaged-house-E9J7CD.jpg - as ./images/fire_slightly_damaged_houses/bbb964c297.jpg
https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcQfmKcFImmFVWKbfRlYlcqR7JqT88RfPCRSHop5UfGs6gxe_93j&usqp=CAU
SUCCESS - saved https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcQfmKcFImmFVWKbfRlYlcqR7JqT88RfPCRSHop5UfGs6gxe_93j&usqp=CAU - as ./images/fire_slightly_damaged_houses/1faa2514c2.jpg
https://wateroutfortwayne.com/wp-content/uploads/2019/08/shutterstock_707832229-1-1000x450.jpg
SUCCESS - saved https://wateroutfortwayne.com/wp-content/uploads/2019/08/shutterstock_707832229-1-1000x450.jpg - as ./images/fire_slightly_damaged_houses/287e682369.jpg
https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcS9sfIDQkLjQhd9_eJO_a2qROImKlgnKXgoYttg1YAD00fS0h8M&usqp=CAU
SUCCESS - saved https://e

SUCCESS - saved https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcRS6kG_Iirf9WtBkjZbLU7N6hR0COd-dTterDRXMUddtWip1uiA&usqp=CAU - as ./images/fire_slightly_damaged_houses/a392583b6b.jpg
https://i.ytimg.com/vi/SXGdUfa1peM/maxresdefault.jpg
SUCCESS - saved https://i.ytimg.com/vi/SXGdUfa1peM/maxresdefault.jpg - as ./images/fire_slightly_damaged_houses/d089ebd03c.jpg
https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcRD0Lkd-0m10BJ4BdIIeDmDiYrA7HApYrJxDQbWBSsoxWLoUgi0&usqp=CAU
SUCCESS - saved https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcRD0Lkd-0m10BJ4BdIIeDmDiYrA7HApYrJxDQbWBSsoxWLoUgi0&usqp=CAU - as ./images/fire_slightly_damaged_houses/c75383138a.jpg
https://www.vmcdn.ca/f/files/guelphtoday/images/fire-and-ems/20170418-fire-2-ts.jpg;w=960;h=640;bgcolor=000000
SUCCESS - saved https://www.vmcdn.ca/f/files/guelphtoday/images/fire-and-ems/20170418-fire-2-ts.jpg;w=960;h=640;bgcolor=000000 - as ./images/fire_slightly_damaged_houses/c9737e4475.jpg
https://dynamicmedia.zuza.co

SUCCESS - saved https://dynamicmedia.zuza.com/zz/m/original_/a/4/a4129335-6b4e-44ba-ae0d-6a6be597f467/Featherstone%20House%20Fire_Super_Portrait.jpg - as ./images/fire_slightly_damaged_houses/07ad2ba187.jpg
https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcR--Gt0ll-5_BBhj7D1kZS-UkYff9L27RH5HtBOyfqu5YE8uilN&usqp=CAU
SUCCESS - saved https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcR--Gt0ll-5_BBhj7D1kZS-UkYff9L27RH5HtBOyfqu5YE8uilN&usqp=CAU - as ./images/fire_slightly_damaged_houses/96d8d06689.jpg
https://2qibqm39xjt6q46gf1rwo2g1-wpengine.netdna-ssl.com/wp-content/uploads/2019/10/19049977_web1_M-Brief-Marysville-Fire-edh-191023.jpg
ERROR - Could not save https://2qibqm39xjt6q46gf1rwo2g1-wpengine.netdna-ssl.com/wp-content/uploads/2019/10/19049977_web1_M-Brief-Marysville-Fire-edh-191023.jpg - cannot identify image file <_io.BytesIO object at 0x10fe9e290>
https://cdn.vox-cdn.com/thumbor/-lFKtXVTWjzItKq--NBxyIHOoCU=/0x0:648x486/1200x800/filters:focal(257x33:359x135)/cdn.vox-cdn.c

ERROR - Could not save https://www.realestate.com.au/blog/images/800x600-fit,progressive/2016/07/20152215/Guernsey_St_Busby_fire_damage_article.jpg - cannot identify image file <_io.BytesIO object at 0x110506dd0>
https://rdcnewscdn.realtor.com/wp-content/uploads/2018/01/fire-damage-house-1024x576.jpg
SUCCESS - saved https://rdcnewscdn.realtor.com/wp-content/uploads/2018/01/fire-damage-house-1024x576.jpg - as ./images/fire_slightly_damaged_houses/ee3e05f21d.jpg
https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcQ6Q-sDve5Axscyqld1vB0hn7zHE2cKQYnHF1poKZP0lkq_Z379&usqp=CAU
SUCCESS - saved https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcQ6Q-sDve5Axscyqld1vB0hn7zHE2cKQYnHF1poKZP0lkq_Z379&usqp=CAU - as ./images/fire_slightly_damaged_houses/1c73af383c.jpg
https://www.iheartradio.ca/image/policy:1.10587636:1582121832/am800-news-totten-fire-february-2020.jpg?f=default&$p$f=9164768
SUCCESS - saved https://www.iheartradio.ca/image/policy:1.10587636:1582121832/am800-news-totten-fire-fe

SUCCESS - saved https://media.socastsrm.com/wordpress/wp-content/blogs.dir/2225/files/2020/03/f7cfebc8-4c92-43ba-9640-b51a8fc625ca.jpg - as ./images/fire_slightly_damaged_houses/ad0ac639e2.jpg
https://blackburnnews.com/wp-content/uploads/2020/02/a-fire.jpg
SUCCESS - saved https://blackburnnews.com/wp-content/uploads/2020/02/a-fire.jpg - as ./images/fire_slightly_damaged_houses/e15e96df72.jpg
https://dynamicmedia.zuza.com/zz/m/original_/6/c/6ca6dbd8-87cf-46a9-a6eb-636cac9021ad/Fire2_Super_Portrait.jpg
SUCCESS - saved https://dynamicmedia.zuza.com/zz/m/original_/6/c/6ca6dbd8-87cf-46a9-a6eb-636cac9021ad/Fire2_Super_Portrait.jpg - as ./images/fire_slightly_damaged_houses/14279afc6b.jpg
https://images.glaciermedia.ca/polopoly_fs/1.24030364.1575757831!/fileImage/httpImage/image.jpg_gen/derivatives/landscape_804/photo-view-royal-fire-dec-7-2019.jpg
SUCCESS - saved https://images.glaciermedia.ca/polopoly_fs/1.24030364.1575757831!/fileImage/httpImage/image.jpg_gen/derivatives/landscape_804/phot

ERROR - Could not save https://4rfnv3jdfte8qj2229aqgj4h-wpengine.netdna-ssl.com/wp-content/uploads/2020/02/20437947_web1_200207-CCI-Fire-Auchinachie-Road-House-fire_2.jpg - cannot identify image file <_io.BytesIO object at 0x11050ac50>
https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcTEwKrso_zSJN6L3J4zlUSzOTpmRT-tJ9tIe8DDi0G8VYjWq7uO&usqp=CAU
SUCCESS - saved https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcTEwKrso_zSJN6L3J4zlUSzOTpmRT-tJ9tIe8DDi0G8VYjWq7uO&usqp=CAU - as ./images/fire_slightly_damaged_houses/d50766b34e.jpg
https://completedki.com/wp-content/uploads/2016/12/picture-992615-764x1024.jpg
SUCCESS - saved https://completedki.com/wp-content/uploads/2016/12/picture-992615-764x1024.jpg - as ./images/fire_slightly_damaged_houses/2d7c6de7d5.jpg
https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcQVFa9VFhOATmQfcO_AqBbacaS7_ctHDfqjpfFqmPxnfw-tIog7&usqp=CAU
SUCCESS - saved https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcQVFa9VFhOATmQfcO_AqBbacaS7_ctHDfqjpfFq

SUCCESS - saved https://c8.alamy.com/comp/BFKNFT/fire-damaged-house-in-north-carolina-usa-BFKNFT.jpg - as ./images/fire_slightly_damaged_houses/330f97aad6.jpg
https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcRS1iQHG7fhGzYvxbZwe0P_OIugLpyWp-_5AEB1ZkvBUd_14tWy&usqp=CAU
SUCCESS - saved https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcRS1iQHG7fhGzYvxbZwe0P_OIugLpyWp-_5AEB1ZkvBUd_14tWy&usqp=CAU - as ./images/fire_slightly_damaged_houses/c166d8ee5c.jpg
https://bloximages.chicago2.vip.townnews.com/qctimes.com/content/tncms/assets/v3/editorial/5/16/516d4ba3-9961-551d-af65-484e8745783d/5ddc40d534fb5.image.jpg?resize=1200%2C900
SUCCESS - saved https://bloximages.chicago2.vip.townnews.com/qctimes.com/content/tncms/assets/v3/editorial/5/16/516d4ba3-9961-551d-af65-484e8745783d/5ddc40d534fb5.image.jpg?resize=1200%2C900 - as ./images/fire_slightly_damaged_houses/7306ae27b2.jpg
https://image.shutterstock.com/image-photo/burned-house-interior-after-fire-260nw-1117451939.jpg
SUCCESS - save

## Fetching Data is Done 