# Summary
---

Code in this notebook is used to pull data from multiple house listings from Redfin. Selenium is used to log in to Redfin and download price history data (user will need to ender their own Redfin username/password). 

To keep track of which house listings have already been scraped, a dictionary of URLs is maintained. Additional URLs are pulled from each webpage and added to the full dictionary of URLs.

Webpages are scraped in batches (user to enter batch size) to minimize data loss if any errors occur or if the scraping must be paused.

Note that only a small sample is saved in the scraped_data folder. User will need to determine how much data they need.

# Table of Contents
---
0. [Import Packages](#import_packages)

1. [Create Functions to Pull Data From Redfin](#create_functions)
  * [Set Up Webdriver and Helper Functions to Pull From Individual Webpages](#set_up_webdriver_function)
  * [Pull Listing Details From Individual Webpages](#pull_listing_details_function)
  * [Add URLs from Similar Houses to Dictionary of All URLS to Pull](#add_to_urls_function)
  * [Pull Data From List of URLs](#pull_data_function)
  
2. [Extract Data](#extract_data)
  * [A. Open Existing Dictionary of URLs to Scrape Or Create New URLs Dictionary](#open_urls_dict)
  * [B. Add more URLs from Redfin Searches (Optional)](#add_urls)
  * [C. View How Many Links Are Unscraped in the List of URLs](#view_urls)
  * [D. Start Scraping and Saving Data](#pull_data)

## 0: Import Packages  <a name="import_packages"></a>
---

In [65]:
# Webscraping
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC 
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By 
from bs4 import BeautifulSoup

# Timing
import time
import random

# Store data
from collections import defaultdict
import pickle
import pandas as pd

## 1: Create Functions to Pull Data From Redfin  <a name="create_functions"></a>
---

### Set Up Webdriver and Helper Functions to Pull From Individual Webpages <a name="set_up_webdriver_function"></a>

In [107]:
def create_driver(a_url=None, incognito=False):
    """
    Opens an instance of Chrome using Selenium, and returns the driver.
    Requires that chromedriver is installed in Applications folder.

    Parameters
    ----------
    a_url: string, optional, default None
        Loads the URL, if provided.

    incognito: boolean, optional, default False
        Opens Chrome in incognito mode if True.
    """

    # Set options for webdriver
    option = webdriver.ChromeOptions()
    if incognito:
        option.add_argument(" — incognito")

    # Create webdriver
    driver = webdriver.Chrome(
        executable_path='/Applications/chromedriver', options=option)

    # Opens URL if provided
    if a_url:
        driver.get(a_url)

    return driver


def soup_from_driver(a_driver):
    soup = BeautifulSoup(a_driver.page_source, 'html.parser')
    return soup


def clean_from_text(text):
    """
    Helper function to clean text.
    Add additional cleaning functions as necessary.
    """
    cleaned_text = text.strip()
    return cleaned_text


def open_xpath_with_driver(xpath, a_driver):
    """
    Helper function to use Selenium to click items
    on the webpage (e.g., Javascript). Item location
    is indicated by xpath. 
    """
    a_driver.find_element_by_xpath(xpath).click()

### Pull Listing Details From Individual Webpages <a name="pull_listing_details_function"></a>

In [108]:
def get_address(a_soup, a_dict, a_driver):
    """
    Pulls address information from webpage.

    Data is pulled from a_soup and added to all the previously
    scraped data in a_dict (data first stored in temporary dict).
    a_driver is used to get the URL of the page scraped 
    (URL is used as key for a_dict).
    """

    # Temporary dictionary to store address data
    address_dict = defaultdict(str)

    # Get addresses and save in address_dict
    for line in a_soup.find(itemprop='address').find_all('span'):
        if line.has_attr('itemprop'):
            address_dict[clean_from_text(
                line['itemprop'])] = clean_from_text(line.text)

    # Get latitude/longitude and save in address_dict
    for line in a_soup.find(itemprop='geo').find_all('meta'):
        if line.has_attr('itemprop'):
            address_dict[clean_from_text(line['itemprop'])] = float(
                line['content'])

    # Adds address information from address_dict into a_dict
    a_dict[a_driver.current_url] = {
        **a_dict[a_driver.current_url], **address_dict}


def get_key_details(a_soup, a_dict, a_driver):
    """
    Pulls main info table from webpage (underneath photos). This includes
    Public Details (Condo/Coop, Garage, etc.), County, Building,
    Community, and Year Built. 

    Data is pulled from a_soup and added to all the previously
    scraped data in a_dict (data first stored in temporary dict).
    a_driver is used to get the URL of the page scraped 
    (URL is used as key for a_dict).
    """
    key_details_dict = defaultdict(str)

    # Try statement used in case webpage does not have key data table
    try:
        for div in a_soup.find(class_='keyDetailsList').find_all('div'):
            key_details_dict[clean_from_text(div.contents[0].text)] = clean_from_text(
                div.contents[1].text)
    except:
        None

    a_dict[a_driver.current_url] = {
        **a_dict[a_driver.current_url], **key_details_dict}


def get_listing_details(a_soup, a_dict, a_driver):
    """
    Pulls Property Details table from webpage. This includes Listing Information,
    such as, Listing Price, Target List Date, Property Type, # of 
    Bedrooms, etc.

    Data is pulled from a_soup and added to all the previously
    scraped data in a_dict (data first stored in temporary dict).
    a_driver is used to get the URL of the page scraped 
    (URL is used as key for a_dict).
    """
    list_details_dict = defaultdict(str)
    for super_group in a_soup.find_all(class_='super-group-content'):
        for data in super_group.find_all('span', {'class': 'entryItemContent'}):
            try:
                list_details_dict[clean_from_text(data.contents[0].string)] = \
                    clean_from_text(data.contents[1].text)
            except:
                list_details_dict[clean_from_text(
                    data.contents[0].string)] = True
    a_dict[a_driver.current_url] = {
        **a_dict[a_driver.current_url], **list_details_dict}


def get_home_facts(a_soup, a_dict, a_driver):
    """
    Pulls Home Facts table from webpage. This includes number of
    Beds, Baths, Finished Sq. Ft., Unfinished Sq. Ft., etc.

    Data is pulled from a_soup and added to all the previously
    scraped data in a_dict (data first stored in temporary dict).
    a_driver is used to get the URL of the page scraped 
    (URL is used as key for a_dict).
    """
    home_facts_dict = defaultdict(str)
    for row in a_soup.find(class_='facts-table'):
        home_facts_dict[clean_from_text(row.span.text)] = \
            clean_from_text(row.div.text)
    a_dict[a_driver.current_url] = {
        **a_dict[a_driver.current_url], **home_facts_dict}


def get_transit_scores(a_soup, a_dict, a_driver):
    """
    Pulls Transportation table from webpage. This includes 
    Walk Score, Transit Score, and Bike Score.

    Data is pulled from a_soup and added to all the previously
    scraped data in a_dict (data first stored in temporary dict).
    a_driver is used to get the URL of the page scraped 
    (URL is used as key for a_dict).
    """
    transit_scores_dict = defaultdict(str)

    # Try statement used in case transit score not available
    for score in a_soup.find_all(class_='score'):
        try:
            transit_scores_dict[clean_from_text(score.find(
                class_='label').text)] = \
                    clean_from_text(score.find(class_='percentage').text)
        except:
            None
    a_dict[a_driver.current_url] = {
        **a_dict[a_driver.current_url], **transit_scores_dict}


def get_recent_area_offer_data(a_soup, a_dict, a_driver):
    """
    Pulls Real Estate Sales (Last 30 days) table from webpage. 
    This includes Median List Price, Median $/Sq.Ft., etc.

    Data is pulled from a_soup and added to all the previously
    scraped data in a_dict (data first stored in temporary dict).
    a_driver is used to get the URL of the page scraped 
    (URL is used as key for a_dict).
    """
    recent_area_offer_data_dict = defaultdict(str)
    recent_area_offer_data_dict['current area'] = clean_from_text(
        a_soup.find(class_='OfferInsights').find('a').text)
    for td in a_soup.find(class_='OfferInsights').find(class_='basic-table').tbody.find_all('td'):
        recent_area_offer_data_dict['Area Current ' + clean_from_text(td.find(class_='field').text)] =\
            clean_from_text(td.find(class_='value').text)
    a_dict[a_driver.current_url] = {
        **a_dict[a_driver.current_url], **recent_area_offer_data_dict}


def get_schools(a_dict, a_driver):
    """
    Pulls Schools and Great School Ratings from webpage. 
    a_driver used to click through Elementary,
    Middle, and High school tabs.

    Data is pulled from a_soup and added to all the previously
    scraped data in a_dict (data first stored in temporary dict).
    a_driver is used to get the URL of the page scraped 
    (URL is used as key for a_dict).
    """
    xpaths = ['//*[@id="schools-scroll"]/div/div[1]/div/div[1]/div[1]/button', '//*[@id="schools-scroll"]/div/div[1]/div/div[1]/div[2]/button',
              '//*[@id="schools-scroll"]/div/div[1]/div/div[1]/div[3]/button', '//*[@id="schools-scroll"]/div/div[1]/div/div[1]/div[4]/button']
    types = ['serving this home', 'elementary', 'middle', 'high']

    # Try statement used in case not all tabs or data available
    for i, xpath in enumerate(xpaths):
        open_xpath_with_driver(xpath, a_driver)
        a_soup = soup_from_driver(a_driver)
        for td in a_soup.find(class_='schools-content').find('table').tbody.find_all('td'):
            try:
                a_dict['school name'].append(clean_from_text(
                    td.find(class_='school-name').text))
                rating_list = td.find(class_='gs-rating-row').text.split(":")
                a_dict['great schools rating'].append(
                    clean_from_text(rating_list[1]))
                a_dict['type'].append(types[i])
                a_dict['url'].append(a_driver.current_url)
            except:
                None


def get_price_history(a_soup, a_dict, a_driver):
    """
    Pulls Price History from webpage. 

    Data is pulled from a_soup and added to all the previously
    scraped data in a_dict (data first stored in temporary dict).
    a_driver is used to get the URL of the page scraped 
    (URL is used as key for a_dict).
    """

    # Try statement used in case data not available
    for i, row in enumerate(a_soup.find(id='property-history-transition-node').table.tbody.find_all('tr')):
        try:
            a_dict['date'].append(clean_from_text(
                row.find(class_='date-col').text))
            a_dict['event'].append(clean_from_text(
                row.find(class_='event-col').findChildren()[0].text))
            a_dict['source'].append(clean_from_text(
                row.find(class_='source-info').text))
            a_dict['price'].append(clean_from_text(
                row.find(class_='price-col').text))
            a_dict['url'].append(a_driver.current_url)
        except:
            None

### Add URLs from Similar Houses to Dictionary of All URLS to Pull <a name="add_to_urls_function"></a>

In [109]:
def get_new_urls(a_soup, a_driver, a_urls_dict):
    """
    Pulls URLs of similar homes from webpage to add to the 
    list of URLS to scrape.

    Data is pulled from a_soup and added to all the previously
    scraped data in a_urls_dict. a_driver is used to get the 
    URL of the page scraped (URL is used as key for a_dict).
    """
    for row in a_soup.find_all(class_='SimilarHomeCardReact'):
        a_urls_dict["https://www.redfin.com"+row.find('a', href=True)['href']]

### Pull Data From List of URLs <a name="pull_data_function"></a>

In [110]:
def start_pulling_data(urls_dict, num_urls_to_visit):
    """
    Provided a dictionary of urls to visit (urls_dict) and
    the number of urls to to visit, this function will log into
    Redfin (account needed for full price history), create empty
    dictionaries to store data, and download data using get_data.
    """

    try:
        # Create dictionaries to store scraped data
        house_data_dict = defaultdict(dict)
        price_history_dict = defaultdict(list)
        schools_dict = defaultdict(list)

        # Filter for unscraped URLS
        all_urls = list(filter(lambda x: not urls_dict[x], urls_dict.keys()))

        # Pull only DC data
        all_urls = [url for url in all_urls if '.com/DC/' in url]

        # Make sure number of URLs to visit not larger than total list length
        if num_urls_to_visit < len(all_urls):
            all_urls = all_urls[:num_urls_to_visit]

        # Create driver and log-in
        driver = create_driver(all_urls[0])
        time.sleep(2)  # Time for page to load
        open_xpath_with_driver(
            "//button[@class='button Button compact tertiary-alt']", driver)
        time.sleep(1)  # Time for log-in to load
        open_xpath_with_driver(
            "//button[contains(@class, 'emailSignInButton')]", driver)
        time.sleep(1)  # Time for email log-in to load
        driver.find_element_by_xpath(
            "//*[@name='emailInput']").send_keys('XXXXXXXXXXXXXXXX')
        driver.find_element_by_xpath(
            '//*[@name="passwordInput"]').send_keys('XXXXXXXXXXXXXXXX')
        open_xpath_with_driver(
            "//*[contains(concat(' ', @class, ' '), ' button Button primary submitButton ')]", driver)
        time.sleep(3)  # insure enough time for webpage to load

        # Start pulling from each webpage now that you are logged in
        for url in all_urls:
            print(url)
            get_data(url, house_data_dict, price_history_dict,
                     schools_dict, driver, urls_dict)
            urls_dict[url] = True

        return house_data_dict, price_history_dict, schools_dict, driver, urls_dict
    except:
        None


def get_data(starting_url, house_data_dict, price_history_dict, schools_dict, driver, urls_dict):
    """
    Uses selenium to get open full price history, then calls 
    individual functions to pull data from tables on wepage.
    """
    try:
        driver.get(starting_url)

        # Time for full webpage to load
        delay = 5
        time.sleep(delay)

        # Expands price history
        history_xpath = '//*[@id="propertyHistory-expandable-segment"]/div[2]/div/span'
        try:
            myElem = WebDriverWait(driver, delay).until(
                EC.element_to_be_clickable((By.XPATH, history_xpath)))
            myElem.click()
        except TimeoutException:
            print("Loading took too much time!")

        # Pulls information from each data table
        try:
            soup = soup_from_driver(driver)
            get_address(soup, house_data_dict, driver)
            get_key_details(soup, house_data_dict, driver)
            get_listing_details(soup, house_data_dict, driver)
            get_home_facts(soup, house_data_dict, driver)
            get_transit_scores(soup, house_data_dict, driver)
            get_recent_area_offer_data(soup, house_data_dict, driver)
            get_schools(schools_dict, driver)
            get_price_history(soup, price_history_dict, driver)
            get_new_urls(soup, driver, urls_dict)
        except:
            None
    except:
        None

## 2: Extract Data  <a name="extract_data"></a>
---

### A. Open Existing Dictionary of URLs to Scrape Or Create New URLs Dictionary<a name="open_urls_dict"></a>

In [101]:
###########################################################
# Open existing dict of URLs if available, otherwise,
# create a new dict of URLs from an export of listings from
# Redfin (search page export).
#
# The dictionary keys are URLs, and values are booleans
# indicating whether or not the links have been scraped.
###########################################################

try:
    # Change this file to the latest urls_dict file (in scraped_data
    # file) if scraping in multiple batches
    file = open("initial_data/urls_dict.pkl", 'rb')
    old_urls_dict = pickle.load(file)
    file.close()
except:
    old_urls_dict = defaultdict(bool)
    new_urls_df = pd.read_csv('initial_data/redfin_search1.csv')
    for url in new_urls_df['URL (SEE http://www.redfin.com/buy-a-home/comparative-market-analysis FOR INFO ON PRICING)']:
        old_urls_dict[url]

### B. Add more URLs from Redfin Searches (Optional)<a name="add_urls"></a>

In [102]:
###########################################################
# Add to old_urls_dict with of listings from
# Redfin (search page export).
#
# The dictionary keys are URLs, and values are booleans
# indicating whether or not the links have been scraped.
###########################################################

try:
    new_urls_df = pd.read_csv('initial_data/redfin_search2.csv')
    for url in new_urls_df['URL (SEE http://www.redfin.com/buy-a-home/comparative-market-analysis FOR INFO ON PRICING)']:
        old_urls_dict[url]
except:
    None

### C. View How Many Links Are Unscraped in the List of URLs<a name="view_urls"></a>

In [103]:
view_new_urls = [i for i in old_urls_dict if old_urls_dict[i] == False]
view_new_dc_urls = [url for url in view_new_urls if '.com/DC/' in url]
print("There are {} unscraped URLs.".format(len(view_new_urls)))
print("{} of the unscraped URLs are in DC.".format(len(view_new_dc_urls)))

There are 706 unscraped URLs.
706 of the unscraped URLs are in DC.


### D. Start Scraping and Saving Data<a name="pull_data"></a>

In [106]:
###########################################################
# Creates Chrome instance to scrape
# num_sites_to_scrape_per_instance sites per Chrome instance.
# This is to batch scraping in case you need to close your
# computer or pause the operation.
#
# Data is pickled in separate files ending with the number
# of the scraping instance. In the data cleaning file,
# all of the pickled files will be concatenated together.
###########################################################
starting_instance_num = 0
ending_instance_num = 1
num_sites_to_scrape_per_instance = 10


for i in range(starting_instance_num, ending_instance_num):
    try:
        house_data_dict, price_history_dict, schools_dict, driver, urls_dict = \
            start_pulling_data(old_urls_dict, num_sites_to_scrape_per_instance)

        filehandler = open(
            "scraped_data/house_data_dict{}.pkl".format(i), "wb")
        pickle.dump(house_data_dict, filehandler)
        filehandler.close()

        filehandler = open(
            "scraped_data/price_history_dict{}.pkl".format(i), "wb")
        pickle.dump(price_history_dict, filehandler)
        filehandler.close()

        filehandler = open("scraped_data/schools_dict{}.pkl".format(i), "wb")
        pickle.dump(schools_dict, filehandler)
        filehandler.close()

        filehandler = open("scraped_data/urls_dict{}.pkl".format(i), "wb")
        pickle.dump(urls_dict, filehandler)
        filehandler.close()

        old_urls_dict = urls_dict
    except:
        None

http://www.redfin.com/DC/Washington/700-New-Hampshire-Ave-NW-20037/unit-1417/home/40395253
http://www.redfin.com/DC/Washington/913-25th-St-NW-20037/home/9043791
http://www.redfin.com/DC/Washington/730-24th-St-NW-20037/unit-605/home/40447764
http://www.redfin.com/DC/Washington/2522-I-St-NW-20037/home/9043021
http://www.redfin.com/DC/Washington/2700-Virginia-Ave-NW-20037/unit-605/home/12534758
Loading took too much time!
http://www.redfin.com/DC/Washington/2401-H-St-NW-20037/unit-814/home/9046921
http://www.redfin.com/DC/Washington/730-24th-St-NW-20037/unit-503/home/143291059
http://www.redfin.com/DC/Washington/730-24th-St-NW-20037/unit-511/home/18444824
http://www.redfin.com/DC/Washington/700-New-Hampshire-Ave-NW-20037/unit-1114/home/12059612
http://www.redfin.com/DC/Washington/922-24th-St-NW-20037/unit-719/home/12534370
