# Browser Automation Homework
Due 7-23<br>
Completed by: **TK YOUR NAME**

We're going to visit the real estate site Zillow.com and search "For sale" listings in a town of your choosing.

We'll collect the listings in the first 5 pages (or all if you like), and get a feel for the price range in that town.

Ultimately I want to know the median price of that town.

Note: if you get asked if you're a bot, just complete the challenges manually.

In [1]:
import os
import random
import time

from playwright.async_api import async_playwright, expect

In [2]:
os.makedirs('data/', exist_ok=True)

### 1) Open the browser, hide automation signs, visit Zillow.com

In [3]:
async def open_browser(headless=False):
    """
    Starts the automated browser and opens a new window
    """
    # Start playwright
    playwright = await TK.TK()

    # Open firefox browser, can use chromium (chrome) or others
    browser = await playwright.TK
  
    # Create a new browser window
    page = await browser.TK()

    return browser, page

In [4]:
driver, page = await open_browser()

In [5]:
url = 'https://zillow.com'
await page.TK(url)

<Response url='https://www.zillow.com/' request=<Request url='https://www.zillow.com/' method='GET'>>

### 2) Find Search Box

Use Playwright's [locator](https://playwright.dev/docs/locators) functions to find the search box.

In [6]:
search_box = page.TK()
search_box

<Locator frame=<Frame name= url='https://www.zillow.com/'> selector='//input[@aria-label="Search"]'>

### 3) Input a geography into search bar

After you've found `search_box` find a way to input or send `search_term` into the input field.

Feel free to change `search_term` to where ever you like.

In [7]:
search_term = 'Worcester, MA'
await search_box.TK

### 4) Make the search

Originally, I thought we could get away with just pressing "ENTER". If you try that you'll see that listings are not from the geography you're searching.

Instead, you'll see a list of suggestions. Click the first suggestion.

You can do that by first finding that suggestion (either the first element, or listing all elements then getting the first), then [clicking](https://playwright.dev/docs/input#mouse-click) on it.

In [8]:
xpath_1st_opt = 'TK'
await expect(TK).to_be_TK()

In [9]:
options = await page.K().all()

In [10]:
first_option = TK

In [11]:
await first_option.TK()

### 5) Pick "For sale," if asked
You might be prompted to check for rentals or sales. This doesn't always show up, but be prepared to click "For sale" if you need to. You can check if the element `is_visible()` to determine this logic.

In [12]:
for_sale = page.TK()
# check if visible
if await TK:
    # click the element...
    await TK

### 6) What are the prices of the houses on the first page?
Print the `text_content()` of each listing's property price below:

In [13]:
for card in await page.TK().all():
    print(await card.TK())

Note: you _should_ see more than nine listings.

You'll need to find a way to scroll down the page to load each new card. From my tests, each page holds up to 40.

This is not a simple task! I found one way to do this below, can you find a better way to do this?

In [14]:
# use `await asyncio.sleep(1)` instead of `time.sleep(1)`.
import asyncio

In [15]:
N = 0
while True:
    # get all the listings, and scroll to the last one, then wait two seconds.
    cards = await page.locator('//span[@data-test="property-card-price"]').all()
    last_listing = cards[-1]
    
    # you can use playwright to issue JavaScript commands:
    await last_listing.evaluate("elm => elm.scrollIntoView();")
    N_cards = len(cards)
    if N_cards == N:
        break
    N = N_cards
    await asyncio.sleep(2)

In [16]:
# how many postings do we have after loading them all?
len(cards)

41

Is there a better way to do this? Feel free to experiment, but it's not necessary for the assignment.

### 7) Save the results as HTML
Save the page source to `html_out` as an HTML file

In [17]:
html_out = 'data/zillow_playwright_test.html'

In [18]:
# how to save what the emulator sees
TK

### 8) Go to the next page
After collecting the first page, go to the next one by clicking the "Next page" button.

In [19]:
next_page = page.TK()

In [20]:
await next_page.TK()

In [21]:
# can also do it this way!
# page.get_by_title('Next page').click()

### 9) Cycle through each page of results
Above we outlined each step, now put it all together here and collect as many results as you can. Add some `asyncio.sleep(2)` (or some other reasonable time) between each step.

You can stop after the 5th page to save time.

Note: you can parse price from the listings directly from Playwright here, or save each page as HTML and parse them after you collect time. I recommend the latter, but for the sake of the homework feel free to take the shortcut.

In [22]:
# first close the browser to start anew
await driver.close()

In [23]:
async def get_results_on_page(page, fn_out):
    """
    Scrolls to load all listings and then saves them to `fn_out`.
    """
    N = 0
    while True:
        # get all the listings, and scroll to the last one, then wait two seconds.
        cards = await page.locator('//span[@data-test="property-card-price"]').all()
        last_listing = cards[-1]

        # you can use playwright to issue JavaScript commands:
        await last_listing.evaluate("elm => elm.scrollIntoView();")
        N_cards = len(cards)
        if N_cards == N:
            break
        N = N_cards
        await asyncio.sleep(2)
        
    # how to save what the emulator sees
    source = await page.content()
    with open(fn_out, 'w') as f:
        f.write(source)

In [24]:
await driver.close()

In [25]:
search_term = 'Beacon, NY'
# search_term = 'worcester, MA'

driver, page = await open_browser()
await page.goto(url)

# find the search box
print("finding search box")
xpath_search = "TK"

# select the first suggestion
print("selecting first option")
xpath_1st_opt = 'TK'
TK
await asyncio.sleep(2)

# select only for sale listings...
print("Press for sale")
xpath_for_sale = 'TK'
for_sale = page.locator(xpath_for_sale)
if await for_sale.TK_CHECK_IF_VISIBLE:
    # TK click it
    await asyncio.sleep(1)
    
# save each page of results
xpath_next = 'TK'
page_n = 0
await expect(page.locator(xpath_next).first).to_be_visible()
while True:
    fn_out = f'data/zillow_page_{page_n}.html'
    print(f"Getting results for page {page_n}")
    await get_results_on_page(page, fn_out)
    page_n += 1
     
    # stop after 10
    if page_n == 10:
        break
        print("Done")
    await asyncio.sleep(1)

    # see if there are more pages of results
    next_page = page.locator(xpath_next)
    if await next_page.is_visible():
        await next_page.click()
    else:
        break
        print("Done")
    await asyncio.sleep(1)

finding search box
selecting first option
Press for sale
Getting results for page 0
Getting results for page 1
Getting results for page 2
Getting results for page 3


### 10) Parse the prices

Parse the prices into a list or a Pandas Series, and list the median price.

In [26]:
import glob
import pandas as pd
from lxml import etree

In [27]:
files = glob.glob('data/zillow_page_*.html')
len(files)

4

In [28]:
prices = []
for fn in files:
    # get the prices as a dictionary and add to a list

In [29]:
df = pd.DataFrame({'prices': prices})
df.prices.describe()

count    1.410000e+02
mean     5.363063e+05
std      3.193751e+05
min      5.500000e+04
25%      3.499000e+05
50%      5.150000e+05
75%      6.950000e+05
max      2.400000e+06
Name: prices, dtype: float64

In [30]:
len(df)

141

In [31]:
df.prices.median()

515000.0

## Extra credit
- What is the median price per square foot?
- Which realtor has the most listings?
- Can you stain listings over $1M in red and take a full-screenshot?

In [32]:
await driver.close()