This project is to build a web scraper that can:
* scrape all search results from real estate websites and build a databasee
* perform exploratory data analysis, and find valued propertities

The website I will be scraping is zillow(https://www.zillow.com/).

**Agenda**:
* preparing for the packages
* requests html pages of the data
* use beautiful soup to parse information
* extract specific house information
* get each house link as house_id 
* scrape house details

**Preparing for the packages**

In [328]:
from bs4 import BeautifulSoup
from requests import get
import re
import pandas as pd
import itertools
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

**Requests html pages of the data**

In [424]:
url = "https://www.zillow.com/homes/for_sale/27560_rb/2_p"
headers = ({'User-Agent':
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3607.0 Safari/537.36'})
response = get(url, headers = headers)

**Use beautiful soup to parse information**

In [425]:
html_soup = BeautifulSoup(response.text, 'html.parser')

In [426]:
facts = html_soup.find('div', class_='zsg-separator')
house_containers = html_soup.find_all('div', class_='zsg-photo-card-content zsg-aspect-ratio-content')

In [421]:
pattern = 'for Sale:[0-9]+'
line = re.search(pattern, facts.text).group()
ind = line.index(':')
total = int(line[ind+1:])
total

76

In [430]:
first = house_containers[4]
first.find_all('span')

[<span class="hide" itemprop="address" itemscope="" itemtype="http://schema.org/PostalAddress"><span itemprop="streetAddress">104 Carter Grove Ct</span><span itemprop="addressLocality"> MORRISVILLE </span><span itemprop="addressRegion">NC </span><span class="hide" itemprop="postalCode">27560</span></span>,
 <span itemprop="streetAddress">104 Carter Grove Ct</span>,
 <span itemprop="addressLocality"> MORRISVILLE </span>,
 <span itemprop="addressRegion">NC </span>,
 <span class="hide" itemprop="postalCode">27560</span>,
 <span itemprop="geo" itemscope="" itemtype="http://schema.org/GeoCoordinates"><meta content="35.8365" itemprop="latitude"/><meta content="-78.8686" itemprop="longitude"/></span>,
 <span class="zsg-photo-card-status"><span class="zsg-icon-for-sale"></span>House for sale</span>,
 <span class="zsg-icon-for-sale"></span>,
 <span class="zsg-photo-card-price">$475,000</span>,
 <span class="zsg-photo-card-info">4 bds <span class="interpunct">·</span> 3 ba <span class="interpunc

**Extract specific house information**

extract house price

In [428]:
var_1 = first.find_all('span')[8].text
var_1

'$314,200+'

extrace house location

In [413]:
location = first.find_all('p')[1].text
location

'107 Gratiot Dr, Morrisville, NC'

extract house size

In [414]:
size = first.find_all('span')[9].text
size.split(' · ')

['4 bds', '3 ba', '2,060 sqft']

get all the links

In [107]:
links = []
l = len(house_containers)
for i in range(l):
    for url in house_containers[i].find_all('a'):
        link = url.get('href')
        if link.startswith('/myzillow') is False:
            link = 'https://www.zillow.com' + link
            links.append(link)        

**Get each house link as house_id**

In [448]:
from bs4 import BeautifulSoup
from requests import get
import time, random 
import re
import pandas as pd
import itertools
import matplotlib.pyplot as plt
import seaborn as sns
#scrape house information from main page
n_page=1
sum_house=0
prices = []
beds = []
baths = []
sqfts = []
house_ids = []
total = 1000
i = 0

def get_house_soup():
    response = get(url, headers = headers)
    html_soup = BeautifulSoup(response.text, 'html.parser')
    return html_soup

def get_house_containers(html_soup):
    house_containers = html_soup.find_all('div', class_="zsg-photo-card-content zsg-aspect-ratio-content")
    return house_containers

def get_num_houses_per_page(house_containers):
    num_house_page = len(house_containers)
    print("len is " + str(num_house_page))
    return num_house_page

def get_total_houses(html_soup):
    facts = html_soup.find('div', class_='zsg-separator')
    pattern = 'for Sale:[0-9]+'
    line = re.search(pattern, facts.text).group()
    ind = line.index(':')
    total = int(line[ind+1:])
    return total

def get_individial_house(house_containers):
    for i in range(len(house_containers)):        
        for instance in house_containers[i].find_all('a'):
            house_id = instance.get('href')
            if house_id.startswith('/myzillow') is False:
                house_id = 'https://www.zillow.com' + house_id
                house_ids.append(house_id)    

while sum_house < total:
    url = "https://www.zillow.com/homes/for_sale/27560_rb/" + str(n_page+i) + "_p"

    headers = ({'User-Agent':
                    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3607.0 Safari/537.36'})
    soup = get_house_soup()
    house_containers = get_house_containers(soup)
    sum_house = sum_house + get_num_houses_per_page(house_containers)
    total = get_total_houses(soup)
    get_individial_house(house_containers)
    i = i + 1
    time.sleep(5)

len is 25
len is 25
len is 25
len is 8


**Scrape house details**

In [585]:
url = "https://www.zillow.com/homedetails/1713-Legendary-Ln-Morrisville-NC-27560/79883305_zpid/"
headers = ({'User-Agent':
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3607.0 Safari/537.36'})
response = get(url, headers = headers)

html_soup = BeautifulSoup(response.text, 'html.parser')
summary = html_soup.find('div', class_="home-details-summary-and-price")
summary = summary.find_all('span')
house_infos = []
for i in range(len(summary)):
    info = summary[i].text
    print(info)
    house_infos.append(info)

 
3 beds
 
2.5 baths
 
2,360 sqft

$369,900


In [543]:
for info in infos:
    if "$" in info:
        sqft = info
    else:
        continue
sqft

'$354,900'

In [586]:
schools = html_soup.find('div', class_="hdp-nearby-schools")
sch_summary = schools.find_all('span')
l = len(sch_summary)
sch_infos = []

for i in range(l-1):
    info = sch_summary[i].text
    if info.isdigit():
        sch_infos.append(info)
sch_infos

['9', '6', '6']

The completed code is saved as zillow_scraper.py