# Redfin Real Estate Listing Scraper

##  Team Members
- Zhixi Lin (zl536)  
- Seth Coward (sac484)  
- Maryam Nasralla (mn688)  
- Andrew Stevens (aes464)
- Jason Chew (jc4768)

##  Project Overview
Our project is a tool that enables users to enter a zip code of their choice and search, filter, and save information about various real estate listings on the popular site Redfin. Rather than scrolling through each page of their website and writing down listings into an Excel sheet or on paper, our tool allows users to filter for the data they want and then save it to a format like CSV or JSON for better use later. We were specifically targeting average home buyers and real estate professionals for this tool as this provides the easiest way for them to gather this data since Redfin doesn't have a public API for gathering this kind of data.

Our tool starts off by first gathering the zip code that needs to be searched for by the user. Then, using our algorithm, it will go to the Redfin website, gather how many pages of results there are, and load them into a queue to be scanned. Then, using that queue, our algorithm parses the data on each available listing, excluding listings that are missing pieces of information, and stores the results in a variable. Then, it asks the user about a handful of different filters that they would like to use, including setting a minimum and maximum price, minimum and maximum square footage, etc. Finally, it will print out the results to the screen for the user to see and then it will ask them if they want to save it in a common format like CSV or JSON. If the user selects one of these formats, it will save the file in its respective format, leaving the user with their results.

## 1. Install Required Packages for the Tool
Includes the pip commands required to install "BeautifulSoup" for easier HTML parsing and "requests" for better communication with websites

In [1]:
# Required packages
!pip install BeautifulSoup4 --break-system-packages
!pip install requests --break-system-packages

[0m

## 2. Import Modules/Packages
These are packages that are required for the script. We use "requests" for making HTTP web requests to the Redfin website. Then "BeautifulSoup" and "re" for HTML parsing and regex usage. We use the "time" package for managing the timing between requests to the Redfin website. Finally, we use the "csv" and "json" modules for managing the outputs of the data.

In [2]:
import requests
from bs4 import BeautifulSoup
import re
import time
import csv
import json

## 3. Set Up Web Session
To be able to communicate with and read the Redfin website, we set up a web session using the "requests" package with the proper headers needed to load the pages. This is set up globally so that any of our functions can use it.

In [3]:
# Headers to make the request look like a browser request
headers = {
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/137.0.0.0 Safari/537.36"
}

# Create a web session
web_session = requests.Session()

# Add the browser headers
web_session.headers.update(headers)

## 4. Define Queue Class and Function
Prepare the definitions for the Queue class to be utilized later

In [4]:
class Queue:
    # sac484
    # This class is for implementing a queue abstract data type as a data structure
    # It conatins methods:
    #     is_empty(): Checks if the queue is empty and returns a boolean accordingly
    #     enqueue(item): Adds an item to the back of the queue
    #     dequeue(): Removes and returns the item from the front of the queue
    #     size(): Returns the size of the queue

    # Constructor for queue
    def __init__(self):
        self.items = []

    # Checks if the queue is empty
    def is_empty(self):
        return len(self.items) == 0

    # Adds an item to the back of the queue
    def enqueue(self, item):
        self.items.insert(0, item)

    # Removes and returns an item from the front of the queue
    def dequeue(self):
        return self.items.pop()

    # Returns the current size of the queue
    def size(self):
        return len(self.items)

## 5. Define get_all_page_urls Function
Prepare the get_all_page_urls function that will create a queue of URLs based on a given zip code, which will be used later for scanning and parsing. There are some different checks in place here to make sure the proper results are expected, such as proper status codes and HTML element parsing.

In [5]:
def get_all_page_urls(zip_code):
    # sac484
    # This function returns a queue of strings that includes all URLs for
    # pages that have real estate listings in a given zip code
    #
    # Parameters:
    # zip_code (string): the zip code to be searched in
    #
    # Returns:
    # A queue of zero or more URLs to be searched through

    # Create an empty queue
    url_queue = Queue()

    # Define a variable for the base Redfin URL for the zip code
    base_url = f"https://www.redfin.com/zipcode/{zip_code}/"

    # Define regex used to parse the max number of pages of listings
    page_regex = re.compile("Viewing page \\d+ of (\\d+)")

    # Get the content of the first page of listings
    request = web_session.get(base_url)
    # Return the empty queue if there was an issue with the page
    if request.status_code != 200:
        return url_queue

    # Create a BeautifulSoup HTML object out of the page content
    html_body = request.text
    page = BeautifulSoup(html_body, "html.parser")

    # Parse through the page content and find the text listing the max number of pages
    element = page.find("span", attrs={"data-rf-test-name": "download-and-save-page-number-text"})
    # Return the empty queue if there was an issue getting the required text
    if element == None:
        return url_queue

    # Parse the max number of pages from the element text
    regex_result = page_regex.search(element.get_text())
    # Return the empty queue if there was a parsing issue
    if regex_result == None:
        return url_queue

    # Add the base URL to the queue to be searched
    url_queue.enqueue(base_url)

    # Create an integer variable out of the capture max number of pages
    max_page = int(regex_result.group(1))

    # If there is more than one page, add each subsequent URL after the first page to the queue
    if max_page > 1:
        for i in range(2, max_page + 1):
            url_queue.enqueue(f"{base_url}page-{i}")

    # Return the URL queue
    return url_queue

## 6. Define get_all_listings Function
Prepare the get_all_listings function that will utilize a queue of URLs, crawl each page, and parse the required data for each real estate listing. Checks are put in place here to exclude listings that are missing certain required information like how much they cost, how many bedrooms or bathrooms they have, etc.

In [6]:
def get_all_listings(zip_code):
    # sac484
    # This function returns a list of dictionaries that each contain data
    # about the real estate listings in the given zip code
    #
    # Parameters:
    # zip_code (string): the zip code to be searched in
    #
    # Returns:
    # A list of dictionaries, each with information about the listing including
    # the listing_url, price, address, num_beds, num_baths, square_footage, and contact_info

    # Get a queue of URls to be searched for the zip code
    page_url_queue = get_all_page_urls(zip_code)

    # Create an empty variable for the resulting listings
    all_listings = []

    # While there are no more URLs to parse
    while not page_url_queue.is_empty():
        # Get the next URL
        url = page_url_queue.dequeue()
        # Get its HTML body content
        html_body = web_session.get(url).text
        # Create a BeautifulSoup object out of the HTML body content
        page = BeautifulSoup(html_body, "html.parser")

        # For every listing card on the page
        for listing_card in page.find_all("div", class_="HomeCardContainer"):
            try:
                # Skip the card if its an in-page advertisement
                if listing_card.get("aria-label") and listing_card["aria-label"] == "Advertisement":
                    continue

                # Capture the URL for its listing page
                listing_href = listing_card.find("a", href=True, class_="bp-InteractiveHomecard")["href"]

                # Capture the price for the listing
                price = listing_card.find("span", class_="bp-Homecard__Price--value").text
                # Skip the listing if its unknown
                if price == "Unknown":
                    continue
                else:
                    # Clean the string of commas and dollar signs and turn it into a float
                    price = float(price[1:].replace(",", ""))

                # Capture the address for the listing
                address = listing_card.find("div", class_="bp-Homecard__Address").string

                # Capture the number of beds on the listing
                num_beds = listing_card.find("span", class_="bp-Homecard__Stats--beds").string.split(" ")[0]
                # Skip the listing if they aren't listed
                if num_beds == "—":
                    continue
                else:
                    # Turn the value into a float
                    num_beds = float(num_beds)

                # Capture the number of baths on the listing
                num_baths = listing_card.find("span", class_="bp-Homecard__Stats--baths").string.split(" ")[0]
                # Skip the listing if they aren't listed
                if num_baths == "—":
                    continue
                else:
                    # Turn the value into a float
                    num_baths = float(num_baths)

                # Capture the square footage of the listing
                square_footage = listing_card.find("span", class_="bp-Homecard__Stats--sqft").find_all(string=True)[0]
                # Skip the listing if it isn't listed
                if square_footage in ["—", "sq ft"]:
                    continue
                else:
                    # Clean the string of commas and turn it into a float
                    square_footage = float(square_footage.replace(",", ""))

                # Capture the contact info on the listing
                try:
                    contact_info = listing_card.find("div", class_="bp-Homecard__Attribution").string
                except AttributeError:
                    # Tell the user to check the listing for more info if it isn't available
                    contact_info = "See listing via URL"

            except:
                continue

            # Combine all of the capture information into an easily readable dictionary
            listing_info = {
                "listing_url": f"https://redfin.com{listing_href}",
                "price": price,
                "address": address,
                "num_beds": num_beds,
                "num_baths": num_baths,
                "square_footage": square_footage,
                "contact_info": contact_info
            }

            # Add the listing to the list of all listings
            all_listings.append(listing_info)
        # Wait one second in between requests to not overload the web page
        time.sleep(1)

    # Return all of the captured listings
    return all_listings

## 7. Define filter_listings Function
Prepare the filter_listings function that will be used in combination with the output from the get_all_listings function to filter the results based on a given set of criteria from the user.

In [7]:
def filter_listings(all_listings, filters):
    # sac484
    # This function will filter the list of listing dictionaries based on
    # the provided dictionary of filters
    #
    # Parameters:
    # all_listings (list): the list of dictionaries of listings
    # filters (dict): the dictionary of filters to filter on
    #
    # Returns:
    # The resulting list of dictionaries following the application of filters

    # If there are no filters to be applied, just return the original list
    if all(filter == None for filter in list(filters.values())):
        return all_listings

    # Create an empty list for the filtered listings
    filtered_listings = []
    # For every listing in the list
    for listing in all_listings:
        # Check if each filter exists in the dictionary and is set to none. If it does exist
        # compare it to its value in the listing and set a boolean depending on if the criteria
        # matches
        conditions = [
            (filters.get("min_price") is None or listing["price"] >= filters["min_price"]),
            (filters.get("max_price") is None or listing["price"] <= filters["max_price"]),
            (filters.get("min_beds") is None or listing["num_beds"] >= filters["min_beds"]),
            (filters.get("max_beds") is None or listing["num_beds"] <= filters["max_beds"]),
            (filters.get("min_baths") is None or listing["num_baths"] >= filters["min_baths"]),
            (filters.get("max_baths") is None or listing["num_baths"] <= filters["max_baths"]),
            (filters.get("min_sqft") is None or listing["square_footage"] >= filters["min_sqft"]),
            (filters.get("max_sqft") is None or listing["square_footage"] <= filters["max_sqft"])
        ]

        # If all conditions are meet, meaning all values are set to True in the list, add
        # the listing to the filtered list
        if all(conditions):
            filtered_listings.append(listing)

    # Return the list of filtered listings
    return filtered_listings

## 8. Define input_with_type Function
Prepare the input_with_type function that will be used to better parse the information given by the user and return it in a proper Python type rather than just a string.

In [8]:
def input_with_type(prompt, value_type):
    # sac484
    # This function returns the value that a user gives for a given prompt but
    # as the specific data type that was provided
    #
    # Parameters:
    # prompt (string): the prompt to ask the user
    # value_type (type): the type that user-input should be converted to
    #
    # Returns:
    # None if the user enters a blank string,
    # otherwise the resulting input from the user as the provided data type

    # Loop to keep asking the user the question
    while True:
        # Capture the response of the prompt and strip any spaces
        response = input(prompt).strip()

        # If the user provided a blank response, return None
        if response == "":
            response = None
            break

        try:
            # Convert the user input to the provided data type
            response = value_type(response)
            break
        except:
            # If the user didn't give a proper value and there was an error in
            # conversion, prompt the user to enter a proper value
            print("Please enter a proper value!")

    # Return the resulting converted value
    return response

## 9. Define save_output Function
Define the save_output function that will be used to output the final real estate listings to either a CSV or JSON file if the user so chooses after filtering the listings.

In [9]:
def save_output(listings, format_choice):
    # zl536
    # This function will output the provided listings to either a
    # CSV or JSON file, or do nothing at all depending on the format choice
    #
    # Parameters:
    # listings (list): the list of listings to be saved
    # format_choice (string): the format to be saved in
    #
    # Returns:
    # Nothing

    # If the user chose CSV
    if format_choice == "csv":
        # Open the filtered_listings.csv file in writing mode
        with open("filtered_listings.csv", "w", newline="", encoding="utf-8") as f:
            # Create a DictWriter object out of the file
            writer = csv.DictWriter(f, fieldnames=listings[0].keys())
            # Write the headers to the file
            writer.writeheader()
            # Write each row to the file
            writer.writerows(listings)
        # Print that the results were saved to the filtered_listings.csv file
        print("Saved to filtered_listings.csv")

    # If the user chose JSON
    elif format_choice == "json":
        # Open the filtered_listings.json file in writing mode
        with open("filtered_listings.json", "w", encoding="utf-8") as f:
            # Dump the list of dictionaries to the json file
            json.dump(listings, f, indent=2)
        # Print that the results were saved to the filtered_listings.json file
        print("Saved to filtered_listings.json")

    # If the user chose anything else
    else:
        # Print that the output was not saved
        print("Output not saved.")

## 10. Retrieve Zip Code from User and Get Listings
Get input from the user to get what zip code they want to search for listings in. Then, utilize the get_all_listings function to get all of the available listings in that zip code. However, if the user types in an invalid zip code or one that doesn't have any listings in it, it will keep asking them to enter a different zip code.

In [10]:
# While loop to keep asking the question to the user
while True:
    # Capture the zip code from the user and strip any extra spaces
    zip_code = input("Enter ZIP code: ").strip()

    # Get all listings in the zip code and print the status to the user
    print("Gathering listings...", end="")
    listings = get_all_listings(zip_code)
    print("Done")

    # If no listings in that zip code were found
    if len(listings) == 0:
        # Ask the user enter another zip code and restart the loop
        print("No listings found for that zip code, please enter a different one!")

    # If listings were found
    else:
        # Break the loop and continue
        break

Enter ZIP code:  19104


Gathering listings...Done


## 11. Ask the User to Provide Any Filters for the Listings
Create a dictionary of filters and ask the user to either provide a value for the filter or just press "Enter" to skip it.

In [11]:
# Create an empty dictionary of filters
filters = {}

# Ask the user for filters on min/max price, min/max beds, min/max baths, and min/max square footage
# and add those values to the filters dictionary as float data types
filters["min_price"] = input_with_type("Minimum price (press Enter to skip): ", float)
filters["max_price"] = input_with_type("Maximum price (press Enter to skip): ", float)
filters["min_beds"] = input_with_type("Minimum number of bedrooms (press Enter to skip): ", float)
filters["max_beds"] = input_with_type("Maximum number of bedrooms (press Enter to skip): ", float)
filters["min_baths"] = input_with_type("Minimum number of bathrooms (press Enter to skip): ", float)
filters["max_baths"] = input_with_type("Maximum number of bathrooms (press Enter to skip): ", float)
filters["min_sqft"] = input_with_type("Minimum square footage (press Enter to skip): ", float)
filters["max_sqft"] = input_with_type("Maximum square footage (press Enter to skip): ", float)

# Filter the gathered listings based on the user input and save it to a new variable
filtered_listings = filter_listings(listings, filters)

Minimum price (press Enter to skip):  40000
Maximum price (press Enter to skip):  350000
Minimum number of bedrooms (press Enter to skip):  
Maximum number of bedrooms (press Enter to skip):  3
Minimum number of bathrooms (press Enter to skip):  
Maximum number of bathrooms (press Enter to skip):  3
Minimum square footage (press Enter to skip):  
Maximum square footage (press Enter to skip):  


## 12. Print Out the Listings that were Found
For each listing that exists after filtering (there could also be no listings), output the data to the user.

In [12]:
# Tell the user how many listings were found
print(f"Found {len(filtered_listings)} matching listing(s).\n\n")

# Print each listing
for item in filtered_listings:
    print(item)

Found 32 matching listing(s).


{'listing_url': 'https://redfin.com/PA/Philadelphia/341-N-State-St-19104/home/38665125', 'price': 325000.0, 'address': '341 N State St, Philadelphia, PA 19104', 'num_beds': 2.0, 'num_baths': 1.0, 'square_footage': 1176.0, 'contact_info': '(215) 546-0550'}
{'listing_url': 'https://redfin.com/PA/Philadelphia/4336-Parrish-St-19104/home/38188692', 'price': 180000.0, 'address': '4336 Parrish St, Philadelphia, PA 19104', 'num_beds': 3.0, 'num_baths': 1.0, 'square_footage': 1140.0, 'contact_info': '(215) 646-2900'}
{'listing_url': 'https://redfin.com/PA/Philadelphia/4217-Chestnut-St-19104/unit-206/home/142996108', 'price': 329000.0, 'address': '4217 Chestnut St #206, Philadelphia, PA 19104', 'num_beds': 1.0, 'num_baths': 1.0, 'square_footage': 674.0, 'contact_info': '(888) 397-7352'}
{'listing_url': 'https://redfin.com/PA/Philadelphia/706-N-Shedwick-St-19104/home/38669915', 'price': 139900.0, 'address': '706 N Shedwick St, Philadelphia, PA 19104', 'num_beds': 2

## 13. Output the Listings to a File
If the user has listings that they want to save to a file, they can do that here and choose either CSV or JSON format.

In [13]:
# After filtering, if there are any listings available
if len(filtered_listings) > 0:
    # Ask the user if they want to save the results as a CSV or JSON file
    output_format = input("Save results to file? (csv/json/none): ").strip().lower()
    # Save the output based on the user input
    save_output(filtered_listings, output_format)

Save results to file? (csv/json/none):  csv


Saved to filtered_listings.csv
