# Apartments.com listing scraper

This notebook will read the listings map page of apartments.com and pull out all of the rental properties details to be used in the modeling.

In [2]:
import string
from bs4 import BeautifulSoup
import numpy as np
import requests
import re
import time
import pandas as pd
import os

Apartments.com employs some amount of anti-scraping technology. To circumvent the anti-scraping measures, for educational purposes only of course, we provide bogus User Agents in the http request as well as routing the request through a list of proxies.

The proxies were obtained from https://www.us-proxy.org/ . The list included in this repository is only the first three entries on the list. There is a chance that any given proxy server may not be usable, so it is important to have a large list of potential proxies.

In [3]:
proxies = []
with open("proxies.txt", "r") as fin:
    for line in fin.readlines():
        proxies.append(line.strip("\n"))

user_agents = list(string.printable[:62])
headers  = {'User-Agent':np.random.choice(user_agents)}

These helper functions will later be used to process the scraped HTML code and pull out relevant listing details and add them to a listing dictionary.

In [4]:
def check(my_string):
    brackets = ['()', '{}', '[]']
    while any(x in my_string for x in brackets):
        for br in brackets:
            my_string = my_string.replace(br, '')
    return not my_string

def get_propid(soup, listing_props):
    try:
        attrs = soup.find("main").attrs
        listing_props["listingid"] = attrs["data-listingid"]
    except:
        pass

def get_basic(soup, listing_props):
    basic_table = soup.find(id="priceBedBathAreaInfoWrapper")
    for prop in ["Monthly Rent", "Bedrooms", "Bathrooms", "Square Feet"]:
        try:
            listing_props[prop] = basic_table.find(text=prop).findNext().contents[0]
        except:
            pass

def get_zip(soup, listing_props):        
    try:
        zipcode = soup.find(class_="stateZipContainer").findNext().findNext().contents[0]
        listing_props["ZIP"] = int(zipcode)
    except:
        pass
    
def get_deposit(soup, listing_props):
    try:
        deposit = soup.find(class_="detailsTextWrapper leaseDepositLabel").findNext().findNext().contents[0]
        listing_props["Deposit"] = deposit
    except:
        pass

def get_scores(soup, listing_props):
    for score in ["walkScore", "transitScore"]:
        try:
            attrs = soup.find(class_=f"component-header ratingCol {score}").attrs
            listing_props[score] = attrs["data-score"]
        except:
            pass

def get_coords(soup, listing_props):
    try:
        listing_props["latitude"] = soup.find_all(property="place:location:latitude")[0]["content"]
    except:
        pass
    try:
        listing_props["longitude"] = soup.find_all(property="place:location:longitude")[0]["content"]
    except:
        pass

def get_neighborhood(soup, listing_props):
    try:
        listing_props["neighborhood"] = soup.find_all(class_="neighborhood")[0].contents[0]
    except:
        pass

#Note: this was just a first attempt at scraping the pet policy. In our work, we found that the actualy pet listings would require processing the 
#body text of the listing as well as a more robust treatment of the built-in pet policy table.
def get_pet(soup, listing_props):
    try:
        pet = soup.find(class_="feePolicyTitle petPolicyTitle").contents[0].lower()
        if "no pet" in pet:
            listing_props["Pet"] = 0
        if "dog" in pet:
            listing_props["Pet"] = 2
        if "cat" in pet:
            listing_props["Pet"] = 3
        if ("dog" in pet) and ("cat" in pet):
            listing_props["Pet"] = 5
    except:
        pass


Here we scrape the listing map and pull out all of the URLs for the apartment listings. In the loop, we pull only 28 pages of listings. The number 28 
corresponds to the number of listings pages which were available when we began the project. This number will change depending on the location,
apartment selection criteria, and naturally with time.

In [5]:

base_url = 'https://www.apartments.com/indianapolis-in/over-200/'
rent_urls = []

listing_links = []

for i in range(1,29):
    url = base_url+str(i)
    
    headers  = {'User-Agent':np.random.choice(user_agents)} #Change the user agent to avoid anti-crawler measures

    res  = requests.get(url, timeout = 10, headers = headers)
    soup = BeautifulSoup(res.content, 'lxml')

    links = soup.find_all("a")
    for link in links:
        link = link.get("href")

        try:
            #Regex match to the form of the apartment listing id.
            if re.search("\/[a-zA-Z0-9]{7}\/$", link) and ("sitemap" not in link):
                listing_links.append(link)
        except:
            pass
    
    time.sleep(np.random.uniform(1, 10)) #Wait for a short moment before submitting the next request.

#Extract unique listings and save to a text time to skip this step in subsequent runs.
unique_listings = list(set(listing_links))

with open("listings.txt", "w") as out:
    for link in unique_listings:
        out.write(link+"\n")


In [6]:
#If listing links already scraped, just run this cell to get the listing links loaded.
if os.path.exists("listings.txt"):
    with open("listings.txt", "r") as fin:
        unique_listings = [line.replace("\n", "") for line in fin.readlines()]

NameError: name 'os' is not defined

Here we scrape each of the listing links to extract the rental listing properties. This cell can take a while as there are hundreds of listings
and we include a generous pause between each HTTP request (~20 seconds) to prevent being flagged as a bot.

The workflow for this is attempt to grab the page, trying a few times if the request times out or breaks, create a dictionary to hold the listings properties,
call the helper functions to extract the properties, and add to the dataset. We then convert the scraped data into a dataframe and save.

In [None]:
listings = []
#We limit this cell to the first 50 listings for testing purposes.
for listing in unique_listings[:50]:
    headers  = {'User-Agent':np.random.choice(user_agents)} #Anti-anti-crawler
    
    success = False
    attempts = 1
    while not success:
        try:
            proxy = np.random.choice(proxies)
            print(f"Reading: {listing} \n on proxy server: {proxy}")
            res  = requests.get(listing, timeout = 10, headers = headers,
                                proxies = {"proxies": proxy})
            soup = BeautifulSoup(res.content, 'lxml')
            success = True
        except TimeoutError:
            #We can time out for a few reasons: bad proxy server or we got flagged. Wait a bit and try again with different headers and proxy.
            print(f"Timed out on attempt {attempts}, trying again...")
            time.sleep(5)
            attempts += 1
            if attempts==5: break
            

    listing_props = {"url": listing}
    
    get_propid(soup, listing_props)
    get_zip(soup, listing_props)
    get_basic(soup, listing_props)
    get_scores(soup, listing_props)
    get_deposit(soup, listing_props)
    get_coords(soup, listing_props)
    get_pet(soup, listing_props)
    get_neighborhood(soup, listing_props)
    
    listings.append(listing_props.copy())
    time.sleep(np.random.uniform(5,30))

#Finally, save the data so we don't have to scrape again. Whole scraping process took ~4 hours!
df = pd.DataFrame(listings)
df.to_excel("apartment_dotcom.xlsx")