# New York Apartments Scraper



### Introduction

To determine the affordability of New York apartments, data will be extracted from Rentals.com as they have over 10,000 units listed at a time. Information such as the price, number of bedroom and bathrooms will be scraped from the corresponding HTML tag using the Selenium library in Python.

### Importing the Packages

In [None]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
import csv 
import re

### Open a Chromium Brower

Selenium uses a Chromium browser to know which tab to scrape information from. The following code will open a chromium browser located on my local computer and direct it to Rentals.com's website that displays New York City's apartment listings.

In [None]:
s = Service('C:/chromedriver')
driver = webdriver.Chrome(service=s)

URL = "https://www.rentals.com/New-York/New-York-City/?page=1"
driver.get(URL)

### Scraping

Rentals.com seperates their listings by pages with each page having 30 listings. So first, we will determine the number of pages then we will loop through each page to get the desired data.

In [None]:
#Determine the number of pages of listings.
pages = int(driver.find_elements(By.XPATH, '//div[@data-tid="pagination_page_count"]')[0].text.split(" ")[-1])

#Each listing will represent one row, and will be store as a nested list into rows.
rows = []

#Loop through every page.
for page in range(0, pages):
    
    #Redirect the chromium browser to the current page
    URL = "https://www.rentals.com/New-York/New-York-City/?page=" + str(page+1)
    driver.get(URL)
    
    #Each listing has a unique ID that is used to navigate to their individual page.
    #Each page of listings has a list of IDs, that list will be stored into houses
    houses = driver.find_elements(By.XPATH, '//div[@data-tid="listing-section"]')
    
    #Each unit in a page will have their own link
    links = []
    
    #There are 31 units per page instead of 30 as there's an additional one for ads
    house_per_page = 31
    
    for i in range(house_per_page):
        try:
            #Each link is just the base URL plus the unique ID
            links.append("https://www.rentals.com/New-York/New-York/" + houses[i].get_attribute("id"))
        except:
            pass

    for link in links:
        #Navigate to each unit by using their unique link
        driver.get(link)
        
        #All the information is stored in this variable
        #I located it by using CSS selector as the div tag doesn't have an ID or class name to reference.
        general_info = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "._3EHgB"))).text.split("\n")

        #Price is the first item in the list.
        price = general_info[0]
        
        #Address isn't stored in general_info so another command is needed.
        address_headline = driver.find_element(By.XPATH, '//div[@data-tid="address_headline"]').text
        if (address_headline):
            street = address_headline
        else:
            #Sometimes the apartment includes their name in the address headline and places their actual address in a different tag.
            street = driver.find_element(By.XPATH, '//div[@data-tid="property_label_headline"]').text
            
        #Loop through general_info to find where the city, state, and zipcode information is.
        for info in general_info:
            #The order of city_state_zipcode isn't constant in general_info so I use Regular Expression to pinpoint the information.
            city_state_zipcode = re.findall(".*, .{2} \d{5}$",info)
            if(city_state_zipcode):
                city_state_zipcode = city_state_zipcode[0]
                #Break the loop once the information is found
                break

        #Bedroom and bathroom information isn't located in general information either.
        bed_bath = driver.find_element(By.XPATH, '//span[@data-tid="bed_bath_section"]').text.split(",")
        try:
            #bedroom is the first element
            bed = bed_bath[0].strip()
        except:
            #Sometimes there are edge cases where there aren't any bedroom or bathroom information
            bed = "NA"
        try:
            #bathroom is the second element
            bathroom = bed_bath[1].strip()
        except:
            bathroom = "NA"
            
        #Compile the disired data into a list and add it to rows
        data = [price, bed, bathroom, street, city_state_zipcode]
        rows.append(data)
        
        #Reset the information for the next iteration.
        price = ""
        bed = ""
        bathroom = ""
        street = ""
        city_state_zipcode = ""

### Save the Data

Lastly, we save the data into a csv file.

In [None]:
#This will be the column name in the csv file
fields = ["price", "number_of_beds", "number_of_bathrooms", "street", "city_state_zipcode"] 
filename = "9-15-2022 NY Rental Data.csv"
    
with open(filename, 'w', newline='', encoding="utf-8") as csvfile: 
    csvwriter = csv.writer(csvfile) 
    csvwriter.writerow(fields) 
    csvwriter.writerows(rows)

#Read the csv file to ensure that the correct number of entries were written
csvFile = pd.read_csv(filename)