## Scraping StreetEasy.com to analyze housing price in New York City 

My goal here is to collect housing prices for both rental and sale in New York city. I looked at three major real estate website including Trulia, Zillow, and StreetEasy. Comparing to the other two websites, StreetEasy gives the most information on the searching results page and the format of each listing is very consistent, which is great for the purpose of web-scraping.<br\ >
<a href="http://streeteasy.com/">
<img "StreetEasy" src="map/streetEasy_logo.jpg" height="30px" width="150px"/></a><br\ >

Web scraping is done using the beautifulsoup package in Python. I created two functions that can loop through all the pages of searching results, and also empty strings to store results. Below are the steps I took to scrape StreetEasy:
1. Analyzing the HTML page: HTML code of a web page can be viewed by right click and selecting 'Inspect'. This helps us identifying the HTML tags of the information to be scraped
2. Making the soup!: It is important to select the correct parser for your data type. I used HTML parser.
3. Navigating the parse tree and iterate through tags: once the soup is made, we have the HTML code in Python. We can then find our desired information by searching through HTML tags.

In [1]:
def package_url_sale(page):
    return 'http://streeteasy.com/for-sale/nyc?page=' + page

In [2]:
def package_url_rent(page):
    return 'http://streeteasy.com/for-rent/nyc?page=' + page

In [3]:
from bs4 import BeautifulSoup
import urllib
import pandas as pd
import pandas as np

price=[]
where=[]
bed=[]
bath=[]
size=[]
monthly=[]
street=[]

In [None]:
for x in range(757,1500): #loop through all pages
    url=package_url_rent(str(x))
    r = urllib.urlopen(url).read()
    soup = BeautifulSoup(r,'html.parser')
    lst = soup.find_all(lambda tag: tag.has_attr('data-id'))
    for i in range(len(lst)):
        #price
        if lst[i].find_all('span',{'class':'price'})==[]:
            price.append('')
        else:
            price.append(lst[i].find_all('span',{'class':'price'})[0].string)
        #where
        length=len(lst[i].find_all('div',{'class':'details_info'}))
        if(lst[i].find_all('div',{'class':'details_info'})[0].find_all('a',href=True)==[]):
            if(length==1):
                where.append('')
            else:
                if(lst[i].find_all('div',{'class':'details_info'})[1].find_all('a',href=True)==[]):
                    where.append('')
                else:
                    where.append(lst[i].find_all('div',{'class':'details_info'})[1].find_all('a',href=True)[0].string)
        else:
            where.append(lst[i].find_all('div',{'class':'details_info'})[0].find_all('a',href=True)[0].string)
        #bedroom
        if(lst[i].find_all('span',{'class':'first_detail_cell'})==[]):
            bed.append('')
        else:
            bed.append(lst[i].find_all('span',{'class':'first_detail_cell'})[0].string)
        #bedroom
        if(lst[i].find_all('span',{'class':'detail_cell'})==[]):
            bath.append('')
        else:
            bath.append(lst[i].find_all('span',{'class':'detail_cell'})[0].string)
        #size
        if(lst[i].find_all('span',{'class':'last_detail_cell'})==[]):
            size.append('')
        else:
            size.append(lst[i].find_all('span',{'class':'last_detail_cell'})[0].string)
        #monthly rent
        #monthly.append(lst[i].find_all('span',{'class':'monthly_payment'})[0].string)
        #street
        street.append(lst[i].find_all('div',{'class':'details-title'})[0].a.string)   
    #print x

print 'done'

## Data Manipulation

For some listings the information on number of bedroom, number of bathroom, and apartment size is incomplete or mixed up. I performed data manipulation to fix the mistaken values and clean up the extra symbols such as comma and dollar sign. <br\ >
Finally, I have two data sets containing the housing information for apartments for rent and apartments for sale. My for sale data set has 8,456 rows and 8 columns, and the for rent data set has 20,988 rows and 7 columns

In [122]:
import pandas as pd
import numpy as np
data={'street':street,'price':price,'where':where,'bed':bed, 'bath':bath, 'size':size,'furnished':0}
data=pd.DataFrame(data)

#is the apartment furnished?
cond=data['bed']=='Furnished'
data.loc[cond,'furnished']=1
data.loc[cond,'bed']=''

#move from size to bath
cond=[]
for i in data['size']:
    if(i==''):
        cond.append(False)
    else:
        cond.append(i.split(" ")[1] in ('bath','baths'))
data.loc[cond,'bath']=data.loc[cond,'size'] 
data.loc[cond,'size']=''

#move from bed to bath
cond=[]
for i in data['bed']:
    if(i=='' or i=='Furnished' or i=='studio'):
        cond.append(False)
    else:
        cond.append(i.split(" ")[1] in ('bath','baths'))
data.loc[cond,'bath']=data.loc[cond,'bed'] 
data.loc[cond,'bed']=''

#move from bath to bed
cond=[]
for i in data['bath']:
    if(i==''):
        cond.append(False)
    else:
        if(len(i.split(" "))==1):
            cond.append(True)
        else:
            if(i.split(" ")[1] in ('bath','baths')):
                cond.append(False)
            else:
                cond.append(True)
data.loc[cond,'bed']=data.loc[cond,'bath'] 
data.loc[cond,'bath']=''

#move from bed to size
cond=[]
for i in data['bed']:
    if(i=='' or i=='studio'):
        cond.append(False)
    else:
        if(i.split(" ")[1] in ('bed','beds')):
            cond.append(False)
        else:
            cond.append(True)
data.loc[cond,'size']=data.loc[cond,'bed'] 
data.loc[cond,'bed']=''


#replace blank with nan
data=data.applymap(lambda x: np.nan if x=='' else x)

#data
data.to_csv('rent.csv',encoding='utf-8')

In [None]:
#size to numeric
cond=data['size'].isnull()
for i in range(0,len(cond)):
    if (not cond[i]):
        data.loc[i,'size']=int(data['size'][i].split(" ")[0].replace(',',''))
#bath to numeric
cond=data['bath'].isnull()
for i in range(0,len(cond)):
    if (not cond[i]):
        data.loc[i,'bath']=float(data['bath'][i].split(" ")[0].replace('+',''))
#bed to numeric
cond=data['bed'].isnull()
data['bed']=data['bed'].replace('studio','0 bed')
for i in range(0,len(cond)):
    if (not cond[i]):
        data.loc[i,'bed']=float(data['bed'][i].split(" ")[0].replace(',','').replace('+',''))
#remove dollar sign
data['price']=[int(i.replace('$','').replace(',','')) for i in data['price']]

data.to_csv('rent_2.csv')