### Introduction and problem statement

Finding a place to live is a big challenge, especially in big cities. Not only it is time consuming, but compare them whn we narrow down our choice to a few is another challenge. For families with children the best school district is priority, for single workers, being closer to a subway is a plus.

In this notebook, we will:
- Scrape the information from an apartment listing data 
- Clean and transform the data
- Analyze and visualize the data
- Forcasting the price for apartment for different neigborhoods

#### Scraping Data from the web
For this exercise, we will be using RentHopsite(http://www.renthop.com). We will focus our energy only on apartments in NYC and neighorhoods.
![title](img/homes_apartment.jpg)

Above is the overview of the site. We will be retrieving the informations below:
- Address
- Number of beds and number of baths
- The listing price
- The property type (Home, Townhome, Apts)
- Zip code

#### Adding libraries we will be using

For scraping we will be using **requests** to pull down listings and use **BeautifulSoup** to extract different attributes. There are two populars way to navigate  HTML structures known as **Document Object Models (DOMs)** XPath and CSS Selectors. We will be using the CSS selectors, which is a pattern language built to work with HTML and identify elements using a combination of element type, class,and ID properties.

In [10]:
import numpy as np 
import pandas as pd 
import requests 
import matplotlib.pyplot as plt 
from bs4 import BeautifulSoup
%matplotlib inline

#### 1- getting the whole page to scrape
  Fist of all, let us define a function call get_url that will allow us to get the Document Object Model for a given page. 

In [48]:
def get_url (url):
    try:
        response = requests.get(url).content
        return BeautifulSoup(response,'html.parser')
    except requests.exceptions.RequestException as e: 
        print("Something when wrong !")
        raise SystemExit(e)


In [None]:
#getting the whole page
soup=get_url('https://www.renthop.com/nyc/manhattan-apartments')


#### 2- getting the list of the all listings to scrape
After getting the HTML for the whole page, we are going to write another fuction to get the area that has only all the properties that we are interested in.
Using Chrome developer console, we see that the main text is within a **div** tag with *id=search-results-list* and each listing is in another **div** tag with *class=search-listing*. therefore, we will be getting all *search-listing*.
As we can see, each page has 72 listings.

In [66]:
APARTMENTS='div.search-listing'

listing_divs = soup.select(APARTMENTS)#[0].select_one(APARTMENTS)


In [29]:
def get_all_listings(soup, locator):
    '''
    our function takes a soup object and a locator and return the
     list of all individual listings.
    '''
    return soup.select(locator) 

#### 3- processing each listing to get individual attributes


In [162]:
LISTINGID_LOCATOR = 'div.search-listing'
LINK_LOCATOR = 'div.search-listing-details a'
NAME_LOCATOR = 'div.search-listing-details div.font-size-9.overflow-ellipsis'
BEDS_LOCATOR = 'div.search-listing div.search-results-bed' 
BATH_LOCATOR = 'div.search-listing div.search-results-bath' 
TYPE_LOCATOR = 'div.search-listing div.d-block.font-gray-1.font-size-9'
DATE_LOCATOR = 'div.search-listing-details>div.col-12>div.font-size-9'
DESCRIPTION_LOCATOR = 'div.px-3.px-lg-0>div.font-size-10'
page=get_all_listings(soup,locator=APARTMENTS)
page_details=get_all_listings(soup,locator=LINK_LOCATOR)

In [113]:
listingID = page[0].attrs['data-id']
longitude = page[0].attrs['data-longitude']
latitude  = page[0].attrs['data-latitude']
listPrice = page[0].attrs['data-price']
down_payment = page[0].attrs['data-min-downpay-pct']
propertyTax = page[0].attrs['data-tax']
address = page[0].attrs['data-listing-name']
link=page_details[0].attrs['href']#.select_one()
name_area = get_all_listings(soup,locator=NAME_LOCATOR)[0].text
beds=get_all_listings(soup,locator=BEDS_LOCATOR)[0].text.split('\n')
bed=[b for b in beds if len(b)>0 and b.isnumeric()][0]

baths=get_all_listings(soup,locator=BATH_LOCATOR)[0].text.split('\n')
bath=[b for b in baths if len(b)>0 and b.isnumeric()][0]
house_type=get_all_listings(soup,locator=TYPE_LOCATOR)[0].text.split(',')[0].strip()
date_published=get_all_listings(soup,locator=DATE_LOCATOR)[0].text.strip()[7:]

soup_details=get_url(link)
description=get_all_listings(soup_details,locator=DESCRIPTION_LOCATOR)[0].text

In [164]:

print(listPrice)


319900


### Pulling out the individual data points

In [24]:
href=listing_divs[0].select('a[id*=title]')[0]['href']
addy=listing_divs[0].select('a[id*=title]')[0].text
hood=listing_divs[0].select('div[id*=hood]')[0].string.replace('\n','')

### Getting other elements

In [30]:
listing_specs =listing_divs[0].select('table[id*=info] tr')
for spec in listing_specs:
    spec_data=spec.text.strip().replace(' ','_').split()
    print(spec_data)

['$2,600', '2_Bed', '1_Bath']


In [59]:
def parse_data(listing_divs):
    listing_list=[]
    for idx in range(len(listing_divs)):
        indv_listing=[]
        current_listing=listing_divs[idx]
        href=listing_divs[0].select('a[id*=title]')[0]['href']
        addy=listing_divs[0].select('a[id*=title]')[0].text.replace(',','_')
        hood=listing_divs[0].select('div[id*=hood]')[0].string.replace('\n','').replace(',','_')

        indv_listing.append(href)
        indv_listing.append(addy)
        indv_listing.append(hood)

        listing_specs=current_listing.select('table[id*=info] tr')
        for spec in listing_specs:
            try:
                indv_listing.extend(spec.text.strip().replace(' ','_').replace(',','_').split())
                indv_listing=[x for x in indv_listing if len(x.strip())!=0]
            except:
                indv_listing.extend(np.nan)
        listing_list.append(indv_listing)
    return listing_list



In [68]:
all_pages_parsed=[]
for i in range(1,21):
    target_page=f"https://www.renthop.com/search/nyc?max_price=50000&min_price={i}&sort=hopscore&q=&search=0"
    print(target_page)
    r=requests.get(target_page).content
    
    soup=BeautifulSoup(r,'html5lib')
    listing_divs=soup.select('div[class*=search-info]')
    one_page_parsed =parse_data(listing_divs)
    all_pages_parsed.extend(one_page_parsed)


https://www.renthop.com/search/nyc?max_price=50000&min_price=1&sort=hopscore&q=&search=0
https://www.renthop.com/search/nyc?max_price=50000&min_price=2&sort=hopscore&q=&search=0
https://www.renthop.com/search/nyc?max_price=50000&min_price=3&sort=hopscore&q=&search=0
https://www.renthop.com/search/nyc?max_price=50000&min_price=4&sort=hopscore&q=&search=0
https://www.renthop.com/search/nyc?max_price=50000&min_price=5&sort=hopscore&q=&search=0
https://www.renthop.com/search/nyc?max_price=50000&min_price=6&sort=hopscore&q=&search=0
https://www.renthop.com/search/nyc?max_price=50000&min_price=7&sort=hopscore&q=&search=0
https://www.renthop.com/search/nyc?max_price=50000&min_price=8&sort=hopscore&q=&search=0
https://www.renthop.com/search/nyc?max_price=50000&min_price=9&sort=hopscore&q=&search=0
https://www.renthop.com/search/nyc?max_price=50000&min_price=10&sort=hopscore&q=&search=0
https://www.renthop.com/search/nyc?max_price=50000&min_price=11&sort=hopscore&q=&search=0
https://www.renthop

In [71]:
all_pages_parsed[:3]
#df = pd.DataFrame(all_pages_parsed, columns=['url', 'address', 'neighborhood', 'rent', 'beds', 'baths','last']) 
#del df['last']
#df.head()

[['https://www.renthop.com/listings/e20-street/na/15733028',
  'E20 street',
  'Stuyvesant Town - Peter Cooper Village_ Midtown Manhattan_ Manhattan',
  '$4_831',
  '3_Bed',
  '_1_Bath'],
 ['https://www.renthop.com/listings/e20-street/na/15733028',
  'E20 street',
  'Stuyvesant Town - Peter Cooper Village_ Midtown Manhattan_ Manhattan',
  '$5_600',
  '4_Bed',
  '3_Bath'],
 ['https://www.renthop.com/listings/e20-street/na/15733028',
  'E20 street',
  'Stuyvesant Town - Peter Cooper Village_ Midtown Manhattan_ Manhattan',
  '$2_450',
  '1_Bed',
  '1_Bath']]