### Introduction and problem statement

Finding a place to live is a big challenge, especially in big cities. Not only it is time consuming, but compare them whn we narrow down our choice to a few is another challenge. For families with children the best school district is priority, for single workers, being closer to a subway is a plus.

In this notebook, we will:
- Scrape the information from an apartment listing data 
- Clean and transform the data
- Analyze and visualize the data
- Forcasting the price for apartment for different neigborhoods

#### Scraping Data from the web
For this exercise, we will be using RentHopsite(http://www.renthop.com). We will focus our energy only on apartments in NYC and neighorhoods.
![title](../homes.jpg)

Above is the overview of the site. We will be retrieving the informations below:
- Address
- Number of beds and number of baths
- The rental listing price
- Zip code
- apartment description
- etc

#### Adding libraries we will be using

For scraping we will be using **requests** to pull down listings and use **BeautifulSoup** to extract different attributes. There are two populars way to navigate  HTML structures known as **Document Object Models (DOMs)** XPath and CSS Selectors. We will be using the CSS selectors, which is a pattern language built to work with HTML and identify elements using a combination of element type, class,and ID properties.

In [1]:
import re 
import numpy as np 
import pandas as pd 
import requests 
import matplotlib.pyplot as plt 
import googlemaps 
from bs4 import BeautifulSoup
%matplotlib inline

#### 1- Getting the whole page to scrape
  Fist of all, let us define a function call get_url that will allow us to get the Document Object Model for a given page. 

In [2]:
def get_url (url):
    try:
        response = requests.get(url).content
        return BeautifulSoup(response,'html.parser')
    except requests.exceptions.RequestException as e: 
        print("Something when wrong !")
        raise SystemExit(e)


#### 2- Getting the list of the all listings to scrape
After getting the HTML for the whole page, we are going to write another fuction to get the area that has only all the properties that we are interested in.
Using Chrome developer console, we see that the main text is within a **div** tag with *id=search-results-list* and each listing is in another **div** tag with *class=search-listing*. therefore, we will be getting all *search-listing*.
As we can see, each page has 20 listings.
Therefore,we are going to define the locator that will gather all listinds as **APARTMENTS_LOCATOR='div.search-listing'**.
All the other locator will follow the same pattern.


In [3]:
def get_all_listings(soup, locator):
    '''
    our function takes a soup object and a locator and return the
     list of all individual listings.
    '''
    return soup.select(locator) 


After inspecting all the components we will be scraping, below is the different locators that we will be using to get differents attributes.

In [4]:
#Here is the list of all locator we have identified

APTS_LOCATOR = 'div.search-listing'
LOC_LOCATOR  = 'div.search-listing'
LINK_LOCATOR = 'div.search-listing a'
LISTINGID_LOCATOR = 'div.search-listing'
LATITUDE_LOCATOR = 'div.search-listing'
LONGITUDE_LOCATOR = 'div.search-listing'
ADDR_LOCATOR = 'div.search-listing div.search-info'
NAME_LOCATOR = 'div.search-listing div.search-info div.font-size-9.overflow-ellipsis'
ADRNAME_LOCATOR = 'div.search-info>div>a'
PRICE_LOCATOR = 'div.search-listing  div.search-info div table tr'
BROKER_LOCATOR= 'div.search-listing div.search-info'
SQFT_LOCATOR  ='div.row.no-gutters div.px-3>div>table'
DESC_LOCATOR  = 'div.row.no-gutters div.font-size-10 p'


#### 3- processing each listing to get individual attributes
Here we are going to get data for each individual listing. We are going to built different functions that will allow us to get our data.

##### a) getting location and link attributes
We see that all the locations attributes and Listingid can be obtained using the same locator. 

In [5]:
def get_location(apt, locator=LOC_LOCATOR):
    '''
      return listingid, longitude, latitude for a given listing
    '''
    location=apt.select_one(locator)
    #print(location)
    listingID = apt.attrs['listing_id']
    longitude = apt.attrs['longitude']
    latitude  = apt.attrs['latitude']
        
    return {
        'listingID':listingID,
        'longitude':longitude,
        'latitude' :latitude,
       }

In [6]:
#getting the link for the property detail page
def get_link(apt,locator=LINK_LOCATOR):
    '''
    return the link to the detail page for each apartment
    '''
    link = apt.select_one(locator)
    return link.attrs['href']

In [7]:
def get_address(apt,locator=ADRNAME_LOCATOR):
    '''
      return the apartment address
    '''
    return apt.select_one(locator).text


In [8]:
def get_neighborhood(apt,locator=NAME_LOCATOR):
    '''
    return the neighborhood for a given listing
    '''
    name=apt.select_one(locator)
    return name.text.strip()

In [9]:
def get_description(soup,locator=DESC_LOCATOR):
    '''
    return the description for each listing
    '''
    description=get_all_listings(soup,locator)
    if description is not None:
        return description

In [10]:
def get_price_bed_bath(soup_detail,locator=PRICE_LOCATOR):
    '''
    From the detail page, return the price, the number of bed and bath 
    for the listing
    '''
    result={}
    price_bath_bed=get_all_listings(soup_detail,locator=SQFT_LOCATOR)[0].text.split('\n\n')
    price_bath_bed=[b.strip() for b in price_bath_bed if len(b)>0]
    if price_bath_bed[0][0]=='$':
        price =float(price_bath_bed[0].strip().replace(',','')[1:])
        result['price']=price
    
    n = len(price_bath_bed)
    for i in range(1,n):
        element = price_bath_bed[i].split(' ')#[0]
        if len(element)>1:
            result[element[1].strip().replace('\n/','')]=element[0].strip()
        else:
            result['Bed']=element[0].strip()
    
    return result


#### 4- Putting all functions together to scrape a page
In this section, we are going to put all functions defined above to get all attributes to for apartments on a given page.

In [11]:
def scrape_page(url):
    soup = get_url(url)
    apartments=get_all_listings(soup,locator=APTS_LOCATOR)
    result=[]
    
    
    for apt in apartments:
        apartment={}

        #getting location
        location = get_location(apt,locator=LOC_LOCATOR)
        #getting apartment link
        link=get_link(apt,locator=LINK_LOCATOR)
        #getting property address
        address=get_address(apt,locator=ADRNAME_LOCATOR)
        #getting apts name
        neighborhood = get_neighborhood(apt,locator=NAME_LOCATOR)
        #getting the property detail page
        soup_detail=get_url(link)

        #getting renting price, apart bed, bath and sqft
        apt_detail =get_price_bed_bath(soup_detail,PRICE_LOCATOR)

        description = get_description(soup_detail,locator=DESC_LOCATOR)

        apartment.update(location)
        apartment['link']=link
        apartment['address']=address
        apartment['neighborhood']=neighborhood
        apartment.update(apt_detail)
        apartment['description']=description
        
        result.append(apartment)

    return result

We see that each page for has 20 listings. To get enough data for our analysis, we will scape 100*20 =2000 apartments

In [12]:
all_pages_parsed=[]
for i in range(1,10):
    target_page=f"https://www.renthop.com/search/nyc?max_price=50000&min_price={i}&sort=hopscore&q=&search=0"
    #print(target_page)
    result=scrape_page(target_page)
    all_pages_parsed.extend(result)


In [13]:
# lets take the final result from our last function and create a pandas dataframe
data=pd.DataFrame(all_pages_parsed)

In [14]:
#a view of our data
data.head()

Unnamed: 0,listingID,longitude,latitude,link,address,neighborhood,price,Bed,Bath,Fee,description,Sqft
0,15897278,-73.996,40.7592,https://www.renthop.com/listings/561-10th-aven...,"561 10th Avenue, Apt 37A","Hell's Kitchen, Midtown Manhattan, Manhattan",3625.0,,1,No,"[[All photos, amenities, and descriptions are ...",
1,15905872,-73.9964,40.7442,https://www.renthop.com/listings/west-23/714/1...,West 23,"Chelsea, Midtown Manhattan, Manhattan",2795.0,,1,,[[Chelsea is located on the West Side of Manha...,
2,15910427,-73.9758,40.7463,https://www.renthop.com/listings/236-e-36th-st...,"236 E 36th St, Apt 2J","Murray Hill, Midtown Manhattan, Manhattan",2700.0,,1,,"[[In the early 1900s, Murray Hill was known fo...",
3,15905865,-74.0073,40.7381,https://www.renthop.com/listings/jane-street/n...,Jane Street,"West Village, Downtown Manhattan, Manhattan",3495.0,,1,,[[The West Village is known for its bohemian c...,
4,15902892,-74.0162,40.7056,https://www.renthop.com/listings/west-street/2...,West Street,"Financial District, Downtown Manhattan, Manhattan",5130.0,,2,No,"[[All photos, amenities, and descriptions are ...",1100.0


#### Adding Zip code
when dealing with geographic data, it is a import to add zip code, that we do not directly have from the data provided.
We can get this information using google API.This is not a free service, but since it gives a $300 credit each month, this will be enought for our purpose.
We will sign up to : https://developers.google.com/maps/documentation/geocoding/intro:
We can click on **Get Started** on the upper right corner of the page.

To get the info needed, make sure to copy and save the API key.

In [15]:
#getting the map
gmaps = googlemaps.Client(key='AIzaSyB26Iz6EC3aBOW59CaV7E5SPiOh19REtGM')

In [16]:
#Example of location
ta = data.loc[3,['address']].values[0]+' '+data.loc[3,['neighborhood']].values[0].split(', ')[-1]
ta

'Jane Street Manhattan'

In [17]:
#Example of JSON return
geocode_result=gmaps.geocode(ta)
geocode_result

[{'address_components': [{'long_name': 'Jane Street',
    'short_name': 'Jane St',
    'types': ['route']},
   {'long_name': 'Manhattan',
    'short_name': 'Manhattan',
    'types': ['political', 'sublocality', 'sublocality_level_1']},
   {'long_name': 'New York',
    'short_name': 'New York',
    'types': ['locality', 'political']},
   {'long_name': 'New York County',
    'short_name': 'New York County',
    'types': ['administrative_area_level_2', 'political']},
   {'long_name': 'New York',
    'short_name': 'NY',
    'types': ['administrative_area_level_1', 'political']},
   {'long_name': 'United States',
    'short_name': 'US',
    'types': ['country', 'political']},
   {'long_name': '10014', 'short_name': '10014', 'types': ['postal_code']}],
  'formatted_address': 'Jane St, New York, NY 10014, USA',
  'geometry': {'bounds': {'northeast': {'lat': 40.7387725, 'lng': -74.002056},
    'southwest': {'lat': 40.737537, 'lng': -74.0096307}},
   'location': {'lat': 40.738158, 'lng': -74.00

In [18]:
#how do we get the postal code
for p in geocode_result[0]['address_components']:
    if 'postal_code' in p['types']:
        print(p['short_name'])

10014


In [19]:
def get_zip(row): 
    '''
    return the zip code for a given neigborhood
    '''
    try: 
        allrow = row['address'] + ' ' + row['neighborhood'].split(', ')[-1] 
        if re.match('^\d+\s\w', allrow): 
            geocode_result = gmaps.geocode(allrow) 
            for piece in geocode_result[0]['address_components']: 
                if 'postal_code' in piece['types']: 
                    return piece['short_name'] 
                else: 
                    continue 
        else: 
            return np.nan 
    except: 
        return np.nan 
 

After defining the function to extract the zip code, let's apply to our dataframe.

In [20]:
data['zip'] = data.apply(get_zip, axis=1)

In [21]:
#A view of our data with zip code.
data.head()

Unnamed: 0,listingID,longitude,latitude,link,address,neighborhood,price,Bed,Bath,Fee,description,Sqft,zip
0,15897278,-73.996,40.7592,https://www.renthop.com/listings/561-10th-aven...,"561 10th Avenue, Apt 37A","Hell's Kitchen, Midtown Manhattan, Manhattan",3625.0,,1,No,"[[All photos, amenities, and descriptions are ...",,10036.0
1,15905872,-73.9964,40.7442,https://www.renthop.com/listings/west-23/714/1...,West 23,"Chelsea, Midtown Manhattan, Manhattan",2795.0,,1,,[[Chelsea is located on the West Side of Manha...,,
2,15910427,-73.9758,40.7463,https://www.renthop.com/listings/236-e-36th-st...,"236 E 36th St, Apt 2J","Murray Hill, Midtown Manhattan, Manhattan",2700.0,,1,,"[[In the early 1900s, Murray Hill was known fo...",,10016.0
3,15905865,-74.0073,40.7381,https://www.renthop.com/listings/jane-street/n...,Jane Street,"West Village, Downtown Manhattan, Manhattan",3495.0,,1,,[[The West Village is known for its bohemian c...,,
4,15902892,-74.0162,40.7056,https://www.renthop.com/listings/west-street/2...,West Street,"Financial District, Downtown Manhattan, Manhattan",5130.0,,2,No,"[[All photos, amenities, and descriptions are ...",1100.0,


### Conclusion
After define this dataframe, this ends the scraping section. We are going to save or dataframe to csv, and continue the analysis on a different dataframe.

In [22]:
data.to_csv('apts_with_zip.csv')

In [23]:
#zdf = data[data['zip'].notnull()].copy()