# Hotel Listings from The Green Book

In this notebook, we are going to extract hotel listings from The Green Book. We will use BeautifulSoup to accomplish our objective.

This notebook was written on request from a friend who needed hotel listings.

## Imports

We will first do the imports...

In [1]:
import csv
import requests
from bs4 import BeautifulSoup as bs

## Using BeautifulSoup for Extraction of Listing

We will then use BeautifulSoup to extract out what we need.

Listings in The Green Book are split into pages of 15 listings each, and follow the following URL structure:
`http://www.thegreenbook.com/companies/search/(category)/page/(page number)/`.

Our friend needs listings for hotels, so the category will be `hotel`.

We will first check to see that BeautifulSoup is working as we expected...

In [2]:
def get_html_content(url):
    # Get the page
    result = requests.get(url)

    # Get page content
    return bs(result.content, 'html.parser') if result.status_code == 200 else None

In [3]:
url = 'http://www.thegreenbook.com/companies/search/hotel/page/1/'

soup = get_html_content(url)
print(soup.prettify())

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html lang="en" xmlns="http://www.w3.org/1999/xhtml">
 <head id="Head1">
  <meta content="SG" name="geo.region"/>
  <meta content="Singapore" name="geo.placename"/>
  <meta content="1.352083;103.819836" name="geo.position"/>
  <meta content="1.352083, 103.819836" name="ICBM"/>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="all" name="robots"/>
  <meta content="noindex" id="robot" name="robots"/>
  <!-- InstanceBeginEditable name="doctitle" -->
  <!-- InstanceEndEditable -->
  <script src="http://www.thegreenbook.com/_js/master-combine.js" type="text/javascript">
  </script>
  <script src="http://www.thegreenbook.com/js/searchresults/search_log.js" type="text/javascript">
  </script>
  <!--For CSS-->
  <link href="/_css/global_style.css" rel="stylesheet" type="text/css"/>
  <link href="/_css/style_mq320.css" media="screen and 

According to the extracted html, we know that each listing has the following structure (truncated for brevity):

```html
<div class="info">
    <h3><a class="H4Ver2" itemprop="CompanyName">Name of Hotel</a></h3>
    <p class="address" itemprop="CompanyAddress">Address of Hotel</p>
</div>
```

This information will be used for extracting out specific information about the hotel.

In [4]:
for el in soup.find_all('div','info'):
    print(el.find('a','H4Ver2').get_text().strip(),
          el.find('p','address').get_text().strip())

Hotel 165 165 Kitchener Rd  Tong Guan Bldg S(208532)
Hotel 1929 Pte Ltd 50 Keong Saik Rd   S(089154)
HOTEL 34 34 Lor 22 Geylang   S(398691)
Hotel 6 5 Lor 6 Geylang #01-00  S(399167)
Hotel 81-Balestier 226 Balestier Rd  Hotel 81 Balestier S(329688)
Hotel 81-Bencoolen 41 Bencoolen St #01-00 S(189623)
Hotel 81-Bugis 31 Middle Rd   S(188995)
Hotel 81-Changi 428 Changi Rd #01-01  S(419871)
Hotel 81-Cherry 3 Lor 12 Geylang   S(399014)
Hotel 81-Chinatown 181 New Bridge Rd   S(059418)
Hotel 81-Classic 12 Joo Chiat Rd   S(427353)
Hotel 81-Cosy 8 Jiak Chuan Rd S(089263)
Hotel 81-Dickson 3 Dickson Rd #01-00  S(209530)
Hotel 81-Elegance 30 Foch Rd   S(209276)
Hotel 81-Fuji 269 Balestier Rd #01-00  S(329720)


The address of the hotel can be further split into the specific address and the postal code.

In [5]:
address = '5 Lor 6 Geylang #01-00  S(399167)'

print(address[:-9].strip())
print(address[-7:-1])

5 Lor 6 Geylang #01-00
399167


## Putting it All Together

Now we will put what we know about the data we're dealing with into a single function:

In [6]:
def extract_contents(page):
    url = 'http://www.thegreenbook.com/companies/search/hotel/page/' + str(page) + '/'
    soup = get_html_content(url)
    
    listings = soup.find_all('div','info')
    
    def get_hotel_contents(el):
        hotel_name = el.find('a','H4Ver2').get_text().strip()
        hotel_full_address = el.find('p','address').get_text().strip()
        hotel_address = hotel_full_address[:-9].strip()
        hotel_postcode = hotel_full_address[-7:-1]
        
        return hotel_name, hotel_address, hotel_postcode
    
    return [get_hotel_contents(el) for el in soup.find_all('div','info')]

In [7]:
# Extract out contents from first page only
extract_contents(1)

[('Hotel 165', '165 Kitchener Rd  Tong Guan Bldg', '208532'),
 ('Hotel 1929 Pte Ltd', '50 Keong Saik Rd', '089154'),
 ('HOTEL 34', '34 Lor 22 Geylang', '398691'),
 ('Hotel 6', '5 Lor 6 Geylang #01-00', '399167'),
 ('Hotel 81-Balestier', '226 Balestier Rd  Hotel 81 Balestier', '329688'),
 ('Hotel 81-Bencoolen', '41 Bencoolen St #01-00', '189623'),
 ('Hotel 81-Bugis', '31 Middle Rd', '188995'),
 ('Hotel 81-Changi', '428 Changi Rd #01-01', '419871'),
 ('Hotel 81-Cherry', '3 Lor 12 Geylang', '399014'),
 ('Hotel 81-Chinatown', '181 New Bridge Rd', '059418'),
 ('Hotel 81-Classic', '12 Joo Chiat Rd', '427353'),
 ('Hotel 81-Cosy', '8 Jiak Chuan Rd', '089263'),
 ('Hotel 81-Dickson', '3 Dickson Rd #01-00', '209530'),
 ('Hotel 81-Elegance', '30 Foch Rd', '209276'),
 ('Hotel 81-Fuji', '269 Balestier Rd #01-00', '329720')]

## Writing to File

Finally, we will write the contents to a CSV file.

The code has been set to loop **31 times** as there are 31 pages altogether.

In [8]:
def write_list_to_csv(filename,lists):
    with open(filename, 'a') as f:
        writer = csv.writer(f)

        for l in lists:
            writer.writerow(l)

In [9]:
for i in range(31):
    hotels = extract_contents(i + 1)
    write_list_to_csv('hotels.csv', hotels)

## Suggested Improvements

* Corner cases:
    * Postal code not at specific position
    * Malaysian addresses
* Using multiprocessing to improve performance of the CSV write operation
* Adapting code to other kinds of listing within The Green Book