# TripAdvisor Crawling Suite User Guide

Kevin Jiahua Du <br />
14 July, 2017

In [2]:
import datetime
print('Last Updated: ' + str(datetime.datetime.now()))

Last Updated: 2017-07-14 04:30:01.661954


This guide will walk you through the steps of using the crawling suite to effortlessly collect hotel data from TripAdvisor, including basic hotel information, reviews, and user profiles.

## Initialization

First of all, we need to configure the suite to specify locations where we want to collect hotel data.
Open `config.ini` with a text editor, and you will see something like this:

> **[THREAD]** <br />
SleepTime = 2 <br />
SnippetThread = 3 <br />
HotelThread = 3 <br />
ReviewThread = 3 <br />
UserThread = 3 <br />
**[LOCATION]** <br />
List = placeholder

Calm down. If you have no idea what a thread is, just leave the THREAD section as it is. 
Otherwise edit accordingly `SLEEP_TIME` seconds to wait between each connection to TripAdvisor; `SNIPPET_THREAD_NUM`, `DETAIL_THREAD_NUM`, `REVIEW_THREAD_NUM`, and `USER_THREAD_NUM` the number of threads in each crawling module. 
In practice, a minimum pause of two seconds and at most three threads for each crawler are recommended.

Telling the suite where to collect data is easy. The most simple way is to replace the placeholder with an URL describing a location.
You can find the URL after you submit a query for a location on TripAdvisor. Take Melbourne, it should be: 

> https://www.tripadvisor.com.au/Hotels-g255100-Melbourne_Victoria-Hotels.html

The placeholder is also available for multiple locations by concatenating all URLs with semicolon and no space. For example, $URL_1;URL_2;URL_3;...;URL_n$. 

We offer a crawling module to automatically gather all location URLs within a given area, say all cities within Victoria and even Australia. 
Once the gathering finished, the module will store the URL string in the same directory.


In [3]:
from crawlers import crawlLocations

area_url = 'https://www.tripadvisor.com.au/Hotels-g255098-Victoria-Hotels.html'
crawlLocations.start(area_url)

locations in 10 pages.
[page 1] https://www.tripadvisor.com.au/Hotels-g255098-Victoria-Hotels.html
[page 2] https://www.tripadvisor.com.au/Hotels-g255098-oa20-Victoria-Hotels.html
[page 3] https://www.tripadvisor.com.au/Hotels-g255098-oa40-Victoria-Hotels.html
[page 4] https://www.tripadvisor.com.au/Hotels-g255098-oa60-Victoria-Hotels.html
[page 5] https://www.tripadvisor.com.au/Hotels-g255098-oa80-Victoria-Hotels.html
[page 6] https://www.tripadvisor.com.au/Hotels-g255098-oa100-Victoria-Hotels.html
[page 7] https://www.tripadvisor.com.au/Hotels-g255098-oa120-Victoria-Hotels.html
[page 8] https://www.tripadvisor.com.au/Hotels-g255098-oa140-Victoria-Hotels.html
[page 9] https://www.tripadvisor.com.au/Hotels-g255098-oa160-Victoria-Hotels.html
[page 10] https://www.tripadvisor.com.au/Hotels-g255098-oa180-Victoria-Hotels.html
195 locations found.


Next, we need to set up a database called ta.db, and create four tables for storing locations, hotels, reviews, and reviewers with a few lines of code.

In [1]:
from os.path import isfile
from tadb import taDB

fn = 'ta.db'
if not isfile(fn):
    with taDB(fn) as db:
        db.create_tables()
        print('database {} created.'.format(fn))

database ta.db created.


Before we get started, we also need to turn on the event logger to record the crawling process and apply the configuration we just made to the suite.

In [2]:
from start import *

# Trigger the event logger.
init_logger()
# Load and set crawler parameters.
urls = load_config()

INFO - parameters: [0; 3; 3; 3; 5]
INFO - 2 locations found


Alright, so far we have done all the preparation.

## Data Acquisition

Collecting data from TripAdvisor using is straightforward: The suite traverses each URL in the location list, invoking different modules to collect and dump the correponding information in the form of raw HTML. 
Note that those four modules must be run in the shown order.
Nevertheless, depending on the type of data required, you can always stop the procedure at any step as long as the preceding steps are completed.

In [3]:
import re
from crawlers import crawlSnippets
from crawlers import crawlHotels
from crawlers import crawlReviews
from crawlers import crawlUsers

# As an example, we only collect data 
# from Trawool and Mount Waverley.
for url in urls:
    gid = re.sub('\D', '', url)
    crawlSnippets.start(gid, url.strip())
    crawlHotels.start(gid)
    crawlReviews.start(gid)
    crawlUsers.start()

INFO - [location 552264] Trawool Victoria
INFO - 1 hotels in 1 pages
INFO - 0 hotels in local cache
INFO - [page 1] https://www.tripadvisor.com.au/Hotels-g552264-Trawool_Victoria-Hotels.html?seen=0&sequence=1&geo=552264&requestingServlet=Hotels&refineForm=true&hs=&adults=2&rooms=1&o=a0&pageSize=&rad=0&dateBumped=NONE&displayedSortOrder=popularity
INFO - 	#0, totaling 1
INFO - [worker 2] shutting down
INFO - [worker 3] shutting down
INFO - [worker 1] shutting down
INFO - all hotel ids are ready
INFO - [hotel 660346] https://www.tripadvisor.com.au/Hotel_Review-g552264-d660346-Reviews-Comfort_Inn_Trawool-Trawool_Victoria.html
INFO - 	[hotel 660346] 72 reviews in 15 pages
INFO - 	[hotel 660346] [page 1] 5 reviews, totaling 5
INFO - 	[hotel 660346] [page 2] 5 reviews, totaling 10
INFO - 	[hotel 660346] [page 3] 5 reviews, totaling 15
INFO - 	[hotel 660346] [page 4] 5 reviews, totaling 20
INFO - 	[hotel 660346] [page 5] 5 reviews, totaling 25
INFO - 	[hotel 660346] [page 6] 5 reviews, totali

INFO - [user 471F5F614835BDAE44624ADCC5BC3179] PASSED: verified
INFO - [user 4F346C4C664117DE8D840F498F38806E] PASSED: verified
INFO - [user 546ACCC7FA3B724786513F49043ABE4F] PASSED: verified
INFO - [user 54F0A20ADAB6F11948BCE047DFB9F605] PASSED: verified
INFO - [user 5A434ACB51E8CE5FF2B4E68972C20F51] PASSED: verified
INFO - [user 5B09E3237D7B24CB450A534DBFBA02AA] PASSED: verified
INFO - [user 5D311C1A5C54D251F2EEDC8E79273CC3] PASSED: verified
INFO - [user 6BA070D60358CD1DA4CAD9D29BFF7CD1] PASSED: verified
INFO - [user 72C227ACF2158E2AFC461D91BA7B86ED] PASSED: verified
INFO - [user 75105453B9261F313D437E154646367C] PASSED: verified
INFO - [user 79E95D034C1E6EC625D350191185D00D] PASSED: verified
INFO - [user 7D5B2A809A82FAAFDDFECE414E2C13B5] PASSED: verified
INFO - [user 7FB95052A8FDC1458FA16F4063C3D4FF] PASSED: verified
INFO - [user 812AF6FA6D84E4A6EB7B6822E7943310] PASSED: verified
INFO - [user 83991B6C21929E9B62398A46DB9E8BF4] PASSED: verified
INFO - [user 83EAE255A3CB3B7261923F2AB68

## Data Extraction

The suite then extracts disired information by matching several patterns from the raw HTML strings in an offline manner.
Unlike the crawling modules, the implementation of extration components is order-insensitive.

In [1]:
with taDB(fn) as db:
    db.extract_hotel_info()
    db.extract_review_info()
    db.extract_user_info()

extracting hotel 566124...
extracting hotel 660346...
hotel extraction done.
extracting review 104230596...
extracting review 108754073...
extracting review 111906174...
extracting review 112695566...
extracting review 116100228...
extracting review 120515398...
extracting review 121842783...
extracting review 122253992...
extracting review 122793559...
extracting review 124190506...
extracting review 129439069...
extracting review 129840731...
extracting review 132278498...
extracting review 134164136...
extracting review 136358640...
extracting review 142183627...
extracting review 143872401...
extracting review 144521437...
extracting review 146191725...
extracting review 157990498...
extracting review 161113564...
extracting review 165109868...
extracting review 165745798...
extracting review 170788761...
extracting review 174595897...
extracting review 175185987...
extracting review 175374291...
extracting review 178915904...
extracting review 179569259...
extracting review 184434

That's it. Now you have your data ready in the local database. Enjoy!