# Points of Interest (POIs): Collection, Retrieval, and Visualization

Kevin Jiahua Du <br />
12 April, 2018

In [1]:
import datetime
print('Last Updated: ' + str(datetime.datetime.now()))

Last Updated: 2020-07-23 18:06:51.493762


## Introduction

This notebook collectes geo-taggged POIs in Australia from three online sources, retrieves neighbor POIs of a given location, and visualizes the retrieved POIs. Additional geo-related resources, tools, and libraries are listed at the end.

- Holiday in Australia
- GeoNames
- OpenStreetMap
- <del>TripAdvisor</del>

**Note**: The last two sources also contain POIs from around the world.

In [2]:
import json
import glob
import pandas as pd
import numpy as np
import ast
import math
import re
import sqlite3
from tqdm import tqdm
import swifter

import requests
from bs4 import BeautifulSoup

from IPython.display import HTML, display

PH = '_'

## 1. Collecting POIs from Holiday in Australia

[Holidays in Australia](https://www.australia.com/en/explore.html) introduces authority-recommended attractions within Australia. 
It provides a range of tour information such as places to/must go and things to/must do.

To start with, you need to save the following GeoJSON files in the same folder named ```hia```.

- To do
    - [General](https://www.australia.com/content/dam/ta/mapbox/geojson/cat-do.geojson)
    - [Sleep](https://www.australia.com/content/dam/ta/mapbox/geojson/cat-sleep.geojson)
    - [Events](https://www.australia.com/content/dam/ta/mapbox/geojson/cat-events.geojson)
    - [Eat](https://www.australia.com/content/dam/ta/mapbox/geojson/cat-eat.geojson)
- Must do
    - [General](https://www.australia.com/content/dam/ta/mapbox/geojson/cat-all-MUSTDOTA.geojson)
    - [Sleep](https://www.australia.com/content/dam/ta/mapbox/geojson/cat-all-MUSTDOSLEEPTA.geojson)
    - [Events](https://www.australia.com/content/dam/ta/mapbox/geojson/cat-all-MustDoEventsTA.geojson)
    - [Eat](https://www.australia.com/content/dam/ta/mapbox/geojson/cat-all-MustdoEatTA.geojson)

1. Merge all downloaded GeoJSON files into a data frame.
1. Assign the ID, category and the 'Must' flag to each POI.
1. Remove redundant POIs.

In [None]:
df = []
for fn in glob.glob('hia/*.geojson'):
    with open(fn, encoding='utf8') as f:
        fn_lower = fn.lower()
        cflag = PH
        if 'do' in fn_lower:
            cflag = 'DO'
        elif 'eat' in fn_lower:
            cflag = 'EAT'
        elif 'sleep' in fn_lower:
            cflag = 'SLEEP'
        elif 'events' in fn_lower:
            cflag = 'EVENT'
        poi_jsons = json.loads(f.read())['features']
        df.extend([
            {'cate': cflag, 'pid': x['properties']['atdwProductId']}
            # 'must': int('must' in fn_lower), 
            for x in poi_jsons
        ])
df = pd.DataFrame(df).drop_duplicates(subset='pid')
df['json_txt'] = PH

Gather details of individual attractions and then remove those without details.

In [None]:
hia_url = 'https://www.australia.com/bin/australia/atdw/product.%s.json'

json_txts = []
for pid, txt in tqdm(zip(df['pid'], df['json_txt'])):
    if txt == PH:
        r = requests.get(hia_url % pid)
        if r.status_code == 200:
            # successful
            txt = r.json()
        else:
            # failed
            txt = PH
    # else: cached
    json_txts.append(txt)
df['json_txt'] = json_txts

df.to_csv('hia/holiday_in_au.csv', header=True, index=False)

<!-- Fields shared across all POI records:
- address, allTypeIds, atdwProdId, currency, description, detailsUnavailable, disabledAccess, displayActivitiesOnFrontTile, displayCuisinesOnFrontTile, displayExperiencesOnFrontTile, displayTypesOnExpanded, displayTypesOnFrontTile, dreamTripId, geoCoordinates, imagePath, localAddress, location, media, productPagePath, productPixelURL, productShortDescription, state, title, typeIds, types. -->

Extracts useful attributes from the JSON strings and export the POIs.

In [None]:
FIELDS = [
    'address', 'state', 'location', 'localAddress', 'title', 
    'description', 'allTypeIds', 'types', 
]

def parse_dict(j_str):
    json_dict = ast.literal_eval(j_str)
    # 'geoCoordinates' originally contains 'latitude' and 'longitude'
    row = json_dict['geoCoordinates']
    for field in FIELDS:
        row[field] = json_dict[field]
    return row

df = pd.read_csv('hia/holiday_in_au.csv')
df = df[df['json_txt']!=PH]
json_df = df['json_txt'].swifter.apply(parse_dict)
json_df = pd.DataFrame(json_df.tolist())
df = pd.concat([df, json_df], axis=1, sort=False)
df = df.drop('json_txt', axis=1)
df.to_csv('hia/hia_pois.csv', header=True, index=False)

Below shows an instance.

In [None]:
hia_df = pd.read_csv('hia/hia_pois.csv').drop('must', axis=1)
hia_df.rename(columns={'latitude':'lat', 'longitude':'lng'}, inplace=True)
print(len(hia_df))
hia_df.head(5)

## 2. Collecting POIs from GeoNames

[GeoNames](http://www.geonames.org/) is a free geographical database that contains [gazetteer data](http://download.geonames.org/export/dump/) of over 11 million placenames worldwide. 
It also offers other useful resources, including [postal code data](http://download.geonames.org/export/zip/) and country boundaries (in both [plain text](http://download.geonames.org/export/dump/shapes_all_low.zip) and [GeoJSON](http://download.geonames.org/export/dump/shapes_simplified_low.json.zip)).


To start with, you need to save the following GeoJSON files in the same folder named ```geo```.
- [The GeoNames core](http://download.geonames.org/export/dump/AU.zip)
    - You need to decompress the zip file and then insert the header at the beginning of the decompressed file.
        - ```geonameid, name, asciiname, alternatenames, latitude, longitude, feature_class, feature_code, country_code, cc2, admin1_code, admin2_code, admin3_code, admin4_code, population, elevation, dem, timezone, modification_date```
- [Administrative division code - level 1](http://download.geonames.org/export/dump/admin1CodesASCII.txt)
    - You need to insert the header at the beginning of the downloaded file.
        - ```code, name, name_ascii, geonameId```
- [Administrative division code - level 2](http://download.geonames.org/export/dump/admin2Codes.txt)
     - You need to insert the header at the beginning of the downloaded file.
        - ```code, name, name_ascii, geonameId```

<!-- - [Feature code](http://download.geonames.org/export/dump/featureCodes_en.txt) [optional]
    - You need to insert the header at the beginning of the downloaded file.
        - ```code, value, desc``` -->

Keeps only useful attributes and rename the columns. Please see the [README](http://download.geonames.org/export/dump/readme.txt) for more infomation.

In [None]:
df = pd.read_csv('geo/AU.txt', delimiter='\t')
df['gclass'] = [
    str(cls) + '.' + str(code) 
    for cls, code in zip(df.feature_class, df.feature_code)
]
df = df[[
    'geonameid', 'name', 'latitude', 'longitude', 
    'admin1_code', 'admin2_code', 
    'feature_class', 'feature_code', 'gclass',
]]
df.columns=[
    'gid', 'name', 'lat', 'lng', 
    'admin1', 'admin2', 
    'gclass1', 'gclass2', 'gclass',
]
df.drop_duplicates(subset='gid', inplace=True)

Convert administrative division code (both level 1 and 2) into names and export the POIs.

In [None]:
admin1_dict = pd.read_csv('geo/admin1CodesASCII.txt', delimiter='\t')
admin1_dict = admin1_dict[admin1_dict.code.str.startswith('AU')]
admin1_dict = dict(zip(admin1_dict.code, admin1_dict.name))

admin2_dict = pd.read_csv('geo/admin2Codes.txt', delimiter='\t')
admin2_dict = admin2_dict[admin2_dict.code.str.startswith('AU')]
admin2_dict = dict(zip(admin2_dict.code, admin2_dict.name))

In [None]:
def get_admin_names(row):
    a1, a2 = row.admin1, row.admin2
    n1, n2 = PH, PH
    
    if not math.isnan(a1):
        # admin level 1
        pat = 'AU.0'+str(int(a1))
        if pat in admin1_dict:
            n1 = admin1_dict[pat]
        # admin level 2
        try:
            pat+='.'+str(int(a2))
            if pat in admin2_dict:
                n2 = admin2_dict[pat]
        except:
            pass
    return  n1, n2

df['admin1'], df['admin2'] = zip(*df.apply(get_admin_names, axis=1))
df.to_csv('geo/geo_pois.csv', header=True, index=False)

Below shows an instance.

In [None]:
geo_df = pd.read_csv('geo/geo_pois.csv')
print(len(geo_df))
geo_df.head(5)

## 3. Collecting POIs from OpenStreetMap

[OpenStreetMap](https://www.openstreetmap.org) (OSM) contributes and maintains data about roads, trails, cafés, railway stations, etc. all over the world.

To start with, you need to save the following OSM archives in the same folder named ```osm```.

- Both [BBBike](https://download.bbbike.org/osm/bbbike/) and [Geofabrik](http://download.geofabrik.de/) allow users to specify volunteered POIs within a region. Choose files ending with ``.osm.pbf`` where possible.
    - [Australia archive](http://download.geofabrik.de/australia-oceania.html)
    - [Melbourne archive](https://download.bbbike.org/osm/bbbike/Melbourne/)

Once finish downloading the OSM archives, you need to go to [MorbZ's Github portal](https://github.com/MorbZ/OsmPoisPbf) and download the OSM file extractor. Check ``readme.md`` for detailed instructions.
- [The POI extractor](https://github.com/MorbZ/OsmPoisPbf/releases)
- [The POI type mapping](https://github.com/MorbZ/OsmPoisPbf/blob/master/doc/poi_types.csv)



Extract POI entities and convert POI type code into names.
<!--- ```java -Xmx4g -jar osmpois.jar planet.osm.pbf``` --->

In [None]:
# mapping: poi_type_code -> poi_type_name
code_type_pair = pd.read_csv('osm/poi_types.csv')
code_type_pair = code_type_pair[~code_type_pair['POI TYPE'].isnull()]
code_type_pair = dict(zip(code_type_pair['CODE'], code_type_pair['POI TYPE']))

def open_openstreetmap(fn):
    tmp_pois = pd.read_csv(fn, delimiter='|', 
        names=['poi_type_id', 'poi_id', 'lat', 'lng', 'poi_name']
    )
    tmp_pois['poi_subtype'] = tmp_pois['poi_type_id'].apply(lambda x: code_type_pair[str(x)])
    tmp_pois['poi_type'] = tmp_pois['poi_subtype'].apply(lambda x: x[:x.index('_')])
    return tmp_pois.drop_duplicates()

df = open_openstreetmap('osm/au_pois.csv')

Remove POIs outside a specified bounding box (optional) and export the rest.

In [None]:
# e.g., the boundary box (not the actual boundary) of Victoria with a slight offset
# bbox = [-33-58/60, -39-10/60, 140+57/60, 150+1/60]
# bbox = [bbox[0]+0.1, bbox[1]-0.1, bbox[2]-0.1, bbox[3]+0.1]
# df = df[(df.lat>bbox[1]) & (df.lat<bbox[0]) & (df.lon>bbox[2]) & (df.lon<bbox[3])]

df.to_csv('osm/osm_pois.csv', header=True, index=False)

Below shows an instance.

In [3]:
osm_df = pd.read_csv('osm/osm_pois.csv')
osm_df = osm_df[~osm_df['poi_type'].str.startswith('POI')]
print(len(osm_df))
osm_df.head(5)

224947


Unnamed: 0,poi_type_id,poi_id,lat,lng,poi_name,poi_subtype,poi_type
1,139,N17026283,-31.435607,152.823424,Cassegrain Winery,TOURIST_ATTRACTION,TOURIST
2,16,N17026830,-31.428795,152.909814,Court House,AMENITY_PUBLICBUILDING,AMENITY
3,139,N17027559,-31.443056,152.919524,Koala Hospital,TOURIST_ATTRACTION,TOURIST
4,105,N17027753,-31.477181,152.914608,Coles,SHOP_SUPERMARKET,SHOP
5,40,N17027848,-31.473039,152.929881,Public Reserve,LANDUSE_GRASS,LANDUSE


## 4. Retrieving and Visualizing POIs

In [4]:
# check the [Precision] section on https://en.wikipedia.org/wiki/Decimal_degrees
# calculate distance in kms between two geo coordinates  
from sklearn.metrics.pairwise import haversine_distances
from math import radians
def get_distance(a, b, euclidean=False):
    if euclidean:
        return np.linalg.norm(np.asarray(a)-np.asarray(b))
    else:
        a_in_radians = [radians(_) for _ in a]
        b_in_radians = [radians(_) for _ in b]
        dist = haversine_distances([a_in_radians, b_in_radians])[-1][0]
        return dist * 6371000/1000


# get top k nearest POIs
def get_neighbors(cur_loc, known_pois, within_km, top_k=5):
    # within_km<0 means no distance restraints
    # top_k<0 returns all results
    ndf = known_pois.copy()
    coords = np.concatenate([ndf[['lat']], ndf[['lng']]], axis=1)
    ndf['dist'] = [get_distance(cur_loc, coord) for coord in coords]
    if within_km > 0:
        ndf = ndf[ndf['dist']<=within_km]
    ndf = ndf.sort_values('dist', ascending=True)
    if top_k > 0:
        ndf = ndf.head(top_k)
    return ndf

As an example, below shows the 20 closest POIs to Melbourne (within a one-kilometer radius of the city) using the OpenStreetMap dataset and plots the POIs on a map. 

In [5]:
MELBOURNE = [-37.814, 144.96332]
WITHIN = 0.5
TOPK = 20

neighbors = get_neighbors(cur_loc=MELBOURNE, known_pois=osm_df, within_km=WITHIN, top_k=TOPK)
display(neighbors)   

Unnamed: 0,poi_type_id,poi_id,lat,lng,poi_name,poi_subtype,poi_type,dist
98007,25,N3472659754,-37.813982,144.963292,Federal Coffee Palace,FOOD_CAFE,FOOD,0.003192
51100,138,N2162178792,-37.814098,144.963428,The Public Purse,TOURIST_ART,TOURIST,0.014448
104733,168,N3945798969,-37.814174,144.963363,"Stop 5: Bourke Street Mall, Elizabeth Street",TRANSPORT_TRAMSTOP,TRANSPORT,0.019713
51676,168,N2190619591,-37.813984,144.963064,"Stop 3: Bourke Street Mall, Bourke Street",TRANSPORT_TRAMSTOP,TRANSPORT,0.022567
27202,75,N907778391,-37.813848,144.963106,Elizabeth Street/Bourke Street,SHOP_BICYCLE,SHOP,0.02528
73700,80,N2776401990,-37.813766,144.963342,H&M,SHOP_CLOTHES,SHOP,0.026078
53243,88,N2265081590,-37.813921,144.9636,Flowers @ GPO,SHOP_FLORIST,SHOP,0.026118
104732,168,N3945798968,-37.814066,144.963633,"Stop 5: Bourke Street Mall, Elizabeth Street",TRANSPORT_TRAMSTOP,TRANSPORT,0.028447
152016,29,N6344541343,-37.813748,144.963415,CA de VIN,FOOD_RESTAURANT,FOOD,0.029266
29793,80,N1058524474,-37.81432,144.963523,Bardot,SHOP_CLOTHES,SHOP,0.039743


In [6]:
import folium
from folium.map import Icon

color_options = list(Icon.color_options)
color_options.remove('black')
color_options.reverse()
poi_types = neighbors['poi_type'].unique().tolist()
if len(poi_types) <= len(color_options):
    poi_color = dict(zip(poi_types, color_options[:len(poi_types)]))
else:
    poi_color = dict(zip(poi_types, ['lightgray'] * len(poi_types)))

In [7]:
# check https://python-visualization.github.io/folium/quickstart.html for more details
# project POIs onto a map
m = folium.Map(location=MELBOURNE, zoom_start=16)
folium.Marker(
    width='30%', height='30%',
    location=MELBOURNE,
    popup='Your Are Here',
    icon=folium.Icon(icon='arrow-down', color='black')
).add_to(m)

folium.Circle(location=MELBOURNE, radius=WITHIN*1000).add_to(m)

for lat, lng, pname, ptype in zip(
    neighbors['lat'], neighbors['lng'], 
    neighbors['poi_name'], neighbors['poi_type']
):
    folium.Marker(
        location=[lat, lng], popup='[%s]\n%s' % (ptype, pname),
        icon=folium.Icon(color=poi_color[ptype])
    ).add_to(m)

display(m)

## 5. More Geographic Resources

### 5.1 Geo-enhanced Social Media and GeoInfo Websites
The following platforms are popular sources for geo-based analysis.
- [Flickr](https://www.flickr.com/)
- [Google Maps](https://maps.google.com/)
- [Twitter](https://twitter.com/)
- [Foursquare](https://foursquare.com/)
- [Wikimapia](http://wikimapia.org/api/) utilizes an interactive "clickable" web map that marks and describes all geographical objects in the world.
- [TIXIK](http://www.tixik.com/info/api/) provides information about interesting places around the world, with hundreds of thousands of presentations with pictures and texts in multiple languages.


### 5.2 Travel Blogs/Texts
The following platforms provide text-based travel information extraction.
- [TravelBlog](https://www.travelblog.org/)
- [Off Exploring](https://www.offexploring.com/)
- [TravelJournal](www.traveljournal.com)
- [Official Site for Victoria](http://www.visitvictoria.com/)
- [Holidays in Australia](https://www.australia.com/en)
- [Yahoo! Travel](https://www.smartertravel.com/author/yahoo-travel/)


### 5.3 Boundrary Determination
- [OpenStreetMap administrative boundaries in GeoJSON](https://peteris.rocks/blog/openstreetmap-administrative-boundaries-in-geojson/)
- [Here’s how to use data from OpenStreetMap for your infographics](http://www.konradlischka.info/en/2015/05/blog-en/heres-how-you-pull-data-from-openstreetmap-for-your-infographics/)
- [MapBox OSM Boundaries](https://github.com/mapbox/osm-boundaries)
- [Extracting Administrative Boundaries from OpenStreetMap](https://www.mysociety.org/2012/06/23/extracting-administrative-boundaries-from-openstreetmap/)


### 5.4 Weather and Season Data
- [Weather Underground](https://www.wunderground.com/) finds historical weather by searching for a city, zip code, or airport code.


### 5.5 Australia Open Data
Below shows some public data released by the Australia government.

- [GOV.AU](https://www.gov.au/) offers a listing of websites that lead to many useful resources such as the Victorian Heritage Database.
- [Australian Government](https://www.australia.gov.au/) helps find government information and services.
- [Victoria's Open Data Directory](https://www.data.vic.gov.au/) finds data that Victorian government departments and agencies have opened and made available to the public. 
- [Australian Bureau of Statistics](http://www.abs.gov.au/browse?opendocument&ref=topBar) contains statistical data, including tourism and transport.
- [Department of Environment, Land, Water and Planning](https://www2.delwp.vic.gov.au/) provides information about Victoria environment and natural resources, including spatial data.
- [Parks Victoria](http://parkweb.vic.gov.au/) consists of information about Victoria parks, including spatial data.
- [Australian Government Datasets](https://data.gov.au/dataset?tags=Boundaries) provides PSMA administrative boundaries of all states and territories in Australia.
    - Geographic coordinates making up polygons for each area.
    - Support multiple data formats, including PDF, SHP, WMS, WFS and GeoJSON.
- [Bureau of Meteorology](http://www.bom.gov.au/) shows climate statistics in Australia.

## 6. Geo-related Tools and Libraries

### 6.1 Geo Coordinates Clustering
- DBSCAN (built in scikit-learn)
- P-DBSCAN performs DBSCAN by taking into account onwership of photos.
- [CFSFDP](https://pypi.org/project/Dcluster/)
    - based on the idea that cluster centers are characterized by a higher density than their neighbors and by a relatively large distance from points with higher densities.
    - see [the original paper](http://science.sciencemag.org/content/344/6191/1492) along with [an automatic method for selecting threshold values](https://arxiv.org/abs/1501.04267) for details.
- [HDBSCAN](https://github.com/scikit-learn-contrib/hdbscan) performs DBSCAN over varying epsilon values (densities) and finds clusters that gives the best stability over epsilon.


### 6.2 Geo Information Retrieval
[Geocoder](https://geocoder.readthedocs.io/index.html) is a wrapper obtaining geo-related information from multiple different geocoding providers such as Google, Bing, and OpenStreetMap.


### 6.3 Geo Visualization

- [Folium](https://github.com/python-visualization/folium) manipulates Python data and then visualizes it via interactive maps with Leaflet.js.
- [smopy](https://github.com/rossant/smopy) returns an OpenStreetMap tile image given a bounding box and converts geographical coordinates to pixels.
- [WorldWind SDK](https://worldwind.arc.nasa.gov/) allows developers to create interactive visualizations of 3D globe, map and geographical information.

## \[Deprecated\] TripAdvisor

<div class="alert alert-block alert-warning">
The code in this section is not working any longer as TripAdvisor has changed its website layout. The code is kept as a reference for those who have to collect attraction information from the platform.
</div>

[TripAdvisor](https://www.tripadvisor.com.au/) provides information and reviews of hotels, restaurants, attractions, flights, and other travel-related content.


To start with, you need to go to the following link for Victoria intorduction on TripAdvisor.
> https://www.tripadvisor.com.au/Tourism-g255098-Victoria-Vacations.html
- in the [Popular Destinations] section, keep pressing the [See more] button until all destinations are loaded.
- manually save all destination urls by inspecting page source.

In [None]:
TA_BASE = 'https://www.tripadvisor.com.au'

In [None]:
def get_id_name(cell):
    x = re.search('(?<=Tourism-).+(?=_Victoria)', cell).group(0)
    x_id = x[:x.index('-')]
    x_name = x[x.index('-')+1:]
    return x_id, x_name

with open('ta/ta_list.txt') as f:
    df = pd.DataFrame({'init_url': f.read().split('\n')})

df['gid'], df['dname'] = zip(*df.init_url.apply(get_id_name))

Convert a destination url to its home page address following the pattern ``/Attractions-[DESTINATION_ID]-Activities[PAGE_INDICATOR]-[DESTINATION_NAME]_Victoria.html`` and save the corresponding HTML.

In [None]:
def get_ta_html(url, full_page=False):
    # not until we get the page
    while (True):
        r = requests.get(url)
        if r.status_code == 200:
            tsoup = BeautifulSoup(r.text, 'lxml')
            if not full_page:
                tsoup = tsoup.find('div', {'id': 'FILTERED_LIST'})
            break
    return str(tsoup)

In [None]:
BASE_URL = TA_BASE + "/Attractions-%s-Activities%s-%s_Victoria.html"
def gen_att_url(des_id, des_name, page=1):
    # page=0 indicates home page
    page_idx = '' if page==1 else '-oa'+str(30*(page-1))
    return BASE_URL % (des_id, page_idx, des_name)

cnt=0
tol=df.shape[0]
htmls = []
for gid, dname in zip(df.gid, df.dname):
    cnt+=1
    print('fetching... %d/%d' % (cnt, tol), end='\r')
    hurl = gen_att_url(gid, dname)
    txt = get_ta_html(hurl)
    htmls.append(txt)
df['html'] = htmls

For each destination, check how many pages it has and save the HTML of all other pages where possible.

In [None]:
def extract_page_num(cell):
    soup = BeautifulSoup(cell, 'lxml')
    psoup = soup.find('div', {'class': 'pagination'})
    num_page = 1
    if psoup is not None:
        num_page = psoup.find_all('a')[-1]['data-page-number']
    return int(num_page)

df['num_page'] = df.html.apply(extract_page_num)

cnt=0
tol=df.shape[0]
rest_htmls = []
for did, dname, num_page in zip(df.did, df.dname, df.num_page):
    cnt+=1
    pages=[]
    for i in range(2, num_page+1):
        print('fetching... %d/%d, page=%d' % (cnt, tol, i), end='\r')
        hurl = gen_att_url(did, dname, i)
        txt = get_ta_html(hurl)
        pages.append(txt)
    rest_htmls.append(str(pages))
    
df['rest_htmls'] = rest_htmls

When searching for popular spots in a destination, TripAdvisor returns both attractions physically within the destination and those close enough to (but outside) the destination.
There are two types of attractions:
- **Individual spots**: each record is one spot
- **Spot groups**: each record is a group that contains a list of spots separated in pages

In [None]:
spot_pool = {}
    
def get_spots(cell):
    def single_spots(cell):
        res = []
        soup = BeautifulSoup(cell, 'lxml')
        for asoup in soup.find_all('div', {
            'class': 'attraction_element'
        }):
            within = True if asoup.find('var') is None else False
            url = asoup.find('div', {'class': 'listing_title'})
            url = url.find('a')['href']
            sid = re.search('d\d+', url).group(0)
            spot_pool[sid] = url
            res.append((within, sid))
        return res

    def spot_groups(cell):
        res = []
        soup = BeautifulSoup(cell, 'lxml')
        for gsoup in soup.find_all('div', {
            'class': 'attraction_type_group'
        }):
            url = gsoup.find('div', {
                'class': 'listing_title'
            }).find('a')
            title = url.text
            title = title[:title.index('(')].strip()
            url = TA_BASE + url['href']
            txt = get_ta_html(url)
            tmp_res = single_spots(txt)
            p_num = extract_page_num(txt)
            if p_num>1:
                pos = url.rindex('-')
                for i in range(2, p_num+1):
                    hurl = url[:pos]+'-oa'+str(30*(i-1))+url[pos:]
                    htxt = get_ta_html(hurl)
                    tmp_res.extend(single_spots(htxt))
            res.append((title, tmp_res))
        return res

    return single_spots(cell), spot_groups(cell)

In [None]:
# The encoding scheme (Python dict object)
# - 'single'
#     - single_list=[(w_flag, spot_url), ...]
#     - *w_flag* indicates whether a spot is physically in current destination
# - 'group'
#     - group_list=[(group_type, single_list), ...]

spots=[]
cnt=0
tol = df.shape[0]
for first_page, rest_pages in zip(df.html, df.rest_htmls):
    cnt+=1
    print('%d / %d ...' % (cnt, tol), end='\r')
    pages = ast.literal_eval(rest_pages)
    pages.append(first_page)
    single_list = []
    group_list = []
    for page in pages:
        single_items, group_items = get_spots(page)
        single_list.extend(single_items)
        group_list.extend(group_items)
    spots.append({'single': single_list, 'group': group_list})
df['spots']=spots

df.to_csv('ta/ta_base.csv', header=True, index=False)
with open('ta/spot_pool.txt', 'w') as f:
    f.write(str(spot_pool))

Pool all unique spots and save the HTML of each individual's home page.

In [None]:
df = pd.read_csv('ta/ta_base.csv')
with open('ta/spot_pool.txt', 'r') as f:
    spot_pool = f.read()
spot_pool = ast.literal_eval(spot_pool)

In [None]:
def update_pool(c, records):
    c.executemany('''INSERT INTO pool VALUES (?, ?)''', records)
    conn.commit()

# init the pool db
conn = sqlite3.connect('ta/spot_html_pool.db')
c = conn.cursor()
c.execute('''CREATE TABLE IF NOT EXISTS 
            pool (id TEXT, html TEXT)''')

#  add records
cnt=0
tol = len(spot_pool)
spot_htmls = []
for spot_id, spot_url in spot_pool.items():
    cnt+=1
    print('%d / %d ...' % (cnt, tol), end='\r')
    html_txt = get_ta_html(TA_BASE + spot_url, full_page=True)
    spot_htmls.append((spot_id, html_txt))
    if cnt % 100 == 0:
        update_pool(c, spot_htmls)
        spot_htmls = []
update_pool(c, spot_htmls)
c = None
conn.close()

1. Extract representative attributes from the HTML pool.
1. Keep only spots physically within a destinations.
1. Map an attraction to its attibutes via spot ID.

In [None]:
def get_within_spots(cell):
    def get_true(l):
        return [sid for w_flag, sid in l if w_flag]
        
    cell = ast.literal_eval(cell)
    single_spots = get_true(cell['single'])
    group_spots = [
        (title, get_true(one_group))
        for title, one_group in cell['group']
    ]
    return {'single': single_spots, 'group': group_spots}

tdf = df[['gid', 'dname', 'spots']].copy()
tdf.spots = tdf.spots.apply(get_within_spots)

In [None]:
def get_spot_attrs(sid):
    cell = c.execute('''
        SELECT html FROM pool WHERE id=? LIMIT 1;
    ''', (sid,)).fetchone()[0]
    soup = BeautifulSoup(cell, 'lxml')
    title = soup.find('h1', {'id': 'HEADING'}).text.strip()
    region = soup.find(
        'div', {'class': 'ppr_priv_global_nav_geopill'
    }).text.strip()
    try:
        num = soup.find('a', {'class': 'more'}).text.strip()
        num = num[:num.index(' ')]
        num = int(str(num).replace(',', ''))
    except:
        num = 0
    try:
        cates = [a.text.strip() for a in soup.find(
            'span', {'class': 'attraction_details'}
        ).find_all('a')]
    except:
        cates= []
    addr = soup.find(
            'div', {'class': 'address'}
        ).text.replace('|', '').strip()
    for sc in soup.find_all('script'):
        sc = str(sc)
        if 'lng: ' not in sc:
            continue
        lat = re.search('(?<=lat: )[-]?\d+\.\d+(?=,)', sc).group(0)
        lng = re.search('(?<=lng: )[-]?\d+\.\d+(?=,)', sc).group(0)
        loc_id = re.search('(?<=locId: )\d+(?=,)', sc).group(0)
        geo_id = re.search('(?<=geoId: )\d+(?=,)', sc).group(0)
        break
    return title, region, cates, addr, lat, lng, loc_id, geo_id, num

In [None]:
conn = sqlite3.connect('ta/spot_html_pool.db')
c = conn.cursor()
sdf = pd.read_sql_query('''SELECT id FROM pool;''', conn)

cnt=0
tol = sdf.shape[0]
attrs = []
for sid in sdf.id:
    cnt+=1
    print('%d / %d' % (cnt, tol), end='\r')
    attrs.append(get_spot_attrs(sid))
    
c = None
conn.close()

sdf = sdf.join(
    pd.DataFrame(attrs, columns = [
        'name', 'region', 'cates', 
        'addr', 'lat', 'lng', 'loc_id', 
        'geo_id', 'num_review'
    ])
)

Assign single/group tags to each spot:
- A single tag (default: *False*) marks whether the spot is stated as a single spot.
- A group tag (default: *\[\]*) includes all names of group the spot is involved in.

In [None]:
# The encoding scheme: (single_tag: boolean, double_tags: list)

sdf.shape
mdict = {sid: (False, []) for sid in sdf.id}
for destination in tdf.spots:
    # update single
    for spot_id in destination['single']:
        smark, gmarks = mdict[spot_id]
        smark = True
        mdict[spot_id] = (smark, gmarks)
    
    # update group
    for cate, spot_ids in destination['group']:
        for spot_id in spot_ids:
            smark, gmarks = mdict[spot_id]
            gmarks.append(cate)
            mdict[spot_id] = (smark, gmarks)

Set single/group tags to each spot and export the POIs.

In [None]:
# - (True, []) means a spot exists only as an individual spot;
# - (False, [...]) means a spot only shows up in group activitie(s);
# - (True, [...]) means a spot serving both purposes;
# - (False, []) indicates a spot appearing only in nearby destination(s) rather than its physical one.
#     - this could happen since TripAdvisor returns inconsistent results on the same destination and same page. 
#     - I suspect this may be a stretagy to protect data from crawling. 

def set_marks(cell):
    smark, gmarks = mdict[cell]
    return smark, str(gmarks)

sdf['smark'], sdf['gmarks'] = zip(*sdf.id.apply(set_marks))
# sdf[(sdf.smark==False) & (sdf.gmarks=='[]')]

sdf.to_csv('ta/ta_pois.csv', header=True, index=False)

Below shows an instance.

In [None]:
sdf = pd.read_csv('ta/ta_pois.csv')
sdf.shape
sdf.head(1)