# 00. Exploratory Scraping

As there is quite a volume of content that we would need to sift through in order to obtain the information that we want, and we ideally wish to do it in a minimal number of attempts.

Thus, the purpose of this notebook is purely to explore the structure of the [Rakuten Travel](https://travel.rakuten.co.jp/) webpage.

## Imports

In [None]:
import numpy as np
import pandas as pd
import re
import requests
from bs4 import BeautifulSoup


## Exploration

### Exploring the Homepage

We first scrape the [homepage](https://travel.rakuten.co.jp/) of Rakuten Travel to see how we can access the reviews.

In [None]:
homepage_res = requests.get(url='https://travel.rakuten.co.jp/')
homepage_soup = BeautifulSoup(homepage_res.text)

In [None]:
print(homepage_soup.prettify())

Again, we are getting some encoding issues here. Let's compare the apparent and actual encodings of the scraped website text.

In [None]:
print(homepage_res.encoding)
print(homepage_res.apparent_encoding)

As suspected, there is a discrepancy in the encoding used. Let's force the encoding to be UTF-8 (what it is in the content) so that we can actually read the text.

In [None]:
homepage_res.encoding = 'utf-8'
homepage_soup = BeautifulSoup(homepage_res.text)

In [None]:
print(homepage_soup.prettify())

And that seems to have fixed the problem. 

Now we can proceed to get the names of each individual location, which are located in `<dd class="area dmArea">` tags.

In [None]:
areas = homepage_soup.find('dd', attrs={'class': 'area dmArea'})
areas

This tells us the names of the different prefectures, but not really enough to navigate to the correct link.

Let us take a closer look at how links between two prefectures differ. For the sake of this comparison, we shall compare the Tokyo and Aomori prefectures. Both links start with `https://search.travel.rakuten.co.jp/ds/undated/search?`, but where they differ is in their optional parameters.

|Prefecture|URL|
|---|---|
|Tokyo|`https://search.travel.rakuten.co.jp/ds/undated/search?f_dai=japan&f_sort=hotel&f_page=1&f_hyoji=30&f_tab=hotel&f_cd=02&`<br>`f_layout=list&f_campaign=&` `f_chu=tokyo` `&f_shou=&f_sai=&f_charge_users=&l-id=topC_search_hotel_undated`|
|Aomori|`https://search.travel.rakuten.co.jp/ds/undated/search?f_dai=japan&f_sort=hotel&f_page=1&f_hyoji=30&f_tab=hotel&f_cd=02&`<br>`f_layout=list&f_campaign=&` `f_chu=aomori` `&f_shou=&f_sai=&f_charge_users=&l-id=topC_search_hotel_undated`|

As we can see, the only place where the two urls differ is in the `f_chu` parameter, which takes in the name of the prefecture present in the `<option value=___>` tag on the homepage.

This allows us to programmatically search through all the prefectures and get all their hotels.

Let us get a list of all the prefecture names that goes into the `f_chu` parameter.

In [None]:
list_of_prefectures = [pref_tag.get('value') for pref_tag in areas.findAll('option')]
len(list_of_prefectures)

In [None]:
list_of_prefectures

**Sanity check:** There are only 47 prefectures in Japan, but we seem to be getting repeats.

Looking at the `list_of_prefectures` variable above, we see that two prefectures (Kanagawa and Shizuoka) are repeated (with different `id`s, as seen in the `area` variable).

However, as the `value` for both versions of the two prefectures are identical, it is safe to simply collapse those together.


In [None]:
list_of_prefectures = list(set(list_of_prefectures))
len(list_of_prefectures)

In [None]:
list_of_prefectures

And now we get 47 prefectures, as intended.

As an aside, while most English speakers would know prefectures such as 福島 and 千葉 by their [Hepburn romanization](https://en.wikipedia.org/wiki/Hepburn_romanization) - Fukushima and Chiba, respectively, this page makes use of the [Nihon-shiki romanization](https://en.wikipedia.org/wiki/Nihon-shiki_romanization) (Japanese-style romanization) used more commonly in Japan, which is written as Hukushima and Tiba respectively. Even then, it is slightly different, as 福島 written in proper Nihon-shiki romanization would be "hukusima".

We can follow the same procedure in navigating all the way to the last page - checking the `<li class="pagingBack">` tag.

### Reviews from Individual Hotels

In [None]:
test_prefecture_url_tokyo = 'https://search.travel.rakuten.co.jp/ds/undated/search?f_dai=japan&f_sort=hotel&f_page=1&f_hyoji=30&f_tab=hotel&f_cd=02&f_layout=list&f_campaign=&f_chu=tokyo&f_shou=&f_sai=&f_charge_users=&l-id=topC_search_hotel_undated'


In [None]:
tokyo_res = requests.get(test_prefecture_url_tokyo)
tokyo_soup = BeautifulSoup(tokyo_res.text)

In [None]:
print(tokyo_soup.prettify())

The webpage we are currently in lists out all the hotels in that prefecture by page. While we could in theory go into the actual hotel page and access the reviews from there, the link already exists in the same box as the hotel page, beside the hotel rating.

It can be directly accessed by searching for the `<p class="cstmrEvl">` tag. As usual, the links are stored inside `<a href=___>` tags.

In [None]:
tokyo_review_links = [hotel.find('a').get('href') for hotel in tokyo_soup.findAll('p', attrs={'class': 'cstmrEvl'})]
tokyo_review_links

### Scraping a single review

We start off by taking small steps with scraping a single page of reviews, then slowly scope up as we go along.

In [None]:
res = requests.get(url='https://travel.rakuten.co.jp/HOTEL/28096/review.html')

In [None]:
soup = BeautifulSoup(res.text)
print(soup.prettify())

We see immediately that the Japanese text isn't being rendered properly. This is an encoding problem, and is easily remedied by declaring the encoding of the response from `requests` to be UTF-8.

In [None]:
# this forces the encoding to be UTF-8
# otherwise the output would be unreadable
res.encoding = 'utf-8'

In [None]:
soup = BeautifulSoup(res.text)
print(soup.prettify())

We see from a cursory inspection that reviews are in the `<p class="commentSentence">` tags. 

In [None]:
reviews = [review.text for review in soup.findAll('p', attrs={'class': 'commentSentence'})]
len(reviews)

In [None]:
reviews

However, they seem to have some extra formatting noise, mainly newline characters. Let us remove them.

In [None]:
reviews = [re.sub('\\[rn]', '', review).strip() for review in reviews]

In [None]:
reviews

Much better. Now, we also notice that half of the "reviews" are in fact replies by the hotel. As those replies always start with the same line(s) of greetings and thanks. This makes it a lot easier for us to filter out such comments.

Later on, we shall encounter a more efficient way to gather specifically guest reviews or hotel replies, but this is exploratory, so let's flow with this for now.

In [None]:
guest_reviews = [review for review in reviews if 'この度はホテルマイステイズ浅草' not in review]
len(guest_reviews)

In [None]:
guest_reviews

As for the more efficient way of determining which commments are by actual guests, and which are from hotel replies, going up one tag level makes it clearer that guest reviews and hotel replies are wrapped under different `<dl>` tags.

The tag `<div class="commentReputationBoth">` captures both the user review and the reply by the hotel front desk.

The user reviews are under the tag `<dl class="commentReputation">`, and the hotel replies are under the tag `<dl class="commentHotel">`.

If one were to only look at comments without any care for who they come from, simply searching for tags with `<p class="commentSentence">` would suffice.

Additionally, each review has a specific id, which can be found in the first `<div class="voteQuestion">` tag, and is accessed through the `id` tag. This would return a string `voteans_[id number]`, which we shall use.

As we are mainly concerned with user reviews, we shall first extract only those under the `<dl class="commentReputation">` tag.

If we ever decide to work with hotel replies, we can simply just extract them under the `<dl class="commentHotel">` tag.

In [None]:
soup_comments_meta = soup.findAll('div', attrs={'class': 'commentBox'})

In [None]:
print(soup_comments_meta[0].prettify())

### Looking Forward - Aspect-Based Sentiment Analysis

We need a way to obtain aspects in order to perform aspect-based sentiment analysis.

Thankfully, Rakuten Travel already splits up their guest ratings into six categories:
- Service (サービス)
- Location (立地)
- Room (部屋)
- Amenities (設備・アメニティ)
- Bathroom (風呂)
- Meals (食事)

Each category is rated on a scale of 1 to 5, and gives us an idea of what the guest liked or did not like about the experience.

These six scores are then aggregated to give a total score (総合) on a scale of 1 to 5.

First, we have to get the url that links to the scores. 

Every customer review contains a link to a more detailed review page with their score breakdown for the six categories. This would be the very first `href` tag that appears under each `<div class="commentBox">` tag.

In [None]:
review_breakdown_link = soup_comments_meta[0].find('a').get('href')
review_breakdown_link

We now scrape the link to the individual's review details, and a cursory scan reveals that the score data is located under the `<ul class="rateDetail">` tag.

In [None]:
test_review_res = requests.get(url=review_breakdown_link)
test_review_soup = BeautifulSoup(test_review_res.text)

In [None]:
print(test_review_soup.prettify())

There should only be one element that has the tag `<ul class="rateDetail">` in the reviews page.

Extracting the text, and then splitting on the newline `\n` character, would then give us the necessary review scores for a single customer.

In [None]:
scores_string = test_review_soup.find('ul', attrs={'class': 'rateDetail'}).text
scores_string

In [None]:
test_scores = scores_string.split('\n')[1:-1] # the string starts and ends with a newline character
test_scores

The next page can be accessed by the link with the text `次の__件`. This is encapsulated in the first `href` in the `<li class="pagingNext">` tag.

In [None]:
# next page
next_page = soup.find('li', attrs={'class': 'pagingNext'}).find('a').get('href')
next_page

Let's check what happens when we reach the end of the reviews. As it so happens, at the time of writing, the current hotel page only has two pages of reviews. This makes it quite an illustrating minimal example.

In [None]:
next_page_res = requests.get(url=next_page)
next_page_soup = BeautifulSoup(next_page_res.text)

In [None]:
print(next_page_soup.prettify())

Let's check if there's a link to the next page.

In [None]:
next_page_soup.find('li', attrs={'class': 'pagingNext'}) == None

This serves as a nice check for when we reach the end of the comments, so that we know when to call it a day with a specific hotel and move on.

Now that we know how to deal with a single hotel's reviews, let's look at how we can loop through all the hotels.

### Overall Scraping Strategy

Let us now put everything we have uncovered above together to obtain the data that we need.

We shall do this by writing out some pseudocode before actually attempting the scrape.

---

Initialize empty lists in a dictionary for the columns that we want:<br>
`review_id`, `review_time`, `review_text`, `hotel_reply_time`, `hotel_reply_text`, and the 7 scores in the order of overall, service, location, room, amenities, bathroom, and food.

Scrape the Rakuten homepage and create the homepage soup.

Encode the response to utf-8 first.

From the homepage soup, extract out the list of prefecture names.

**for loop 1: From homepage, looping through prefectures** <br>
- `for prefecture in prefecture_names:`
    - scrape link to prefecture hotels list and make soup

    - initialize an empty list for the hotel links
    - extract list of review links (`findAll('p', attrs={'class': 'cstmrEvl'})` and `find('a').get('href')`)
    - `while prefecture_hotels_soup.find('li', attrs={'class': 'pagingNext'}) != None:`
        - go to next page
    
    - **for loop 2: From prefecture, looping through hotels** 
    - `for hotel_review_link in list_of_hotel_review_links:`
        - make soup

        - initialize an empty list for the customers
        - extract list of customer review details links (`findAll('div', attrs={'class': 'commentBox'})` and `find('a').get('href')`)
        - `while hotel_soup.find('li', attrs={'class': 'pagingNext'}) != None:`
            - go to next page

        - **for loop 3: From hotel, looping through customers**
        - `for customer_review on list_of_customer_reviews:`
            - make soup
            - encode the response to utf-8 first

            - extract review id (`find('div', class='voteQuestion').get('id')`)

            - extract review timestamp (`find('span', attrs={'class': 'time'}).text`)

            - extract review text (`find('p', attrs={'class': 'commentSentence'})`)

            - try
                - make hotel reply soup (if it exists)
                - extract hotel reply timestamp
                - extract hotel reply text
            
            - extract hotel name (`find('a', attrs={'class': 'rtconds fn'}).text`)

            - extract scores as a string (`find('ul', attrs={'class': 'rateDetail'}).text`)

            - split scores (`split('\n')[1:-1]`)

            - append review id
            - append review time
            - append review text
            - append hotel reply timestamp
            - append hotel reply text
            - append hotel name
            - append prefecture
            - `for i in range(7):`
                - we only want the number, the order of the categories is already encoded in the initialization above
                - `lists_of_scores[i].append(customer_scores[i][-1])`

