### Imports

In [1]:
import numpy as np
import pandas as pd
import re
import requests
import time

from bs4 import BeautifulSoup

### Overall Scraping Strategy

As a convenient reference for writing the actual scraping code, the pseudocode at the end of the previous notebook is copied here.

---

Initialize empty lists in a dictionary for the columns that we want:<br>
`review_id`, `review_time`, `review_text`, `hotel_reply_time`, `hotel_reply_text`, and the 7 scores in the order of overall, service, location, room, amenities, bathroom, and food.

Scrape the Rakuten homepage and create the homepage soup.

Encode the response to utf-8 first.

From the homepage soup, extract out the list of prefecture names.

**for loop 1: From homepage, looping through prefectures** <br>
- `for prefecture in prefecture_names:`
    - scrape link to prefecture hotels list and make soup

    - initialize an empty list for the hotel links
    - extract list of review links (`findAll('p', attrs={'class': 'cstmrEvl'})` and `find('a').get('href')`)
    - `while prefecture_hotels_soup.find('li', attrs={'class': 'pagingNext'}) != None:`
        - go to next page
    
    - **for loop 2: From prefecture, looping through hotels** 
    - `for hotel_review_link in list_of_hotel_review_links:`
        - make soup

        - initialize an empty list for the customers
        - extract list of customer review details links (`findAll('div', attrs={'class': 'commentBox'})` and `find('a').get('href')`)
        - `while hotel_soup.find('li', attrs={'class': 'pagingNext'}) != None:`
            - go to next page

        - **for loop 3: From hotel, looping through customers**
        - `for customer_review on list_of_customer_reviews:`
            - make soup
            - encode the response to utf-8 first

            - extract review id (`find('div', class='voteQuestion').get('id')`)

            - extract review timestamp (`find('span', attrs={'class': 'time'}).text`)

            - extract review text (`find('p', attrs={'class': 'commentSentence'})`)

            - try
                - make hotel reply soup (if it exists)
                - extract hotel reply timestamp
                - extract hotel reply text
            
            - extract hotel name (`find('a', attrs={'class': 'rtconds fn'}).text`)

            - extract scores as a string (`find('ul', attrs={'class': 'rateDetail'}).text`)

            - split scores (`split('\n')[1:-1]`)

            - append review id
            - append review time
            - append review text
            - append hotel reply timestamp
            - append hotel reply text
            - append hotel name
            - append prefecture
            - `for i in range(7):`
                - we only want the number, the order of the categories is already encoded in the initialization above
                - `lists_of_scores[i].append(customer_scores[i][-1])`



In [2]:
def scrape_rakuten():
    '''This is a script to scrape Rakuten Travel, as outlined in the accompanying pseudocode.

    Returns
    ---
    A dictionary, where each component list is a column of the DataFrame we want to end up with.

    This will later be exported to a `.csv` file.
    '''

    ########## output lists ##########
    final_dict = {'review_id': [],
                  'review_time': [],
                  'review_text': [],
                  'hotel_reply_time': [],
                  'hotel_reply_text': [],
                  'hotel_name': [],
                  'prefecture': [],
                  'overall_score': [],
                  'service_score': [],
                  'location_score': [],
                  'room_score': [],
                  'amenities_score': [],
                  'bathroom_score': [],
                  'food_score': []}
    
    score_order = ['overall_score', 
                   'service_score', 
                   'location_score',
                   'room_score',
                   'amenities_score',
                   'bathroom_score',
                   'food_score']
    ##################################



    # initial scraping of Rakuten Travel homepage
    # we only need to do this once
    homepage_res = requests.get(url='https://travel.rakuten.co.jp/')
    homepage_res.encoding = 'utf-8'
    homepage_soup = BeautifulSoup(homepage_res.text)

    # extracts out the list of 47 unique prefectures
    areas = homepage_soup.find('dd', attrs={'class': 'area dmArea'})
    list_of_prefectures = [pref_tag.get('value') for pref_tag in areas.findAll('option')]
    list_of_prefectures = list(set(list_of_prefectures))

    # due to the time constraint, we limit the scraping to hotels from the Tohoku region
    # Aomori, Iwate, Miyagi, Akita, Yamagata, Fukushima
    list_of_prefectures = ['aomori', 'iwate', 'miyagi', 'akita', 'yamagata', 'hukushima']

    # loop 1: homepage -> prefectures
    for prefecture in list_of_prefectures:
        # sets the correct url
        pref_url = f"https://search.travel.rakuten.co.jp/ds/undated/search?f_dai=japan&f_sort=hotel&f_page=1&f_hyoji=30&f_tab=hotel&f_cd=02&f_layout=list&f_campaign=&f_chu={prefecture}&f_shou=&f_sai=&f_charge_users=&l-id=topC_search_hotel_undated"
        pref_res = requests.get(url=pref_url)
        pref_res.encoding = 'utf-8'
        pref_soup = BeautifulSoup(pref_res.text)

        # initializes list of individual hotel review links
        list_of_hotel_review_links = []

        # this gets all the hotels from the first page
        for _ in range(1):
            # appends hotel review links
            list_of_hotel_review_links.extend([hotel.find('a').get('href') for hotel 
                                               in pref_soup.findAll('p', attrs={'class': 'cstmrEvl'}) 
                                               if hotel.find('a') != None])

            # ends the while loop if there is no next page
            if pref_soup.find('li', attrs={'class': 'pagingNext'}) == None:
                break
            else:
                # gets the next page's url
                next_hotel_page_url = pref_soup.find('li', attrs={'class': 'pagingNext'}).find('a').get('href')
                next_hotel_page_res = requests.get(url=next_hotel_page_url)
                next_hotel_page_res.encoding = 'utf-8'
                # overwrites pref_soup so that the while loop terminates properly
                pref_soup = BeautifulSoup(next_hotel_page_res.text)
        
        # this is to keep track of progress
        counter = 0

        # loop 2: prefectures -> hotels
        for hotel_review_link in list_of_hotel_review_links:
            # makes soup
            hotel_res = requests.get(hotel_review_link)
            hotel_res.encoding = 'utf-8'
            hotel_soup = BeautifulSoup(hotel_res.text)

            # initializes list of customer review links
            list_of_customer_review_links = []

            # this gets all reviews from the first 2 pages
            for _ in range(2):
                # appends each review's individual link
                list_of_customer_review_links.extend([comment_tag.find('a').get('href') for comment_tag 
                                                      in hotel_soup.findAll('div', attrs={'class': 'commentBox'})])

                # ends the loop if there is no next page
                if hotel_soup.find('li', attrs={'class': 'pagingNext'}) == None:
                    break
                else:
                    # gets the next page's url
                    next_review_page_url = hotel_soup.find('li', attrs={'class': 'pagingNext'}).find('a').get('href')
                    next_review_page_res = requests.get(url=next_review_page_url)
                    next_review_page_res.encoding = 'utf-8'
                    # overwrites hotel_soup so that the while loop terminates properly
                    hotel_soup = BeautifulSoup(next_review_page_res.text)
            
            # loop 3: hotels -> customer reviews
            for customer_review_link in list_of_customer_review_links:
                # makes soup
                review_res = requests.get(customer_review_link)
                time.sleep(1) # throttling to prevent overloading Rakuten's servers
                review_res.encoding = 'utf-8'
                review_soup = BeautifulSoup(review_res.text)

                # extracts the html for the customer review only
                customer_review_soup = review_soup.find('dl', attrs={'class': 'commentReputation'})

                # extracts review id
                review_id = customer_review_soup.find('div', attrs={'class': 'voteQuestion'}).get('id')
                
                # extracts review timestamp
                review_time = customer_review_soup.find('span', attrs={'class': 'time'}).text

                # extracts review text
                review_text = customer_review_soup.find('p', attrs={'class': 'commentSentence'}).text
                # removes newline characters and trailing whitespaces
                review_text = re.sub('\\[rn]', '', review_text).strip()

                # extracts the html for hotel reply (if it exists)
                try:
                    hotel_reply_soup = review_soup.find('dl', attrs={'class': 'commentHotel'})

                    # extracts timestamp of hotel's reply
                    hotel_reply_time = hotel_reply_soup.find('span', attrs={'class': 'time'}).text

                    # extracts hotel reply
                    hotel_reply_text = hotel_reply_soup.find('p', attrs={'class': 'commentSentence'}).text
                    hotel_reply_text = re.sub('\\[rn]', '', hotel_reply_text)
                except AttributeError: # in the case that the hotel does not leave a reply
                    hotel_reply_time = np.nan
                    hotel_reply_text = np.nan

                # extracts hotel name
                hotel_name = review_soup.find('a', attrs={'class': 'rtconds fn'}).text

                try: 
                    # extracts the respective scores
                    review_scores_string = customer_review_soup.find('ul', attrs={'class': 'rateDetail'}).text
                    # splits the scores into a list
                    review_scores = review_scores_string.split('\n')[1:-1]
                except AttributeError: # sometimes people don't vote
                    continue

                # appending data to the respective list
                final_dict['review_id'].append(review_id)
                final_dict['review_time'].append(review_time)
                final_dict['review_text'].append(review_text)
                final_dict['hotel_reply_time'].append(hotel_reply_time)
                final_dict['hotel_reply_text'].append(hotel_reply_text)
                final_dict['hotel_name'].append(hotel_name)
                final_dict['prefecture'].append(prefecture)
                for i in range(7):
                    final_dict[score_order[i]].append(review_scores[i][-1])
                
            # this is here just to indicate progress
            counter += 1
            print(f'Scraping completed for {prefecture.capitalize()} hotel {counter}.')

        # this is here just to indicate progress
        print(f'Scraping completed for {prefecture.capitalize()} prefecture.')

    print('All hotels scraped.')

    return final_dict




We first scrape the data into a dictionary `rakuten_dict`.

Just the top 2 pages of reviews for the first page of hotels for just the Tohoku region (Aomori, Iwate, Miyagi, Akita, Yamagata, Fukushima prefectures) takes around 4 hours total to scrape. This totals to around 40 reviews each from 187 hotels.

If time had permitted, and the better processors were had, scraping all the hotels in all 47 prefectures would be done. However, it would take days of nonstop scraping in order to get all the data, because the individual scores only show up in the title link of each individual review.

In [3]:
rakuten_dict = scrape_rakuten()

Scraping completed for Aomori hotel 1.
Scraping completed for Aomori hotel 2.
Scraping completed for Aomori hotel 3.
Scraping completed for Aomori hotel 4.
Scraping completed for Aomori hotel 5.
Scraping completed for Aomori hotel 6.
Scraping completed for Aomori hotel 7.
Scraping completed for Aomori hotel 8.
Scraping completed for Aomori hotel 9.
Scraping completed for Aomori hotel 10.
Scraping completed for Aomori hotel 11.
Scraping completed for Aomori hotel 12.
Scraping completed for Aomori hotel 13.
Scraping completed for Aomori hotel 14.
Scraping completed for Aomori hotel 15.
Scraping completed for Aomori hotel 16.
Scraping completed for Aomori hotel 17.
Scraping completed for Aomori hotel 18.
Scraping completed for Aomori hotel 19.
Scraping completed for Aomori hotel 20.
Scraping completed for Aomori hotel 21.
Scraping completed for Aomori hotel 22.
Scraping completed for Aomori hotel 23.
Scraping completed for Aomori hotel 24.
Scraping completed for Aomori hotel 25.
Scraping 

Then we throw it into a DataFrame.

In [4]:
rakuten_df = pd.DataFrame.from_dict(rakuten_dict)
rakuten_df.head()

Unnamed: 0,review_id,review_time,review_text,hotel_reply_time,hotel_reply_text,hotel_name,prefecture,overall_score,service_score,location_score,room_score,amenities_score,bathroom_score,food_score
0,voteans_21301527,2024-02-18 01:26:29,家族でのんびり過ごせて良かったです。お料理が美味しかったし、景色も美しかったです。温泉は思っ...,,,鶴の舞橋と岩木山　絶景の宿　つがる富士見荘,aomori,4,4,4,4,4,4,5
1,voteans_21251321,2024-01-29 12:03:50,ロビーからの鶴の舞橋の景観が素晴らしく、雪景色もあって楽しめました。食事も満腹となるほどの量...,2024-02-15 18:11:20,この度は当館にご宿泊いただき誠にありがとうございます。ロビーから見える鶴の舞橋と雪化粧した岩...,鶴の舞橋と岩木山　絶景の宿　つがる富士見荘,aomori,4,4,5,4,4,4,4
2,voteans_21170325,2023-12-26 15:04:31,先日はお世話になりました。夕食を部屋食への変更、ありがとうございました。お食事はとても味付け...,,,鶴の舞橋と岩木山　絶景の宿　つがる富士見荘,aomori,4,5,5,5,4,5,5
3,voteans_21061209,2023-11-21 16:12:52,大変満足でした。お掃除も行き届いていましたしお料理も美味しかったです。お風呂のお湯も豊富で気...,,,鶴の舞橋と岩木山　絶景の宿　つがる富士見荘,aomori,5,4,5,4,3,4,4
4,voteans_21031432,2023-11-12 19:16:54,【良かった点】・部屋から見える景色が素晴らしく大満足。・フロントの対応が丁寧。【気になった点...,,,鶴の舞橋と岩木山　絶景の宿　つがる富士見荘,aomori,2,2,5,2,3,2,2


Lastly, we export it to a `.csv` file.

In [5]:
rakuten_df.to_csv('../datasets/rakuten_tohoku_reviews_replies.csv', index=False)