__Steps:__ 

- __Initialization__

  Initialize a new folder and a Git repository within it. Name the folder using your student ID.
     
- __Web Crawling__

  Write a web crawler to fetch data from [booking.com](https://www.booking.com/index.zh-tw.html?label=gen173nr-1DCAEoggI46AdIM1gEaOcBiAEBmAEwuAEHyAEN2AED6AEBiAIBqAIDuALnxKuoBsACAdICJDc3MGNmMGE5LTdlYTAtNDMyZS1iM2Y4LTNiMzI5NDZkYTMxZNgCBOACAQ&sid=d2bbb0e0a1dbbf961b544750b10edeb5&keep_landing=1&sb_price_type=total&). Implement a function that takes `location`, `check-in date`, and `check-out date` as inputs and returns a DataFrame containing hotel details like `name`, `location`, `price`, `rating`, `distance`, and `comments`. Commit this notebook to your Git repository with a clear commit message.

- __Data Cleaning__

  After scraping, ensure data types are correctly formatted: `price` as integer, `rating` as float, `comment` as string, and `distance` in kilometers (as a float).
  
- __Data Visualization__

  Use `Plotly` to visualize the data in `web_crawler.ipynb`. The scatter plot should have the `price` on the x-axis and `distance from the center` on the y-axis, color-coded by `ratings`. Commit the updated notebook to Git (with message).


In [53]:
import pandas as pd
import requests as rq
from bs4 import BeautifulSoup as bs
from urllib import parse
import re

def get_hotels(location, checkin, checkout, num_results=100):
    string = "https://www.booking.com/searchresults.zh-tw.html?"
    headers = {
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.0 Safari/605.1.15"
    }

    # Initialize an empty DataFrame with columns
    columns = ["name", "location", "price", "ratings", "distance", "comments"]
    hotels = pd.DataFrame(columns=columns)

    offset = 0
    while len(hotels) < num_results:
        query = {
            "ss": location,
            "checkin": checkin,
            "checkout": checkout,
            "offset": offset
        }

        
        url = string + parse.urlencode(query)
        #print(f"Currently searching for {url}")
        
        res = rq.get(url, headers=headers)
        #print(f"The status code is {res.status_code}")
        
        soup = bs(res.text, 'html.parser')
        #if offset == 0:
        #    print(soup.select('h1.f6431b446c.d5f78961c3')[0].text.strip())
        
        offset += 25
    
            
        ratings_data = [rating.text.strip() for rating in soup.select('div.aca0ade214.a5f1aae5b2.cd2e7d62b0')]
        if not ratings_data: #沒資料就break
            break
        
        # Initialize a new temp DataFrame for each loop iteration
        temp_df = pd.DataFrame(columns=columns)

        # Extract the data from the list and add it to the temp DataFrame
        for item in ratings_data[1:52:2]:
            
            #print(item)
            
            # Use regular expression to extract ratings and comments  
            match = re.match(r'(\d\.\d)(\D+)(\d*,?\d+\s則評語)', item)
            if match:
                rating, comment_text, _ = match.groups()
            else:
                rating = None
                comment_text = None
                
            temp_df.loc[len(temp_df)] = [None, None, None, rating, None, comment_text]

            
        temp_df["name"] = [name.text.strip() for name in soup.select('div[data-testid="title"].f6431b446c.a15b38c233')]
        temp_df["location"] = [location.text.strip() for location in soup.select('span.aee5343fdb.def9bc142a[data-testid="address"]')]
        temp_df["price"] = [price.text.strip() for price in soup.select("span.f6431b446c.fbfd7c1165.e84eb96b1f")]
        if soup.select('span[aria-expanded="false"][data-testid="distance"]'):
            temp_df["distance"] = [distance.text.strip() for distance in soup.select('span[aria-expanded="false"][data-testid="distance"]')]

        # Append temp_df to the main DataFrame hotels_df
        hotels = pd.concat([hotels, temp_df], ignore_index=True)
        
    hotels['price'] = hotels['price'].str.replace('TWD', '').str.replace(',', '').astype(int)
    hotels["ratings"] = hotels["ratings"].astype(float)
    hotels["comments"] = hotels["comments"].astype(str)
    hotels['distance'] = hotels['distance'].apply(lambda x: None if x is None else (float(x.replace('距中心 ', '').split(' ')[0]) / 1000) if '公尺' in x else float(x.replace('距中心 ', '').split(' ')[0]))
    return hotels[:num_results]


hotels = get_hotels("Paris", "2023-12-12", "2023-12-14", num_results=100)

hotels





Unnamed: 0,name,location,price,ratings,distance,comments
0,Hôtel Opéra Liège,"9區 - 歌劇院, 巴黎",16386,8.8,3.1,很棒
1,Sonder L'Edmond Parc Monceau,"17區 - 巴蒂諾爾, 巴黎",10637,8.5,4.0,非常好
2,2 lits doubles près de la Seine/Notre Dame,"5區 - 拉丁區, 巴黎",9971,,0.7,
3,Rare 18th century washhouse convert,"18區 - 蒙馬特, 巴黎",27345,,3.7,
4,Delphine,"16區 - 帕西, 巴黎",18644,,6.3,
...,...,...,...,...,...,...
95,Hotel Paris Italie,"13. 義大利廣場, 巴黎",9446,7.9,3.0,好
96,馨樂庭奧斯特利茲巴黎酒店,"13. 義大利廣場, 巴黎",9426,8.2,2.5,非常好
97,提姆酒店,"10區 - 共和區, 巴黎",8010,7.6,2.1,好
98,Louisa Hotel Paris,"16區 - 帕西, 巴黎",13120,7.4,4.8,好


In [None]:
import plotly.express as px

# Create a scatter plot
fig = px.scatter(hotels, x='price', y='distance', color='ratings',
                 hover_name='name',
                 hover_data={'price': True, 'ratings': True}) # to be displayed when hovering over datapoints

# Customize the plot
fig.update_traces(marker=dict(size=12, opacity=0.7),
                  selector=dict(mode='markers+text'))

# Add titles and labels
fig.update_layout(
    title='Hotel Prices vs. Distance from Center',
    xaxis_title='Price',
    yaxis_title='Distance from Center (kilometers)'
)

# Show the plot
fig.show()