# The Relationship Between a Hotel's Geographical Proximity to the Center of the Country and Its Price

* Yarin Cohen, ID: 211361720
* Amit Shiber, ID: 322372582

## About Our Project

From time to time the issue of the periphery versus the center of the country comes up in the media. We decided to research more about the subject and check the hotel prices in the cities near Tel Aviv and in the distant cities. After crawling the data from the hotel website, we will use an additional function with an API to calculate distances between two locations. Is there a connection between the price of hotel charges and its proximity to the center of country?

### Information Sources and Data Acquisition Methods

* **Crawling Booking.com** - One of the largest online travel agencies. As of December 31, 2022, Booking.com offered lodging reservation services for approximately 2.7 million properties, including 400,000 hotels, motels, and resorts and 2.3 million homes, apartments in over 220 countries and in over 40 languages. It will help us getting data about the hotels in this project.

* **GeoDB Cities API** - Online cities database. It exposes city, region, and country data via both GraphQL and REST APIs. It will help us calculate the distance between two cities.

### Data Set Description

Each line in the data set represents a hotel.

Columns representation in the data set:
* Hotel name
* Hotel Address
* Hotel Description
* Price per night (on a fixed date, the cheapest deal)
* Score - general
* Score - stuff
* Score - facilities
* Score - convenience
* Score - value for money
* Score - location
* Score - clean
* Proximity to the center of the country (km)

### Machine Learning

* **Type of ML**: Regression

* We will start with easy regression models (one variable and low powers) and try to go through each pair of an explanatory variable and an explained variable.

* There is no rule that says how many variables make a regression heavy and sluggish. If the software starts to falter, we will stop and think whether adding the variables and holdings will contribute to the prediction or only to the complications of calculation, memory, etc. We are required to exercise discretion between predictability and complications and resources such as private time.

* If the learning results are not satisfactory, we will use
in classification and division into price levels.

## Imports

In [11]:
import requests
import bs4
from bs4 import BeautifulSoup
import time
import random
import pandas as pd
import scipy as sc
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
import sklearn
from sklearn import linear_model, metrics, preprocessing
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.metrics import r2_score, f1_score
%matplotlib inline

## Step 1: Defining a Research Question

Is it possible to predict the price of a night in a certain hotel, based on its proximity to the center and the score given to it by surfers in the various categories?

## Step 2: Data Acquisition

### Data Acquisition by Crawling

First of all, we will check Booking.com's Robots.txt terms, to understand if there are any pages we can't crawl: https://booking.com/robots.txt

* We will start by searching manually on Booking's main page for a vacation in Israel, on 01-02/08/2023.

* The results page will be crawled first.
* Due to complexity of HTML elements, we will use the mobile version of Booking.
* <a href="https://www.booking.com/searchresults.he.html?ss=%D7%99%D7%A9%D7%A8%D7%90%D7%9C&ss=%D7%99%D7%A9%D7%A8%D7%90%D7%9C&group_adults=2&group_children=0&no_rooms=1&sb_travel_purpose=leisure&ssne=%D7%99%D7%A9%D7%A8%D7%90%D7%9C&ssne_untouched=%D7%99%D7%A9%D7%A8%D7%90%D7%9C&sb_changed_dates=1&label=gen173nr-1BCAEoggI46AdIM1gEaGqIAQGYAQ64AQfIAQzYAQHoAQGIAgGoAgO4AsPO_6IGwAIB0gIkN2EzYmVmMjgtNTkwYS00YjMyLWI5ZmUtMmZjMTQwOTdmM2I42AIF4AIB&sid=ae3ca57b743d1747c5f828a2fabc4587&aid=304142&lang=he&sb=1&src_elem=sb&src=searchresults&dest_id=103&dest_type=country&checkin=2024-02-01&checkout=2024-02-02&prefer_site_type=mdot" >This is</a> the first page will be crawled.

#### Auxiliary Functions

In [12]:
# Load soup object:

def loadSoupObject(url):

    headers = { "User-Agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 15_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148" }
    
    time.sleep(random.randint(1,5))
    r = requests.get(url, headers=headers).content
    
    return BeautifulSoup(r,"html.parser")

In [13]:
# Getting URLs of all the hotels in the page:

def getHotelsURL(soupObj):

    links = []
   
    for link in soupObj.findAll("a", {"data-testid" : "title"}):
        links.append(link.get("href"))

    return links

In [14]:
# Getting URL of the next results page:

def getNextPage(soupObj):
    return soupObj.find("a", {"title" : "לעמוד הבא"}).get("href")

In [61]:
# Getting information from a hotel page:

def getHotelData(soupObj):

    dataOfHotel = []

    # Hotel name:
    dataOfHotel.append(soupObj.find("span",{"class" : "hp-header--title--text"}).text)

    # Hotel address:
    dataOfHotel.append(soupObj.find("span",{"class" : "js_hp_address_text_line"}).text)

    # Hotel description:
    dataOfHotel.append(soupObj.find("div",{"class" : "page-section--content"}).text)
    
    # Price per night (on a fixed date, the cheapest deal):
    dataOfHotel.append(soupObj.find("div",{"class" : "prco-js-headline-price"}).text)

    # Score - general:
    dataOfHotel.append(soupObj.find("div",{"data-testid" : "review-score-component"}).text)

    # Score - stuff:
    # dataOfHotel.append(soupObj.find("div",{"id" : ":rb:-label"}).text)

    # Score - facilities:
    # dataOfHotel.append(soupObj.find("div",{"id" : ":r9:-label"}).text)

    # Score - convenience:
    # dataOfHotel.append(soupObj.find("div",{"id" : ":ra:-label"}).text)

    # Score - value for money:
    dataOfHotel.append(soupObj.find("div",{"id" : ":R5m:-label"}).text)

    # Score - location:
    dataOfHotel.append(soupObj.find("div",{"id" : ":R4m:-label"}).text)

    # Score - clean:
    dataOfHotel.append(soupObj.find("div",{"id" : ":R56:-label"}).text)


    return dataOfHotel


#### Main Function

In [16]:
urlResults = "https://www.booking.com/searchresults.he.html?ss=%D7%99%D7%A9%D7%A8%D7%90%D7%9C&ss=%D7%99%D7%A9%D7%A8%D7%90%D7%9C&group_adults=2&group_children=0&no_rooms=1&sb_travel_purpose=leisure&ssne=%D7%99%D7%A9%D7%A8%D7%90%D7%9C&ssne_untouched=%D7%99%D7%A9%D7%A8%D7%90%D7%9C&sb_changed_dates=1&label=gen173nr-1BCAEoggI46AdIM1gEaGqIAQGYAQ64AQfIAQzYAQHoAQGIAgGoAgO4AsPO_6IGwAIB0gIkN2EzYmVmMjgtNTkwYS00YjMyLWI5ZmUtMmZjMTQwOTdmM2I42AIF4AIB&sid=ae3ca57b743d1747c5f828a2fabc4587&aid=304142&lang=he&sb=1&src_elem=sb&src=searchresults&dest_id=103&dest_type=country&checkin=2024-02-01&checkout=2024-02-02&prefer_site_type=mdot"
currentPage = loadSoupObject(urlResults)

resultsPages = []
hotelsLinks = []

In [17]:
# Collecting links of results pages:

resultsPages.append(currentPage)

for i in range(33):
    nextPage = loadSoupObject(getNextPage(currentPage))
    resultsPages.append(nextPage)
    currentPage = nextPage

In [30]:
# Collecting links of hotels:

for page in resultsPages:
    hotelsLinks.extend(getHotelsURL(page))

In [46]:
hotelsLinks

['https://www.booking.com/hotel/il/one-bedroom-apartment-with-view-sheshet-hayamim.he.html?label=gen173nr-1BCAEoggI46AdIM1gEaGqIAQGYAQ64AQfIAQzYAQHoAQGIAgGoAgO4AsPO_6IGwAIB0gIkN2EzYmVmMjgtNTkwYS00YjMyLWI5ZmUtMmZjMTQwOTdmM2I42AIF4AIB&sid=ae3ca57b743d1747c5f828a2fabc4587&aid=304142&ucfs=1&arphpl=1&checkin=2024-02-01&checkout=2024-02-02&dest_id=103&dest_type=country&group_adults=2&req_adults=2&no_rooms=1&group_children=0&req_children=0&hpos=1&hapos=1&sr_order=popularity&srpvid=af4073c3bbbb01e7&srepoch=1684513671&all_sr_blocks=775517801_335848457_2_0_0&highlighted_blocks=775517801_335848457_2_0_0&matching_block_id=775517801_335848457_2_0_0&sr_pri_blocks=775517801_335848457_2_0_0__80801',
 'https://www.booking.com/hotel/il/khvvt-yn-gdy-ein-gedi-farm.he.html?label=gen173nr-1BCAEoggI46AdIM1gEaGqIAQGYAQ64AQfIAQzYAQHoAQGIAgGoAgO4AsPO_6IGwAIB0gIkN2EzYmVmMjgtNTkwYS00YjMyLWI5ZmUtMmZjMTQwOTdmM2I42AIF4AIB&sid=ae3ca57b743d1747c5f828a2fabc4587&aid=304142&ucfs=1&arphpl=1&checkin=2024-02-01&checkout=202

In [43]:
loadSoupObject(hotelsLinks[0])

<!DOCTYPE html>

<!--
You know you could be getting paid to poke around in our code?
We're hiring designers and developers to work in Amsterdam:
https://careers.booking.com/
-->
<!-- mdot-548 -->
<html class="no-js" lang="he">
<head>
<link crossorigin="" href="https://cf.bstatic.com" rel="dns-prefetch"/>
<link crossorigin="" href="https://cf.bstatic.com" rel="dns-prefetch"/>
<script nonce="D05KcSCziaTLEZM" type="text/javascript">
document.addEventListener('DOMContentLoaded', function () {
/**
* provides the current user's cookie consent
* in order to use it:
* 1. inline privacy/cookieConsent.js in the page you need to use it.
* please note that this library relies on window.PCM.isCountryNeedCookieBanner to be initialised
* before using (calling getValue function) it
* 2. in your js file:
*
* var privacyCookieConsent = B.require('privacyCookieConsent');
* var consent = privacyCookieConsent.getValue();
*/
B.define('privacyCookieConsent', function () {
var consentGroupIsAllowed = {
analyt

In [62]:
# Crawling data from hotel pages:

getHotelData(loadSoupObject(hotelsLinks[0]))

['One Bedroom Apartment with View by Stay Eilat',
 '\n7014 Sderot Sheshet HaYamim, אילת\n',
 '\nמקום האירוח ׳דירת חדר שינה אחד עם נוף באילת׳ (One Bedroom Apartment with View by Stay Eilat) ממוקם באילת, במרחק 2.3 ק"מ מחוף קיסוסקי ומחוף הפנינה, ומציע מיזוג אוויר, אינטרנט אלחוטי חינם ונוף לעיר ולים. \nבדירה יש חדר שינה אחד, מטבח עם מיקרוגל\n …\n\n\n\nהצג עוד\n\n\n\n',
 '\n808\xa0₪\n',
 '9.0 מעולה\xa0·\xa03 חוות דעת',
 '8.3',
 '7.5',
 '9.2']

### Data Acquisition by API

#### Auxiliary Functions

In [20]:
# Calculate distace between two locations:



#### Main Function

In [21]:
# Main



## Step 3: Data Handling

### Missing Data

#### Auxiliary Functions

In [22]:
# Code:



#### Main Function

In [23]:
# Main



### Data Duplication

#### Auxiliary Functions

In [24]:
# Code:



#### Main Function

In [25]:
# Main



### Outliers

#### Auxiliary Functions

In [26]:
# Code:



#### Main Function

In [27]:
# Main



## Step 4: Machine Learning

#### Auxiliary Functions

In [28]:
# Code:



#### Main Function

In [29]:
# Main

