# The Relationship Between a Hotel's Geographical Proximity to the Center of the Country and Its Price

* Yarin Cohen, ID: 211361720
* Amit Shiber, ID: 322372582

## About Our Project

From time to time the issue of the periphery versus the center of the country comes up in the media. We decided to research more about the subject and check the hotel prices in the cities near Tel Aviv and in the distant cities. After crawling the data from the hotel website, we will use an additional function with an API to calculate distances between two locations. Is there a connection between the price of hotel charges and its proximity to the center of country?

### Information Sources and Data Acquisition Methods

* **Crawling Booking.com** - One of the largest online travel agencies. As of December 31, 2022, Booking.com offered lodging reservation services for approximately 2.7 million properties, including 400,000 hotels, motels, and resorts and 2.3 million homes, apartments in over 220 countries and in over 40 languages. It will help us getting data about the hotels in this project.

* **GeoDB Cities API** - Online cities database. It exposes city, region, and country data via both GraphQL and REST APIs. It will help us calculate the distance between two cities.

### Data Set Description

Each line in the data set represents a hotel.

Columns representation in the data set:
* Hotel name
* Hotel Address
* Hotel Description
* Price per night (on a fixed date, the cheapest deal)
* Score - general
* Score - stuff
* Score - facilities
* Score - convenience
* Score - value for money
* Score - location
* Proximity to the center of the country (km)

### Machine Learning

* **Type of ML**: Regression

* We will start with easy regression models (one variable and low powers) and try to go through each pair of an explanatory variable and an explained variable.

* There is no rule that says how many variables make a regression heavy and sluggish. If the software starts to falter, we will stop and think whether adding the variables and holdings will contribute to the prediction or only to the complications of calculation, memory, etc. We are required to exercise discretion between predictability and complications and resources such as private time.

* If the learning results are not satisfactory, we will use
in classification and division into price levels.

## Imports

In [93]:
import requests
import bs4
from bs4 import BeautifulSoup  
import pandas as pd
import scipy as sc
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
import sklearn
from sklearn import linear_model, metrics, preprocessing
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.metrics import r2_score, f1_score
%matplotlib inline

## Step 1: Defining a Research Question

Is it possible to predict the price of a night in a certain hotel, based on its proximity to the center and the score given to it by surfers in the various categories?

## Step 2: Data Acquisition

### Data Acquisition by Crawling

First of all, we will check Booking.com's Robots.txt terms, to understand if there are any pages we can't crawl: https://booking.com/robots.txt

* We will start by searching manually on Booking's main page for a vacation in Israel, on 01-02/08/2023.

* The results page will be crawled first.
* Due to complexity of HTML elements, we will use the mobile version of Booking.
* <a href="https://www.booking.com/searchresults.he.html?ss=%D7%99%D7%A9%D7%A8%D7%90%D7%9C&ss=%D7%99%D7%A9%D7%A8%D7%90%D7%9C&group_adults=2&group_children=0&no_rooms=1&sb_travel_purpose=leisure&ssne=%D7%99%D7%A9%D7%A8%D7%90%D7%9C&ssne_untouched=%D7%99%D7%A9%D7%A8%D7%90%D7%9C&sb_changed_dates=1&label=gen173nr-1BCAEoggI46AdIM1gEaGqIAQGYAQ64AQfIAQzYAQHoAQGIAgGoAgO4AsPO_6IGwAIB0gIkN2EzYmVmMjgtNTkwYS00YjMyLWI5ZmUtMmZjMTQwOTdmM2I42AIF4AIB&sid=ae3ca57b743d1747c5f828a2fabc4587&aid=304142&lang=he&sb=1&src_elem=sb&src=searchresults&dest_id=103&dest_type=country&checkin=2024-02-01&checkout=2024-02-02&prefer_site_type=mdot" >This is</a> the first page will be crawled.

#### Auxiliary Functions

In [94]:
# Load soup object:

def loadSoupObject(url):
    r = requests.get(url).text
    return BeautifulSoup(r,"html.parser")

In [95]:
# Getting URLs of all the hotels in the page:

def getHotelsURL(soupObj):

    links = []
   
    for link in soupObj.findAll("a", {"class": "bui-card__header_full_link_wrap"}):
        links.append("https://www.booking.com/" + link.get("href"))

    return links

In [96]:
# Getting URL of the next results page:

def getNextPage(soupObj):

    return soupObj.findAll("a")

# <a title="לעמוד הבא"

In [97]:
# Getting information from hotel page:



#### Main Function

In [98]:
# Main

soup = loadSoupObject("https://www.booking.com/searchresults.he.html?ss=%D7%99%D7%A9%D7%A8%D7%90%D7%9C&ss=%D7%99%D7%A9%D7%A8%D7%90%D7%9C&group_adults=2&group_children=0&no_rooms=1&sb_travel_purpose=leisure&ssne=%D7%99%D7%A9%D7%A8%D7%90%D7%9C&ssne_untouched=%D7%99%D7%A9%D7%A8%D7%90%D7%9C&sb_changed_dates=1&label=gen173nr-1BCAEoggI46AdIM1gEaGqIAQGYAQ64AQfIAQzYAQHoAQGIAgGoAgO4AsPO_6IGwAIB0gIkN2EzYmVmMjgtNTkwYS00YjMyLWI5ZmUtMmZjMTQwOTdmM2I42AIF4AIB&sid=ae3ca57b743d1747c5f828a2fabc4587&aid=304142&lang=he&sb=1&src_elem=sb&src=searchresults&dest_id=103&dest_type=country&checkin=2024-02-01&checkout=2024-02-02&prefer_site_type=mdot");
links = getHotelsURL(soup)
next = getNextPage(soup)
next

[<a class="a11y-skip-to-content" href="#indexsearch">דלג לתוכן העיקרי</a>,
 <a aria-label="Booking.com הזמנת מלונות באינטרנט" data-et-click="
 YTBUIHOdVLBLKAMZUC:1
 YTBUIHOdVLBLKAMZUC:2
 " href="https://www.booking.com/index.he.html">
 <svg aria-hidden="true" class="bk-icon -streamline-booking_logo_dark_bg_mono" focusable="false" height="24" role="presentation" viewbox="0 0 180 30" width="144">
 <path d="M70.6 2.73999C70.602 2.19808 70.7646 1.66892 71.0673 1.21943C71.3701 0.769947 71.7993 0.420321 72.3007 0.214768C72.8021 0.00921437 73.3532 -0.0430342 73.8843 0.064629C74.4155 0.172292 74.9027 0.435032 75.2845 0.819622C75.6663 1.20421 75.9255 1.69338 76.0293 2.22527C76.133 2.75716 76.0768 3.30788 75.8676 3.80779C75.6584 4.3077 75.3056 4.73434 74.8539 5.03377C74.4022 5.3332 73.8719 5.49197 73.33 5.48999C72.9702 5.48868 72.6141 5.41651 72.2822 5.2776C71.9503 5.13869 71.649 4.93576 71.3955 4.6804C71.1419 4.42504 70.9412 4.12225 70.8047 3.78931C70.6683 3.45637 70.5987 3.09982 70.6 2.73999V2

### Data Acquisition by API

#### Auxiliary Functions

In [99]:
# Calculate distace between two locations:



#### Main Function

In [100]:
# Main



## Step 3: Data Handling

### Missing Data

#### Auxiliary Functions

In [101]:
# Code:



#### Main Function

In [102]:
# Main



### Data Duplication

#### Auxiliary Functions

In [103]:
# Code:



#### Main Function

In [104]:
# Main



### Outliers

#### Auxiliary Functions

In [105]:
# Code:



#### Main Function

In [106]:
# Main



## Step 4: Machine Learning

#### Auxiliary Functions

In [107]:
# Code:



#### Main Function

In [108]:
# Main

