# <center>Author: Victor Diallen

# Table of Contents :
* [1. Introduction](#section1)
* [2. Importing Libraries](#section2)
* [3. Web Scraping](#section3)
* [4. Creating Dataframe with Data Received](#section4)
* [5. Saving Dataframe to a CSV File](#section5)
* [6. End of Project's Part 1](#section6)

<a id="section1"></a>
# Introduction

- This project is build in two parts.
- The first part will be done here in python and the second part will be done in Power BI.
- The aim of the project is to use python for web scraping data from the website Booking.com and retrieve
  data from Paris listings (which will be done in the first part), and then analyze it with Power BI

<a id="section2"></a>
# Importing Libraries

In [1]:
import requests
from bs4 import BeautifulSoup as soup
import pandas as pd
import re
import time
import webbrowser
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

<a id="section3"></a>
# Web Scraping

In [31]:
# Creating empty lists to recieve data from the website
hotel_list = []
classification_list = []
grade_list = []
price_list = []
taxes_list = []
center_distance_list = []
location_list = []

# headers for BeautifulSoup library
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}

# Editable inputs for website URL
rooms = '1'
adults = '2'
children = '0'
check_in = '2024-05-07'
check_out = '2024-05-14'

# Web Scraping pages 1 to 10
for page in range(1,11):
    
    offset = (page*25 - 25)
    
    web_address = 'https://www.booking.com/searchresults.pt-br.html?label=gen173nr-1BCAEoggI46AdIM1gEaCCIAQGYAS24ARfIAQzYAQHoAQGIAgGoAgO4AvyF_KwGwAIB0gIkYjg3NmQ1ZmUtMGI2OC00NjliLWJiMTItZGU0ZjYyMGI2Mjc42AIF4AIB&sid=5b63f355c060b3ea5e8de0519fbf6c68&aid=304142&ss=Paris&ssne=Paris&ssne_untouched=Paris&lang=pt-br&src=index&dest_id=-1456928&dest_type=city&ac_position=0&ac_click_type=b&ac_langcode=xb&ac_suggestion_list_length=5&search_selected=true&search_pageview_id=bd93927e391c01a2&ac_meta=GhBiZDkzOTI3ZTM5MWMwMWEyIAAoATICeGI6BVBhcmlzQABKAFAA&checkin='+check_in+'&checkout='+check_out+'&group_adults='+adults+'&no_rooms='+rooms+'&group_children='+children+'&nflt=review_score%3D70&offset='+str(offset)
    
    page = requests.get(web_address, headers=headers)
    bsobj = soup(page.content, 'html')
    
    # Obtaining information about the hotel names
    name_find = bsobj.findAll('div', {'data-testid':'title'})
    for name in name_find:
        hotel_list.append(name.text.strip())
            
    # Obtaining information about the hotel classification
    classification_find = bsobj.findAll('div', {'class':'a3b8729ab1 e6208ee469 cb2cbb3ccb'})    
    for classifying in classification_find:
        classification_list.append(classifying.text.strip())

    # Obtaining information about the hotel grades
    grade_find = bsobj.findAll('div', {'class':'a3b8729ab1 d86cee9b25'}) 
    for grading in grade_find:
        grade_list.append(grading.text.strip())
        
    # Obtaining information about the hotel prices
    price_find = bsobj.findAll('span', {'class':'f6431b446c fbfd7c1165 e84eb96b1f'})
    for pricing in price_find:
        price_list.append(pricing.text.strip())
        
    # Obtaining information about the hotel taxes
    tax_find = bsobj.findAll('div', {'class':'c5ca594cb1 f19ed67e4b'})
    for tax in tax_find:
        tax_element = tax.find('div', {'data-testid': 'taxes-and-charges'})
        tax_element_found = tax_element.text.strip()
        taxes_list.append(tax_element_found)
        
    # Obtaining information about the hotel distance to center
    distance_find = bsobj.findAll('span', {'data-testid':'distance'})
    for distance in distance_find:
        center_distance_list.append(distance.text.strip())
        
    # Obtaining information about the hotel location
    location_find = bsobj.findAll('span', {'data-testid':'address'})
    for loc in location_find:
        location_list.append(loc.text.strip())

<a id="section4"></a>
# Creating Dataframe with Data Received

In [246]:
# Retrieving the length of the lists after recieving data
print(len(hotel_list))
print(len(classification_list))
print(len(grade_list))
print(len(price_list))
print(len(taxes_list))
print(len(center_distance_list))
print(len(location_list))

252
252
250
252
252
252
252


In [251]:
# Creating dataframe
# Grades won't be added here because Booking website sometimes messes up classification and grading
# and end up putting them together. Instead we'll join the other lists and remove rows on which this happened
# so it can have the same length as the grades list.

df = pd.DataFrame({
    
    'Hotel' : hotel_list,
    'Classification' : classification_list,
    'Price' : price_list,
    'Taxes' : taxes_list,
    'Distance to Center' : center_distance_list,
    'Location' : location_list
    
})

In [252]:
df.head()

Unnamed: 0,Hotel,Classification,Price,Taxes,Distance to Center,Location
0,Austin's Saint Lazare Hotel,Muito bom,R$ 7.069,+R$ 390 em impostos e taxas,"2,9 km do centro","9º arr., Paris"
1,Sonder Le Frochot,Bom,R$ 7.332,+R$ 141 em impostos e taxas,"2,9 km do centro","9º arr., Paris"
2,UCPA SPORT STATION HOSTEL PARIS,Muito bom,R$ 3.717,+R$ 75 em impostos e taxas,"4,6 km do centro","19º arr., Paris"
3,Sonder Quintinie,Muito bom,R$ 9.520,+R$ 375 em impostos e taxas,"3,8 km do centro","15º arr., Paris"
4,"Holiday Inn Paris Montmartre, an IHG Hotel",Muito bom,R$ 7.773,+R$ 611 em impostos e taxas,"3,8 km do centro","18º arr., Paris"


In [253]:
# Removing rows which are messed up (classification and grades together)
df = df[df['Classification'].str.contains(',') == False]

In [255]:
# Joining grades list to dataframe
df['Grade'] = grade_list

In [256]:
# Reorganizing dataframe
df = df[['Hotel', 'Classification', 'Grade', 'Location', 'Distance to Center', 'Price', 'Taxes']]

In [257]:
# Removing unwanted text
df['Distance to Center'] = df['Distance to Center'].str.replace(' do centro', '')

In [258]:
# Removing unwanted text
df['Taxes'] = df['Taxes'].str.replace(' em impostos e taxas', '')

In [259]:
df

Unnamed: 0,Hotel,Classification,Grade,Location,Distance to Center,Price,Taxes
0,Austin's Saint Lazare Hotel,Muito bom,85,"9º arr., Paris","2,9 km",R$ 7.069,+R$ 390
1,Sonder Le Frochot,Bom,92,"9º arr., Paris","2,9 km",R$ 7.332,+R$ 141
2,UCPA SPORT STATION HOSTEL PARIS,Muito bom,83,"19º arr., Paris","4,6 km",R$ 3.717,+R$ 75
3,Sonder Quintinie,Muito bom,79,"15º arr., Paris","3,8 km",R$ 9.520,+R$ 375
4,"Holiday Inn Paris Montmartre, an IHG Hotel",Muito bom,81,"18º arr., Paris","3,8 km",R$ 7.773,+R$ 611
5,Central Hotel Paris,Bom,84,"14º arr., Paris","2,7 km",R$ 8.261,+R$ 390
6,Odalys City Paris XVII,Bom,81,"17º arr., Paris","5,3 km",R$ 4.699,+R$ 390
7,Hotel Anya,Bom,76,"11º arr., Paris","2,2 km",R$ 3.531,+R$ 203
8,Hotel des Vosges,Muito bom,81,"20º arr., Paris","2,6 km",R$ 3.616,+R$ 195
9,Aparthotel Adagio Porte de Versailles,Bom,84,Paris,"5,9 km",R$ 5.214,+R$ 611


<a id="section5"></a>
# Saving Dataframe to a CSV File

In [260]:
# Saving dataframe to a csv file
df.to_csv('Paris_booking_hotels_analysis.csv')

<a id="section6"></a>
# End of Project's Part 1

- The part 2 of the project will be done in Power BI using the csv file created above. The aim of part 2 is to analyze the collected data.