# Web Scraping for Hotel Data Analysis

This Jupyter Notebook is dedicated to collecting hotel data from a specific website using web scraping techniques. The primary goal is to analyze hotel prices and availability in Barcelona and Valencia during and before the Primavera Sound Festival dates. We use the Selenium package to automate web browser interaction, BeautifulSoup for HTML parsing, and Pandas for data manipulation and storage. The process involves setting up the browser, navigating to the target website, inputting search criteria, scraping relevant hotel data, and compiling it into a structured format for further analysis.

### Cell 0: Importing Libraries
This cell imports all necessary libraries for web scraping, data handling, and browser automation. It sets the foundation for the entire notebook by ensuring all tools are available for the tasks ahead.

In [1]:
import json
import numpy as np
import pandas as pd
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException, ElementClickInterceptedException, StaleElementReferenceException
from selenium import webdriver
import os
from bs4 import BeautifulSoup
import time
from selenium import webdriver
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.firefox.options import Options

### Cell 1: Browser Configuration Functions
Defines functions to configure the Firefox browser for downloading files and initializing a browser session with specific preferences for web scraping. This step is crucial for automating the interaction with web pages.

In [2]:
def ffx_preferences(dfolder, download=False):
    '''
    Sets the preferences of the firefox browser: download path.
    '''
    profile = webdriver.FirefoxProfile()
    # set download folder:
    profile.set_preference("browser.download.dir", dfolder)
    profile.set_preference("browser.download.folderList", 2)
    profile.set_preference("browser.download.manager.showWhenStarting", False)
    profile.set_preference("browser.helperApps.neverAsk.saveToDisk",
                           "application/msword,application/rtf, application/csv,text/csv,image/png ,image/jpeg, application/pdf, text/html,text/plain,application/octet-stream")
    
    profile.add_extension('/Users/ruimaciel/Desktop/Barcelona/NLP_I/TA_sessions/2/uBlock0@raymondhill.net.xpi')

    # this allows to download pdfs automatically
    if download:
        profile.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/pdf,application/x-pdf")
        profile.set_preference("pdfjs.disabled", True)

    options = Options()
    options.profile = profile
    return options


def start_up(link, dfolder, geko_path,donwload=True):
    geko_path='/Users/ruimaciel/Desktop/Barcelona/NLP_I/TA_sessions/2/geckodriver'
    download_path='/Users/ruimaciel/Downloads'
    os.makedirs(dfolder, exist_ok=True)

    options = ffx_preferences(dfolder,donwload)
    service = Service(geko_path)
    browser = webdriver.Firefox(service=service, options=options)
    # Enter the website address here
    browser.get(link)
    time.sleep(5)  # Adjust sleep time as needed
    return browser


def check_and_click(browser, xpath, type):
    '''
    Function that checks whether the object is clickable and, if so, clicks on
    it. If not, waits one second and tries again.
    '''
    ck = False
    ss = 0
    while ck == False:
        ck = check_obscures(browser, xpath, type)
        time.sleep(1)
        ss += 1
        if ss == 15:
            # warn_sound()
            # return NoSuchElementException
            ck = True
            # browser.quit()

def check_obscures(browser, xpath, type):
    '''
    Function that checks whether the object is being "obscured" by any element so
    that it is not clickable. Important: if True, the object is going to be clicked!
    '''
    try:
        if type == "xpath":
            browser.find_element('xpath',xpath).click()
        elif type == "id":
            browser.find_element('id',xpath).click()
        elif type == "css":
            browser.find_element('css selector',xpath).click()
        elif type == "class":
            browser.find_element('class name',xpath).click()
        elif type == "link":
            browser.find_element('link text',xpath).click()
    except (ElementClickInterceptedException, NoSuchElementException, StaleElementReferenceException) as e:
        print(e)
        return False
    return True

# lets open booking:
dfolder='/Users/ruimaciel/Downloads'
geko_path='/Users/ruimaciel/Desktop/Barcelona/NLP_I/TA_sessions/2/geckodriver'
link='https://www.booking.com/index.en.html'

browser=start_up(dfolder=dfolder,link=link,geko_path=geko_path)

#eliminate the cookie bar
browser.find_element(by='xpath', value='//*[@id="onetrust-reject-all-handler"]').click()

### Cell 2: Search Function
Implements a function to perform a search on the target website using specified place names and date ranges. This automation is key for navigating through the website and directing it to the data of interest.

In [5]:
def search_place_and_dates(browser, place, from_day, to_day):
    # Enter the place in the search field
    search1 = browser.find_element(by='xpath', value='//*[@id=":re:"]')
    search1.send_keys(place)

    # Click on the calendar
    calendar = 'button.ebbedaf8ac:nth-child(2) > span:nth-child(1)'
    browser.find_element('css selector', calendar).click()

    # Calendar navigation paths
    path = '//div[@id="calendar-searchboxdatepicker"]//table[@class="eb03f3f27f"]//tbody//td[@class="b80d5adb18"]//span[@class="cf06f772fa"]'
    next_button_path_1 = '.f4552b6561'
    next_button_path_2 = 'button.f38b6daa18:nth-child(2)'

    # Flags to track if dates are found
    found_from_day = False
    found_to_day = False
    is_first_iteration = True

    # Iterate to find from_day and to_day
    while not (found_from_day and found_to_day):
        dates = browser.find_elements('xpath', path)
        for date in dates:
            if date.get_attribute("data-date") == from_day and not found_from_day:
                date.click()
                found_from_day = True
                break

            if date.get_attribute("data-date") == to_day and not found_to_day:
                date.click()
                found_to_day = True
                break

        # Navigate the calendar
        if not (found_from_day and found_to_day):
            if is_first_iteration:
                browser.find_element('css selector', next_button_path_1).click()
                is_first_iteration = False
            else:
                browser.find_element('css selector', next_button_path_2).click()

    # Click the search button
    search_button = '/html/body/div[3]/div[2]/div/form/div[1]/div[4]/button/span'
    browser.find_element('xpath', search_button).click()


search_place_and_dates(browser, "Barcelona","2024-05-30","2024-06-02")


### Cell 3: Data Scraping Function
Outlines a method for extracting hotel data from the webpage. It navigates the DOM to find hotel names, prices, and ratings, and stores this information in a pandas DataFrame.

In [10]:
def scrap_hotel_data(browser):
    data = pd.DataFrame(columns=[])
    stop = False  # Flag to stop the loop
    while stop == False:
        containers = browser.find_elements('xpath', '//div[@class="c066246e13"]')
        for hotel in containers:
            hotel_name = hotel.find_element('xpath', './/div[@class="f6431b446c a15b38c233"]').text
            try:
                hotel_rating = hotel.find_element('xpath', './/div[@class="a3b8729ab1 d86cee9b25"]').text
            except:
                hotel_rating = np.nan
            try:
                promotion = hotel.find_element('xpath', './/div[@class="d17181842f"]').text
            except: 
                promotion = np.nan
            try:
                hotel_description_long = hotel.find_element('css selector', 'div.b1037148f8').text
            except:
                hotel_description_long = np.nan
            try:
                hotel_price = hotel.find_element('xpath', './/span[@class="f6431b446c fbfd7c1165 e84eb96b1f"]').text
            except:
                hotel_price = np.nan
            try:
                url = hotel.find_element('xpath', './/a[@href]')
                hotel_url = url.get_attribute('href')
            except:
                hotel_url = np.nan
        
            new_row = {'Hotels': hotel_name, 'Ratings': hotel_rating, 'Price':hotel_price,'Promotion': promotion, 'Description': hotel_description_long,  'Link': hotel_url}
            data = pd.concat([data, pd.DataFrame([new_row])], ignore_index=True)

        # Change page with CSS Selector
        try:
            next_button_check = browser.find_element("xpath", "/html/body/div[4]/div/div[2]/div/div[2]/div[3]/div[2]/div[2]/div[4]/div[2]/nav/nav/div/div[3]/button")
            is_disabled = next_button_check.get_attribute("disabled")
            if is_disabled == "true":
                stop = True  # Set the flag to stop the loop if the element is disabled
            browser.find_element('css selector', 'div.b16a89683f:nth-child(3) > button:nth-child(1) > span:nth-child(1) > span:nth-child(1)').click()
            time.sleep(2)
        except:
            pass
    
    return data

barcelona_1 = scrap_hotel_data(browser)

### Cell 4: Data Enrichment Function
Describes a function to add extra information to the scraped data, enhancing the dataset with location and date context for better analysis.

In [12]:
def add_columns_to_dataframe(dataframe, place, from_day, to_day):
    # Add the 'Place' column with the specified value
    dataframe['Place'] = place
    
    # Add the 'From Date' column with the specified value
    dataframe['From Date'] = from_day
    
    # Add the 'To Day' column with the specified value
    dataframe['To Day'] = to_day
    
    return dataframe

# Call the function to add the columns with the specified values
barcelona_1 = add_columns_to_dataframe(barcelona_1, "Barcelona", "2024-05-30", "2024-06-02")

print(barcelona_1)

                                               Hotels Ratings    Price  \
0                              Top Location Apartment     8.9    € 662   
1                       Le Palacete powered by Sonder     8.1  € 1,254   
2                              Cami Gallery Barcelona     8.4    € 571   
3     TWO Hotel Barcelona by Axel 4* Sup- Adults Only     8.4    € 785   
4                                  Stylish Apartments     9.4  € 1,427   
...                                               ...     ...      ...   
996                       Click&Flat Stylish Torrijos     8.5  € 2,561   
997                          Click&Flat Floridablanca     7.4  € 2,370   
998                            Casa Cosi - Eixample 5     9.5  € 3,974   
999                                AB Roger de Lluria     7.0  € 3,786   
1000                           Casa Cosi - Eixample 2      10  € 4,227   

                                       Promotion  \
0          Only 1 left at this price on our site   
1     O

### Cell 5: Starts up the Booking to scrap the Barcelona Data (Previous Weekend)

In [13]:
#Barcelona previous weekend
browser=start_up(dfolder=dfolder,link=link,geko_path=geko_path)
browser.find_element(by='xpath', value='//*[@id="onetrust-reject-all-handler"]').click()
search_place_and_dates(browser, "Barcelona","2024-05-23","2024-05-26")

### Cell 6: Scrap and Display Barcelona Data (Previous Weekend)
Applies the data collection process to Barcelona for the weekend before the Primavera Sound Festival.

Exhibits the DataFrame with hotel data to ensure that the df is correctly created.

In [14]:
barcelona_0 = scrap_hotel_data(browser)
barcelona_0 = add_columns_to_dataframe(barcelona_0, "Barcelona", "2024-05-23", "2024-05-26")
print(barcelona_0)


                                               Hotels Ratings    Price  \
0                                        Casa Abamita     8.5    € 740   
1                       Le Palacete powered by Sonder     8.1    € 925   
2                      Motel One Barcelona-Ciutadella     8.8    € 605   
3     TWO Hotel Barcelona by Axel 4* Sup- Adults Only     8.4    € 610   
4                   Inside Barcelona Apartments Sants     8.6    € 494   
...                                               ...     ...      ...   
1000                         Barceloneta Port Ramblas     9.4    € 921   
1001                               The 8 Boutique B&B     7.7    € 751   
1002                                    Hotel Transit     7.6    € 703   
1003                       Aspasios Garden Apartments     8.8  € 1,167   
1004                    Fontanella By BCN URBAN Rooms     8.4    € 913   

                                        Promotion  \
0      Only 1 room left at this price on our site   
1    

### Cell 7: Starts up Booking for Valencia on the Primavera weekend.

In [15]:
#Valencia Primavera Weekend (1)
browser=start_up(dfolder=dfolder,link=link,geko_path=geko_path)
browser.find_element(by='xpath', value='//*[@id="onetrust-reject-all-handler"]').click()
search_place_and_dates(browser, "Valencia","2024-05-30", "2024-06-02")

### Cell 8: Scraping Valencia Hotel Data (Primavera Weekend)
Applies the data collection process to Valencia for the weekend during the Primavera Sound Festival.

Exhibits the DataFrame with hotel data to ensure that the df is correctly created.

In [16]:
valencia_1 = scrap_hotel_data(browser)
valencia_1 = add_columns_to_dataframe(valencia_1, "Valencia","2024-05-30", "2024-06-02")
print(valencia_1)

                                   Hotels Ratings  Price  \
0                               Niña Mala     8.8  € 299   
1              COOL LOFTS center valencia     7.5  € 312   
2                                                          
3                                                          
4                                                          
..                                    ...     ...    ...   
585                             BIZZBEACH     7.8  € 220   
586                       Apart and Benef     7.4  € 296   
587  Habitación familiar con baño privado     5.2  € 497   
588                     Coworking Balance     7.8  € 164   
589                 HABITACION INDIVIDUAL     8.8  € 298   

                                      Promotion  \
0    Only 1 room left at this price on our site   
1    Only 1 room left at this price on our site   
2                                                 
3                                           NaN   
4                       

### Cell 9: Starts up Booking to scrap the Valencia Data (Previous Weekend)

In [17]:
#Valencia Previous Weekend
browser=start_up(dfolder=dfolder,link=link,geko_path=geko_path)
browser.find_element(by='xpath', value='//*[@id="onetrust-reject-all-handler"]').click()
search_place_and_dates(browser, "Valencia","2024-05-23", "2024-05-26")

### Cell 10: Scrap and Display Valencia Data (Previous Weekend)
Applies the data collection process to Valencia for the weekend before the Primavera Sound Festival.

Exhibits the DataFrame with hotel data to ensure that the df is correctly created.

In [18]:
valencia_0 = scrap_hotel_data(browser)
valencia_0 = add_columns_to_dataframe(valencia_0, "Valencia","2024-05-23", "2024-05-26")
print(valencia_0)

                                          Hotels Ratings  Price  \
0                    Hotel Olympia Universidades     8.4  € 259   
1                                 Melia Valencia     8.5  € 771   
2                                   Zalamera BnB     8.9  € 417   
3                       North Station Apartments     8.7  € 311   
4                        Flatsforyou San Vicente     8.3  € 347   
..                                           ...     ...    ...   
856                              Cantagua Hostel     8.9  € 180   
857                           UP Hostel Valencia     7.7  € 178   
858                            Coworking Balance     7.8  € 166   
859  Livensa Living Studios Valencia Marina Real     8.1  € 760   
860                        HABITACION INDIVIDUAL     8.8  € 324   

                                      Promotion  \
0                                           NaN   
1                                           NaN   
2    Only 1 room left at this price on our

### Cell 11: Data Aggregation and Export
Combines all collected and processed DataFrames into a single DataFrame and exports it to a CSV file, concluding the data collection and preparation phase of the project.

In [19]:
final_df = pd.concat([barcelona_0, barcelona_1, valencia_1, valencia_0], ignore_index=True)
final_df.to_csv('/Users/ruimaciel/Desktop/Barcelona/NLP_I/Intro_NLP_PS1_Group_12/final_df.csv', index=False)